For compaction to be able to purge expired data, like tombstones, a
sstable set snapshot is set in the compaction descriptor.
That's a decision that belongs to task type. For example, all regular
compaction enable GC, whereas scrub for example doesn't for safety
reasons.
The problem is that the decision is being made by every instantiation
of compaction_descriptor in the strategies, which is both unnecessary
and also adds lots of boilerplate to the code, making it hard to
understand and work with.
As sstable set snapshot is an implementation detail, a new method
is being added to compaction_descriptor to make the intention
clearer, making the interface easier to understand.
can_purge_tombstones, used previously by rewrite task only, is being
reused for communicating GC intention into task::compact_sstables().
The boilerplate was a pain when adding a new strategy method for
the ongoing work on cleanup, described by issue #10097.
Another benefit is that we'll now only create a set snapshot when
compaction will really run. Before, it could happen that the snapshot
would be discarded if the compaction attempt had to be postponed,
which is a waste of cpu cycles.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This makes host id mismatch cause a warning and stop being fatal,
to un-break node replacement dtests.
Should be revisited if/when the underlying problem (double setting of
local host id on a replacing node) is fixed.
Refs #10148
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20220303085049.186259-1-michael.livshin@scylladb.com>
There's nothing specific to scylla in the lister
classes, they could (and maybe should) be part of
the seastar library.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
directory_lister provides a simpler interface compared to lister.
After creating the directory_lister,
its async get() method should be called repeatedly,
returning a std::optional<directory_entry> each call,
until it returns a disengaged entry or an error.
This is especially suitable for coroutines
as demonstrated in the unit tests that were added.
For example:
```c++
auto dl = directory_lister(path);
while (auto de = co_await dl.get()) {
co_await process(*de);
}
```
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#9835
* github.com:scylladb/scylla:
sstable_directory: process_sstable_dir: use directory_lister
sstable_directory: process_sstable_dir: fixup indentation
sstable_directory: coroutinize process_sstable_dir
lister: add directory_lister
Add an additional sstable validation step to check that originating
host id matches the local host id.
This is only done for ME-and-up sstables, which do not come from
upload/, and when the local host id is known.
When local host id is unknown, check that the sstable belongs to a
system keyspace, i.e. whether it is plausible that Scylla is still
booting up and hasn't loaded/generated the local host id yet.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
This will be useful to allow sstable_directory user to filter out
sstables that should not be reshaped. The default filter is
implemented as including everything.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reshape isn't mandatory for correctness, unlike resharding.
So we can allow boot to continue even in face of reshape
failure. Without this, boot will fail right away due to
unhandled exception. This is intended to make population
more resilient as any exception, even "benign" ones,
may cause boot to fail. It's better to allow boot to
continue from where it left off, as if there's an exception
like io error, or OOM, population will be unable to
complete anyway.
This patch was written based on observation that dangling
errors in interposer consumer used by compaction can cause
a different exception to be triggered, like broken_promise,
when user asked reshape to stop. This can no longer happen
now, but better safe than sorry.
So regular compaction can now pick on backlog once node is
online.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220107130539.14899-1-raphaelsc@scylladb.com>
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.
References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.
scylla-gdb.py is adjusted to look for both the new and old names.
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.
As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
Make compaction procedure switch to table_state. Only function in
compaction.cc still directly using table is
get_fully_expired_sstables(T,...), but subsequently we'll make it
switch to table_state and then we can finally stop including database.hh
in the compaction code.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"This series removes layer violation in compaction, and also
simplifies compaction manager and how it interacts with compaction
procedure."
* 'compaction_manager_layer_violation_fix/v4' of github.com:raphaelsc/scylla:
compaction: split compaction info and data for control
compaction_manager: use task when stopping a given compaction type
compaction: remove start_size and end_size from compaction_info
compaction_manager: introduce helpers for task
compaction_manager: introduce explicit ctor for task
compaction: kill sstables field in compaction_info
compaction: kill table pointer in compaction_info
compaction: simplify procedure to stop ongoing compactions
compaction: move management of compaction_info to compaction_manager
compaction: move output run id from compaction_info into task
compaction_info must only contain info data to be exported to the
outside world, whereas compaction_data will contain data for
controlling compaction behavior and stats which change as
compaction progresses.
This separation makes the interface clearer, also allowing for
future improvements like removing direct references to table
in compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Today, compaction is calling compaction manager to register / deregister
the compaction_info created by it.
This is a layer violation because manager sits one layer above
compaction, so manager should be responsible for managing compaction
info.
From now on, compaction_info will be created and managed by
compaction_manager. compaction will only have a reference to info,
which it can use to update the world about compaction progress.
This will allow compaction_manager to be simplified as info can be
coupled with its respective task, allowing duplication to be removed
and layer violation to be fixed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
compaction_info must only contain info data to be exported to the
outside world, whereas compaction_data will contain data for
controlling compaction behavior and stats which change as
compaction progresses.
This separation makes the interface clearer, also allowing for
future improvements like removing direct references to table
in compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Today, compaction is calling compaction manager to register / deregister
the compaction_info created by it.
This is a layer violation because manager sits one layer above
compaction, so manager should be responsible for managing compaction
info.
From now on, compaction_info will be created and managed by
compaction_manager. compaction will only have a reference to info,
which it can use to update the world about compaction progress.
This will allow compaction_manager to be simplified as info can be
coupled with its respective task, allowing duplication to be removed
and layer violation to be fixed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Indicate whether the compaction job should be aborted
due to an error using a new, compaction_aborted_exception type,
vs. compaction_stopped_exception that indicates
the task should be stopped due to some external event that
doesn't indicate an error (like shutdown or api call).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
the name compaction_options is confusing as it overlaps in meaning
with compaction_descriptor. hard to reason what are the exact
difference between them, without digging into the implementation.
compaction_options is intended to only carry options specific to
a give compaction type, like a mode for scrub, so let's rename
it to compaction_type_options to make it clearer for the
readers.
[avi: adjust for scrub changes]
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210908003934.152054-1-raphaelsc@scylladb.com>
Now reshape can be aborted on either boot or refresh.
The workflow is:
1) reshape starts
2) user notices it's taking too long
3) nodetool stop RESHAPE
the good thing is that completed reshape work isn't lost, allowing
table to enjoy the benefits of all reshaping done up to the abortion
point.
Fixes#7738.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
All callers use it to do operations that are closely associated with one
of the standard compaction types, so no reason to pass in a custom
string instead of the compaction type enum.
Since compaction is layered on top of sstables, let's move all compaction code
into a new top-level directory.
This change will give me extra motivation to remove all layer violations, like
sstable calling compaction-specific code, and compaction entanglement with
other components like table and storage service.
Next steps:
- remove all layer violations
- move compaction code in sstables namespace into a new one for compaction.
- move compaction unit tests into its own file
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210707194058.87060-1-raphaelsc@scylladb.com>
Keep descriptors in a map so it could be searched easily by generation.
and possibly delete the descriptor, if found, in the presence of
a temporary toc component.
A following patch will add support to create_links for moving
sstables between directories. It is based on keeping a TemporaryTOC
file in the destination directory while linking all source components.
If scylla crashes here, the destination sstable will have both
its TemporaryTOC and TOC components and it needs to be removed
to roll the move backwards.
Then, create_links will atomically move the TemporaryTOC from
the destination back to the source directory, to toggle rolling
back to rolling forward by marking the source sstable for removal.
If scylla crashes here, the source sstable will have both
its TemporaryTOC and TOC components and it needs to be removed
to roll the move forward.
Add unit test for this case.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This is a regression caused by aebd965f0.
After the sstable_directory changes, resharding now waits for all sstables
to be exhausted before releasing reference to them, which prevents their
resources like disk space and fd from being released. Let's restore the
old behavior of incrementally releasing resources, reducing the space
requirement significantly.
Fixes#7463.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20201020140939.118787-1-raphaelsc@scylladb.com>
Although each sstable_directory limits concurrency using
max_concurrent_for_each, there could be a large number
of calls to do_for_each_sstable running in parallel
(e.g per keyspace X per table in the distributed_loader).
To cap parallelism across sstable_directory instances and
concurrent calls to do_for_each_sstable, start a sharded<semaphore>
and pass a shared semaphore& to the sstable_directory:s.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Use max_concurrent_for_each instead of parallel_for_each in
sstable_directory::parallel_for_each_restricted to avoid
creating potentially thousands of continuations,
one for each sstable.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
`count` function was often used in various ways.
`contains` does not only express the intend of the code better but also
does it in more unified way.
This commit replaces all the occurences of the `count` with the
`contains`.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <b4ef3b4bc24f49abe04a2aba0ddd946009c9fcb2.1597314640.git.piotr@scylladb.com>
"
This patchset adds a reshape operation to each compaction strategy;
that is a strategy-specific way of detecting if SSTables are in-strategy
or off-strategy, and in case they are offstrategy moving them to in-strategy.
Often times the number of SSTables in a particular slice of the sstable set
matters for that decision (number of SSTables in the same time window for TWCS,
number of SSTables per tier for STCS, number of L0 SSTables for LCS). We want
to be more lenient for operations that keep the node offline, like reshape at
boot, but more forgiving for operations like upload, which run in maintenance
mode. To accomodate for that the threshold for considering a slice of the SSTable
set offstrategy is passed as a parameter
Once this patchset is applied, the upload directory will reshape the SSTables
before moving them to the main directory (if needed). One side effect of it
is that it is no longer necessary to take locks for the refresh operation nor
disable writes in the table.
With the infrastructure that we have built in the upload directory, we can
apply the same set of steps to populate_column_family. Using the sstable_directory
to scan the files we can reshard and reshape (usually if we resharded a reshape
will be necessary) with the node still offline. This has the benefit of never
adding shared SSTables to the table.
Applying this patchset will unlock a host of cleanups:
- we can get rid of all testing for shared sstables, sstable_need_rewrite, etc.
- we can remove the resharding backlog tracker.
and many others. Most cleanups are deferred for a later patchset, though.
"
* 'reshard-reshape-v4' of github.com:glommer/scylla:
distributed_loader: reshard before the node is made online
distributed_loader: rework uploading of SSTables
sstable_directory: add helper to reshape existing unshared sstables
compaction_strategy: add method to reshape SSTables
compaction: add a new compaction type, Reshape
compaction: add a size and throught pretty printer.
compaction: add default implementation for some pure functions
tests: fix fragile database tests
distributed_loader.cc: add a helper function to extract the highest SSTable version found
distributed_loader.cc : extract highest_generation_seen code
compaction_manager: rename run_resharding_job
distributed_loader: assume populate_column_families is run in shard 0
api: do not allow user to meddle with auto compaction too early
upload: use custom error handler for upload directory
sstable_directory: fix debug message
Uploading of SSTables is problematic: for historical reasons it takes a
lock that may have to wait for ongoing compactions to finish, then it
disables writes in the table, and then it goes loading SSTables as if it
knew nothing about them.
With the sstable_directory infrastructure we can do much better:
* we can reshard and reshape the SSTables in place, keeping the number
of SSTables in check. Because this is an background process we can be
fairly aggressive and set the reshape mode to strict.
* we can then move the SSTables directly into the main directory.
Because we know they are few in number we can call the more elegant
add_sstable_and_invalidate_cache instead of the open coding currently
done by load_new_sstables
* we know they are not shared (if they were, we resharded them),
simplifying the load process even further.
The major changes after this patch is applied is that all compactions
(resharding and reshape) needed to make the SSTables in-strategy are
done in the streaming class, which reduces the impact of this operation
on the node. When the SSTables are loaded, subsequent reads will not
suffer as we will not be adding shared SSTables in potential high
numbers, nor will we reshard in the compaction class.
There is also no more need for a lock in the upload process so in the
fast path where users are uploading a set of SSTables from a backup this
should essentially be instantaneous. The lock, as well as the code to
disable and enable table writes is removed.
A future improvement is to bypass the staging directory too, in which
case the reshaping compaction would already generate the view updates.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Before moving SSTables to the main directory, we may need to reshape them
into in-strategy. This patch provides helper code that reshapes the SSTables
that are known to be unshared local in the sstable directory, and updates the
sstable directory with the result.
Rehaping can be made more or less aggressive by passing a reshape mode
(relaxed or strict), which will influence the amount of SSTables reshape
can tolerate to consider a particular slice of the SSTable set
offstrategy.
Because the compaction expects an std::vector everywhere, we changed
our chunked vector for the unshared sstables to a std::vector so we
can more easily pass it around without conversions.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
It will be used to run any custom job where the caller provides a
function. One such example is indeed resharding, but reshaping SSTables
can also fall here.
The semaphore is also renamed, and we'll allow only one custom job at a
time (across all possible types).
We also remove the assumption of the scheduling group. The caller has to
have already placed the code in the correct CPU scheduling group. The
I/O priority class comes from the descriptor.
To make sure that we don't regress, we wrap the entire reshard-at-boot
code in the compaction class. Currently the setup would be done in the
main group, and the actual resharding in the compaction group. Note that
this is temporary, as this code is about to change.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The seastar api v4 changes the return type of when_all_succeed. This
patch adds discard_result when that is best solution to handle the
change.
This doesn't do the actual update to v4 since there are still a few
issues left to fix in seastar. A patch doing just the update will
follow.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200617233150.918110-1-espindola@scylladb.com>
I just noticed while working on the reshape patches that there
is an extra format bracket in two of the debug message. As they
are debug I've seen them less often than the others and that slipped.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Off-strategy SSTables are SSTables that do not conform to the invariants
that the compaction strategies define. Examples of offstrategy SSTables
are SSTables acquired over bootstrap, resharding when the cpu count
changes or imported from other databases through our upload directory.
This patch introduces a new class, sstable_directory, that will
handle SSTables that are present in a directory that is not one of the
directories where the table expects its SSTables.
There is much to be done to support off-strategy compactions fully. To
make sure we make incremental progress, this patch implements enough
code to handle resharding of SSTables in the upload directory. SSTables
are resharded in place, before we start accessing the files.
Later, we will take other steps before we finally move the SSTables into
the main directory. But for now, starting with resharding will not only
allow us to start small, but it will also allow us to start unleashing
much needed cleanups in many places. For instance, once we start
resharding on boot before making the SSTables available, we will be able
to expurge all places in Scylla where, during normal operations, we have
extra handler code for the fact that SSTables could be shared.
Tests: a new test is added and it passes in debug mode.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
When we are scanning an sstable directory, we want to filter out the
manifest file in most situations. The table class has a filter for that,
but it is a static filter that doesn't depend on table for anything. We
are better off removing it and putting in another independent location.
While it seems wasteful to use a new header just for that, this header
will soon be populated with the sstable_directory class.
Tests: unit (dev)
Signed-off-by: Glauber Costa <glauber@scylladb.com>