With compaction_manager switching to table_state, we'll need to
introduce a method in table_state to return maintenance set.
So better to have a descriptive name for main set.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We should separate the scheduling groups used for major compaction
from the the regular compaction scheduling group so that
the latter can be affected by the backlog tracker in case
backlog accumulates during a long running major compaction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The backlog definition for leveled is incorrectly built on the assumption that
the world must reach the state of zero amplification, i.e. everything in the
last level. The actual goal is space amplification of 1.1.
In reality, LCS just wants that for every level L, level L is fan_out=10 times
larger than L-1. See more in commit 9de7abdc80 which adjusts LCS to conform
to this goal.
If level 3 = 1000G, level 2 = 100G, level 1 = 10G, level 0 = 1G, that should
return zero backlog as space amplification is (1000+100+10+1)/1000 = ~1.1
But today, LCS calculates high backlog for the layout above, as it will only be
satisfied once everything is promoted to the maximum level. That's completely
disconnected from what the strategy actually wants. Therefore, a mismatch.
With today's definition, the backlog for any SSTable is:
sizeof(sstable) * (Lmax - levelof(sstable)) * fan_out
where Lmax = maximum level,
and fan_out = LCS' fan out which is 10 by default
That's essentially calculating the total cost for data in the SSTable to climb
up to the maximum level. Of course, if a SSTable is at the maximum level,
(Lmax - levelof(sstable)) returns zero, therefore backlog for it is zero.
Take a look at this example:
If L0 sstable is 0.16G, then its backlog = 0.16G * (3 - 0) * 10 = 4.8G
0.16G = LCS' default fragment size
Maximum level (Lmax in formula) can be easily 3 as:
log10 of (30G/0.16G=~187 sstables)) = ~2.27
~2.27 means that data has exceeded level 2 capacity and so needs 3 levels.
So 3 L0 sstables could add ~15G of backlog. With 1G memory per shard (30:1 disk
memory ratio), that's normalized backlog of ~15, which translates into
additional ~500 shares. That's halfway to full compaction speed.
With more files in higher levels, we can easily get to a normalized backlog
above 30, resulting in 1k shares.
The suboptimal backlog definition causes either table using LCS or coexisting
tables to run with more shares than needed, causing compaction to steal
resources, resulting in higher latency and reduced throughput.
To solve this problem, a new formula is used which will basically calculate
the amount of work needed to achieve the layout goal. We no longer want to
promote everything to the last level, but instead we'll incrementally calculate
the backlog in each level L, which is the amount of work needed such that the
next level L + 1 is at least fan_out times bigger.
Fixes#10583.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
No functional changes intended - this series is quite verbose,
but after it's in, it should be considerably easier to change
the type of SSTable generations to something else - e.g. a string
or timeUUID.
Closes#10533
Then caller can decide whether to copy or move candidate set into the
function. cleanup_sstables_compaction_task can move candidates as
it's no longer needed once it retrieves all descriptors.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Today, all compaction strategies will clean up their files using the
incremental approach of one sstable being rewritten at a time.
Turns out that's not the best approach performance wise. Let's take
STCS for example. As cleanup finishes rewriting one file, the output
file is placed into the sstable set. Regular now can compact that
file with another that was already there (e.g. produced by flush after
cleanup started). Inefficient compactions like this can keep happening
as cleanup incrementally places output file into the candidate list
for regular.
This method will allow strategies to clean up their files in batches.
For example, STCS can clean up all files in smallest tiers in single
round, allowing the output data to be added at once. So next compaction
rounds can be more efficient in terms of writeamp. Another benefit is
that deduplication and GC can happen more efficiently.
The drawback is the space requirement, as we no longer compact one file
a a time. However, the impact is minimized by cleaning up the smallest
tier first. With leveled strategy for example, even though 90% of data
is in highest level, the space requirement is not a problem because
we can apply the incremental compaction on its behalf. The same applies
to ICS. With STCS, the requirement is the size of the tier being
compacted, but that's already expected by its users anyway.
By the time being, all strategies have it unimplemented. so they still
use the old behavior where files are rewritten on at a time.
This will allow us to incrementally implement the cleanup method for
all compaction strategies.
Refs #10097.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
For compaction to be able to purge expired data, like tombstones, a
sstable set snapshot is set in the compaction descriptor.
That's a decision that belongs to task type. For example, all regular
compaction enable GC, whereas scrub for example doesn't for safety
reasons.
The problem is that the decision is being made by every instantiation
of compaction_descriptor in the strategies, which is both unnecessary
and also adds lots of boilerplate to the code, making it hard to
understand and work with.
As sstable set snapshot is an implementation detail, a new method
is being added to compaction_descriptor to make the intention
clearer, making the interface easier to understand.
can_purge_tombstones, used previously by rewrite task only, is being
reused for communicating GC intention into task::compact_sstables().
The boilerplate was a pain when adding a new strategy method for
the ongoing work on cleanup, described by issue #10097.
Another benefit is that we'll now only create a set snapshot when
compaction will really run. Before, it could happen that the snapshot
would be discarded if the compaction attempt had to be postponed,
which is a waste of cpu cycles.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Introduced in commit: ddd693c6d7
We're not emplacing newer windows in the tracker, causing
std::out_of_range when replacing sstables for windows.
Let's fix the logic and add an unit test to cover this.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220301194944.95096-1-raphaelsc@scylladb.com>
Today, compaction can act much more aggressive than it really has to, because
the strategy and its definition of backlog are completely decoupled.
The backlog definition for size-tiered, which is inherited by all
strategies (e.g.: LCS L0, TWCS' windows), is built on the assumption that the
world must reach the state of zero amplification. But that's unrealistic and
goes against the intent amplification defined by the compaction strategy.
For example, size tiered is a write oriented strategy which allows for extra
space amplification for compaction to keep up with the high write rate.
It can be seen today, in many deployments, that compaction shares is either
close to 1000, or even stuck at 1000, even though there's nothing to be done,
i.e. the compaction strategy is completely satisfied.
When there's a single sstable per tier, for example.
This means that whenever a new compaction job kicks in, it will act much more
aggressive because of the high shares, caused by false backlog of the existing
tables. This translates into higher P99 latencies and reduced throughput.
Solution
========
This problem can be fixed, as proposed in the document "Fixing compaction
aggressiveness due to suboptimal definition of zero backlog by controller" [1],
by removing backlog of tiers that don't have to be compacted now, like a tier
that has a single file. That's about coupling the strategy goal with the
backlog definition. So once strategy becomes satisfied, so will the controller.
Low-efficiency compaction, like compacting 2 files only or cross-tier, only
happens when system is under little load and can proceed at a slower pace.
Once efficient jobs show up, ongoing compactions, even if inefficient, will get
more shares (as efficient jobs add to the backlog) so compaction won't fall
behind.
With this approach, throughput and latency is improved as cpu time is no longer
stolen (unnecessarily) from the foreground requests.
[1]: https://docs.google.com/document/d/1EQnXXGWg6z7VAwI4u8AaUX1vFduClaf6WOMt2wem5oQFixes#4588.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This new interface allows table to communicate multiple changes in the
SSTable set with a single call, which is useful on compaction completion
for example.
With this new interface, the size tiered backlog tracker will be able to
know when compaction completed, which will allow it to recompute tiers
and their backlog contribution, if any. Without it, tiered tracker
would have to recompute tiers for every change, which would be terribly
expensive.
The old remove/add interface are being removed in favor of the new one.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.
To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.
In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:
1) GC a tombstone after gc_grace_seconds
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;
This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.
2) Never GC a tombstone
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};
3) GC a tombstone immediately
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};
4) GC a tombstone after repair
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};
In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.
A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.
Tests: compaction_test.py, ninja test
Fixes#3560
[avi: resolve conflicts vs data_dictionary]
This strategy method was introduced unnecessarily. We assume it was
going to be needed, but turns out it was never needed, not even
for ICS. Also it's built on a wrong assumption as an output
sstable run being generated can never be compacted in parallel
as the non-overlapping requirement can be easily broken.
LCS for example can allow parallel compaction on different runs
(levels) but correctness cannto be guaranteed with same runs
are compacted in parallel.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Last method in compaction_strategy using table. From now on,
compaction strategy no longer works directly with table.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
From now on, get_major_compaction_job() will use table_state instead of
a plain reference to table.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
From now on, get_sstables_for_compaction() will use table_state.
With table_state, we avoid layer violations like strategy using
manager and also makes testing easier.
Compaction unit tests were temporarily disabled to avoid a giant
commit which is hard to parse.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Similar to LCS, let's reduce table dependency in DTCS, to make it
easier to switch to table_state.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This warning can catch a virtual function that thinks it
overrides another, but doesn't, because the two functions
have different signatures. This isn't very likely since most
of our virtual functions override pure virtuals, but it's
still worth having.
Enable the warning and fix numerous violations.
Closes#9347
The sstable_list is destroyed right after the temporary
lw_shared_ptr<sstable_list> returned from `cf.get_sstables()`
is dereferenced.
Fixes#9138
Test: unit(dev)
DTest: resharding_test.py:ReshardingTombstones_with_DateTieredCompactionStrategy.disable_tombstone_removal_during_reshard_test (debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210804075813.42526-1-bhalevy@scylladb.com>
Since compaction is layered on top of sstables, let's move all compaction code
into a new top-level directory.
This change will give me extra motivation to remove all layer violations, like
sstable calling compaction-specific code, and compaction entanglement with
other components like table and storage service.
Next steps:
- remove all layer violations
- move compaction code in sstables namespace into a new one for compaction.
- move compaction unit tests into its own file
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210707194058.87060-1-raphaelsc@scylladb.com>