Commit Graph

20 Commits

Author SHA1 Message Date
Raphael S. Carvalho
2a9bfa3e3f compaction_strategy: get_cleanup_compaction_jobs: accept candidates by value
Then caller can decide whether to copy or move candidate set into the
function. cleanup_sstables_compaction_task can move candidates as
it's no longer needed once it retrieves all descriptors.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-03-29 09:49:13 -03:00
Raphael S. Carvalho
44e9e10414 compaction_strategy: Allow strategies to define their own cleanup strategy
Today, all compaction strategies will clean up their files using the
incremental approach of one sstable being rewritten at a time.

Turns out that's not the best approach performance wise. Let's take
STCS for example. As cleanup finishes rewriting one file, the output
file is placed into the sstable set. Regular now can compact that
file with another that was already there (e.g. produced by flush after
cleanup started). Inefficient compactions like this can keep happening
as cleanup incrementally places output file into the candidate list
for regular.

This method will allow strategies to clean up their files in batches.
For example, STCS can clean up all files in smallest tiers in single
round, allowing the output data to be added at once. So next compaction
rounds can be more efficient in terms of writeamp. Another benefit is
that deduplication and GC can happen more efficiently.

The drawback is the space requirement, as we no longer compact one file
a a time. However, the impact is minimized by cleaning up the smallest
tier first. With leveled strategy for example, even though 90% of data
is in highest level, the space requirement is not a problem because
we can apply the incremental compaction on its behalf. The same applies
to ICS. With STCS, the requirement is the size of the tier being
compacted, but that's already expected by its users anyway.

By the time being, all strategies have it unimplemented. so they still
use the old behavior where files are rewritten on at a time.
This will allow us to incrementally implement the cleanup method for
all compaction strategies.

Refs #10097.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-03-23 00:04:03 -03:00
Raphael S. Carvalho
c25d8f6770 compaction: Move decision of garbage collection from strategy to task type
For compaction to be able to purge expired data, like tombstones, a
sstable set snapshot is set in the compaction descriptor.

That's a decision that belongs to task type. For example, all regular
compaction enable GC, whereas scrub for example doesn't for safety
reasons.

The problem is that the decision is being made by every instantiation
of compaction_descriptor in the strategies, which is both unnecessary
and also adds lots of boilerplate to the code, making it hard to
understand and work with.

As sstable set snapshot is an implementation detail, a new method
is being added to compaction_descriptor to make the intention
clearer, making the interface easier to understand.

can_purge_tombstones, used previously by rewrite task only, is being
reused for communicating GC intention into task::compact_sstables().

The boilerplate was a pain when adding a new strategy method for
the ongoing work on cleanup, described by issue #10097.
Another benefit is that we'll now only create a set snapshot when
compaction will really run. Before, it could happen that the snapshot
would be discarded if the compaction attempt had to be postponed,
which is a waste of cpu cycles.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-03-21 12:14:04 -03:00
Raphael S. Carvalho
2dba0670ad compaction: Fix time_window_backlog_tracker::replace_sstables()
Introduced in commit: ddd693c6d7

We're not emplacing newer windows in the tracker, causing
std::out_of_range when replacing sstables for windows.

Let's fix the logic and add an unit test to cover this.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220301194944.95096-1-raphaelsc@scylladb.com>
2022-03-02 10:08:40 +02:00
Raphael S. Carvalho
1d9f53c881 compaction: Redefine compaction backlog to tame compaction aggressiveness
Today, compaction can act much more aggressive than it really has to, because
the strategy and its definition of backlog are completely decoupled.

The backlog definition for size-tiered, which is inherited by all
strategies (e.g.: LCS L0, TWCS' windows), is built on the assumption that the
world must reach the state of zero amplification. But that's unrealistic and
goes against the intent amplification defined by the compaction strategy.
For example, size tiered is a write oriented strategy which allows for extra
space amplification for compaction to keep up with the high write rate.

It can be seen today, in many deployments, that compaction shares is either
close to 1000, or even stuck at 1000, even though there's nothing to be done,
i.e. the compaction strategy is completely satisfied.
When there's a single sstable per tier, for example.
This means that whenever a new compaction job kicks in, it will act much more
aggressive because of the high shares, caused by false backlog of the existing
tables. This translates into higher P99 latencies and reduced throughput.

Solution
========
This problem can be fixed, as proposed in the document "Fixing compaction
aggressiveness due to suboptimal definition of zero backlog by controller" [1],
by removing backlog of tiers that don't have to be compacted now, like a tier
that has a single file. That's about coupling the strategy goal with the
backlog definition. So once strategy becomes satisfied, so will the controller.

Low-efficiency compaction, like compacting 2 files only or cross-tier, only
happens when system is under little load and can proceed at a slower pace.
Once efficient jobs show up, ongoing compactions, even if inefficient, will get
more shares (as efficient jobs add to the backlog) so compaction won't fall
behind.

With this approach, throughput and latency is improved as cpu time is no longer
stolen (unnecessarily) from the foreground requests.

[1]: https://docs.google.com/document/d/1EQnXXGWg6z7VAwI4u8AaUX1vFduClaf6WOMt2wem5oQ

Fixes #4588.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-02-24 18:57:38 -03:00
Raphael S. Carvalho
ddd693c6d7 compaction_backlog_tracker: Batch changes through a new replacement interface
This new interface allows table to communicate multiple changes in the
SSTable set with a single call, which is useful on compaction completion
for example.
With this new interface, the size tiered backlog tracker will be able to
know when compaction completed, which will allow it to recompute tiers
and their backlog contribution, if any. Without it, tiered tracker
would have to recompute tiers for every change, which would be terribly
expensive.
The old remove/add interface are being removed in favor of the new one.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-02-24 15:34:16 -03:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Botond Dénes
1ba19c2aa4 compaction/compaction_strategy: convert make_interposer_consumer() to v2
The underlying timestamp-based splitter is v2 already.
2022-01-07 13:51:59 +02:00
Asias He
a8ad385ecd repair: Get rid of the gc_grace_seconds
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.

To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.

In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:

1) GC a tombstone after gc_grace_seconds

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;

This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.

2) Never GC a tombstone

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};

3) GC a tombstone immediately

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};

4) GC a tombstone after repair

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};

In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.

A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.

Tests: compaction_test.py, ninja test

Fixes #3560

[avi: resolve conflicts vs data_dictionary]
2022-01-04 19:48:14 +02:00
Raphael S. Carvalho
49f40c8791 compaction: Implement strategy control and wire it
This implements strategy control interface for both manager and
tests, and wire it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-13 16:05:23 -03:00
Raphael S. Carvalho
9725e5efa9 compaction_strategy: kill unused can_compact_partial_runs()
This strategy method was introduced unnecessarily. We assume it was
going to be needed, but turns out it was never needed, not even
for ICS. Also it's built on a wrong assumption as an output
sstable run being generated can never be compacted in parallel
as the non-overlapping requirement can be easily broken.
LCS for example can allow parallel compaction on different runs
(levels) but correctness cannto be guaranteed with same runs
are compacted in parallel.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-03 12:20:51 -03:00
Raphael S. Carvalho
8d9704c030 compaction: LCS: kill needless include of database.hh
This is part of work for reducing compilation time and removing
layer violation in compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211120042232.106651-1-raphaelsc@scylladb.com>
2021-11-20 18:28:55 +02:00
Raphael S. Carvalho
bb5a8682f3 compaction: stop including database.hh for compaction_strategy
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-09 11:29:47 -03:00
Raphael S. Carvalho
e2f6a47999 compaction: switch to table_state in estimated_pending_compactions()
Last method in compaction_strategy using table. From now on,
compaction strategy no longer works directly with table.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-09 11:25:28 -03:00
Raphael S. Carvalho
93ae9225f7 compaction: switch to table_state in compaction_strategy::get_major_compaction_job()
From now on, get_major_compaction_job() will use table_state instead of
a plain reference to table.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-09 11:25:22 -03:00
Raphael S. Carvalho
d881310b52 compaction: switch to table_state in compaction_strategy::get_sstables_for_compaction()
From now on, get_sstables_for_compaction() will use table_state.
With table_state, we avoid layer violations like strategy using
manager and also makes testing easier.

Compaction unit tests were temporarily disabled to avoid a giant
commit which is hard to parse.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-09 10:52:14 -03:00
Raphael S. Carvalho
9f2d2eee98 DTCS: reduce table dependency for task estimation
Similar to LCS, let's reduce table dependency in DTCS, to make it
easier to switch to table_state.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-09 10:50:29 -03:00
Avi Kivity
daf028210b build: enable -Winconsistent-missing-override warning
This warning can catch a virtual function that thinks it
overrides another, but doesn't, because the two functions
have different signatures. This isn't very likely since most
of our virtual functions override pure virtuals, but it's
still worth having.

Enable the warning and fix numerous violations.

Closes #9347
2021-09-15 12:55:54 +03:00
Benny Halevy
3ad0067272 date_tiered_manifest: get_now: fix use after free of sstable_list
The sstable_list is destroyed right after the temporary
lw_shared_ptr<sstable_list> returned from `cf.get_sstables()`
is dereferenced.

Fixes #9138

Test: unit(dev)
DTest: resharding_test.py:ReshardingTombstones_with_DateTieredCompactionStrategy.disable_tombstone_removal_during_reshard_test (debug)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210804075813.42526-1-bhalevy@scylladb.com>
2021-08-04 15:24:47 +03:00
Raphael S. Carvalho
1924e8d2b6 treewide: Move compaction code into a new top-level compaction dir
Since compaction is layered on top of sstables, let's move all compaction code
into a new top-level directory.
This change will give me extra motivation to remove all layer violations, like
sstable calling compaction-specific code, and compaction entanglement with
other components like table and storage service.

Next steps:
- remove all layer violations
- move compaction code in sstables namespace into a new one for compaction.
- move compaction unit tests into its own file

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210707194058.87060-1-raphaelsc@scylladb.com>
2021-07-07 23:21:51 +03:00