40 Commits

Author SHA1 Message Date
Botond Dénes
86ed627fc4 compaction: move code to namespace compaction
The namespace usage in this directory is very inconsistent, with files
and classes scattered in:
* global namespace
* namespace compaction
* namespace sstables

With cases, where all three used in the same file. This code used to
live in sstables/ and some of it still retains namespace sstables as a
heritage of that time. The mismatch between the dir (future module) and
the namespace used is confusing, so finish the migration and move all
code in compaction/ to namespace compaction too.

This patch, although large, is mechanic and only the following kind of
changes are made:
* replace namespace sstable {} with namespace compaction {}
* add namespace compaction {}
* drop/add sstables::
* drop/add compaction::
* move around forward-declarations so they are in the correct namespace
  context

This refactoring revealed some awkward leftover coupling between
sstables and compaction, in sstables/sstable_set.cc, where the
make_sstable_set() methods of compaction strategies are implemented.
2025-09-25 15:03:56 +03:00
Raphael S. Carvalho
9d3755f276 replica: Futurize retrieval of sstable sets in compaction_group_view
This will allow upcoming work to gently produce a sstable set for
each compaction group view. Example: repaired and unrepaired.

Locking strategy for compaction's sstable selection:
Since sstable retrieval path became futurized, tasks in compaction
manager will now hold the write lock (compaction_state::lock)
when retrieving the sstable list, feeding them into compaction
strategy, and finally registering selected sstables as compacting.
The last step prevents another concurrent task from picking the
same sstable. Previously, all those steps were atomic, but
we have seen stall in that area in large installations, so
futurization of that area would come sooner or later.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:58:00 +03:00
Raphael S. Carvalho
20c3301a1a treewide: Futurize estimation of pending compaction tasks
This is to allow futurization of compaction_group_view method that
retrieves sstable set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:51:29 +03:00
Raphael S. Carvalho
2c4a9ba70c treewide: Rename table_state to compaction_group_view
Since table_state is a view to a compaction group, it makes sense
to rename it as so.

With upcoming incremental repair, each replica::compaction_group
will be actually two compaction groups, so there will be two
views for each replica::compaction_group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:51:28 +03:00
Botond Dénes
efc48caea5 readers/mutation_reader: s/reader_consumer_v2/mutation_reader_consumer/ 2025-05-09 07:53:29 -04:00
Raphael S. Carvalho
21d1e78457 compaction: Wire table_state into make_sstable_set()
This will be useful for feeding token range owned by compaction group
into sstable set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Kefu Chai
d81ed5adb4 compaction: explain make_interpose_consumer() in compaction strategy
Add documentation to clarify the purpose and behavior of
make_interpose_consumer() in the compaction_strategy_impl class. This
method is crucial for building layered processing pipelines but its
semantics were previously undocumented.

The added documentation explains how:
- It decorates end consumers with additional processing steps
- It enables construction of processing pipelines
- The original consumer's semantics are preserved

This improves code maintainability by making the pipeline construction
pattern more apparent to developers.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21336
2024-10-30 13:22:00 +03:00
Raphael S. Carvalho
0ce8ee03f1 compaction: wire storage free space into reshape procedure
After this, TWCS reshape procedure can be changed to limit job
to 10% of available space.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-06-13 12:53:27 -03:00
Ferenc Szili
b50a9f9bab removed forward declaration of resharding_descriptor
resharding_descriptor has been removed in e40aa042 in 2020
2024-03-22 11:35:10 +01:00
Ferenc Szili
5a65169f46 compaction: implement unchecked_tombstone_compaction
This change adds the missing Cassandra compaction option unchecked_tombstone_compaction.
Setting this option to true causes the compaction to ignore tombstone_threshold,
and decide whether to do a compaction only on the value of tombstone_compaction_interval
2024-03-22 11:19:43 +01:00
Kefu Chai
eb9216ef11 compaction: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16707
2024-01-10 11:07:36 +02:00
Raphael S. Carvalho
b551f4abd2 streaming: Improve partition estimation with TWCS
When off-strategy is disabled, data segregation is not postponed,
meaning that getting partition estimate right is important to
decrease filter's false positives. With streaming, we don't
have min and max timestamps at destination, well, we could have
extended the RPC verb to send them, but turns out we can deduce
easily the amount of windows using default TTL. Given partitioner
random nature, it's not absurd to assume that a given range being
streamed may overlap with all windows, meaning that each range
will yield one sstable for each window when segregating incoming
data. Today, we assume the worst of 100 windows (which is the
max amount of sstables the input data can be segregated into)
due to the lack of metadata for estimating the window count.
But given that users are recommended to target a max of ~20
windows, it means partition estimate is being downsized 5x more
than needed. Let's improve it by using default TTL when
estimating window count, so even on absence of timestamp
metadata, the partition estimation won't be way off.

Fixes #15704.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-11-08 12:10:03 +02:00
Raphael S. Carvalho
fded314e46 sstables: Fix update of tombstone GC settings to have immediate effect
After "repair: Get rid of the gc_grace_seconds", the sstable's schema (mode,
gc period if applicable, etc) is used to estimate the amount of droppable
data (or determine full expiration = max_deletion_time < gc_before).
It could happen that the user switched from timeout to repair mode, but
sstables will still use the old mode, despite the user asked for a new one.
Another example is when you play with value of grace period, to prevent
data resurrection if repair won't be able to run in a timely manner.
The problem persists until all sstables using old GC settings are recompacted
or node is restarted.
To fix this, we have to feed latest schema into sstable procedures used
for expiration purposes.

Fixes #15643.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#15746
2023-10-19 16:27:59 +03:00
Raphael S. Carvalho
8997fe0625 compaction: Switch to strategy_control::candidates() for regular compaction
Now everything is prepared for the switch, let's do it.

Now let's wait for ICS to enjoy the set of changes.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-09-25 17:18:21 -03:00
Aleksandra Martyniuk
ac08b57555 compaction: cql3: validate options in check_restricted_table_properties
Check whether valid compaction strategy options are set for the given
strategy type in check_restricted_table_properties.
2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk
0ed39af221 compaction: validate common compaction strategy options
Add compaction_strategy_impl::validate_options to validate common
compaction strategy options.
2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk
a2e6081984 compaction: split compaction_strategy_impl constructor
Split compaction_strategy_impl constructor into methods that will
be reused for validation.

Add additional checks providing that options' values are legal.
2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk
84fd90e472 compaction: add method to validate min and max threshold
Add compaction_strategy_impl::validate_min_max_threshold method
that will be used to validate min and max threshold values
for different compaction methods.
2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk
702c19f941 compaction: make compaction strategy keys static constexpr 2023-09-13 16:59:40 +02:00
Pavel Emelyanov
66e43912d6 code: Switch to seastar API level 7
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).

So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command

The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields

Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)

Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile

The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13963
2023-06-06 13:29:16 +03:00
Benny Halevy
73280c0a15 compaction: refactor compaction_fwd.hh out of compaction_descriptor.hh
So it can be used in the next patch that will refactor
compaction_state out of class compaction_manager.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-04-10 23:19:04 +03:00
Raphael S. Carvalho
1ffe2f04ef compaction: add table_state param to compaction_strategy::notify_completion()
once compaction_strategy is made staless, the state must be retrieved
in notify_completion() through table_state.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-03-27 13:40:02 -03:00
Raphael S. Carvalho
232e71f2cf compaction: add const-qualifier to a few compaction_strategy methods
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-03-27 11:13:10 -03:00
Raphael S. Carvalho
b88acffd66 replica: Allow one compaction_backlog_tracker for each compaction_group
Today, compaction_backlog_tracker is managed in each compaction_strategy
implementation. So every compaction strategy is managing its own
tracker and providing a reference to it through get_backlog_tracker().

But this prevents each group from having its own tracker, because
there's only a single compaction_strategy instance per table.
To remove this limitation, compaction_strategy impl will no longer
manage trackers but will instead provide an interface for trackers
to be created, such that each compaction group will be allowed to
have its own tracker, which will be managed by compaction manager.

On compaction strategy change, table will update each group with
the new tracker, which is created using the previously introduced
ompaction_group_sstable_set_updater.

Now table's backlog will be the sum of all compaction_group backlogs.
The normalization factor is applied on the sum, so we don't have
to adjust each individual backlog to any factor.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:22:51 -03:00
Raphael S. Carvalho
fedafd76eb compaction: kill undefined get_unimplemented_backlog_tracker()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:17:37 -03:00
Benny Halevy
572d534d0d sstables: get_gc_before_*: get tombstone_gc_state from caller
Pass the tombstone_gc_state from the compaction_strategy
to sstables get_gc_before_* functions using the table state
to get to the tombstone_gc_state.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-06 23:05:39 +03:00
Benny Halevy
e3f561db31 compaction_manager: major_compaction_task: run in maintenance scheduling groupt
We should separate the scheduling groups used for major compaction
from the the regular compaction scheduling group so that
the latter can be affected by the backlog tracker in case
backlog accumulates during a long running major compaction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-06 18:18:45 +03:00
Raphael S. Carvalho
2a9bfa3e3f compaction_strategy: get_cleanup_compaction_jobs: accept candidates by value
Then caller can decide whether to copy or move candidate set into the
function. cleanup_sstables_compaction_task can move candidates as
it's no longer needed once it retrieves all descriptors.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-03-29 09:49:13 -03:00
Raphael S. Carvalho
44e9e10414 compaction_strategy: Allow strategies to define their own cleanup strategy
Today, all compaction strategies will clean up their files using the
incremental approach of one sstable being rewritten at a time.

Turns out that's not the best approach performance wise. Let's take
STCS for example. As cleanup finishes rewriting one file, the output
file is placed into the sstable set. Regular now can compact that
file with another that was already there (e.g. produced by flush after
cleanup started). Inefficient compactions like this can keep happening
as cleanup incrementally places output file into the candidate list
for regular.

This method will allow strategies to clean up their files in batches.
For example, STCS can clean up all files in smallest tiers in single
round, allowing the output data to be added at once. So next compaction
rounds can be more efficient in terms of writeamp. Another benefit is
that deduplication and GC can happen more efficiently.

The drawback is the space requirement, as we no longer compact one file
a a time. However, the impact is minimized by cleaning up the smallest
tier first. With leveled strategy for example, even though 90% of data
is in highest level, the space requirement is not a problem because
we can apply the incremental compaction on its behalf. The same applies
to ICS. With STCS, the requirement is the size of the tier being
compacted, but that's already expected by its users anyway.

By the time being, all strategies have it unimplemented. so they still
use the old behavior where files are rewritten on at a time.
This will allow us to incrementally implement the cleanup method for
all compaction strategies.

Refs #10097.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-03-23 00:04:03 -03:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Botond Dénes
1ba19c2aa4 compaction/compaction_strategy: convert make_interposer_consumer() to v2
The underlying timestamp-based splitter is v2 already.
2022-01-07 13:51:59 +02:00
Asias He
a8ad385ecd repair: Get rid of the gc_grace_seconds
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.

To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.

In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:

1) GC a tombstone after gc_grace_seconds

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;

This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.

2) Never GC a tombstone

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};

3) GC a tombstone immediately

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};

4) GC a tombstone after repair

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};

In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.

A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.

Tests: compaction_test.py, ninja test

Fixes #3560

[avi: resolve conflicts vs data_dictionary]
2022-01-04 19:48:14 +02:00
Raphael S. Carvalho
49f40c8791 compaction: Implement strategy control and wire it
This implements strategy control interface for both manager and
tests, and wire it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-13 16:05:23 -03:00
Raphael S. Carvalho
9725e5efa9 compaction_strategy: kill unused can_compact_partial_runs()
This strategy method was introduced unnecessarily. We assume it was
going to be needed, but turns out it was never needed, not even
for ICS. Also it's built on a wrong assumption as an output
sstable run being generated can never be compacted in parallel
as the non-overlapping requirement can be easily broken.
LCS for example can allow parallel compaction on different runs
(levels) but correctness cannto be guaranteed with same runs
are compacted in parallel.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-03 12:20:51 -03:00
Raphael S. Carvalho
bb5a8682f3 compaction: stop including database.hh for compaction_strategy
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-09 11:29:47 -03:00
Raphael S. Carvalho
e2f6a47999 compaction: switch to table_state in estimated_pending_compactions()
Last method in compaction_strategy using table. From now on,
compaction strategy no longer works directly with table.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-09 11:25:28 -03:00
Raphael S. Carvalho
93ae9225f7 compaction: switch to table_state in compaction_strategy::get_major_compaction_job()
From now on, get_major_compaction_job() will use table_state instead of
a plain reference to table.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-09 11:25:22 -03:00
Raphael S. Carvalho
d881310b52 compaction: switch to table_state in compaction_strategy::get_sstables_for_compaction()
From now on, get_sstables_for_compaction() will use table_state.
With table_state, we avoid layer violations like strategy using
manager and also makes testing easier.

Compaction unit tests were temporarily disabled to avoid a giant
commit which is hard to parse.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-09 10:52:14 -03:00
Raphael S. Carvalho
1924e8d2b6 treewide: Move compaction code into a new top-level compaction dir
Since compaction is layered on top of sstables, let's move all compaction code
into a new top-level directory.
This change will give me extra motivation to remove all layer violations, like
sstable calling compaction-specific code, and compaction entanglement with
other components like table and storage service.

Next steps:
- remove all layer violations
- move compaction code in sstables namespace into a new one for compaction.
- move compaction unit tests into its own file

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210707194058.87060-1-raphaelsc@scylladb.com>
2021-07-07 23:21:51 +03:00