Commit Graph

44 Commits

Author SHA1 Message Date
Lakshmi Narayanan Sreethar
18c071c94b compaction: fix use after free when strategy is altered during compaction
The `compaction_strategy_state` class holds strategy specific state via
a `std::variant` containing different state types. When a compaction
strategy performs compaction, it retrieves a reference to its state from
the `compaction_strategy_state` object. If the table's compaction
strategy is ALTERed while a compaction is in progress, the
`compaction_strategy_state` object gets replaced, destroying the old
state. This leaves the ongoing compaction holding a dangling reference,
resulting in a use after free.

Fix this by using `seastar::shared_ptr` for the state variant
alternatives(`leveled_compaction_strategy_state_ptr` and
`time_window_compaction_strategy_state_ptr`). The compaction strategies
now hold a copy of the shared_ptr, ensuring the state remains valid for
the duration of the compaction even if the strategy is altered.

The `compaction_strategy_state` itself is still passed by reference and
only the variant alternatives use shared_ptrs. This allows ongoing
compactions to retain ownership of the state independently of the
wrapper's lifetime.

Fixes #25913

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-10-17 22:57:05 +05:30
Lakshmi Narayanan Sreethar
35159e5b02 compaction/twcs: pass compaction_strategy_state to internal methods
During TWCS compaction, multiple methods independently fetch the
compaction_strategy_state using get_state(). This can lead to
inconsistencies if the compaction strategy is ALTERed while the
compaction is in progress.

This patch fixes a part of this issue by passing down the state to the
lower level methods as parameters instead of fetching it repeatedly.

Refs #25913

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-10-17 21:26:30 +05:30
Botond Dénes
86ed627fc4 compaction: move code to namespace compaction
The namespace usage in this directory is very inconsistent, with files
and classes scattered in:
* global namespace
* namespace compaction
* namespace sstables

With cases, where all three used in the same file. This code used to
live in sstables/ and some of it still retains namespace sstables as a
heritage of that time. The mismatch between the dir (future module) and
the namespace used is confusing, so finish the migration and move all
code in compaction/ to namespace compaction too.

This patch, although large, is mechanic and only the following kind of
changes are made:
* replace namespace sstable {} with namespace compaction {}
* add namespace compaction {}
* drop/add sstables::
* drop/add compaction::
* move around forward-declarations so they are in the correct namespace
  context

This refactoring revealed some awkward leftover coupling between
sstables and compaction, in sstables/sstable_set.cc, where the
make_sstable_set() methods of compaction strategies are implemented.
2025-09-25 15:03:56 +03:00
Ernest Zaslavsky
5ba5aec1f8 treewide: Move mutation related files to a mutation directory
As requested in #22104, moved the files and fixed other includes and build system.

Moved files:
 - combine.hh
 - collection_mutation.hh
 - collection_mutation.cc
 - converting_mutation_partition_applier.hh
 - converting_mutation_partition_applier.cc
 - counters.hh
 - counters.cc
 - timestamp.hh

Fixes: #22104

This is a cleanup, no need to backport

Closes scylladb/scylladb#25085
2025-09-24 13:23:38 +03:00
Lakshmi Narayanan Sreethar
f2308a2ce5 compaction/twcs: use on_internal_error for invalid timestamp resolution
When the `time_window_compaction_strategy::to_timestamp_type` encounters
an invalid timestamp resolution it throws an `std::runtime_error`. This
patch replaces it with `on_internal_error()` and also logs the invalid
timestamp resolution for easier debugging.

Refs #25913

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#26138
2025-09-24 09:14:16 +03:00
Raphael S. Carvalho
9d3755f276 replica: Futurize retrieval of sstable sets in compaction_group_view
This will allow upcoming work to gently produce a sstable set for
each compaction group view. Example: repaired and unrepaired.

Locking strategy for compaction's sstable selection:
Since sstable retrieval path became futurized, tasks in compaction
manager will now hold the write lock (compaction_state::lock)
when retrieving the sstable list, feeding them into compaction
strategy, and finally registering selected sstables as compacting.
The last step prevents another concurrent task from picking the
same sstable. Previously, all those steps were atomic, but
we have seen stall in that area in large installations, so
futurization of that area would come sooner or later.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:58:00 +03:00
Raphael S. Carvalho
20c3301a1a treewide: Futurize estimation of pending compaction tasks
This is to allow futurization of compaction_group_view method that
retrieves sstable set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:51:29 +03:00
Raphael S. Carvalho
2c4a9ba70c treewide: Rename table_state to compaction_group_view
Since table_state is a view to a compaction group, it makes sense
to rename it as so.

With upcoming incremental repair, each replica::compaction_group
will be actually two compaction groups, so there will be two
views for each replica::compaction_group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:51:28 +03:00
Botond Dénes
efc48caea5 readers/mutation_reader: s/reader_consumer_v2/mutation_reader_consumer/ 2025-05-09 07:53:29 -04:00
Raphael S. Carvalho
21d1e78457 compaction: Wire table_state into make_sstable_set()
This will be useful for feeding token range owned by compaction group
into sstable set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Benny Halevy
284dbc51c3 time_window_compaction_strategy: estimated_pending_compactions: reestimate compactions rather than using cached value
Currently, `estimated_pending_compactions` uses a precalculated value
calculated by `update_estimated_compaction_by_tasks`, which, in turn,
is called by `get_compaction_candidates`.  That means that, if
`estimated_pending_compactions` is called, e.g. right after
major compaction, it will return an outdated value that was
calculated prior to major compaction, and so, it is no longer
relevant.

Instead, just recalculate the value in `estimated_pending_compactions`
and drop `update_estimated_compaction_by_tasks`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-10-07 10:15:19 +03:00
Raphael S. Carvalho
ace4e5111e compaction: Reduce twcs off-strategy space overhead to 10% of free space
TWCS off-strategy suffers with 100% space overhead, so a big TWCS table
can cause scylla to run out of disk space during node ops.

To not penalize TWCS tables, that take a small percentage of disk,
with increased write ampl, TWCS off-strategy will be restricted to
10% of free disk space. Then small tables can still compact all
disjoint sstables in a single round.

Fixes #16514.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-06-13 13:06:51 -03:00
Raphael S. Carvalho
0ce8ee03f1 compaction: wire storage free space into reshape procedure
After this, TWCS reshape procedure can be changed to limit job
to 10% of available space.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-06-13 12:53:27 -03:00
Raphael S. Carvalho
157a5c4b1b treewide: Avoid using namespace sstables in header to avoid conflicts
That's needed for compaction_group.hh to be included in headers.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-11-23 17:36:57 +02:00
Raphael S. Carvalho
b551f4abd2 streaming: Improve partition estimation with TWCS
When off-strategy is disabled, data segregation is not postponed,
meaning that getting partition estimate right is important to
decrease filter's false positives. With streaming, we don't
have min and max timestamps at destination, well, we could have
extended the RPC verb to send them, but turns out we can deduce
easily the amount of windows using default TTL. Given partitioner
random nature, it's not absurd to assume that a given range being
streamed may overlap with all windows, meaning that each range
will yield one sstable for each window when segregating incoming
data. Today, we assume the worst of 100 windows (which is the
max amount of sstables the input data can be segregated into)
due to the lack of metadata for estimating the window count.
But given that users are recommended to target a max of ~20
windows, it means partition estimate is being downsized 5x more
than needed. Let's improve it by using default TTL when
estimating window count, so even on absence of timestamp
metadata, the partition estimation won't be way off.

Fixes #15704.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-11-08 12:10:03 +02:00
Raphael S. Carvalho
8997fe0625 compaction: Switch to strategy_control::candidates() for regular compaction
Now everything is prepared for the switch, let's do it.

Now let's wait for ICS to enjoy the set of changes.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-09-25 17:18:21 -03:00
Aleksandra Martyniuk
44744d6229 compaction: validate options used in different compaction strategies
For each compaction strategy, validate whether options values are valid.
2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk
c8c3c0e6a6 compaction: split time_window_compaction_strategy_options construtor
Split time_window_compaction_strategy_options constructor into
functions that will be reused for validation.
2023-09-13 16:59:40 +02:00
Aleksandra Martyniuk
a01dd1351e compaction: add validate method to compaction_strategy_options
Add temporarily empty validate method to compaction_strategy_options.
The method will validate the options and help determining whether
only the allowed options were set.
2023-09-13 16:59:40 +02:00
Benny Halevy
e5cf6f0897 time_window_compaction_strategy_options: make copy and move-able
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-09-13 16:59:40 +02:00
Kefu Chai
bab16eb30e treewide: remove #includes not use directly
for faster build times and clear inter-module dependencies, we
should not #includes headers not directly used. instead, we should
only #include the headers directly used by a certain compilation
unit.

in this change, the source files under "/compaction" directories
are checked using clangd, which identifies the cases where we have
an #include which is not directly used. all the #includes identified
by clangd are removed. because some source files rely on the incorrectly
included header file, those ones are updated to #include the header
file they directly use.

if a forward declaration suffice, the declaration is added instead.

see also https://clangd.llvm.org/guides/include-cleaner#unused-include-warning

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-18 17:36:31 +08:00
Pavel Emelyanov
66e43912d6 code: Switch to seastar API level 7
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).

So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command

The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields

Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)

Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile

The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13963
2023-06-06 13:29:16 +03:00
Raphael S. Carvalho
a47bac931c Move TWCS option from table into TWCS itself
enable_optimized_twcs_queries is specific to TWCS, therefore it
belongs to TWCS, not replica::table.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #13489
2023-04-14 08:28:16 +03:00
Raphael S. Carvalho
989afbf83b compaction: TWCS: wire up compaction_strategy_state
TWCS no longer keeps internal state, and will now rely on state
managed by each compaction group through compaction::table_state.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-03-28 08:48:15 -03:00
Raphael S. Carvalho
e2f38baa92 compaction: TWCS: prepare for stateless strategy
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-03-27 13:40:01 -03:00
Raphael S. Carvalho
017f432b8f compaction: TWCS: extract state into a separate struct
This is a step towards decoupling compaction strategy (computation)
and its state. Making the former stateless.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-03-27 13:38:47 -03:00
Raphael S. Carvalho
232e71f2cf compaction: add const-qualifier to a few compaction_strategy methods
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-03-27 11:13:10 -03:00
Pavel Emelyanov
9bdea110a6 code: Reduce fanout of sstables(_manager)?.hh over headers
This change removes sstables.hh from some other headers replacing it
with version.hh and shared_sstable.hh. Also this drops
sstables_manager.hh from some more headers, because this header
propagates sstables.hh via self. That change is pretty straightforward,
but has a recochet in database.hh that needs disk-error-handler.hh.

Without the patch touch sstables/sstable.hh results in 409 targets
recompillation, with the patch -- 299 targets.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #12222
2022-12-07 14:34:19 +02:00
Raphael S. Carvalho
b88acffd66 replica: Allow one compaction_backlog_tracker for each compaction_group
Today, compaction_backlog_tracker is managed in each compaction_strategy
implementation. So every compaction strategy is managing its own
tracker and providing a reference to it through get_backlog_tracker().

But this prevents each group from having its own tracker, because
there's only a single compaction_strategy instance per table.
To remove this limitation, compaction_strategy impl will no longer
manage trackers but will instead provide an interface for trackers
to be created, such that each compaction group will be allowed to
have its own tracker, which will be managed by compaction manager.

On compaction strategy change, table will update each group with
the new tracker, which is created using the previously introduced
ompaction_group_sstable_set_updater.

Now table's backlog will be the sum of all compaction_group backlogs.
The normalization factor is applied on the sum, so we don't have
to adjust each individual backlog to any factor.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-11-11 09:22:51 -03:00
Avi Kivity
5937b1fa23 treewide: remove empty comments in top-of-files
After fcb8d040 ("treewide: use Software Package Data Exchange
(SPDX) license identifiers"), many dual-licensed files were
left with empty comments on top. Remove them to avoid visual
noise.

Closes #10562
2022-05-13 07:11:58 +02:00
Raphael S. Carvalho
568bb40127 compaction: TWCS: Implement cleanup method for bucket awareness
This continues the work in a69d98c3d0,
by implementing the cleanup method in TWCS to make it bucket aware.
Till now, the default impl was used which cleanups on file at a
time, starting from the smallest.

The cleanup strategy for TWCS is simple. It's simply calling the
size tiered cleanup method for each bucket, so there will be
one job for each tier in each window.

The next strategies to receive this improvement are LCS and ICS
(the latter one being only available in enterprise).

Refs #10097.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-03-29 09:52:06 -03:00
Raphael S. Carvalho
8f4c04c38a compaction: TWCS: change get_buckets() signature to work with const qualified functions
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-03-29 09:49:14 -03:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Botond Dénes
1ba19c2aa4 compaction/compaction_strategy: convert make_interposer_consumer() to v2
The underlying timestamp-based splitter is v2 already.
2022-01-07 13:51:59 +02:00
Asias He
a8ad385ecd repair: Get rid of the gc_grace_seconds
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.

To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.

In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:

1) GC a tombstone after gc_grace_seconds

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;

This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.

2) Never GC a tombstone

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};

3) GC a tombstone immediately

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};

4) GC a tombstone after repair

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};

In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.

A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.

Tests: compaction_test.py, ninja test

Fixes #3560

[avi: resolve conflicts vs data_dictionary]
2022-01-04 19:48:14 +02:00
Raphael S. Carvalho
8eace8fc49 TWCS: Make sure major on past window is done on all its sstables
Once current window is sealed, TWCS is supposed to compact all its
sstables into one. If there's ongoing compaction, it can happen that
sstables are missed and therefore past windows will contain more than
one sstable. Additionally, it could happen that major doesn't happen
at all if under heavy load. All these problems are fixed by serializing
major on past window and also postponing it if manager refuses to run
the job now.

Fixes #9553.

Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-13 16:10:43 -03:00
Raphael S. Carvalho
2dc890d8e6 TWCS: remove needless param for STCS options
STCS option can be retrieved from class member, as newest_bucket()
is no longer a static function. let's get rid of it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-13 16:05:40 -03:00
Raphael S. Carvalho
41a5736aaf TWCS: kill unused param in newest_bucket()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-13 16:05:36 -03:00
Raphael S. Carvalho
49f40c8791 compaction: Implement strategy control and wire it
This implements strategy control interface for both manager and
tests, and wire it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-13 16:05:23 -03:00
Raphael S. Carvalho
e2f6a47999 compaction: switch to table_state in estimated_pending_compactions()
Last method in compaction_strategy using table. From now on,
compaction strategy no longer works directly with table.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-09 11:25:28 -03:00
Raphael S. Carvalho
d881310b52 compaction: switch to table_state in compaction_strategy::get_sstables_for_compaction()
From now on, get_sstables_for_compaction() will use table_state.
With table_state, we avoid layer violations like strategy using
manager and also makes testing easier.

Compaction unit tests were temporarily disabled to avoid a giant
commit which is hard to parse.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-09 10:52:14 -03:00
Avi Kivity
daf028210b build: enable -Winconsistent-missing-override warning
This warning can catch a virtual function that thinks it
overrides another, but doesn't, because the two functions
have different signatures. This isn't very likely since most
of our virtual functions override pure virtuals, but it's
still worth having.

Enable the warning and fix numerous violations.

Closes #9347
2021-09-15 12:55:54 +03:00
Raphael S. Carvalho
1924e8d2b6 treewide: Move compaction code into a new top-level compaction dir
Since compaction is layered on top of sstables, let's move all compaction code
into a new top-level directory.
This change will give me extra motivation to remove all layer violations, like
sstable calling compaction-specific code, and compaction entanglement with
other components like table and storage service.

Next steps:
- remove all layer violations
- move compaction code in sstables namespace into a new one for compaction.
- move compaction unit tests into its own file

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210707194058.87060-1-raphaelsc@scylladb.com>
2021-07-07 23:21:51 +03:00