115 Commits

Author SHA1 Message Date
Michał Chojnowski
1cfce430f1 replica/table: keep track of total pre-compression file size
Every table and sstable set keeps track of the total file size
of contained sstables.

Due to a feature request, we also want to keep track of the hypothetical
file size if Data files were uncompressed, to add a metric that
shows the compression ratio of sstables.

We achieve this by replacing the relevant `uint_64 bytes_on_disk`
counters everywhere with a struct that contains both the actual
(post-compression) size and the hypothetical pre-compression size.

This patch isn't supposed to change any observable behavior.
In the next patch, we will use these changes to add a new metric.
2025-11-13 00:49:57 +01:00
Botond Dénes
9c85046f93 sstables,compaction: move compaction exceptions to compaction/
sstables/exceptions.hh still hosts some compaction specific exception
types. Move them over to the new compaction/exceptions.hh, to make the
compaction module more self-contained.
2025-09-29 06:49:14 +03:00
Botond Dénes
86ed627fc4 compaction: move code to namespace compaction
The namespace usage in this directory is very inconsistent, with files
and classes scattered in:
* global namespace
* namespace compaction
* namespace sstables

With cases, where all three used in the same file. This code used to
live in sstables/ and some of it still retains namespace sstables as a
heritage of that time. The mismatch between the dir (future module) and
the namespace used is confusing, so finish the migration and move all
code in compaction/ to namespace compaction too.

This patch, although large, is mechanic and only the following kind of
changes are made:
* replace namespace sstable {} with namespace compaction {}
* add namespace compaction {}
* drop/add sstables::
* drop/add compaction::
* move around forward-declarations so they are in the correct namespace
  context

This refactoring revealed some awkward leftover coupling between
sstables and compaction, in sstables/sstable_set.cc, where the
make_sstable_set() methods of compaction strategies are implemented.
2025-09-25 15:03:56 +03:00
Aleksandra Martyniuk
20f55ea1b8 compaction: handle exception in expected_total_workload
expected_total_workload methods of scrub compaction tasks create
a vector of table_info based on table names. If any table was already
dropped, then the exception is thrown. It leaves table_info in corrupted
state and node crashes with `free(): invalid size`.

Return std::nullopt if an exception was thrown to indicate that
total workload cannot be found.

Fixes: #25941.
2025-09-10 12:13:37 +02:00
Botond Dénes
6116f9e11b Merge 'Compaction tasks progress' from Aleksandra Martyniuk
Determine the progress of compaction tasks that have
children.

The progress of a compaction task is calculated using the default
get_progress method. If the expected_total_workload method is
implemented, the default progress is computed as:
(sum of child task progresses) / (expected total workload)

If expected_total_workload is not defined, progress is estimated based
on children progresses. However, in this case, the total progress may
increase over time as the task executes.

All compaction tasks, except for reshape tasks, implement the
expected_children_number method. To compute expected_total_workload,
iterate over all SSTables covered by the task and sum their sizes. Note
that expected_total_workload is just an approximation and the real workload
may differ if SStables set for the keyspace/table/compaction group changes.

Reshape tasks are an exception, as their scope is determined during
execution. Hence, for these tasks expected_total_workload isn't defined
and their progress (both total and completed) is determined based
on currently created children.

Fixes: https://github.com/scylladb/scylladb/issues/8392.
Fixes: https://github.com/scylladb/scylladb/issues/6406.
Fixes: https://github.com/scylladb/scylladb/issues/7845.

New feature, no backport needed

Closes scylladb/scylladb#15158

* github.com:scylladb/scylladb:
  test: add compaction task progress test
  compaction: set progress unit for compaction tasks
  compaction: find expected workload for reshard tasks
  compaction: find expected workload for global cleanup compaction tasks
  compaction: find expected workload for global major compaction tasks
  compaction: find expected workload for keyspace compaction tasks
  compaction: find expected workload for shard compaction tasks
  compaction: find expected workload for table compaction tasks
  compaction: return empty progress when compaction_size isn't set
  compaction: update compaction_data::compaction_size at once
  tasks: do not check expected workload for done task
2025-09-03 13:23:42 +03:00
Aleksandra Martyniuk
0844a057d1 compaction: use current_task_type 2025-08-29 17:08:00 +02:00
Aleksandra Martyniuk
773ae73704 test: add compaction task progress test 2025-08-28 12:10:13 +02:00
Aleksandra Martyniuk
046742bd18 compaction: find expected workload for reshard tasks
Find expected workload in bytes of reshard tasks.

The workload of table_resharding_compaction_task_impl is found
at the beginning of its execution. Before that, expected_total_workload()
returns std::nullopt, which means that the progress for this task
won't be shown.
2025-08-28 12:10:13 +02:00
Aleksandra Martyniuk
a2381380f2 compaction: find expected workload for global cleanup compaction tasks
Sum bytes of all sstables of all non local vnode keyspaces.
2025-08-28 12:10:13 +02:00
Aleksandra Martyniuk
cc9e342cb6 compaction: find expected workload for global major compaction tasks
Sum bytes of all sstables of all keyspaces.
2025-08-28 12:10:13 +02:00
Aleksandra Martyniuk
b12fc50de7 compaction: find expected workload for keyspace compaction tasks
Add compaction_task_impl::get_keyspace_task_workload that sums
the bytes in all sstables of this keyspace.

This function is used to find the expected workload of the following
keyspace compaction types:
- major;
- cleanup;
- offstrategy;
- upgrade_sstables;
- scrub.
2025-08-28 12:10:07 +02:00
Aleksandra Martyniuk
82e241eb00 compaction: find expected workload for shard compaction tasks
Add compaction_task_impl::get_shard_task_workload that sums
the bytes in all sstables of this keyspace on this shard.

This function is used to find the expected workload of the following
shard compaction types:
- major;
- cleanup;
- offstrategy;
- upgrade_sstables;
- scrub.
2025-08-28 12:04:16 +02:00
Aleksandra Martyniuk
926753e8bf compaction: find expected workload for table compaction tasks
Add compaction_task_impl::get_table_task_workload that sums
the bytes in all sstables in the table.

This function is used to find the expected workload of the following
compaction types:
- major;
- cleanup;
- offstrategy;
- upgrade_sstables;
- scrub.
2025-08-28 10:41:22 +02:00
Aleksandra Martyniuk
bd28c50d84 compaction: return empty progress when compaction_size isn't set
Currently, progress of compaction task executors is reported in bytes.
However, if compaction_size isn't set for compaction task executor,
the executor's progress is shown as 1/1 (if it has finished) or 0/1
(otherwise).

In the following patches, the progress of executors' parent task will
be found based on its children. Hence, to avoid mixing different progress
units, the binary progress is no longer used.

Return empty progress when compaction_size isn't set.

Drop task_manager::task::impl::get_binary_progress as it's no longer
used.
2025-08-27 17:51:21 +02:00
Asias He
f9021777d8 compaction: Add tablet incremental repair support
This patch addes incremental_repair support in compaction.

- The sstables are split into repaired and unrepaired set.

- Repaired and unrepaired set compact sperately.

- The repaired_at from sstable and sstables_repaired_at from
  system.tablets table are used to decide if a sstable is repaired or
  not.

- Different compactions tasks, e.g., minor, major, scrub, split, are
  serialized with tablet repair.
2025-08-18 11:01:21 +08:00
Raphael S. Carvalho
2c4a9ba70c treewide: Rename table_state to compaction_group_view
Since table_state is a view to a compaction group, it makes sense
to rename it as so.

With upcoming incremental repair, each replica::compaction_group
will be actually two compaction groups, so there will be two
views for each replica::compaction_group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:51:28 +03:00
Benny Halevy
bd62421c05 keyspace: rename get_vnode_effective_replication_map
to get_static_effective_replication_map, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map (both are per-keyspace).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-06 13:40:43 +03:00
Benny Halevy
ec85678de1 locator: abstract_replication_strategy: define is_local
Prefer for specializing the local replication strategy,
local effective replication map, et. al byt defining
an is_local() predicate, similar to uses_tablets().

Note that is_vnode_based() still applies to local replication
strategy.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-06 13:34:23 +03:00
Kefu Chai
9fdbe0e74b tree: Remove unused boost headers
This commit eliminates unused boost header includes from the tree.

Removing these unnecessary includes reduces dependencies on the
external Boost.Adapters library, leading to faster compile times
and a slightly cleaner codebase.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22997
2025-02-25 10:32:32 +03:00
Kefu Chai
9a20fb43ab tree: replace boost::min_element() with std::ranges::min_element()
in order to reduce the external header dependency, let's switch to
the standardlized std::ranges::min_element().

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22572
2025-02-05 21:54:01 +02:00
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Avi Kivity
9024e4940c counters.hh: drop unused boost includes
Re-add them to source files that need them.

Closes scylladb/scylladb#21738
2024-12-05 12:27:41 +02:00
Kefu Chai
bab12e3a98 treewide: migrate from boost::adaptors::transformed to std::views::transform
now that we are allowed to use C++23. we now have the luxury of using
`std::views::transform`.

in this change, we:

- replace `boost::adaptors::transformed` with `std::views::transform`
- use `fmt::join()` when appropriate where `boost::algorithm::join()`
  is not applicable to a range view returned by `std::view::transform`.
- use `std::ranges::fold_left()` to accumulate the range returned by
  `std::view::transform`
- use `std::ranges::fold_left()` to get the maximum element in the
  range returned by `std::view::transform`
- use `std::ranges::min()` to get the minimal element in the range
  returned by `std::view::transform`
- use `std::ranges::equal()` to compare the range views returned
  by `std::view::transform`
- remove unused `#include <boost/range/adaptor/transformed.hpp>`
- use `std::ranges::subrange()` instead of `boost::make_iterator_range()`,
  to feed `std::views::transform()` a view range.

to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

limitations:

there are still a couple places where we are still using
`boost::adaptors::transformed` due to the lack of a C++23 alternative
for `boost::join()` and `boost::adaptors::uniqued`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21700
2024-12-03 09:41:32 +02:00
Kefu Chai
f436edfa22 mutation: remove unused "#include"s
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.

please note, because `mutation/mutation.hh` does not include
`seastar/coroutine/maybe_yield.hh` anymore, and quite a few source
files were relying on this header to bring in the declaration of
`maybe_yield()`, we have to include this header in the places where
this symbol is used. the same applies to `seastar/core/when_all.hh`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-11-29 14:01:44 +08:00
Botond Dénes
ccb433d767 Merge 'tasks: add api_task_ttl for tasks started with API' from Aleksandra Martyniuk
When users start an operation asynchronously with API, they are expected to check the operation's status. Hence, the status should be kept in task manager for reasonable time after the operation is done. The operations that are started internally usually don't need to stay in task manager for that long.

Add api_task_ttl that will be used for tasks started with API. By default it's 1 hour. The time for which non-API tasks stay in task manager isn't changed.

Fixes: #21499.
Refs: #21425.

No backport needed - previous versions may use task_ttl

Closes scylladb/scylladb#21505

* github.com:scylladb/scylladb:
  test: add test to check user_task_ttl
  tasks: api: move make_task method
  docs: nodetool: update backup and restore commands docs
  docs: update task manager docs
  nodetool: add nodetool tasks user-ttl command
  node_ops: use user task ttl for node ops virtual task
  tasks: use user_task_ttl for tasks started by user
  api: task_manager: add /task_manager/user_ttl to get and set user task ttl
  tasks: add task_manager::task::is_user_task method
  tasks: keep updateable_value of task_ttl in task manager
  db: config: add user_task_ttl_seconds named value
2024-11-27 09:57:57 +02:00
Kefu Chai
a5ee0c896b treewide: migrate from boost::adaptors::filtered to std::views::filter
Modernize the codebase by replacing Boost range adaptors with C++23 standard library views,
reducing external dependencies and leveraging modern C++ language features.

Key Changes:
- Replace `boost::adaptors::filtered` with `std::views::filter`
- Remove `#include <boost/range/adaptor/filtered.hpp>`
- Utilize standard library range views

Motivation:
- Reduce project's external dependency footprint
- Leverage standard library's range and view capabilities
- Improve long-term code maintainability
- Align with modern C++ best practices

Implementation Challenges and Considerations:
1. Range Conversion and Move Semantics
   - `std::ranges::to` adaptor requires rvalue references
   - Necessitated updates to variable and parameter constness
   - Example: `cql3/restrictions/statement_restrictions.cc` modified to remove `const`
     from `common` to enable efficient range conversion

2. Range Iteration and Mutation
   - Range views may mutate internal state during iteration
   - Cannot pass ranges by const reference in some scenarios
   - Solution: Pass ranges by rvalue reference to explicitly indicate
     state invalidation

Limitations:
- One instance of `boost::adaptors::filtered` temporarily preserved
  due to lack of a C++23 alternative for `boost::join()`
- A comprehensive replacement will be addressed in a follow-up change

This change is part of our ongoing effort to modernize the codebase,
reducing external dependencies and adopting modern C++ practices.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21648
2024-11-26 14:26:50 +02:00
Aleksandra Martyniuk
292d00463a tasks: add task_manager::task::is_user_task method 2024-11-25 14:21:53 +01:00
Pavel Emelyanov
7d8cc3ccc2 treewide,error_injection: Use inject(wait_for_message) overload
Many places want to inject a handler that waits for external kick. Now
there's convenience inject() method overload for this. It will result in
extra messages in logs, but so far no code/test cares about it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-10-30 16:53:33 +03:00
Lakshmi Narayanan Sreethar
84d06a13c7 api: compaction: add consider_only_existing_data option
Added a new parameter `consider_only_existing_data` to major compaction
API endpoints. When enabled, major compaction will:

- Force-flush all tables.
- Force a new active segment in the commit log.
- Compact all existing SSTables and garbage-collect tombstones by only
  checking the SSTables being compacted. Memtables, commit logs, and
  other SSTables not part of the compaction will not be checked, as they
  will only contain newer data that arrived after the compaction
  started.

The `consider_only_existing_data` is passed down to the compaction
descriptor's `gc_check_only_compacting_sstables` option to ensure that
only the existing data is considered for garbage collection.

The option is also passed to the `maybe_flush_commitlog` method to make
sure all the tables are flushed and a new active segment is created in
the commit log.

Fixes #19728

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-09-05 17:25:45 +05:30
Lakshmi Narayanan Sreethar
5e6bffc146 compaction: rename maybe_flush_all_tables to maybe_flush_commitlog
Major compaction flushes all tables as a part of flushing the commitlog.
After forcing new active segments in the commitlog, all the tables are
flushed to enable reclaim of older commitlog segments. The main goal is
to flush the commitlog and flushing all the table is just a dependency.

Rename maybe_flush_all_tables to maybe_flush_commitlog so that it
reflects the actual intent of the major compaction code. Added a new
wrapper method to database::flush_all_tables(),
database::flush_commitlog(), that is now called from
maybe_flush_commitlog.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-09-05 17:25:45 +05:30
Lakshmi Narayanan Sreethar
fa2488cc83 compaction: maybe_flush_all_tables: add new force_flush param
Add a new parameter, `force_flush` to the maybe_flush_all_tables()
method. Setting `force_flush` to true will flush all the tables
regardless of when they were flushed last. This will be used by the new
compaction option in a following patch.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-09-05 17:25:45 +05:30
Benny Halevy
686a8f2939 abstract_replication_strategy: make get_ranges async
To prevent stalls due to large number of tokens.
For example, large cluster with say 70 nodes can have
more than 16K tokens.

Fixes #19757

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-25 10:57:34 +03:00
Benny Halevy
2bbbe2a8bc database: get_keyspace_local_ranges: get vnode_effective_replication_map_ptr param
Prepare for making the function async.
Then, it will need to hold on to the erm while getting
the token_ranges asynchronously.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-25 10:55:33 +03:00
Benny Halevy
ea5a0cca10 compaction: task_manager_module: open code maybe_get_keyspace_local_ranges
It is used only here and can be simplified by
checking if the keyspace replication strategy
is per table by the caller.

Prepare for making get_keyspace_local_ranges async.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-08-25 10:25:32 +03:00
Pavel Emelyanov
38edbebb10 compaction_manager: Keep flush-all-before-major option on own config
Currently the major compaction task impl grabs this (non-updateable)
value from db::config. That's not good, all services including
compaction manager have their own configs from which they take options.
Said that, this patch puts the said option onto
compaction_manager::config, makes use of it and configures one from
db::config on start (and tests).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#20174
2024-08-23 10:31:55 +03:00
Aleksandra Martyniuk
c456a43173 compaction: replace optional<task_info> with task_info param
compaction_manager::perform_compaction does not create task manager
task for compaction if parent_info is set to std::nullopt. Currently,
we always want to create task manager task for compaction.

Remove optional from task info parameters which start compaction.
Track all compactions with task manager.
2024-08-02 14:38:46 +02:00
Raphael S. Carvalho
0ce8ee03f1 compaction: wire storage free space into reshape procedure
After this, TWCS reshape procedure can be changed to limit job
to 10% of available space.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-06-13 12:53:27 -03:00
Raphael S. Carvalho
b8bd4c51c2 replica: don't expose compaction_group to reshape task
compaction_group sits in replica layer and compaction layer is
supposed to talk to it through compaction::table_state only.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-06-13 12:43:14 -03:00
Aleksandra Martyniuk
532653f118 replica: replace table::as_table_state
Replace table::as_table_state with table::try_get_table_state_with_static_sharding
which throws if a table does not use static sharding.
2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk
cf9913b0b7 compaction: pass compaction group id to reshape_compaction_group
Pass compaction group id to
shard_reshaping_compaction_task_impl::reshape_compaction_group.
Modify table::as_table_state to return table_state of the given
compaction group.
2024-05-10 14:56:38 +02:00
Pavel Emelyanov
1f44a374b8 error_injection: Overload inject() instead of inject_with_handler()
The inject_with_handler() method accepts a coroutine that can be called
wiht injection_handler. With such function as an argument, there's no
need in distinctive inject_with_handler() name for a method, it can be
overload of all the existing inject()-s

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-03-11 19:30:19 +03:00
Lakshmi Narayanan Sreethar
83fecc2f1f compaction: reshape sstables within compaction groups
For tables using tablet based replication strategies, the sstables
should be reshaped only within the compaction groups they belong to.
Updated shard_reshaping_compaction_task_impl to group the sstables based
on their compaction groups before reshaping them within the groups.

Fixes #16966

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-02-23 18:43:39 +05:30
Lakshmi Narayanan Sreethar
9fffd8905f compaction: reshape: update total reshaped size only on success
The total reshaped size should only be updated on reshape success and
not after reshape has been failed due to some exception.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-02-23 01:07:54 +05:30
Lakshmi Narayanan Sreethar
4fb099659a compaction: simplify exception handling in shard_reshaping_compaction_task_impl::run
Catch and handle the exceptions directly instead of rethrowing and
catching again.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-02-23 01:07:54 +05:30
Avi Kivity
eedb997568 Merge 'compaction: upgrade: handle keyspaces that use tablets' from Lakshmi Narayanan Sreethar
Tables in keyspaces governed by replication strategy that uses tablets, have separate effective_replication_maps. Update the upgrade compaction task to handle this when getting owned key ranges for a keyspace.

Fixes #16848

Closes scylladb/scylladb#17335

* github.com:scylladb/scylladb:
  compaction: upgrade: handle keyspaces that use tablets
  replica/database: add an optional variant to get_keyspace_local_ranges
2024-02-15 21:31:54 +02:00
Lakshmi Narayanan Sreethar
7a98877798 compaction: upgrade: handle keyspaces that use tablets
Tables in keyspaces governed by replication strategy that uses tablets, have
separate effective_replication_maps. Update the upgrade compaction task to
handle this when getting owned key ranges for a keyspace.

Fixes #16848

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-02-15 17:47:39 +05:30
Kefu Chai
caa20c491f storage_service: pass non-empty keyspace when performing cleanup_all
this change addresses the regression introduced by 5e0b3671, which
fall backs to local cleanup in cleanup_all. but 5e0b3671 failed to
pass the keyspace to the `shard_cleanup_keyspace_compaction_task_impl`
is its constructor parameter, that's why the test fails like
```
error executing POST request to http://localhost:10000/storage_service/cleanup_all with parameters {}: remote replied with status code 400 Bad Request:
Can't find a keyspace

```

where the string after "Can't find a keyspace" is empty.

in this change, the keyspace name of the keyspace to be cleaned is passed to
`shard_cleanup_keyspace_compaction_task_impl`.

we always enable the topology coordinator when performing testing,
that's why this issue does not pop up until the longevity test.

Fixes #17302
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17320
2024-02-15 13:17:45 +02:00
Kefu Chai
5e0b3671d3 storage_service: fall back to local cleanup in cleanup_all
before this change, if no keyspaces are specified,
scylla-nodetool just enumerate all non-local keyspaces, and
call "/storage_service/keyspace_cleanup" on them one after another.
this is not quite efficient, as each this RESTful API call
force a new active commitlog segment, and flushes all tables.
so, if the target node of this command has N non-local keyspaces,
it would repeat the steps above for N times. this is not necessary.
and after a topology change, we would like to run a global
"nodetool cleanup" without specifying the keyspace, so this
is a typical use case which we do care about.

to address this performance issue, in this change, we improve
an existing RESTful API call "/storage_service/cleanup_all", so
if the topology coordinator is not enabled, we fall back to
a local cleanup to cleanup all non-local keyspaces.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-02-01 11:25:53 +08:00
Kefu Chai
4f90a875f6 compaction: format flush_mode without the helper
since flush_mode is moved out of major_compaction_task_impl, let's
drop the helper hosted in that class as well, and implement the
formatter witout it.

please note, the `__builtin_unreachable()` is dropped. it should
not change the behavior of the formatter. we don't put it in the
`default` branch in hope that `-Wswitch` can warn us in the case
when another enum of `flush_mode` is added, but we fail to handle
it somehow.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-02-01 11:25:53 +08:00
Kefu Chai
b39cc01bb3 compaction_manager: flush all tables before cleanup
according to the document "nodetool cleanup"

> Triggers removal of data that the node no longer owns

currently, scylla performs cleanup by rewriting the sstables. but
commitlog segments may still contain the mutations to the tables
which are dropped during sstable rewriting. when scylla server
restarts, the dirty mutations are replayed to the memtable. if
any of these dirty mutations changes the tables cleaned up. the
stale data are reapplied. this would lead to data resurrection.

so, in this change we following the same model of major compaction:

1. force new active segment,
2. flush all tables
3. perform cleanup using compaction, which rewrites the sstables
   of specified tables

because we already `flush()` all tables in
`cleanup_keyspace_compaction_task_impl::run()`, there is no need to
call `flush()` again, in `table::perform_cleanup_compaction()`, so
the `flush()` call is dropped in this function, and the tests using
this function are updated to call `flush()` manually to preserve
the existing behavior.

there are two callers of `cleanup_keyspace_compaction_task_impl`,

* one is `storage_service::sstable_cleanup_fiber()`, which listens
  for the events fired by topology_state_machine, which is in turn
  driven by, for instance, "/storage_service/cleanup_all" API.
  which cleanup all keyspaces in one after another.
* another is "/storage_service/keyspace_cleanup", which cleans up
  the specified keyspace.

in the first use case, we can force a new active segment for a single
time, so another parameter to the ctor of
`cleanup_keyspace_compaction_task_impl` is introduced to specify if
the `db.flush_all_tables()` call should be skiped.

please note, there are two possible optimizations,

1. force new active segment only if the mutations in it touches the
   tables being cleaned up
2. after forcing new active segment, only flush the (mem)tables
   mutated by the non-active segments

but let's leave them for following-up changes. this change is a
minimal fix for data resurrection issue.

Fixes #16757
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-02-01 11:25:53 +08:00