Every table and sstable set keeps track of the total file size
of contained sstables.
Due to a feature request, we also want to keep track of the hypothetical
file size if Data files were uncompressed, to add a metric that
shows the compression ratio of sstables.
We achieve this by replacing the relevant `uint_64 bytes_on_disk`
counters everywhere with a struct that contains both the actual
(post-compression) size and the hypothetical pre-compression size.
This patch isn't supposed to change any observable behavior.
In the next patch, we will use these changes to add a new metric.
sstables/exceptions.hh still hosts some compaction specific exception
types. Move them over to the new compaction/exceptions.hh, to make the
compaction module more self-contained.
The namespace usage in this directory is very inconsistent, with files
and classes scattered in:
* global namespace
* namespace compaction
* namespace sstables
With cases, where all three used in the same file. This code used to
live in sstables/ and some of it still retains namespace sstables as a
heritage of that time. The mismatch between the dir (future module) and
the namespace used is confusing, so finish the migration and move all
code in compaction/ to namespace compaction too.
This patch, although large, is mechanic and only the following kind of
changes are made:
* replace namespace sstable {} with namespace compaction {}
* add namespace compaction {}
* drop/add sstables::
* drop/add compaction::
* move around forward-declarations so they are in the correct namespace
context
This refactoring revealed some awkward leftover coupling between
sstables and compaction, in sstables/sstable_set.cc, where the
make_sstable_set() methods of compaction strategies are implemented.
expected_total_workload methods of scrub compaction tasks create
a vector of table_info based on table names. If any table was already
dropped, then the exception is thrown. It leaves table_info in corrupted
state and node crashes with `free(): invalid size`.
Return std::nullopt if an exception was thrown to indicate that
total workload cannot be found.
Fixes: #25941.
Determine the progress of compaction tasks that have
children.
The progress of a compaction task is calculated using the default
get_progress method. If the expected_total_workload method is
implemented, the default progress is computed as:
(sum of child task progresses) / (expected total workload)
If expected_total_workload is not defined, progress is estimated based
on children progresses. However, in this case, the total progress may
increase over time as the task executes.
All compaction tasks, except for reshape tasks, implement the
expected_children_number method. To compute expected_total_workload,
iterate over all SSTables covered by the task and sum their sizes. Note
that expected_total_workload is just an approximation and the real workload
may differ if SStables set for the keyspace/table/compaction group changes.
Reshape tasks are an exception, as their scope is determined during
execution. Hence, for these tasks expected_total_workload isn't defined
and their progress (both total and completed) is determined based
on currently created children.
Fixes: https://github.com/scylladb/scylladb/issues/8392.
Fixes: https://github.com/scylladb/scylladb/issues/6406.
Fixes: https://github.com/scylladb/scylladb/issues/7845.
New feature, no backport needed
Closesscylladb/scylladb#15158
* github.com:scylladb/scylladb:
test: add compaction task progress test
compaction: set progress unit for compaction tasks
compaction: find expected workload for reshard tasks
compaction: find expected workload for global cleanup compaction tasks
compaction: find expected workload for global major compaction tasks
compaction: find expected workload for keyspace compaction tasks
compaction: find expected workload for shard compaction tasks
compaction: find expected workload for table compaction tasks
compaction: return empty progress when compaction_size isn't set
compaction: update compaction_data::compaction_size at once
tasks: do not check expected workload for done task
Find expected workload in bytes of reshard tasks.
The workload of table_resharding_compaction_task_impl is found
at the beginning of its execution. Before that, expected_total_workload()
returns std::nullopt, which means that the progress for this task
won't be shown.
Add compaction_task_impl::get_keyspace_task_workload that sums
the bytes in all sstables of this keyspace.
This function is used to find the expected workload of the following
keyspace compaction types:
- major;
- cleanup;
- offstrategy;
- upgrade_sstables;
- scrub.
Add compaction_task_impl::get_shard_task_workload that sums
the bytes in all sstables of this keyspace on this shard.
This function is used to find the expected workload of the following
shard compaction types:
- major;
- cleanup;
- offstrategy;
- upgrade_sstables;
- scrub.
Add compaction_task_impl::get_table_task_workload that sums
the bytes in all sstables in the table.
This function is used to find the expected workload of the following
compaction types:
- major;
- cleanup;
- offstrategy;
- upgrade_sstables;
- scrub.
Currently, progress of compaction task executors is reported in bytes.
However, if compaction_size isn't set for compaction task executor,
the executor's progress is shown as 1/1 (if it has finished) or 0/1
(otherwise).
In the following patches, the progress of executors' parent task will
be found based on its children. Hence, to avoid mixing different progress
units, the binary progress is no longer used.
Return empty progress when compaction_size isn't set.
Drop task_manager::task::impl::get_binary_progress as it's no longer
used.
This patch addes incremental_repair support in compaction.
- The sstables are split into repaired and unrepaired set.
- Repaired and unrepaired set compact sperately.
- The repaired_at from sstable and sstables_repaired_at from
system.tablets table are used to decide if a sstable is repaired or
not.
- Different compactions tasks, e.g., minor, major, scrub, split, are
serialized with tablet repair.
Since table_state is a view to a compaction group, it makes sense
to rename it as so.
With upcoming incremental repair, each replica::compaction_group
will be actually two compaction groups, so there will be two
views for each replica::compaction_group.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
to get_static_effective_replication_map, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map (both are per-keyspace).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prefer for specializing the local replication strategy,
local effective replication map, et. al byt defining
an is_local() predicate, similar to uses_tablets().
Note that is_vnode_based() still applies to local replication
strategy.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This commit eliminates unused boost header includes from the tree.
Removing these unnecessary includes reduces dependencies on the
external Boost.Adapters library, leading to faster compile times
and a slightly cleaner codebase.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22997
in order to reduce the external header dependency, let's switch to
the standardlized std::ranges::min_element().
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22572
now that we are allowed to use C++23. we now have the luxury of using
`std::views::transform`.
in this change, we:
- replace `boost::adaptors::transformed` with `std::views::transform`
- use `fmt::join()` when appropriate where `boost::algorithm::join()`
is not applicable to a range view returned by `std::view::transform`.
- use `std::ranges::fold_left()` to accumulate the range returned by
`std::view::transform`
- use `std::ranges::fold_left()` to get the maximum element in the
range returned by `std::view::transform`
- use `std::ranges::min()` to get the minimal element in the range
returned by `std::view::transform`
- use `std::ranges::equal()` to compare the range views returned
by `std::view::transform`
- remove unused `#include <boost/range/adaptor/transformed.hpp>`
- use `std::ranges::subrange()` instead of `boost::make_iterator_range()`,
to feed `std::views::transform()` a view range.
to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.
this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.
limitations:
there are still a couple places where we are still using
`boost::adaptors::transformed` due to the lack of a C++23 alternative
for `boost::join()` and `boost::adaptors::uniqued`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21700
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.
please note, because `mutation/mutation.hh` does not include
`seastar/coroutine/maybe_yield.hh` anymore, and quite a few source
files were relying on this header to bring in the declaration of
`maybe_yield()`, we have to include this header in the places where
this symbol is used. the same applies to `seastar/core/when_all.hh`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
When users start an operation asynchronously with API, they are expected to check the operation's status. Hence, the status should be kept in task manager for reasonable time after the operation is done. The operations that are started internally usually don't need to stay in task manager for that long.
Add api_task_ttl that will be used for tasks started with API. By default it's 1 hour. The time for which non-API tasks stay in task manager isn't changed.
Fixes: #21499.
Refs: #21425.
No backport needed - previous versions may use task_ttl
Closesscylladb/scylladb#21505
* github.com:scylladb/scylladb:
test: add test to check user_task_ttl
tasks: api: move make_task method
docs: nodetool: update backup and restore commands docs
docs: update task manager docs
nodetool: add nodetool tasks user-ttl command
node_ops: use user task ttl for node ops virtual task
tasks: use user_task_ttl for tasks started by user
api: task_manager: add /task_manager/user_ttl to get and set user task ttl
tasks: add task_manager::task::is_user_task method
tasks: keep updateable_value of task_ttl in task manager
db: config: add user_task_ttl_seconds named value
Modernize the codebase by replacing Boost range adaptors with C++23 standard library views,
reducing external dependencies and leveraging modern C++ language features.
Key Changes:
- Replace `boost::adaptors::filtered` with `std::views::filter`
- Remove `#include <boost/range/adaptor/filtered.hpp>`
- Utilize standard library range views
Motivation:
- Reduce project's external dependency footprint
- Leverage standard library's range and view capabilities
- Improve long-term code maintainability
- Align with modern C++ best practices
Implementation Challenges and Considerations:
1. Range Conversion and Move Semantics
- `std::ranges::to` adaptor requires rvalue references
- Necessitated updates to variable and parameter constness
- Example: `cql3/restrictions/statement_restrictions.cc` modified to remove `const`
from `common` to enable efficient range conversion
2. Range Iteration and Mutation
- Range views may mutate internal state during iteration
- Cannot pass ranges by const reference in some scenarios
- Solution: Pass ranges by rvalue reference to explicitly indicate
state invalidation
Limitations:
- One instance of `boost::adaptors::filtered` temporarily preserved
due to lack of a C++23 alternative for `boost::join()`
- A comprehensive replacement will be addressed in a follow-up change
This change is part of our ongoing effort to modernize the codebase,
reducing external dependencies and adopting modern C++ practices.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21648
Many places want to inject a handler that waits for external kick. Now
there's convenience inject() method overload for this. It will result in
extra messages in logs, but so far no code/test cares about it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Added a new parameter `consider_only_existing_data` to major compaction
API endpoints. When enabled, major compaction will:
- Force-flush all tables.
- Force a new active segment in the commit log.
- Compact all existing SSTables and garbage-collect tombstones by only
checking the SSTables being compacted. Memtables, commit logs, and
other SSTables not part of the compaction will not be checked, as they
will only contain newer data that arrived after the compaction
started.
The `consider_only_existing_data` is passed down to the compaction
descriptor's `gc_check_only_compacting_sstables` option to ensure that
only the existing data is considered for garbage collection.
The option is also passed to the `maybe_flush_commitlog` method to make
sure all the tables are flushed and a new active segment is created in
the commit log.
Fixes#19728
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Major compaction flushes all tables as a part of flushing the commitlog.
After forcing new active segments in the commitlog, all the tables are
flushed to enable reclaim of older commitlog segments. The main goal is
to flush the commitlog and flushing all the table is just a dependency.
Rename maybe_flush_all_tables to maybe_flush_commitlog so that it
reflects the actual intent of the major compaction code. Added a new
wrapper method to database::flush_all_tables(),
database::flush_commitlog(), that is now called from
maybe_flush_commitlog.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Add a new parameter, `force_flush` to the maybe_flush_all_tables()
method. Setting `force_flush` to true will flush all the tables
regardless of when they were flushed last. This will be used by the new
compaction option in a following patch.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
To prevent stalls due to large number of tokens.
For example, large cluster with say 70 nodes can have
more than 16K tokens.
Fixes#19757
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prepare for making the function async.
Then, it will need to hold on to the erm while getting
the token_ranges asynchronously.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is used only here and can be simplified by
checking if the keyspace replication strategy
is per table by the caller.
Prepare for making get_keyspace_local_ranges async.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently the major compaction task impl grabs this (non-updateable)
value from db::config. That's not good, all services including
compaction manager have their own configs from which they take options.
Said that, this patch puts the said option onto
compaction_manager::config, makes use of it and configures one from
db::config on start (and tests).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20174
compaction_manager::perform_compaction does not create task manager
task for compaction if parent_info is set to std::nullopt. Currently,
we always want to create task manager task for compaction.
Remove optional from task info parameters which start compaction.
Track all compactions with task manager.
compaction_group sits in replica layer and compaction layer is
supposed to talk to it through compaction::table_state only.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Pass compaction group id to
shard_reshaping_compaction_task_impl::reshape_compaction_group.
Modify table::as_table_state to return table_state of the given
compaction group.
The inject_with_handler() method accepts a coroutine that can be called
wiht injection_handler. With such function as an argument, there's no
need in distinctive inject_with_handler() name for a method, it can be
overload of all the existing inject()-s
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
For tables using tablet based replication strategies, the sstables
should be reshaped only within the compaction groups they belong to.
Updated shard_reshaping_compaction_task_impl to group the sstables based
on their compaction groups before reshaping them within the groups.
Fixes#16966
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The total reshaped size should only be updated on reshape success and
not after reshape has been failed due to some exception.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Catch and handle the exceptions directly instead of rethrowing and
catching again.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Tables in keyspaces governed by replication strategy that uses tablets, have separate effective_replication_maps. Update the upgrade compaction task to handle this when getting owned key ranges for a keyspace.
Fixes#16848Closesscylladb/scylladb#17335
* github.com:scylladb/scylladb:
compaction: upgrade: handle keyspaces that use tablets
replica/database: add an optional variant to get_keyspace_local_ranges
Tables in keyspaces governed by replication strategy that uses tablets, have
separate effective_replication_maps. Update the upgrade compaction task to
handle this when getting owned key ranges for a keyspace.
Fixes#16848
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
this change addresses the regression introduced by 5e0b3671, which
fall backs to local cleanup in cleanup_all. but 5e0b3671 failed to
pass the keyspace to the `shard_cleanup_keyspace_compaction_task_impl`
is its constructor parameter, that's why the test fails like
```
error executing POST request to http://localhost:10000/storage_service/cleanup_all with parameters {}: remote replied with status code 400 Bad Request:
Can't find a keyspace
```
where the string after "Can't find a keyspace" is empty.
in this change, the keyspace name of the keyspace to be cleaned is passed to
`shard_cleanup_keyspace_compaction_task_impl`.
we always enable the topology coordinator when performing testing,
that's why this issue does not pop up until the longevity test.
Fixes#17302
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17320
before this change, if no keyspaces are specified,
scylla-nodetool just enumerate all non-local keyspaces, and
call "/storage_service/keyspace_cleanup" on them one after another.
this is not quite efficient, as each this RESTful API call
force a new active commitlog segment, and flushes all tables.
so, if the target node of this command has N non-local keyspaces,
it would repeat the steps above for N times. this is not necessary.
and after a topology change, we would like to run a global
"nodetool cleanup" without specifying the keyspace, so this
is a typical use case which we do care about.
to address this performance issue, in this change, we improve
an existing RESTful API call "/storage_service/cleanup_all", so
if the topology coordinator is not enabled, we fall back to
a local cleanup to cleanup all non-local keyspaces.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
since flush_mode is moved out of major_compaction_task_impl, let's
drop the helper hosted in that class as well, and implement the
formatter witout it.
please note, the `__builtin_unreachable()` is dropped. it should
not change the behavior of the formatter. we don't put it in the
`default` branch in hope that `-Wswitch` can warn us in the case
when another enum of `flush_mode` is added, but we fail to handle
it somehow.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
according to the document "nodetool cleanup"
> Triggers removal of data that the node no longer owns
currently, scylla performs cleanup by rewriting the sstables. but
commitlog segments may still contain the mutations to the tables
which are dropped during sstable rewriting. when scylla server
restarts, the dirty mutations are replayed to the memtable. if
any of these dirty mutations changes the tables cleaned up. the
stale data are reapplied. this would lead to data resurrection.
so, in this change we following the same model of major compaction:
1. force new active segment,
2. flush all tables
3. perform cleanup using compaction, which rewrites the sstables
of specified tables
because we already `flush()` all tables in
`cleanup_keyspace_compaction_task_impl::run()`, there is no need to
call `flush()` again, in `table::perform_cleanup_compaction()`, so
the `flush()` call is dropped in this function, and the tests using
this function are updated to call `flush()` manually to preserve
the existing behavior.
there are two callers of `cleanup_keyspace_compaction_task_impl`,
* one is `storage_service::sstable_cleanup_fiber()`, which listens
for the events fired by topology_state_machine, which is in turn
driven by, for instance, "/storage_service/cleanup_all" API.
which cleanup all keyspaces in one after another.
* another is "/storage_service/keyspace_cleanup", which cleans up
the specified keyspace.
in the first use case, we can force a new active segment for a single
time, so another parameter to the ctor of
`cleanup_keyspace_compaction_task_impl` is introduced to specify if
the `db.flush_all_tables()` call should be skiped.
please note, there are two possible optimizations,
1. force new active segment only if the mutations in it touches the
tables being cleaned up
2. after forcing new active segment, only flush the (mem)tables
mutated by the non-active segments
but let's leave them for following-up changes. this change is a
minimal fix for data resurrection issue.
Fixes#16757
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>