Our sstable format selection logic is weird, and hard to follow.
If I'm not misunderstanding, the pieces are:
1. There's the `sstable_format` config entry, which currently
doesn't do anything, but in the past it used to disable
cluster features for versions newer than the specified one.
2. There are deprecated and unused config entries for individual
versions (`enable_sstables_mc_format`, `enable_sstables_md_format`,
etc).
3. There is a cluster feature for each version:
ME_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, etc.
(Currently all sstable version features have been grandfathered,
and aren't checked by the code anymore).
4. There's an entry in `system.scylla_local` which contains the
latest enabled sstable version. (Why? Isn't this directly derived
from cluster features anyway)?
5. There's `sstable_manager::_format` which contains the
sstable version to be used for new writes.
This field is updated by `sstables_format_selector`
based on cluster features and the `system.scylla_local` entry.
I don't see why those pieces are needed. Version selection has the
following constraints:
1. New sstables must be written with a format that supports existing
data. For example, range tombstones with an infinite bound are only
supported by sstables since version "mc". So if a range tombstone
with an infinite bound exists somewhere in the dataset,
the format chosen for new sstables has to be at least as new as "mc".
2. A new format might only be used after a corresponding cluster feature
is enabled. (Otherwise new sstables might become unreadable if they
are sent to another node, or if a node is downgraded).
3. The user should have a way to inhibit format ugprades if he wishes.
So far, constraint (1) has been fulfilled by never using formats older
than the newest format ever enabled on the node. (With an exception
for resharding and reshaping system tables).
Constraint (2) has been fulfilled by calling `sstable_manager::set_format`
only after the corresponsing cluster feature is enabled.
Constraint (3) has been fulfilled by the ability to inhibit cluster
features by setting `sstable_format` by some fixed value.
The main thing I don't like about this whole setup is that it doesn't
let me downgrade the preferred sstable format. After a format is
enabled, there is no way to go back to writing the old format again.
That is no good -- after I make some performance-sensitive changes
in a new format, it might turn out to be a pessimization for the
particular workload, and I want to be able to go back.
This patch aims to give a way to downgrade formats without violating
the constraints. What it does is:
1. The entry in `system.scylla_local` becomes obsolete.
After the patch we no longer update or read it.
As far as I understand, the purpose of this entry is to prevent
unwanted format downgrades (which is something cluster features
are designed for) and it's updated if and only if relevant
cluster features are updated. So there's no reason to have it,
we can just directly use cluster features.
2. `sstable_format_selector` gets deleted.
Without the `system.scylla_local` around, it's just a glorified
feature listener.
3. The format selection logic is moved into `sstable_manager`.
It already sees the `db::config` and the `gms::feature_service`.
For the foreseeable future, the knowledge of enabled cluster features
and current config should be enough information to pick the right formats.
4. The `sstable_format` entry in `db::config` is no longer intended to
inhibit cluster features. Instead, it is intended to select the
format for new sstables, and it becomes live-updatable.
5. Instead of writing new sstables with "highest supported" format,
(which used to be set by `sstables_format_selector`) we write
them with the "preferred" format, which is determined by
`sstable_manager` based on the combination of enabled features
and the current value of `sstable_format`.
Closesscylladb/scylladb#26092
[avi: Pavel found the reason for the scylla_local entry -
it predates stable storage for cluster features]
compaction/scrub: register sstables for compaction before validation
When `scrub --validate` runs, it collects all candidate sstables at the
start and validates them one by one in separate compaction tasks.
However, scrub in validate mode does not register these sstables for
compaction, which allows regular compaction to pick them up and
potentially compact them away before validation begins. This leads to
scrub failures because the sstables can no longer be found.
This patch fixes the issue by first disabling compaction, collecting the
sstables, and then registering them for compaction before starting
validation. This ensures that the enqueued sstables remain available for
the entire duration of the scrub validation task.
Fixes#23363
This reported scrub failure occurs on all versions that have the
checksum/digest validation feature for uncompressed sstables.
So, backport it to older versions.
Closesscylladb/scylladb#26034
* github.com:scylladb/scylladb:
compaction/scrub: register sstables for compaction before validation
compaction/scrub: handle exceptions when moving invalid sstables to quarantine
expected_total_workload methods of scrub compaction tasks create a vector of table_info based on table names. If any table was already dropped, then the exception is thrown. It leaves table_info in corrupted state and node crashes with `free(): invalid size`.
Return std::nullopt if an exception was thrown to indicate that total workload cannot be found.
Fixes: #25941.
No release branches affected
Closesscylladb/scylladb#25944
* github.com:scylladb/scylladb:
tasks: get progress of failed task based on children
compaction: handle exception in expected_total_workload
The compaction_manager::stop_compaction() method internally walks the
list of tasks and compares each task's compacting_table (which is
compaction group view pointer) with the given one. In case this
stop_compaction() method is called via API for a specific table, the
method walks the list of tasks for every compaction group from the
table, thus resulting in nr_groups * nr_tasks complexity. Not terrible,
but not nice either.
The proposal is to pass filtering function into the inner
do_stop_ongoing_compactions() method. Some users will pass a simple
"return true" lambda, but those that need to stop compactions for a
specitif table (e.g. -- the API handler) will effectively walk the
list of tasks once comparing the given compaction group's schema with
the target table one (spoiler: eventually this place will also be
simplified not to mess with replica::table at all).
One ugliness with the change is the way "scope" for logging message is
collected. If all tasks belong to the same table, then "for table ..."
is printed in logs. With the change the scope is no longer known
instantly and is evaluated dynamically while walking the list of tasks.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#25846
When `scrub --validate` runs, it collects all candidate sstables at the
start and validates them one by one in separate compaction tasks.
However, scrub in validate mode does not register these sstables for
compaction, which allows regular compaction to pick them up and
potentially compact them away before validation begins. This leads to
scrub failures because the sstables can no longer be found.
This patch fixes the issue by first disabling compaction, collecting the
sstables, and then registering them for compaction before starting
validation. This ensures that the enqueued sstables remain available for
the entire duration of the scrub validation task.
Fixes#23363
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
In validate mode, scrub moves invalid sstables into the quarantine
folder. If validation fails because the sstable files are missing from
disk, there is nothing to move, and the quarantine step will throw an
exception. Handle such exceptions so scrub can return a proper
compaction_result instead of propagating the exception to the caller.
This will help the testcase for #23363 to reliably determine if the
scrub has failed or not.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The interface suggests the whole sstable cleanup is aborted with
'nodetool stop CLEANUP', but it is currently stopping only the
ongoing cleanup task, and the compaction manager will retry the
task since the error is not propagated all the way back to the
caller. With raft topology, the coordinator should retry it though
since cleanup became mandatory with automatic cleanup. So it's
only fixing the usage where cleanup is issued manually.
The stop exception is only propagated to the caller of cleanup.
When stopping tasks during shutdown, the exception is swallowed
and the error only returned to the caller.
Fixes#20823.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#24996
expected_total_workload methods of scrub compaction tasks create
a vector of table_info based on table names. If any table was already
dropped, then the exception is thrown. It leaves table_info in corrupted
state and node crashes with `free(): invalid size`.
Return std::nullopt if an exception was thrown to indicate that
total workload cannot be found.
Fixes: #25941.
Determine the progress of compaction tasks that have
children.
The progress of a compaction task is calculated using the default
get_progress method. If the expected_total_workload method is
implemented, the default progress is computed as:
(sum of child task progresses) / (expected total workload)
If expected_total_workload is not defined, progress is estimated based
on children progresses. However, in this case, the total progress may
increase over time as the task executes.
All compaction tasks, except for reshape tasks, implement the
expected_children_number method. To compute expected_total_workload,
iterate over all SSTables covered by the task and sum their sizes. Note
that expected_total_workload is just an approximation and the real workload
may differ if SStables set for the keyspace/table/compaction group changes.
Reshape tasks are an exception, as their scope is determined during
execution. Hence, for these tasks expected_total_workload isn't defined
and their progress (both total and completed) is determined based
on currently created children.
Fixes: https://github.com/scylladb/scylladb/issues/8392.
Fixes: https://github.com/scylladb/scylladb/issues/6406.
Fixes: https://github.com/scylladb/scylladb/issues/7845.
New feature, no backport needed
Closesscylladb/scylladb#15158
* github.com:scylladb/scylladb:
test: add compaction task progress test
compaction: set progress unit for compaction tasks
compaction: find expected workload for reshard tasks
compaction: find expected workload for global cleanup compaction tasks
compaction: find expected workload for global major compaction tasks
compaction: find expected workload for keyspace compaction tasks
compaction: find expected workload for shard compaction tasks
compaction: find expected workload for table compaction tasks
compaction: return empty progress when compaction_size isn't set
compaction: update compaction_data::compaction_size at once
tasks: do not check expected workload for done task
Currently, make_and_start_task returns a pointer to task_manager::task
that hides the implementation details. If we need to access
the implementation (e.g. because we want a task to "return" a value),
we need to make and start task step by step openly.
Return task_manager::task::impl from make_and_start_task. Use it
where possible.
Fixes: https://github.com/scylladb/scylladb/issues/22146.
Optimization; no backport
Closesscylladb/scylladb#25743
* github.com:scylladb/scylladb:
tasks: return task::impl from make_and_start_task
compaction: use current_task_type
repair: add new param to tablet_repair_task_impl
repair: add new params to shard_repair_task_impl
repair: pass argument by value
Currently, make_and_start_task returns a pointer to task_manager::task
that hides the implementation details. If we need to access
the implementation (e.g. because we want a task to "return" a value),
we need to make and start task step by step openly.
Return task_manager::task::impl from make_and_start_task. Use it
where possible.
Fixes: https://github.com/scylladb/scylladb/issues/22146.
Using a single state variable to keep track whether compaction
manager is enabled/disabled is insufficient, as multiple services
may independently request compactions to be disabled.
To address this, a counter is introduced to record how many times
the compaction manager has been drained. The manager is considered
enabled only when this counter reaches zero.
Introducing a counter, enabled and disabled states become obsolete.
So they are replaced with a single running state.
Find expected workload in bytes of reshard tasks.
The workload of table_resharding_compaction_task_impl is found
at the beginning of its execution. Before that, expected_total_workload()
returns std::nullopt, which means that the progress for this task
won't be shown.
Add compaction_task_impl::get_keyspace_task_workload that sums
the bytes in all sstables of this keyspace.
This function is used to find the expected workload of the following
keyspace compaction types:
- major;
- cleanup;
- offstrategy;
- upgrade_sstables;
- scrub.
Add compaction_task_impl::get_shard_task_workload that sums
the bytes in all sstables of this keyspace on this shard.
This function is used to find the expected workload of the following
shard compaction types:
- major;
- cleanup;
- offstrategy;
- upgrade_sstables;
- scrub.
Add compaction_task_impl::get_table_task_workload that sums
the bytes in all sstables in the table.
This function is used to find the expected workload of the following
compaction types:
- major;
- cleanup;
- offstrategy;
- upgrade_sstables;
- scrub.
Add precompiled header support to CMakeLists.txt and configure.py -
it improves compilation time by approximately 10%.
New header `stdafx.hh` is added, don't include it manually -
the compiler will include it for you. The header contains includes from
external libraries used by Scylla - seastar, standard library,
linux headers and zlib.
The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER`
or configure.py --disable-precompiled-header to disable.
The feature should be disabled, when trying to check headers - otherwise
you might get false negatives on missing includes from seastar / abseil and so on.
Note: following configuration needs to be added to ccache.conf:
sloppiness = pch_defines,time_macros
Closes#25182
Currently, progress of compaction task executors is reported in bytes.
However, if compaction_size isn't set for compaction task executor,
the executor's progress is shown as 1/1 (if it has finished) or 0/1
(otherwise).
In the following patches, the progress of executors' parent task will
be found based on its children. Hence, to avoid mixing different progress
units, the binary progress is no longer used.
Return empty progress when compaction_size isn't set.
Drop task_manager::task::impl::get_binary_progress as it's no longer
used.
Currently, in compaction::setup compaction_size is updated in a loop.
Due to that the total progress of compaction executors grows during
their execution.
Add the sstables sizes to a compaction_size variable. Update
compaction_data::compaction_size after the loop.
The central idea of incremental repair is to allow repair participants
to select and repair only a portion of the dataset to speed up the
repair process. All repair participants must utilize an identical
selection method to repair and synchronize the same selected dataset.
There are two primary selection methods: time-based and file-based. The
time-based method selects data within a specified time frame. It is
versatile but it is less efficient because it requires reading all of
the dataset and omitting data beyond the time frame. The file-based
method selects data from unrepaired SSTables and is more efficient
because it allows the entire SSTable to be omitted. This document patch
implements the file-based selection method.
Incremental repair will only be supported for tablet tables; it will not
be supported for vnode tables. On one hand, the legacy vnode is less
important to support. On the other hand, the incremental repair for
vnode is much harder to implement. With vnodes, a SSTalbe could contain
data for multiple vnode ranges. When a given vnode range is repaired,
only a portion of the SSTable is repaired. This complicates the
manipulation of SSTables significantly during both repair and
compaction. With tablets, an entire tablet is repaired so that a
sstable is either fully repaired or not repaired which is a huge
simplification.
This patch uses the repaired_at from sstables::statistics component to
mark a sstable as repaired. It uses a virtual clock as the repair
timestamp, i.e., using a monotonically increasing number for the
repaired_at field of a SSTable and sstables_repaired_at column in
system.tablets table. Notice that when a sstable is not repaired, the
repaired_at field will be set to the default value 0 by default. The
being_repaired in memory field of a SSTable is used to explicitly mark
that a SSTable is being selected. The following variables are used for
incremental repair:
The repaired_at on disk field of a SSTable is used.
- A 64-bit number increases sequentially
The sstables_repaired_at is added to the system.tablets table.
- repaired_at <= sstables_repaired_at means the sstable is repaired
The being_repaired in memory field of a SSTable is added.
- A repair UUID tells which sstable has participated in the repair
Initial test results:
1) Medium dataset results
Node amount: 3
Instance type: i4i.2xlarge
Disk usage per node: ~500GB
Cluster pre-populated with ~500GB of data before starting repairs job.
Results for Repair Timings:
The regular repair run took 210 mins.
Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s
The speedup is: 183 mins / 48s = 228X
2) Small dataset results
Node amount: 3
Instance type: i4i.2xlarge
Disk usage per node: ~167GB
Cluster pre-populated with ~167GB of data before starting the repairs job.
Regular repair 1st run took 110s, 2nd and 3rd runs took 110s.
Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds.
The speedup is: 110s / 1.5s = 73X
3) Large dataset results
Node amount: 6
Instance type: i4i.2xlarge, 3 racks
50% of base load, 50% read/write
Dataset == Sum of data on each node
Dataset Non-incremental repair (minutes)
1.3 TiB 31:07
3.5 TiB 25:10
5.0 TiB 19:03
6.3 TiB 31:42
Dataset Incremental repair (minutes)
1.3 TiB 24:32
3.0 TiB 13:06
4.0 TiB 5:23
4.8 TiB 7:14
5.6 TiB 3:58
6.3 TiB 7:33
7.0 TiB 6:55
Fixes#22472Closesscylladb/scylladb#24291
* github.com:scylladb/scylladb:
replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair
compaction: Move compaction_reenabler to compaction_reenabler.hh
topology_coordinator: Make rpc::remote_verb_error to warning level
repair: Add metrics for sstable bytes read and skipped from sstables
test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair
test.py: Add tests for tablet incremental repair
repair: Add tablet incremental repair support
compaction: Add tablet incremental repair support
feature_service: Add TABLET_INCREMENTAL_REPAIR feature
tablet_allocator: Add tablet_force_tablet_count_increase and decrease
repair: Add incremental helpers
sstable: Add being_repaired to sstable
sstables: Add set_repaired_at to metadata_collector
mutation_compactor: Introduce add operator to compaction_stats
tablet: Add sstables_repaired_at to system.tablets table
test: Fix drain api in task_manager_client.py
This patch addes incremental_repair support in compaction.
- The sstables are split into repaired and unrepaired set.
- Repaired and unrepaired set compact sperately.
- The repaired_at from sstable and sstables_repaired_at from
system.tablets table are used to decide if a sstable is repaired or
not.
- Different compactions tasks, e.g., minor, major, scrub, split, are
serialized with tablet repair.
Allow possibly avoiding overlap checks in the case where the source of
the min-live timestamp is known to only contain data which was written
*after* expiry treshold. Expiry treshold is the upper bound of
tombstone.deletion_time that was already expired at the time of
obtaining this expiry treshold value. Meaning that any write originating
from after this point in time, was generated at a time when such
tombstone was already expired. Hence these writes are not relevant for
the purposes of overlap checks with the tombstone and so their min-live
timestamp can be ignored.
This is important for MV workloads, where writes generated now can have
timestamps going far back in time, possibly blocking tombstone GC of
much older [shadowable] tombstones.
Instead of storing it partially in tombstone_gc and partially in an
external map. Move all external parts into the new
shared_tombstone_gc_state. This new class is responsible for
keeping and updating the repair history. tombstone_gc_state just keeps
const pointers to the shared state as before and is only responsible for
querying the tombstone gc before times.
This separation makes the code easier to follow and also enables further
patching of tombstone_gc_state.
We are used to symbols definition being grouped in one .cc file, but a
symbol declaration and definition living in separate modules
(subfolders) is surprising.
Relocate always_gc, never_gc, can_always_purge and can_never_purge to
compaction/compaction.cc, from mutatiobn/mutation_partition.cc. The
declarations of these symbols is in
compaction/compaction_garbage_collector.hh.
This will allow upcoming work to gently produce a sstable set for
each compaction group view. Example: repaired and unrepaired.
Locking strategy for compaction's sstable selection:
Since sstable retrieval path became futurized, tasks in compaction
manager will now hold the write lock (compaction_state::lock)
when retrieving the sstable list, feeding them into compaction
strategy, and finally registering selected sstables as compacting.
The last step prevents another concurrent task from picking the
same sstable. Previously, all those steps were atomic, but
we have seen stall in that area in large installations, so
futurization of that area would come sooner or later.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Since there will be only one physical sstable set, it makes sense to move
backlog tracker to replica::compaction_group. With incremental repair,
it still makes sense to compute backlog accounting both logical sets,
since the compound backlog influences the overall read amplification,
and the total backlog across repaired and unrepaired sets can help
driving decisions like giving up on incremental repair when unrepaired
set is almost as large as the repaired set, causing an amplification
of 2.
Also it's needed for correctness because a sstable can move quickly
across the logical sets, and having one tracker for each logical
set could cause the sstable to not be erased in the old set it
belonged to;
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Since table_state is a view to a compaction group, it makes sense
to rename it as so.
With upcoming incremental repair, each replica::compaction_group
will be actually two compaction groups, so there will be two
views for each replica::compaction_group.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
to get_static_effective_replication_map, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map (both are per-keyspace).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prefer for specializing the local replication strategy,
local effective replication map, et. al byt defining
an is_local() predicate, similar to uses_tablets().
Note that is_vnode_based() still applies to local replication
strategy.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When running compactions are aborted by the aforementioned helper, in logs there appear a line like
"Compaction for ks/cf was stopped due to: user-triggered operation". This message could've been better, since it may indicate several distinct reasons described with the same "user-triggered operation".
With this PR the message will help telling "truncate", "cleanup", "rewrite" and "split" from each other.
Closesscylladb/scylladb#25136
* https://github.com/scylladb/scylladb:
compaction: Pass "reason" to perform_task_on_all_files()
compaction: Pass "reason" to run_with_compaction_disabled()
compaction: Pass "reason" to stop_and_disable_compaction()
Compaction is routine and the log messages pollute the log files,
hiding important information.
All the data is available via `nodetool compactionhistory`.
Reduce noise by demoting those log messages to debug level.
One test is adjusted to use debug level for compaction, since it
listens for those messages.
Closesscylladb/scylladb#24949
This tells "cleanup" (done via try_perform_cleanup) and prepares the
ground for more callers (see next patch)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
As requested in #22114, moved the files and fixed other includes and build system.
Moved files:
- interval.hh
- Map_difference.hh
Fixes: #22114
This is a cleanup, no need to backport
Closesscylladb/scylladb#25095
Print the keyspace.table names, issue trace log messages also
when returning early if tombstone_gc is disabled or
when gc_check_only_compacting_sstables is set.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#24914
The `drain` method, cancels all running compactions and moves the
compaction manager into the disabled state. To move it back to
the enabled state, the `enable` method shall be called.
This, however, throws an assertion error as the submission time is
not cancelled and re-enabling the manager tries to arm the armed timer.
Thus, cancel the timer, when calling the drain method to disable
the compaction manager.
Fixes https://github.com/scylladb/scylladb/issues/24504
All versions are affected. So it's a good candidate for a backport.
Closesscylladb/scylladb#24505