When being stopped compaction manager may step on ENOSPC. This is not a
reason to fail stopping process with abort, better to warn this fact in
logs and proceed as if nothing happened
refs: #11245
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it the future-returning method and setup the _stop_future in its
only caller. Makes next patch much simpler
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
With v2 having individual bounds of range tombstone as separate
fragments, out-of-order fragments become more difficult to handle,
especially in the presence of active range tombstone.
Scrub in both SKIP and SEGREGATE mode closes the partition on
seeing the first invalid fragment (SEGREAGE re-opens it immediately).
If there is an active range tombstone, scrub now also has to take care
of closing said tombstone when closing the partition. In a normal stream
it could just use the last position-in-partition to create a closing
bound. But when out-of-order fragments are on the table this is not
possible: the closing bound may be found later in the stream, with a
position smaller than that of the current position-in-partition.
To prevent extending range tombstone changes like that, Scrub now aborts
the compaction on the first invalid fragment seen *inside* an active
range tombstone.
Fixing a v2 stream with range tombstone changes is definitely possible,
but non-trivial, so we defer it until there is demand for it.
This series also makes the mutation fragment stream validator check for
open range tombstones on partition-end and adds a comprehensive
test-suite for the validator.
Fixes: #10168
Tests: unit(dev)
* scrub-rtc-handling-fix/v2 of github.com/denesb/scylla.git:
compaction/compaction: abort scrub when attempting to rectify stream with active tombstone
test/boost/mutation_test: add test for mutation_fragment_stream_validator
mutation_fragment_stream_validator: validate range tombstone changes
(cherry picked from commit edd0481b38)
It was assumed that offstrategy compaction is always triggered by streaming/repair
where it would inherit the caller's scheduling group.
However, offstrategy is triggered by a timer via table::_off_strategy_trigger so I don't see
how the expiration of this timer will inherit anything from streaming/repair.
Also, since d309a86, offstrategy compaction
may be triggered by the api where it will run in the default scheduling group.
The bottom line is that the compaction manager needs to explicitly perform offstrategy compaction
in the maintenance scheduling group similar to `perform_sstable_scrub_validate_mode`.
Fixes#10151
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302084821.2239706-1-bhalevy@scylladb.com>
(cherry picked from commit 0764e511bb)
First patch removes incorrect usage of rwlock which should be restricted to minor and major compaction tasks.
Second patch revives a semaphore, which was lost in 6737c88045, as we want major to not wait on off-strategy completion before deciding whether or not it should proceed with execution. It wouldn't proceed with execution if user asked major to stop while waiting for a chance to run.
For master, we're going to rely on abortable variant of get_units() to allow major to be quickly aborted.
Fixes#10485.
Closes#10582
* github.com:scylladb/scylla:
compaction_manager: Revive custom job semaphore
compaction_manager: Remove rwlock usage in run_custom_job()
In commit 6737c88045, we started using a single semaphore for
maintenance operations, which is a good change.
However, after introduction of off-strategy, major cannot proceed
until off-strategy is done reshaping all its input files.
If user requests major to abort, the command will only return
once off-strategy is done, and that can take lots of time.
In master, we'll allow pending major to be quickly aborted, but
that's not possible here as abortable variant of get_units()
is not available yet.
Here, we'll allow major to proceed in parallel to off-strategy,
so major can decide whether or not it should run in parallel.
Fixes#10485.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The rwlock usage was introduced in 2017 commit 10eaa2339e.
Resharding was online back then and we want to serialize it with
major.
Rwlock usage should be restricted to major and minor, as clearly
stated in the documentation, but we're still using it in
run_custom_job().
It gains us nothing, it only prevents off-strategy and other
custom jobs from running concurrently to major.
Let's kill this as we want to allow off-strategy to not prevent
a major from happening in parallel, as the former works only
on the maintenance sstable set and won't interfere with
the latter.
Refs #10485.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Dtest triggers the problem by:
1) creating table with LCS
2) disabling regular compaction
3) writing a few sstables
4) running maintenance compaction, e.g. cleanup
Once the maintenance compaction completes, disengaged optional _last_compacted_keys
triggers an exception in notify_completion().
_last_compacted_keys is used by regular for its round-robin file picking
policy. It stores the last compacted key for each level. Meaning it's
irrelevant for any other compaction type.
Regular compaction is responsible for initializing it when it runs for
the first time to pick files. But with it disabled, notify_completion()
will find it uninitialized, therefore resulting in bad_optional_access.
To fix this, the procedure is skipped if _last_compacted_keys is
disengaged. Regular compaction, once re-enabled, will be able to
fill _last_compacted_keys by looking at metadata of the files.
compaction_test.py::TestCompaction::test_disable_autocompaction_doesnt_
block_user_initiated_compactions[CLEANUP-LeveledCompactionStrategy]
now passes.
Fixes#10378.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#10508
(cherry picked from commit 8e99d3912e)
_estimated_remaining_tasks gets updated via get_next_non_expired_sstables ->
get_compaction_candidates, but otherwise if we return earlier from
get_sstables_for_compaction, it does not get updated and may go out of sync.
Refs #10418
(to be closed when the fix reaches branch-4.6)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10419
(cherry picked from commit 01f41630a5)
Flushing the base table triggers view building
and corresponding compactions on the view tables.
Temporarily disable compaction on both the base
table and all its view before flush and snapshot
since those flushed sstables are about to be truncated
anyway right after the snapshot is taken.
This should make truncate go faster.
In the process, this series also embeds `database::truncate_views`
into `truncate` and coroutinizes both
Refs #6309
Test: unit(dev)
Closes#10203
* github.com:scylladb/scylla:
replica/database: truncate: fixup indentation
replica/database: truncate: temporarily disable compaction on table and views before flush
replica/database: truncate: coroutinize per-view logic
replica/database: open-code truncate_view in truncate
replica/database: truncate: coroutinize run_with_compaction_disabled lambda
replica/database: coroutinize truncate
compaction_manager: add disable_compaction method
(cherry picked from commit aab052c0d5)
Since regular compaction may run in parallel no lock
is required per-table.
We still acquire a read lock in this patch, for backporting
purposes, in case the branch doesn't contain
6737c88045.
But it can be removed entirely in master in a follow-up patch.
This should solve some of the slowness in cleanup compaction (and
likely in upgrade sstables seen in #10060, and
possibly #10166.
Fixes#10175
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10177
(cherry picked from commit 11ea2ffc3c)
This reverts commit 4c05e5f966.
Moving cleanup to maintenance group made its operation time up to
10x slower than previous release. It's a blocker to 4.6 release,
so let's revert it until we figure this all out.
Probably this happens because maintenance group is fixed at a
relatively small constant, and cleanup may be incrementally
generating backlog for regular compaction, where the former is
fighting for resources against the latter.
Fixes#10060.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220213184306.91585-1-raphaelsc@scylladb.com>
(cherry picked from commit a9427f150a)
Now that the table layer is using perform_offstrategy,
submit_offstrategy is no longer in use and can be deleted.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Return a future from perform_offstrategy, resolved
when the offstrategy compaction completes so that callers
can wait on it.
submit_offstrategy still submits the offstrategy compaction
in the background.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
Currently, when TWCS reshape finds a bucket containing more than 32
files, it will blindly resize that bucket to 32.
That's very bad because it doesn't take into consideration that
compaction efficiency depends on relative sizes of files being
compacted together, meaning that a huge file can be compacted with
a tiny one, producing lots of write amplification.
To solve this problem, STCS reshape logic will now be reused in
each time bucket. So only similar-sized files are compacted together
and the time bucket will be considered reshaped once its size tiers
are properly compacted, according to the reshape mode.
Fixes#9938.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220117205000.121614-1-raphaelsc@scylladb.com>
"
With this series the mutation compactor can now consume a v2 stream. On
the output side it still uses v1, so it can now act as an online
v2->v1 converter. This allows us to push out v2->v1 conversion to as far
as the compactor, usually the next to last component in a read pipeline,
just before the final consumer. For reads this is as far as we can go,
as the intra-node ABI and hence the result-sets built are v1. For
compaction we could go further and eliminate conversion altogether, but
this requires some further work on both the compactor and the sstable
writer and so it is left to be done later.
To summarize, this patchset enables a v2 input for the compactor and it
updates compaction and single partition reads to use it.
"
* 'mutation-compactor-consume-v2/v1' of https://github.com/denesb/scylla:
table: add make_reader_v2()
querier: convert querier_cache and {data,mutation}_querier to v2
compaction: upgrade compaction::make_interposer_consumer() to v2
mutation_reader: remove unecessary stable_flattened_mutations_consumer
compaction/compaction_strategy: convert make_interposer_consumer() to v2
mutation_writer: migrate timestamp_based_splitting_writer to v2
mutation_writer: migrate shard_based_splitting_writer to v2
mutation_writer: add v2 clone of feed_writer and bucket_writer
flat_mutation_reader_v2: add reader_consumer_v2 typedef
mutation_reader: add v2 clone of queue_reader
compact_mutation: make start_new_page() independent of mutation_fragment version
compact_mutation: add support for consuming a v2 stream
compact_mutation: extract range tombstone consumption into own method
range_tombstone_assembler: add get_range_tombstone_change()
range_tombstone_assembler: add get_current_tombstone()
run_with_compaction_disabled() is used to temporarily disable compaction
for a table T. Not only regular compaction, but all types.
Turns out it's stopping all types but it's only preventing new regular
compactions from starting. So major for example can start even with
compaction temporarily disabled. This is fixed by not allowing
compaction of any type if disabled. This wasn't possible before as
scrub incorrectly ran entirely with compaction disabled, so it wouldn't
be able to start, but now it only disables compaction while retrieving
its candidate list.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220107154942.59800-1-raphaelsc@scylladb.com>
Said wrapper was conceived to make unmovable `compact_mutation` because
readers wanted movable consumers. But `compact_mutation` is movable for
years now, as all its unmovable bits were moved into an
`lw_shared_ptr<>` member. So drop this unnecessary wrapper and its
unnecessary usages.
Move the ::database, ::keyspace, and ::table classes to a new replica
namespace and replica/ directory. This designates objects that only
have meaning on a replica and should not be used on a coordinator
(but note that not all replica-only classes should be in this module,
for example compaction and sstables are lower-level objects that
deserve their own modules).
The module is imperfect - some additional classes like distributed_loader
should also be moved, but there is only one way to untie Gordian knots.
Closes#9872
* github.com:scylladb/scylla:
replica: move ::database, ::keyspace, and ::table to replica namespace
database: Move database, keyspace, table classes to replica/ directory
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.
References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.
scylla-gdb.py is adjusted to look for both the new and old names.
Tables waiting for a chance to run reshape wouldn't trigger stop
exception, as the exception was only being triggered for ongoing
compactions. Given that stop reshape API must abort all ongoing
tasks and all pending ones, let's change run_custom_job() to
trigger the exception if it found that the pending task was
asked to stop.
Tests:
dtest: compaction_additional_test.py::TestCompactionAdditional::test_stop_reshape_with_multiple_keyspaces
unit: dev
Fixes#9836.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211223002157.215571-1-raphaelsc@scylladb.com>
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.
As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
"
Completes coroutinization of rewrite_sstables().
tests: UNIT(debug)
"
* 'rewrite_sstable_coroutinization' of https://github.com/raphaelsc/scylla:
compaction_manager: coroutinize main loop in sstable rewrite procedure
compaction_manager: coroutinize exception handling in sstable rewrite procedure
compaction_manager: mark task::finish_compaction() as noexcept
compaction_manager: make maybe_stop_on_error() more flexible
We can stop only specific compaction types.
Reshard should be excluded since it mustn't be stopped.
And other types of compaction types like "VALIDATION" or "INDEX_BUILD"
are valid in terms of their syntax but unsupported by scylla so we better
return an error rather than appear to support them.
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211222133449.2177746-1-bhalevy@scylladb.com>
It's hard to integrate maybe_stop_on_error() with coroutines as it
accepts a resolved future, not an exception pointer. Let's adjust
its interface, making it more flexible to work with.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.
To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.
In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:
1) GC a tombstone after gc_grace_seconds
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;
This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.
2) Never GC a tombstone
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};
3) GC a tombstone immediately
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};
4) GC a tombstone after repair
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};
In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.
A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.
Tests: compaction_test.py, ninja test
Fixes#3560
[avi: resolve conflicts vs data_dictionary]
"
TWCS perform STCS on a window as long as it's the most recent one.
From there on, TWCS will compact all files in the past window into
a single file. With some moderate write load, it could happen that
there's still some compaction activity in that past window, meaning
that per-window major may miss some files being currently compacted.
As a result, a past window may contain more than 1 file after all
compaction activity is done on its behalf, which may increase read
amplification. To avoid that, TWCS will now make sure that per-window
major is serialized, to make sure no files are missed.
Fixes#9553.
tests: unit(dev).
"
* 'fix_twcs_per_window_major_v3' of https://github.com/raphaelsc/scylla:
TWCS: Make sure major on past window is done on all its sstables
TWCS: remove needless param for STCS options
TWCS: kill unused param in newest_bucket()
compaction: Implement strategy control and wire it
compaction: Add interface to control strategy behavior.
Allow stopping compaction by type on a given keyspace and list of tables.
Also add api unit test suite that tests the existing `stop_compaction` api
and the new `stop_keyspace_compaction` api.
Fixes#9700Closes#9746
* github.com:scylladb/scylla:
api: storage_service: validate_keyspace: improve exception error message
api: compaction_manager: add stop_keyspace_compaction
api: storage_service: expose validate_keyspace and parse_tables
api: compaction_manager: stop_compaction: fix type description
compaction_manager: stop_compaction: expose optional table*
test: api: add basic compaction_manager test
Currently when scrub/validate is stopped (e.g. via the api),
scrub_validate_mode_validate_reader co_return:s without
closing the reader passed to it - causing a crash due
to internal error check, see #9766.
Throwing a compaction_stopped_exception rather than co_return:ing
an exception will be handled as any other exeption, including closing
the reader.
Fixes#9766
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211213125528.2422745-1-bhalevy@scylladb.com>
Once current window is sealed, TWCS is supposed to compact all its
sstables into one. If there's ongoing compaction, it can happen that
sstables are missed and therefore past windows will contain more than
one sstable. Additionally, it could happen that major doesn't happen
at all if under heavy load. All these problems are fixed by serializing
major on past window and also postponing it if manager refuses to run
the job now.
Fixes#9553.
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
STCS option can be retrieved from class member, as newest_bucket()
is no longer a static function. let's get rid of it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>