Currently the function calls boost::partial_sort with a middle
iterator that might be out of bound and cause undefined behavior.
Check the vector size, and do a partial sort only if its longer
than `max_sstables`, otherwise sort the whole vector.
Fixesscylladb/scylladb#20608
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#20609
(cherry picked from commit 39ce358d82)
Refs: scylladb/scylladb#20609
flat_mutation_reader_v2 was introduced in a pair of commits in 2021:
e3309322c3 "Clone flat_mutation_reader related classes into v2 variants"
08b5773c12 "Adapt flat_mutation_reader_v2 to the new version of the API"
as a replacement for flat_mutation_reader, using range_tombstone_change
instead of range_tombstone to represent represent range tombstones. See
those commits for more information.
The transition was incremental; the last use of the original
flat_mutation_reader was removed in 2022 in commit
026f8cc1e7 "db: Use mutation_partition_v2 in mvcc"
In turn, flat_mutation_reader was introduced in 2017 in commit
748205ca75 "Introduce flat_mutation_reader"
To transition from a mutation_reader that nested rows within
a partition in a separate stream, to a flat reader that streamed
partitions and rows in the same stream.
Here, we reclaim the original name and rename the awkward
flat_mutation_reader_v2 to mutation_reader.
Note that mutation_fragment_v2 remains since we still use the original
for compatibilty, sometimes.
Some notes about the transition:
- files were also renamed. In one case (flat_mutation_reader_test.cc), the
rename target already existed, so we rename to
mutation_reader_another_test.cc.
- a namespace 'mutation_reader' with two definitions existed (in
mutation_reader_fwd.hh). Its contents was folded into the mutation_reader
class. As a result, a few #includes had to be adjusted.
Closesscylladb/scylladb#19356
TWCS off-strategy suffers with 100% space overhead, so a big TWCS table
can cause scylla to run out of disk space during node ops.
To not penalize TWCS tables, that take a small percentage of disk,
with increased write ampl, TWCS off-strategy will be restricted to
10% of free disk space. Then small tables can still compact all
disjoint sstables in a single round.
Fixes#16514.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
reader_consumer_v2 being a noncopyable_function imposes a restriction
when stacking one interposer consumer on top of another.
Think for example of a token-based segregator on top of a timestamp
based one.
To achieve that, the interposer consumer creator must be reentrant,
such that the consumer can be created on each "channel", but today
the creator becomes unusable after first usage.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define a formatter for
map<timestamp_type, vector<shared_sstable>>. since the operator<<
for this type is only used in the .cc file, and the only use case
of it is to provide the formatter for fmt, so the operator<< based
formatter is remove in this change.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16163
When off-strategy is disabled, data segregation is not postponed,
meaning that getting partition estimate right is important to
decrease filter's false positives. With streaming, we don't
have min and max timestamps at destination, well, we could have
extended the RPC verb to send them, but turns out we can deduce
easily the amount of windows using default TTL. Given partitioner
random nature, it's not absurd to assume that a given range being
streamed may overlap with all windows, meaning that each range
will yield one sstable for each window when segregating incoming
data. Today, we assume the worst of 100 windows (which is the
max amount of sstables the input data can be segregated into)
due to the lack of metadata for estimating the window count.
But given that users are recommended to target a max of ~20
windows, it means partition estimate is being downsized 5x more
than needed. Let's improve it by using default TTL when
estimating window count, so even on absence of timestamp
metadata, the partition estimation won't be way off.
Fixes#15704.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
After "repair: Get rid of the gc_grace_seconds", the sstable's schema (mode,
gc period if applicable, etc) is used to estimate the amount of droppable
data (or determine full expiration = max_deletion_time < gc_before).
It could happen that the user switched from timeout to repair mode, but
sstables will still use the old mode, despite the user asked for a new one.
Another example is when you play with value of grace period, to prevent
data resurrection if repair won't be able to run in a timely manner.
The problem persists until all sstables using old GC settings are recompacted
or node is restarted.
To fix this, we have to feed latest schema into sstable procedures used
for expiration purposes.
Fixes#15643.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#15746
Now everything is prepared for the switch, let's do it.
Now let's wait for ICS to enjoy the set of changes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
To be consistent with other compaction_strategy_options,
time_window_compaction_strategy_options uses compaction_strategy_impl::get_value
and cql3::statements::property_definitions::to_long helpers for
parsing.
Add temporarily empty validate method to compaction_strategy_options.
The method will validate the options and help determining whether
only the allowed options were set.
Prevent switch case statements from falling through without annotation
([[fallthrough]]) proving that this was intended.
Existing intended cases were annotated.
Closes#14607
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).
So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command
The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields
Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)
Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile
The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13963
compaction strategies know how to pick files that are most likely to
satisfy tombstone purge conditions (i.e. not shadow data in uncompacting
files).
This logic can be bypassed if tombstone GC was disabled by user,
as it's a waste of effort to proceed with it until re-enabled.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
enable_optimized_twcs_queries is specific to TWCS, therefore it
belongs to TWCS, not replica::table.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13489
TWCS no longer keeps internal state, and will now rely on state
managed by each compaction group through compaction::table_state.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This is a step towards decoupling compaction strategy (computation)
and its state. Making the former stateless.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Move mutation-related files to a new mutation/ directory. The names
are kept in the global namespace to reduce churn; the names are
unambiguous in any case.
mutation_reader remains in the readers/ module.
mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this
patch.
This is a step forward towards librarization or modularization of the
source base.
Closes#12788
This series introduces two configurable options when working with TWCS tables:
- `restrict_twcs_default_ttl` - a LiveUpdate-able tri_mode_restriction which defaults to WARN and will notify the user whenever a TWCS table is created without a `default_time_to_live` setting
- `twcs_max_window_count` - Which forbids the user from creating TWCS tables whose window count (buckets) are past a certain threshold. We default to 50, which should be enough for most use cases, and a setting of 0 effectively disables the check.
Refs: #6923Fixes: #9029Closes#11445
* github.com:scylladb/scylladb:
tests: cql_query_test: add mixed tests for verifying TWCS guard rails
tests: cql_query_test: add test for TWCS window size
tests: cql_query_test: add test for TWCS tables with no TTL defined
cql: add configurable restriction of default_time_to_live when for TimeWindowCompactionStrategy tables
cql: add max window restriction for TimeWindowCompactionStrategy
time_window_compaction_strategy: reject invalid window_sizes
cql3 - create/alter_table_statement: Make check_restricted_table_properties accept a schema_ptr
Scylla mistakenly allows an user to configure an invalid TWCS window_size <= 0, which effectively breaks the notion of compaction windows.
Interestingly enough, a <= 0 window size should be considered an undefined behavior as either we would create a new window every 0 duration (?) or the table would behave as STCS, the reader is encouraged to figure out which one of these is true. :-)
Cassandra, on the other hand, will properly throw a ConfigurationException when receiving such invalid window sizes and we now match the behavior to the same as Cassandra's.
Refs: #2336
Pass the tombstone_gc_state from the compaction_strategy
to sstables get_gc_before_* functions using the table state
to get to the tombstone_gc_state.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To make sure fully_expired sstables are not missed
if get_sstables_for_compaction is called just heuristically,
change the state by setting _last_expired_check
to the current time only when no fully_expired_sstables are found
among the candidates.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Print the compaction_strategy `this` pointer
so we can distinguish between different instance of the
compaction_strategy object (some code paths copy it and
some may instantiate a branch new compaction_strategy object).
The motivation is detecting when the side effects of this function are
applied on the "master" instance, stored in the table shard.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
_estimated_remaining_tasks gets updated via get_next_non_expired_sstables ->
get_compaction_candidates, but otherwise if we return earlier from
get_sstables_for_compaction, it does not get updated and may go out of sync.
Refs #10418
(to be closed when the fix reaches branch-4.6)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10419
This continues the work in a69d98c3d0,
by implementing the cleanup method in TWCS to make it bucket aware.
Till now, the default impl was used which cleanups on file at a
time, starting from the smallest.
The cleanup strategy for TWCS is simple. It's simply calling the
size tiered cleanup method for each bucket, so there will be
one job for each tier in each window.
The next strategies to receive this improvement are LCS and ICS
(the latter one being only available in enterprise).
Refs #10097.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
For compaction to be able to purge expired data, like tombstones, a
sstable set snapshot is set in the compaction descriptor.
That's a decision that belongs to task type. For example, all regular
compaction enable GC, whereas scrub for example doesn't for safety
reasons.
The problem is that the decision is being made by every instantiation
of compaction_descriptor in the strategies, which is both unnecessary
and also adds lots of boilerplate to the code, making it hard to
understand and work with.
As sstable set snapshot is an implementation detail, a new method
is being added to compaction_descriptor to make the intention
clearer, making the interface easier to understand.
can_purge_tombstones, used previously by rewrite task only, is being
reused for communicating GC intention into task::compact_sstables().
The boilerplate was a pain when adding a new strategy method for
the ongoing work on cleanup, described by issue #10097.
Another benefit is that we'll now only create a set snapshot when
compaction will really run. Before, it could happen that the snapshot
would be discarded if the compaction attempt had to be postponed,
which is a waste of cpu cycles.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
Currently, when TWCS reshape finds a bucket containing more than 32
files, it will blindly resize that bucket to 32.
That's very bad because it doesn't take into consideration that
compaction efficiency depends on relative sizes of files being
compacted together, meaning that a huge file can be compacted with
a tiny one, producing lots of write amplification.
To solve this problem, STCS reshape logic will now be reused in
each time bucket. So only similar-sized files are compacted together
and the time bucket will be considered reshaped once its size tiers
are properly compacted, according to the reshape mode.
Fixes#9938.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220117205000.121614-1-raphaelsc@scylladb.com>
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.
To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.
In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:
1) GC a tombstone after gc_grace_seconds
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;
This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.
2) Never GC a tombstone
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};
3) GC a tombstone immediately
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};
4) GC a tombstone after repair
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};
In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.
A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.
Tests: compaction_test.py, ninja test
Fixes#3560
[avi: resolve conflicts vs data_dictionary]
Once current window is sealed, TWCS is supposed to compact all its
sstables into one. If there's ongoing compaction, it can happen that
sstables are missed and therefore past windows will contain more than
one sstable. Additionally, it could happen that major doesn't happen
at all if under heavy load. All these problems are fixed by serializing
major on past window and also postponing it if manager refuses to run
the job now.
Fixes#9553.
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
STCS option can be retrieved from class member, as newest_bucket()
is no longer a static function. let's get rid of it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When reshaping TWCS table in relaxed mode, which is the case for
offstrategy and boot, disjoint tolerance is too strict, which can
lead those processes to do more work than needed.
Let's increase the tolerance to max threshold, which will limit the
amount of sstables opened in compaction to a reasonable amount.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211130132538.56285-1-raphaelsc@scylladb.com>
From now on, get_sstables_for_compaction() will use table_state.
With table_state, we avoid layer violations like strategy using
manager and also makes testing easier.
Compaction unit tests were temporarily disabled to avoid a giant
commit which is hard to parse.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>