When purging regular tombstone consult the min_live_timestamp, if available.
This is safe since we don't need to protect dead data from resurrection, as it is already dead.
For shadowable_tombstones, consult the min_memtable_live_row_marker_timestamp,
if available, otherwise fallback to the min_live_timestamp.
If we see in a view table a shadowable tombstone with time T, then in any row where the row marker's timestamp is higher than T the shadowable tombstone is completely ignored and it doesn't hide any data in any column, so the shadowable tombstone can be safely purged without any effect or risk resurrecting any deleted data.
In other words, rows which might cause problems for purging a shadowable tombstone with time T are rows with row markers older or equal T. So to know if a whole sstable can cause problems for shadowable tombstone of time T, we need to check if the sstable's oldest row marker (and not oldest column) is older or equal T. And the same check applies similarly to the memtable.
If both extended timestamp statistics are missing, fallback to the legacy (and inaccurate) min_timestamp.
Fixesscylladb/scylladb#20423Fixesscylladb/scylladb#20424
> [!NOTE]
> no backport needed at this time
> We may consider backport later on after given some soak time in master/enterprise
> since we do see tombstone accumulation in the field under some materialized views workloads
Closesscylladb/scylladb#20446
* github.com:scylladb/scylladb:
cql-pytest: add test_compaction_tombstone_gc
sstable_compaction_test: add mv_tombstone_purge_test
sstable_compaction_test: tombstone_purge_test: test that old deleted data do not inhibit tombstone garbage collection
sstable_compaction_test: tombstone_purge_test: add testlog debugging
sstable_compaction_test: tombstone_purge_test: make_expiring: use next_timestamp
sstable, compaction: add debug logging for extended min timestamp stats
compaction: get_max_purgeable_timestamp: use memtable and sstable extended timestamp stats
compaction: define max_purgeable_fn
tombstone: can_gc_fn: move declaration to compaction_garbage_collector.hh
sstables: scylla_metadata: add ext_timestamp_stats
compaction_group, storage_group, table_state: add extended timestamp stats getters
sstables, memtable: track live timestamps
memtable_encoding_stats_collector: update row_marker: do nothing if missing
Migrate the `system_distributed.view_build_status` table to `system.view_build_status_v2`. The writes to the v2 table are done via raft group0 operations.
The new parameter `view_builder_version` stored in `scylla_local` indicates whether nodes should use the old or the new table.
New clusters use v2. Otherwise, the migration to v2 is initiated by the topology coordinator when the feature is enabled. It reads all the rows from the old table and writes them to the new table, and sets `view_builder_version` to v2. When the change is applied, all view_builder services are updated to write and read from the v2 table.
The old table `system_distributed.view_build_status` is set to read virtually from the new table in order to maintain compatibility.
When removing a node from the cluster, we remove its rows from the table atomically (fixes https://github.com/scylladb/scylladb/issues/11836). Also, during the migration, we remove all invalid rows.
Fixesscylladb/scylladb#15329
dtest https://github.com/scylladb/scylla-dtest/pull/4827Closesscylladb/scylladb#19745
* github.com:scylladb/scylladb:
view: test view_build_status table with node replace
test/pylib: use view_build_status_v2 table in wait_for_view
view_builder: common write view_build_status function
view_builder: improve migration to v2 with intermediate phase
view: delete node rows from view_build_status on node removal
view: sanitize view_build_status during migration
view: make old view_build_status table a virtual table
replica: move streaming_reader_lifecycle_policy to header file
view_builder: test view_build_status_v2
storage_service: add view_build_status to raft snapshot
view_builder: migration to v2
db:system_keyspace: add view_builder_version to scylla_local
view_builder: read view status from v2 table
view_builder: introduce writing status mutations via raft
view_builder: pass group0_client and qp to view_builder
view_builder: extract sys_dist status operations to functions
db:system_keyspace: add view_build_status_v2 table
To return the minimum live timestamp and live row-marker
timestamp across a compaction_group, storage_group, or
table_state.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Added a new parameter `consider_only_existing_data` to major compaction
API endpoints. When enabled, major compaction will:
- Force-flush all tables.
- Force a new active segment in the commit log.
- Compact all existing SSTables and garbage-collect tombstones by only
checking the SSTables being compacted. Memtables, commit logs, and
other SSTables not part of the compaction will not be checked, as they
will only contain newer data that arrived after the compaction
started.
The `consider_only_existing_data` is passed down to the compaction
descriptor's `gc_check_only_compacting_sstables` option to ensure that
only the existing data is considered for garbage collection.
The option is also passed to the `maybe_flush_commitlog` method to make
sure all the tables are flushed and a new active segment is created in
the commit log.
Fixes#19728
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Major compaction flushes all tables as a part of flushing the commitlog.
After forcing new active segments in the commitlog, all the tables are
flushed to enable reclaim of older commitlog segments. The main goal is
to flush the commitlog and flushing all the table is just a dependency.
Rename maybe_flush_all_tables to maybe_flush_commitlog so that it
reflects the actual intent of the major compaction code. Added a new
wrapper method to database::flush_all_tables(),
database::flush_commitlog(), that is now called from
maybe_flush_commitlog.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
To prevent stalls due to large number of tokens.
For example, large cluster with say 70 nodes can have
more than 16K tokens.
Fixes#19757Closesscylladb/scylladb#19758
* github.com:scylladb/scylladb:
abstract_replication_strategy: make get_ranges async
database: get_keyspace_local_ranges: get vnode_effective_replication_map_ptr param
compaction: task_manager_module: open code maybe_get_keyspace_local_ranges
alternator: ttl: token_ranges_owned_by_this_shard: let caller make the ranges_holder
alternator: ttl: can pass const gms::gossiper& to ranges_holder
alternator: ttl: ranges_holder_primary: unconstify _token_ranges member
alternator: ttl: refactor token_ranges_owned_by_this_shard
To prevent stalls due to large number of tokens.
For example, large cluster with say 70 nodes can have
more than 16K tokens.
Fixes#19757
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prepare for making the function async.
Then, it will need to hold on to the erm while getting
the token_ranges asynchronously.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is used only here and can be simplified by
checking if the keyspace replication strategy
is per table by the caller.
Prepare for making get_keyspace_local_ranges async.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
There's parse_table_directory_name() static helper in database.cc code
that is used by methods that parse table tree layout for snapshot.
Export this helper for external usage and rename to fit the format_...
one introduced by previous patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The one makes table directory (not full path) out of table name and
uuid. This is to be symmetrical with yet another helper that converts
dirctory name back to table name and uuid (next patch)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently, database::tables_metadata::add_table needs to hold a write
lock before adding a table. So, if we update other classes keeping
track of tables before calling add_table, and the method yields,
table's metadata will be inconsistent.
Set all table-related info in tables_metadata::add_table_helper (called
by add_table) so that the operation is atomic.
Analogically for remove_table.
Fixes: #19833.
Closesscylladb/scylladb#20064
Consider the following:
```
T
0 split prepare starts
1 repair starts
2 split prepare finishes
3 repair adds unsplit sstables
4 repair ends
5 split executes
```
If repair produces sstable after split prepare phase, the replica will not split that sstable later, as prepare phase is considered completed already. That causes split execution to fail as replicas weren't really prepared. This also can be triggered with load-and-stream which shares the same write (consumer) path.
The approach to fix this is the same employed to prevent a race between split and migration. If migration happens during prepare phase, it can happen source misses the split request, but the tablet will still be split on the destination (if needed). Similarly, the repair writer becomes responsible for splitting the data if underlying table is in split mode. That's implemented in replica::table for correctness, so if node crashes, the new sstable missing split is still split before added to the set.
Fixes#19378.
Fixes#19416.
**Please replace this line with justification for the backport/\* labels added to this PR**
Closesscylladb/scylladb#19427
* github.com:scylladb/scylladb:
tablets: Fix race between repair and split
compaction: Allow "offline" sstable to be split
The reconcilable_result is built as it would be constructed for
forward read queries for tables with reversed order.
Mutations constructed for reversed queries are consumed forward.
Drop overloaded reversed functions that reverse read_command and
reconcilable_result directly and keep only those requiring smart
pointers. They are not used any more.
Remove schema reversing in query() and query_mutations() methods.
Instead, a reversed schema shall be passed for reversed queries.
Rename a schema variable from s into query_schema for readability.
Reverse reads have already been with us for a while, thus this back
door option to read entire paritions forward and reversing them after
can be retired.
Consider the following:
T
0 split prepare starts
1 repair starts
2 split prepare finishes
3 repair adds unsplit sstables
4 repair ends
5 split executes
If repair produces sstable after split prepare phase, the replica
will not split that sstable later, as prepare phase is considered
completed already. That causes split execution to fail as replicas
weren't really prepared. This also can be triggered with
load-and-stream which shares the same write (consumer) path.
The approach to fix this is the same employed to prevent a race
between split and migration. If migration happens during prepare
phase, it can happen source misses the split request, but the
tablet will still be split on the destination (if needed).
Similarly, the repair writer becomes responsible for splitting
the data if underlying table is in split mode. That's implemented
in replica::table for correctness, so if node crashes, the new
sstable missing split is still split before added to the set.
Fixes#19378.
Fixes#19416.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If parent_info argument of compaction_manager::perform_compaction
is std::nullopt, then created compaction executor isn't tracked by task
manager. Currently, all compaction operations should by visible in task
manager.
Modify split methods to keep split executor in task manager. Get rid of
the option to bypass task manager.
Closesscylladb/scylladb#19995
* github.com:scylladb/scylladb:
compaction: replace optional<task_info> with task_info param
compaction: keep split executor in task manager
Commit ad0e6b79 (replica: Remove all_datadir from keyspace config) removed all_datadirs from keyspace config, now it's datadir turn. After this change keyspace no longer references any on-disk directories, only the sstables's storage driver attached to keyspace's tables does.
refs #12707Closesscylladb/scylladb#19866
* github.com:scylladb/scylladb:
replica: Remove keyspace::config::datadir
sstables/storage: Evaluate path for keyspace directory in storage
sstables/storage: Add sstables_manager arg to init_keyspace_storage()
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.
Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.
To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.
[1] 66ef711d68Closesscylladb/scylladb#20006
compaction_manager::perform_compaction does not create task manager
task for compaction if parent_info is set to std::nullopt. Currently,
we always want to create task manager task for compaction.
Remove optional from task info parameters which start compaction.
Track all compactions with task manager.
It's finally no longer used. Now only sstables storage code "knows" that
keyspace may have its on-disk directory.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is in preparation for the following patch that adds abort_source
variable to the sstables_manager.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
It was added to make integration of storage groups easier, but it's
complicated since it's another source of truth and we could have
problems if it becomes inconsistent with the group map.
Fixes#18506.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We currently disable tombstone GC for compaction done on the read path of streaming and repair, because those expired tombstones can still prevent data resurrection. With time-based tombstone GC, missing a repair for long enough can cause data resurrection because a tombstone is potentially GC'd before it could be spread to every node by repair. So repair disseminating these expired tombstones helps clusters which missed repair for long enough. It is not a guarantee because compaction could have done the GC itself, but it is better than nothing.
This last resort is getting less important with repair-based tombstone GC. Furthermore, we have seen this cause huge repair amplification in a cluster, where expired tombstones triggered repair replicating otherwise identical rows.
This series makes tombstone GC on the streaming/repair compaction path configurable with a config item. This new config item defaults to `false` (current behaviour), setting it to `true`, will enable tombstone GC.
Fixes: https://github.com/scylladb/scylladb/issues/19015
Not a regression, no backport needed
Closesscylladb/scylladb#19016
* github.com:scylladb/scylladb:
test/topology_custom/test_repair: add test for enable_tombstone_gc_for_streaming_and_repair
replica/table: maybe_compact_for_streaming(): toggle tombstone GC based on the control flag
replica: propagate enable_tombstone_gc_for_streaming_and_repair to maybe_compact_for_streaming()
db/config: introduce enable_tombstone_gc_for_streaming_and_repair
Currently, a pending replica that applies a write on a table that has
materialized views, will build all the view updates as a normal replica,
only to realize at a late point, in db::view::get_view_natural_endpoint(),
that it doesn't have a paired view replica to send the updates to. It will
then either drop the view updates, or send them to a pending view
replica, if such exists.
This work is unnecessary since it may be dropped, and even if there is a
pending view replica to send the updates to, the updates that are built
by the pending replica may be wrong since it may have incomplete
information.
This commit fixes the inefficiency by skipping the view update building
step when applying an update on a pending replica.
The metric total_view_updates_on_wrong_node is added to count the cases
that a view update is determined to be unnecessary.
The test reproduces the scenario of writing to a table and applying
the update on a pending replica, and verifies that the pending replica
doesn't try to build view updates.
Fixesscylladb/scylladb#19152Closesscylladb/scylladb#19488
flat_mutation_reader_v2 was introduced in a pair of commits in 2021:
e3309322c3 "Clone flat_mutation_reader related classes into v2 variants"
08b5773c12 "Adapt flat_mutation_reader_v2 to the new version of the API"
as a replacement for flat_mutation_reader, using range_tombstone_change
instead of range_tombstone to represent represent range tombstones. See
those commits for more information.
The transition was incremental; the last use of the original
flat_mutation_reader was removed in 2022 in commit
026f8cc1e7 "db: Use mutation_partition_v2 in mvcc"
In turn, flat_mutation_reader was introduced in 2017 in commit
748205ca75 "Introduce flat_mutation_reader"
To transition from a mutation_reader that nested rows within
a partition in a separate stream, to a flat reader that streamed
partitions and rows in the same stream.
Here, we reclaim the original name and rename the awkward
flat_mutation_reader_v2 to mutation_reader.
Note that mutation_fragment_v2 remains since we still use the original
for compatibilty, sometimes.
Some notes about the transition:
- files were also renamed. In one case (flat_mutation_reader_test.cc), the
rename target already existed, so we rename to
mutation_reader_another_test.cc.
- a namespace 'mutation_reader' with two definitions existed (in
mutation_reader_fwd.hh). Its contents was folded into the mutation_reader
class. As a result, a few #includes had to be adjusted.
Closesscylladb/scylladb#19356
Normally, the space overhead for TWCS is 1/N, where is number of windows. But during off-strategy, the overhead is 100% because input sstables cannot be released earlier.
Reshaping a TWCS table that takes ~50% of available space can result in system running out of space.
That's fixed by restricting every TWCS off-strategy job to 10% of free space in disk. Tables that aren't big will not be penalized with increased write amplification, as all input (disjoint) sstables can still be compacted in a single round.
Fixes#16514.
Closesscylladb/scylladb#18137
* github.com:scylladb/scylladb:
compaction: Reduce twcs off-strategy space overhead to 10% of free space
compaction: wire storage free space into reshape procedure
sstables: Allow to get free space from underlying storage
replica: don't expose compaction_group to reshape task
compaction_group sits in replica layer and compaction layer is
supposed to talk to it through compaction::table_state only.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
those functions cannot return nullptr, will throw when group is not
found, so better return ref instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When a function is created with the CREATE FUNCTION statement, the
statement handler does all the necessary preparations on its own. The
very same code exists in schema_tables, when the function is loaded on
boot. This patch generalizes both and keeps function language-specific
context creation inside lang/ code.
The creation function returns context via argument reference. It would
have been nicer if it was returned via future<>, but it's not suitable
for future<T> type :(
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
And, while at it, rename local variable to refer to it to as "manager"
not "wasm". Query processor and database also have getters named
"wasm()", these are not renamed yet to keep patch smaller (and those
getters are going to be reworked further anyway).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This vector of paths is only used to generate the same vector of paths
for table config, but the latter already has all the needed info.
It's the part of the plan to stop using paths/directories in keyspaces
and tables, because with storage-options tables no longer keep their
data in "files on disk", so this information goes to sstables storage
manager (refs #12707)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19119
tablet snapshot, used by migration, can race with compaction and
can find files deleted. That won't cause data loss because the
error is propagated back into the coordinator that decides to
retry streaming stage. So the consequence is delayed migration,
which might in turn reduce node operation throughput (e.g.
when decommissioning a node). It should be rare though, so
shouldn't have drastic consequences.
Fixes#18977.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#18979
Retrieval of tablet stats must be serialized with mutation to token metadata, as the former requires tablet id stability.
If tablet split is finalized while retrieving stats, the saved erm, used by all shards, can have a lower tablet count than the one in a particular shard, causing an abort as tablet map requires that any id feeded into it is lower than its current tablet count.
Fixes#18085.
Closesscylladb/scylladb#18287
* github.com:scylladb/scylladb:
test: Fix flakiness in topology_experimental_raft/test_tablets
service: Use tablet read selector to determine which replica to account table stats
storage_service: Fix race between tablet split and stats retrieval
If tablet split is finalized while retrieving stats, the saved erm, used by all
shards, will be invalidated. It can either cause incorrect behavior or
crash if id is not available.
It's worked by feeding local tablet map into the "coordinator"
collecting stats from all shards. We will also no longer have a snapshot
of erm shared between shards to help intra-node migration. This is
simplified by serializing token metadata changes and the retrieval of
the stats (latter should complete pretty fast, so it shouldn't block
the former for any significant time).
Fixes#18085.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently, we do not explicitly set a scheduling group for the schema
commitlog which causes it to run in the default scheduling group (called
"main"). However:
- It is important and significant enough that it should run in a
scheduling group that is separate from the main one,
- It should not run in the existing "commitlog" group as user writes may
sometimes need to wait for schema commitlog writes (e.g. read barrier
done to learn the schema necessary to interpret the user write) and we
want to avoid priority inversion issues.
Therefore, introduce a new scheduling group dedicated to the schema
commitlog.
Fixes: scylladb/scylladb#15566Closesscylladb/scylladb#18715
There's a set of API endpoints that toggle per-table auto-compaction and tombstone-gc booleans. They all live in two different .cc files under api/ directory and duplicate code of each other. This PR generalizes those handlers, places them next to each other, fixes leak on stop and, as a nice side effect, enlightens database.hh header.
Closesscylladb/scylladb#18703
* github.com:scylladb/scylladb:
api,database: Move auto-compaction toggle guard
api: Move some table manipulation helpers from storage_service
api: Move table-related calls from storage_service domain
api: Reimplement some endpoints using existing helpers
api: Lost unset of tombstone-gc endpoints
Toggling per-table auto-compaction enabling bit is guarded with
on-database boolean and raii guard. It's only used by a single
api/column_family.cc file, so it can live there.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>