Allow create_pending_deletion_log to delete a bunch of sstables
potentially resides in different prefixes (e.g. in the base directory
and under staging/).
The motivation arises from table::cleanup_tablet that calls compaction_group::cleanup on all cg:s via cleanup_compaction_groups. Cleanup, in turn, calls delete_sstables_atomically on all sstables in the compaction_group, in all states, including the normal state as well as staging - hence the requirement to support deleting sstables in different sub-directories.
Also, apparently truncate calls delete_atomically for all sstables too, via table::discard_sstables, so if it happened to be executed during view update generation, i.e. when there are sstables in staging, it should hit the assertion failure reported in https://github.com/scylladb/scylladb/issues/18862 as well (although I haven't seen it yet, but I see no reason why it would happen). So the issue was apparently present since the initial implementation of the pending_delete_log. It's just that with tablet migration it is more likely to be hit.
Fixesscylladb/scylladb#18862
Needs backport to 6.0 since tablets require this capability
Closesscylladb/scylladb#19555
* github.com:scylladb/scylladb:
sstable_directory: create_pending_deletion_log: place pending_delete log under the base directory
sstables: storage: keep base directory in base class
sstables: storage: define opened_directory in header file
sstable_directory: use only dirlog
Cleanup of a deallocated tablet throws an exception.
Since failed cleanup is retried, we end up in an infinite loop.
Ignore cleanup of deallocated storage groups.
Fixes: #19752.
Needs to be backported to all branches with tablets (6.0 and later)
Closesscylladb/scylladb#20584
* github.com:scylladb/scylladb:
test: check if cleanup of deallocated sg is ignored
replica: ignore cleanup of deallocated storage group
Currently, attempt to cleanup deallocated storage group throws
an exception. Failed tablet cleanup is retried, stucking
in an endless loop.
Ignore cleanup of deallocated storage group.
When purging regular tombstone consult the min_live_timestamp, if available.
This is safe since we don't need to protect dead data from resurrection, as it is already dead.
For shadowable_tombstones, consult the min_memtable_live_row_marker_timestamp,
if available, otherwise fallback to the min_live_timestamp.
If we see in a view table a shadowable tombstone with time T, then in any row where the row marker's timestamp is higher than T the shadowable tombstone is completely ignored and it doesn't hide any data in any column, so the shadowable tombstone can be safely purged without any effect or risk resurrecting any deleted data.
In other words, rows which might cause problems for purging a shadowable tombstone with time T are rows with row markers older or equal T. So to know if a whole sstable can cause problems for shadowable tombstone of time T, we need to check if the sstable's oldest row marker (and not oldest column) is older or equal T. And the same check applies similarly to the memtable.
If both extended timestamp statistics are missing, fallback to the legacy (and inaccurate) min_timestamp.
Fixesscylladb/scylladb#20423Fixesscylladb/scylladb#20424
> [!NOTE]
> no backport needed at this time
> We may consider backport later on after given some soak time in master/enterprise
> since we do see tombstone accumulation in the field under some materialized views workloads
Closesscylladb/scylladb#20446
* github.com:scylladb/scylladb:
cql-pytest: add test_compaction_tombstone_gc
sstable_compaction_test: add mv_tombstone_purge_test
sstable_compaction_test: tombstone_purge_test: test that old deleted data do not inhibit tombstone garbage collection
sstable_compaction_test: tombstone_purge_test: add testlog debugging
sstable_compaction_test: tombstone_purge_test: make_expiring: use next_timestamp
sstable, compaction: add debug logging for extended min timestamp stats
compaction: get_max_purgeable_timestamp: use memtable and sstable extended timestamp stats
compaction: define max_purgeable_fn
tombstone: can_gc_fn: move declaration to compaction_garbage_collector.hh
sstables: scylla_metadata: add ext_timestamp_stats
compaction_group, storage_group, table_state: add extended timestamp stats getters
sstables, memtable: track live timestamps
memtable_encoding_stats_collector: update row_marker: do nothing if missing
before this change, we rely on `using namespace seastar` to use
`seastar::format()` without qualifying the `format()` with its
namespace. this works fine until we changed the parameter type
of format string `seastar::format()` from `const char*` to
`fmt::format_string<...>`. this change practically invited
`seastar::format()` to the club of `std::format()` and `fmt::format()`,
where all members accept a templated parameter as its `fmt`
parameter. and `seastar::format()` is not the best candidate anymore.
despite that argument-dependent lookup (ADT for short) favors the
function which is in the same namespace as its parameter, but
`using namespace` makes `seastar::format()` more competitive,
so both `std::format()` and `seastar::format()` are considered
as the condidates.
that is what is happening scylladb in quite a few caller sites of
`format()`, hence ADT is not able to tell which function the winner
in the name lookup:
```
/__w/scylladb/scylladb/mutation/mutation_fragment_stream_validator.cc:265:12: error: call to 'format' is ambiguous
265 | return format("{} ({}.{} {})", _name_view, s.ks_name(), s.cf_name(), s.id());
| ^~~~~~
/usr/bin/../lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/format:4290:5: note: candidate function [with _Args = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
4290 | format(format_string<_Args...> __fmt, _Args&&... __args)
| ^
/__w/scylladb/scylladb/seastar/include/seastar/core/print.hh:143:1: note: candidate function [with A = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
143 | format(fmt::format_string<A...> fmt, A&&... a) {
| ^
```
in this change, we
change all `format()` to either `fmt::format()` or `seastar::format()`
with following rules:
- if the caller expects an `sstring` or `std::string_view`, change to
`seastar::format()`
- if the caller expects an `std::string`, change to `fmt::format()`.
because, `sstring::operator std::basic_string` would incur a deep
copy.
we will need another change to enable scylladb to compile with the
latest seastar. namely, to pass the format string as a templated
parameter down to helper functions which format their parameters.
to miminize the scope of this change, let's include that change when
bumping up the seastar submodule. as that change will depend on
the seastar change.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Migrate the `system_distributed.view_build_status` table to `system.view_build_status_v2`. The writes to the v2 table are done via raft group0 operations.
The new parameter `view_builder_version` stored in `scylla_local` indicates whether nodes should use the old or the new table.
New clusters use v2. Otherwise, the migration to v2 is initiated by the topology coordinator when the feature is enabled. It reads all the rows from the old table and writes them to the new table, and sets `view_builder_version` to v2. When the change is applied, all view_builder services are updated to write and read from the v2 table.
The old table `system_distributed.view_build_status` is set to read virtually from the new table in order to maintain compatibility.
When removing a node from the cluster, we remove its rows from the table atomically (fixes https://github.com/scylladb/scylladb/issues/11836). Also, during the migration, we remove all invalid rows.
Fixesscylladb/scylladb#15329
dtest https://github.com/scylladb/scylla-dtest/pull/4827Closesscylladb/scylladb#19745
* github.com:scylladb/scylladb:
view: test view_build_status table with node replace
test/pylib: use view_build_status_v2 table in wait_for_view
view_builder: common write view_build_status function
view_builder: improve migration to v2 with intermediate phase
view: delete node rows from view_build_status on node removal
view: sanitize view_build_status during migration
view: make old view_build_status table a virtual table
replica: move streaming_reader_lifecycle_policy to header file
view_builder: test view_build_status_v2
storage_service: add view_build_status to raft snapshot
view_builder: migration to v2
db:system_keyspace: add view_builder_version to scylla_local
view_builder: read view status from v2 table
view_builder: introduce writing status mutations via raft
view_builder: pass group0_client and qp to view_builder
view_builder: extract sys_dist status operations to functions
db:system_keyspace: add view_build_status_v2 table
The `database::get_all_tables_flushed_at` method returns a variable
without setting the computed all_tables_flushed_at value. This causes
its caller, `maybe_flush_all_tables` to flush all the tables everytime
regardless of when they were last flushed. Fix this by returning
the computed value from `database::get_all_tables_flushed_at`.
Fixes#20301
Requires a backport to 6.0 and 6.1 as they have the same issue.
Closesscylladb/scylladb#20471
* github.com:scylladb/scylladb:
cql-pytest: add test to verify compaction_flush_all_tables_before_major_seconds config
database::get_all_tables_flushed_at: fix return value
Before we add a new, is_shadowable, parameter to it.
And define global `can_always_purge` and `can_never_purge`
functions, a-la `always_gc` and `never_gc`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To return the minimum live timestamp and live row-marker
timestamp across a compaction_group, storage_group, or
table_state.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When garbage collecting tombstones, we care only
about shadowing of live data. However, currently
we track min/max timestamp of both live and dead
data, but there is no problem with purging tombstones
that shadow dead data (expired or shdowed by other
tombstones in the sstable/memtable).
Also, for shadowable tombstones, we track live row marker timestamps
separately since, if the live row marker timestamp is greater than
a shadowable tombstone timestamp, then the row marker
would shadow the shadowable tombstone thus exposing the cells
in that row, even if their timestasmp may be smaller
than the shadow tombstone's.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Refs #18161
Yet another approach to dealing with large commitlog submissions.
We handle oversize single mutation by adding yet another entry
typo: fragmented. In this case we only add a fragment (aha) of
the data that needs storing into each entry, along with metadata
to correlate and reconstruct the full entry on replay.
Because these fragmented entries are spread over N segments, we
also need to add references from the first segment in a chain
to the subsequent ones. These are released once we clear the
relevant cf_id count in the base.
*
This approach has the downside that due to how serialization etc
works w.r.t. mutations, we need to create an intermediate buffer
to hold the full serialized target entry. This is then incrementally
written into entries of < max_mutation_size, successively requesting
more segments.
On replay, when encountering a fragment chain, the fragment is
added to a "state", i.e. a mapping of currently processing
frag chains. Once we've found all fragments and concatenated
the buffers into a single fragmented one, we can issue a
replay callback as usual.
Note that a replay caller will need to create and provide such
a state object. Old signature replay function remains for tests
and such.
This approach bumps the file format (docs to come).
To ensure "atomicity" we both force synchronization, and should
the whole op fail, we restore segment state (rewinding), thus
discarding data all we wrote.
Closesscylladb/scylladb#19472
* github.com:scylladb/scylladb:
commitlog/database: Make some commitlog options updatable + add feature listener
features/config: Add feature for fragmented commitlog entries
docs: Add entry on commitlog file format v4
commitlog_test: Add more oversized cases
commitlog_replayer: Replay segments in order created
commitlog_replayer: Use replay state to support fragmented entries
commitlog_replayer: coroutinize partly
commitlog: Handle oversized entries
If the row_marker is missing then its timestamp
is missing as well, so there's no point
calling update_timestamp for it. Better return early.
This should cause no functional change.
The following patch will add more logic
for tracking extended timestamp stats.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The `database::get_all_tables_flushed_at` method returns a variable
without setting the computed all_tables_flushed_at value. This causes
its caller, `maybe_flush_all_tables` to flush all the tables everytime
regardless of when they were last flushed. Fix this by returning
the computed value from `database::get_all_tables_flushed_at`.
Fixes#20301
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
To be able to atomically delete sstables both in
base table directory and in its sub-directories,
like `staging/`, use a shared pending_delete_dir
under under the base directory.
Note that this requires loading and processing
the base directory first.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Added a new parameter `consider_only_existing_data` to major compaction
API endpoints. When enabled, major compaction will:
- Force-flush all tables.
- Force a new active segment in the commit log.
- Compact all existing SSTables and garbage-collect tombstones by only
checking the SSTables being compacted. Memtables, commit logs, and
other SSTables not part of the compaction will not be checked, as they
will only contain newer data that arrived after the compaction
started.
The `consider_only_existing_data` is passed down to the compaction
descriptor's `gc_check_only_compacting_sstables` option to ensure that
only the existing data is considered for garbage collection.
The option is also passed to the `maybe_flush_commitlog` method to make
sure all the tables are flushed and a new active segment is created in
the commit log.
Fixes#19728
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Major compaction flushes all tables as a part of flushing the commitlog.
After forcing new active segments in the commitlog, all the tables are
flushed to enable reclaim of older commitlog segments. The main goal is
to flush the commitlog and flushing all the table is just a dependency.
Rename maybe_flush_all_tables to maybe_flush_commitlog so that it
reflects the actual intent of the major compaction code. Added a new
wrapper method to database::flush_all_tables(),
database::flush_commitlog(), that is now called from
maybe_flush_commitlog.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Makes some commitlog options runtime updatable. Most important for this case,
the usage of fragmented entries. Also adds a subscription in database on said
feature, to possibly enable once cluster enables it.
Hides the functionality behind a cluster feature, i.e. postspones
using it until an upgrade is complete etc. This to allow rolling back
even with dirty nodes, at least until a cluster is commited.
Feature can also be disabled by scylla option, just in case. This will
lock it out of whole cluster, but this is probably good, because depending
on off or on, certain schema/raft ops might fail or succeed (due to large
mutations), and this should probably be equivalent across nodes.
Handed over from https://github.com/scylladb/scylladb/pull/20149
This adds minimal implementation of the start-restore API call.
The method starts a task that runs load-and-stream functionality against sstables from S3 bucket. Arguments are:
```
endpoint -- the ID in object_store.yaml config file
bucket -- the target bucket to get objects from
keyspace -- the keyspace to work on
table -- the table to work on
snapshot -- the name of the snapshot from which the backup was taken
```
The task runs in the background, its task_id is returned from the method once it's spawned and it should be used via /task_manager API to track the task execution and completion.
Remote sstables components are scanned as if they were placed in local upload/ directory. Then colelcted sstables are fed into load-and-stream.
This branch has https://github.com/scylladb/scylladb/pull/19890 (Integrated backup), https://github.com/scylladb/scylladb/pull/20120 (S3 lister) and few more minor PRs merged in. The restore branch itself starts with [utils: Introduce abstract (directory) lister](29c867b54d) commit.
refs: https://github.com/scylladb/scylladb/issues/18392Closesscylladb/scylladb#20305
* github.com:scylladb/scylladb:
tools/scylla-nodetool: add restore integration
test/object_store: Add simple restore test
test/object_store: Generalize prepare_snapshot_for_backup()
code: Introduce restore API method
sstable_loader: Add sstables::storage_manager dependency
sstable_loader: Maintain task manager module
sstable_loader: Out-line constructor
distributed_loader: Split get_sstables_from_upload_dir()
sstables/storage: Compose uploaded sstable path simpler
sstable_directory: Prepare FS lister to scan files on S3
sstable_directory: Parse sstable component without full path
s3-client: Add support for lister::filter
utils: Introduce abstract (directory) lister
The method starts a task that uses sstables_loader load-and-stream
functionality to bring new sstables into the cluster. The existing
load-and-stream picks up sstables from upload/ directory, the newly
introduced task collects them from S3 bucket and given prefix (that
correspond to the path where backup API method put them).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Attempting to read a partition via `SELECT * FROM MUTATION_FRAGMENTS()`, which the node doesn't own, from a table using tablets causes a crash.
This is because when using tablets, the replica side simply doesn't handle requests for un-owned tokens and this triggers a crash.
We should probably improve how this is handled (an exception is better than a crash), but this is outside the scope of this PR.
This PR fixes this and also adds a reproducer test.
Fixes: https://github.com/scylladb/scylladb/issues/18786
Fixes a regression introduced in 6.0, so needs backport to 6.0 and 6.1
Closesscylladb/scylladb#20109
* github.com:scylladb/scylladb:
test/tablets: Test that reading tablets' mutations from MUTATION_FRAGMENTS works
replica/mutation_dump: enfore pinning of effective replication map
replica/mutation_dump: handle un-owned tokens (with tablets)
Next patches will need this method to initialize sstable_directory
differently and then do its regular processing. For that, split the
method into two, next patch will re-use the common part it needs.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
To prevent stalls due to large number of tokens.
For example, large cluster with say 70 nodes can have
more than 16K tokens.
Fixes#19757Closesscylladb/scylladb#19758
* github.com:scylladb/scylladb:
abstract_replication_strategy: make get_ranges async
database: get_keyspace_local_ranges: get vnode_effective_replication_map_ptr param
compaction: task_manager_module: open code maybe_get_keyspace_local_ranges
alternator: ttl: token_ranges_owned_by_this_shard: let caller make the ranges_holder
alternator: ttl: can pass const gms::gossiper& to ranges_holder
alternator: ttl: ranges_holder_primary: unconstify _token_ranges member
alternator: ttl: refactor token_ranges_owned_by_this_shard
Currently "table removal" is logged as a reason of compaction stop for table drop,
tablet cleanup and tablet split. Modify log to reflect the reason.
Closesscylladb/scylladb#20042
* github.com:scylladb/scylladb:
test: add test to check compaction stop log
compaction: fix compaction group stop reason
To prevent stalls due to large number of tokens.
For example, large cluster with say 70 nodes can have
more than 16K tokens.
Fixes#19757
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prepare for making the function async.
Then, it will need to hold on to the erm while getting
the token_ranges asynchronously.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is used only here and can be simplified by
checking if the keyspace replication strategy
is per table by the caller.
Prepare for making get_keyspace_local_ranges async.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
There's parse_table_directory_name() static helper in database.cc code
that is used by methods that parse table tree layout for snapshot.
Export this helper for external usage and rename to fit the format_...
one introduced by previous patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The one makes table directory (not full path) out of table name and
uuid. This is to be symmetrical with yet another helper that converts
dirctory name back to table name and uuid (next patch)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
By making it a required argument, making sure the topology version is
pinned for the duration of the query. This is needed because mutation
dump queries bypass the storage proxy, where this pinning usually takes
place. So it has to be enforced here.
When using tablets, the replica-side doesn't handle un-owned tokens.
table::shard_for_reads() will just return 0 for un-owned tokens, and a
later attempt at calling table::storage_group_for_token() with said
un-owned token will cause a crash (std::terminate due to
std::out_of_range thrown in noexcept context).
The replicas rely on the coordinator to not send stray requests, but for
select from mutation_fragments(table) queries, there is no coordinator
side who could do the correct dispatching. So do this in
mutation_dump(), just creating empty readers for un-owned tokens.
compaction_manager::remove passes "table removal" as a reason
of stopping ongoing compactions, but currently remove method
is also called when a tablet is migrated or split.
Pass the actual reason of compaction stop, so that logs aren't
misleading.
Now, when each shard storage_group_manager keeps
only the storage_groups for the tablet replica it owns,
we can simple return the storage_group map size
instead of counting the number of tablet replicas
mapped to this shard.
Add a unit test that sums the tablet count
on all shards and tests that the sum is equal
to the configured default `initial_tablets.
Fixes#18909
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#20223
Currently, database::tables_metadata::add_table needs to hold a write
lock before adding a table. So, if we update other classes keeping
track of tables before calling add_table, and the method yields,
table's metadata will be inconsistent.
Set all table-related info in tables_metadata::add_table_helper (called
by add_table) so that the operation is atomic.
Analogically for remove_table.
Fixes: #19833.
Closesscylladb/scylladb#20064
before this change, `scylla sstable shard-of` didn't support tablets,
because:
- with tablets enabled, data distribution uses the scheduler
- this replaces the previous method of mapping based on vnodes and shard numbers
- as a result, we can no longer deduce sstable mapping from token ranges
in this change, we:
- read `system.tablets` table to retrieve tablet information
- print the tablet's replica set (list of <host, shard> pairs)
- this helps users determine where a given sstable is hosted
This approach provides the closest equivalent functionality of
`shard-of` in the tablet era.
Fixesscylladb/scylladb#16488
---
no need to backport, it's an improvement, not a critical fix.
Closesscylladb/scylladb#20002
* github.com:scylladb/scylladb:
tools: enhance `scylla sstable shard-of` to support tablets
replica/tablets: extract tablet_replica_set_from_cell()
tools: extract get_table_directory() out
tools: extract read_mutation out
build: split the list of source file across multiple line
tools/scylla-sstable: print warning when running shard-of with tablets
Consider the following:
```
T
0 split prepare starts
1 repair starts
2 split prepare finishes
3 repair adds unsplit sstables
4 repair ends
5 split executes
```
If repair produces sstable after split prepare phase, the replica will not split that sstable later, as prepare phase is considered completed already. That causes split execution to fail as replicas weren't really prepared. This also can be triggered with load-and-stream which shares the same write (consumer) path.
The approach to fix this is the same employed to prevent a race between split and migration. If migration happens during prepare phase, it can happen source misses the split request, but the tablet will still be split on the destination (if needed). Similarly, the repair writer becomes responsible for splitting the data if underlying table is in split mode. That's implemented in replica::table for correctness, so if node crashes, the new sstable missing split is still split before added to the set.
Fixes#19378.
Fixes#19416.
**Please replace this line with justification for the backport/\* labels added to this PR**
Closesscylladb/scylladb#19427
* github.com:scylladb/scylladb:
tablets: Fix race between repair and split
compaction: Allow "offline" sstable to be split
Remove the existing copy constructor to enable the use of the implicit
copy constructor. This fixes the issue of `_sstable_set_ids` not being
copied in the current copy constructor.
Fixes#19519
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The lock_table() method needs database, ks and cf to find the table on
all shards. The same can be achieved with the help of global_table_ptr
thing that all the core callers already have at hand.
There's a test that doesn't have global table, but it can get one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#20139
When a reverse slice is provided, it is given in the native reverse
format. Thus the ranges will be returned in the same order as stored
in the slice.
Therefore there is no need to distinguish between get_ranges and
get_native_ranges. The latter one gets dropped and get_ranges returns
ranges in the same order as stored in the slice.
The reconcilable_result is built as it would be constructed for
forward read queries for tables with reversed order.
Mutations constructed for reversed queries are consumed forward.
Drop overloaded reversed functions that reverse read_command and
reconcilable_result directly and keep only those requiring smart
pointers. They are not used any more.
Remove schema reversing in query() and query_mutations() methods.
Instead, a reversed schema shall be passed for reversed queries.
Rename a schema variable from s into query_schema for readability.
Reverse reads have already been with us for a while, thus this back
door option to read entire paritions forward and reversing them after
can be retired.
Consider the following:
T
0 split prepare starts
1 repair starts
2 split prepare finishes
3 repair adds unsplit sstables
4 repair ends
5 split executes
If repair produces sstable after split prepare phase, the replica
will not split that sstable later, as prepare phase is considered
completed already. That causes split execution to fail as replicas
weren't really prepared. This also can be triggered with
load-and-stream which shares the same write (consumer) path.
The approach to fix this is the same employed to prevent a race
between split and migration. If migration happens during prepare
phase, it can happen source misses the split request, but the
tablet will still be split on the destination (if needed).
Similarly, the repair writer becomes responsible for splitting
the data if underlying table is in split mode. That's implemented
in replica::table for correctness, so if node crashes, the new
sstable missing split is still split before added to the set.
Fixes#19378.
Fixes#19416.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently, each change to tablet metadata triggers a full metadata reload from disk. This is very wasteful, especially if the metadata change affects only a single row in the `system.tablets` table. This is the case when the tablet load balancer triggers a migration, this will affect a single row in the table, but today will trigger a full reload.
We expect tablet count to potentially grow to thousands and beyond and the overhead of this full reload can become significant.
This PR makes tablet metadata reload partial, instead of reloading all metadata on topology or schema changes, reload only the partitions that are affected by the change. Copy the rest from the in-memory state.
This is done with two passes: first the change mutations are scanned and a hint is produced. This hint is then passed down to the reload code, which will use it to only reload parts (rows/partitions) of the metadata that has actually changed.
The performance difference between full reload and partial reload is quite drastic:
```
INFO 2024-07-25 05:06:27,347 [shard 0:stat] testlog - Tablet metadata reload:
full 616.39ms
partial 0.18ms
```
This was measured with the modified (by this PR) `perf_tablets`, which creates 100 tables, each with 2K tablets. The test was modified to change a single tablet, then do a full and partial reload respectively, measuring the time it takes for reach.
Fixes: #15294
New feature, no backport needed.
Closesscylladb/scylladb#15541
* github.com:scylladb/scylladb:
test/perf/perf_tablets: add tablet metadata reload perf measurement
test/boost/tablets_test: add test for partial tablet metadata updates
db/schema_tables: pass tablet hint to update_tablet_metadata()
service/storage_service: load_tablet_metadata(): add hint parameter
service/migration_listener: update_tablet_metadata(): add hint parameter
service/raft/group0_state_machine: provide tablet change hint on topology change
service/storage_service: topology_state_load(): allow providing change hint
replica/tablets: add update_tablet_metadata()
replica/tablets: fix indentation
replica/tablets: extract tablet_metadata builder logic
replica/tablets: add get_tablet_metadata_change_hint() and update_tablet_metadata_change_hint()
locator/tablets: add tablet_map::clear_tablet_transition_info()
locator/tablets: make tablet_metadata cheap to copy
mutation/canonical_mutation: add key()