When the storage_service::restore_tablets() resolves, it only means that
tablet transitions are done, including restore transitions, but not
necessarily that they succeeded. So before resolving the restoration
task with success need to check if all sstables were downloaded and, if
not, resolve the task with exception.
Test included. It uses fault-injection to abort downloading of a single
sstable early, then checks that the error was properly propagated back
to the task waiting API
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After each SSTable is successfully attached to the local table in
download_tablet_sstables(), update its downloaded status in
system_distributed.snapshot_sstables to true. This enables tracking
restore progress by counting how many SSTables have been downloaded.
Change attach_sstable() return type from future<> to
future<sstables::shared_sstable>, returning the SSTable that was
attached. This will be used to extract the SSTable identifier and
first token for updating the download status.
Add a method to update the downloaded status of a specific SSTable
entry in system_distributed.snapshot_sstables. This will be used
by the tablet restore process to mark SSTables as downloaded after
they have been successfully attached to the local table.
Add a 'downloaded' boolean column to the snapshot_sstables table
schema and the corresponding field to the snapshot_sstable_entry
struct. Update insert_snapshot_sstable() and get_snapshot_sstables()
to write and read this column.
This column will be used to track which SSTables have been
successfully downloaded during a tablet restore operation.
Move the TTL value used for snapshot_sstables rows from a local
variable in insert_snapshot_sstable() to a class-level constant
SNAPSHOT_SSTABLES_TTL_SECONDS, making it reusable by other methods.
The test is derived from test_restore_with_streaming_scopes() one, with
the excaption that it doesn't check for streaming directions, doesn't
check mutations right after creation and doesn't loop over scoped
sub-tests, because there's no scope concept here.
Also it verifies just two topologies, it seems to be enough. The scopes
test has many topologies because of the nature of the scoped restore,
with cluster-wide restore such flexibility is not required.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch adds
- Changes in sstables_loader::restore_tablets() method
It populates the system_distributed_keyspace.snapshot_sstables table
with the information read from the manifest
- Implementation of tablet_restore_task_impl::run() method
It emplaces a bunch of tablet migrations with "restore" kind
- Topology coordinator handling of tablet_transition_stage::restore
When seen, the coordinator calls RESTORE_TABLET RPC against all tablet
replicas
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The topology coordinator will need to call this verb against existing
tablet replicas to ask them restore tablet sstables. Here's the RPC verb
to do it.
It now returns an empty restore_result to make it "synchronous" -- the
co_await send_restore_tablets() won't resolve until client call
finishes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Extracts the data from snapshot_sstables tables and filters only
sstables belonging to current node and tablet in question, then starts
downloading the matched sstables
Extracted from Ernest PR #28701 and piggy-backs the refactoring from
another Ernest PR #28773. Will be used by next patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When doing cluster-wide restore using topology coordinator, the
coordinator will need to serve a bunch of new tablet transition kinds --
the restore one. For that, it will need to receive information about
from where to perform the restore -- the endpoint and bucket pair. This
data can be grabbed from nowhere but the tablet transition itself, so
add the "restore_config" member with this data.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The new cluster-wide tablets restore API is going to be asynchronous,
just like existing node-local one is. For that the task_manager tasks
will be used.
This patch adds a skeleton for tablets-restore task with empty run
method. Next patches will populate it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Withdrawn from #28701. The endpoint implementation from the PR is going
to be reworked, but the swagger description and set/unset placeholders
are very useful.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Ernest Zaslavsky <ernest.zaslavsky@scylladb.com>
When restoring a backup into a keyspace under a different name, than the
one at which it existed during backup, the snapshot_sstables table must
be populated with the _new_ keyspace name, not the one taken from
manifest. Same is true for table name.
This patch makes it possible to override keyspace/table loaded from
manifest file with the provided values. in the future it will also be
good to check that if those values are not provided by user, then values
read from different manifest files are the same.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Rename arguments and local variables in get_sstables_for_tablet to avoid
references to SSTable-specific terminology. This makes the function more
generic and better suited for reuse with different range types.
Generalize get_sstables_for_tablet by templating the return type so it
produces vectors matching the input range’s value type. This makes the
function more flexible and prepares it for reuse in tablet‑aware
restore.
Add getters for the first and last tokens in get_sstables_for_tablet to
make the function more generic and suitable for future use in the
tablet-aware restore code.
Move the download_sstable and get_sstables_for_tablet static functions
from sstables_loader into a new file to make them reusable by the
tablet-aware restore code.
Extract single-tablet range filtering into a new
get_sstables_for_tablet function, taken from the existing
get_sstables_for_tablets. This will later be reused in the
tablet-aware restore code.
Extract the logic for downloading a single SST into a dedicated
function and reuse it in download_fully_contained_sstables. This
supports upcoming changes that consolidate common code.
Add a shard_id member to the minimal_sst_info struct as part of the
tablet-aware restore refactoring. This will support upcoming changes
that extract common code.
This change adds functionality for parsing backup manifests
and populating system_distributed.snapshot_sstables with
the content of the manifests.
This change is useful for tablet-aware restore. The function
introduced here will be called by the coordinator node
when restore starts to populate the snapshot_sstables table
with the data that workers need to execute the restore process.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Co-authored-by: Pavel Emelyanov <xemul@scylladb.com>
This patch adds the snapshot_sstables table with the following
schema:
```cql
CREATE TABLE system_distributed.snapshot_sstables (
snapshot_name text,
keyspace text, table text,
datacenter text, rack text,
id uuid,
first_token bigint, last_token bigint,
toc_name text, prefix text)
PRIMARY KEY ((snapshot_name, keyspace, table, datacenter, rack), first_token, id);
```
The table will be populated by the coordinator node during the restore
phase (and later on during the backup phase to accomodate live-restore).
The content of this table is meant to be consumed by the restore worker nodes
which will use this data to filter and file-based download sstables.
Fixes SCYLLADB-263
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
The loader will need to populate and read data from
system_distributed.snapshot_sstables table added recently, so this
dependency is truly needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patches will introduce an RPC handler to restore a tablet on
replica. The handler will be registered by sstables_loader, and it will
have to call that helper from storage_service which thus needs to be
moved to public scope.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The goal of the split is to have try_transit_tablet() that
- doesn't throw if tablet is in transition, but reports it back
- doesn't wait for the submitted transition to finish
The user will be in tablet-aware-restore, it will call this new trying
helper in parallel, then wait for all transitions to finish.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
implement tablet split, tablet merge and tablet migration for tables that use the experimental logstor storage engine.
* tablet merge simply merges the histograms of segments of one compaction group with another.
* for tablet split we take the segments from the source compaction group, read them and write all live records to separate segments according to the split classifier, and move separated segments to the target compaction groups.
* for tablet migration we use stream_blob, similarly to file streaming of sstables. we add a new op type for streaming a logstor segment. on the source we take a snapshot of the segments with an input stream that reads the segment, and on the target we create a sink that allocates a new segment on the target shard and writes to it.
* we also do some improvements for recovery and loading of segments. we add a segment header that contains useful information for non-mixed segments, such as the table and token range.
Refs SCYLLADB-770
no backport - still a new and experimental feature
Closesscylladb/scylladb#29207
* github.com:scylladb/scylladb:
test: logstor: additional logstor tests
docs/dev: add logstor on-disk format section
logstor: add version and crc to buffer header
test: logstor: tablet split/merge and migration
logstor: enable tablet balancing
logstor: streaming of logstor segments using stream_blob
logstor: add take_logstor_snapshot
logstor: segment input/output stream
logstor: implement compaction_group::cleanup
logstor: tablet split
logstor: tablet merge
logstor: add compaction reenabler
logstor: add segment header
logstor: serialize writes to active segment
replica: extend compaction_group functions for logstor
replica: add compaction_group_for_logstor_segment
logstor: code cleanup
Since we do no longer support upgrade from versions that do not support
v2 of "view building status" code (building status is managed by raft) we can remove v1 code and upgrade code and make sure we do not boot with old "builder status" version.
v2 version was introduced by 8d25a4d678 which is included in scylla-2025.1.0.
No backport needed since this is code removal.
Closesscylladb/scylladb#29105
* github.com:scylladb/scylladb:
view: drop unused v1 builder code
view: remove upgrade to raft code
Replace the range scan in read_verify_workload() with individual
single-partition queries, using the keys returned by
prepare_write_workload() instead of hard-coding them.
The range scan was previously observed to time out in debug mode after
a hard cluster restart. Single-partition reads are lighter on the
cluster and less likely to time out under load.
The new verification is also stricter: instead of merely checking that
the expected number of rows is returned, it verifies that each written
key is individually readable, catching any data-loss or key-identity
mismatch that the old count-only check would have missed.
This is the second attemp at stabilizing this test, after the recent
854c374ebf. That fix made sure that the
cluster has converged on topology and nodes see each other before running
the verify workload.
Fixes: SCYLLADB-1331
Closesscylladb/scylladb#29313
The supergroup replaces streaming (a.k.a. maintenance as well) group, inherits 200 shares from it and consists of four sub-groups (all have equal shares of 200 withing the new supergroup)
* maintenance_compaction. This group configures `compaction_manager::maintenance_sg()` group. User-triggered compaction runs in it
* backup. This group configures `snapshot_ctl::config::backup_sched_group`. Native backup activity runs there
* maintenance. It's a new "visible" name, everything that was called "maintenance" in the code ran in "streaming" group. Now it will run in "maintenance". The activities include those that don't communicate over RPC (see below why)
* `tablet_allocator::balance_tablets()`
* `sstables_manager::components_reclaim_reload_fiber()`
* `tablet_storage_group_manager::merge_completion_fiber()`
* metrics exporting http server altogether
* streaming. This is purely existing streaming group that just moves under the new supergroup. Everything else that was run there, continues doing so, including
* hints sender
* all view building related components (update generator, builder, workers)
* repair
* stream_manager
* messaging service (except for verb handlers that switch groups)
* join_cluster() activity
* REST API
* ... something else I forgot
The `--maintenance_io_throughput_mb_per_sec` option is introduced. It controls the IO throughput limit applied to the maintenance supergroup. If not set, the `--stream_io_throughput_mb_per_sec` option is used to preserve backward compatibility.
All new sched groups inherit `request_class::maintenance` (however, "backup" seem not to make any requests yet).
Moving more activities from "streaming" into "maintenance" (or its own group) is possible, but one will need to take care of RPC group switching. The thing is that when a client makes an RPC call, the server may switch to one of pre-negotiated scheduling groups. Verbs for existing activities that run in "streaming" group are routed through RPC index that negotiates "streaming" group on the server side. If any of that client code moves to some other group, server will still run the handlers in "streaming" which is not quite expected. That's one of the main reasons why only the selected fibers were moved to their own "maintenance" group. Similar for backup -- this code doesn't use RPC, so it can be moved. Restoring code uses load-and-stream and corresponding RPCs, so it cannot be just moved into its own new group.
Fixes SCYLLADB-351
New feature, not backporting
Closesscylladb/scylladb#28542
* github.com:scylladb/scylladb:
code: Add maintenance/maintenance group
backup: Add maintenance/backup group
compaction: Add maintenance/maintenance_compaction group
main: Introduce maintenance supergroup
main: Move all maintenance sched group into streaming one
database: Use local variable for current_scheduling_group
code: Live-update IO throughputs from main
shared_tombstone_gc_state::update_repair_time() uses copy-on-write
semantics: each call copies the entire per_table_history_maps and the
per-table repair_history_map. repair_service::load_history() called
this once per history entry, making the load O(N²) in both time and
memory.
Introduce batch_update_repair_time() which performs a single
copy-on-write for any number of entries belonging to the same table.
Restructure load_history() to collect entries into batches of up to
1000 and flush each batch in one call, keeping peak memory bounded.
The batch size limit is intentional: the repair history table currently
has no bound on the number of entries and can grow large. Note that
this does not cause a problem in the in-memory history map itself:
entries are coalesced internally and only the latest repair time is
kept per range. The unbounded entry count only makes the batched
update during load expensive.
Fixes: SCYLLADB-104
Closesscylladb/scylladb#29326
Include non-primary key restrictions (e.g. regular column filters) in
the filter JSON sent to the Vector Store service. Previously only
partition key and clustering column restrictions were forwarded, so
filtering on regular columns was silently ignored.
Add get_nonprimary_key_restrictions() getter to statement_restrictions.
Add unit tests for non-primary key equality, range, and bind marker
restrictions in filter_test.
Fixes: SCYLLADB-970
Closesscylladb/scylladb#29019
For counter updates, use a counter ID that is constructed from the
node's rack instead of the node's host ID.
A rack can have at most two active tablet replicas at a time: a single
normal tablet replica, and during tablet migration there are two active
replicas, the normal and pending replica. Therefore we can have two
unique counter IDs per rack that are reused by all replicas in the rack.
We construct the counter ID from the rack UUID, which is constructed
from the name "dc:rack". The pending replica uses a deterministic
variation of the rack's counter ID by negating it.
This improves the performance and size of counter cells by having less
unique counter IDs and less counter shards in a counter cell.
Previously the number of counter shards was the number of different
host_id's that updated the counter, which can be typically the number of
nodes in the cluster and continue growing indefinitely when nodes are
replaced. with the rack-based counter id the number of counter shards
will be at most twice the number of different racks (including removed
racks, which should not be significant).
Fixes SCYLLADB-356
backport not needed - an enhancement
Closesscylladb/scylladb#28901
* github.com:scylladb/scylladb:
docs/dev: add counters doc
counters: reuse counter IDs by rack
Replace move_to_shard()/move_to_host() with as_bounce()/target_shard()/
target_host() to clarify the interface after bounce was extended to
support cross-node bouncing.
- Add virtual as_bounce() returning const bounce* to the base class
(nullptr by default, overridden in bounce to return this), replacing
the virtual move_to_shard() which conflated bounce detection with
shard access
- Rename move_to_shard() -> target_shard() (now non-virtual, returns
unsigned directly) and move_to_host() -> target_host() on bounce
- Replace dynamic_pointer_cast with static_pointer_cast at call sites
that already checked as_bounce()
- Move forward declarations of message types before the virtual
methods so as_bounce() can reference bounce
Fixes: SCYLLADB-1066
Closesscylladb/scylladb#29367
Motivation
----------
Since strongly consistent tables are based on the concept of Raft
groups, operations on them can get stuck for indefinite amounts of
time. That may be problematic, and so we'd like to implement a way
to cancel those operations at suitable times.
Description of solution
-----------------------
The situations we focus on are the following:
* Timed-out queries
* Leader changes
* Tablet migrations
* Table drops
* Node shutdowns
We handle each of them and provide validation tests.
Implementation strategy
-----------------------
1. Auxiliary commits.
2. Abort operations on timeout.
3. Abort operations on tablet removal.
4. Extend `client_state`.
5. Abort operation on shutdown.
6. Help `state_machine` be aborted as soon as possible.
Tests
-----
We provide tests that validate the correctness of the solution.
The total time spent on `test_strong_consistency.py`
(measured on my local machine, dev mode):
Before:
```
real 0m31.809s
user 1m3.048s
sys 0m21.812s
```
After:
```
real 0m34.523s
user 1m10.307s
sys 0m27.223s
```
The incremental differences in time can be found in the commit messages.
Fixes SCYLLADB-429
Backport: not needed. This is an enhancement to an experimental feature.
Closesscylladb/scylladb#28526
* github.com:scylladb/scylladb:
service: strong_consistency: Abort state_machine::apply when aborting server
service: strong_consistency: Abort ongoing operations when shutting down
service: client_state: Extend with abort_source
service: strong_consistency: Handle abort when removing Raft group
service: strong_consistency: Abort Raft operations on timeout
service: strong_consistency: Use timeout when mutating
service: strong_consistency: Fix indentation
service: strong_consistency: Enclose coordinator methods with try-catch
service: strong_consistency: Crash at unexpected exception
test: cluster: Extract default config & cmdline in test_strong_consistency.py
This reverts commit 8b4a91982b.
Two commits independently added rolling_max_tracker_test to test/boost/CMakeLists.txt:
8b4a919 cmake: add missing rolling_max_tracker_test and symmetric_key_test
f3a91df test/cmake: add missing tests to boost test suite
The second was merged two days after the first. They didn't conflict on
code-level and applied cleanly resulting in a duplicate add_scylla_test()
entries that breaks the CMake build:
CMake Error: add_executable cannot create target
"test_boost_rolling_max_tracker_test" because another target
with the same name already exists.
Remove the duplicate.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Reported-by: Łukasz Paszkowski <lukasz.paszkowski@scylladb.com>
The PR contains more code cleanups, mostly in gossiper. Dropping more gossiper state leaving only NORMAL and SHUTDOWN. All other states are checked against topology state. Those two are left because SHUTDOWN state is propagated through gossiper only and when the node is not in SHUTDOWN it should be in some other state.
No need to backport. Cleanups.
Closesscylladb/scylladb#29129
* https://github.com/scylladb/scylladb:
storage_service: cleanup unused code
storage_service: simplify get_peer_info_for_update
gossiper: send shutdown notifications in parallel
gms: remove unused code
virtual_tables: no need to call gossiper if we already know that the node is in shutdown
gossiper: print node state from raft topology in the logs
gossiper: use is_shutdown instead of code it manually
gossiper: mark endpoint_state(inet_address ip) constructor as explicit
gossiper: remove unused code
gossiper: drop last use of LEFT state and drop the state
gossiper: drop unused STATUS_BOOTSTRAPPING state
gossiper: rename is_dead_state to is_left since this is all that the function checks now.
gossiper: use raft topology state instead of gossiper one when checking node's state
storage_service: drop check_for_endpoint_collision function
storage_service: drop is_first_node function
gossiper: remove unused REMOVED_TOKEN state
gossiper: remove unused advertise_token_removed function