mirror of https://github.com/scylladb/scylladb.git synced 2026-04-21 00:50:35 +00:00

Files

Tomasz Grabiec 9daed59af9 Merge 'Tablet-aware restore' from Pavel Emelyanov

The mechanics of the restore is like this

- A /storage_service/tablets/restore API is called with (keyspace, table, endpoint, bucket, manifests) parameters
  - First, it populates the system_distributed.snapshot_sstables table with the data read from the manifests
  - Then it emplaces a bunch of tablet transitions (of a new "restore" kind), one for each tablet
- The topology coordinator handles the "restore" transition by calling a new RESTORE_TABLET RPC against all the current tablet replicas
- Each replica handles the RPC verb by
  - Reading the snapshot_sstables table
  - Filtering the read sstable infos against current node and tablet being handled
  - Downloading and attaching the filtered sstables

This PR includes system_distributed.snapshot_sstables table from @robertbindar and preparation work from @kreuzerkrieg that extracts raw sstables downloading and attaching from existing generic sstables loading code.

This is first step towards SCYLLADB-197 and lacks many things. In particular
- the API only works for single-DC cluster
- the caller needs to "lock" tablet boundaries with min/max tablet count
- not abortable
- no progress tracking
- sub-optimal (re-kicking API on restore will re-download everything again)
- not re-attacheable (if API node dies, restoration proceeds, but the caller cannot "wait" for it to complete via other node)
- nodes download sstables in maintenance/streaming sched gorup (should be moved to maintenance/backup)

Other follow-up items:
- have an actual swagger object specification for `backup_location`

Closes #28436
Closes #28657
Closes #28773

Closes scylladb/scylladb#28763

* github.com:scylladb/scylladb:
  test: Add test for backup vs migration race
  test: Restore resilience test
  sstables_loader: Fail tablet-restore task if not all sstables were downloaded
  sstables_loader: mark sstables as downloaded after attaching
  sstables_loader: return shared_sstable from attach_sstable
  db: add update_sstable_download_status method
  db: add downloaded column to snapshot_sstables
  db: extract snapshot_sstables TTL into class constant
  test: Add a test for tablet-aware restore
  tablets: Implement tablet-aware cluster-wide restore
  messaging: Add RESTORE_TABLET RPC verb
  sstables_loader: Add method to download and attach sstables for a tablet
  tablets: Add restore_config to tablet_transition_info
  sstables_loader: Add restore_tablets task skeleton
  test: Add rest_client helper to kick newly introduced API endpoint
  api: Add /storage_service/tablets/restore endpoint skeleton
  sstables_loader: Add keyspace and table arguments to manfiest loading helper
  sstables_loader_helpers: just reformat the code
  sstables_loader_helpers: generalize argument and variable names
  sstables_loader_helpers: generalize get_sstables_for_tablet
  sstables_loader_helpers: add token getters for tablet filtering
  sstables_loader_helpers: remove underscores from struct members
  sstables_loader: move download_sstable and get_sstables_for_tablet
  sstables_loader: extract single-tablet SST filtering
  sstables_loader: make download_sstable static
  sstables_loader: fix formating of the new `download_sstable` function
  sstables_loader: extract single SST download into a function
  sstables_loader: add shard_id to minimal_sst_info
  sstables_loader: add function for parsing backup manifests
  split utility functions for creating test data from database_test
  export make_storage_options_config from lib/test_services
  rjson: Add helpers for conversions to dht::token and sstable_id
  Add system_distributed_keyspace.snapshot_sstables
  add get_system_distributed_keyspace to cql_test_env
  code: Add system_distributed_keyspace dependency to sstables_loader
  storage_service: Export export handle_raft_rpc() helper
  storage_service: Export do_tablet_operation()
  storage_service: Split transit_tablet() into two
  tablets: Add braces around tablet_transition_kind::repair switch

2026-04-21 02:27:24 +02:00

16 KiB

Raw Blame History

Keeping sstables on S3/GS

On of the ways to use object storage is to keep sstables directly on it as objects.

Enabling the feature

Currently the object-storage backend works if keyspace-storage-options is listed in experimental_features in scylla.yaml. like:

experimental_features:
  - keyspace-storage-options

It can also be enabled with --experimental-features=keyspace-storage-options command line option when launchgin scylla.

Configuring AWS S3 access

You can define endpoint details in the scylla.yaml file. For example:

object_storage_endpoints:
  - name: https://s3.us-east-1.amazonaws.com:443
    aws_region: us-east-1

Configuring GCP storage access

Similarly to AWS, define endpoint details in scylla.yaml like:

object_storage_endpoints:
  - name: https://storage.googleapis.com
    type: gs
    credentials_file: <gcp account credentials json file>

Typically, google compute storage only uses the same endpoint URI (unless using private proxy or mock server), so name can also use the default moniker.

credentials_file can be omitted, in which case the default credentials on the machine will be used, i.e. resolving the current users credentials or fallback to machine credentials if running on a GCP instance.

If set, the environment variable GOOGLE_APPLICATION_CREDENTIALS can be set to point to a credentials file.

If no credentials file is set, the default credentials will be searched, i.e. application_default_credentials.json in the gcp local data folder.

You can also set the credentials_file to none to completely skip authentication. Useful for testing on mock servers.

Local/Development Environment

In a local or development environment, you usually need to set AWS authentication tokens in environment variables to ensure the client works properly. For instance:

export AWS_ACCESS_KEY_ID=EXAMPLE_ACCESS_KEY_ID
export AWS_SECRET_ACCESS_KEY=EXAMPLE_SECRET_ACCESS_KEY

Additionally, you may include an aws_session_token, although this is not typically necessary for local or development environments:

export AWS_ACCESS_KEY_ID=EXAMPLE_ACCESS_KEY_ID
export AWS_SECRET_ACCESS_KEY=EXAMPLE_SECRET_ACCESS_KEY
export AWS_SESSION_TOKEN=EXAMPLE_TEMPORARY_SESSION_TOKEN

For gs, when using local mock server, authentication is normally not used.

Important Note

The examples above are intended for development or local environments. You should never use this approach in production. The Scylla S3 client will first attempt to access credentials from environment variables. If it fails to obtain credentials, it will then try to retrieve them from the AWS Security Token Service (STS) or the EC2 Instance Metadata Service.

For the EC2 Instance Metadata Service to function correctly, no additional configuration is required. However, STS requires the IAM Role ARN to be defined in the scylla.yaml file, as shown below:

object_storage_endpoints:
  - name: https://s3.us-east-1.amazonaws.com:443
    aws_region: us-east-1
    iam_role_arn: arn:aws:iam::123456789012:instance-profile/my-instance-instance-profile

Creating keyspace with S3

Sstables location is keyspace-scoped. In order to create a keyspace with S3 storage use CREATE KEYSPACE with STORAGE = { 'type': 'S3', 'endpoint': '$endpoint_name', 'bucket': '$bucket' } parameters, where $endpoint_name should match with the corresponding name of the configured endpoint in the YAML file above.

In the following example, an endpoint named "s3.us-east-2.amazonaws.com" is defined in scylla.yaml, and this endpoint is used when creating the keyspace "ks".

in scylla.yaml:

object_storage_endpoints:
  - name: https://s3.us-east-2.amazonaws.com:443
    aws_region: us-east-2

and when creating the keyspace:

CREATE KEYSPACE ks
  WITH REPLICATION = {
   'class' : 'NetworkTopologyStrategy',
   'replication_factor' : 1
  }
  AND STORAGE = {
   'type' : 'S3',
   'endpoint' : 's3.us-east-2.amazonaws.com',
   'bucket' : 'bucket-for-testing'
  };

Creating keyspace with GS

This mirrors AWS S3 config.

in scylla.yaml:

object_storage_endpoints:
  - name: default
    credentials_file: <credentials file>|none

and when creating the keyspace:

CREATE KEYSPACE ks
  WITH REPLICATION = {
   'class' : 'NetworkTopologyStrategy',
   'replication_factor' : 1
  }
  AND STORAGE = {
   'type' : 'GS',
   'endpoint' : 'default',
   'bucket' : 'bucket-for-testing'
  };

Copying sstables on S3/GS (backup)

It's possible to upload sstables from data/ directory on S3 via API. This is good to do because in that case all the resources that are needed for that operation (like disk IO bandwidth and IOPS, CPU time, networking bandwidth) will be under Seastar's control and regular Scylla workload will not be randomly affected.

The API endpoint name is /storage_service/backup and its Swagger description can be found here. Accepted parameters are

keyspace: the keyspace to copy sstables from
table: the table to copy sstables from
snapshot: the snapshot name to copy sstables from
endpoint: the key in the object storage configuration file. Can be either an AWS or GCP endpoint
bucket: bucket name to put sstables' files in
prefix: prefix to put sstables' files under

Currently only snapshot backup is possible, so first one needs to take snapshot

All tables in a keyspace are uploaded, the destination object names will look like s3://bucket/some/prefix/to/store/data/.../sstable or gs://bucket/some/prefix/to/store/data/.../sstable

System tables

There are a few system tables that object storage related code needs to touch in order to operate.

system_distributed.snapshot_sstables - Used during restore by worker nodes to get the list of SSTables that need to be downloaded from object storage and restored locally.
system.sstables - Used to keep track of SSTables on object storage when a keyspace is created with object storage storage_options.

Manipulating S3 data

This section intends to give an overview of where, when and how we store data in S3 and provide a quick set of commands
which help gain local access to the data in case there is a need for manual intervention.

Most of the time it won't be necessary to touch the data on S3 directly, there are transparent REST APIs and Scylla Manager
commands for backup and restore and Scylla can operate normally with S3 storage configured in the
CREATE KEYSPACE cql documented at ScyllaDB CQL Extensions | ScyllaDB Docs.

However, if for some reason the SSTables become corrupted and need an offline scrub before re-uploading
or if a bug investigation leads to the need to analyze the backup data, follow the information below to access
that data.

Issue tracking the document here.

Object Storage Layout

There are currently three mechanisms in Scylla which write data to S3/GS:

Scylla Manager backup

When performing a backup with sctool, a backup prefix is created within the bucket passed as argument and
under that prefix, Scylla Manager stores all the backup data of all the backup tasks organized by cluster name,
datacenter, keyspace, etc.

Follow Specification | ScyllaDB Docs in the Scylla Manager documentation for the exact layout
under the backup prefix.

/storage_service/backup REST API

When using the /storage_service/backup REST API, the data is stored under the prefix passed as argument to the API.
The structure under this prefix is identical to what you’d find in the typical Scylla snapshot.
There is a manifest file which contains the list of Data files for each SSTable, the schema file and all the SSTables
components stored flat under the prefix.

scylla-bucket/prefix/
│
├── manifest.json
├── schema.cql
|
├── me-3gqe_1lnj_4sbpc2ezoscu9hhtor-big-Data.db
├── me-3gqe_1lnj_4sbpc2ezoscu9hhtor-big-Index.db
├── me-3gqe_1lnj_4sbpc2ezoscu9hhtor-big-Summary.db
├── ...
│
├── ma-1abx_k29m_9fyug3sdtjwj8krpqh-big-Data.db
├── ma-1abx_k29m_9fyug3sdtjwj8krpqh-big-Index.db
├── ma-1abx_k29m_9fyug3sdtjwj8krpqh-big-Summary.db
├── ...
│
└── ... (more SSTable components)

See the API documentation for more details about the actual backup request.

The snapshot manifest

Each table snapshot directory contains a manifest.json file that lists the contents of the snapshot and some metadata. The json structure is as follows:

{
  "manifest": {
    "version": "1.0",
    "scope": "node"
  },
  "node": {
    "host_id": "<UUID>",
    "datacenter": "mydc",
    "rack": "myrack"
  },
  "snapshot": {
    "name": "snapshot name",
    "created_at": seconds_since_epoch,
    "expires_at": seconds_since_epoch | null,
  },
  "table": {
    "keyspace_name": "my_keyspace",
    "table_name": "my_table",
    "table_id": "<UUID>",
    "tablets_type": "none|powof2",
    "tablet_count": N
  },
  "sstables": [
    {
      "id": "67e35000-d8c6-11f0-9599-060de9f3bd1b",
      "toc_name": "me-3gw7_0ndy_3wlq829wcsddgwha1n-big-TOC.txt",
      "data_size": 75,
      "index_size": 8,
      "first_token": -8629266958227979430,
      "last_token": 9168982884335614769,
    },
    {
      "id": "67e35000-d8c6-11f0-85dc-0625e9f3bd1b",
      "toc_name": "me-3gw7_0ndy_3wlq821a6cqlbmxrtn-big-TOC.txt",
      "data_size": 73,
      "index_size": 8,
      "first_token": 221146791717891383,
      "last_token": 7354559975791427036,
    },
    ...
  ],
  "files": [ ... ]
}

The `manifest` member contains the following attributes:
- `version` - respresenting the version of the manifest itself. It is incremented when members are added or removed from the manifest.
- `scope` - the scope of metadata stored in this manifest file.  The following scopes are supported:
    - `node` - the manifest describes all SSTables owned by this node in this snapshot.

The `node` member contains metadata about this node that enables datacenter- or rack-aware restore.
- `host_id` - is the node's unique host_id (a UUID).
- `datacenter` - is the node's datacenter.
- `rack` - is the node's rack.

The `snapshot` member contains metadata about the snapshot.
- `name` - is the snapshot name (a.k.a. `tag`)
- `created_at` - is the time when the snapshot was created.
- `expires_at` - is an optional time when the snapshot expires and can be dropped, if a TTL was set for the snapshot.  If there is no TTL, `expires_at` may be omitted, set to null, or set to 0.

The `table` member contains metadata about the table being snapshot.
- `keyspace_name` and `table_name` - are self-explanatory.
- `table_id` - a universally unique identifier (UUID) of the table set when the table is created.
- `tablets_type`:
    - `none` - if the keyspace uses vnodes replication
    - `powof2` - if the keyspace uses tablets replication, and the tablet token ranges are based on powers of 2.
    - `arbitrary` - if the keyspace uses tablets replication, and the tablet token ranges and count can be arbitrary.
- `tablet_count` - Optional. If `tablets_type` is not `none`, contains the number of tablets allcated in the table. If `tablets_type` is `powof2`, tablet_count would be a power of 2.

The `sstables` member is a list containing metadata about the SSTables in the snapshot.
- `id` - is the STable's unique id (a UUID).  It is carried over with the SSTable when it's streamed as part of tablet migration, even if it gets a new generation.
- `toc_name` - is the name of the SSTable Table Of Contents (TOC) component.
- `data_size` and `index_size` - are the sizes of the SSTable's data and index components, respectively.  They can be used to estimate how much disk space is needed for restore.
- `first_token` and `last_token` - are the first and last tokens in the SSTable, respectively.  They can be used to determine if a SSTable is fully contained in a (tablet) token range to enable efficient file-based streaming of the SSTable.

The optional `files` member may contain a list of non-SSTable files included in the snapshot directory, not including the manifest.json file and schema.cql.

CREATE KEYSPACE with S3/GS storage

When creating a keyspace with S3/GS storage, the data is stored under the bucket passed as argument to the CREATE KEYSPACE statement.
Once the statement is issued, Scylla will transparently use the S3/GS bucket as the location of the SSTables for that keyspace.
Like in the case above, there is no hierarchy for the data, all SSTables components are stored flat within the bucket.

scylla-sstables-bucket/
│
├── 3gqe_1lnj_4sbpc2ezoscu9hhtor/
│   ├── Data.db
│   ├── Index.db
│   ├── Summary.db
│   └── ...
│
├── 1abx_k29m_9fyug3sdtjwj8krpqh/
│   ├── Data.db
│   ├── Index.db
│   ├── Summary.db
│   └── ...
│
└── ... (other SSTable folders)

Downloading, deleting, uploading SSTables

To manually manage sstables on S3, AWS CLI commands can be used, but first it's mandatory to have awscli
installed (installation guide) and have the proper credentials set up in order to be able to access ScyllaDB S3 buckets.

Please make sure your ~/.aws/credentials file points to a valid set of S3 credentials.
Either refresh credentials if you use an OKTA-based fetching tool or make sure they point to a valid IAM user with S3 access.

Provided all the prerequisites above are fulfilled and you're able to run

aws s3 ls s3://your-bucket/

and see something (or at least not see an error if the bucket is empty), you're all set for the next commands.

NOTE: Please refer to the sections above for the prefix layout of each S3 use case.

Downloading SSTables

Fetching the SSTables of your backup can be easily done by
e.g. copying each individual component

aws s3 cp s3://your-bucket/path/to/sstable/me-3gqb_1izi_0pxn421yzymfw5c8zf-big-Data.db  /local/path/to/sstable/component

or downloading an entire sstable using globs

aws s3 cp s3://your-bucket/path-to-sstables/ /local/path/for/sstables --exclude "*" --include 'some-sstable-generation-big-*' --recursive

Deleting SSTables

components individually

aws s3 rm s3://your-bucket/path/to/sstable/me-3gqb_1izi_0pxn421yzymfw5c8zf-big-Data.db

or the entire SSTable using globs

aws s3 rm s3://your-bucket/path-to-sstables/ --exclude "*" --include 'some-sstable-generation-big-*' --recursive

Uploading SSTables

components individually

 aws s3 cp /local/path/to/sstable/me-3gqb_1izi_0pxn421yzymfw5c8zf-big-Data.db s3://your-bucket/path/to/sstable/component

or the entire SSTable using globs

aws s3 cp /local/path/for/sstables s3://your-bucket/path-to-sstables/ --exclude "*" --include 'some-sstable-generation-big-*' --recursive

Metadata touchups

In case of Scylla Manager backups, if manual scrubbing is needed and SSTables will be re-uploaded,
multiple things would need to be changed, same thing if you need to drop some SSTables altogether.
As you might’ve seen in the Scylla Manager Specification Docs, we keep a JSON manifest per node
and that manifest file contains lots of SSTable-dependent information:

list of SSTables per table owned by node
total size of SSTables in the chunk of table owned
total size of all chunks of tables owned
the list of tokens owned by the node

As the name of the fields suggests, all the information in the list above depends on the SSTables content, so any attempt
to fix locally a corrupt SSTable and re-upload, most probably will force you to update them in the manifest file of the node.
There is high likelihood that a scrubbed SSTable results in different values for all the fields specified above.

For the storage_service/backup REST API, in theory only removing an entire SSTable from the backup would require changing
the manifest file and remove the corresponding entry for the SSTable, in all other cases, no metadata changes needed.

For the CREATE KEYSPACE on S3, there is no need to update any metadata as we currently don’t have any.

NOTE: It’s obvious to say that re-uploading a scrubbed SSTable means re-uploading all its components as it’s likely most of them were changed.

16 KiB Raw Blame History Unescape Escape

Keeping sstables on S3/GS

Enabling the feature

Configuring AWS S3 access

Configuring GCP storage access

Local/Development Environment

Important Note

Creating keyspace with S3

Creating keyspace with GS

Copying sstables on S3/GS (backup)

System tables

Manipulating S3 data

Object Storage Layout

The snapshot manifest

Downloading, deleting, uploading SSTables

Downloading SSTables

Deleting SSTables

Uploading SSTables

Metadata touchups

16 KiB

Raw Blame History