To configure S3 storage, one needs to do
```
object_storage_endpoints:
- name: s3.us-east-1.amazonaws.com
port: 443
https: true
aws_region: us-east-1
```
and for GCS it's
```
object_storage_endpoints:
- name: https://storage.googleapis.com:433
type: gs
credentials_file: <gcp account credentials json file>
```
This PR updates the S3 part to look like
```
object_storage_endpoints:
- name: https://s3.us-east-1.amazonaws.com:443
aws_region: us-east-1
```
fixes: #26570
This is 2nd attempt, previous one (#27360) was reverted because it reported endpoint configs in new format via API and CQL always, even if the endpoint was configured in the old way. This "broke" scylla manager and some dtests. This version has this bug fixed, and endpoints are reported in the same format as they were configured with.
About correctness of the changes.
No modifications to existing tests are made here, so old format is respected correctly (as far as it's covered by tests). To prove the new format works the the test_get_object_store_endpoints is extended to validate both options. Some preparations to this test to make this happen come on their own with the PR #28111 to show that they are valid and pass before changing the core code.
Enhancing the way configuration is made, likely no need to backport.
Closesscylladb/scylladb#28112
* github.com:scylladb/scylladb:
test: Validate S3 endpoints new format works
docs: Update docs according to new endpoints config option format
object_storage: Create s3 client with "extended" endpoint name
s3/storage: Tune config updating
sstable: Shuffle args for s3_client_wrapper
test: Rename badconf variable into objconf
test: Split the object_store/test_get_object_store_endpoints test
For this, add the s3::client::make(endpoint, ...) overload that accepts
endpoint in proto://host:port format. Then it parses the provided url
and calls the legacy one, that accepts raw host string and config with
port, https bit, etc.
The generic object_storage_endpoint_param no longer needs to carry the
internal s3::endpoint_config, the config option parsing changes
respectively.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Don't prepare s3::endpoint_config from generic code, jut pass the region
and iam_role_arn (those that can potentially change) to the callback.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it construct like gs_client_wrapper -- with generic endpoint param
reference and make the storage-specific casts/gets/whatever internally.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If the read is aborted via the permit (due to timeout) don't throw the
abort exception, instead propagate it via the future chain.
Also, use try_catch<> instead of try ... catch to decorate
malformed_sstable_exception with the file name.
This reverts commit 1bb897c7ca, reversing
changes made to 954f2cbd2f. It makes
incompatible changes to the object storage configuration format, breaking
tests [1]. It's likely that it doesn't break any production configuration,
but we can't be sure.
Fixes#27966Closesscylladb/scylladb#27969
The storage::snapshot() is used in two different modes -- one to save sstable as snapshot somewhere, and another one to create a copy of sstable. The latter use-case is "optimized" by snapshotting an sstable under new generation, but it's only true for local storage. Despite for S3 storage snapshot is not implemented, _cloning_ sstable stored on S3 is not necessarily going to be the same as doing a snapshot.
Another sign of snapshot and clone being different is that calling snapshot() for snapshot itself and for clone use two very different sets of arguments -- snapshotting specifies relative name and omits new generation, while cloning doesn't need "name" and instead provides generation. Recently (#26528) cloning got extra "leave_unsealed" tag, that makes no sense for snapshotting.
Having said that, this PR introduces sstables::storage::clone() method and modifies both, callers and implementations, according to the above features of each. As a result, code logic in both methods become much simpler and a bunch of bool classes and "_tag" helper structures goes away.
Improving internal APIs, no need to backport
Closesscylladb/scylladb#27871
* github.com:scylladb/scylladb:
sstables, storage: Drop unused bool classes and tags
sstables/storage: Drop create_links_common() overloads
sstable: Simplify storage::snapshot()
sstables: Introduce storage::clone()
There's seastar helper that does the same, no need to carry yet another
implementation
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#27851
There's a bunch of tagged create_links_common() overloads that call the
most generic one with properly crafted arguments and the link_mode.
Callers of those one-liners can craft the args themselves.
As a result, there's only one create_links_common() overload and callers
explicitly specifying what they want from it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now there are only two callers left -- sstable::snapshot() and
sstable::seal() that wants to auto-backup the sealed sstable.
The snapshot arguments are:
- relative path, use _base_dir
- no new generation provided
- no leave-unsealed tag
With that, the implementation of filesystem_storage::snapshot() is as
simple as
- prepare full path relative to _base_dir
- touch new directory
- call create_links_common()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
And call it from sstable::clone() instead of storage::snapshot().
The snapshot arguements are:
- target directory is storage::prefix(), that's _dir itself
- new generation is always provided, no need for optional
- leave_unsealed bool flag
With that, the implementation of filesystem_storage::clone() is as
simple as call create_links_common() forwarding args and _dir to it. The
unification of leave_unsealed branches will come a bit later making this
code even shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method in question knows that it writes snapshot to local filesystem and uses this actively. This PR relaxes this knowledge and splits the logic into two parts -- one that orchestrates sstables snapshot and collects the necessary metadata, and the code that writes the metadata itself.
Closesscylladb/scylladb#27762
* github.com:scylladb/scylladb:
table: Move snapshot_file_set to table.cc
table: Rename and move snapshot_on_all_shards() method
table: Ditch jsondir variable
table, sstables: Pass snapshot name to sstable::snapshot()
table: Use snapshot_writer in write_manifest()
table: Use snapshot_writer in write_schema_as_cql()
table: Add snapshot_writer::sync()
table: Add snapshot_writer::init()
table: Introduce snapshot_writer
table: Move final sync and rename seal_snapshot()
table: Hide write_schema_as_cql()
table: Hide table::seal_snapshot()
table: Open-code finalize_snapshot()
table: Fix indentation after previuous patch
table: Use smp::invoke_on_all() to populate the vector with filenames
table: Don't touch dir once more on seal_snapshot()
table: Open-code table::take_snapshot() into caller lambda
table: Move parts of table::take_snapshot to sstables_manager
table: Introduce table::take_snapshot()
table: Store the result of smp::submit_to in local variable
The db::config::object_storage_endpoints parameter is live-updateable, but when the update really happens, the new endpoints may fail to propagate to non-zero shards because of the way db::config sharding is implemented.
Refs: #7316Fixes: #26509
Backport to 2025.3 and 2025.4, AFAIK there are set ups with object storage configs for native backup
Closesscylladb/scylladb#27689
* github.com:scylladb/scylladb:
sstables/storage_manager: Fix configured endpoints observer
test/object_store: Add test to validate how endpoint config update works
To configure S3 storage, one needs to do
```
object_storage_endpoints:
- name: s3.us-east-1.amazonaws.com
port: 443
https: true
aws_region: us-east-1
```
and for GCS it's
```
object_storage_endpoints:
- name: https://storage.googleapis.com:433
type: gs
credentials_file: <gcp account credentials json file>
```
This PR updates the S3 part to look like
```
object_storage_endpoints:
- name: https://s3.us-east-1.amazonaws.com:443
aws_region: us-east-1
```
fixes: #26570
Not-yet released feature, no need to backport. Old configs are not accepted any longer. If it's needed, then this decision needs to be revised.
Closesscylladb/scylladb#27360
* github.com:scylladb/scylladb:
object_storage: Temporarily handle pure endpoint addresses as endpoints
code: Remove dangling mentions of s3::endpoint_config
docs: Update docs according to new endpoints config option format
object_storage: Create s3 client with "extended" endpoint name
test: Add named constants for test_get_object_store_endpoints endpoint names
s3/storage: Tune config updating
sstable: Shuffle args for s3_client_wrapper
On start the manager creates observer for object_storage_endpoints
config parameter. The goal is to refresh the maintained set of endpoint
parameters and client upon config change. The observer is created on
shard 0 only, and when kicked it calls manager.invoke-on-all to update
manager on all shards.
However, there's a race here. The thing is that db::config values are
implicitly "sharded" under the hood with the help of plain array. When
any code tries to read a value from db::config::something, the reading
code secretly gets the value from this inner array indexed by the
current shard id.
Next, when the config is updated, it first assigns new values to [0]
element of the hidden array, then calls broadcast_to_all_shards() helper
that copies the valaues from zeroth slot to all the others. But the
manager's observer is triggered when the new value is assigned on zero
index, and if the invoke-on-all lambda (mentioned above) happens to be
faster than broadcast_to_all_shards, the non-zero shards will read old
values from db::config's inner array.
The fix is to instantiate observer on all shards and update only local
shard, whenever this update is triggered.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently sstable::snapshot() is called with directory name where to put
snapshots into. This patch changes it to accept snapshot name instead.
This makes the table-sstable API be unware of snapshot destination
storage type.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Move the loop over vector of sstables that calls sstable->snapshot()
into sstables manager.
This makes it symmetric with sstables_manager::delete_atomically() and
allows for future changes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Split prepare can run concurrently with repair.
Consider this:
1) split prepare starts
2) incremental repair starts
3) split prepare finishes
4) incremental repair produces unsplit sstable
5) split is not happening on sstable produced by repair
5.1) that sstable is not marked as repaired yet
5.2) might belong to repairing set (has compaction disabled)
6) split executes
7) repairing or repaired set has unsplit sstable
If split was acked to coordinator (meaning prepare phase finished),
repair must make sure that all sstables produced by it are split.
It's not happening today with incremental repair because it disables
split on sstables belonging to repairing group. And there's a window
where sstables produced by repair belong to that group.
To solve the problem, we want the invariant where all sealed sstables
will be split.
To achieve this, streaming consumers are patched to produce unsealed
sstable, and the new variant add_new_sstable_and_update_cache() will
take care of splitting the sstable while it's unsealed.
If no split is needed, the new sstable will be sealed and attached.
This solution was also needed to interact nicely with out of space
prevention too. If disk usage is critical, split must not happen on
restart, and the invariant aforementioned allows for it, since any
unsplit sstable left unsealed will be discarded on restart.
The streaming consumer will fail if disk usage is critical too.
The reason interposer consumer doesn't fully solve the problem is
because incremental repair can start before split, and the sstable
being produced when split decision was emitted must be split before
attached. So we need a solution which covers both scenarios.
Fixes#26041.
Fixes#27414.
Should be backported to 2025.4 that contains incremental repair
Closesscylladb/scylladb#26528
* github.com:scylladb/scylladb:
test: Add reproducer for split vs intra-node migration race
test: Verify split failure on behalf of repair during critical disk utilization
test: boost: Add failure_when_adding_new_sstable_test
test: Add reproducer for split vs incremental repair race condition
compaction: Fail split of new sstable if manager is disabled
replica: Don't split in do_add_sstable_and_update_cache()
streaming: Leave sstables unsealed until attached to the table
replica: Wire add_new_sstables_and_update_cache() into intra-node streaming
replica: Wire add_new_sstable_and_update_cache() into file streaming consumer
replica: Wire add_new_sstable_and_update_cache() into streaming consumer
replica: Document old add_sstable_and_update_cache() variants
replica: Introduce add_new_sstables_and_update_cache()
replica: Introduce add_new_sstable_and_update_cache()
replica: Account for sstables being added before ACKing split
replica: Remove repair read lock from maybe_split_new_sstable()
compaction: Preserve state of input sstable in maybe_split_new_sstable()
Rename maybe_split_sstable() to maybe_split_new_sstable()
sstables: Allow storage::snapshot() to leave destination sstable unsealed
sstables: Add option to leave sstable unsealed in the stream sink
test: Verify unsealed sstable can be compacted
sstables: Allow unsealed sstable to be loaded
sstables: Restore sstable_writer_config::leave_unsealed
We currently have races, like between moving an sstable from staging
using change_state, or when taking a snapshot, to e.g.
rewrite_statistics that replaces one of the sstable component files
when called, for example, from update_repaired_at by incremental repair.
Use a semaphore as a mutex to serialize those functions.
Note that there is no need for rwlock since the operations
are rare and read-only operations like snapshot don't
need to run in parallel.
Fixes#25919
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This reverts commit 866c96f536, reversing
changes made to 367633270a.
This change caused all longevities to fail, with a crash in parsing
scylla-metadata. The investigation is still ongoing, with no quick fix
in sight yet.
Fixes: #27496Closesscylladb/scylladb#27518
That will be needed for file streaming to leave output sstable unsealed.
we want the invariant where all sealed sstables are split after split
was ACKed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This is crucial for splitting before sealing the sstable produced by
repair. This way, unsplit sstables won't be left on disk sealed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
File streaming will have to load an unsealed sstable, so we need
to be able to parse components from temporary TOC instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This option was retired in commit 0959739216, but
it will be again needed in order to implement split before sealing.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This reverts commit 8192f45e84.
The merge exposed a bug where truncate (via drop) fails and causes Raft
errors, leading to schema inconsistencies across nodes. This results in
test_table_drop_with_auto_snapshot failures with 'Keyspace test does not exist'
errors.
The specific problematic change was in commit 19b6207f which modified
truncate_table_on_all_shards to set use_sstable_identifier = true. This
causes exceptions during truncate that are not properly handled, leading
to Raft applier fiber stopping and nodes losing schema synchronization.
To keep backward compatibility, support
- old configs -- where endpoint is just an address and port is separate.
When it happens, format the "new" endpoint name
- lookup by address-only. If it happens, scan all endpoints and see if
any one matches the provided address
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
For this, add the s3::client::make(endpoint, ...) overload that accepts
endpoint in proto://host:port format. Then it parses the provided url
and calls the legacy one, that accepts raw host string and config with
port, https bit, etc.
The generic object_storage_endpoint_param no longer needs to carry the
internal s3::endpoint_config, the config option parsing changes
respectively.
Tests, that generate the config files, and docs are updated.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Don't prepare s3::endpoint_config from generic code, jut pass the region
and iam_role_arn (those that can potentially change) to the callback.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it construct like gs_client_wrapper -- with generic endpoint param
reference and make the storage-specific casts/gets/whatever internally.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Tablet migration transfers sstable files without changing origin
host-id. As it should, becuase those sstables were not written on the
destination host, and should be ignored by commit log replay.
So it's a normal situation, and it's confusing to see this warning in
logs.
Fixes#26957Closesscylladb/scylladb#27433
The distributed_loader::get_sstables_from_object_store() method accepts an endpoint parameter and internally wants to get storage type for that endpoint (s3 or gcs). This is needed to construct storage_options object to create an sstable object.
To get the type, the method scans db::config option, but there's much simpler way to get one.
Code cleanup, no need to backport
Closesscylladb/scylladb#27381
* github.com:scylladb/scylladb:
sstables_loader: Provide endpoint type for get_sstables_from_object_store()
storage_manager: Introduce get_endpoint_type() method
storage_manager: Split get_endpoint_client()
This change adds a new option to the REST api and correspondingly, to scylla nodetool: use_sstable_identifier.
When set, we use the sstable identifier, if available, to name each sstable in the snapshots directory
and the manifest.json file, rather than using the sstable generation.
This can be used by the user (e.g. Scylla Manager) for global deduplication with tablets, where an sstable
may be migrated across shards or across nodes, and in this case, its generation may change, but its
sstable identifier remains sstable.
Currently, Scylla manager uses the sstable generation to detect sstables that are already backed up to
object storage and exist in previous backed up snapshots.
Historically, the sstable generation was guaranteed to be unique only per table per node,
so the dedup code currently checks for deduplication in the node scope.
However, with tablet migration, sstables are renamed when migrated to a different shard,
i.e. their generation changes, and they may be renamed when migrated to another node,
but even if they are not, the dedup logic still assumes uniqueness only within a node.
To address both cases, we keep the sstable_id stable throughout the sstable life cycle (since 3a12ad96c7).
Given the globally unique sstable identifier, scylla manager can now detect duplicate sstables
in a wider scope. This can be cluster-wide, but we practically need only rack-wide deduplication
or dc-wide, as tablets are migrated across racks only in rare occasions (like when converting from a
numerical replication factor to a rack list containing a subset of the available racks in a datacenter).
Fixes#27181
* New feature, no backport required
Closesscylladb/scylladb#27184
* github.com:scylladb/scylladb:
database: truncate_table_on_all_shards: set use_sstable_identifier to true
nodetool: snapshot: add --use-sstable-identifier option
api: storage_service: take_snapshot: add use_sstable_identifier option
test: database_test: add snapshot_use_sstable_identifier_works
test: database_test: snapshot_works: add validate_manifest
sstable: write_scylla_metadata: add random_sstable_identifier error injection
table: snapshot_on_all_shards: take snapshot_options
sstable: add get_format getter
sstable: snapshot: add use_sstable_identifier option
db: snapshot_ctl: snapshot_options: add use_sstable_identifier options
db: snapshot_ctl: move skip_flush to struct snapshot_options
Large reserves in allocating_section can cause stalls. We already log
reserve increase, but we don't know which table it belongs to:
lsa - LSA allocation failure, increasing reserve in section 0x600009f94590 to 128 segments;
Allocating sections used for updating row cache on memtable flush are
notoriously problematic. Each table has its own row_cache, so its own
allocating_section(s). If we attached table name to those sections, we
could identify which table is causing problems. In some issues we
suspected system.raft, but we can't be sure.
This patch allows naming allocating_sections for the purpose of
identifying them in such log messages. I use abstract_formatter for
this purpose to avoid the cost of formatting strings on the hot path
(e.g. index_reader). And also to avoid duplicating strings which are
already stored elsewhere.
Fixes#25799Closesscylladb/scylladb#27470
This change replaces plain file_writer with crc32_digest_file_writer
for all SSTable components that should be checksummed. The resulting component
digests are stored in the sstable structure and later persisted to disk
as part of the Scylla metadata component during writer::consume_end_of_stream.
Test that taking a snapshot with the use_sstable_identifier
option (and injecting `random_sstable_identifier`) produces
different file names in the snapshot than the original
sstable names and validate te manifest.json file respectively.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To be used by a unit test in the following patch for testing
the snapshot use_sstable_identifier option.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To be used by the snapshot code in te following patch
for manufacturing a basename using the sstable_id rather
than its generation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When set to true, use the sstable_identifier as the sstable name
in the snapshot rather than its generation.
sstable::snapshot now returns the generation it used
for the sstable in the snapshot, based on the `use_sstable_identifier`
option, to be used by the upper layer generating the manifest.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add TemporaryScylla component type to make atomic updates of SSTable Scylla metadata using temporary files
and atomic rename operations possible. This will be needed in further commit to rewrite metadata together with
the statistics component.
Introduce template parameter to checksummed file writer to support
digest-only calculation without storing chunk checksums.
This will be needed for future to calculate digest of other components.
This patch enables integrity check in 'create_stream_sources()' by introducing a new 'sstable_data_stream_source_impl' class for handling the Data component of SSTables. The new implementation uses 'sstable::data_stream()' with 'integrity_check::yes' instead of the raw input_stream.
These additional checks require reading the digest and CRC components from disk, which may introduce some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data.
For compressed SSTables - where checksums are already embedded - the cost comes from reading, calculating and verifying the diges.
New test cases were added to verify that the integrity checks work correctly, detecting both data and digest mismatches.
Backport is not required, since it is a new feature
Fixes#21776Closesscylladb/scylladb#26702
* github.com:scylladb/scylladb:
file_stream_test: add sstable file streaming integrity verification test cases
streaming: prioritize sender-side errors in tablet_stream_files
sstables: enable integrity check for data file streaming
sstables: Add compressed raw streaming support
sstables: Allow to read digest and checksum from user provided file instance
sstables: add overload of data_stream() to accept custom file_input_stream_options
This patch enables integrity check in 'create_stream_sources()' by introducing a new
'sstable_data_stream_source_impl' class for handling the Data component of
SSTables. The new implementation uses 'sstable::data_stream()' with 'integrity_check::yes' instead
of the raw input_stream.
These additional checks require reading the digest and CRC components from
disk, which may introduce some I/O overhead. For uncompressed SSTables,
this involves loading and computing checksums and digest from the data.
For compressed SSTables - where checksums are already embedded - the
cost comes from reading, calculation and verifying the digest.
Implement compressed_raw_file_data_source that streams compressed chunks
without decompression while verifying checksums and calculating digests.
Extends raw_stream enum to support compressed_chunks mode.
This data_source implementation will be used in the next commits
for file based streaming.
Add overloaded methods to read digest and checksum from user-provided file
handles:
- 'read_digest(file f)'
- 'read_checksum(file f)
This will be useful for tablet file-based streaming to enable integrity verification, as the streaming code uses SSTable snapshots with open files to prevent missing components when SSTables are unlinked.