Commit Graph

4018 Commits

Author SHA1 Message Date
Łukasz Paszkowski
fde09fd136 reader_concurrency_semaphore: Add preemptive_abort_factor to constructors
The new parameter parametrizes the factor used to reject a read
during admission. Its value shall be between 0.0 and 1.0 where
  + 0.0 means a read will never get rejected during admission
  + 1.0 means a read will immediatelly get rejected during admission

Although passing values outside the interaval is possible, they
will have the exact same effects as they were clamped to [0.0, 1.0].
2026-01-28 14:20:01 +01:00
Łukasz Paszkowski
8829098e90 reader_concurrency_semaphore: Remove cpu_concurrency's default value
The commit 59faa6d, introduces a new parameter called cpu_concurrency
and sets its default value to 1 which violates the commit fbb83dd that
removes all default values from constructors but one used by the unit
tests.

The patch removes the default value of the cpu_concurrency parameter
and alters tests to use the test dedicated reader_concurrency_semaphore
constructor wherever possible.
2026-01-27 15:40:11 +01:00
Avi Kivity
f1c6094150 Merge 'Remove buffer_input_stream and limiting_input_stream from core code' from Pavel Emelyanov
These two streams mostly play together. The former provides an input_stream from read from in-memory temporary buffers, the latter wraps it to limit the size of provided temporary buffers. Both are used to test contiguous data consumer, also the buffer_input_stream has a caller in sstables reversing reader.

This PR removes the buffer_input_stream in favor of seastar memory_data_source, and moves the limiting_input_stream into test/lib.

Enanching testing code, not backporting

Closes scylladb/scylladb#28352

* github.com:scylladb/scylladb:
  code: Move limiting data source to test/lib
  util: Simplify limiting_data_source API
  util: Remove buffer_input_stream
  test: Use seastar::util::temporary_buffer_data_source in data consumer test
  sstables: Use seastar::util::as_input_stream() in mx reader
2026-01-26 22:05:59 +02:00
Pavel Emelyanov
2399bb8995 sstables: Use seastar::util::as_input_stream() in mx reader
Right now the code uses make_buffer_input_stream() helper that creates
an input stream with buffer_data_source_impl inside which, in turn,
provides the data_source_impl API over a single temporary_buffer.

Seastar has the very same facility, it's better to use it. Eventually
the buffer_data_source_impl will be removed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-26 12:43:17 +03:00
Pavel Emelyanov
1796997ace Merge 'restore: Enable direct download of fully contained SSTables' from Ernest Zaslavsky
This PR refactors the streaming subsystem to support direct download of fully contained sstables. Instead of streaming these files, they are downloaded and attached directly to their corresponding tables. This approach reduces overhead, simplifies logic, and improves efficiency. Expected node scope restore performance improvement: ~4 times faster in best case scenario when all sstables are fully contained.

1. Add storage options field to sstable Introduce a data member to store storage options, enabling distinction between local and object storage types.
2. Add method to create component source Extend the storage interface with a public method to create a data_source for any sstable component.
3. Inline streamer instance creation Remove make_sstable_streamer and inline its usage to allow different sets of arguments at call sites.
4. Skip streaming empty sstable sets Avoid unnecessary streaming calls when the sstable set is empty.
5. Enable direct download of contained sstables Replace streaming of fully contained sstables with direct download, attaching them to their corresponding table.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-200
Refs: https://github.com/scylladb/scylladb/issues/23908

No need to backport as this code targets 2026.2 release (for tablet-aware restore)

Closes scylladb/scylladb#26834

* github.com:scylladb/scylladb:
  tests: reuse test_backup_broken_streaming
  streaming: enable direct download of contained sstables
  storage: add method to create component source
  streaming: keep sharded database reference on tablet_sstable_streamer
  streaming: skip streaming empty sstable sets
  streaming: inline streamer instance creation
  tests: fix incorrect backup/restore test flow
2026-01-26 10:22:34 +03:00
Pavel Emelyanov
4a307d931a sstable: Replace buffer_data_sink_impl with seastar::util::basic_memory_data_sink
The former accumulates sstable writer writes into a vector of temporary
buffers. In seastar there's a generic memory data sink that provides a
sink to accumulate stream of bytes into any container.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-23 14:22:22 +03:00
Pavel Emelyanov
97b1340a68 sstables: Use seastar::util::as_input_stream() and remove buffer_data_source_impl
The latter is used to wrap vector of buffers into an input_stream.
Seastar already provides the very same functionality with the
convenience as_input_stream() helper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-23 14:21:14 +03:00
Pavel Emelyanov
cb6ee05391 Merge 'Extend snapshot manifest.json with tablet-aware metadata' from Benny Halevy
This series extends the json manifest file we create when taking snapshots.
It adds the following metadata:
- manifesr version and scope
- snapshot name
- created_at and expires_at timestamps (#24061)
- node metadata (host_id, dc, rack)
- keyspace and table metadat
- tablet_count (#26352)
- per-sstable metadata (#26352)

Fixes [SCYLLADB-189](https://scylladb.atlassian.net/browse/SCYLLADB-189)
Fixes [SCYLLADB-195](https://scylladb.atlassian.net/browse/SCYLLADB-195)
Fixes [SCYLLADB-196](https://scylladb.atlassian.net/browse/SCYLLADB-196)

* Enhancement, no backport needed

[SCYLLADB-189]: https://scylladb.atlassian.net/browse/SCYLLADB-189?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[SCYLLADB-195]: https://scylladb.atlassian.net/browse/SCYLLADB-195?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[SCYLLADB-196]: https://scylladb.atlassian.net/browse/SCYLLADB-196?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#27945

* github.com:scylladb/scylladb:
  snapshot: keep per-sstable metadata in manifest.json
  snapshot: add table info and tablet_count to manifest.json
  snapshot: add basic support for snapshot ttl in manifest.json
  table: snapshot_on_all_shards: take snapshot_options
  db: snapshot_ctl: move skip_flush to struct snapshot_options
  snapshot: add snapshot name in manifest.json
  test: lib: cql_test_env: apply db::config::tablets_mode_for_new_keyspaces
  snapshot: add node info to manifest.json
  snapshot: add manifest info to manifest.json
  test: database_test: snapshot_works: add validate_manifest
2026-01-22 15:19:11 +03:00
Benny Halevy
d6557764b9 snapshot: keep per-sstable metadata in manifest.json
Adds a "sstables" array member to manifest.json.
For each sstables, keep the following metadata:
id - a uuid for the sstable (the sstable identifier
if the use-sstable-identifier option was used, otherwise
the sstable uuid generation)
toc_name - the name of the TOC.txt file
data_size and index_size - in bytes
first_token and last_token - of the sstable first and last keys.

Fixes: SCYLLADB-196

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:42:52 +02:00
Botond Dénes
4281d18c2e Merge 'schema: Apply sstable_compression_user_table_options to CQL aux and Alternator tables' from Nikos Dragazis
In PR 5b6570be52 we introduced the config option `sstable_compression_user_table_options` to allow adjusting the default compression settings for user tables. However, the new option was hooked into the CQL layer and applied only to CQL base tables, not to the whole spectrum of user tables: CQL auxiliary tables (materialized views, secondary indexes, CDC log tables), Alternator base tables, Alternator auxiliary tables (GSIs, LSIs, Streams).

This gap also led to inconsistent default compression algorithms after we changed the option’s default algorithm from LZ4 to LZ4WithDicts (adf9c426c2).

This series introduces a general “schema initializer” mechanism in `schema_builder` and uses it to apply the default compression settings uniformly across all user tables. This ensures that all base and aux tables take their default compression settings from config.

Fixes #26914.

Backport justification: LZ4WithDicts is the new default since 2025.4, but the config option exists since 2025.2. Based on severity, I suggest we backport only to 2025.4 to maintain consistency of the defaults.

Closes scylladb/scylladb#27204

* github.com:scylladb/scylladb:
  db/config: Update sstable_compression_user_table_options description
  schema: Add initializer for compression defaults
  schema: Generalize static configurators into schema initializers
  schema: Initialize static properties eagerly
  db: config: Add accessor for sstable_compression_user_table_options
  test: Check that CQL and Alternator tables respect compression config
2026-01-22 06:50:48 +02:00
Ernest Zaslavsky
51285785fa storage: add method to create component source
Extend the storage interface with a public method to create a
`data_source` for any sstable component.
2026-01-21 16:40:12 +02:00
Avi Kivity
bd08b6e5b2 Merge 'Unify configuration of object storage endpoints (take 2)' from Pavel Emelyanov
To configure S3 storage, one needs to do

```
object_storage_endpoints:
  - name: s3.us-east-1.amazonaws.com
    port: 443
    https: true
    aws_region: us-east-1
```

and for GCS it's

```
object_storage_endpoints:
  - name: https://storage.googleapis.com:433
    type: gs
    credentials_file: <gcp account credentials json file>
```

This PR updates the S3 part to look like

```
object_storage_endpoints:
  - name: https://s3.us-east-1.amazonaws.com:443
    aws_region: us-east-1
```

fixes: #26570

This is 2nd attempt, previous one (#27360) was reverted because it reported endpoint configs in new format via API and CQL always, even if the endpoint was configured in the old way. This "broke" scylla manager and some dtests. This version has this bug fixed, and endpoints are reported in the same format as they were configured with.

About correctness of the changes.

No modifications to existing tests are made here, so old format is respected correctly (as far as it's covered by tests). To prove the new format works the the test_get_object_store_endpoints is extended to validate both options. Some preparations to this test to make this happen come on their own with the PR #28111  to show that they are valid and pass before changing the core code.

Enhancing the way configuration is made, likely no need to backport.

Closes scylladb/scylladb#28112

* github.com:scylladb/scylladb:
  test: Validate S3 endpoints new format works
  docs: Update docs according to new endpoints config option format
  object_storage: Create s3 client with "extended" endpoint name
  s3/storage: Tune config updating
  sstable: Shuffle args for s3_client_wrapper
  test: Rename badconf variable into objconf
  test: Split the object_store/test_get_object_store_endpoints test
2026-01-14 18:29:03 +02:00
Nikos Dragazis
76b2d0f961 db: config: Add accessor for sstable_compression_user_table_options
The `sstable_compression_user_table_options` config option determines
the default compression settings for user tables.

In patch 2fc812a1b9, the default value of this option was changed from
LZ4 to LZ4WithDicts and a fallback logic was introduced during startup
to temporarily revert the option to LZ4 until the dictionary compression
feature is enabled.

Replace this fallback logic with an accessor that returns the correct
settings depending on the feature flag. This is cleaner and more
consistent with the way we handle the `sstable_format` option, where the
same problem appears (see `get_preferred_sstable_version()`).

As a consequence, the configuration option must always be accessed
through this accessor. Add a comment to point this out.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 18:30:38 +02:00
Pavel Emelyanov
f227de24b2 object_storage: Create s3 client with "extended" endpoint name
For this, add the s3::client::make(endpoint, ...) overload that accepts
endpoint in proto://host:port format. Then it parses the provided url
and calls the legacy one, that accepts raw host string and config with
port, https bit, etc.

The generic object_storage_endpoint_param no longer needs to carry the
internal s3::endpoint_config, the config option parsing changes
respectively.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:24:06 +03:00
Pavel Emelyanov
8f97e6b3de s3/storage: Tune config updating
Don't prepare s3::endpoint_config from generic code, jut pass the region
and iam_role_arn (those that can potentially change) to the callback.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:24:06 +03:00
Pavel Emelyanov
bee3564564 sstable: Shuffle args for s3_client_wrapper
Make it construct like gs_client_wrapper -- with generic endpoint param
reference and make the storage-specific casts/gets/whatever internally.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-13 13:24:06 +03:00
Botond Dénes
6801611b94 sstables/mx/reader: don't throw exceptions on the read-path
If the read is aborted via the permit (due to timeout) don't throw the
abort exception, instead propagate it via the future chain.
Also, use try_catch<> instead of try ... catch to decorate
malformed_sstable_exception with the file name.
2026-01-13 10:47:57 +02:00
Avi Kivity
0df85c8ae8 Revert "Merge 'Unify configuration of object storage endpoints' from Pavel Emelyanov"
This reverts commit 1bb897c7ca, reversing
changes made to 954f2cbd2f. It makes
incompatible changes to the object storage configuration format, breaking
tests [1]. It's likely that it doesn't break any production configuration,
but we can't be sure.

Fixes #27966

Closes scylladb/scylladb#27969
2026-01-05 08:53:41 +02:00
Avi Kivity
567c28dd0d Merge 'Decouple sstables::storage::snapshot() and ::clone() functionality' from Pavel Emelyanov
The storage::snapshot() is used in two different modes -- one to save sstable as snapshot somewhere, and another one to create a copy of sstable. The latter use-case is "optimized" by snapshotting an sstable under new generation, but it's only true for local storage. Despite for S3 storage snapshot is not implemented, _cloning_ sstable stored on S3 is not necessarily going to be the same as doing a snapshot.

Another sign of snapshot and clone being different is that calling snapshot() for snapshot itself and for clone use two very different sets of arguments -- snapshotting specifies relative name and omits new generation, while cloning doesn't need "name" and instead provides generation. Recently (#26528) cloning got extra "leave_unsealed" tag, that makes no sense for snapshotting.

Having said that, this PR introduces sstables::storage::clone() method and modifies both, callers and implementations, according to the above features of each. As a result, code logic in both methods become much simpler and a bunch of bool classes and "_tag" helper structures goes away.

Improving internal APIs, no need to backport

Closes scylladb/scylladb#27871

* github.com:scylladb/scylladb:
  sstables, storage: Drop unused bool classes and tags
  sstables/storage: Drop create_links_common() overloads
  sstable: Simplify storage::snapshot()
  sstables: Introduce storage::clone()
2025-12-29 17:50:54 +02:00
Pavel Emelyanov
2e33234e91 util: Remove lister::rmdir()
There's seastar helper that does the same, no need to carry yet another
implementation

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27851
2025-12-28 19:46:19 +02:00
Pavel Emelyanov
712cc8b8f1 sstables, storage: Drop unused bool classes and tags
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-26 09:47:27 +03:00
Pavel Emelyanov
9e189da23a sstables/storage: Drop create_links_common() overloads
There's a bunch of tagged create_links_common() overloads that call the
most generic one with properly crafted arguments and the link_mode.
Callers of those one-liners can craft the args themselves.

As a result, there's only one create_links_common() overload and callers
explicitly specifying what they want from it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-26 09:47:27 +03:00
Pavel Emelyanov
32cf358f44 sstable: Simplify storage::snapshot()
Now there are only two callers left -- sstable::snapshot() and
sstable::seal() that wants to auto-backup the sealed sstable.

The snapshot arguments are:
- relative path, use _base_dir
- no new generation provided
- no leave-unsealed tag

With that, the implementation of filesystem_storage::snapshot() is as
simple as
- prepare full path relative to _base_dir
- touch new directory
- call create_links_common()

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-26 09:47:27 +03:00
Pavel Emelyanov
8e496a2f2f sstables: Introduce storage::clone()
And call it from sstable::clone() instead of storage::snapshot().

The snapshot arguements are:
- target directory is storage::prefix(), that's _dir itself
- new generation is always provided, no need for optional
- leave_unsealed bool flag

With that, the implementation of filesystem_storage::clone() is as
simple as call create_links_common() forwarding args and _dir to it. The
unification of leave_unsealed branches will come a bit later making this
code even shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-26 09:47:27 +03:00
Botond Dénes
3071ccd54a Merge 'Storage-agnostic table::snapshot_on_all_shards()' from Pavel Emelyanov
The method in question knows that it writes snapshot to local filesystem and uses this actively. This PR relaxes this knowledge and splits the logic into two parts -- one that orchestrates sstables snapshot and collects the necessary metadata, and the code that writes the metadata itself.

Closes scylladb/scylladb#27762

* github.com:scylladb/scylladb:
  table: Move snapshot_file_set to table.cc
  table: Rename and move snapshot_on_all_shards() method
  table: Ditch jsondir variable
  table, sstables: Pass snapshot name to sstable::snapshot()
  table: Use snapshot_writer in write_manifest()
  table: Use snapshot_writer in write_schema_as_cql()
  table: Add snapshot_writer::sync()
  table: Add snapshot_writer::init()
  table: Introduce snapshot_writer
  table: Move final sync and rename seal_snapshot()
  table: Hide write_schema_as_cql()
  table: Hide table::seal_snapshot()
  table: Open-code finalize_snapshot()
  table: Fix indentation after previuous patch
  table: Use smp::invoke_on_all() to populate the vector with filenames
  table: Don't touch dir once more on seal_snapshot()
  table: Open-code table::take_snapshot() into caller lambda
  table: Move parts of table::take_snapshot to sstables_manager
  table: Introduce table::take_snapshot()
  table: Store the result of smp::submit_to in local variable
2025-12-24 13:46:47 +02:00
Botond Dénes
95d4c73eb1 Merge 'Make object storage config truly updateable' from Pavel Emelyanov
The db::config::object_storage_endpoints parameter is live-updateable, but when the update really happens, the new endpoints may fail to propagate to non-zero shards because of the way db::config sharding is implemented.

Refs: #7316
Fixes: #26509

Backport to 2025.3 and 2025.4, AFAIK there are set ups with object storage configs for native backup

Closes scylladb/scylladb#27689

* github.com:scylladb/scylladb:
  sstables/storage_manager: Fix configured endpoints observer
  test/object_store: Add test to validate how endpoint config update works
2025-12-24 13:42:44 +02:00
Botond Dénes
1bb897c7ca Merge 'Unify configuration of object storage endpoints' from Pavel Emelyanov
To configure S3 storage, one needs to do

```
object_storage_endpoints:
  - name: s3.us-east-1.amazonaws.com
    port: 443
    https: true
    aws_region: us-east-1
```

and for GCS it's

```
object_storage_endpoints:
  - name: https://storage.googleapis.com:433
    type: gs
    credentials_file: <gcp account credentials json file>
```

This PR updates the S3 part to look like

```
object_storage_endpoints:
  - name: https://s3.us-east-1.amazonaws.com:443
    aws_region: us-east-1
```

fixes: #26570

Not-yet released feature, no need to backport. Old configs are not accepted any longer. If it's needed, then this decision needs to be revised.

Closes scylladb/scylladb#27360

* github.com:scylladb/scylladb:
  object_storage: Temporarily handle pure endpoint addresses as endpoints
  code: Remove dangling mentions of s3::endpoint_config
  docs: Update docs according to new endpoints config option format
  object_storage: Create s3 client with "extended" endpoint name
  test: Add named constants for test_get_object_store_endpoints endpoint names
  s3/storage: Tune config updating
  sstable: Shuffle args for s3_client_wrapper
2025-12-24 06:59:02 +02:00
Pavel Emelyanov
132aa753da sstables/storage_manager: Fix configured endpoints observer
On start the manager creates observer for object_storage_endpoints
config parameter. The goal is to refresh the maintained set of endpoint
parameters and client upon config change. The observer is created on
shard 0 only, and when kicked it calls manager.invoke-on-all to update
manager on all shards.

However, there's a race here. The thing is that db::config values are
implicitly "sharded" under the hood with the help of plain array. When
any code tries to read a value from db::config::something, the reading
code secretly gets the value from this inner array indexed by the
current shard id.

Next, when the config is updated, it first assigns new values to [0]
element of the hidden array, then calls broadcast_to_all_shards() helper
that copies the valaues from zeroth slot to all the others. But the
manager's observer is triggered when the new value is assigned on zero
index, and if the invoke-on-all lambda (mentioned above) happens to be
faster than broadcast_to_all_shards, the non-zero shards will read old
values from db::config's inner array.

The fix is to instantiate observer on all shards and update only local
shard, whenever this update is triggered.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:43:11 +03:00
Pavel Emelyanov
a21aa5bdf6 table, sstables: Pass snapshot name to sstable::snapshot()
Currently sstable::snapshot() is called with directory name where to put
snapshots into. This patch changes it to accept snapshot name instead.
This makes the table-sstable API be unware of snapshot destination
storage type.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 12:14:36 +03:00
Pavel Emelyanov
ef3b651e1b table: Move parts of table::take_snapshot to sstables_manager
Move the loop over vector of sstables that calls sstable->snapshot()
into sstables manager.

This makes it symmetric with sstables_manager::delete_atomically() and
allows for future changes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-23 11:57:25 +03:00
Botond Dénes
bfdd4f7776 Merge 'Synchronize incremental repair and tablet split' from Raphael Raph Carvalho
Split prepare can run concurrently with repair.

Consider this:

1) split prepare starts
2) incremental repair starts
3) split prepare finishes
4) incremental repair produces unsplit sstable
5) split is not happening on sstable produced by repair
        5.1) that sstable is not marked as repaired yet
        5.2) might belong to repairing set (has compaction disabled)
6) split executes
7) repairing or repaired set has unsplit sstable

If split was acked to coordinator (meaning prepare phase finished),
repair must make sure that all sstables produced by it are split.
It's not happening today with incremental repair because it disables
split on sstables belonging to repairing group. And there's a window
where sstables produced by repair belong to that group.

To solve the problem, we want the invariant where all sealed sstables
will be split.
To achieve this, streaming consumers are patched to produce unsealed
sstable, and the new variant add_new_sstable_and_update_cache() will
take care of splitting the sstable while it's unsealed.
If no split is needed, the new sstable will be sealed and attached.

This solution was also needed to interact nicely with out of space
prevention too. If disk usage is critical, split must not happen on
restart, and the invariant aforementioned allows for it, since any
unsplit sstable left unsealed will be discarded on restart.
The streaming consumer will fail if disk usage is critical too.

The reason interposer consumer doesn't fully solve the problem is
because incremental repair can start before split, and the sstable
being produced when split decision was emitted must be split before
attached. So we need a solution which covers both scenarios.

Fixes #26041.
Fixes #27414.

Should be backported to 2025.4 that contains incremental repair

Closes scylladb/scylladb#26528

* github.com:scylladb/scylladb:
  test: Add reproducer for split vs intra-node migration race
  test: Verify split failure on behalf of repair during critical disk utilization
  test: boost: Add failure_when_adding_new_sstable_test
  test: Add reproducer for split vs incremental repair race condition
  compaction: Fail split of new sstable if manager is disabled
  replica: Don't split in do_add_sstable_and_update_cache()
  streaming: Leave sstables unsealed until attached to the table
  replica: Wire add_new_sstables_and_update_cache() into intra-node streaming
  replica: Wire add_new_sstable_and_update_cache() into file streaming consumer
  replica: Wire add_new_sstable_and_update_cache() into streaming consumer
  replica: Document old add_sstable_and_update_cache() variants
  replica: Introduce add_new_sstables_and_update_cache()
  replica: Introduce add_new_sstable_and_update_cache()
  replica: Account for sstables being added before ACKing split
  replica: Remove repair read lock from maybe_split_new_sstable()
  compaction: Preserve state of input sstable in maybe_split_new_sstable()
  Rename maybe_split_sstable() to maybe_split_new_sstable()
  sstables: Allow storage::snapshot() to leave destination sstable unsealed
  sstables: Add option to leave sstable unsealed in the stream sink
  test: Verify unsealed sstable can be compacted
  sstables: Allow unsealed sstable to be loaded
  sstables: Restore sstable_writer_config::leave_unsealed
2025-12-23 07:28:56 +02:00
Benny Halevy
9e18cfbe17 sstable: add _mutate_sem to serialize link/move with components rewrite
We currently have races, like between moving an sstable from staging
using change_state, or when taking a snapshot, to e.g.
rewrite_statistics that replaces one of the sstable component files
when called, for example, from update_repaired_at by incremental repair.

Use a semaphore as a mutex to serialize those functions.
Note that there is no need for rwlock since the operations
are rare and read-only operations like snapshot don't
need to run in parallel.

Fixes #25919

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-16 17:06:45 +02:00
Botond Dénes
85f05fbe1b Revert "Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk"
This reverts commit 866c96f536, reversing
changes made to 367633270a.

This change caused all longevities to fail, with a crash in parsing
scylla-metadata. The investigation is still ongoing, with no quick fix
in sight yet.

Fixes: #27496

Closes scylladb/scylladb#27518
2025-12-16 11:34:40 +02:00
Raphael S. Carvalho
1a077a80f1 sstables: Allow storage::snapshot() to leave destination sstable unsealed
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
c5e840e460 sstables: Add option to leave sstable unsealed in the stream sink
That will be needed for file streaming to leave output sstable unsealed.

we want the invariant where all sealed sstables are split after split
was ACKed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
c10486a5e9 test: Verify unsealed sstable can be compacted
This is crucial for splitting before sealing the sstable produced by
repair. This way, unsplit sstables won't be left on disk sealed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
ab82428228 sstables: Allow unsealed sstable to be loaded
File streaming will have to load an unsealed sstable, so we need
to be able to parse components from temporary TOC instead.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Raphael S. Carvalho
b1be4ba2fc sstables: Restore sstable_writer_config::leave_unsealed
This option was retired in commit 0959739216, but
it will be again needed in order to implement split before sealing.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
copilot-swe-agent[bot]
77ee7f3417 Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy"
This reverts commit 8192f45e84.

The merge exposed a bug where truncate (via drop) fails and causes Raft
errors, leading to schema inconsistencies across nodes. This results in
test_table_drop_with_auto_snapshot failures with 'Keyspace test does not exist'
errors.

The specific problematic change was in commit 19b6207f which modified
truncate_table_on_all_shards to set use_sstable_identifier = true. This
causes exceptions during truncate that are not properly handled, leading
to Raft applier fiber stopping and nodes losing schema synchronization.
2025-12-12 03:55:13 +00:00
Pavel Emelyanov
a63508a6f7 object_storage: Temporarily handle pure endpoint addresses as endpoints
To keep backward compatibility, support

- old configs -- where endpoint is just an address and port is separate.
  When it happens, format the "new" endpoint name

- lookup by address-only. If it happens, scan all endpoints and see if
  any one matches the provided address

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:47 +03:00
Pavel Emelyanov
a3ca4fccef object_storage: Create s3 client with "extended" endpoint name
For this, add the s3::client::make(endpoint, ...) overload that accepts
endpoint in proto://host:port format. Then it parses the provided url
and calls the legacy one, that accepts raw host string and config with
port, https bit, etc.

The generic object_storage_endpoint_param no longer needs to carry the
internal s3::endpoint_config, the config option parsing changes
respectively.

Tests, that generate the config files, and docs are updated.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:47 +03:00
Pavel Emelyanov
932b008107 s3/storage: Tune config updating
Don't prepare s3::endpoint_config from generic code, jut pass the region
and iam_role_arn (those that can potentially change) to the callback.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:46 +03:00
Pavel Emelyanov
e47f0c6284 sstable: Shuffle args for s3_client_wrapper
Make it construct like gs_client_wrapper -- with generic endpoint param
reference and make the storage-specific casts/gets/whatever internally.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-12-10 15:33:46 +03:00
Tomasz Grabiec
7df610b73d sstables: Remove host id mismatch warning for sstable streaming
Tablet migration transfers sstable files without changing origin
host-id.  As it should, becuase those sstables were not written on the
destination host, and should be ignored by commit log replay.

So it's a normal situation, and it's confusing to see this warning in
logs.

Fixes #26957

Closes scylladb/scylladb#27433
2025-12-08 18:39:22 +02:00
Piotr Dulikowski
386309d6a0 Merge 'Improve the way distributed-loader constructs storage_options for backup sstables' from Pavel Emelyanov
The distributed_loader::get_sstables_from_object_store() method accepts an endpoint parameter and internally wants to get storage type for that endpoint (s3 or gcs). This is needed to construct storage_options object to create an sstable object.

To get the type, the method scans db::config option, but there's much simpler way to get one.

Code cleanup, no need to backport

Closes scylladb/scylladb#27381

* github.com:scylladb/scylladb:
  sstables_loader: Provide endpoint type for get_sstables_from_object_store()
  storage_manager: Introduce get_endpoint_type() method
  storage_manager: Split get_endpoint_client()
2025-12-08 16:55:20 +01:00
Pavel Emelyanov
8192f45e84 Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy
This change adds a new option to the REST api and correspondingly, to scylla nodetool: use_sstable_identifier.
When set, we use the sstable identifier, if available, to name each sstable in the snapshots directory
and the manifest.json file, rather than using the sstable generation.

This can be used by the user (e.g. Scylla Manager) for global deduplication with tablets, where an sstable
may be migrated across shards or across nodes, and in this case, its generation may change, but its
sstable identifier remains sstable.

Currently, Scylla manager uses the sstable generation to detect sstables that are already backed up to
object storage and exist in previous backed up snapshots.
Historically, the sstable generation was guaranteed to be unique only per table per node,
so the dedup code currently checks for deduplication in the node scope.

However, with tablet migration, sstables are renamed when migrated to a different shard,
i.e. their generation changes, and they may be renamed when migrated to another node,
but even if they are not, the dedup logic still assumes uniqueness only within a node.

To address both cases, we keep the sstable_id stable throughout the sstable life cycle (since 3a12ad96c7).
Given the globally unique sstable identifier, scylla manager can now detect duplicate sstables
in a wider scope.  This can be cluster-wide, but we practically need only rack-wide deduplication
or dc-wide, as tablets are migrated across racks only in rare occasions (like when converting from a
numerical replication factor to a rack list containing a subset of the available racks in a datacenter).

Fixes #27181

* New feature, no backport required

Closes scylladb/scylladb#27184

* github.com:scylladb/scylladb:
  database: truncate_table_on_all_shards: set use_sstable_identifier to true
  nodetool: snapshot: add --use-sstable-identifier option
  api: storage_service: take_snapshot: add use_sstable_identifier option
  test: database_test: add snapshot_use_sstable_identifier_works
  test: database_test: snapshot_works: add validate_manifest
  sstable: write_scylla_metadata: add random_sstable_identifier error injection
  table: snapshot_on_all_shards: take snapshot_options
  sstable: add get_format getter
  sstable: snapshot: add use_sstable_identifier option
  db: snapshot_ctl: snapshot_options: add use_sstable_identifier options
  db: snapshot_ctl: move skip_flush to struct snapshot_options
2025-12-08 12:56:12 +03:00
Tomasz Grabiec
082342ecad Attach names to allocating sections for better debuggability
Large reserves in allocating_section can cause stalls. We already log
reserve increase, but we don't know which table it belongs to:

  lsa - LSA allocation failure, increasing reserve in section 0x600009f94590 to 128 segments;

Allocating sections used for updating row cache on memtable flush are
notoriously problematic. Each table has its own row_cache, so its own
allocating_section(s). If we attached table name to those sections, we
could identify which table is causing problems. In some issues we
suspected system.raft, but we can't be sure.

This patch allows naming allocating_sections for the purpose of
identifying them in such log messages. I use abstract_formatter for
this purpose to avoid the cost of formatting strings on the hot path
(e.g. index_reader). And also to avoid duplicating strings which are
already stored elsewhere.

Fixes #25799

Closes scylladb/scylladb#27470
2025-12-07 14:14:25 +02:00
Taras Veretilnyk
bc2e83bc1f sstables: store digest of all sstable components in scylla metadata
This change replaces plain file_writer with crc32_digest_file_writer
for all SSTable components that should be checksummed. The resulting component
digests are stored in the sstable structure and later persisted to disk
as part of the Scylla metadata component during writer::consume_end_of_stream.
2025-12-04 21:00:09 +01:00
Benny Halevy
07b92a1ee8 test: database_test: add snapshot_use_sstable_identifier_works
Test that taking a snapshot with the use_sstable_identifier
option (and injecting `random_sstable_identifier`) produces
different file names in the snapshot than the original
sstable names and validate te manifest.json file respectively.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:57:38 +02:00
Benny Halevy
28cb300d0a sstable: write_scylla_metadata: add random_sstable_identifier error injection
To be used by a unit test in the following patch for testing
the snapshot use_sstable_identifier option.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:55:50 +02:00