Seastar deprecated default-constructing input_stream and output_stream
(they are useless in that state), and also deprecated move-assigning
them after the fact.
Fix by wrapping both fields in std::optional, and using emplace() to
construct them in-place once the connected socket is available.
It would be nicer to make connect() a static method that returns
a connection, but that's a larger change.
Closesscylladb/scylladb#29627
When a malformed sstable error occurs, it is usually caused by actual sstable corruption — a cosmic ray, a bad disk write, etc. However, it can also be caused by memory corruption, where a data structure in memory happens to be read as sstable data. In the latter case, having a coredump of the process at the moment of the error is invaluable for post-mortem debugging, since the exception throwing/catching machinery destroys the stack frames that would point to the corruption site.
This patch series introduces `--abort-on-malformed-sstable-error`, a new command-line option (with `LiveUpdate` support) that, when set, causes the server to call `std::abort()` instead of throwing an exception whenever any sstable parse error is detected. This covers all code paths:
- Direct `throw malformed_sstable_exception(...)` sites (migrated to `throw_malformed_sstable_exception()`)
- Direct `throw bufsize_mismatch_exception(...)` sites (migrated to `throw_bufsize_mismatch_exception()`)
- `parse_assert()` failures (via `on_parse_error()`)
- BTI parse errors (via `on_bti_parse_error()`)
The implementation places the flag and helper functions in `sstables/sstables.cc`, next to the existing `on_parse_error()` / `on_bti_parse_error()` infrastructure.
The flag defaults to `false`, preserving current behaviour. It is intended to be enabled temporarily when investigating suspected memory corruption.
**Commit breakdown:**
1. Infrastructure: flag, getter/setter, and throw helpers in `sstables/sstables.cc`; config option wired up in `main.cc`
2. `on_parse_error()` and `on_bti_parse_error()` check the new flag
3. All ~50 `throw malformed_sstable_exception(...)` sites migrated
4. Both `throw bufsize_mismatch_exception(...)` sites migrated
Refs: SCYLLADB-1087
Backport: new feature, no backport
Closesscylladb/scylladb#29324
* github.com:scylladb/scylladb:
sstables: migrate all bufsize_mismatch_exception throw sites to throw_bufsize_mismatch_exception()
sstables: migrate all malformed_sstable_exception throw sites to throw_malformed_sstable_exception()
sstables: make on_parse_error() and on_bti_parse_error() respect --abort-on-malformed-sstable-error
sstables: disable abort-on-malformed-sstable-error in tests that corrupt sstables on purpose
sstables: introduce --abort-on-malformed-sstable-error infrastructure
sstables: refactor parse_path() to return std::expected<> instead of throwing
This series adds per-test bucket isolation to all S3 and GCS object storage tests. Previously, every test shared a single pre-created bucket, which meant tests could interfere with each other through leftover objects and could not run concurrently across multiple `test.py` processes without risking collisions.
New `create_bucket`, `delete_bucket`, and `delete_bucket_with_objects` methods on `s3::client`, following the existing `make_request` pattern. `create_bucket` handles the `BUCKET_ALREADY_OWNED_BY_YOU` error gracefully.
A new `s3_test_fixture` RAII class for C++ Boost tests that creates a uniquely-named bucket on construction (derived from the Boost test name and pid) and tears down everything — objects, bucket, client — on destruction. All S3 tests in `s3_test.cc` are migrated to use it, removing manual `deferred_delete_object` and `deferred_close` boilerplate. The minio server policy is broadened to allow dynamic bucket creation/deletion.
A `client::make` overload that accepts a custom `retry_strategy`, used in tests with a fast 1ms retry delay instead of exponential backoff, significantly reducing test runtime for transient errors during bucket lifecycle operations.
Python-side (`test/cluster/object_store`): each pytest fixture (`object_storage`, `s3_storage`, `s3_server`) now creates a unique bucket per test function via `create_test_bucket()` and destroys it on teardown. Bucket names are sanitized from the pytest node name with a short UUID suffix for uniqueness.
Object storage helpers (`S3Server`, `MinioWrapper`, `GSFront`, `GSServerImpl`, factory functions, CQL helpers, `s3_server` fixture) are extracted from `test/cluster/object_store/conftest.py` into a shared `test/pylib/object_storage.py` module, eliminating duplication across test suites. The conftest becomes a thin re-export wrapper. Old class names are preserved as aliases for backward compatibility.
| Test Name | new test specific retry strategy execution time (ms) | original execution time (ms) | Δ (ms) | Speedup |
|--------------------------------------------------------------|----------------:|-------------:|---------:|--------:|
| test_client_upload_file_multi_part_with_remainder_proxy | 19,261 | 61,395 | −42,134 | **3.2×** |
| test_client_upload_file_multi_part_without_remainder_proxy | 16,901 | 53,688 | −36,787 | **3.2×** |
| test_client_upload_file_single_part_proxy | 3,478 | 6,789 | −3,311 | **2.0×** |
| test_client_multipart_copy_upload_proxy | 1,303 | 1,619 | −316 | 1.2× |
| test_client_put_get_object_proxy | 150 | 365 | −215 | **2.4×** |
| test_client_readable_file_stream_proxy | 125 | 327 | −202 | **2.6×** |
| test_small_object_copy_proxy | 205 | 389 | −184 | 1.9× |
| test_client_put_get_tagging_proxy | 181 | 350 | −169 | 1.9× |
| test_client_multipart_upload_proxy | 1,252 | 1,416 | −164 | 1.1× |
| test_client_list_objects_proxy | 729 | 881 | −152 | 1.2× |
| test_chunked_download_data_source_with_delays_proxy | 830 | 960 | −130 | 1.2× |
| test_client_readable_file_proxy | 148 | 279 | −131 | 1.9× |
| test_client_upload_file_multi_part_with_remainder_minio | 3,358 | 3,170 | +188 | 0.9× |
| test_client_upload_file_multi_part_without_remainder_minio | 3,131 | 2,929 | +202 | 0.9× |
| test_client_upload_file_single_part_minio | 519 | 421 | +98 | 0.8× |
| test_download_data_source_proxy | 180 | 237 | −57 | 1.3× |
| test_client_list_objects_incomplete_proxy | 590 | 641 | −51 | 1.1× |
| test_large_object_copy_proxy | 952 | 991 | −39 | 1.0× |
| test_client_multipart_upload_fallback_proxy | 148 | 185 | −37 | 1.3× |
| test_client_multipart_copy_upload_minio | 641 | 674 | −33 | 1.1× |
No backport needed — this is a test infrastructure improvement with no production code impact beyond the new `s3::client` methods.
Closesscylladb/scylladb#29508
* github.com:scylladb/scylladb:
test: extract object storage helpers to test/pylib/object_storage.py
test: add per-test bucket isolation to object_store fixtures
s3: add client::make overload with custom retry strategy
test: add s3_test_fixture and migrate tests to per-bucket isolation
s3: add create_bucket and delete_bucket to client
s3_storage::make_source previously ignored its file f parameter and
constructed a fresh s3::client::readable_file per call. The new
file's _stats cache was empty, so the first dma_read_bulk issued a
HEAD via maybe_update_stats just to learn the object size before
the ranged GET -- one ~50 ms RTT per uncached read.
The file f passed in by the two callers (sstable::data_stream for
Data.db reads and index_reader::make_context for Index.db reads)
already wraps the sstable's _data_file or _index_file. Those file
objects had their stats populated at sstable open time by
update_info_for_opened_data, and they were wrapped with the
configured file_io_extensions when opened via open_component. Reusing
them is exactly what filesystem_storage::make_source does (one-line
make_file_data_source over f), so the s3 path simply matches it.
readable_file::size() is also updated to route through
maybe_update_stats(), so a .size() call populates the _stats cache
the same way .stat() does -- preventing a redundant HEAD on the
first subsequent read of components opened with .size() (Index,
Partitions, Rows in update_info_for_opened_data).
Closesscylladb/scylladb#29766
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Verify that the load balancer does not issue intranode migrations when
the load difference between shards is within the size_based_balance_threshold,
and that it does issue migrations when the difference exceeds the threshold.
The intranode shard balancing loop only stopped when the most-loaded
and least-loaded shard were the same (src == dst), meaning it would
keep issuing migrations until the load difference reached exactly 0.
This caused unnecessary migrations for negligible imbalances.
Apply the same is_balanced() threshold check that is already used for
inter-node balancing, so that intranode migrations stop when the
relative load difference between shards is within the configured
size_based_balance_threshold (default 1%).
Add some text about how the new transition works. It doesn't include
full feature description, just concentrates on the new transition and
the way it interacts with the rest of topology coordinator machinery.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test starts regular backup+restore on a smaller cluster, but prior
to it spawns tablet migration from one node to another and locks it in
the middle with the help of block_tablet_streaming injection (even
though tablets have no data and there's nothing to stream, the injection
is located early enough to work).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test checks that losing one of nodes from the cluster while restore
is handled. In particular:
- losing an API node makes the task waiting API to throw (apparently)
- losing coordinator or replica node makes the API call to fail, because
some tablets should fail to get restored. If the coordinator is lost,
it triggers coordinator re-election and new coordinator still notices
that a tablet that was replicated to "old" coordinator failed to get
restored and fails the restore anyway
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the storage_service::restore_tablets() resolves, it only means that
tablet transitions are done, including restore transitions, but not
necessarily that they succeeded. So before resolving the restoration
task with success need to check if all sstables were downloaded and, if
not, resolve the task with exception.
Test included. It uses fault-injection to abort downloading of a single
sstable early, then checks that the error was properly propagated back
to the task waiting API
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After each SSTable is successfully attached to the local table in
download_tablet_sstables(), update its downloaded status in
system_distributed.snapshot_sstables to true. This enables tracking
restore progress by counting how many SSTables have been downloaded.
Change attach_sstable() return type from future<> to
future<sstables::shared_sstable>, returning the SSTable that was
attached. This will be used to extract the SSTable identifier and
first token for updating the download status.
Add a method to update the downloaded status of a specific SSTable
entry in system_distributed.snapshot_sstables. This will be used
by the tablet restore process to mark SSTables as downloaded after
they have been successfully attached to the local table.
Add a 'downloaded' boolean column to the snapshot_sstables table
schema and the corresponding field to the snapshot_sstable_entry
struct. Update insert_snapshot_sstable() and get_snapshot_sstables()
to write and read this column.
This column will be used to track which SSTables have been
successfully downloaded during a tablet restore operation.
Co-authored-by: Pavel Emelyanov <xemul@scylladb.com>
Move the TTL value used for snapshot_sstables rows from a local
variable in insert_snapshot_sstable() to a class-level constant
SNAPSHOT_SSTABLES_TTL_SECONDS, making it reusable by other methods.
The test is derived from test_restore_with_streaming_scopes() one, with
the excaption that it doesn't check for streaming directions, doesn't
check mutations right after creation and doesn't loop over scoped
sub-tests, because there's no scope concept here.
Also it verifies just two topologies, it seems to be enough. The scopes
test has many topologies because of the nature of the scoped restore,
with cluster-wide restore such flexibility is not required.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch adds
- Changes in sstables_loader::restore_tablets() method
It populates the system_distributed_keyspace.snapshot_sstables table
with the information read from the manifest
- Implementation of tablet_restore_task_impl::run() method
It emplaces a bunch of tablet migrations with "restore" kind
- Topology coordinator handling of tablet_transition_stage::restore
When seen, the coordinator calls RESTORE_TABLET RPC against all tablet
replicas
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The topology coordinator will need to call this verb against existing
tablet replicas to ask them restore tablet sstables. Here's the RPC verb
to do it.
It now returns an empty restore_result to make it "synchronous" -- the
co_await send_restore_tablets() won't resolve until client call
finishes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Extracts the data from snapshot_sstables tables and filters only
sstables belonging to current node and tablet in question, then starts
downloading the matched sstables
Extracted from Ernest PR #28701 and piggy-backs the refactoring from
another Ernest PR #28773. Will be used by next patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When doing cluster-wide restore using topology coordinator, the
coordinator will need to serve a bunch of new tablet transition kinds --
the restore one. For that, it will need to receive information about
from where to perform the restore -- the endpoint and bucket pair. This
data can be grabbed from nowhere but the tablet transition itself, so
add the "restore_config" member with this data.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The new cluster-wide tablets restore API is going to be asynchronous,
just like existing node-local one is. For that the task_manager tasks
will be used.
This patch adds a skeleton for tablets-restore task with empty run
method. Next patches will populate it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Withdrawn from #28701. The endpoint implementation from the PR is going
to be reworked, but the swagger description and set/unset placeholders
are very useful.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Ernest Zaslavsky <ernest.zaslavsky@scylladb.com>
When restoring a backup into a keyspace under a different name, than the
one at which it existed during backup, the snapshot_sstables table must
be populated with the _new_ keyspace name, not the one taken from
manifest. Same is true for table name.
This patch makes it possible to override keyspace/table loaded from
manifest file with the provided values. in the future it will also be
good to check that if those values are not provided by user, then values
read from different manifest files are the same.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Rename arguments and local variables in get_sstables_for_tablet to avoid
references to SSTable-specific terminology. This makes the function more
generic and better suited for reuse with different range types.
Generalize get_sstables_for_tablet by templating the return type so it
produces vectors matching the input range’s value type. This makes the
function more flexible and prepares it for reuse in tablet‑aware
restore.
Add getters for the first and last tokens in get_sstables_for_tablet to
make the function more generic and suitable for future use in the
tablet-aware restore code.
Move the download_sstable and get_sstables_for_tablet static functions
from sstables_loader into a new file to make them reusable by the
tablet-aware restore code.
Extract single-tablet range filtering into a new
get_sstables_for_tablet function, taken from the existing
get_sstables_for_tablets. This will later be reused in the
tablet-aware restore code.
Extract the logic for downloading a single SST into a dedicated
function and reuse it in download_fully_contained_sstables. This
supports upcoming changes that consolidate common code.
Add a shard_id member to the minimal_sst_info struct as part of the
tablet-aware restore refactoring. This will support upcoming changes
that extract common code.
This change adds functionality for parsing backup manifests
and populating system_distributed.snapshot_sstables with
the content of the manifests.
This change is useful for tablet-aware restore. The function
introduced here will be called by the coordinator node
when restore starts to populate the snapshot_sstables table
with the data that workers need to execute the restore process.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Co-authored-by: Pavel Emelyanov <xemul@scylladb.com>
This patch adds the snapshot_sstables table with the following
schema:
```cql
CREATE TABLE system_distributed.snapshot_sstables (
snapshot_name text,
keyspace text, table text,
datacenter text, rack text,
id uuid,
first_token bigint, last_token bigint,
toc_name text, prefix text)
PRIMARY KEY ((snapshot_name, keyspace, table, datacenter, rack), first_token, id);
```
The table will be populated by the coordinator node during the restore
phase (and later on during the backup phase to accomodate live-restore).
The content of this table is meant to be consumed by the restore worker nodes
which will use this data to filter and file-based download sstables.
Fixes SCYLLADB-263
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
The loader will need to populate and read data from
system_distributed.snapshot_sstables table added recently, so this
dependency is truly needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patches will introduce an RPC handler to restore a tablet on
replica. The handler will be registered by sstables_loader, and it will
have to call that helper from storage_service which thus needs to be
moved to public scope.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The goal of the split is to have try_transit_tablet() that
- doesn't throw if tablet is in transition, but reports it back
- doesn't wait for the submitted transition to finish
The user will be in tablet-aware-restore, it will call this new trying
helper in parallel, then wait for all transitions to finish.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
- Add `large_rows_exceeding_threshold`, `large_cell_exceeding_threshold`, and `large_collection_exceeding_threshold` metrics to complement the existing `large_partition_exceeding_threshold`
- Add unit tests verifying stats counters increment correctly during SSTable writes
Backport is not needed
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1095Closesscylladb/scylladb#29722
* github.com:scylladb/scylladb:
test/boost: add tests for large data stats counters
db: add large data metrics for rows, cells, and collections
Replace the local buffer_data_sink_impl and buffer_data_source_impl
classes in create_memory_sink() and create_memory_source() with
seastar::util::memory_data_sink and seastar::util::memory_data_source
respectively, which are now available upstream.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29616
Add test_large_data_stats_large_rows, test_large_data_stats_large_cells,
and test_large_data_stats_large_collections to verify that the
large_data_handler stats counters are correctly incremented during
SSTable writes and that unrelated counters remain at zero.
Previously only large_partition_exceeding_threshold was exposed as a
metric. Add three new counters to large_data_handler::stats and register
corresponding Prometheus metrics:
- large_rows_exceeding_threshold
- large_cell_exceeding_threshold
- large_collection_exceeding_threshold
The counters are incremented in maybe_record_large_rows() and
maybe_record_large_cells() following the same pattern used by the
existing partition metric.