The chunked download source sends large GET requests and then consumes data
as it arrives. Sometimes it can stop reading from socket early and drop the
in-flight data. The existing read-bytes metrics show only the number of
consumed bytes, we we also want to know the number of requested bytes
Refs #25770 (accounting of read-bytes)
Fixes#25876
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#25877
In S3 client both read and write metrics have three counters -- number
of requests made, number of bytes processed and request latency. In most
of the cases all three counters are updated at once -- upon response
arrival.
However, in case of chunked download source this way of accounting
metrics is misleading. In this code the request is made once, and then
the obtained bytes are consumed eventually as the data arrive.
Currently, each time a new portion of data is read from the socket the
number of read requests is incremented. That's wrong, the request is
made once, and this counter should also be incremented once, not for
every data buffer that arrived in response.
Same for read request latency -- it's "added" for every data buffer that
arrives, but it's a lenghy process, the _request_ latency should be
accounted once per responce. Maybe later we'll want to have "data
latency" metrics as well, but for what we have now it's request latency.
The number of read bytes is accounted properly, so not touched here.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#25770
Prevent Seastar from retrying HTTP requests to avoid buffer double-feed
issues when an entire request is retried. This could cause data
corruption in `chunked_download_source`. The change is global for every
instance of `s3_client`, but it is still safe because:
* Seastar's `http_client` resets connections regardless of retry behavior
* `s3_client` retry logic handles all error types—exceptions, HTTP errors,
and AWS-specific errors—via `http_retryable_client`
Create aws_error from raised exceptions when possible and respond
appropriately. Previously, non-aws_exception types leaked from the
request handler and were treated as non-retryable, causing potential
data corruption during download.
Handle case where the download loop exits after consuming all data,
but before receiving an empty buffer signaling EOF. Without this, the
next request is sent with a non-zero offset and zero length, resulting
in "Range request cannot be satisfied" errors. Now, an empty buffer is
pushed to indicate completion and exit the fiber properly.
Disable retries for S3 requests in the chunked download source to
prevent duplicate chunks from corrupting the buffer queue. The
response handler now throws an exception to bypass the retry
strategy, allowing the next range to be attempted cleanly.
This exception is only triggered for retryable errors; unretryable
ones immediately halt further requests.
It just std::move-s a buffer and a semaphore_units objects, both moves
are noexcept, so is the constructor itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#24552
Revamped the `range` class to actively manage its state by enforcing validation on all modifications. This prevents overflow, invalid states, and ensures the object size does not exceed the 5TiB limit in S3. This should address and prevent future problems related to this issue https://github.com/minio/minio/issues/21333
No backport needed since this problem related only to this change https://github.com/scylladb/scylladb/pull/23880Closesscylladb/scylladb#24312
* github.com:scylladb/scylladb:
s3_client: headers cleanup
s3_client: Refactor `range` class for state validation
Revamped the `range` class to actively manage its state by enforcing validation on all modifications. This prevents overflow, invalid states, and ensures the object size does not exceed the 5TiB limit in S3.
Revise how we report statistics for `chunked_download_source`. Ensure
metrics for downloaded but unconsumed data are visible, as they do not
contribute to read amplification, which is tracked separately.
Closesscylladb/scylladb#24491
The existing `download_source` implementation optimizes performance
by keeping the connection to S3 open and draining data directly from
the socket. While this eliminates the overhead (60-100ms) of repeatedly
establishing new connections, it leads to rapid exhaustion of client-
side connections.
On a single shard, two `mx_readers` for load and stream are enough to
trigger this issue. Since each client typically holds two connections,
readers keeping index and data sources open can cause deadlocks where
processes stall due to unavailable connections.
Introduce `chunked_download_source`, a new S3 download method built on
`download_source`, to dynamically manage connections:
- Buffers data in 5MiB chunks using a producer-consumer model
- Closes connections once buffers reach capacity, returning them to
the pool for other clients
- Uses a filling fiber that resumes fetching once buffers are
consumed from the queue
Performance remains comparable to `download_source`, achieving
95MiB/s for sequential 1GiB downloads from S3. However, preloading
large chunks may cause read amplification.
Fixes: https://github.com/scylladb/scylladb/issues/23785Closesscylladb/scylladb#23880
Implement the CopyObject API to directly copy S3 object from one location to another. This implementation consumes zero networking overhead on the client side since the object is copied internally by S3 machinery
Usage example: Backup of tiered SSTables - you already have SSTables on S3, CopyObject is the ideal way to go
No need to backport since we are adding new functionality for a future use
Closesscylladb/scylladb#23779
* github.com:scylladb/scylladb:
s3_client: implement S3 copy object
s3_client: improve exception message
s3_client: reposition local function for future use
This PR enhances S3 throughput by leveraging every available shard to upload backup files concurrently. By distributing the load across multiple shards, we significantly improve the upload performance. Each shard retrieves an SSTable and processes its files sequentially, ensuring efficient, file-by-file uploads.
To prevent uncontrolled fiber creation and potential resource exhaustion, the backup task employs a directory semaphore from the sstables_manager. This mechanism helps regulate concurrency at the directory level, ensuring stable and predictable performance during large-scale backup operations.
Refs #22460fixes: #22520
```
===========================================
Release build, master, smp-16, mem-32GiB
Bytes: 2342880184, backup time: 9.51 s
===========================================
Release build, this PR, smp-16, mem-32GiB
Bytes: 2342891015, backup time: 1.23 s
===========================================
```
Looks like it is faster at least x7.7
No backport needed since it (native backup) is still unused functionality
Closesscylladb/scylladb#23727
* github.com:scylladb/scylladb:
backup: Add test for invalid endpoint
backup_task: upload on all shards
backup_task: integrate sharded storage manager for upload
Use all shards to upload snapshot files to S3.
By using the sharded sstables_manager_for_table
infrastructure.
Refs #22460
Quick perf comparison
===========================================
Release build, master, smp-16, mem-32GiB
Bytes: 2342880184, backup time: 9.51 s
===========================================
Release build, this PR, smp-16, mem-32GiB
Bytes: 2342891015, backup time: 1.23 s
===========================================
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Co-authored-by: Ernest Zaslavsky <ernest.zaslavsky@scylladb.com>
Add support for the CopyObject API to enable direct copying of S3
objects between locations. This approach eliminates networking
overhead on the client side, as the operation is handled internally
by S3.
Name the gates and phased barriers we use
to make it easy to debug gate_closed_exception
Refs https://github.com/scylladb/seastar/pull/2688
* Enhancement only, no backport needed
Closesscylladb/scylladb#23329
* github.com:scylladb/scylladb:
utils: loading_cache: use named_gate
utils: flush_queue: use named_gate
sstables_manager: use named gate
sstables_loader: use named gate
utils: phased_barrier, pluggable: use named gate
utils: s3::client::multipart_upload: use named gate
utils: s3::client: use named_gate
transport: controller: use named gate
tracing: trace_keyspace_helper: use named gate
task_manager: module: use named gate
topology_coordinator: use named gate
storage_service: use named gate
storage_proxy: wait_for_hint_sync_point: use named gate
storage_proxy: remote: use named gate
service: session: use named gate
service: raft: raft_rpc: use named gate
service: raft: raft_group0: use named gate
service: raft: persistent_discovery: use named gate
service: raft: group0_state_machine: use named gate
service: migration_manager: use named gate
replica: table: use named gate
replica: compaction_group, storage_group: use named gate
redis: query_processor: use named gate
repair: repair_meta: use named gate
reader_concurrency_semaphore: use named gate
raft: server_impl: use named gate
querier_cache: use named gate
gms: gossiper: use named gate
generic_server: use named gate
db: sstables_format_listener: use named gate
db: snapshot: backup_task: use named gate
db: snapshot_ctl: use named gate
hints: hints_sender: use named gate
hints: manager: use named gate
hints: hint_endpoint_manager: use named gate
commitlog: segment_manager: use named gate
db: batchlog_manager: use named gate
query_processor: remote: use named gate
compaction: compaction_state: use named gate
alternator/server: use named_gate
When streaming files using multipart upload, switch from using
`output_stream::write(const char*, size_t)` to passing buffer objects
directly to `output_stream::write()`. This eliminates unnecessary memory
copying that occurred when the original implementation had to
defensively copy data before sending.
The buffer objects can now be safely reused by the output stream instead
of creating deep copies, which should improve performance by reducing
memory operations during S3 file uploads.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#23567
The new data source implementation runs a single GET for the whole range
specified and lends the body input_stream for the upper input_stream's
get()-s. Eventually, getting the data from the body stream EOFs or
fails. In either case, the existing body is closed and a new GET is
spawn with the updater Range header so that not to include the bytes
read so far.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The get_object_contiguous() formats the 'bytes=X-Y' one for its GET
request. The very same code will be needed by next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
As the IAM role is not configured to assume a role at this moment, it
makes sense to move the instance metadata credentials provider up in
the chain. This avoids unnecessary network calls and prevents log
clutter caused by failure messages.
Closesscylladb/scylladb#23360
This PR introduces several key improvements to bolster the reliability of our S3 client, particularly in handling intermittent authentication and TLS-related issues. The changes include:
1. **Automatic Credential Renewal and Request Retry**: When credentials expire, the new retry strategy now resets the credentials and set the client to the retryable state, so the client will re-authenticate, and automatically retry the request. This change prevents transient authentication failures from propagating as fatal errors.
2. **Enhanced Exception Unwrapping**: The client now extracts the embedded std::system_error from std::nested_exception instances that may be raised by the Seastar HTTP client when using TLS. This allows for more precise error reporting and handling.
3. **Expanded TLS Error Handling**: We've added support for retryable TLS error codes within the std::system_error handler. This modification enables the client to detect and recover from transient TLS issues by retrying the affected operations.
Together, these enhancements improve overall client robustness by ensuring smoother recovery from both credential and TLS-related errors.
No backport needed since it is an enhancement
Closesscylladb/scylladb#22150
* github.com:scylladb/scylladb:
aws_error: Add GNU TLS codes
s3_client: Handle nested std::system_error exceptions
s3_client: Start using new retry strategy
retry_strategy: Add custom retry strategy for S3 client
retry_strategy: Make `should_retry` awaitable
Enhance error handling by detecting and processing std::system_error exceptions
nested within std::nested_exception. This improvement ensures that system-level
errors wrapped in the exception chain are properly caught and managed, leading
to more robust error reporting and recovery.
* Previously, token expiration was considered a fatal error. With this change,
the `s3_client` uses new retry strategy that is trying to renew expired
creds
* Added related test to the `s3_proxy`
Introduced a new retry strategy that extends the default implementation.
The should_retry method is overridden to handle a specific case for expired credential tokens.
When an expired token error is detected, the credentials are reset so it is expected that the client will re-authenticates, and the
original request is retried.