Commit Graph

1982 Commits

Author SHA1 Message Date
Asias He
4a4fbae8f7 utils: Allow chunked_vector::erase to work with non-default-constructible type
This is needed for chunked_vector<frozen_mutation_fragment> in repair.
2025-07-29 13:43:17 +08:00
Botond Dénes
837424f7bb Merge 'Add Azure Key Provider for Encryption at Rest' from Nikos Dragazis
This PR introduces a new Key Provider to support Azure Key Vault as a Key Management System (KMS) for Encryption at Rest. The core design principle is the same as in the AWS and GCP key providers - an externally provided Vault key that is used to protect local data encryption keys (a process known as "key wrapping").

In more detail, this patch series consists of:
* Multiple Azure credential sources, offering a variety of authentication options (Service Principals, Managed Identities, environment variables, Azure CLI).
* The Azure host - the Key Vault endpoint bridge.
* The Azure Key Provider - the interface for the Azure host.
* Unit tests using real Azure resources (credentials and Vault keys).
* Log filtering logic to not expose sensitive data in the logs (plaintext keys, credentials, access tokens).

This is part of the overall effort to support Azure deployments.

Testing done:
* Unit tests.
* Manual test on an Azure VM with a Managed Identity.
* Manual test with credentials from Azure CLI.
* Manual test of `--azure-hosts` cmdline option.
* Manual test of log filtering.

Remaining items:
- [x] Create necessary Azure resources for CI.
- [x] Merge pipeline changes (https://github.com/scylladb/scylla-pkg/pull/5201).

Closes https://github.com/scylladb/scylla-enterprise/issues/1077.

New feature. No backport is needed.

Closes scylladb/scylladb#23920

* github.com:scylladb/scylladb:
  docs: Document the Azure Key Provider
  test: Add tests for Azure Key Provider
  pylib: Add mock server for Azure Key Vault
  encryption: Define and enable Azure Key Provider
  encryption: azure: Delegate hosts to shard 0
  encryption: Add Azure host cache
  encryption: Add config options for Azure hosts
  encryption: azure: Add override options
  encryption: azure: Add retries for transient errors
  encryption: azure: Implement init()
  encryption: azure: Implement get_key_by_id()
  encryption: azure: Add id-based key cache
  encryption: azure: Implement get_or_create_key()
  encryption: azure: Add credentials in Azure host
  encryption: azure: Add attribute-based key cache
  encryption: azure: Add skeleton for Azure host
  encryption: Templatize get_{kmip,kms,gcp}_host()
  encryption: gcp: Fix typo in docstring
  utils: azure: Get access token with default credentials
  utils: azure: Get access token from Azure CLI
  utils: azure: Get access token from IMDS
  utils: azure: Get access token with SP certificate
  utils: azure: Get access token with SP secret
  utils: rest: Add interface for request/response redaction logic
  utils: azure: Declare all Azure credential types
  utils: azure: Define interface for Azure credentials
  utils: Introduce base64url_{encode,decode}
2025-07-25 10:45:32 +03:00
Pavel Emelyanov
295165d8ea Merge 's3_client: Enhance s3_client error handling' from Ernest Zaslavsky
Enhance and fix error handling in the `chunked_download_source` to prevent errors seeping from the request callback. Also stop retrying on seastar's side since it is going to break the integrity of data which maybe downloaded more than once for the same range.

Fixes: https://github.com/scylladb/scylladb/issues/25043

Should be backported to 2025.3 since we have an intention to release native backup/restore feature

Closes scylladb/scylladb#24883

* github.com:scylladb/scylladb:
  s3_client: Disable Seastar-level retries in HTTP client creation
  s3_test: Validate handling of non-`aws_error` exceptions
  s3_client: Improve error handling in chunked_download_source
  aws_error: Add factory method for `aws_error` from exception
2025-07-22 10:40:39 +03:00
Ernest Zaslavsky
fc2c9dd290 s3_client: Disable Seastar-level retries in HTTP client creation
Prevent Seastar from retrying HTTP requests to avoid buffer double-feed
issues when an entire request is retried. This could cause data
corruption in `chunked_download_source`. The change is global for every
instance of `s3_client`, but it is still safe because:
* Seastar's `http_client` resets connections regardless of retry behavior
* `s3_client` retry logic handles all error types—exceptions, HTTP errors,
  and AWS-specific errors—via `http_retryable_client`
2025-07-21 17:03:23 +03:00
Ernest Zaslavsky
ba910b29ce s3_test: Validate handling of non-aws_error exceptions
Inject exceptions not wrapped in `aws_error` from request callback
lambda to verify they are properly caught and handled.
2025-07-21 16:52:43 +03:00
Ernest Zaslavsky
b7ae6507cd s3_client: Improve error handling in chunked_download_source
Create aws_error from raised exceptions when possible and respond
appropriately. Previously, non-aws_exception types leaked from the
request handler and were treated as non-retryable, causing potential
data corruption during download.
2025-07-21 16:49:47 +03:00
Ernest Zaslavsky
d53095d72f aws_error: Add factory method for aws_error from exception
Move `aws_error` creation logic out of `retryable_http_client` and
into the `aws_error` class to support reuse across components.
2025-07-21 16:42:44 +03:00
Ernest Zaslavsky
408aa289fe treewide: Move misc files to utils directory
As requested in #22114, moved the files and fixed other includes and build system.

Moved files:
- interval.hh
- Map_difference.hh

Fixes: #22114

This is a cleanup, no need to backport

Closes scylladb/scylladb#25095
2025-07-21 11:56:40 +03:00
Avi Kivity
3dfdcf7d7a Merge 'transport: remove throwing protocol_exception on connection start' from Dario Mirovic
`protocol_exception` is thrown in several places. This has become a performance issue, especially when starting/restarting a server. To alleviate this issue, throwing the exception has to be replaced with returning it as a result or an exceptional future.

This PR replaces throws in the `transport/server` module. This is achieved by using result_with_exception, and in some places, where suitable, just by creating and returning an exceptional future.

There are four commits in this PR. The first commit introduces tests in `test/cqlpy`. The second commit refactors transport server `handle_error` to not rethrow exceptions. The third commit refactors reusable buffer writer callbacks. The fourth commit replaces throwing `protocol_exception` to returning it.

Based on the comments on an issue linked in https://github.com/scylladb/scylladb/issues/24567, the main culprit from the side of protocol exceptions is the invalid protocol version one, so I tested that exception for performance.

In order to see if there is a measurable difference, a modified version of `test_protocol_version_mismatch` Python is used, with 100'000 runs across 10 processes (not threads, to avoid Python GIL). One test run consisted of 1 warm-up run and 5 measured runs. First test run has been executed on the current code, with throwing protocol exceptions. Second test urn has been executed on the new code, with returning protocol exceptions. The performance report is in https://github.com/scylladb/scylladb/pull/24738#issuecomment-3051611069. It shows ~10% gains in real, user, and sys time for this test.

Testing

Build: `release`

Test file: `test/cqlpy/test_protocol_exceptions.py`
Test name: `test_protocol_version_mismatch` (modified for mass connection requests)

Test arguments:
```
max_attempts=100'000
num_parallel=10
```

Throwing `protocol_exception` results:
```
real=1:26.97  user=10:00.27  sys=2:34.55  cpu=867%
real=1:26.95  user=9:57.10  sys=2:32.50  cpu=862%
real=1:26.93  user=9:56.54  sys=2:35.59  cpu=865%
real=1:26.96  user=9:54.95  sys=2:32.33  cpu=859%
real=1:26.96  user=9:53.39  sys=2:33.58  cpu=859%

real=1:26.95 user=9:56.85 sys=2:34.11 cpu=862%   # average
```

Returning `protocol_exception` as `result_with_exception` or an exceptional future:
```
real=1:18.46  user=9:12.21  sys=2:19.08  cpu=881%
real=1:18.44  user=9:04.03  sys=2:17.91  cpu=869%
real=1:18.47  user=9:12.94  sys=2:19.68  cpu=882%
real=1:18.49  user=9:13.60  sys=2:19.88  cpu=883%
real=1:18.48  user=9:11.76  sys=2:17.32  cpu=878%

real=1:18.47 user=9:10.91 sys=2:18.77 cpu=879%   # average
```

This PR replaced `transport/server` throws of `protocol_exception` with returns. There are a few other places where protocol exceptions are thrown, and there are many places where `invalid_request_exception` is thrown. That is out of scope of this single PR, so the PR just refs, and does not resolve issue #24567.

Refs: #24567

This PR improves performance in cases when protocol exceptions happen, for example during connection storms. It will require backporting.

Closes scylladb/scylladb#24738

* github.com:scylladb/scylladb:
  test/cqlpy: add cpp exception metric test conditions
  transport/server: replace protocol_exception throws with returns
  utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception
  transport/server: avoid exception-throw overhead in handle_error
  test/cqlpy: add protocol_exception tests
2025-07-20 17:42:30 +03:00
Dario Mirovic
9f4344a435 utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception
Make make_bytes_ostream and make_fragmented_temporary_buffer accept
writer callbacks that return utils::result_with_exception instead of
forcing them to throw on error. This lets callers propagate failures
by returning an error result rather than throwing an exception.

Introduce buffer_writer_for, bytes_ostream_writer, and fragmented_buffer_writer
concepts to simplify and document the template requirements on writer callbacks.

This patch does not modify the actual callbacks passed, except for the syntax
changes needed for successful compilation, without changing the logic.

Refs: #24567
2025-07-17 16:40:02 +02:00
Botond Dénes
fd6877c654 Merge 'alternator: avoid oversized allocation in Query/Scan' from Nadav Har'El
This series fixes one cause of oversized allocations - and therefore potentially stalls and increased tail latencies - in Alternator.

The first patch in the series is the main fix - the later patches are cleanups requested by reviewers but also involved other pre-existing code, so I did those cleanups as separate patches.

Alternator's Scan or Query operation return a page of results. When the number of items is not limited by a "Limit" parameter, the default is to return a 1 MB page. If items are short, a large number of them can fit in that 1MB. The test test_query.py::test_query_large_page_small_rows has 30,000 items returned in a single page.

In the response JSON, all these items are returned in a single array "Items". Before this patch, we build the full response as a RapidJSON object before sending it. The problem is that unfortunately, RapidJSON stores arrays as contiguous allocations. This results in large contiguous allocations in workloads that scan many small items, and large contiguous allocations can also cause stalls and high tail latencies. For example, before this patch, running

    test/alternator/run --runveryslow \
        test_query.py::test_query_large_page_small_rows

reports in the log:

    oversized allocation: 573440 bytes.

After this patch, this warning no longer appears.
The patch solves the problem by collecting the scanned items not in a RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e, a chunked (non-contiguous) array of items (each a JSON value). After collecting this array separately from the response object, we need to print its content without actually inserting it into the object - we add a new function print_with_extra_array() to do that.

The new separate-chunked-vector technique is used when a large number (currently, >256) of items were scanned. When there is a smaller number of items in a page (this is typical when each item is longer), we just insert those items in the object and print it as before.

Beyond the original slow test that demonstrated the oversized allocation (which is now gone), this patch also includes a new test which exercises the new code with a scan of 700 (>256) items in a page - but this new test is fast enough to be permanently in our test suite and not a manual "veryslow" test as the other test.

Fixes #23535

The stalls caused by large allocations was seen by actual users, so it makes sense to backport this patch. On the other hand, the patch while not big is fairly intrusive (modifies the nomal Scan and Query path and also the later patches do some cleanup of additional code) so there is some small risk involved in the backport.

Closes scylladb/scylladb#24480

* github.com:scylladb/scylladb:
  alternator: clean up by co-routinizing
  alternator: avoid spamming the log when failing to write response
  alternator: clean up and simplify request_return_type
  alternator: avoid oversized allocation in Query/Scan
2025-07-17 11:30:40 +03:00
Ernest Zaslavsky
342e94261f s3_client: parse multipart response XML defensively
Ensure robust handling of XML responses when initiating multipart
uploads. Check for the existence of required nodes before access,
and throw an exception if the XML is empty or malformed.

Refs: https://github.com/scylladb/scylladb/issues/24676

Closes scylladb/scylladb#24990
2025-07-17 10:55:04 +03:00
Nikos Dragazis
eec49c4d78 utils: azure: Get access token with default credentials
Attempt to detect credentials from the system.

Inspired from the `DefaultAzureCredential` in the Azure C++ SDK, this
credential type detects credentials from the following sources (in this
order):

* environment variables (SP credentials - same variables as in Azure C++ SDK)
* Azure CLI
* IMDS

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
937d6261c0 utils: azure: Get access token from Azure CLI
Implement token request with Azure CLI.

Inspired from the Azure C++ SDK's `AzureCliCredential`, this credential
type attempts to run the Azure CLI in a shell and parse the token from
its output. This is meant for development purposes, where a user has
already installed the Azure CLI and logged in with their user account.

Pass the following environment to the process:
* PATH
* HOME
* AZURE_CONFIG_DIR

Add a token factory to construct a token from the process output. Unlike
in Azure Entra and IMDS, the CLI's JSON output does not contain
'expires_in', and the token key is in camel case.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
52a4bd83d5 utils: azure: Get access token from IMDS
Implement token request from IMDS.

No credentials are required for that - just a plain HTTP request on the
IMDS token endpoint.

Since the IMDS endpoint is a raw IP, it's not possible to reliably
determine whether IMDS is accessible or not (i.e., whether the node is
an Azure VM). Azure provides no node-local indication either. In lack of
a better choice, attempt to connect and declare failure if the
connection is not established within 3 seconds. Use a raw TCP socket for
this check, as the HTTP client currently lacks timeout or cancellation
support. Perform the check only once, during the first token refresh.

For the time being, do not support nodes with multiple user-assigned
managed identities. Expect the token request to fail in this case (IMDS
requires the identifier of the desired Managed Identity).

Add a token factory to correctly parse the HTTP response. This addresses
a discrepancy between token requests on IMDS and Azure Entra - the
'expires_in' field is a string in the former and an integer in the
latter.

Finally, implement a fail-fast retry policy for short-lived transient
errors.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
919765fb7f utils: azure: Get access token with SP certificate
Implement token request for Service Principals with a certificate.

The request is the same as with a secret, except that the secret is
replaced with an assertion. The assertion is a JWT that is signed with
the certificate.

To be consistent with the Azure C++ SDK, expect the certificate and the
associated private key to be encoded in PEM format and be provided in a
single file.

The docs suggest using 'PS256' for the JWT's 'alg' claim. Since this is
not supported by our current JWT library (jwt-cpp), use 'RS256' instead.

The JWT also requires a unique identifier for the 'jti' claim. Use a
random UUID for that (it should suffice for our use cases).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
a671530af6 utils: azure: Get access token with SP secret
Implement token request for Service Principals with a secret.

The token request requires a TLS connection. When closing the
connection, do not wait for a response to the TLS `close_notify` alert.
Azure's OAuth server would ignore it and the Seastar `connected_socket`
would hang for 10 seconds.

Add log redaction logic to not expose sensitive data from the request
and response payloads.

Add a token factory to parse the HTTP response. This cannot be shared
with other credential types because the JSON format is not consistent.

Finally, implement a fail-fast retry policy for short-lived transient
errors.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
66c8ffa9bf utils: rest: Add interface for request/response redaction logic
The rest http client, currently used by the AWS and GCP key providers,
logs the HTTP requests and responses unaltered. This causes some
sensitive data to be exposed (plaintext data encryption keys,
credentials, access tokens).

Add an interface to optionally redact any sensitive data from HTTP
headers and payloads.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
0d0135dc4c utils: azure: Declare all Azure credential types
The goal is to mimic the Azure C++ SDK, which offers a variety of
credentials, depending on their type and source.

Declare the following credentials:
* Service Principal credentials
* Managed Identity credentials
* Azure CLI credentials
* Default credentials

Also, define a common exception for SP and MI credentials which are
network-based.

This patch only defines the API.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
3c4face47b utils: azure: Define interface for Azure credentials
Azure authentication is token based - the client obtains an access token
with their credentials, and uses it as a bearer token to authorize
requests to Azure services.

Define a common API for all credential types. The API will consist of a
single `get_access_token()` function that will be returning a new or a
cached access token for some resource URI (defines token scope).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
57bc51342e utils: Introduce base64url_{encode,decode}
Add helpers for base64url encoding.

base64url is a variant of base64 that uses a URL-safe alphabet. It can
be constructed from base64 by replacing the '+' and '/' characters with
'-' and '_' respectively. Many implementations also strip the padding,
although this is not required by the spec [1].

This will be used in upcoming patches for Azure Key Vault requests that
require base64url-encoded payloads.

[1] https://datatracker.ietf.org/doc/html/rfc4648#section-5

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Avi Kivity
c762425ea7 Merge 'auth: move passwords::check call to alien thread' from Andrzej Jackowski
Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive.

This PR addresses the issue in two ways:
1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case).
2) `passwords::check` is moved to a dedicated alien thread.

Regarding point 1: before this change, the following hashing schemes were supported by     `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However:
- The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it  not good idea to fix or use it.
- SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512.
- MD5 is no longer considered secure for password hashing.

Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers.

Fixes https://github.com/scylladb/scylladb/issues/24524

Backport not needed, as it is a new feature.

Closes scylladb/scylladb#24924

* github.com:scylladb/scylladb:
  main: utils: add thread names to alien workers
  auth: move passwords::check call to alien thread
  test: wait for 3 clients with given username in test_service_level_api
  auth: refactor password checking in password_authenticator
  auth: make SHA-512 the only password hashing scheme for new passwords
  auth: whitespace change in identify_best_supported_scheme()
  auth: require scheme as parameter for `generate_salt`
  auth: check password hashing scheme support on authenticator start
2025-07-16 13:15:54 +03:00
Andrzej Jackowski
77a9b5919b main: utils: add thread names to alien workers
This commit adds a call to `pthread_setname_np` in
`alien_worker::spawn`, so each alien worker thread receives a
descriptive name. This makes debugging, monitoring, and performance
analysis easier by allowing alien workers to be clearly identified
in tools such as `perf`.
2025-07-15 23:29:21 +02:00
Nadav Har'El
2385fba4b6 alternator: avoid oversized allocation in Query/Scan
This patch fixes one cause of oversized allocations - and therefore
potentially stalls and increased tail latencies - in Alternator.

Alternator's Scan or Query operation return a page of results. When the
number of items is not limited by a "Limit" parameter, the default is
to return a 1 MB page. If items are short, a large number of them can
fit in that 1MB. The test test_query.py::test_query_large_page_small_rows
has 30,000 items returned in a single page.

In the response JSON, all these items are returned in a single array
"Items". Before this patch, we build the full response as a RapidJSON
object before sending it. The problem is that unfortunately, RapidJSON
stores arrays as contiguous allocations. This results in large
contiguous allocations in workloads that scan many small items, and
large contiguous allocations can also cause stalls and high tail
latencies. For example, before this patch, running

    test/alternator/run --runveryslow \
        test_query.py::test_query_large_page_small_rows

reports in the log:

    oversized allocation: 573440 bytes.

After this patch, this warning no longer appears.
The patch solves the problem by collecting the scanned items not in a
RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e,
a chunked (non-contiguous) array of items (each a JSON value).
After collecting this array separately from the response object, we
need to print its content without actually inserting it into the object -
we add a new function print_with_extra_array() to do that.

The new separate-chunked-vector technique is used when a large number
(currently, >256) of items were scanned. When there is a smaller number
of items in a page (this is typical when each item is longer), we just
insert those items in the object and print it as before.

Beyond the original slow test that demonstrated the oversized allocation
(which is now gone), this patch also includes a new test which
exercises the new code with a scan of 700 (>256) items in a page -
but this new test is fast enough to be permanently in our test suite
and not a manual "veryslow" test as the other test.

Fixes #23535
2025-07-14 18:41:34 +03:00
Benny Halevy
0e455c0d45 utils: clear_gently: add support for sets
Since set and unordered_set do not allow modifying
their stored object in place, we need to first extract
each object, clear it gently, and only then destroy it.

To achieve that, introduce a new Extractable concept,
that extracts all items in a loop and calls clear_gently
on each extracted item, until the container is empty.

Add respective unit tests for set and unordered_set.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#24608
2025-07-13 12:30:45 +03:00
Pawel Pery
8d3c33f74a utils: refactor sequential_producer as abortable
This patch is a part of vector_store_client sharded service
implementation for a communication with vector-store service.

There is a need for abortable sequention_producer operator(). The
existing operator() is changed to allow timeout argument with default
time_point::max() (as current default usage) and the new operator() is
created with abort_source parameter.

Reference: VS-47
2025-07-08 16:29:55 +02:00
Yaniv Michael Kaul
82fba6b7c0 PowerPC: remove ppc stuff
We don't even compile-test it.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#24659
2025-07-08 10:38:23 +03:00
Dawid Mędrek
a151944fa6 treewide: Replace __builtin_expect with (un)likely
C++20 introduced two new attributes--likely and unlikely--that
function as a built-in replacement for __builtin_expect implemented
in various compilers. Since it makes code easier to read and it's
an integral part of the language, there's no reason to not use it
instead.

Closes scylladb/scylladb#24786
2025-07-03 13:34:04 +03:00
Pavel Emelyanov
fa0077fb77 Merge 'S3 chunked download source bug fixes' from Ernest Zaslavsky
- Fix missing negation in the `if` in the background downloading fiber
- Add test to catch this case
- Improve the s3 proxy to inject errors if the same resource requested more than once
- Suppress client retry since retrying the same request when each produces multiple buffers may lead to the same data appear more than once in the buffer deque
- Inject exception from the test to simulate response callback failure in the middle

No need to backport anything since this class in not used yet

Closes scylladb/scylladb#24657

* github.com:scylladb/scylladb:
  s3_test: Add s3_client test for non-retryable error handling
  s3_test: Add trace logging for default_retry_strategy
  s3_client: Fix edge case when the range is exhausted
  s3_client: Fix indentation in try..catch block
  s3_client: Stop retries in chunked download source
  s3_client: Enhance test coverage for retry logic
  s3_client: Add test for Content-Range fix
  s3_client: Fix missing negation
  s3_client: Refine logging
  s3_client: Improve logging placement for current_range output
2025-07-02 14:45:10 +03:00
Avi Kivity
1e0b015c8b Merge 'cql3: Represent create_statement using managed_bytes' from Dawid Mędrek
When describing a table, we need to do it carefully: if some
columns were dropped, we must specify that explicitly by

```
ALTER TABLE {table} DROP {column} USING TIMESTAMP ...
```

in the result of the DESCRIBE statement. Failing to do so
could lead to data resurrection.

However, if a table has been altered many, many times,
we might end up with a huge create statement. Constructing
it could, in turn, trigger an oversized allocation.
Some tests ran into that very problem in fact.

In this commit, we want to mitigate the problem: instead of
allocating a contiguous chunk of memory for the create
statement, we use `bytes_ostream` and `managed_bytes` to
possibly keep data scattered in memory. It makes handling
`cql3::description` less convenient in the code, but since
the struct is pretty much immediately serialized after
creating it, it's a very good trade-off.

A reproducer is intentionally not provided by this commit:
it's easy to test the change, but adding and dropping
a huge number of columns would take a really long amount
of time, so we need to omit it.

Fixes scylladb/scylladb#24018

Backport: all of the supported versions are affected, so we want to backport the changes there.

Closes scylladb/scylladb#24151

* github.com:scylladb/scylladb:
  cql3/description: Serialize only rvalues of description
  cql3: Represent create_statement using managed_string
  cql3/statements/describe_statement.cc: Don't copy descriptions
  cql3: Use managed_bytes instead of bytes in DESCRIBE
  utils/managed_string.hh: Introduce managed_string and fragmented_ostringstream
2025-07-01 21:59:38 +03:00
Ernest Zaslavsky
acf15eba8e s3_test: Add s3_client test for non-retryable error handling
Introduce a test that injects a non-retryable error and verifies
that the chunked download source throws an exception as expected.
2025-07-01 18:45:17 +03:00
Ernest Zaslavsky
49e8c14a86 s3_client: Fix edge case when the range is exhausted
Handle case where the download loop exits after consuming all data,
but before receiving an empty buffer signaling EOF. Without this, the
next request is sent with a non-zero offset and zero length, resulting
in "Range request cannot be satisfied" errors. Now, an empty buffer is
pushed to indicate completion and exit the fiber properly.
2025-07-01 18:45:17 +03:00
Ernest Zaslavsky
e50f247bf1 s3_client: Fix indentation in try..catch block
Correct indentation in the `try..catch` block to improve code
readability and maintain consistent formatting.
2025-07-01 18:45:17 +03:00
Ernest Zaslavsky
d2d69cbc8c s3_client: Stop retries in chunked download source
Disable retries for S3 requests in the chunked download source to
prevent duplicate chunks from corrupting the buffer queue. The
response handler now throws an exception to bypass the retry
strategy, allowing the next range to be attempted cleanly.

This exception is only triggered for retryable errors; unretryable
ones immediately halt further requests.
2025-07-01 18:45:17 +03:00
Ernest Zaslavsky
6d9cec558a s3_client: Fix missing negation
Restore a missing `not` in a conditional check that caused
incorrect behavior during S3 client execution.
2025-07-01 18:45:17 +03:00
Ernest Zaslavsky
e73b83e039 s3_client: Refine logging
Fix typo in log message to improve clarity and accuracy during
S3 operations.
2025-07-01 18:45:17 +03:00
Ernest Zaslavsky
f1d0690194 s3_client: Improve logging placement for current_range output
Relocated logging to occur after determining the `current_range`,
ensuring more relevant output during S3 client operations.
2025-07-01 18:45:17 +03:00
Michał Chojnowski
a29724479a utils/alien_worker: fix a data race in submit()
We move a `seastar::promise` on the external worker thread,
after the matching `seastar::future` was returned to the shard.

That's illegal. If the `promise` move occurs concurrently with some
operation (move, await) on the `future`, it becomes a data race
which could cause various kinds of corruption.

This patch fixes that by keeping the promise at a stable address
on the shard (inside a coroutine frame) and only passing through
the worker.

Fixes #24751

Closes scylladb/scylladb#24752
2025-07-01 15:13:04 +03:00
Dawid Mędrek
9cc3d49233 utils/managed_string.hh: Introduce managed_string and fragmented_ostringstream
Currently, we use `managed_bytes` to represent fragmented sequences of bytes.
In some cases, the type corresponds to generic bytes, while in some other cases
-- to strings of actual text. Because of that, it's very easy to get confused
what use `managed_bytes` serve in a specific piece of code. We should avoid it.

In this commit, we're introducing basic wrappers over `managed_bytes` and
`bytes_ostream` with a promise that they represent UTF-8-encoded strings.
The interface of those types are pretty basic, but they should be sufficient
for the most common use: filling a stream with characters and then extracting
a fragmented buffer from it.
2025-06-30 19:12:08 +02:00
Lakshmi Narayanan Sreethar
279253ffd0 utils/big_decimal: fix scale overflow when parsing values with large exponents
The exponent of a big decimal string is parsed as an int32, adjusted for
the removed fractional part, and stored as an int32. When parsing values
like `1.23E-2147483647`, the unscaled value becomes `123`, and the scale
is adjusted to `2147483647 + 2 = 2147483649`. This exceeds the int32
limit, and since the scale is stored as an int32, it overflows and wraps
around, losing the value.

This patch fixes that the by parsing the exponent as an int64 value and
then adjusting it for the fractional part. The adjusted scale is then
checked to see if it is still within int32 limits before storing. An
exception is thrown if it is not within the int32 limits.

Note that strings with exponents that exceed the int32 range, like
`0.01E2147483650`, were previously not parseable as a big decimal. They
are now accepted if the final adjusted scale fits within int32 limits.
For the above value, unscaled_value = 1 and scale = -2147483648, so it
is now accepted. This is in line with how Java's `BigDecimal` parses
strings.

Fixes: #24581

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#24640
2025-06-26 15:29:28 +03:00
Szymon Malewski
f28bab741d utils/exceptions.cc: Added check for exceptions::request_timeout_exception in is_timeout_exception function.
It solves the issue, where in some cases a timeout exceptions in CAS operations are logged incorrectly as a general failure.

Fixes #24591

Closes scylladb/scylladb#24619
2025-06-26 12:25:38 +02:00
Marcin Maliszkiewicz
45392ac29e utils: don't allow do discard updateable_value observer
If the object returned from observe() is destructured,
it stops observing, potentially causing subtle bugs.
Typically, the observer object is retained as a class member.
2025-06-23 17:54:01 +02:00
Pavel Emelyanov
dc166be663 s3: Mark claimed_buffer constructor noexcept
It just std::move-s a buffer and a semaphore_units objects, both moves
are noexcept, so is the constructor itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24552
2025-06-18 20:36:45 +03:00
Pavel Emelyanov
b0766d1e73 Merge 's3_client: Refactor range class for state validation' from Ernest Zaslavsky
Revamped the `range` class to actively manage its state by enforcing validation on all modifications. This prevents overflow, invalid states, and ensures the object size does not exceed the 5TiB limit in S3. This should address and prevent future problems related to this issue https://github.com/minio/minio/issues/21333

No backport needed since this problem related only to this change https://github.com/scylladb/scylladb/pull/23880

Closes scylladb/scylladb#24312

* github.com:scylladb/scylladb:
  s3_client: headers cleanup
  s3_client: Refactor `range` class for state validation
2025-06-17 10:34:55 +03:00
Ernest Zaslavsky
e398576795 s3_client: Fix hang in get() on EOF by signaling condition variable
* Ensure _get_cv.signal() is called when an empty buffer received
* Prevents `get()` from stalling indefinitely while waiting on EOF
* Found when testing https://github.com/scylladb/scylladb/pull/23695

Closes scylladb/scylladb#24490
2025-06-17 10:33:19 +03:00
Calle Wilund
4a98c258f6 http: Add missing thread_local specifier for static
Refs #24447

Patch adding this somehow managed to leave out the thread_local
specifier. While gnutls cert object can be shared across shards
just fine, the actual shared_ptr here cannot, thus we could
cause memory errors.

Closes scylladb/scylladb#24514
2025-06-17 10:23:52 +03:00
Ernest Zaslavsky
1b20e0be4a s3_client: headers cleanup 2025-06-16 16:02:30 +03:00
Ernest Zaslavsky
9ad7a456fe s3_client: Refactor range class for state validation
Revamped the `range` class to actively manage its state by enforcing validation on all modifications. This prevents overflow, invalid states, and ensures the object size does not exceed the 5TiB limit in S3.
2025-06-16 16:02:24 +03:00
Ernest Zaslavsky
2b300c8eb9 s3_client: Improve reporting of S3 client statistics
Revise how we report statistics for `chunked_download_source`. Ensure
metrics for downloaded but unconsumed data are visible, as they do not
contribute to read amplification, which is tracked separately.

Closes scylladb/scylladb#24491
2025-06-16 09:33:57 +03:00
Ernest Zaslavsky
30199552ac s3_client: Mitigate connection exhaustion in download_source
The existing `download_source` implementation optimizes performance
by keeping the connection to S3 open and draining data directly from
the socket. While this eliminates the overhead (60-100ms) of repeatedly
establishing new connections, it leads to rapid exhaustion of client-
side connections.

On a single shard, two `mx_readers` for load and stream are enough to
trigger this issue. Since each client typically holds two connections,
readers keeping index and data sources open can cause deadlocks where
processes stall due to unavailable connections.

Introduce `chunked_download_source`, a new S3 download method built on
`download_source`, to dynamically manage connections:

- Buffers data in 5MiB chunks using a producer-consumer model
- Closes connections once buffers reach capacity, returning them to
  the pool for other clients
- Uses a filling fiber that resumes fetching once buffers are
  consumed from the queue

Performance remains comparable to `download_source`, achieving
95MiB/s for sequential 1GiB downloads from S3. However, preloading
large chunks may cause read amplification.

Fixes: https://github.com/scylladb/scylladb/issues/23785

Closes scylladb/scylladb#23880
2025-06-10 12:58:24 +03:00