Commit Graph

2005 Commits

Author SHA1 Message Date
Botond Dénes
f8b79d563a Merge 's3: Minor refactoring and beautification of S3 client and tests' from Ernest Zaslavsky
This pull request introduces minor code refactoring and aesthetic improvements to the S3 client and its associated test suite. The changes focus on enhancing readability, consistency, and maintainability without altering any functional behavior.

No backport is required, as the modifications are purely cosmetic and do not impact functionality or compatibility.

Closes scylladb/scylladb#25490

* github.com:scylladb/scylladb:
  s3_client: relocate `req` creation closer to usage
  s3_client: reformat long logging lines for readability
  s3_test: extract file writing code to a function
2025-08-18 18:48:42 +03:00
Avi Kivity
96956e48c4 Merge 'utils: stall_free: detect clear_gently method of const payload types' from Benny Halevy
Currently, when a container or smart pointer holds a const payload
type, utils::clear_gently does not detect the object's clear_gently
method as the method is non-const and requires a mutable object,
as in the following example in class tablet_metadata:
```
    using tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>;
    using table_to_tablet_map = std::unordered_map<table_id, tablet_map_ptr>;
```

That said, when a container is cleared gently the elements it holds
are destroyed anyhow, so we'd like to allow to clear them gently before
destruction.

This change still doesn't allow directly calling utils::clear_gently
an const objects.

And respective unit tests.

Fixes #24605
Fixed #25026

* This is an optimization that is not strictly required to backport (as https://github.com/scylladb/scylladb/pull/24618 dealt with clear_gently of `tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>` well enough)

Closes scylladb/scylladb#24606

* github.com:scylladb/scylladb:
  utils: stall_free: detect clear_gently method of const payload types
  utils: stall_free: clear gently a foreign shared ptr only when use_count==1
2025-08-18 12:52:02 +03:00
Ernest Zaslavsky
a0016bd0cc s3_client: relocate req creation closer to usage
Move the creation of the `req` object to the point where it is
actually used, improving code clarity and reducing premature
initialization.
2025-08-14 16:18:43 +03:00
Ernest Zaslavsky
6ef2b0b510 s3_client: reformat long logging lines for readability
Break up excessively long logging statements to improve readability
and maintain consistent formatting across the codebase.
2025-08-14 16:18:43 +03:00
Ernest Zaslavsky
dd51e50f60 s3_client: add memory fallback in chunked_download_source
Introduce fallback logic in `chunked_download_source` to handle
memory exhaustion. When memory is low, feed the `deque` with only
one uncounted buffer at a time. This allows slow but steady progress
without getting stuck on the memory semaphore.

Fixes: https://github.com/scylladb/scylladb/issues/25453
Fixes: https://github.com/scylladb/scylladb/issues/25262

Closes scylladb/scylladb#25452
2025-08-14 09:52:10 +03:00
Ernest Zaslavsky
380c73ca03 s3_client: make memory semaphore acquisition abortable
Add `abort_source` to the `get_units` call for the memory semaphore
in the S3 client, allowing the acquisition process to be aborted.

Fixes: https://github.com/scylladb/scylladb/issues/25454

Closes scylladb/scylladb#25469
2025-08-13 08:48:55 +03:00
Benny Halevy
23ac80fc6b utils: stall_free: detect clear_gently method of const payload types
Currently, when a container or smart pointer holds a const payload
type, utils::clear_gently does not detect the object's clear_gently
method as the method is non-const and requires a mutable object,
as in the following example in class tablet_metadata:
```
    using tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>;
    using table_to_tablet_map = std::unordered_map<table_id, tablet_map_ptr>;
```

That said, when a container is cleared gently the elements it holds
are destroyed anyhow, so we'd like to allow to clear them gently before
destruction.

This change still doesn't allow directly calling utils::clear_gently
an const objects.

And respective unit tests.

Fixes #24605

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-11 14:22:01 +03:00
Benny Halevy
cb9db2f396 utils: stall_free: clear gently a foreign shared ptr only when use_count==1
Unlike clear_gently of SharedPtr, clear_gently of a
`foreign_ptr<shared_ptr<T>>` calls clear_gently on the contained object
even if it's still shared and may still be in use.

This change examines the foreign shared pointer's use_count
and calls clear_gently on the shard object only when
its use_count reaches 1.

Fixes #25026

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-11 14:21:32 +03:00
Tomasz Grabiec
f7c001deff Merge 'key: clustering_bounds_comparator: avoid thread_local initialization guard overhead' from Avi Kivity
I noticed clustering_bounds_comparator was running an unnecessary
thread_local initialization guard. This series switches the variable
to constinit initialization, removing the guard.

Performance measurements (perf-simple-query) show an unimpressive
20 instruction per op reduction. However, each instruction counts!

Before:

```
throughput:
	mean=   203642.54 standard-deviation=1102.99
	median= 204328.69 median-absolute-deviation=955.56
	maximum=204624.13 minimum=202222.19
instructions_per_op:
	mean=   42097.59 standard-deviation=40.07
	median= 42111.83 median-absolute-deviation=30.65
	maximum=42139.88 minimum=42044.91
cpu_cycles_per_op:
	mean=   22664.81 standard-deviation=131.28
	median= 22581.10 median-absolute-deviation=111.57
	maximum=22832.30 minimum=22553.24
```

After:

```
throughput:
	mean=   204397.73 standard-deviation=2277.71
	median= 204942.95 median-absolute-deviation=2191.54
	maximum=207588.30 minimum=202162.80
instructions_per_op:
	mean=   42087.21 standard-deviation=27.30
	median= 42092.75 median-absolute-deviation=20.33
	maximum=42108.33 minimum=42041.51
cpu_cycles_per_op:
	mean=   22589.79 standard-deviation=219.24
	median= 22544.82 median-absolute-deviation=191.98
	maximum=22835.11 minimum=22303.52
```

(Very) minor performance improvement, no backport suggestd.

Closes scylladb/scylladb#25259

* github.com:scylladb/scylladb:
  keys: clustering_bounds_comparator: make thread_local _empty_prefix constinit
  keys: make empty creation clustering_key_prefix constexpr
  managed_bytes: make empty managed_bytes constexpr friendly
  keys: clustering_bounds_comparator: make _empty_prefix a prefix
2025-08-11 13:20:38 +02:00
Avi Kivity
90eb6e6241 Merge 'sstables/trie: implement BTI node format serialization and traversal' from Michał Chojnowski
This is the next part in the BTI index project.

Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25154
Next part: implementing a trie cursor (the "set to key, step forwards, step backwards" thing) on top of the `node_reader` added here.

The new code added here is not used for anything yet, but it's posted as a separate PR
to keep things reviewably small.

This part implements the BTI trie node encoding, as described in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md#trie-nodes.
It contains the logic for encoding the abstract in-memory `writer_node`s (added in the previous PR)
into the on-disk format, and the logic for traversing the on-disk nodes during a read.

New functionality, no backporting needed.

Closes scylladb/scylladb#25317

* github.com:scylladb/scylladb:
  sstables/trie: add tests for BTI node serialization and traversal
  sstables/trie: implement BTI node traversal
  sstables/trie: implement BTI serialization
  utils/cached_file: add get_shared_page()
  utils/cached_file: replace a std::pair with a named struct
2025-08-07 12:15:42 +03:00
Pavel Emelyanov
10056a8c6d Merge 'Simplify credential reload: remove internal expiration checks' from Ernest Zaslavsky
This PR introduces a refinement in how credential renewal is triggered. Previously, the system attempted to renew credentials one hour before their expiration, but the credentials provider did not recognize them as expired—resulting in a no-op renewal that returned existing credentials. This led the timer fiber to immediately retry renewal, causing a renewal storm.

To resolve this, we remove expiration (or any other checks) in `reload` method, assuming that whoever calls this method knows what he does.

Fixes: https://github.com/scylladb/scylladb/issues/25044

Should be backported to 2025.3 since we need this fix for the restore

Closes scylladb/scylladb#24961

* github.com:scylladb/scylladb:
  s3_creds: code cleanup
  s3_creds: Make `reload` unconditional
  s3_creds: Add test exposing credentials renewal issue
2025-08-05 17:49:13 +03:00
Avi Kivity
4c785b31c7 Merge 'List Alternator clients in system.clients virtual table' from Nadav Har'El
Before this series, the "system.clients" virtual table lists active connections (and their various properties, like client address, logged in username and client version) only for CQL requests. This series adds also Alternator clients to system.clients. One of the interesting use cases of this new feature is understanding exactly which SDK a user is using -without inspecting their application code.  Different SDKs pass different "User-Agent" headers in requests, and that User-Agent will be visible in the system.clients entries for Alternator requests as the "driver_name" field.

Unlike CQL where logged in username, driver name, etc. applies to a complete connection, in the Alternator API, different requests can theoretically be signed by different users and carry different headers but still arrive over the same HTTP connection. So instead of listing the currently open Alternator *connections*, we will list the currently active *requests*.

The first three patches introduce utilities that will be useful in the implementation. The fourth patch is the implementation itself (which is quite simple with the utility introduced in the second patch), and the fifth patch a regression test for the new feature. The sixth patch adds documentation, the seventh patch refactors generic_server to use the newly introduced utility class and reduce code duplication, and the eighth patch adds a small check to an existing check of CQL's system.clients.

Fixes #24993

This patch adds a new feature, so doesn't require a backport. Nevertheless, if we want it to get to existing customers more quickly to allow us to better understand their use case by reading the system.clients table, we may want to consider backporting this patch to existing branches. There is some risk involved in this patch, because it adds code that gets run on every Alternator request, so a bug on it can cause problems for every Alternator request.

Closes scylladb/scylladb#25178

* github.com:scylladb/scylladb:
  test/cqlpy: slightly strengthen test for system.clients
  generic_server: use utils::scoped_item_list
  docs/alternator: document the system.clients system table in Alternator
  alternator: add test for Alternator clients in system.clients
  alternator: list active Alternator requests in system.clients
  utils: unit test for utils::scoped_item_list
  utils: add a scoped_item_list utility class
  utils: add "fatal" version of utils::on_internal_error()
2025-08-05 15:55:41 +03:00
Michał Chojnowski
6fe7dbaedc utils/cached_file: add get_shared_page()
BTI index is page-aware. It's designed to be read in page units.

Thus, we want a `cached_file` accessor which explicitly requests
a whole page, preferably without copying it.

`cached_file` already works in terms of reference-counted pages,
underneath. This commit only adds some accessors which lets
us request those reference-counting page pointers more directly.
2025-08-05 00:56:50 +02:00
Michał Chojnowski
58d768e383 utils/cached_file: replace a std::pair with a named struct
Cosmetic change. For clarity.
2025-08-05 00:55:32 +02:00
Piotr Dulikowski
ec7832cc84 Merge 'Raft-based recovery procedure: simplify rolling restart with recovery_leader' from Patryk Jędrzejczak
The following steps are performed in sequence as part of the
Raft-based recovery procedure:
- set `recovery_leader` to the host ID of the recovery leader in
  `scylla.yaml` on all live nodes,
- send the `SIGHUP` signal to all Scylla processes to reload the config,
- perform a rolling restart (with the recovery leader being restarted
  first).

These steps are not intuitive and more complicated than they could be.

In this PR, we simplify these steps. From now on, we will be able to
simply set `recovery_leader` on each node just before restarting it.

Apart from making necessary changes in the code, we also update all
tests of the Raft-based recovery procedure and the user-facing
documentation.

Fixes scylladb/scylladb#25015

The Raft-based procedure was added in 2025.2. This PR makes the
procedure simpler and less error-prone, so it should be backported
to 2025.2 and 2025.3.

Closes scylladb/scylladb#25032

* github.com:scylladb/scylladb:
  docs: document the option to set recovery_leader later
  test: delay setting recovery_leader in the recovery procedure tests
  gossip: add recovery_leader to gossip_digest_syn
  db: system_keyspace: peers_table_read_fixup: remove rows with null host_id
  db/config, gms/gossiper: change recovery_leader to UUID
  db/config, utils: allow using UUID as a config option
2025-08-04 08:29:32 +02:00
Ernest Zaslavsky
837475ec6f s3_creds: code cleanup
Remove unnecessary code which is no more used
2025-08-04 09:26:11 +03:00
Ernest Zaslavsky
e4ebe6a309 s3_creds: Make reload unconditional
Assume that any caller invoking `reload` intends to refresh credentials.
Remove conditional logic that checks for expiration before reloading.
2025-08-03 17:41:35 +03:00
Avi Kivity
8b1bf46086 Merge 'sstables: introduce trie_writer' from Michał Chojnowski
This is the first part of a larger project meant to implement a trie-based
index format. (The same or almost the same as Cassandra's BTI).

As of this patch, the new code isn't used for anything yet,
but we introduced separately from its users to keep PRs small enough
for reviewability.

This commit introduces trie_writer, a class responsible for turning a
stream of (key, value) pairs (already sorted by key) into a stream of
serializable nodes, such that:

1. Each node lies entirely within one page (guaranteed).
2. Parents are located in the same page as their children (best-effort).
3. Padding (unused space) is minimized (best-effort).

It does mostly what you would expect a "sorted keys -> trie" builder to do.
The hard part is calculating the sizes of nodes (which, in a well-packed on-disk
format, depend on the exact offsets of the node from its children) and grouping
them into pages.

This implementation mostly follows Cassandra's design of the same thing.
There are some differences, though. Notable ones:

1. The writer operates on chains of characters, rather than single characters.

   In Cassandra's implementation, the writer creates one node per character.
   A single long key can be translated to thousands of nodes.
   We create only one node per key. (Actually we split very long keys into
   a few nodes, but that's arbitrary and beside the point).

   For BTI's partition key index this doesn't matter.
   Since it only stores a minimal unique prefix of each key,
   and the trie is very balanced (due to token randomness),
   the average number of new characters added per key is very close to 1 anyway.
   (And the string-based logic might actually be a small pessimization, since
   manipulating a 1-byte string might be costlier than manipulating a single byte).

   But the row index might store arbitrarily long entries, and in that case the
   character-based logic might result in catastrophically bad performance.
   For reference: when writing a partition index, the total processing cost
   of a single node in the trie_writer is on the order of 800 instructions.
   Total processing cost of a single tiny partition during a `upgradesstables`
   operation is on the order of 10000 instructions. A small INSERT is on the
   order of 40000 instructions.

   So processing a single 1000-character clustering key in the trie_writer
   could cost as much as 20 INSERTs, which is scary. Even 100-character keys
   can be very expensive. With extremely long keys like that, the string-based
   logic is more than ~100x cheaper than character-based logic.
   (Note that only *new* characters matter here. If two index entries share a
   prefix, that prefix is only processed once. And the index is only populated
   with the minimal prefix needed to distinguish neighbours. So in practice,
   long chains might not happen often. But still, they are possible).

   I don't know if it makes sense to care about this case, but I figured the
   potential for problems is too big to ignore, so I switched to chain-based logic.

2. In the (assumed to be rare) case when a grouped subtree turns out to be bigger
   than a full page after revising the estimate, Cassandra splits it in a
   different way than us.

For testability, there is some separation between the logic responsible
for turning a stream of keys into a stream of nodes, and the logic
responsible for turning a stream of nodes into a stream of bytes.
This commit only includes the first part. It doesn't implement the target
on-disk format yet.

The serialization logic is passed to trie_writer via a template parameter.

There is only one test added in this commit, which attempts to be exhaustive,
by testing all possible datasets up to some size. The run time of the test
grows exponentially with the parameter size. I picked a set of parameters
which runs fast enough while still being expressive enough to cover all
the logic. (I checked the code coverage). But I also tested it with greater parameters
on my own machine (and with DEVELOPER_BUILD enabled, which adds extra sanitization).

Refs scylladb/scylladb#19191

New functionality, no backporting needed.

Closes scylladb/scylladb#25154

* github.com:scylladb/scylladb:
  sstables: introduce trie_writer
  utils/bit_cast: add object_representation()
2025-08-01 20:23:24 +03:00
Nadav Har'El
186e6d3ce0 utils: add a scoped_item_list utility class
In a later patch, we'll want Alternator to maintain a list of ongoing
requests, and be able to list them when the system.clients table is
read. This patch introduces a new container, utils::scoped_item_list<T>,
that will help Alternator do that:

  1. Each request adds an item to the list, and receives a handle;
     When that handle goes out of scope the item is automatically
     deleted from the list.
  2. Also a method is provided for iterating over the list of items
     without risking a stall if the list is very long.

The new scoped_item_list<T> is heavily based on similar code that is
integrated inside generic_server.hh, which is used by CQL to similarly
maintain a list of active connections and their properties. However,
unfortunately that code is deeply integrated into the generic_server
class, and Alternator can't use generic_server because it uses Seastar's
HTTP server which isn't based on generic_server.

In contrast, the container defined in this patch is stand-alone and does
not depend on Alternator in any way. In a later patch in this series we
will modify generic_server to use the new scoped_item_list<> instead of
having that feature inside it.

The next patch is a unit test for the new class we are adding in this
patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-08-01 02:15:04 +03:00
Nadav Har'El
33476c7b06 utils: add "fatal" version of utils::on_internal_error()
utils::on_internal_error() is a wrapper for Seastar's on_internal_error()
which does not require a logger parameter - because it always uses one
logger ("on_internal_error"). Not needing a unique logger is especially
important when using on_internal_error() in a header file, where we
can't define a logger.

Seastar also has a another similar function, on_fatal_internal_error(),
for which we forgot to implement a "utils" version (without a logger
parameter). This patch fixes that oversight.

In the next patch, we need to use on_fatal_internal_error() in a header
file, so the "utils" version will be useful. We will need the fatal
version because we will encounter an unexpected situation during server
destruction, and if we let the regular on_internal_error() just throw
an exception, we'll be left in an undefined state.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-08-01 02:15:04 +03:00
Avi Kivity
5c6c944797 managed_bytes: make empty managed_bytes constexpr friendly
Sprinkle constexpr where needed to make the default constructor,
move constructor, and destructor constexpr.

Add a test to verify.

This is needed to make a thread_local variable containing an
empty managed_bytes constinit, reducing thread-local guards.
2025-07-29 23:51:43 +03:00
Asias He
4a4fbae8f7 utils: Allow chunked_vector::erase to work with non-default-constructible type
This is needed for chunked_vector<frozen_mutation_fragment> in repair.
2025-07-29 13:43:17 +08:00
Botond Dénes
837424f7bb Merge 'Add Azure Key Provider for Encryption at Rest' from Nikos Dragazis
This PR introduces a new Key Provider to support Azure Key Vault as a Key Management System (KMS) for Encryption at Rest. The core design principle is the same as in the AWS and GCP key providers - an externally provided Vault key that is used to protect local data encryption keys (a process known as "key wrapping").

In more detail, this patch series consists of:
* Multiple Azure credential sources, offering a variety of authentication options (Service Principals, Managed Identities, environment variables, Azure CLI).
* The Azure host - the Key Vault endpoint bridge.
* The Azure Key Provider - the interface for the Azure host.
* Unit tests using real Azure resources (credentials and Vault keys).
* Log filtering logic to not expose sensitive data in the logs (plaintext keys, credentials, access tokens).

This is part of the overall effort to support Azure deployments.

Testing done:
* Unit tests.
* Manual test on an Azure VM with a Managed Identity.
* Manual test with credentials from Azure CLI.
* Manual test of `--azure-hosts` cmdline option.
* Manual test of log filtering.

Remaining items:
- [x] Create necessary Azure resources for CI.
- [x] Merge pipeline changes (https://github.com/scylladb/scylla-pkg/pull/5201).

Closes https://github.com/scylladb/scylla-enterprise/issues/1077.

New feature. No backport is needed.

Closes scylladb/scylladb#23920

* github.com:scylladb/scylladb:
  docs: Document the Azure Key Provider
  test: Add tests for Azure Key Provider
  pylib: Add mock server for Azure Key Vault
  encryption: Define and enable Azure Key Provider
  encryption: azure: Delegate hosts to shard 0
  encryption: Add Azure host cache
  encryption: Add config options for Azure hosts
  encryption: azure: Add override options
  encryption: azure: Add retries for transient errors
  encryption: azure: Implement init()
  encryption: azure: Implement get_key_by_id()
  encryption: azure: Add id-based key cache
  encryption: azure: Implement get_or_create_key()
  encryption: azure: Add credentials in Azure host
  encryption: azure: Add attribute-based key cache
  encryption: azure: Add skeleton for Azure host
  encryption: Templatize get_{kmip,kms,gcp}_host()
  encryption: gcp: Fix typo in docstring
  utils: azure: Get access token with default credentials
  utils: azure: Get access token from Azure CLI
  utils: azure: Get access token from IMDS
  utils: azure: Get access token with SP certificate
  utils: azure: Get access token with SP secret
  utils: rest: Add interface for request/response redaction logic
  utils: azure: Declare all Azure credential types
  utils: azure: Define interface for Azure credentials
  utils: Introduce base64url_{encode,decode}
2025-07-25 10:45:32 +03:00
Michał Chojnowski
0ca983ea91 utils/bit_cast: add object_representation()
An util that casts a trivial object to the span of its bytes.
2025-07-23 17:03:05 +02:00
Patryk Jędrzejczak
ec69028907 db/config, utils: allow using UUID as a config option
We change the `recovery_leader` option to UUID in the following commit.
2025-07-23 15:36:45 +02:00
Pavel Emelyanov
295165d8ea Merge 's3_client: Enhance s3_client error handling' from Ernest Zaslavsky
Enhance and fix error handling in the `chunked_download_source` to prevent errors seeping from the request callback. Also stop retrying on seastar's side since it is going to break the integrity of data which maybe downloaded more than once for the same range.

Fixes: https://github.com/scylladb/scylladb/issues/25043

Should be backported to 2025.3 since we have an intention to release native backup/restore feature

Closes scylladb/scylladb#24883

* github.com:scylladb/scylladb:
  s3_client: Disable Seastar-level retries in HTTP client creation
  s3_test: Validate handling of non-`aws_error` exceptions
  s3_client: Improve error handling in chunked_download_source
  aws_error: Add factory method for `aws_error` from exception
2025-07-22 10:40:39 +03:00
Ernest Zaslavsky
fc2c9dd290 s3_client: Disable Seastar-level retries in HTTP client creation
Prevent Seastar from retrying HTTP requests to avoid buffer double-feed
issues when an entire request is retried. This could cause data
corruption in `chunked_download_source`. The change is global for every
instance of `s3_client`, but it is still safe because:
* Seastar's `http_client` resets connections regardless of retry behavior
* `s3_client` retry logic handles all error types—exceptions, HTTP errors,
  and AWS-specific errors—via `http_retryable_client`
2025-07-21 17:03:23 +03:00
Ernest Zaslavsky
ba910b29ce s3_test: Validate handling of non-aws_error exceptions
Inject exceptions not wrapped in `aws_error` from request callback
lambda to verify they are properly caught and handled.
2025-07-21 16:52:43 +03:00
Ernest Zaslavsky
b7ae6507cd s3_client: Improve error handling in chunked_download_source
Create aws_error from raised exceptions when possible and respond
appropriately. Previously, non-aws_exception types leaked from the
request handler and were treated as non-retryable, causing potential
data corruption during download.
2025-07-21 16:49:47 +03:00
Ernest Zaslavsky
d53095d72f aws_error: Add factory method for aws_error from exception
Move `aws_error` creation logic out of `retryable_http_client` and
into the `aws_error` class to support reuse across components.
2025-07-21 16:42:44 +03:00
Ernest Zaslavsky
408aa289fe treewide: Move misc files to utils directory
As requested in #22114, moved the files and fixed other includes and build system.

Moved files:
- interval.hh
- Map_difference.hh

Fixes: #22114

This is a cleanup, no need to backport

Closes scylladb/scylladb#25095
2025-07-21 11:56:40 +03:00
Avi Kivity
3dfdcf7d7a Merge 'transport: remove throwing protocol_exception on connection start' from Dario Mirovic
`protocol_exception` is thrown in several places. This has become a performance issue, especially when starting/restarting a server. To alleviate this issue, throwing the exception has to be replaced with returning it as a result or an exceptional future.

This PR replaces throws in the `transport/server` module. This is achieved by using result_with_exception, and in some places, where suitable, just by creating and returning an exceptional future.

There are four commits in this PR. The first commit introduces tests in `test/cqlpy`. The second commit refactors transport server `handle_error` to not rethrow exceptions. The third commit refactors reusable buffer writer callbacks. The fourth commit replaces throwing `protocol_exception` to returning it.

Based on the comments on an issue linked in https://github.com/scylladb/scylladb/issues/24567, the main culprit from the side of protocol exceptions is the invalid protocol version one, so I tested that exception for performance.

In order to see if there is a measurable difference, a modified version of `test_protocol_version_mismatch` Python is used, with 100'000 runs across 10 processes (not threads, to avoid Python GIL). One test run consisted of 1 warm-up run and 5 measured runs. First test run has been executed on the current code, with throwing protocol exceptions. Second test urn has been executed on the new code, with returning protocol exceptions. The performance report is in https://github.com/scylladb/scylladb/pull/24738#issuecomment-3051611069. It shows ~10% gains in real, user, and sys time for this test.

Testing

Build: `release`

Test file: `test/cqlpy/test_protocol_exceptions.py`
Test name: `test_protocol_version_mismatch` (modified for mass connection requests)

Test arguments:
```
max_attempts=100'000
num_parallel=10
```

Throwing `protocol_exception` results:
```
real=1:26.97  user=10:00.27  sys=2:34.55  cpu=867%
real=1:26.95  user=9:57.10  sys=2:32.50  cpu=862%
real=1:26.93  user=9:56.54  sys=2:35.59  cpu=865%
real=1:26.96  user=9:54.95  sys=2:32.33  cpu=859%
real=1:26.96  user=9:53.39  sys=2:33.58  cpu=859%

real=1:26.95 user=9:56.85 sys=2:34.11 cpu=862%   # average
```

Returning `protocol_exception` as `result_with_exception` or an exceptional future:
```
real=1:18.46  user=9:12.21  sys=2:19.08  cpu=881%
real=1:18.44  user=9:04.03  sys=2:17.91  cpu=869%
real=1:18.47  user=9:12.94  sys=2:19.68  cpu=882%
real=1:18.49  user=9:13.60  sys=2:19.88  cpu=883%
real=1:18.48  user=9:11.76  sys=2:17.32  cpu=878%

real=1:18.47 user=9:10.91 sys=2:18.77 cpu=879%   # average
```

This PR replaced `transport/server` throws of `protocol_exception` with returns. There are a few other places where protocol exceptions are thrown, and there are many places where `invalid_request_exception` is thrown. That is out of scope of this single PR, so the PR just refs, and does not resolve issue #24567.

Refs: #24567

This PR improves performance in cases when protocol exceptions happen, for example during connection storms. It will require backporting.

Closes scylladb/scylladb#24738

* github.com:scylladb/scylladb:
  test/cqlpy: add cpp exception metric test conditions
  transport/server: replace protocol_exception throws with returns
  utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception
  transport/server: avoid exception-throw overhead in handle_error
  test/cqlpy: add protocol_exception tests
2025-07-20 17:42:30 +03:00
Dario Mirovic
9f4344a435 utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception
Make make_bytes_ostream and make_fragmented_temporary_buffer accept
writer callbacks that return utils::result_with_exception instead of
forcing them to throw on error. This lets callers propagate failures
by returning an error result rather than throwing an exception.

Introduce buffer_writer_for, bytes_ostream_writer, and fragmented_buffer_writer
concepts to simplify and document the template requirements on writer callbacks.

This patch does not modify the actual callbacks passed, except for the syntax
changes needed for successful compilation, without changing the logic.

Refs: #24567
2025-07-17 16:40:02 +02:00
Botond Dénes
fd6877c654 Merge 'alternator: avoid oversized allocation in Query/Scan' from Nadav Har'El
This series fixes one cause of oversized allocations - and therefore potentially stalls and increased tail latencies - in Alternator.

The first patch in the series is the main fix - the later patches are cleanups requested by reviewers but also involved other pre-existing code, so I did those cleanups as separate patches.

Alternator's Scan or Query operation return a page of results. When the number of items is not limited by a "Limit" parameter, the default is to return a 1 MB page. If items are short, a large number of them can fit in that 1MB. The test test_query.py::test_query_large_page_small_rows has 30,000 items returned in a single page.

In the response JSON, all these items are returned in a single array "Items". Before this patch, we build the full response as a RapidJSON object before sending it. The problem is that unfortunately, RapidJSON stores arrays as contiguous allocations. This results in large contiguous allocations in workloads that scan many small items, and large contiguous allocations can also cause stalls and high tail latencies. For example, before this patch, running

    test/alternator/run --runveryslow \
        test_query.py::test_query_large_page_small_rows

reports in the log:

    oversized allocation: 573440 bytes.

After this patch, this warning no longer appears.
The patch solves the problem by collecting the scanned items not in a RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e, a chunked (non-contiguous) array of items (each a JSON value). After collecting this array separately from the response object, we need to print its content without actually inserting it into the object - we add a new function print_with_extra_array() to do that.

The new separate-chunked-vector technique is used when a large number (currently, >256) of items were scanned. When there is a smaller number of items in a page (this is typical when each item is longer), we just insert those items in the object and print it as before.

Beyond the original slow test that demonstrated the oversized allocation (which is now gone), this patch also includes a new test which exercises the new code with a scan of 700 (>256) items in a page - but this new test is fast enough to be permanently in our test suite and not a manual "veryslow" test as the other test.

Fixes #23535

The stalls caused by large allocations was seen by actual users, so it makes sense to backport this patch. On the other hand, the patch while not big is fairly intrusive (modifies the nomal Scan and Query path and also the later patches do some cleanup of additional code) so there is some small risk involved in the backport.

Closes scylladb/scylladb#24480

* github.com:scylladb/scylladb:
  alternator: clean up by co-routinizing
  alternator: avoid spamming the log when failing to write response
  alternator: clean up and simplify request_return_type
  alternator: avoid oversized allocation in Query/Scan
2025-07-17 11:30:40 +03:00
Ernest Zaslavsky
342e94261f s3_client: parse multipart response XML defensively
Ensure robust handling of XML responses when initiating multipart
uploads. Check for the existence of required nodes before access,
and throw an exception if the XML is empty or malformed.

Refs: https://github.com/scylladb/scylladb/issues/24676

Closes scylladb/scylladb#24990
2025-07-17 10:55:04 +03:00
Nikos Dragazis
eec49c4d78 utils: azure: Get access token with default credentials
Attempt to detect credentials from the system.

Inspired from the `DefaultAzureCredential` in the Azure C++ SDK, this
credential type detects credentials from the following sources (in this
order):

* environment variables (SP credentials - same variables as in Azure C++ SDK)
* Azure CLI
* IMDS

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
937d6261c0 utils: azure: Get access token from Azure CLI
Implement token request with Azure CLI.

Inspired from the Azure C++ SDK's `AzureCliCredential`, this credential
type attempts to run the Azure CLI in a shell and parse the token from
its output. This is meant for development purposes, where a user has
already installed the Azure CLI and logged in with their user account.

Pass the following environment to the process:
* PATH
* HOME
* AZURE_CONFIG_DIR

Add a token factory to construct a token from the process output. Unlike
in Azure Entra and IMDS, the CLI's JSON output does not contain
'expires_in', and the token key is in camel case.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
52a4bd83d5 utils: azure: Get access token from IMDS
Implement token request from IMDS.

No credentials are required for that - just a plain HTTP request on the
IMDS token endpoint.

Since the IMDS endpoint is a raw IP, it's not possible to reliably
determine whether IMDS is accessible or not (i.e., whether the node is
an Azure VM). Azure provides no node-local indication either. In lack of
a better choice, attempt to connect and declare failure if the
connection is not established within 3 seconds. Use a raw TCP socket for
this check, as the HTTP client currently lacks timeout or cancellation
support. Perform the check only once, during the first token refresh.

For the time being, do not support nodes with multiple user-assigned
managed identities. Expect the token request to fail in this case (IMDS
requires the identifier of the desired Managed Identity).

Add a token factory to correctly parse the HTTP response. This addresses
a discrepancy between token requests on IMDS and Azure Entra - the
'expires_in' field is a string in the former and an integer in the
latter.

Finally, implement a fail-fast retry policy for short-lived transient
errors.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
919765fb7f utils: azure: Get access token with SP certificate
Implement token request for Service Principals with a certificate.

The request is the same as with a secret, except that the secret is
replaced with an assertion. The assertion is a JWT that is signed with
the certificate.

To be consistent with the Azure C++ SDK, expect the certificate and the
associated private key to be encoded in PEM format and be provided in a
single file.

The docs suggest using 'PS256' for the JWT's 'alg' claim. Since this is
not supported by our current JWT library (jwt-cpp), use 'RS256' instead.

The JWT also requires a unique identifier for the 'jti' claim. Use a
random UUID for that (it should suffice for our use cases).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
a671530af6 utils: azure: Get access token with SP secret
Implement token request for Service Principals with a secret.

The token request requires a TLS connection. When closing the
connection, do not wait for a response to the TLS `close_notify` alert.
Azure's OAuth server would ignore it and the Seastar `connected_socket`
would hang for 10 seconds.

Add log redaction logic to not expose sensitive data from the request
and response payloads.

Add a token factory to parse the HTTP response. This cannot be shared
with other credential types because the JSON format is not consistent.

Finally, implement a fail-fast retry policy for short-lived transient
errors.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
66c8ffa9bf utils: rest: Add interface for request/response redaction logic
The rest http client, currently used by the AWS and GCP key providers,
logs the HTTP requests and responses unaltered. This causes some
sensitive data to be exposed (plaintext data encryption keys,
credentials, access tokens).

Add an interface to optionally redact any sensitive data from HTTP
headers and payloads.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
0d0135dc4c utils: azure: Declare all Azure credential types
The goal is to mimic the Azure C++ SDK, which offers a variety of
credentials, depending on their type and source.

Declare the following credentials:
* Service Principal credentials
* Managed Identity credentials
* Azure CLI credentials
* Default credentials

Also, define a common exception for SP and MI credentials which are
network-based.

This patch only defines the API.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
3c4face47b utils: azure: Define interface for Azure credentials
Azure authentication is token based - the client obtains an access token
with their credentials, and uses it as a bearer token to authorize
requests to Azure services.

Define a common API for all credential types. The API will consist of a
single `get_access_token()` function that will be returning a new or a
cached access token for some resource URI (defines token scope).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Nikos Dragazis
57bc51342e utils: Introduce base64url_{encode,decode}
Add helpers for base64url encoding.

base64url is a variant of base64 that uses a URL-safe alphabet. It can
be constructed from base64 by replacing the '+' and '/' characters with
'-' and '_' respectively. Many implementations also strip the padding,
although this is not required by the spec [1].

This will be used in upcoming patches for Azure Key Vault requests that
require base64url-encoded payloads.

[1] https://datatracker.ietf.org/doc/html/rfc4648#section-5

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-07-16 17:14:08 +03:00
Avi Kivity
c762425ea7 Merge 'auth: move passwords::check call to alien thread' from Andrzej Jackowski
Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive.

This PR addresses the issue in two ways:
1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case).
2) `passwords::check` is moved to a dedicated alien thread.

Regarding point 1: before this change, the following hashing schemes were supported by     `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However:
- The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it  not good idea to fix or use it.
- SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512.
- MD5 is no longer considered secure for password hashing.

Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers.

Fixes https://github.com/scylladb/scylladb/issues/24524

Backport not needed, as it is a new feature.

Closes scylladb/scylladb#24924

* github.com:scylladb/scylladb:
  main: utils: add thread names to alien workers
  auth: move passwords::check call to alien thread
  test: wait for 3 clients with given username in test_service_level_api
  auth: refactor password checking in password_authenticator
  auth: make SHA-512 the only password hashing scheme for new passwords
  auth: whitespace change in identify_best_supported_scheme()
  auth: require scheme as parameter for `generate_salt`
  auth: check password hashing scheme support on authenticator start
2025-07-16 13:15:54 +03:00
Andrzej Jackowski
77a9b5919b main: utils: add thread names to alien workers
This commit adds a call to `pthread_setname_np` in
`alien_worker::spawn`, so each alien worker thread receives a
descriptive name. This makes debugging, monitoring, and performance
analysis easier by allowing alien workers to be clearly identified
in tools such as `perf`.
2025-07-15 23:29:21 +02:00
Nadav Har'El
2385fba4b6 alternator: avoid oversized allocation in Query/Scan
This patch fixes one cause of oversized allocations - and therefore
potentially stalls and increased tail latencies - in Alternator.

Alternator's Scan or Query operation return a page of results. When the
number of items is not limited by a "Limit" parameter, the default is
to return a 1 MB page. If items are short, a large number of them can
fit in that 1MB. The test test_query.py::test_query_large_page_small_rows
has 30,000 items returned in a single page.

In the response JSON, all these items are returned in a single array
"Items". Before this patch, we build the full response as a RapidJSON
object before sending it. The problem is that unfortunately, RapidJSON
stores arrays as contiguous allocations. This results in large
contiguous allocations in workloads that scan many small items, and
large contiguous allocations can also cause stalls and high tail
latencies. For example, before this patch, running

    test/alternator/run --runveryslow \
        test_query.py::test_query_large_page_small_rows

reports in the log:

    oversized allocation: 573440 bytes.

After this patch, this warning no longer appears.
The patch solves the problem by collecting the scanned items not in a
RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e,
a chunked (non-contiguous) array of items (each a JSON value).
After collecting this array separately from the response object, we
need to print its content without actually inserting it into the object -
we add a new function print_with_extra_array() to do that.

The new separate-chunked-vector technique is used when a large number
(currently, >256) of items were scanned. When there is a smaller number
of items in a page (this is typical when each item is longer), we just
insert those items in the object and print it as before.

Beyond the original slow test that demonstrated the oversized allocation
(which is now gone), this patch also includes a new test which
exercises the new code with a scan of 700 (>256) items in a page -
but this new test is fast enough to be permanently in our test suite
and not a manual "veryslow" test as the other test.

Fixes #23535
2025-07-14 18:41:34 +03:00
Benny Halevy
0e455c0d45 utils: clear_gently: add support for sets
Since set and unordered_set do not allow modifying
their stored object in place, we need to first extract
each object, clear it gently, and only then destroy it.

To achieve that, introduce a new Extractable concept,
that extracts all items in a loop and calls clear_gently
on each extracted item, until the container is empty.

Add respective unit tests for set and unordered_set.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#24608
2025-07-13 12:30:45 +03:00
Pawel Pery
8d3c33f74a utils: refactor sequential_producer as abortable
This patch is a part of vector_store_client sharded service
implementation for a communication with vector-store service.

There is a need for abortable sequention_producer operator(). The
existing operator() is changed to allow timeout argument with default
time_point::max() (as current default usage) and the new operator() is
created with abort_source parameter.

Reference: VS-47
2025-07-08 16:29:55 +02:00
Yaniv Michael Kaul
82fba6b7c0 PowerPC: remove ppc stuff
We don't even compile-test it.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#24659
2025-07-08 10:38:23 +03:00