Commit Graph

1968 Commits

Author SHA1 Message Date
Avi Kivity
0900a88884 Merge 'auth: move passwords::check call to alien thread' from Andrzej Jackowski
Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive.

This PR addresses the issue in two ways:
1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case).
2) `passwords::check` is moved to a dedicated alien thread.

Regarding point 1: before this change, the following hashing schemes were supported by     `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However:
- The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it  not good idea to fix or use it.
- SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512.
- MD5 is no longer considered secure for password hashing.

Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers.

Fixes https://github.com/scylladb/scylladb/issues/24524

Backport not needed, as it is a new feature.

Closes scylladb/scylladb#24924

* github.com:scylladb/scylladb:
  main: utils: add thread names to alien workers
  auth: move passwords::check call to alien thread
  test: wait for 3 clients with given username in test_service_level_api
  auth: refactor password checking in password_authenticator
  auth: make SHA-512 the only password hashing scheme for new passwords
  auth: whitespace change in identify_best_supported_scheme()
  auth: require scheme as parameter for `generate_salt`
  auth: check password hashing scheme support on authenticator start

(cherry picked from commit c762425ea7)
2025-09-07 13:38:33 +03:00
Nadav Har'El
5d6aa6e8c2 utils, alternator: fix detection of invalid base-64
This patch fixes an error-path bug in the base-64 decoding code in
utils/base64.cc, which among other things is used in Alternator to decode
blobs in JSON requests.

The base-64 decoding code has a lookup table, which was wrongly sized 255
bytes, but needed to be 256 bytes. This meant that if the byte 255 (0xFF)
was included in an invalid base-64 string, instead of detecting that this
is an invalid byte (since the only valid bytes in a base-64 string are
A-Z,a-z,0-9,+,/ and =), the code would either think it's valid with a
nonsense 6-bit part, or even crash on an out-of-bounds read.

Besides the trivial fix, this patch also includes a reproducing test,
which tries to write a blob as a supposedly base-64 encoded string with
a 0xFF byte in it. The test fails before this patch (the write succeeds,
unexpectedly), and passes after this patch (the write fails as
expected). The test also passes on DynamoDB.

Fixes #25701

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25705

(cherry picked from commit ff91027eac)

Closes scylladb/scylladb#25767
2025-09-04 11:38:55 +03:00
Ernest Zaslavsky
8a017834a0 s3_client: add memory fallback in chunked_download_source
Introduce fallback logic in `chunked_download_source` to handle
memory exhaustion. When memory is low, feed the `deque` with only
one uncounted buffer at a time. This allows slow but steady progress
without getting stuck on the memory semaphore.

Fixes: https://github.com/scylladb/scylladb/issues/25453
Fixes: https://github.com/scylladb/scylladb/issues/25262

Closes scylladb/scylladb#25452

(cherry picked from commit dd51e50f60)

Closes scylladb/scylladb#25511
2025-08-15 13:30:38 +03:00
Ernest Zaslavsky
c70ba8384e s3_client: make memory semaphore acquisition abortable
Add `abort_source` to the `get_units` call for the memory semaphore
in the S3 client, allowing the acquisition process to be aborted.

Fixes: https://github.com/scylladb/scylladb/issues/25454

Closes scylladb/scylladb#25469

(cherry picked from commit 380c73ca03)

Closes scylladb/scylladb#25499
2025-08-14 10:34:28 +02:00
Patryk Jędrzejczak
1863386bc8 Merge '[Backport 2025.3] Raft-based recovery procedure: simplify rolling restart with recovery_leader' from Scylladb[bot]
The following steps are performed in sequence as part of the
Raft-based recovery procedure:
- set `recovery_leader` to the host ID of the recovery leader in
  `scylla.yaml` on all live nodes,
- send the `SIGHUP` signal to all Scylla processes to reload the config,
- perform a rolling restart (with the recovery leader being restarted
  first).

These steps are not intuitive and more complicated than they could be.

In this PR, we simplify these steps. From now on, we will be able to
simply set `recovery_leader` on each node just before restarting it.

Apart from making necessary changes in the code, we also update all
tests of the Raft-based recovery procedure and the user-facing
documentation.

Fixes scylladb/scylladb#25015

The Raft-based procedure was added in 2025.2. This PR makes the
procedure simpler and less error-prone, so it should be backported
to 2025.2 and 2025.3.

- (cherry picked from commit ec69028907)

- (cherry picked from commit 445a15ff45)

- (cherry picked from commit 23f59483b6)

- (cherry picked from commit ba5b5c7d2f)

- (cherry picked from commit 9e45e1159b)

- (cherry picked from commit f408d1fa4f)

Parent PR: #25032

Closes scylladb/scylladb#25335

* https://github.com/scylladb/scylladb:
  docs: document the option to set recovery_leader later
  test: delay setting recovery_leader in the recovery procedure tests
  gossip: add recovery_leader to gossip_digest_syn
  db: system_keyspace: peers_table_read_fixup: remove rows with null host_id
  db/config, gms/gossiper: change recovery_leader to UUID
  db/config, utils: allow using UUID as a config option
2025-08-07 09:58:13 +02:00
Ernest Zaslavsky
88779a6884 s3_creds: code cleanup
Remove unnecessary code which is no more used

(cherry picked from commit 837475ec6f)
2025-08-06 00:50:36 +00:00
Ernest Zaslavsky
ccdc98c8f0 s3_creds: Make reload unconditional
Assume that any caller invoking `reload` intends to refresh credentials.
Remove conditional logic that checks for expiration before reloading.

(cherry picked from commit e4ebe6a309)
2025-08-06 00:50:36 +00:00
Avi Kivity
a8193bd503 Merge '[Backport 2025.3] transport: remove throwing protocol_exception on connection start' from Dario Mirovic
`protocol_exception` is thrown in several places. This has become a performance issue, especially when starting/restarting a server. To alleviate this issue, throwing the exception has to be replaced with returning it as a result or an exceptional future.

This PR replaces throws in the `transport/server` module. This is achieved by using result_with_exception, and in some places, where suitable, just by creating and returning an exceptional future.

There are four commits in this PR. The first commit introduces tests in `test/cqlpy`. The second commit refactors transport server `handle_error` to not rethrow exceptions. The third commit refactors reusable buffer writer callbacks. The fourth commit replaces throwing `protocol_exception` to returning it.

Based on the comments on an issue linked in https://github.com/scylladb/scylladb/issues/24567, the main culprit from the side of protocol exceptions is the invalid protocol version one, so I tested that exception for performance.

In order to see if there is a measurable difference, a modified version of `test_protocol_version_mismatch` Python is used, with 100'000 runs across 10 processes (not threads, to avoid Python GIL). One test run consisted of 1 warm-up run and 5 measured runs. First test run has been executed on the current code, with throwing protocol exceptions. Second test urn has been executed on the new code, with returning protocol exceptions. The performance report is in https://github.com/scylladb/scylladb/pull/24738#issuecomment-3051611069. It shows ~10% gains in real, user, and sys time for this test.

Testing

Build: `release`

Test file: `test/cqlpy/test_protocol_exceptions.py`
Test name: `test_protocol_version_mismatch` (modified for mass connection requests)

Test arguments:
```
max_attempts=100'000
num_parallel=10
```

Throwing `protocol_exception` results:
```
real=1:26.97  user=10:00.27  sys=2:34.55  cpu=867%
real=1:26.95  user=9:57.10  sys=2:32.50  cpu=862%
real=1:26.93  user=9:56.54  sys=2:35.59  cpu=865%
real=1:26.96  user=9:54.95  sys=2:32.33  cpu=859%
real=1:26.96  user=9:53.39  sys=2:33.58  cpu=859%

real=1:26.95 user=9:56.85 sys=2:34.11 cpu=862%   # average
```

Returning `protocol_exception` as `result_with_exception` or an exceptional future:
```
real=1:18.46  user=9:12.21  sys=2:19.08  cpu=881%
real=1:18.44  user=9:04.03  sys=2:17.91  cpu=869%
real=1:18.47  user=9:12.94  sys=2:19.68  cpu=882%
real=1:18.49  user=9:13.60  sys=2:19.88  cpu=883%
real=1:18.48  user=9:11.76  sys=2:17.32  cpu=878%

real=1:18.47 user=9:10.91 sys=2:18.77 cpu=879%   # average
```

This PR replaced `transport/server` throws of `protocol_exception` with returns. There are a few other places where protocol exceptions are thrown, and there are many places where `invalid_request_exception` is thrown. That is out of scope of this single PR, so the PR just refs, and does not resolve issue #24567.

Refs: #24567
Fixes: #25271

This PR improves performance in cases when protocol exceptions happen, for example during connection storms. It will require backporting.

* (cherry picked from commit 7aaeed012e)

* (cherry picked from commit 30d424e0d3)

* (cherry picked from commit 9f4344a435)

* (cherry picked from commit 5390f92afc)

* (cherry picked from commit 4a6f71df68)

Parent PR: #24738

Closes scylladb/scylladb#25117

* github.com:scylladb/scylladb:
  test/cqlpy: add cpp exception metric test conditions
  transport/server: replace protocol_exception throws with returns
  utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception
  transport/server: avoid exception-throw overhead in handle_error
  test/cqlpy: add protocol_exception tests
2025-08-05 14:16:14 +03:00
Patryk Jędrzejczak
d18d2fa0cf db/config, utils: allow using UUID as a config option
We change the `recovery_leader` option to UUID in the following commit.

(cherry picked from commit ec69028907)
2025-08-05 10:59:39 +00:00
Dario Mirovic
1078a1f03a utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception
Make make_bytes_ostream and make_fragmented_temporary_buffer accept
writer callbacks that return utils::result_with_exception instead of
forcing them to throw on error. This lets callers propagate failures
by returning an error result rather than throwing an exception.

Introduce buffer_writer_for, bytes_ostream_writer, and fragmented_buffer_writer
concepts to simplify and document the template requirements on writer callbacks.

This patch does not modify the actual callbacks passed, except for the syntax
changes needed for successful compilation, without changing the logic.

Refs: #24567
Fixes: #25271
(cherry picked from commit 9f4344a435)
2025-07-30 21:35:15 +02:00
Pavel Emelyanov
99f328b7a7 Merge '[Backport 2025.3] s3_client: Enhance s3_client error handling' from Scylladb[bot]
Enhance and fix error handling in the `chunked_download_source` to prevent errors seeping from the request callback. Also stop retrying on seastar's side since it is going to break the integrity of data which maybe downloaded more than once for the same range.

Fixes: https://github.com/scylladb/scylladb/issues/25043

Should be backported to 2025.3 since we have an intention to release native backup/restore feature

- (cherry picked from commit d53095d72f)

- (cherry picked from commit b7ae6507cd)

- (cherry picked from commit ba910b29ce)

- (cherry picked from commit fc2c9dd290)

Parent PR: #24883

Closes scylladb/scylladb#25137

* github.com:scylladb/scylladb:
  s3_client: Disable Seastar-level retries in HTTP client creation
  s3_test: Validate handling of non-`aws_error` exceptions
  s3_client: Improve error handling in chunked_download_source
  aws_error: Add factory method for `aws_error` from exception
2025-07-29 14:42:45 +03:00
Nadav Har'El
b7da50d781 alternator: avoid oversized allocation in Query/Scan
This patch fixes one cause of oversized allocations - and therefore
potentially stalls and increased tail latencies - in Alternator.

Alternator's Scan or Query operation return a page of results. When the
number of items is not limited by a "Limit" parameter, the default is
to return a 1 MB page. If items are short, a large number of them can
fit in that 1MB. The test test_query.py::test_query_large_page_small_rows
has 30,000 items returned in a single page.

In the response JSON, all these items are returned in a single array
"Items". Before this patch, we build the full response as a RapidJSON
object before sending it. The problem is that unfortunately, RapidJSON
stores arrays as contiguous allocations. This results in large
contiguous allocations in workloads that scan many small items, and
large contiguous allocations can also cause stalls and high tail
latencies. For example, before this patch, running

    test/alternator/run --runveryslow \
        test_query.py::test_query_large_page_small_rows

reports in the log:

    oversized allocation: 573440 bytes.

After this patch, this warning no longer appears.
The patch solves the problem by collecting the scanned items not in a
RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e,
a chunked (non-contiguous) array of items (each a JSON value).
After collecting this array separately from the response object, we
need to print its content without actually inserting it into the object -
we add a new function print_with_extra_array() to do that.

The new separate-chunked-vector technique is used when a large number
(currently, >256) of items were scanned. When there is a smaller number
of items in a page (this is typical when each item is longer), we just
insert those items in the object and print it as before.

Beyond the original slow test that demonstrated the oversized allocation
(which is now gone), this patch also includes a new test which
exercises the new code with a scan of 700 (>256) items in a page -
but this new test is fast enough to be permanently in our test suite
and not a manual "veryslow" test as the other test.

Fixes #23535

(cherry picked from commit 2385fba4b6)
2025-07-27 07:42:01 +00:00
Ernest Zaslavsky
e45852a595 s3_client: Disable Seastar-level retries in HTTP client creation
Prevent Seastar from retrying HTTP requests to avoid buffer double-feed
issues when an entire request is retried. This could cause data
corruption in `chunked_download_source`. The change is global for every
instance of `s3_client`, but it is still safe because:
* Seastar's `http_client` resets connections regardless of retry behavior
* `s3_client` retry logic handles all error types—exceptions, HTTP errors,
  and AWS-specific errors—via `http_retryable_client`

(cherry picked from commit fc2c9dd290)
2025-07-22 16:46:54 +00:00
Ernest Zaslavsky
fdf706a6eb s3_test: Validate handling of non-aws_error exceptions
Inject exceptions not wrapped in `aws_error` from request callback
lambda to verify they are properly caught and handled.

(cherry picked from commit ba910b29ce)
2025-07-22 16:46:53 +00:00
Ernest Zaslavsky
2bc3accf9c s3_client: Improve error handling in chunked_download_source
Create aws_error from raised exceptions when possible and respond
appropriately. Previously, non-aws_exception types leaked from the
request handler and were treated as non-retryable, causing potential
data corruption during download.

(cherry picked from commit b7ae6507cd)
2025-07-22 16:46:53 +00:00
Ernest Zaslavsky
0106d132bd aws_error: Add factory method for aws_error from exception
Move `aws_error` creation logic out of `retryable_http_client` and
into the `aws_error` class to support reuse across components.

(cherry picked from commit d53095d72f)
2025-07-22 16:46:53 +00:00
Ernest Zaslavsky
934359ea28 s3_client: parse multipart response XML defensively
Ensure robust handling of XML responses when initiating multipart
uploads. Check for the existence of required nodes before access,
and throw an exception if the XML is empty or malformed.

Refs: https://github.com/scylladb/scylladb/issues/24676

Closes scylladb/scylladb#24990

(cherry picked from commit 342e94261f)

Closes scylladb/scylladb#25057
2025-07-21 12:03:00 +02:00
Ernest Zaslavsky
873c8503cd s3_test: Add s3_client test for non-retryable error handling
Introduce a test that injects a non-retryable error and verifies
that the chunked download source throws an exception as expected.

(cherry picked from commit acf15eba8e)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
7f303bfda3 s3_client: Fix edge case when the range is exhausted
Handle case where the download loop exits after consuming all data,
but before receiving an empty buffer signaling EOF. Without this, the
next request is sent with a non-zero offset and zero length, resulting
in "Range request cannot be satisfied" errors. Now, an empty buffer is
pushed to indicate completion and exit the fiber properly.

(cherry picked from commit 49e8c14a86)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
22739df69f s3_client: Fix indentation in try..catch block
Correct indentation in the `try..catch` block to improve code
readability and maintain consistent formatting.

(cherry picked from commit e50f247bf1)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
54db6ca088 s3_client: Stop retries in chunked download source
Disable retries for S3 requests in the chunked download source to
prevent duplicate chunks from corrupting the buffer queue. The
response handler now throws an exception to bypass the retry
strategy, allowing the next range to be attempted cleanly.

This exception is only triggered for retryable errors; unretryable
ones immediately halt further requests.

(cherry picked from commit d2d69cbc8c)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
00f10e7f1d s3_client: Fix missing negation
Restore a missing `not` in a conditional check that caused
incorrect behavior during S3 client execution.

(cherry picked from commit 6d9cec558a)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
4cd1792528 s3_client: Refine logging
Fix typo in log message to improve clarity and accuracy during
S3 operations.

(cherry picked from commit e73b83e039)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
115e8c85e4 s3_client: Improve logging placement for current_range output
Relocated logging to occur after determining the `current_range`,
ensuring more relevant output during S3 client operations.

(cherry picked from commit f1d0690194)
2025-07-13 13:17:14 +00:00
Michał Chojnowski
1bd536a228 utils/alien_worker: fix a data race in submit()
We move a `seastar::promise` on the external worker thread,
after the matching `seastar::future` was returned to the shard.

That's illegal. If the `promise` move occurs concurrently with some
operation (move, await) on the `future`, it becomes a data race
which could cause various kinds of corruption.

This patch fixes that by keeping the promise at a stable address
on the shard (inside a coroutine frame) and only passing through
the worker.

Fixes #24751

Closes scylladb/scylladb#24752

(cherry picked from commit a29724479a)

Closes scylladb/scylladb#24780
2025-07-03 10:45:51 +03:00
Lakshmi Narayanan Sreethar
279253ffd0 utils/big_decimal: fix scale overflow when parsing values with large exponents
The exponent of a big decimal string is parsed as an int32, adjusted for
the removed fractional part, and stored as an int32. When parsing values
like `1.23E-2147483647`, the unscaled value becomes `123`, and the scale
is adjusted to `2147483647 + 2 = 2147483649`. This exceeds the int32
limit, and since the scale is stored as an int32, it overflows and wraps
around, losing the value.

This patch fixes that the by parsing the exponent as an int64 value and
then adjusting it for the fractional part. The adjusted scale is then
checked to see if it is still within int32 limits before storing. An
exception is thrown if it is not within the int32 limits.

Note that strings with exponents that exceed the int32 range, like
`0.01E2147483650`, were previously not parseable as a big decimal. They
are now accepted if the final adjusted scale fits within int32 limits.
For the above value, unscaled_value = 1 and scale = -2147483648, so it
is now accepted. This is in line with how Java's `BigDecimal` parses
strings.

Fixes: #24581

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#24640
2025-06-26 15:29:28 +03:00
Szymon Malewski
f28bab741d utils/exceptions.cc: Added check for exceptions::request_timeout_exception in is_timeout_exception function.
It solves the issue, where in some cases a timeout exceptions in CAS operations are logged incorrectly as a general failure.

Fixes #24591

Closes scylladb/scylladb#24619
2025-06-26 12:25:38 +02:00
Marcin Maliszkiewicz
45392ac29e utils: don't allow do discard updateable_value observer
If the object returned from observe() is destructured,
it stops observing, potentially causing subtle bugs.
Typically, the observer object is retained as a class member.
2025-06-23 17:54:01 +02:00
Pavel Emelyanov
dc166be663 s3: Mark claimed_buffer constructor noexcept
It just std::move-s a buffer and a semaphore_units objects, both moves
are noexcept, so is the constructor itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24552
2025-06-18 20:36:45 +03:00
Pavel Emelyanov
b0766d1e73 Merge 's3_client: Refactor range class for state validation' from Ernest Zaslavsky
Revamped the `range` class to actively manage its state by enforcing validation on all modifications. This prevents overflow, invalid states, and ensures the object size does not exceed the 5TiB limit in S3. This should address and prevent future problems related to this issue https://github.com/minio/minio/issues/21333

No backport needed since this problem related only to this change https://github.com/scylladb/scylladb/pull/23880

Closes scylladb/scylladb#24312

* github.com:scylladb/scylladb:
  s3_client: headers cleanup
  s3_client: Refactor `range` class for state validation
2025-06-17 10:34:55 +03:00
Ernest Zaslavsky
e398576795 s3_client: Fix hang in get() on EOF by signaling condition variable
* Ensure _get_cv.signal() is called when an empty buffer received
* Prevents `get()` from stalling indefinitely while waiting on EOF
* Found when testing https://github.com/scylladb/scylladb/pull/23695

Closes scylladb/scylladb#24490
2025-06-17 10:33:19 +03:00
Calle Wilund
4a98c258f6 http: Add missing thread_local specifier for static
Refs #24447

Patch adding this somehow managed to leave out the thread_local
specifier. While gnutls cert object can be shared across shards
just fine, the actual shared_ptr here cannot, thus we could
cause memory errors.

Closes scylladb/scylladb#24514
2025-06-17 10:23:52 +03:00
Ernest Zaslavsky
1b20e0be4a s3_client: headers cleanup 2025-06-16 16:02:30 +03:00
Ernest Zaslavsky
9ad7a456fe s3_client: Refactor range class for state validation
Revamped the `range` class to actively manage its state by enforcing validation on all modifications. This prevents overflow, invalid states, and ensures the object size does not exceed the 5TiB limit in S3.
2025-06-16 16:02:24 +03:00
Ernest Zaslavsky
2b300c8eb9 s3_client: Improve reporting of S3 client statistics
Revise how we report statistics for `chunked_download_source`. Ensure
metrics for downloaded but unconsumed data are visible, as they do not
contribute to read amplification, which is tracked separately.

Closes scylladb/scylladb#24491
2025-06-16 09:33:57 +03:00
Ernest Zaslavsky
30199552ac s3_client: Mitigate connection exhaustion in download_source
The existing `download_source` implementation optimizes performance
by keeping the connection to S3 open and draining data directly from
the socket. While this eliminates the overhead (60-100ms) of repeatedly
establishing new connections, it leads to rapid exhaustion of client-
side connections.

On a single shard, two `mx_readers` for load and stream are enough to
trigger this issue. Since each client typically holds two connections,
readers keeping index and data sources open can cause deadlocks where
processes stall due to unavailable connections.

Introduce `chunked_download_source`, a new S3 download method built on
`download_source`, to dynamically manage connections:

- Buffers data in 5MiB chunks using a producer-consumer model
- Closes connections once buffers reach capacity, returning them to
  the pool for other clients
- Uses a filling fiber that resumes fetching once buffers are
  consumed from the queue

Performance remains comparable to `download_source`, achieving
95MiB/s for sequential 1GiB downloads from S3. However, preloading
large chunks may cause read amplification.

Fixes: https://github.com/scylladb/scylladb/issues/23785

Closes scylladb/scylladb#23880
2025-06-10 12:58:24 +03:00
Calle Wilund
80feb8b676 utils::http::dns_connection_factory: Use a shared certificate_credentials
Fixes #24447

This factory type, which is really more a data holder/connection producer
per connection instance, creates, if using https, a new certificate_credentials
on every instance. Which when used by S3 client is per client and
scheduling groups.

Which eventually means that we will do a set_system_trust + "cold" handshake
for every tls connection created this way.

This will cause both IO and cold/expensive certificate checking -> possible
stalls/wasted CPU. Since the credentials object in question is literally a
"just trust system", it could very well be shared across the shard.

This PR adds a thread local static cached credentials object and uses this
instead. Could consider moving this to seastar, but maybe this is too much.

Closes scylladb/scylladb#24448
2025-06-10 11:20:21 +03:00
Benny Halevy
8b387109fc disk_space_monitor: add space_source_registration
Register the current space_source_fn in an RAII
object that resets monitor._space_source to the
previous function when the RAII object is destroyed.

Use space_source_registration in database_test::
mutation_dump_generated_schema_deterministic_id_version
to prevent use-after-stack-return in the test.

Fixes #24314

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#24342
2025-06-04 16:25:24 +03:00
Calle Wilund
942477ecd9 encryption/utils: Move encryption httpclient to "general" REST client
Fixed #24296

While the HTTP client used for REST calls in AWS/GCP KMS integration (EAR)
is not general enough to be called a HTTP client as such, it is general
enough to be called a REST client (limited to stateless, single-op REST
calls).

Other code, like general auth integrations (hello Azure) and similar
could reuse this to lessen code duplication.

This patch simply moves the httpclient class from encryption to "rest"
namespace, and explicitly "limits" it to such usage. Making an alias
in encryption to avoid touching more files than needed.

Closes scylladb/scylladb#24297
2025-05-30 12:21:51 +03:00
Avi Kivity
f0ec9dd8f2 Merge 'utils/logalloc: enforce the max contiguous allocation size limit' from Michał Chojnowski
This series fixes the only known violation of logalloc's allocation size limits (in `chunked_managed_vector`), and then it make those limits hard.

Before the series, LSA handles overly-large allocations by forwarding them to the standard allocator. After the series, an attempt to do an overly large allocations via LSA will trigger an `on_internal_error` instead.

We do this because the allocator fallback logic turned out to have subtle and problematic accounting bugs.
We could fix them, or we can remove the mechanism altogether.
It's hard to say which choice is better. This PR arbitrarily makes the choice to remove the mechanism.
This makes the logic simpler, at the risk of escalating some allocation size bugs to crashes.

See the descriptions of individual commits for more details.

Fixes scylladb/scylladb#23850
Fixes scylladb/scylladb#23851
Fixes scylladb/scylladb#23854

I'm not sure if any of this should be backported or not.

The `chunked_managed_vector` fix could be backported, because it's a bugfix. It's an old bug, though, and we have never observed problems related to it.

The changes to `logalloc` aren't supposed to be fixing any observable problem, so a backport probably has more risk than benefit in this case.

Closes scylladb/scylladb#23944

* github.com:scylladb/scylladb:
  utils/logalloc: enforce LSA allocation size limits
  utils/lsa/chunked_managed_vector: fix the calculation of max_chunk_capacity()
2025-05-29 22:11:41 +03:00
Michał Chojnowski
cb02d47b10 utils/logalloc: enforce LSA allocation size limits
In order to guarantee a decent upper limit on fragmentation,
LSA only handles allocations smaller than 0.1 of a segment.

Allocations larger than this limit are permitted, but they are
not placed in LSA segments. Instead, they are forwarded to
the standard allocator.

We don't really have any use case for this "fallback".
As far as I can tell, it only exists for "historical"
reasons, from times where there were some data structures
which weren't fully adapted to LSA yet.

We don't the fallback to be used.
Long-lived standard allocations are undesirable.
They have higher internal fragmentation than LSA
allocations, and they can cause external fragmentation
in the standard allocator. So we want to eliminate them all.

The only reason to keep the fallback is to soften the impact
if some bug results in limit-exceeding LSA allocations happening
in production. In principle, the fallback turns a crash
(or something similarly drastic) into just a performance problem.

However, it turns out that the fallback is buggy.
Recently we had a bug which caused limit-exceeding LSA allocations
to happen.
And then it turned out that LSA reclaim doesn't deal fully correctly
with evictable non-LSA allocations, and the dirty_memory_manager
accounting for non-LSA allocations is completely wrong.
This resulted in subtle, serious, and hard to understand stability
problems in production.

Arguably the biggest problem is that the "fallback" allocations
weren't reported in any way. They were happening in some tests,
but they were silently permitted, so nobody noticed that they
should be eliminated. If we just had a rate-limited error log
that reports fallback allocations, they would have never got
into a release.

So maybe we could fix the fallback, add more tests for it,
add a warning for when it's used, and keep it.

But this PR instead opts for removing the fallback mechanism
altogether and failing fast. After the patch, if a non-conforming
allocation happens, it will trigger an `on_internal_error`.
With this, we risk a greater impact if some non-conforming allocations
happen in production, but we make the system simpler.

It's hard to say if it's a good tradeoff.
2025-05-29 13:05:08 +02:00
Michał Chojnowski
185a032044 utils/stream_compressor: allocate memory for zstd compressors externally
The default and recommended way to use zstd compressors is to let
zstd allocate and free memory for compressors on its own.

That's what we did for zstd compressors used in RPC compression.
But it turns out that it generates allocation patterns we dislike.

We expected zstd not to generate allocations after the context object
is initialized, but it turns out that it tries to downsize the context
sometimes (by reallocation). We don't want that because the allocations
generated by zstd are large (1 MiB with the parameters we use),
so repeating them periodically stresses the reclaimer.

We can avoid this by using the "static context" API of zstd,
in which the memory for context is allocated manually by the user
of the library. In this mode, zstd doesn't allocate anything
on its own.

The implementation details of this patch adds a consideration for
forward compatibility: later versions of Scylla can't use a
window size greater than the one we hardcoded in this patch
when talking to the old version of the decompressor.

(This is not a problem, since those compressors are only used
for RPC compression at the moment, where cross-version communication
can be prevented by bumping COMPRESSOR_NAME. But it's something
that the developer who changes the window size must _remember_ to do).

Fixes #24160
Fixes #24183

Closes scylladb/scylladb#24161
2025-05-27 12:43:11 +03:00
Avi Kivity
13a75ff835 utils: chunked_vector: add swap() method
Following std::vector(), we implement swap(). It's a simple matter
of swapping all the contents.

A unit test is added.
2025-05-14 16:19:40 +03:00
Avi Kivity
24e0d17def utils: chunked_vector: add range insert() overloads
Inserts an iterator range at some position.

Again we insert the range at the end and use std::rotate() to
move the newly inserted elements into place, forgoing possible
optimizations.

Unit tests are added.
2025-05-14 16:19:40 +03:00
Avi Kivity
9425a3c242 utils: chunked_vector: relax static_assert
chunked_vector is only implemented for types with a
non-throwing move constructor; this greatly simplifies
the implementation.

We have a static_assert to enforce it (should really
be a constraint, but chunked_vector predates C++ concepts).

This static_assert prevents forward declarations from compiling:

    class forward_declared;
    using a = utils::chunked_vector<forward_declared>;

`a` won't compile since the static_assert will be instantiated
and will fail since forward_declared is an incomplete type. Using
a constraint has the same problem.

Fix by moving the static_assert to the destructor. The destructor
won't be instantiated by the forward declaration, so it won't
trigger. It will trigger when someone destroys the vector; at this
point the types are no longer forward declared.
2025-05-14 16:19:40 +03:00
Avi Kivity
d6eefce145 utils: chunked_vector: implement erase() for single elements and ranges
Implement using std::rotate() and resize(). The elements to be erased
are rotated to the end, then resized out of existence.

Again we defer optimization for trivially copyable types.

Unit tests are added.

Needed for range_streamer with token_ranges using chunked_vector.
2025-05-14 16:19:37 +03:00
Avi Kivity
5301f3d0b5 utils: chunked_vector: implement insert() for single-element inserts
partition_range_compat's unwrap() needs insert if we are to
use it for chunked_vector (which we do).

Implement using push_back() and std::rotate().

emplace(iterator, args) is also implemented, though the benefit
is diluted (it will be moved after construction).

The implementation isn't optimal - if T is trivially copyable
then using std::memmove() will be much faster that std::rotate(),
but this complex optimization is left for later.

Unit tests are added.
2025-05-14 14:54:59 +03:00
Michał Chojnowski
c47f438db3 logalloc: make background_reclaimer::free_memory_threshold publicly visible
Wanted by the change to the background_reclaim test in the next patch.
2025-05-06 18:59:18 +02:00
Pavel Emelyanov
b56d6fbb84 Merge 'sstables: Fix quadratic space complexity in partitioned_sstable_set' from Raphael Raph Carvalho
Interval map is very susceptible to quadratic space behavior when it's flooded with many entries overlapping all (or most of) intervals, since each such entry will have presence on all intervals it overlaps with.

A trigger we observed was memtable flush storm, which creates many small "L0" sstables that spans roughly the entire token range.

Since we cannot rely on insertion order, solution will be about storing sstables with such wide ranges in a vector (unleveled).

There should be no consequence for single-key reads, since upper layer applies an additional filtering based on token of key being queried.
And for range scans, there can be an increase in memory usage, but not significant because the sstables span an wide range and would have been selected in the combined reader if the range of scan overlaps with them.

Anyway, this is a protection against storm of memtable flushes and shouldn't be the common scenario.

It works both with tablets and vnodes, by adjusting the token range spanned by compaction group accordingly.

Fixes #23634.

We can backport this into 2024.2, 2025.1, but we should let this cook in master for 1 month or so.

Closes scylladb/scylladb#23806

* github.com:scylladb/scylladb:
  test: Verify partitioned set store split and unsplit correctly
  sstables: Fix quadratic space complexity in partitioned_sstable_set
  compaction: Wire table_state into make_sstable_set()
  compaction: Introduce token_range() to table_state
  dht: Add overlap_ratio() for token range
2025-05-05 11:28:38 +03:00
Piotr Dulikowski
8ffe4b0308 utils::loading_cache: gracefully skip timer if gate closed
The loading_cache has a periodic timer which acquires the
_timer_reads_gate. The stop() method first closes the gate and then
cancels the timer - this order is necessary because the timer is
re-armed under the gate. However, the timer callback does not check
whether the gate was closed but tries to acquire it, which might result
in unhandled exception which is logged with ERROR severity.

Fix the timer callback by acquiring access to the gate at the beginning
and gracefully returning if the gate is closed. Even though the gate
used to be entered in the middle of the callback, it does not make sense
to execute the timer's logic at all if the cache is being stopped.

Fixes: scylladb/scylladb#23951

Closes scylladb/scylladb#23952
2025-04-30 16:43:22 +03:00