2224 Commits

Author SHA1 Message Date
Piotr Dulikowski
df68d0c0f7 directories: add missing seastar/util/closeable.hh include
Without this include the file would not compile on its own. The issue
was most likely masked by the use of precompiled headers in our CI.

Closes scylladb/scylladb#29170
2026-03-23 15:46:56 +03:00
Pavel Emelyanov
c4a0f6f2e6 object_store: Don't leave dangling objects by iterating moved-from names vector
The code in upload_file std::move()-s vector of names into
merge_objects() method, then iterates over this vector to delete
objects. The iteration is apparently a no-op on moved-from vector.

The fix is to make merge_objects() helper get vector of names by const
reference -- the method doesn't modify the names collection, the caller
keeps one in stable storage.

Fixes #29060

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29061
2026-03-20 10:09:30 +02:00
Pavel Emelyanov
712ba5a31f utils: Use yielding directory_lister in owner verification
Switch directories::do_verify_owner_and_mode() from lister::scan_dir() to
utils::directory_lister while preserving the previous hidden-entry
behavior.

Make do_verify_subpath use lister::filter_type directly so the
verification helper can pass it straight into directory_lister, and keep
a single yielding iteration loop for directory traversal.

Minus one scan_dir user twards scan_dir removal from code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29064
2026-03-20 10:08:38 +02:00
Pavel Emelyanov
961fc9e041 s3: Don't rearm credential timers when credentials are not refreshed
The update_credentials_and_rearm() may get "empty" credentials from
_creds_provider_chain.get_aws_credentials() -- it doesn't throw, but
returns default-initialized value. In that case the expires_at will be
set to time_point::min, and it's probably not a good idea to arm the
refresh timer and, even worse idea, to subtract 1h from it.

Fixes #29056

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29057
2026-03-20 10:07:01 +02:00
Pavel Emelyanov
0a8dc4532b s3: Fix missing upload ID in copy_part trace log
The format string had two {} placeholders but three arguments, the
_upload_id one is skipped from formatting

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29053
2026-03-20 10:05:44 +02:00
Avi Kivity
5e7fb08bf3 Merge 'Fix bad performance for densely populated partition index pages' from Tomasz Grabiec
This applies to small partition workload where index pages have high partition count, and the index doesn't fit in cache. It was observed that the count can be in the order of hundreds. In such a workload pages undergo constant population, LSA compaction, and LSA eviction, which has severe impact on CPU utilization.

Refs https://scylladb.atlassian.net/browse/SCYLLADB-620

This PR reduces the impact by several changes:

  - reducing memory footprint in the partition index. Assuming partition key size is 16 bytes, the cost dropped from 96 bytes to 36 bytes per partition.

  - flattening the object graph and amortizing storage. Storing entries directly in the vector. Storing all key values in a single managed_bytes. Making index_entry a trivial struct.

  - index entries and key storage are now trivially moveable, and batched inside vector storage
    so LSA migration can use memcpy(), which amortizes the cost per key. This reduces the cost of LSA segment compaction.

 - LSA eviction is now pretty much constant time for the whole page
   regardless of the number of entries, because elements are trivial and batched inside vectors.
   Page eviction cost dropped from 50 us to 1 us.

Performance evaluated with:

   scylla perf-simple-query -c1 -m200M --partitions=1000000

Before:

```
7774.96 tps (166.0 allocs/op, 521.7 logallocs/op,  54.0 tasks/op,  802428 insns/op,  430457 cycles/op,        0 errors)
7511.08 tps (166.1 allocs/op, 527.2 logallocs/op,  54.0 tasks/op,  804185 insns/op,  430752 cycles/op,        0 errors)
7740.44 tps (166.3 allocs/op, 526.2 logallocs/op,  54.2 tasks/op,  805347 insns/op,  432117 cycles/op,        0 errors)
7818.72 tps (165.2 allocs/op, 517.6 logallocs/op,  53.7 tasks/op,  794965 insns/op,  427751 cycles/op,        0 errors)
7865.49 tps (165.1 allocs/op, 513.3 logallocs/op,  53.6 tasks/op,  788898 insns/op,  425171 cycles/op,        0 errors)
```

After (+318%):

```
32492.40 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109236 insns/op,  103203 cycles/op,        0 errors)
32591.99 tps (130.4 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  108947 insns/op,  102889 cycles/op,        0 errors)
32514.52 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109118 insns/op,  103219 cycles/op,        0 errors)
32491.14 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109349 insns/op,  103272 cycles/op,        0 errors)
32582.90 tps (130.5 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109269 insns/op,  102872 cycles/op,        0 errors)
32479.43 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109313 insns/op,  103242 cycles/op,        0 errors)
32418.48 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109201 insns/op,  103301 cycles/op,        0 errors)
31394.14 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109267 insns/op,  103301 cycles/op,        0 errors)
32298.55 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109323 insns/op,  103551 cycles/op,        0 errors)
```

When the workload is miss-only, with both row cache and index cache disabled (no cache maintenance cost):

  perf-simple-query -c1 -m200M --duration 6000 --partitions=100000 --enable-index-cache=0 --enable-cache=0

Before:

```
9124.57 tps (146.2 allocs/op, 789.0 logallocs/op,  45.3 tasks/op,  889320 insns/op,  357937 cycles/op,        0 errors)
9437.23 tps (146.1 allocs/op, 789.3 logallocs/op,  45.3 tasks/op,  889613 insns/op,  357782 cycles/op,        0 errors)
9455.65 tps (146.0 allocs/op, 787.4 logallocs/op,  45.2 tasks/op,  887606 insns/op,  357167 cycles/op,        0 errors)
9451.22 tps (146.0 allocs/op, 787.4 logallocs/op,  45.3 tasks/op,  887627 insns/op,  357357 cycles/op,        0 errors)
9429.50 tps (146.0 allocs/op, 787.4 logallocs/op,  45.3 tasks/op,  887761 insns/op,  358148 cycles/op,        0 errors)
9430.29 tps (146.1 allocs/op, 788.2 logallocs/op,  45.3 tasks/op,  888501 insns/op,  357679 cycles/op,        0 errors)
9454.08 tps (146.0 allocs/op, 787.3 logallocs/op,  45.3 tasks/op,  887545 insns/op,  357132 cycles/op,        0 errors)
```

After (+55%):

```
14484.84 tps (150.7 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  396164 insns/op,  229490 cycles/op,        0 errors)
14526.21 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  396401 insns/op,  228824 cycles/op,        0 errors)
14567.53 tps (150.7 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  396319 insns/op,  228701 cycles/op,        0 errors)
14545.63 tps (150.6 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395889 insns/op,  228493 cycles/op,        0 errors)
14626.06 tps (150.5 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395254 insns/op,  227891 cycles/op,        0 errors)
14593.74 tps (150.5 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395480 insns/op,  227993 cycles/op,        0 errors)
14538.10 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  397035 insns/op,  228831 cycles/op,        0 errors)
14527.18 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  396992 insns/op,  228839 cycles/op,        0 errors)
```

Same as above, but with summary ratio increased from 0.0005 to 0.005 (smaller pages):

Before:

```
33906.70 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170553 insns/op,   98104 cycles/op,        0 errors)
32696.16 tps (146.0 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170369 insns/op,   98405 cycles/op,        0 errors)
33889.05 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170551 insns/op,   98135 cycles/op,        0 errors)
33893.24 tps (146.1 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170488 insns/op,   98168 cycles/op,        0 errors)
33836.73 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170528 insns/op,   98226 cycles/op,        0 errors)
33897.61 tps (146.0 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170428 insns/op,   98081 cycles/op,        0 errors)
33834.73 tps (146.1 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170438 insns/op,   98178 cycles/op,        0 errors)
33776.31 tps (146.3 allocs/op,  83.9 logallocs/op,  45.2 tasks/op,  170958 insns/op,   98418 cycles/op,        0 errors)
33808.08 tps (146.3 allocs/op,  83.9 logallocs/op,  45.2 tasks/op,  170940 insns/op,   98388 cycles/op,        0 errors)
```

After (+18%):

```
40081.51 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121047 insns/op,   82231 cycles/op,        0 errors)
40005.85 tps (148.6 allocs/op,   4.4 logallocs/op,  45.2 tasks/op,  121327 insns/op,   82545 cycles/op,        0 errors)
39816.75 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121067 insns/op,   82419 cycles/op,        0 errors)
39953.11 tps (148.1 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121027 insns/op,   82258 cycles/op,        0 errors)
40073.96 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121006 insns/op,   82313 cycles/op,        0 errors)
39882.25 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  120925 insns/op,   82320 cycles/op,        0 errors)
39916.08 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121054 insns/op,   82393 cycles/op,        0 errors)
39786.30 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121027 insns/op,   82465 cycles/op,        0 errors)
38662.45 tps (148.3 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121108 insns/op,   82312 cycles/op,        0 errors)
39849.42 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121098 insns/op,   82447 cycles/op,        0 errors)
```

Closes scylladb/scylladb#28603

* github.com:scylladb/scylladb:
  sstables: mx: index_reader: Optimize parsing for no promoted index case
  vint: Use std::countl_zero()
  test: sstable_partition_index_cache_test: Validate scenario of pages with sparse promoted index placement
  sstables: mx: index_reader: Amoritze partition key storage
  managed_bytes: Hoist write_fragmented() to common header
  utils: managed_vector: Use std::uninitialized_move() to move objects
  sstables: mx: index_reader: Keep promoted_index info next to index_entry
  sstables: mx: index_reader: Extract partition_index_page::clear_gently()
  sstables: mx: index_reader: Shave-off 16 bytes from index_entry by using raw_token
  sstables: mx: index_reader: Reduce allocation_section overhead during index page parsing by batching allocation
  sstables: mx: index_reader: Keep index_entry directly in the vector
  dht: Introduce raw_token
  test: perf_simple_query: Add 'sstable-format' command-line option
  test: perf_simple_query: Add 'sstable-summary-ratio' command-line option
  test: perf-simple-query: Add option to disable index cache
  test: cql_test_env: Respect enable-index-cache config
2026-03-19 14:42:50 +02:00
Pavel Emelyanov
d6c01be09b s3/client: Don't reconstruct regex on every parse_content_range call
Make the pattern static const so it is compiled once at first call rather
than on every Content-Range header parse.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29054
2026-03-18 17:56:33 +02:00
Tomasz Grabiec
1452e92567 managed_bytes: Hoist write_fragmented() to common header 2026-03-18 16:25:20 +01:00
Tomasz Grabiec
75e6412b1c utils: managed_vector: Use std::uninitialized_move() to move objects
It's shorter, and is supposed to be optimized for trivially-moveable
types.

Important for managed_vector<index_entry>, which can have lots of
elements.
2026-03-18 16:25:20 +01:00
Botond Dénes
5d868dcc55 Merge 's3_client: fix s3::range max value for object size' from Ernest Zaslavsky
- fix s3::range max value for object size which is 50TiB and not 5.
- refactor constants to make it accessible for all interested parties, also reuse these constants in tests

No need to backport, doubt we will encounter an object larger than 5TiB

Closes scylladb/scylladb#28601

* github.com:scylladb/scylladb:
  s3_client: reorganize tests in part_size_calculation_test
  s3_client: switch using s3 limits constants in tests
  s3_client: fix the s3::range max object size
  s3_client: remove "aws" prefix from object limits constants
  s3_client: make s3 object limits accessible
2026-03-17 16:34:42 +02:00
Avi Kivity
76b6784c1a Merge 'cql3: track CQL parsing memory cost and use it for admission control' from Marcin Maliszkiewicz
Use rolling_max_tracker to record gross bytes allocated during each
CQL parse.  The rolling maximum is then added to the memory estimate
for incoming QUERY and PREPARE requests so that the admission control
in the CQL transport layer accounts for parsing overhead.

The measured memory footprint serves as upper bound rather than
exact number but it's purpose is to prevent OOMs under unprepared
statements heavy load.

In benchmark 1G memory node shows decrease of non-LSA memory usage
from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While
tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as
memory admission kicks in trying to prevent OOM.

This is phase 1 of OOM prevention, potential next steps:
- add second admission in query_processor::get_statement trying to prevent potential thundering herd problem
- decrease cql_server memory pool size
- count reads in the memory pool
- add per service level memory pool and a shared one

Related https://scylladb.atlassian.net/browse/SCYLLADB-740
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-938

Backport: no, new feature, but we may reconsider if some customer needs it

Closes scylladb/scylladb#28919

* github.com:scylladb/scylladb:
  cql3: track CQL parsing memory cost and use it for admission control
  utils: add rolling max tracker
2026-03-12 19:59:52 +02:00
Marcin Maliszkiewicz
5b2a07b408 utils: add rolling max tracker
We will use it later to track parser memory
usage via per query samples.

Tests runtime in dev: 1.6s
2026-03-12 08:56:41 +01:00
Amnon Heiman
b22162c719 estimated_histogram.hh: adds estimated_histogram_with_max
This patch adds estimated_histogram_with_max template that will be a
based for specific estimated_histograms, eventually replacing the current
struct implementation.

Introduce estimated_histogram_with_max<Max> as a reusable wrapper around
approx_exponential_histogram<1, Max, 4>, providing merge support and the
same add helpers used by existing estimated_histogra type.

Add estimated_histogram_with_max_merge()

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-03-11 15:02:37 +02:00
Avi Kivity
c331796d28 Merge 'Support Min < Precision for approx_exponential_histogram' from Amnon Heiman
This series closes a gap in the approx_exponential_histogram implementation to
cover integer values starting from small Min values.

While the original implementation was focused on durations, where this limitation
was not an issue, over time, there has been a growing need for histograms that
cover smaller values, such as the number of SSTables or the number of items in a
batch.

The reason for the original limitation is inherent to the exponential histogram
math. The previous code required Min to be at least Precision to avoid negative
bit shifts in the exponential calculations.

After this series, approx_exponential_histogram allows Min to be smaller than
Precision by scaling values during indexing. The value is shifted left by
log2 Precision minus log2 Min or zero whichever is larger, and the existing
exponential math is applied. Bucket limits are then scaled back to the original
units. This keeps insertion and retrieval O(1) without runtime branching, at the
cost of repeated bucket limits for some values in the Min to Precision range.

Additional tests cover the new behavior.
Relates to #2785

** New feature, no need to backport. **

Closes scylladb/scylladb#28371

* github.com:scylladb/scylladb:
  estimated_histogram_test.cc: add to_metrics_histogram test
  histogram_metrics_helper.hh: Support Min < Precision
  estimated_histogram_test.cc: Add tests for approx_exponential_histogram with Min<Precision
  estimated_histogram.hh: support Min less than Precision histograms
2026-03-04 12:43:26 +02:00
Botond Dénes
fcc570c697 Merge 'Exorcise assertions from Alternator, using a new throwing_assert() macro' from Nadav Har'El
assert(), and SCYLLA_ASSERT() are evil (Refs #7871) because they can cause the entire Scylla cluster to crash mysteriously instead of cleanly failing the specific request that encountered a serious problem of failed pre-requisite.

In this two-patch series, in the first patch we introduce a new macro throwing_assert(), a convenient drop-in replacement for SCYLLA_ASSERT() but which has all the benefits of on_internal_error() instead of the dangers of SCYLLA_ASSERT().
In the second patch we use the new function to replace every call to SCYLLA_ASSERT() in Alternator by the new throwing_assert().

Here is an example from the second patch to demonstrate the power of this approach: The Alternator code uses the attrs_column() function to retrieve the ":attrs" column of a schema. Since every Alternator table always has an ":attrs" column in its schema, we felt safe to SCYLLA_ASSERT() that this column exists. However, imagine that one day because of a bug, one Alternator table is missing this column. Or maybe not a bug - maybe a malicious user on a shared cluster found a way to deliberately delete this column (e.g, with a CQL command!) and this check fails. Before this patch, the entire Scylla node will crash. If the same request is sent to all nodes - the entire cluster will crash. The user might not even know which request caused this crash. In contrast, after this patch, the specific operation - e.g., PutItem - will get an exception. Only this operation, and nothing else, will be aborted, and the user who sent this request will even get an "Internal Server Error" with the assertion-failure message, alerting them that this specific query is causing problems, while other queries might work normally.

There's no need to backport this patch - unless it becomes annoying that other branches don't have the throwing_assert() function and we want it to ease other backports.

Fixes #28308.

Closes scylladb/scylladb#28445

* github.com:scylladb/scylladb:
  alternator: replace SCYLLA_ASSERT with throwing_assert
  utils: introduce throwing_assert(), a safe replacement for assert
2026-02-27 15:35:36 +02:00
Amnon Heiman
0b4f28ae21 histogram_metrics_helper.hh: Support Min < Precision
to_metrics_histogram now collapses duplicate integer bucket bounds
caused by Min less than Precision scaling while always keeping native
histogram metadata.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-02-26 09:00:38 +02:00
Amnon Heiman
6c21e5f80c estimated_histogram.hh: support Min less than Precision histograms
approx_exponential_histogram is a pseudo exponential histogram implementation
that can insert and retrieve values into and from buckets in O 1 time.
The implementation uses power of two ranges and splits them linearly into
buckets. The number of buckets per power of two range is called Precision.

The original implementation aimed at covering large value ranges had a
limitation. The histogram Min value had to be greater than or equal to
Precision. As a result code that needs histograms for small integer values
could not use this implementation efficiently.

This change addresses that gap by handling the case where Min is less than
Precision. For Min smaller than Precision the value is scaled by a power of
two factor during indexing so the existing exponential math can be reused
without runtime branching. Bucket limits are scaled back to the original
units which can lead to repeated bucket limits in the Min to Precision
range for integer values.

Example with Min 2 and Precision 4
Buckets 2 2 3 3 4 5 6 7 8 10 12 14 and so on

Implementation details
Introduce SHIFT based on log2 Precision minus log2 Min when positive
Scale Min and Max by SHIFT for all exponential calculations
Compute NUM_BUCKETS using the standard log2 Max over Min formula
Use scaled value in find_bucket_index to avoid fractional bucket steps
Return bucket limits by scaling back to original units
Constraint relaxed from Min greater or equal to Precision to allow any Min
less than Max still power of two

This change maintains backward compatibility with existing histograms
while enabling efficient tracking of small integer values.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-02-26 00:46:14 +02:00
Nadav Har'El
d876e7cd0a utils: introduce throwing_assert(), a safe replacement for assert
This patch introduces throwing_assert(cond), a better and safer
replacement for assert(cond) or SCYLLA_ASSERT(cond). It aims to
eventually replace all assertions in Scylla and provide a real solution
to issue #7871 ("exorcise assertions from Scylla").

throwing_assert() is based on the existing on_internal_error() and
inherits all its benefits, but brings with it the *convenience* of
assert() and SCYLLA_ASSERT(): No need for a separate if(), new strings,
etc.  For example, you can do write just one line of throwing_assert():

    throwing_assert(p != nullptr);

Instead of much more verbose on_internal_error:

    if (p == nullptr) {
        utils::on_internal_error("assertion failed: p != nullptr")
    }

Like assert() and SCYLLA_ASSERT(), in our tests throwing_assert() dumps
core on failure. But its advantage over the other assertion functions
like becomes clear in production:

* assert() is compiled-out in release builds. This means that the
  condition is not checked, and the code after the failed condition
  continues to run normally, potentially to disasterous consequences.

  In contrast, throwing_assert() continues to check the condition even in
  release builds, and if the condition is false it throws an exception.
  This ensures that the code following the condition doesn't run.

* SCYLLA_ASSERT() in release builds checks the condition and *crashes*
  Scylla if the condition is not met.

  In contrast, throwing_assert() doesn't crash, but throws an exception.
  This means that the specific operation that encountered the error
  is aborted, instead of the entire server. It often also means that
  the user of this operation will see this error somehow and know
  which operation failed - instead of encountering a mysterious
  server (or even whole-cluster crash) without any indication which
  operation caused it.

Another benefit of throwing_assert() is that it logs the error message
(and also a backtrace!) to Scylla's usual logging mechanisms - not to
stderr like assert and SCYLLA_ASSERT write, where users sometimes can't
see what is written.

Fixes #28308.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:58:47 +02:00
Botond Dénes
99244179f7 Merge 'CQL transport: Add histogram-based request/response size tracking' from Amnon Heiman
This series closes a gap in how CQL request and response sizes are reported.

Previously, request_size and response_size were tracked as simple counters,
providing only cumulative totals per shard. This made it difficult to understand
the distribution of message sizes and identify potential issues with very large
or very small requests.

After this series, the CQL transport reports detailed histogram metrics showing
the distribution of request and response sizes. These histograms are tracked
per-instance, per-type (per ops), and per-scheduling-group, providing
much better visibility into CQL traffic patterns.

The histograms are collected for QUERY, EXECUTE, and BATCH operations, which are
the primary data path operations where message size distribution is most relevant.
This data can help identify:
- Clients sending unexpectedly large requests
- Operations with oversized result sets
- Scheduling group differences in traffic patterns

To support this, the series extends the approx_exponential_histogram template to
handle accurate sum, adds a bytes_histogram type alias optimized for byte-range measurements (1KB to 1GB).

The existing per-shard counter metrics are maintained for backward compatibility.
Metrics example:
```
scylla_transport_cql_request_bytes{kind="BATCH",scheduling_group_name="sl:default",shard="0"} 129808
scylla_transport_cql_request_bytes{kind="EXECUTE",scheduling_group_name="sl:default",shard="0"} 227409
scylla_transport_cql_request_bytes{kind="PREPARE",scheduling_group_name="sl:default",shard="0"} 631
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:default",shard="0"} 2809
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:driver",shard="0"} 4079
scylla_transport_cql_request_bytes{kind="REGISTER",scheduling_group_name="sl:default",shard="0"} 98
scylla_transport_cql_request_bytes{kind="STARTUP",scheduling_group_name="sl:driver",shard="0"} 432
scylla_transport_cql_request_histogram_bytes_sum{kind="QUERY",scheduling_group_name="sl:driver"} 4079
scylla_transport_cql_request_histogram_bytes_count{kind="QUERY",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1024.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2048.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4096.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8192.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16384.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="32768.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="65536.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="131072.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="262144.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="524288.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1048576.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2097152.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4194304.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8388608.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16777216.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="33554432.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="67108864.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="134217728.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="268435456.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="536870912.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1073741824.000000",scheduling_group_name="sl:driver"} 57
```
**The field sees it as an important issue**

Fixes #14850

Closes scylladb/scylladb#28419

* github.com:scylladb/scylladb:
  test/boost/estimated_histogram_test.cc: Switch to real Sum
  transport/server: to bytes_histogram
  approx_exponential_histogram: Add sum() method for accurate value tracking
  utils/estimated_histogram.hh: Add bytes_histogram
2026-02-25 13:05:18 +02:00
Ernest Zaslavsky
321d4caf0c object_storage: add retryable machinery to object storage
remove hand rolled error handling from object storage client
and replace with common machinery that supports exception
handling and retrying when appropriate
2026-02-22 14:00:44 +02:00
Ernest Zaslavsky
24972da26d rest_client: add simple_send overload
add an overload to rest client `simple_send` to accept a retry_strategy for http's make_request
2026-02-22 14:00:44 +02:00
Avi Kivity
dee868b71a interval: avoid clang 23 warning on throw statement in potentially noexcept function
interval_data's move constructor is conditionally noexcept. It
contains a throw statemnt for the case that the underlying type's
move constructor can throw; that throw statemnt is never executed
if we're in the noexept branch. Clang 23 however doesn't understand
that, and warns about throwing in a noexcept function.

Fix that by rewriting the logic using seastar::defer(). In the
noexcept case, the optimizer should eliminate it as dead code.

Closes scylladb/scylladb#28710
2026-02-19 12:24:20 +03:00
Calle Wilund
8e71a6f52a gcp: Add handling of 429 (too many requests) to exponential backoff
Fixes: SCYLLADB-611

Adds http error code 429 to codes handled by exponential backoff.

Closes scylladb/scylladb#28588
2026-02-19 09:42:39 +01:00
Ernest Zaslavsky
d763bdabc2 s3_client: fix the s3::range max object size
in s3::Range class start using s3 global constant for two reasons:
1) uniformity, no need to introduce semantically same constant in each class
2) the value was wrong
2026-02-18 12:12:04 +02:00
Ernest Zaslavsky
24e70b30c8 s3_client: remove "aws" prefix from object limits constants
remove "aws" prefix from object limits constants since it is
irrelevant and unnecessary when sitting under s3 namespace
2026-02-18 12:12:04 +02:00
Ernest Zaslavsky
329c156600 s3_client: make s3 object limits accessible
make s3 limits constants publicly accessible to reuse it later
2026-02-18 12:12:04 +02:00
Pavel Emelyanov
89d8ae5cb6 Merge 'http: prepare http clients retry machinery refactoring' from Ernest Zaslavsky
Today S3 client has well established and well testes (hopefully) http request retry strategy, in the rest of clients it looks like we are trying to achieve the same writing the same code over and over again and of course missing corner cases that already been addressed in the S3 client.
This PR aims to extract the code that could assist other clients to detect the retryability of an error originating from the http client, reuse the built in seastar http client retryability and to minimize the boilerplate of http client exception handling

No backport needed since it is only refactoring of the existing code

Closes scylladb/scylladb#28250

* github.com:scylladb/scylladb:
  exceptions: add helper to build a chain of error handlers
  http: extract error classification code
  aws_error: extract `retryable` from aws_error
2026-02-18 10:06:37 +03:00
Pavel Emelyanov
2f10fd93be Merge 's3_client: Fix s3 part size and number of parts calculation' from Ernest Zaslavsky
- Correct `calc_part_size` function since it could return more than 10k parts
- Add tests
- Add more checks in `calc_part_size` to comply with S3 limits

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-640
Must be ported back to 2025.3/4 and 2026.1 since we may encounter this bug in production clusters

Closes scylladb/scylladb#28592

* github.com:scylladb/scylladb:
  s3_client: add more constrains to the calc_part_size
  s3_client: add tests for calc_part_size
  s3_client: correct multipart part-size logic to respect 10k limit
2026-02-18 10:04:53 +03:00
Botond Dénes
2e087882fa Merge 'GCS object storage. Fix incompatibilty issues with "real" GCS' from Calle Wilund
Fixes #28398
Fixes #28399

When used as path elements in google storage paths, the object names need to be URL encoded. Due to

a.) tests not really using prefixes including non-url valid chars (i.e. / etc)
and
b.) the mock server used for most testing not enforcing this particular aspect,

this was missed.

Modified unit tests to use prefixing for all names, so when running real GS, any errors like this will show.

"Real" GCS also behaves a bit different when listing with pager, compared to mock;
The former will not give a pager token for last page, only penultimate.
 Adds handling for this.

Needs backport to the releases that have (though might not really use) the feature, as it is technically possible to use google storage for backup and whatnot there, and it should work as expected.

Closes scylladb/scylladb#28400

* github.com:scylladb/scylladb:
  utils/gcp/object_storage: URL-encode object names in URL:s
  utils::gcp::object_storage: Fix list object pager end condition detection
2026-02-17 16:40:02 +02:00
Ernest Zaslavsky
034c6fbd87 s3_client: limit multipart upload concurrency
Prevent launching hundreds or thousands of fibers during multipart uploads
by capping concurrent part submissions to 16.

Closes scylladb/scylladb#28554
2026-02-16 13:32:58 +03:00
Ernest Zaslavsky
960adbb439 s3_client: add more constrains to the calc_part_size
Enforce more checks on part size and object size as defined in
"Amazon S3 multipart upload limits", see
https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html and
https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingObjects.html
2026-02-10 13:15:07 +02:00
Ernest Zaslavsky
6280cb91ca s3_client: add tests for calc_part_size
Introduce tests that validate the corrected multipart part-size
calculation, including boundary conditions and error cases.
2026-02-10 13:13:26 +02:00
Ernest Zaslavsky
289e910cec s3_client: correct multipart part-size logic to respect 10k limit
The previous calculation could produce more than 10,000 parts for large
uploads because we mixed values in bytes and MiB when determining the
part size. This could result in selecting a part size that still
exceeded the AWS multipart upload limit. The updated logic now ensures
the number of parts never exceeds the allowed maximum.

This change also aligns the implementation with the code comment: we
prefer a 50 MiB part size because it provides the best performance, and
we use it whenever it fits within the 10,000-part limit. If it does not,
we increase the part size (in bytes, aligned to MiB) to stay within the
limit.
2026-02-10 13:13:25 +02:00
Ernest Zaslavsky
7142b1a08d exceptions: add helper to build a chain of error handlers
Generalize error handling by creating exception dispatcher which allows to write error handlers by sequentially applying handlers the same way one would write `catch ()` blocks
2026-02-09 08:48:41 +02:00
Ernest Zaslavsky
7fd62f042e http: extract error classification code
move http client related error classification code to a common location for future reuse
2026-02-09 08:48:41 +02:00
Ernest Zaslavsky
5beb7a2814 aws_error: extract retryable from aws_error
Move aws::retryable to common location to reuse it later in other http based clients
2026-02-09 08:48:41 +02:00
Amnon Heiman
5875bcca23 approx_exponential_histogram: Add sum() method for accurate value tracking
Previously, histogram sums were estimated by multiplying bucket offsets
by their counts, which produces inaccurate results - typically too high
when using upper limits or too low when using lower limits.

This patch adds accurate sum tracking to approx_exponential_histogram:
- Adds a _sum member variable to track the actual sum of all values
- Implements sum() method to return the accumulated total
- Updates add() to increment _sum for each value
- Modifies to_metrics_histogram() helper to use the new sum() method

This change is important as histograms will be used instead of counters for
byte statistics, where accurate totals are essential for metrics reporting.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-01-28 13:39:46 +02:00
Amnon Heiman
2fd453f4ec utils/estimated_histogram.hh: Add bytes_histogram
For various use cases, we need to report byte histograms, such as for
request and reply message sizes.

This patch introduce bytes_histogram as a type alias for
approx_exponential_histogram configured to track byte values from 1KB to
1GB with power-of-2 buckets (Precision=1).

This provides a convenient, performance-efficient histogram for
measuring message sizes, payload sizes, and other byte-based metrics.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-01-28 13:31:39 +02:00
Calle Wilund
87aa6c8387 utils/gcp/object_storage: URL-encode object names in URL:s
Fixes #28398

When used as path elements in google storage paths, the object names
need to be URL encoded. Due to a.) tests not really using prefixes including
non-url valid chars (i.e. / etc) and the mock server used for most
testing not enforcing this particular aspect, this was missed.

Modified unit tests to use prefixing for all names, so when run
in real GS, any errors like this will show.
2026-01-27 18:01:21 +01:00
Calle Wilund
a896d8d5e3 utils::gcp::object_storage: Fix list object pager end condition detection
Fixes #28399

When iterating with pager, the mock server and real GCS behaves differently.
The latter will not give a pager token for last page, only penultimate.

Need to handle.
2026-01-27 17:57:17 +01:00
Pavel Emelyanov
02af292869 Merge 'Introduce TTL and retries to address resolution' from Ernest Zaslavsky
In production environments, we observed cases where the S3 client would repeatedly fail to connect due to DNS entries becoming stale. Because the existing logic only attempted the first resolved address and lacked a way to refresh DNS state, the client could get stuck in a failure loop.

Introduce RR TTL and connection failure retry to
- re-resolve the RR in a timely manner
- forcefully reset and re-resolve addresses
- add a special case when the TTL is 0 and the record must be resolved for every request

Fixes: CUSTOMER-96
Fixes: CUSTOMER-139

Should be backported to 2025.3/4 and 2026.1 since we already encountered it in the production clusters for 2025.3

Closes scylladb/scylladb#27891

* github.com:scylladb/scylladb:
  connection_factory: includes cleanup
  dns_connection_factory: refine the move constructor
  connection_factory: retry on failure
  connection_factory: introduce TTL timer
  connection_factory: get rid of shared_future in dns_connection_factory
  connection_factory: extract connection logic into a member
  connection_factory: remove unnecessary `else`
  connection_factory: use all resolved DNS addresses
  s3_test: remove client double-close
2026-01-27 18:45:43 +03:00
Avi Kivity
f1c6094150 Merge 'Remove buffer_input_stream and limiting_input_stream from core code' from Pavel Emelyanov
These two streams mostly play together. The former provides an input_stream from read from in-memory temporary buffers, the latter wraps it to limit the size of provided temporary buffers. Both are used to test contiguous data consumer, also the buffer_input_stream has a caller in sstables reversing reader.

This PR removes the buffer_input_stream in favor of seastar memory_data_source, and moves the limiting_input_stream into test/lib.

Enanching testing code, not backporting

Closes scylladb/scylladb#28352

* github.com:scylladb/scylladb:
  code: Move limiting data source to test/lib
  util: Simplify limiting_data_source API
  util: Remove buffer_input_stream
  test: Use seastar::util::temporary_buffer_data_source in data consumer test
  sstables: Use seastar::util::as_input_stream() in mx reader
2026-01-26 22:05:59 +02:00
Ernest Zaslavsky
912c48a806 connection_factory: includes cleanup 2026-01-26 15:15:21 +02:00
Ernest Zaslavsky
3a31380b2c dns_connection_factory: refine the move constructor
Clean up the awkward move constructor that was declared in the header
but defaulted in a separate compilation unit, improving clarity and
consistency.
2026-01-26 15:15:15 +02:00
Ernest Zaslavsky
a05a4593a6 connection_factory: retry on failure
If connecting to a provided address throws, renew the address list and
retry once (and only once) before giving up.
2026-01-26 15:14:18 +02:00
Ernest Zaslavsky
6eb7dba352 connection_factory: introduce TTL timer
Add a TTL-based timer to connection_factory to automatically refresh
resolved host name addresses when they expire.
2026-01-26 15:11:49 +02:00
Pavel Emelyanov
77435206b9 code: Move limiting data source to test/lib
Only two tests use it now -- the limit-data-source-test iself and a test
that validates continuous_data_consumer template.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-26 12:49:42 +03:00
Pavel Emelyanov
111b376d0d util: Simplify limiting_data_source API
The source maintains "limit generator" -- a function that returns the
maximum size of bytes to return from the next buffer.

Currently all callers just return constant numbers from it. Passing a
function that returns non-constant one can, probably, be used for a
fuzzy test, but even the limiting-data-source-test itself doesn't do it,
so what's the point...

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-26 12:46:37 +03:00
Pavel Emelyanov
e297ed0b88 util: Remove buffer_input_stream
It's now unused. All the users had been patched to use seastar memory
data source implementation.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-26 12:46:10 +03:00
Ernest Zaslavsky
cb2aa85cf5 aws_error: handle all restartable nested exception types
Previously we only inspected std::system_error inside
std::nested_exception to support a specific TLS-related failure
mode. However, nested exceptions may contain any type, including
other restartable (retryable) errors. This change unwraps one
nested exception per iteration and re-applies all known handlers
until a match is found or the chain is exhausted.

Closes scylladb/scylladb#28240
2026-01-26 10:19:57 +03:00