Commit Graph

2245 Commits

Author SHA1 Message Date
Tomasz Grabiec
9daed59af9 Merge 'Tablet-aware restore' from Pavel Emelyanov
The mechanics of the restore is like this

- A /storage_service/tablets/restore API is called with (keyspace, table, endpoint, bucket, manifests) parameters
  - First, it populates the system_distributed.snapshot_sstables table with the data read from the manifests
  - Then it emplaces a bunch of tablet transitions (of a new "restore" kind), one for each tablet
- The topology coordinator handles the "restore" transition by calling a new RESTORE_TABLET RPC against all the current tablet replicas
- Each replica handles the RPC verb by
  - Reading the snapshot_sstables table
  - Filtering the read sstable infos against current node and tablet being handled
  - Downloading and attaching the filtered sstables

This PR includes system_distributed.snapshot_sstables table from @robertbindar and preparation work from @kreuzerkrieg that extracts raw sstables downloading and attaching from existing generic sstables loading code.

This is first step towards SCYLLADB-197 and lacks many things. In particular
- the API only works for single-DC cluster
- the caller needs to "lock" tablet boundaries with min/max tablet count
- not abortable
- no progress tracking
- sub-optimal (re-kicking API on restore will re-download everything again)
- not re-attacheable (if API node dies, restoration proceeds, but the caller cannot "wait" for it to complete via other node)
- nodes download sstables in maintenance/streaming sched gorup (should be moved to maintenance/backup)

Other follow-up items:
- have an actual swagger object specification for `backup_location`

Closes #28436
Closes #28657
Closes #28773

Closes scylladb/scylladb#28763

* github.com:scylladb/scylladb:
  test: Add test for backup vs migration race
  test: Restore resilience test
  sstables_loader: Fail tablet-restore task if not all sstables were downloaded
  sstables_loader: mark sstables as downloaded after attaching
  sstables_loader: return shared_sstable from attach_sstable
  db: add update_sstable_download_status method
  db: add downloaded column to snapshot_sstables
  db: extract snapshot_sstables TTL into class constant
  test: Add a test for tablet-aware restore
  tablets: Implement tablet-aware cluster-wide restore
  messaging: Add RESTORE_TABLET RPC verb
  sstables_loader: Add method to download and attach sstables for a tablet
  tablets: Add restore_config to tablet_transition_info
  sstables_loader: Add restore_tablets task skeleton
  test: Add rest_client helper to kick newly introduced API endpoint
  api: Add /storage_service/tablets/restore endpoint skeleton
  sstables_loader: Add keyspace and table arguments to manfiest loading helper
  sstables_loader_helpers: just reformat the code
  sstables_loader_helpers: generalize argument and variable names
  sstables_loader_helpers: generalize get_sstables_for_tablet
  sstables_loader_helpers: add token getters for tablet filtering
  sstables_loader_helpers: remove underscores from struct members
  sstables_loader: move download_sstable and get_sstables_for_tablet
  sstables_loader: extract single-tablet SST filtering
  sstables_loader: make download_sstable static
  sstables_loader: fix formating of the new `download_sstable` function
  sstables_loader: extract single SST download into a function
  sstables_loader: add shard_id to minimal_sst_info
  sstables_loader: add function for parsing backup manifests
  split utility functions for creating test data from database_test
  export make_storage_options_config from lib/test_services
  rjson: Add helpers for conversions to dht::token and sstable_id
  Add system_distributed_keyspace.snapshot_sstables
  add get_system_distributed_keyspace to cql_test_env
  code: Add system_distributed_keyspace dependency to sstables_loader
  storage_service: Export export handle_raft_rpc() helper
  storage_service: Export do_tablet_operation()
  storage_service: Split transit_tablet() into two
  tablets: Add braces around tablet_transition_kind::repair switch
2026-04-21 02:27:24 +02:00
Botond Dénes
69c58c6589 Merge 'streaming: add oos protection in mutation based streaming' from Łukasz Paszkowski
The mutation-fragment-based streaming path in `stream_session.cc` did not check whether the receiving node was in critical disk utilization mode before accepting incoming mutation fragments. This meant that operations like `nodetool refresh --load-and-stream`, which stream data through the `STREAM_MUTATION_FRAGMENTS` RPC handler, could push data onto a node that had already reached critical disk usage.

The file-based streaming path in stream_blob.cc already had this protection, but the load&stream path was missing it.

This patch adds a check for `is_in_critical_disk_utilization_mode()` in the `stream_mutation_fragments` handler in `stream_session.cc`, throwing a `replica::critical_disk_utilization_exception` when the node is at critical disk usage. This mirrors the existing protection in the blob streaming path and closes the gap that allowed data to be written to a node that should have been rejecting all incoming writes.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-901

The out of space prevention mechanism was introduced in 2025.4. The fix should be backported there and all later versions.

Closes scylladb/scylladb#28873

* github.com:scylladb/scylladb:
  streaming: reject mutation fragments on critical disk utilization
  test/cluster/storage: Add a reproducer for load-and-stream out-of-space rejection
  sstables: clean up TemporaryHashes file in wipe()
  sstables: add error injection point in write_components
  test/cluster/storage: extract validate_data_existence to module scope
  test/cluster: enable suppress_disk_space_threshold_checks in tests using data_file_capacity
  utils/disk_space_monitor: add error injection to suppress threshold checks
2026-04-20 17:56:36 +03:00
Nadav Har'El
f83270df12 Merge 'alternator/streams: Block tablet merges for Alternator Streams on tablet tables' from Piotr Szymaniak
DynamoDB Streams API can only convey a single parent per stream shard.
Tablet merges produce two parents, making them incompatible with
Alternator Streams. This series blocks tablet merges when streams are
active on a tablet table.

For CreateTable, a freshly created table has no pending merges, so
streams are enabled immediately with tablet merges blocked.

For UpdateTable on an existing table, stream enablement is deferred:
the user's intent is stored via `enable_requested`, tablet merges are
blocked (new merge decisions are suppressed and any active merge
decision is revoked), and the topology coordinator finalizes enablement
once no in-flight merges remain.

The topology coordinator is woken promptly on error injection release
and tablet split completion, reducing finalization latency from ~60s
to seconds.

`test_parent_children_merge` is marked xfail (merges are now blocked),
and downward (merge) steps are removed from `test_parent_filtering` and
`test_get_records_with_alternating_tablets_count`.

Not addressed here: using a topology request to preempt long-running
operations like repair (tracked in SCYLLADB-1304).

Refs SCYLLADB-461

Closes scylladb/scylladb#29224

* github.com:scylladb/scylladb:
  topology: Wake coordinator promptly for stream enablement lifecycle
  test/cluster: Test deferred stream enablement on tablet tables
  alternator/streams: Block tablet merges when Alternator Streams are enabled
2026-04-19 09:15:13 +03:00
Piotr Szymaniak
a2a0868c7d topology: Wake coordinator promptly for stream enablement lifecycle
The topology coordinator sleeps on a condition variable between
iterations. Several events relevant to Alternator stream enablement
did not wake it, causing delays of up to 60s (the periodic load
stats refresh interval) at each step:

1. Error injection release: when a test disables the
   delay_cdc_stream_finalization injection, the coordinator was
   not notified. Add an on_disable callback mechanism to the error
   injection framework (register_on_disable / unregister_on_disable)
   so subsystems can react when an injection is released. The
   topology coordinator uses this to broadcast its event.

2. Tablet split completion: after all local storage groups for a
   table finish splitting, split_ready_seq_number is set but the
   coordinator only discovered this via the periodic stats refresh.
   Add an on_tablet_split_ready callback to topology_state_machine
   that the coordinator sets to trigger_load_stats_refresh(). The
   split monitor in storage_service calls it when all compaction
   groups are split-ready, giving the coordinator fresh stats
   immediately so it can finalize the resize.

These changes reduce test_deferred_stream_enablement_on_tablets
from ~120s to ~13s and fix a production issue where Alternator
stream enablement could be delayed by up to 60s at each step of
the lifecycle (error injection release, split completion).
2026-04-19 03:54:33 +02:00
Nikos Dragazis
a00056381f utils: Add UUID::is_name_based()
The UUID class already provides `is_timestamp()` for identifying
time-based (version 1) UUIDs. Add the analogous `is_name_based()`
predicate for version 3 (name-based) UUIDs, along with a test.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-17 20:58:39 +03:00
Botond Dénes
6eb2d15f39 Merge 'Replace CAS estimated histogram with estimated_histogram_with_max' from Amnon Heiman
ScyllaDB uses estimated_histogram in many places.
We already have a more efficient alternative: estimated_histogram_with_max. It is both CPU- and
memory-efficient, and it can be exported as Prometheus native histograms.

Its main limitation (which also has benefits) is that the bucket layout is fixed at compile time, so
histograms with different configurations cannot be mixed.

The end goal is to replace all uses of estimated_histogram in the codebase.
That migration requires a few small API adjustments, so it is done in steps.

This PR replaces estimated_histogram for CAS contention.
The PR includes a patch that adds functionality to the base approx_exponential_histogram, which will be used by the API.

The specific histograms are defined in a single place and cover the range 1-100; this makes future changes easy.

**New feature, no need to backport**

Closes scylladb/scylladb#29017

* github.com:scylladb/scylladb:
  storage_proxy: migrate CAS contention histograms to estimated_histogram_with_max
  estimated_histogram.hh: Add bucket offset and count to approx_exponential_histogram
2026-04-17 13:12:59 +03:00
Botond Dénes
33682fd14e Merge 'sstables/storage_manager: fix race between object storage config update and keyspace creation' from Dimitrios Symonidis
Previously, config_updater used a serialized_action to trigger update_config() when object_storage_endpoints changed. Because serialized_action::trigger() always schedules the action as a new reactor task (via semaphore::wait().then()), there was a window between the config value becoming visible to the REST API and update_config() actually running. This allowed a concurrent CREATE KEYSPACE to see the new endpoint via is_known_endpoint() before storage_manager had registered it in _object_storage_endpoints.

Now config observers run synchronously in a reactor turn and must not suspend. Split the previous monolithic async update_config() coroutine  into two phases:

- Sync (in the observer, never suspends): storage_manager::_object_storage_endpoints is updated in place; for already-instantiated clients, update_config_sync swaps the new config atomically
- Async (per-client gate): background fibers finish the work that can't run in the observer — S3 refreshes credentials under _creds_sem; GCS drains and closes the replaced client.

Config reloads triggered by SIGHUP are applied on shard 0 and then broadcast to all other shards. An rwlock has been also introduced to make sure that the configuration has been propagated to all cores. This guarantees that a client requesting a config via the REST API will see a consistent snapshot

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-757
Fixes: [28141](https://github.com/scylladb/scylladb/issues/28141)

Closes scylladb/scylladb#28950

* github.com:scylladb/scylladb:
  test/object_store: verify object storage client creation and live reconfiguration
  sstables/utils/s3: split config update into sync and async parts
  test_config: improve logging for wait_for_config API
  db: introduce read-write lock to synchronize config updates with REST API
2026-04-16 10:20:43 +03:00
Łukasz Paszkowski
3726e31c03 utils/disk_space_monitor: add error injection to suppress threshold checks
Add the `suppress_disk_space_threshold_checks` error injection point
to the disk space monitor. When enabled, the threshold listener
short-circuits without evaluating disk utilization.

This is useful for tests that override disk capacity via `data_file_capacity`,
where the real disk usage causes the monitor to incorrectly report
critical utilization and activate out-of-space prevention mechanisms.
2026-04-16 08:38:33 +02:00
Dimitrios Symonidis
24a7b146fa sstables/utils/s3: split config update into sync and async parts
Config observers run synchronously in a reactor turn and must not
suspend. Split the previous monolithic async update_config() coroutine
into two phases:

Sync (runs in the observer, never suspends):
  - S3: atomically swap _cfg (lw_shared_ptr) and set a credentials
    refresh flag.
  - GCS: install a freshly constructed client; stash the old one for
    async cleanup.
  - storage_manager: update _object_storage_endpoints and fire the
    async cleanup via a gate-guarded background fiber.

Async (gate-guarded background fiber):
  - S3: acquire _creds_sem, invalidate and rearm credentials only if
    the refresh flag is set.
  - GCS: drain and close stashed old clients.
2026-04-15 14:28:31 +02:00
Botond Dénes
280fe7cfb7 Merge 'Make inclusion of config.hh cheaper' from Nadav Har'El
This is an attempt (mostly suggested and implemented by AI, but with a few hours of human babysitting...), to somewhat reduce compilation time by picking one template, named_value<T>, which is used in more than a hundred source files through the config.hh header, and making it use external instantiation: The different methods of named_value<T> for various T are instantiated only once (in config.cc), and the individual translation units don't need to compile them a hundred times.

The resulting saving is a little underwhelming: The total object-file size goes down about 1% (from 346,200 before the patch to 343,488 after the patch), and previous experience shows that this object-file size is proportional to the compilation time, most of which involves code generation. But I haven't been able to measure speedup of the build itself.

1% is not nothing, but not a huge saving either. Though arguably, with 50 more of these patches, we can make the build twice faster :-)

Refs #1.

Closes scylladb/scylladb#28992

* github.com:scylladb/scylladb:
  config: move named_value<T> method bodies out-of-line
  config: suppress named_value<T> instantiation in every source file
2026-04-15 14:40:15 +03:00
Tomasz Grabiec
266a225416 utils: avoid exceptions in disk_space_monitor polling loop
The poll loop used condition_variable::wait(timeout) to sleep between
iterations. On every normal timeout expiry, this threw a
condition_variable_timed_out exception, which incremented the C++
exception metric and triggered false alerts for support.

Replace the timed wait with a seastar::timer that broadcasts the
condition variable on expiry, combined with an untimed wait(). The
timer is cancelled automatically on scope exit when the wait is woken
early by trigger_poll() or abort.

Fixes SCYLLADB-1477

Closes scylladb/scylladb#29438
2026-04-15 14:40:15 +03:00
Robert Bindar
fd43995c11 rjson: Add helpers for conversions to dht::token and sstable_id
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-04-14 11:06:13 +03:00
Avi Kivity
0ae22a09d4 LICENSE: Update to version 1.1
Updated terms of non-commercial use (must be a never-customer).
2026-04-12 19:46:33 +03:00
Ernest Zaslavsky
1702d6e6d4 s3_client: pass through abort_source in copy_object
The abort_source parameter in s3::client::copy_object
was ignored — the function accepted it but always passed
nullptr to the underlying copy_s3_object. Forward it
properly so callers can cancel in-progress copies.
2026-04-07 18:16:52 +03:00
Ernest Zaslavsky
bfdc1e5267 gcp_client: fix copy_object request method and body
The GCP copy_object (rewrite API) had two bugs:

1. The request body was an empty string, but the GCP
   rewrite endpoint always parses it as JSON metadata.
   An empty string is not valid JSON, resulting in
   400 "Metadata in the request couldn't decode".
   Fix: send "{}" (empty JSON object) as the body.

2. The HTTP method was PUT, but the GCP Objects: rewrite
   API requires POST per the documentation.
   Fix: use POST.

Test coverage in a follow-up patch
2026-04-07 18:16:52 +03:00
Ernest Zaslavsky
ba785f6cab s3_client: use lowres_system_clock for aws_sigv4
Switch aws_sigv4 to lowres_system_clock since it is not affected by
time offsets often introduced in tests, which can skew db_clock. S3
requests cannot represent time shifts greater than 15 minutes from
server time, so a stable clock is required.
2026-04-05 11:07:17 +03:00
Ernest Zaslavsky
e08d779922 s3_client: add object_exists helper
Introduce `object_exists` to the S3 client to check whether an object
exists. This is primarily useful for test scenarios.
2026-04-05 11:07:16 +03:00
Ernest Zaslavsky
016b344a8a gcs_client: add object_exists helper
Introduce `object_exists` to the GCS client to check whether an object
exists. This is primarily useful for test scenarios.
2026-04-05 11:07:16 +03:00
Piotr Dulikowski
df68d0c0f7 directories: add missing seastar/util/closeable.hh include
Without this include the file would not compile on its own. The issue
was most likely masked by the use of precompiled headers in our CI.

Closes scylladb/scylladb#29170
2026-03-23 15:46:56 +03:00
Pavel Emelyanov
c4a0f6f2e6 object_store: Don't leave dangling objects by iterating moved-from names vector
The code in upload_file std::move()-s vector of names into
merge_objects() method, then iterates over this vector to delete
objects. The iteration is apparently a no-op on moved-from vector.

The fix is to make merge_objects() helper get vector of names by const
reference -- the method doesn't modify the names collection, the caller
keeps one in stable storage.

Fixes #29060

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29061
2026-03-20 10:09:30 +02:00
Pavel Emelyanov
712ba5a31f utils: Use yielding directory_lister in owner verification
Switch directories::do_verify_owner_and_mode() from lister::scan_dir() to
utils::directory_lister while preserving the previous hidden-entry
behavior.

Make do_verify_subpath use lister::filter_type directly so the
verification helper can pass it straight into directory_lister, and keep
a single yielding iteration loop for directory traversal.

Minus one scan_dir user twards scan_dir removal from code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29064
2026-03-20 10:08:38 +02:00
Pavel Emelyanov
961fc9e041 s3: Don't rearm credential timers when credentials are not refreshed
The update_credentials_and_rearm() may get "empty" credentials from
_creds_provider_chain.get_aws_credentials() -- it doesn't throw, but
returns default-initialized value. In that case the expires_at will be
set to time_point::min, and it's probably not a good idea to arm the
refresh timer and, even worse idea, to subtract 1h from it.

Fixes #29056

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29057
2026-03-20 10:07:01 +02:00
Pavel Emelyanov
0a8dc4532b s3: Fix missing upload ID in copy_part trace log
The format string had two {} placeholders but three arguments, the
_upload_id one is skipped from formatting

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29053
2026-03-20 10:05:44 +02:00
Avi Kivity
5e7fb08bf3 Merge 'Fix bad performance for densely populated partition index pages' from Tomasz Grabiec
This applies to small partition workload where index pages have high partition count, and the index doesn't fit in cache. It was observed that the count can be in the order of hundreds. In such a workload pages undergo constant population, LSA compaction, and LSA eviction, which has severe impact on CPU utilization.

Refs https://scylladb.atlassian.net/browse/SCYLLADB-620

This PR reduces the impact by several changes:

  - reducing memory footprint in the partition index. Assuming partition key size is 16 bytes, the cost dropped from 96 bytes to 36 bytes per partition.

  - flattening the object graph and amortizing storage. Storing entries directly in the vector. Storing all key values in a single managed_bytes. Making index_entry a trivial struct.

  - index entries and key storage are now trivially moveable, and batched inside vector storage
    so LSA migration can use memcpy(), which amortizes the cost per key. This reduces the cost of LSA segment compaction.

 - LSA eviction is now pretty much constant time for the whole page
   regardless of the number of entries, because elements are trivial and batched inside vectors.
   Page eviction cost dropped from 50 us to 1 us.

Performance evaluated with:

   scylla perf-simple-query -c1 -m200M --partitions=1000000

Before:

```
7774.96 tps (166.0 allocs/op, 521.7 logallocs/op,  54.0 tasks/op,  802428 insns/op,  430457 cycles/op,        0 errors)
7511.08 tps (166.1 allocs/op, 527.2 logallocs/op,  54.0 tasks/op,  804185 insns/op,  430752 cycles/op,        0 errors)
7740.44 tps (166.3 allocs/op, 526.2 logallocs/op,  54.2 tasks/op,  805347 insns/op,  432117 cycles/op,        0 errors)
7818.72 tps (165.2 allocs/op, 517.6 logallocs/op,  53.7 tasks/op,  794965 insns/op,  427751 cycles/op,        0 errors)
7865.49 tps (165.1 allocs/op, 513.3 logallocs/op,  53.6 tasks/op,  788898 insns/op,  425171 cycles/op,        0 errors)
```

After (+318%):

```
32492.40 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109236 insns/op,  103203 cycles/op,        0 errors)
32591.99 tps (130.4 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  108947 insns/op,  102889 cycles/op,        0 errors)
32514.52 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109118 insns/op,  103219 cycles/op,        0 errors)
32491.14 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109349 insns/op,  103272 cycles/op,        0 errors)
32582.90 tps (130.5 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109269 insns/op,  102872 cycles/op,        0 errors)
32479.43 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109313 insns/op,  103242 cycles/op,        0 errors)
32418.48 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109201 insns/op,  103301 cycles/op,        0 errors)
31394.14 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109267 insns/op,  103301 cycles/op,        0 errors)
32298.55 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109323 insns/op,  103551 cycles/op,        0 errors)
```

When the workload is miss-only, with both row cache and index cache disabled (no cache maintenance cost):

  perf-simple-query -c1 -m200M --duration 6000 --partitions=100000 --enable-index-cache=0 --enable-cache=0

Before:

```
9124.57 tps (146.2 allocs/op, 789.0 logallocs/op,  45.3 tasks/op,  889320 insns/op,  357937 cycles/op,        0 errors)
9437.23 tps (146.1 allocs/op, 789.3 logallocs/op,  45.3 tasks/op,  889613 insns/op,  357782 cycles/op,        0 errors)
9455.65 tps (146.0 allocs/op, 787.4 logallocs/op,  45.2 tasks/op,  887606 insns/op,  357167 cycles/op,        0 errors)
9451.22 tps (146.0 allocs/op, 787.4 logallocs/op,  45.3 tasks/op,  887627 insns/op,  357357 cycles/op,        0 errors)
9429.50 tps (146.0 allocs/op, 787.4 logallocs/op,  45.3 tasks/op,  887761 insns/op,  358148 cycles/op,        0 errors)
9430.29 tps (146.1 allocs/op, 788.2 logallocs/op,  45.3 tasks/op,  888501 insns/op,  357679 cycles/op,        0 errors)
9454.08 tps (146.0 allocs/op, 787.3 logallocs/op,  45.3 tasks/op,  887545 insns/op,  357132 cycles/op,        0 errors)
```

After (+55%):

```
14484.84 tps (150.7 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  396164 insns/op,  229490 cycles/op,        0 errors)
14526.21 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  396401 insns/op,  228824 cycles/op,        0 errors)
14567.53 tps (150.7 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  396319 insns/op,  228701 cycles/op,        0 errors)
14545.63 tps (150.6 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395889 insns/op,  228493 cycles/op,        0 errors)
14626.06 tps (150.5 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395254 insns/op,  227891 cycles/op,        0 errors)
14593.74 tps (150.5 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395480 insns/op,  227993 cycles/op,        0 errors)
14538.10 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  397035 insns/op,  228831 cycles/op,        0 errors)
14527.18 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  396992 insns/op,  228839 cycles/op,        0 errors)
```

Same as above, but with summary ratio increased from 0.0005 to 0.005 (smaller pages):

Before:

```
33906.70 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170553 insns/op,   98104 cycles/op,        0 errors)
32696.16 tps (146.0 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170369 insns/op,   98405 cycles/op,        0 errors)
33889.05 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170551 insns/op,   98135 cycles/op,        0 errors)
33893.24 tps (146.1 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170488 insns/op,   98168 cycles/op,        0 errors)
33836.73 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170528 insns/op,   98226 cycles/op,        0 errors)
33897.61 tps (146.0 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170428 insns/op,   98081 cycles/op,        0 errors)
33834.73 tps (146.1 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170438 insns/op,   98178 cycles/op,        0 errors)
33776.31 tps (146.3 allocs/op,  83.9 logallocs/op,  45.2 tasks/op,  170958 insns/op,   98418 cycles/op,        0 errors)
33808.08 tps (146.3 allocs/op,  83.9 logallocs/op,  45.2 tasks/op,  170940 insns/op,   98388 cycles/op,        0 errors)
```

After (+18%):

```
40081.51 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121047 insns/op,   82231 cycles/op,        0 errors)
40005.85 tps (148.6 allocs/op,   4.4 logallocs/op,  45.2 tasks/op,  121327 insns/op,   82545 cycles/op,        0 errors)
39816.75 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121067 insns/op,   82419 cycles/op,        0 errors)
39953.11 tps (148.1 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121027 insns/op,   82258 cycles/op,        0 errors)
40073.96 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121006 insns/op,   82313 cycles/op,        0 errors)
39882.25 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  120925 insns/op,   82320 cycles/op,        0 errors)
39916.08 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121054 insns/op,   82393 cycles/op,        0 errors)
39786.30 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121027 insns/op,   82465 cycles/op,        0 errors)
38662.45 tps (148.3 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121108 insns/op,   82312 cycles/op,        0 errors)
39849.42 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121098 insns/op,   82447 cycles/op,        0 errors)
```

Closes scylladb/scylladb#28603

* github.com:scylladb/scylladb:
  sstables: mx: index_reader: Optimize parsing for no promoted index case
  vint: Use std::countl_zero()
  test: sstable_partition_index_cache_test: Validate scenario of pages with sparse promoted index placement
  sstables: mx: index_reader: Amoritze partition key storage
  managed_bytes: Hoist write_fragmented() to common header
  utils: managed_vector: Use std::uninitialized_move() to move objects
  sstables: mx: index_reader: Keep promoted_index info next to index_entry
  sstables: mx: index_reader: Extract partition_index_page::clear_gently()
  sstables: mx: index_reader: Shave-off 16 bytes from index_entry by using raw_token
  sstables: mx: index_reader: Reduce allocation_section overhead during index page parsing by batching allocation
  sstables: mx: index_reader: Keep index_entry directly in the vector
  dht: Introduce raw_token
  test: perf_simple_query: Add 'sstable-format' command-line option
  test: perf_simple_query: Add 'sstable-summary-ratio' command-line option
  test: perf-simple-query: Add option to disable index cache
  test: cql_test_env: Respect enable-index-cache config
2026-03-19 14:42:50 +02:00
Pavel Emelyanov
d6c01be09b s3/client: Don't reconstruct regex on every parse_content_range call
Make the pattern static const so it is compiled once at first call rather
than on every Content-Range header parse.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29054
2026-03-18 17:56:33 +02:00
Tomasz Grabiec
1452e92567 managed_bytes: Hoist write_fragmented() to common header 2026-03-18 16:25:20 +01:00
Tomasz Grabiec
75e6412b1c utils: managed_vector: Use std::uninitialized_move() to move objects
It's shorter, and is supposed to be optimized for trivially-moveable
types.

Important for managed_vector<index_entry>, which can have lots of
elements.
2026-03-18 16:25:20 +01:00
Botond Dénes
5d868dcc55 Merge 's3_client: fix s3::range max value for object size' from Ernest Zaslavsky
- fix s3::range max value for object size which is 50TiB and not 5.
- refactor constants to make it accessible for all interested parties, also reuse these constants in tests

No need to backport, doubt we will encounter an object larger than 5TiB

Closes scylladb/scylladb#28601

* github.com:scylladb/scylladb:
  s3_client: reorganize tests in part_size_calculation_test
  s3_client: switch using s3 limits constants in tests
  s3_client: fix the s3::range max object size
  s3_client: remove "aws" prefix from object limits constants
  s3_client: make s3 object limits accessible
2026-03-17 16:34:42 +02:00
Avi Kivity
76b6784c1a Merge 'cql3: track CQL parsing memory cost and use it for admission control' from Marcin Maliszkiewicz
Use rolling_max_tracker to record gross bytes allocated during each
CQL parse.  The rolling maximum is then added to the memory estimate
for incoming QUERY and PREPARE requests so that the admission control
in the CQL transport layer accounts for parsing overhead.

The measured memory footprint serves as upper bound rather than
exact number but it's purpose is to prevent OOMs under unprepared
statements heavy load.

In benchmark 1G memory node shows decrease of non-LSA memory usage
from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While
tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as
memory admission kicks in trying to prevent OOM.

This is phase 1 of OOM prevention, potential next steps:
- add second admission in query_processor::get_statement trying to prevent potential thundering herd problem
- decrease cql_server memory pool size
- count reads in the memory pool
- add per service level memory pool and a shared one

Related https://scylladb.atlassian.net/browse/SCYLLADB-740
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-938

Backport: no, new feature, but we may reconsider if some customer needs it

Closes scylladb/scylladb#28919

* github.com:scylladb/scylladb:
  cql3: track CQL parsing memory cost and use it for admission control
  utils: add rolling max tracker
2026-03-12 19:59:52 +02:00
Amnon Heiman
cedd049218 estimated_histogram.hh: Add bucket offset and count to approx_exponential_histogram
Add utility accessors to approx_exponential_histogram to export bucket
boundaries and bucket counts in a form suitable for display/tests when
Min < Precision causes repeated integer limits.

Add MAX compile-time constant alias for the template Max parameter.
Add get_buckets_offsets() to return bucket lower limits with duplicate
adjacent limits removed.

Add get_buckets_counts() to return counts aligned with the deduplicated
limits, merging counts from buckets that share the same lower limit.
Keep existing histogram behavior unchanged.
This new functionality is intended for API use and not for
performance-critical paths.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-03-12 14:04:40 +01:00
Marcin Maliszkiewicz
5b2a07b408 utils: add rolling max tracker
We will use it later to track parser memory
usage via per query samples.

Tests runtime in dev: 1.6s
2026-03-12 08:56:41 +01:00
Amnon Heiman
b22162c719 estimated_histogram.hh: adds estimated_histogram_with_max
This patch adds estimated_histogram_with_max template that will be a
based for specific estimated_histograms, eventually replacing the current
struct implementation.

Introduce estimated_histogram_with_max<Max> as a reusable wrapper around
approx_exponential_histogram<1, Max, 4>, providing merge support and the
same add helpers used by existing estimated_histogra type.

Add estimated_histogram_with_max_merge()

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-03-11 15:02:37 +02:00
Nadav Har'El
b411d436de config: move named_value<T> method bodies out-of-line
The previous commit added extern template declarations to suppress
named_value<T> instantiation in every translation units, but those only
suppress non-inline members. All method bodies defined inside the class
body were inline and thus exempt from extern template, so they were
still emitted as weak symbols in every TU that used them.

Fix this by moving all named_value<T> method definitions out of the class
body in config_file.hh and into config_file_impl.hh as out-of-line template
definitions.  Since config_file_impl.hh is included only by db/config.cc,
utils/config_file.cc, sstables/compressor.cc, and
ent/encryption/encryption_config.cc, the method bodies are now compiled
in only those four TUs.

Also add the two missing explicit instantiation pairs that caused linker
errors:
- named_value<vector<object_storage_endpoint_param>> in db/config.cc
- named_value<encryption_config::string_string_map> in encryption_config.cc
2026-03-11 13:20:03 +02:00
Nadav Har'El
e0c13518ae config: suppress named_value<T> instantiation in every source file
config.hh is included by a large fraction of the codebase. It pulls in
utils/config_file.hh, whose named_value<T> template has its method
bodies defined in config_file_impl.hh. Those bodies depend on three of
the heaviest Boost headers – boost/program_options.hpp,
boost/lexical_cast.hpp, and boost/regex.hpp – as well as yaml-cpp.
Because the method bodies are reachable from config.hh, every
translation unit that includes config.hh was silently instantiating all
of named_value<T>'s methods (for each distinct T) and compiling that
Boost/yaml-cpp machinery from scratch.

Fix this by adding extern template struct declarations for all 32
distinct named_value<T> specialisations used by db::config:
- the 14 primitive/stdlib types go into utils/config_file.hh
- the 18 db-specific types (enum_option<…>, seed_provider_type, etc.)
  go into db/config.hh

Matching explicit template struct instantiation definitions are added in
db/config.cc, which is already the only translation unit that includes
config_file_impl.hh.  As a result the Boost/yaml-cpp template machinery
is compiled exactly once (in config.o) instead of being re-instantiated
in every including TU.

One subtlety: named_value<seed_provider_type> has an explicit member
specialisation of add_command_line_option.  Per [temp.expl.spec], such
a specialisation must be declared before any extern template declaration
of the enclosing class template, so a forward declaration of the
specialisation is added to config.hh ahead of the extern template line.

Also, for some of the types we explicitly instantiated in db/config.cc,
the named_value<T> constructor calls config_type_for<T>(), which we
also need to provide explicit specializations - some of them we already
had but some were missing.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-11 11:30:39 +02:00
Avi Kivity
c331796d28 Merge 'Support Min < Precision for approx_exponential_histogram' from Amnon Heiman
This series closes a gap in the approx_exponential_histogram implementation to
cover integer values starting from small Min values.

While the original implementation was focused on durations, where this limitation
was not an issue, over time, there has been a growing need for histograms that
cover smaller values, such as the number of SSTables or the number of items in a
batch.

The reason for the original limitation is inherent to the exponential histogram
math. The previous code required Min to be at least Precision to avoid negative
bit shifts in the exponential calculations.

After this series, approx_exponential_histogram allows Min to be smaller than
Precision by scaling values during indexing. The value is shifted left by
log2 Precision minus log2 Min or zero whichever is larger, and the existing
exponential math is applied. Bucket limits are then scaled back to the original
units. This keeps insertion and retrieval O(1) without runtime branching, at the
cost of repeated bucket limits for some values in the Min to Precision range.

Additional tests cover the new behavior.
Relates to #2785

** New feature, no need to backport. **

Closes scylladb/scylladb#28371

* github.com:scylladb/scylladb:
  estimated_histogram_test.cc: add to_metrics_histogram test
  histogram_metrics_helper.hh: Support Min < Precision
  estimated_histogram_test.cc: Add tests for approx_exponential_histogram with Min<Precision
  estimated_histogram.hh: support Min less than Precision histograms
2026-03-04 12:43:26 +02:00
Botond Dénes
fcc570c697 Merge 'Exorcise assertions from Alternator, using a new throwing_assert() macro' from Nadav Har'El
assert(), and SCYLLA_ASSERT() are evil (Refs #7871) because they can cause the entire Scylla cluster to crash mysteriously instead of cleanly failing the specific request that encountered a serious problem of failed pre-requisite.

In this two-patch series, in the first patch we introduce a new macro throwing_assert(), a convenient drop-in replacement for SCYLLA_ASSERT() but which has all the benefits of on_internal_error() instead of the dangers of SCYLLA_ASSERT().
In the second patch we use the new function to replace every call to SCYLLA_ASSERT() in Alternator by the new throwing_assert().

Here is an example from the second patch to demonstrate the power of this approach: The Alternator code uses the attrs_column() function to retrieve the ":attrs" column of a schema. Since every Alternator table always has an ":attrs" column in its schema, we felt safe to SCYLLA_ASSERT() that this column exists. However, imagine that one day because of a bug, one Alternator table is missing this column. Or maybe not a bug - maybe a malicious user on a shared cluster found a way to deliberately delete this column (e.g, with a CQL command!) and this check fails. Before this patch, the entire Scylla node will crash. If the same request is sent to all nodes - the entire cluster will crash. The user might not even know which request caused this crash. In contrast, after this patch, the specific operation - e.g., PutItem - will get an exception. Only this operation, and nothing else, will be aborted, and the user who sent this request will even get an "Internal Server Error" with the assertion-failure message, alerting them that this specific query is causing problems, while other queries might work normally.

There's no need to backport this patch - unless it becomes annoying that other branches don't have the throwing_assert() function and we want it to ease other backports.

Fixes #28308.

Closes scylladb/scylladb#28445

* github.com:scylladb/scylladb:
  alternator: replace SCYLLA_ASSERT with throwing_assert
  utils: introduce throwing_assert(), a safe replacement for assert
2026-02-27 15:35:36 +02:00
Amnon Heiman
0b4f28ae21 histogram_metrics_helper.hh: Support Min < Precision
to_metrics_histogram now collapses duplicate integer bucket bounds
caused by Min less than Precision scaling while always keeping native
histogram metadata.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-02-26 09:00:38 +02:00
Amnon Heiman
6c21e5f80c estimated_histogram.hh: support Min less than Precision histograms
approx_exponential_histogram is a pseudo exponential histogram implementation
that can insert and retrieve values into and from buckets in O 1 time.
The implementation uses power of two ranges and splits them linearly into
buckets. The number of buckets per power of two range is called Precision.

The original implementation aimed at covering large value ranges had a
limitation. The histogram Min value had to be greater than or equal to
Precision. As a result code that needs histograms for small integer values
could not use this implementation efficiently.

This change addresses that gap by handling the case where Min is less than
Precision. For Min smaller than Precision the value is scaled by a power of
two factor during indexing so the existing exponential math can be reused
without runtime branching. Bucket limits are scaled back to the original
units which can lead to repeated bucket limits in the Min to Precision
range for integer values.

Example with Min 2 and Precision 4
Buckets 2 2 3 3 4 5 6 7 8 10 12 14 and so on

Implementation details
Introduce SHIFT based on log2 Precision minus log2 Min when positive
Scale Min and Max by SHIFT for all exponential calculations
Compute NUM_BUCKETS using the standard log2 Max over Min formula
Use scaled value in find_bucket_index to avoid fractional bucket steps
Return bucket limits by scaling back to original units
Constraint relaxed from Min greater or equal to Precision to allow any Min
less than Max still power of two

This change maintains backward compatibility with existing histograms
while enabling efficient tracking of small integer values.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-02-26 00:46:14 +02:00
Nadav Har'El
d876e7cd0a utils: introduce throwing_assert(), a safe replacement for assert
This patch introduces throwing_assert(cond), a better and safer
replacement for assert(cond) or SCYLLA_ASSERT(cond). It aims to
eventually replace all assertions in Scylla and provide a real solution
to issue #7871 ("exorcise assertions from Scylla").

throwing_assert() is based on the existing on_internal_error() and
inherits all its benefits, but brings with it the *convenience* of
assert() and SCYLLA_ASSERT(): No need for a separate if(), new strings,
etc.  For example, you can do write just one line of throwing_assert():

    throwing_assert(p != nullptr);

Instead of much more verbose on_internal_error:

    if (p == nullptr) {
        utils::on_internal_error("assertion failed: p != nullptr")
    }

Like assert() and SCYLLA_ASSERT(), in our tests throwing_assert() dumps
core on failure. But its advantage over the other assertion functions
like becomes clear in production:

* assert() is compiled-out in release builds. This means that the
  condition is not checked, and the code after the failed condition
  continues to run normally, potentially to disasterous consequences.

  In contrast, throwing_assert() continues to check the condition even in
  release builds, and if the condition is false it throws an exception.
  This ensures that the code following the condition doesn't run.

* SCYLLA_ASSERT() in release builds checks the condition and *crashes*
  Scylla if the condition is not met.

  In contrast, throwing_assert() doesn't crash, but throws an exception.
  This means that the specific operation that encountered the error
  is aborted, instead of the entire server. It often also means that
  the user of this operation will see this error somehow and know
  which operation failed - instead of encountering a mysterious
  server (or even whole-cluster crash) without any indication which
  operation caused it.

Another benefit of throwing_assert() is that it logs the error message
(and also a backtrace!) to Scylla's usual logging mechanisms - not to
stderr like assert and SCYLLA_ASSERT write, where users sometimes can't
see what is written.

Fixes #28308.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-25 14:58:47 +02:00
Botond Dénes
99244179f7 Merge 'CQL transport: Add histogram-based request/response size tracking' from Amnon Heiman
This series closes a gap in how CQL request and response sizes are reported.

Previously, request_size and response_size were tracked as simple counters,
providing only cumulative totals per shard. This made it difficult to understand
the distribution of message sizes and identify potential issues with very large
or very small requests.

After this series, the CQL transport reports detailed histogram metrics showing
the distribution of request and response sizes. These histograms are tracked
per-instance, per-type (per ops), and per-scheduling-group, providing
much better visibility into CQL traffic patterns.

The histograms are collected for QUERY, EXECUTE, and BATCH operations, which are
the primary data path operations where message size distribution is most relevant.
This data can help identify:
- Clients sending unexpectedly large requests
- Operations with oversized result sets
- Scheduling group differences in traffic patterns

To support this, the series extends the approx_exponential_histogram template to
handle accurate sum, adds a bytes_histogram type alias optimized for byte-range measurements (1KB to 1GB).

The existing per-shard counter metrics are maintained for backward compatibility.
Metrics example:
```
scylla_transport_cql_request_bytes{kind="BATCH",scheduling_group_name="sl:default",shard="0"} 129808
scylla_transport_cql_request_bytes{kind="EXECUTE",scheduling_group_name="sl:default",shard="0"} 227409
scylla_transport_cql_request_bytes{kind="PREPARE",scheduling_group_name="sl:default",shard="0"} 631
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:default",shard="0"} 2809
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:driver",shard="0"} 4079
scylla_transport_cql_request_bytes{kind="REGISTER",scheduling_group_name="sl:default",shard="0"} 98
scylla_transport_cql_request_bytes{kind="STARTUP",scheduling_group_name="sl:driver",shard="0"} 432
scylla_transport_cql_request_histogram_bytes_sum{kind="QUERY",scheduling_group_name="sl:driver"} 4079
scylla_transport_cql_request_histogram_bytes_count{kind="QUERY",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1024.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2048.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4096.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8192.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16384.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="32768.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="65536.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="131072.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="262144.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="524288.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1048576.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2097152.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4194304.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8388608.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16777216.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="33554432.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="67108864.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="134217728.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="268435456.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="536870912.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1073741824.000000",scheduling_group_name="sl:driver"} 57
```
**The field sees it as an important issue**

Fixes #14850

Closes scylladb/scylladb#28419

* github.com:scylladb/scylladb:
  test/boost/estimated_histogram_test.cc: Switch to real Sum
  transport/server: to bytes_histogram
  approx_exponential_histogram: Add sum() method for accurate value tracking
  utils/estimated_histogram.hh: Add bytes_histogram
2026-02-25 13:05:18 +02:00
Ernest Zaslavsky
321d4caf0c object_storage: add retryable machinery to object storage
remove hand rolled error handling from object storage client
and replace with common machinery that supports exception
handling and retrying when appropriate
2026-02-22 14:00:44 +02:00
Ernest Zaslavsky
24972da26d rest_client: add simple_send overload
add an overload to rest client `simple_send` to accept a retry_strategy for http's make_request
2026-02-22 14:00:44 +02:00
Avi Kivity
dee868b71a interval: avoid clang 23 warning on throw statement in potentially noexcept function
interval_data's move constructor is conditionally noexcept. It
contains a throw statemnt for the case that the underlying type's
move constructor can throw; that throw statemnt is never executed
if we're in the noexept branch. Clang 23 however doesn't understand
that, and warns about throwing in a noexcept function.

Fix that by rewriting the logic using seastar::defer(). In the
noexcept case, the optimizer should eliminate it as dead code.

Closes scylladb/scylladb#28710
2026-02-19 12:24:20 +03:00
Calle Wilund
8e71a6f52a gcp: Add handling of 429 (too many requests) to exponential backoff
Fixes: SCYLLADB-611

Adds http error code 429 to codes handled by exponential backoff.

Closes scylladb/scylladb#28588
2026-02-19 09:42:39 +01:00
Ernest Zaslavsky
d763bdabc2 s3_client: fix the s3::range max object size
in s3::Range class start using s3 global constant for two reasons:
1) uniformity, no need to introduce semantically same constant in each class
2) the value was wrong
2026-02-18 12:12:04 +02:00
Ernest Zaslavsky
24e70b30c8 s3_client: remove "aws" prefix from object limits constants
remove "aws" prefix from object limits constants since it is
irrelevant and unnecessary when sitting under s3 namespace
2026-02-18 12:12:04 +02:00
Ernest Zaslavsky
329c156600 s3_client: make s3 object limits accessible
make s3 limits constants publicly accessible to reuse it later
2026-02-18 12:12:04 +02:00
Pavel Emelyanov
89d8ae5cb6 Merge 'http: prepare http clients retry machinery refactoring' from Ernest Zaslavsky
Today S3 client has well established and well testes (hopefully) http request retry strategy, in the rest of clients it looks like we are trying to achieve the same writing the same code over and over again and of course missing corner cases that already been addressed in the S3 client.
This PR aims to extract the code that could assist other clients to detect the retryability of an error originating from the http client, reuse the built in seastar http client retryability and to minimize the boilerplate of http client exception handling

No backport needed since it is only refactoring of the existing code

Closes scylladb/scylladb#28250

* github.com:scylladb/scylladb:
  exceptions: add helper to build a chain of error handlers
  http: extract error classification code
  aws_error: extract `retryable` from aws_error
2026-02-18 10:06:37 +03:00
Pavel Emelyanov
2f10fd93be Merge 's3_client: Fix s3 part size and number of parts calculation' from Ernest Zaslavsky
- Correct `calc_part_size` function since it could return more than 10k parts
- Add tests
- Add more checks in `calc_part_size` to comply with S3 limits

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-640
Must be ported back to 2025.3/4 and 2026.1 since we may encounter this bug in production clusters

Closes scylladb/scylladb#28592

* github.com:scylladb/scylladb:
  s3_client: add more constrains to the calc_part_size
  s3_client: add tests for calc_part_size
  s3_client: correct multipart part-size logic to respect 10k limit
2026-02-18 10:04:53 +03:00
Botond Dénes
2e087882fa Merge 'GCS object storage. Fix incompatibilty issues with "real" GCS' from Calle Wilund
Fixes #28398
Fixes #28399

When used as path elements in google storage paths, the object names need to be URL encoded. Due to

a.) tests not really using prefixes including non-url valid chars (i.e. / etc)
and
b.) the mock server used for most testing not enforcing this particular aspect,

this was missed.

Modified unit tests to use prefixing for all names, so when running real GS, any errors like this will show.

"Real" GCS also behaves a bit different when listing with pager, compared to mock;
The former will not give a pager token for last page, only penultimate.
 Adds handling for this.

Needs backport to the releases that have (though might not really use) the feature, as it is technically possible to use google storage for backup and whatnot there, and it should work as expected.

Closes scylladb/scylladb#28400

* github.com:scylladb/scylladb:
  utils/gcp/object_storage: URL-encode object names in URL:s
  utils::gcp::object_storage: Fix list object pager end condition detection
2026-02-17 16:40:02 +02:00