Commit Graph

2067 Commits

Author SHA1 Message Date
Pavel Emelyanov
080c55a115 lister: Fix race between readdir and stat
Sometimes file::list_directory() returns entries without type set. In
thase case lister calls file_type() on the entry name to get it. In case
the call returns disengated type, the code assumes that some error
occurred and resolves into exception.

That's not correct. The file_type() method returns disengated type only
if the file being inspected is missing (i.e. on ENOENT errno). But this
can validly happen if a file is removed bettween readdir and stat. In
that case it's not "some error happened", but a enry should be just
skipped. In "some error happened", then file_type() would resolve into
exceptional future on its own.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26595

(cherry picked from commit d9bfbeda9a)

Closes scylladb/scylladb#26767
2025-10-29 11:29:57 +02:00
Ernest Zaslavsky
6f6b3a26c4 s3_client: tune logging level
Change all logging related to errors in `chunked_download_source` background download fiber to `info` to make it visible right away in logs.

(cherry picked from commit fdd0d66f6e)
2025-10-21 12:26:50 +00:00
Ernest Zaslavsky
4eb427976b s3_client: add logging
Add logging for the case when we encounter expired credentials, shouldnt happen but just in case

(cherry picked from commit 4497325cd6)
2025-10-21 12:26:50 +00:00
Ernest Zaslavsky
94d49da8ec s3_client: improve exception handling for chunked downloads
Refactor the wrapping exception used in `chunked_download_source` to
prevent the retry strategy from reattempting failed requests. The new
implementation preserves the original `exception_ptr`, making the root
cause clearer and easier to diagnose.

(cherry picked from commit 1d34657b14)
2025-10-21 12:26:50 +00:00
Ernest Zaslavsky
f9bc211966 s3_client: fix indentation
Reformat `client::make_request` to fix the indentation of `if` block

(cherry picked from commit 58a1cff3db)
2025-10-21 12:26:50 +00:00
Ernest Zaslavsky
4aff338282 s3_client: add max for client level retries
To prevent client retrying indefinitely time skew and authentication errors add `max_attempts` to the `client::make_request`

(cherry picked from commit 43acc0d9b9)
2025-10-21 12:26:50 +00:00
Ernest Zaslavsky
8b7dce8334 s3_client: remove s3_retry_strategy
It never worked as intended, so the credentials handling is moving to the same place where we handle time skew, since we have to reauthenticate the request

(cherry picked from commit 116823a6bc)
2025-10-21 12:26:50 +00:00
Ernest Zaslavsky
2afd323838 s3_client: support high-level request retries
Add an option to retry S3 requests at the highest level, including
reinitializing headers and reauthenticating. This addresses cases
where retrying the same request fails, such as when the S3 server
rejects a timestamp older than 15 minutes.

(cherry picked from commit 185d5cd0c6)
2025-10-21 12:26:50 +00:00
Ernest Zaslavsky
f2f415a742 s3_client: just reformat make_request
Just reformat previously changed methods to improve readability

(cherry picked from commit db1ca8d011)
2025-10-21 12:26:50 +00:00
Ernest Zaslavsky
5c2d8bd273 s3_client: unify make_request implementation
Refactor `make_request` to use a single core implementation that
handles authentication and issues the HTTP request. All overloads now
delegate to this unified method.

(cherry picked from commit 55fb2223b6)
2025-10-21 12:26:49 +00:00
Ernest Zaslavsky
04b9e98ef8 s3_client: track memory starvation in background filling fiber
Introduce a counter metric to monitor instances where the background
filling fiber is blocked due to insufficient memory in the S3 client.

Closes scylladb/scylladb#26466

(cherry picked from commit 413739824f)

Closes scylladb/scylladb#26555
2025-10-15 12:03:09 +02:00
Ernest Zaslavsky
5c6335e029 s3_client: fix when condition to prevent infinite locking
Refine condition variable predicate in filling fiber to avoid
indefinite waiting when `close` is invoked.

Closes scylladb/scylladb#26449

(cherry picked from commit c2bab430d7)

Closes scylladb/scylladb#26497
2025-10-12 16:19:48 +03:00
Michał Chojnowski
879db5855d utils/config_file: fix a missing allowed_values propagation in one of named_value constructors
In one of the constructors of `named_value`, the `allowed_values`
argument isn't used.

(This means that if some config entry uses this constructor,
the values aren't validated on the config layer,
and might give some lower layer a bad surprise).

Fix that.

Fixes scyllladb/scylladb#26371

Closes scylladb/scylladb#26196

(cherry picked from commit 3b338e36c2)

Closes scylladb/scylladb#26425
2025-10-09 13:19:41 +03:00
Avi Kivity
4d9271df98 Merge 'sstables: introduce sstable version ms' from Michał Chojnowski
This is yet another part in the BTI index project.

Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25626
Next parts: make `ms` the default. Then, general tweaks and improvements. Later, potentially a full `da` format implementation.

This patch series introduces a new, Scylla-only sstable format version `ms`, which is like `me`, but with the index components (Summary.db and Index.db) replaced with BTI index components (Partitions.db and Rows.db), as they are in Cassandra 5.0's `da` format version.

(Eventually we want to just implement `da`, but there are several other changes (unrelated to the index files) between `me` and `da`. By adding this `ms` as an intermediate step we can adapt the new index formats without dragging all the other changes into the mix (and raising the risk of regressions, which is already high)).

The high-level structure of the PR is:
1. Introduce new component types — `Partitions` and `Rows`.
2. Teach `class sstable` to open them when they exist.
3. Teach the sstable writer how to write index data to them.
4. Teach `class sstable` and unit tests how to deal with sstables that have no `Index` or `Summary` (but have `Partitions` and `Rows` instead).
5. Introduce the new sstable version `ms`, specify that it has `Partitions` and `Rows` instead of `Index` and `Summary`.
6. Prepare unit tests for the appearance of `ms`.
7. Enable `ms` in unit tests.
8. Make `ms` enablable via db::config (with a silent fall back to `me` until the new `MS_SSTABLE_FORMAT` cluster feature is enabled).
9. Prepare integration tests for the appearance of `ms`.
10. Enable both `ms` and `me` in tests where we want both versions to be tested.

This series doesn't make `ms` the default yet, because that requires teaching Scylla Manager and a few dtests about the new format first. It can be enabled by setting `sstable_format: ms` in the config.

Per a review request, here is an example from `perf_fast_forward`, demonstrating some motivation for a new format. (Although not the main one. The main motivations are getting rid of restrictions on the RAM:disk ratio, and index read throughput for datasets with tiny partitions). The dataset was populated with `build/release/scylla perf-fast-forward --smp=1 --sstable-format=$VERSION --data-directory=data.$VERSION --column-index-size-in-kb=1 --populate --random-seed=0`.
This test involves a partition with 1000000 clustering rows (with 32-bit keys and 100-byte values) and ~500 index blocks, and queries a few particular rows from the partition. Since the branching factor for the BIG promoted index is 2 (it's a binary search), the lookup involves ~11.2 sequential page reads per row. The BTI format has a more reasonable branching factor, so it involves ~2.3 page reads per row.

`build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/me --run-tests=large-partition-select-few-rows`:
```
offset  stride  rows     iterations    avg aio    aio      (KiB)
500000  1       1                70       18.0     18        128
500001  1       1               647       19.0     19        132
0       1000000 1               748       15.0     15        116
0       500000  2               372       29.0     29        284
0       250000  4               227       56.0     56        504
0       125000  8               116      106.0    106        928
0       62500   16               67      195.0    195       1732
```
`build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/ms --run-tests=large-partition-select-few-rows`:
```
offset  stride  rows     iterations    avg aio    aio      (KiB)
500000  1       1                51        5.1      5         20
500001  1       1                64        5.3      5         20
0       1000000 1               679        4.0      4         16
0       500000  2               492        8.0      8         88
0       250000  4               804       16.0     16        232
0       125000  8               409       31.0     31        516
0       62500   16               97       54.0     54       1056
```

Index file size comparison for the default `perf_fast_forward` tables with `--random-seed=0`:
Large partition table (dominated by intra-partition index): 2.4 MB with `me`, 732 kB with `ms`.
For the small partitions table (dominated by inter-partition index): 11 MB with `me`, 8.4 MB with `ms`.

External tests:
I ran SCT test `longevity-mv-si-4days-streaming-test` test on 6 nodes with 30 shards each for 8 hours. No anomalies were observed.

New functionality, no backport needed.

Closes scylladb/scylladb#26215

* github.com:scylladb/scylladb:
  test/boost/bloom_filter_test: add test_rebuild_from_temporary_hashes
  test/cluster: add test_bti_index.py
  test: prepare bypass_cache_test.py for `ms` sstables
  sstables/trie/bti_index_reader: add a failure injection in advance_lower_and_check_if_present
  test/cqlpy/test_sstable_validation.py: prepare the test for `ms` sstables
  tools/scylla-sstable: add `--sstable-version=?` to `scylla sstable write`
  db/config: expose "ms" format to the users via database config
  test: in Python tests, prepare some sstable filename regexes for `ms`
  sstables: add `ms` to `all_sstable_versions`
  test/boost/sstable_3_x_test: add `ms` sstables to multi-version tests
  test/lib/index_reader_assertions: skip some row index checks for BTI indexes
  test/boost/sstable_inexact_index_test: explicitly use a `me` sstable
  test/boost/sstable_datafile_test: skip test_broken_promoted_index_is_skipped for `ms` sstables
  test/resource: add `ms` sample sstable files for relevant tests
  test/boost/sstable_compaction_test: prepare for `ms` sstables.
  test/boost/index_reader_test: prepare for `ms` sstables
  test/boost/bloom_filter_tests: prepare for `ms` sstables
  test/boost/sstable_datafile_test: prepare for `ms` sstables
  test/boost/sstable_test: prepare for `ms` sstables.
  sstables: introduce `ms` sstable format version
  tools/scylla-sstable: default to "preferred" sstable version, not "highest"
  sstables/mx/reader: use the same hashed_key for the bloom filter and the index reader
  sstables/trie/bti_index_reader: allow the caller to passing a precalculated murmur hash
  sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller
  sstables/mx: make Index and Summary components optional
  sstables: open Partitions.db early when it's needed to populate key range for sharding metadata
  sstables: adapt sstable::set_first_and_last_keys to sstables without Summary
  sstables: implement an alternative way to rebuild bloom filters for sstables without Index
  utils/bloom_filter: add `add(const hashed_key&)`
  sstables: adapt estimated_keys_for_range to sstables without Summary
  sstables: make `sstable::estimated_keys_for_range` asynchronous
  sstables/sstable: compute get_estimated_key_count() from Statistics instead of Summary
  replica/database: add table::estimated_partitions_in_range()
  sstables/mx: implement sstable::has_partition_key using a regular read
  sstables: use BTI index for queries, when present and enabled
  sstables/mx/writer: populate BTI index files
  sstables: create and open BTI index files, when enabled
  sstables: introduce Partition and Rows component types
  sstables/mx/writer: make `_pi_write_m.partition_tombstone` a `sstables::deletion_time`
2025-09-30 09:40:02 +03:00
Michał Chojnowski
c549afa1a9 utils/bloom_filter: add add(const hashed_key&)
In one of the next patches, we will want to use (in BTI partition
index writer) the same hash as used by the bloom filter,
and we'll also want to allow rebuilding the filter in a second
pass (after the whole sstable is written) from hashes (as opposed
to rebuilding from partition keys saved in Index.db, which is
something we sometimes do today) saved to a temporary file.

For those, we need an interface that allows us to compute the hash
externally, and only pass the hash to `add()`.
2025-09-29 13:01:21 +02:00
Szymon Malewski
bb8004e52d utils: extend lru_string_map
This patch extend `lru_string_map` with `sized_string_map` - a class that helps to control cache size.
It implements cache resizing in background thread.
2025-09-28 04:27:33 +02:00
Szymon Malewski
5332ceb24e utils: add lru_string_map
Adds a lru_string_map definition.
This structure maps a string keys to templated arguments, allowing efficient lookup and adding keys.
Each lookup (and adding) puts the keys on internal LRU list and the entires may be efficiently removed in a LRU order.
It will be a base for the expression cache in Alternator.
2025-09-28 04:06:00 +02:00
Pavel Emelyanov
8f815de1e0 Merge 'treewide: move away from accessing httpd::request::query_parameters' from Botond Dénes
Acecssing this member directly is deprecated, migrate code to use {get,set}_query_param() and friends instead.

Fixes: https://github.com/scylladb/scylladb/issues/26023

Preparation for seastar update, no backport required.

Closes scylladb/scylladb#26024

* github.com:scylladb/scylladb:
  treewide: move away from accessing httpd::request::query_parameters
  test/pylib/s3_server_mock.py: better handle empty query params
2025-09-25 11:05:50 +03:00
Avi Kivity
fb5664a1d5 interval: split interval_bound implementation for const references
In f3dccc2215 ("interval: change start()/end() not to return
references to data members"), we introduced interval_bound_const_ref
as a lightweight alternative to interval_bound that does not carry
a T. This was needed because interval no longer contains
interval_bound:s.

This interval_bound_const_ref was just an interval_bound<const T&>,
and converting constructors and operators were added to move between
the interval_bound<T> and interval_bound<const T&>.

However, these happen to be illegal in C++ and just happened to work
in clang 20. Clang 21 tightened its checks and these are now flagged.
The problem is that when instantiating interval_bound<const T&> the
converting constructor looks like a copy constructor; and while it's
behind a constraint (that evaluates to false) the rules don't care
about that.

Fix this by having a separate interval_bound_const_ref template.
The new template is slightly better as it allows assignment (since
the payload is a pointer rather than a reference). Not that it's really
needed.

The C++ rule was reported [1] as too restrictive, but there is no
resolution yet.

[1] https://cplusplus.github.io/CWG/issues/2837.html

Closes scylladb/scylladb#26081
2025-09-24 13:57:21 +02:00
Botond Dénes
1ac7b4c35e treewide: move away from accessing httpd::request::query_parameters
Acecssing this member directly is deprecated, migrate code to use
{get,set}_query_param() and friends instead.

Fixes: https://github.com/scylladb/scylladb/issues/26023
2025-09-24 11:52:15 +03:00
Ernest Zaslavsky
e56081d588 treewide: seastar module update and fix broken rest client
start using `write_body` in `rest/client` to properly set headers due to changes applied to seastar's http client

Seastar module update
```
b6be384e Merge 'http: generalize Content-Type setting' from Nadav Har'El
74472298 http: generalize request's Content-Type setting
9fd5a1cc http: generalize reply's Content-Type setting
a2665f38 memory: Remove deprecated enable_abort_on_allocation_failure()
d2a5a8a9 resource.cc: Remove some dead code
7ad9f424 http: Add support of multiple key repetitions for the request
a636baca task: Move task::get_backtrace() definition in its class
a0101efa Fixed "doxygen" spelling in error message
db969482 Merge 'http/reply: introduce set_cookie()' from Botond Dénes
5357b434 http/reply: introduce set_cookie()
1ddcf05f http/reply: make write_reply*() public
4b782d73 http/connection: start_response(): fix indentation
720feca0 http/reply: encapsulate reply writing in write_reply()
3e19917d Merge 'exceptions: log thrown and propagated exception with distinct log levels' from Botond Dénes
db9aea93 Merge 'Correctly wrap up abandoned yielding directory lister' from Pavel Emelyanov
dbb2bf3f test: Add test for input_stream::read_exactly()
a5308ec9 file/directory_lister: Correctly wrap up fallback generator
4f0811f4 file/directory_lister: Convert on-stack queue to shared pointer
59801da7 tests: Add directory lister early drop cases
33233032 http/reply: s/write_reply_to_connection/write_reply/
69b93620 http/reply: write_reply_{to_connection,headers}(): pass output stream
56e9bda7 test: Convert directory_test into seastar test
96782358 Merge 'Improve io_tester's seqwrite and append workloads' from Pavel Emelyanov
8b46e3d4 SEASTAR_ASSERT: assert to stderr and flush stream
3370e22a tutorial.md: use current_exception_as_future()
e977453a Add fixture support for seastar::testing
3e70d7f7 io_tester: Do not set append_is_unlikely unconditionally
2a4ae7b4 io_tester: Count file size overflows
5e678bb5 io_tester: Tuneup size overflow check
d5dad8ce io_tester: Move position management code to io_class_data
5586a056 io_tester: Rename seqwrite -> overwrite
92df2fb2 io_tester: Relax return value of create_and_fill_file()
03d9500d io_tester: Dont fill file for APPEND
d6844a7b io_tester: Indentation fix after previous patch
fb9e0088 io_tester: Coroutinize create_and_fill_file()
2f802f57 exceptions: log thrown and propagated exception with distinct log levels
4971fa70 util: move log-level into own header
39448fc1 Merge 'Fix and tune http::request setup by client' from Pavel Emelyanov
52d0c4fb iostream: Move output_stream::write(scattered_message) lower
7a52f734 Merge 'read_first_line: Missing pragma and licence' from Ernest Zaslavsky
d0881b7e read_first_line: Add missing license boilerplate
988a0e99 read_first_line:: Add missing `#pragma once`
42675266 http: Make client::make_request accept const request&
c7709fb5 http: Make request making API return exceptional future not throw
b68ed89b http: Move request content length header setup
1d96dac6 http: Move request version configuration
072e86f6 http: Setup request once
```

Closes scylladb/scylladb#25915

(cherry picked from commit 44d34663bc)

Closes scylladb/scylladb#26100
2025-09-19 11:40:59 +03:00
Avi Kivity
f6b6312cf4 Merge 'sstables/trie: prepare for integrating BTI indexes with sstable readers and writers' from Michał Chojnowski
This is yet another part in the BTI index project.

Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25626
Next parts: introducing the new components, Partitions.db and Rows.db

This is the preparatory, uncontroversial part of https://github.com/scylladb/scylladb/pull/26039, which has been split out to a separate PR to make the main part (which, after a revision, will be posted later) smaller.

This series contains several small fixes and changes to BTI-related code added earlier, which either have to be done (i.e. propagating `reader_permit` to IO calls in index reads) or just deserved to be done. There's no single theme for the changes in this PR, refer to the individual commits for details.

The changes are for the sake of new and unreleased code. No backporting should be done.

Closes scylladb/scylladb#26075

* github.com:scylladb/scylladb:
  sstables/mx/reader: remove mx::make_reader_with_index_reader
  test/boost/bti_index_test: fix indentation
  sstables/trie/bti_index_reader: in last_block_offset(), return offset from the beginning of partition, not file
  sstables/trie: support reader_permit and trace_state properly
  sstables/trie/bti_node_reader: avoid calling into `cached_file` if the target position is already cached
  sstables/trie/bti_index_reader: get rid of the seastar::file wrapper in read_row_index_header
  sstables/trie/bti_index_reader: support BYPASS CACHE
  test/boost/bti_index_test: use read_bti_partitions_db_footer where appropriate
  sstables/trie: change the signature of bti_partition_index_writer::finish
  sstables/bti_index: improve signatures of special member functions in index writers
  streaming/stream_transfer_task: coroutinize `estimate_partitions()`
  types/comparable_bytes: add a missing implementation for date_type_impl
  sstables: remove an outdated FIXME
  storage_service: delete `get_splits()`
  sstables/trie: fix some comment typos in bti_index_reader.cc
  sstables/mx/writer: rename _pi_write_m.tomb to partition_tombstone
2025-09-18 12:10:27 +03:00
Pavel Emelyanov
65638232e8 Merge 'utils: azure: Catch system errors when probing IMDS and bump the verbosity of logs' from Nikos Dragazis
This PR fixes a bug in the Azure default credential provider that would cause the `test_azure_provider_with_incomplete_creds` unit test to be flaky. The provider would assume that an unreachable IMDS endpoint would always result in a timeout, but network errors are also possible (e.g., ICMP "host unreachable"). The issue is triggered by this particular test because it sets the IMDS endpoint to a non-routable address. Some routers choose to silently drop such packets, while others return ICMP errors. To fix it, the default credential provider has been updated to catch system errors as well.

This PR also raises the log level of the default credential provider from DEBUG to INFO, making it easier for operators to diagnose authentication issues.

More details in the commit messages.

Fixes #25641.

Closes scylladb/scylladb#25696

* github.com:scylladb/scylladb:
  utils: azure: Catch system errors when detecting IMDS
  utils: azure: Bump default credential logs from DEBUG to INFO
2025-09-18 07:43:00 +03:00
Ernest Zaslavsky
c9c245c756 rest_client: set version on http::request to avoid invalid state
Upcoming changes in Seastar cause `rest::simple_send` to move the
`http::request` into `seastar::http::experimental::client::make_request`
when called multiple times. This leaves the original request in an
invalid state. Specifically, the `_version` field becomes empty,
causing request validation to fail. This patch ensures `version` is
explicitly set to prevent such failures.

Fixes: https://github.com/scylladb/scylladb/issues/26018

Closes scylladb/scylladb#26066
2025-09-18 07:36:25 +03:00
Michał Chojnowski
1f85069389 sstables/trie: support reader_permit and trace_state properly
Before this patch, `reader_permit` taken by `bti_index_reader`.
wasn't actually being passed down to disk reads. In this patch,
we fix this FIXME by propagating the permit down to the I/O
operations on the `cached_file`.

Also, it didn't take `trace_state_ptr` at all.
In this patch, we add a `trace_state_ptr` argument and propagate
it down to disk reads.

(We combine the two changes because the permit and the trace state
are passed together everywhere anyway).
2025-09-17 12:22:40 +02:00
Benny Halevy
3a6208b319 utils: stall_free: clear_gently: release wrapped objects
As discussed in https://github.com/scylladb/scylladb/pull/24606#discussion_r2281870939
clear_gently of shared pointers should release the wrapped
object reference and when the object's use_count reaches 1,
the object itself would be cleared_gently, before it's destroyed.

This behavior is similar to the way we clear gently containers
like arrays or vectors, and so it is extended in this patch
to smart pointers like unique_ptr and foreign_ptr.

The unit tests are adjusted respectively to expect the
smart pointers to be reset after clear_gently, plus
the use of `reset()` for `foreign_ptr<shared_ptr<>>` was
replaced by `clear_gently().get()` which now ensures the
reference to a shared object is released, and awaited for,
if it happens on a foreign owner shard, unlike reset of
a foreign_ptr that kicks off destroy of that shared object
in the background on the owner shard - causing flakiness.

Fixes #25723

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#25759
2025-09-17 11:44:26 +03:00
Pavel Emelyanov
6fb66b796a s3: Add metrics to show S3 prefetch bytes
The chunked download source sends large GET requests and then consumes data
as it arrives. Sometimes it can stop reading from socket early and drop the
in-flight data. The existing read-bytes metrics show only the number of
consumed bytes, we we also want to know the number of requested bytes

Refs #25770 (accounting of read-bytes)
Fixes #25876

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#25877
2025-09-16 23:40:47 +03:00
Nikos Dragazis
58e8142a06 utils: azure: Catch system errors when detecting IMDS
When the default credential provider probes IMDS to check its
availability, it assumes that application-level connection timeouts are
the only error that can occur when the node is not an Azure VM, i.e.,
the packets will be silently dropped somewhere in the network.

However, this has proven not always true for the
`test_azure_provider_with_incomplete_creds` unit test, which overrides
the default IMDS endpoint with a non-routeable IP from TEST-NET-1 [1].
This test has been reported to fail in some local setups where routers
respond with ICMP "host unreachable" errors instead of silently dropping
the packets. This error propagates to user space as an EHOSTUNREACH
system error, which is not caught by the default credential provider,
causing the test to fail. The reason we use a non-routeable address in
this test is to ensure that IMDS probing will always fail, even if
running the test on an Azure VM.

Theoretically, the same problem applies to the default IMDS endpoint as
well (169.254.169.254). The RFC 3927 [2] mandates that packets targeting
link-local addresses (169.254/16) must not be forwarded, but the exact
behavior is left to implementation.

Since we cannot predict how routers will behave, fix this by catching
all relevant system errors when probing IMDS.

[1] https://datatracker.ietf.org/doc/html/rfc5735
[2] https://datatracker.ietf.org/doc/html/rfc3927

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-09-16 15:27:59 +03:00
Nikos Dragazis
78bcecd570 utils: azure: Bump default credential logs from DEBUG to INFO
The default credential provider produces diagnostic logs on each step as
it walks through the credential chain. These logs are useful for
operators to diagnose authentication problems as they expose information
about which credential sources are being evaluated, in which order, why
they fail, and which source is eventually selected.

Promote them from DEBUG to INFO level.

Additionally, concatenate the logs for environment credentials into a
single log statement to avoid interleaving with other logs.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-09-16 15:20:52 +03:00
Botond Dénes
ee7c85919e Revert "treewide: seastar module update and fix broken rest client"
This reverts commit 44d34663bc of PR
https://github.com/scylladb/scylladb/pull/25915.

Breaks articact tests on ARM, blocking us from building new images from
master.
2025-09-16 08:31:08 +03:00
Ernest Zaslavsky
44d34663bc treewide: seastar module update and fix broken rest client
start using `write_body` in `rest/client` to properly set headers due to changes applied to seastar's http client

Seastar module update
```
b6be384e Merge 'http: generalize Content-Type setting' from Nadav Har'El
74472298 http: generalize request's Content-Type setting
9fd5a1cc http: generalize reply's Content-Type setting
a2665f38 memory: Remove deprecated enable_abort_on_allocation_failure()
d2a5a8a9 resource.cc: Remove some dead code
7ad9f424 http: Add support of multiple key repetitions for the request
a636baca task: Move task::get_backtrace() definition in its class
a0101efa Fixed "doxygen" spelling in error message
db969482 Merge 'http/reply: introduce set_cookie()' from Botond Dénes
5357b434 http/reply: introduce set_cookie()
1ddcf05f http/reply: make write_reply*() public
4b782d73 http/connection: start_response(): fix indentation
720feca0 http/reply: encapsulate reply writing in write_reply()
3e19917d Merge 'exceptions: log thrown and propagated exception with distinct log levels' from Botond Dénes
db9aea93 Merge 'Correctly wrap up abandoned yielding directory lister' from Pavel Emelyanov
dbb2bf3f test: Add test for input_stream::read_exactly()
a5308ec9 file/directory_lister: Correctly wrap up fallback generator
4f0811f4 file/directory_lister: Convert on-stack queue to shared pointer
59801da7 tests: Add directory lister early drop cases
33233032 http/reply: s/write_reply_to_connection/write_reply/
69b93620 http/reply: write_reply_{to_connection,headers}(): pass output stream
56e9bda7 test: Convert directory_test into seastar test
96782358 Merge 'Improve io_tester's seqwrite and append workloads' from Pavel Emelyanov
8b46e3d4 SEASTAR_ASSERT: assert to stderr and flush stream
3370e22a tutorial.md: use current_exception_as_future()
e977453a Add fixture support for seastar::testing
3e70d7f7 io_tester: Do not set append_is_unlikely unconditionally
2a4ae7b4 io_tester: Count file size overflows
5e678bb5 io_tester: Tuneup size overflow check
d5dad8ce io_tester: Move position management code to io_class_data
5586a056 io_tester: Rename seqwrite -> overwrite
92df2fb2 io_tester: Relax return value of create_and_fill_file()
03d9500d io_tester: Dont fill file for APPEND
d6844a7b io_tester: Indentation fix after previous patch
fb9e0088 io_tester: Coroutinize create_and_fill_file()
2f802f57 exceptions: log thrown and propagated exception with distinct log levels
4971fa70 util: move log-level into own header
39448fc1 Merge 'Fix and tune http::request setup by client' from Pavel Emelyanov
52d0c4fb iostream: Move output_stream::write(scattered_message) lower
7a52f734 Merge 'read_first_line: Missing pragma and licence' from Ernest Zaslavsky
d0881b7e read_first_line: Add missing license boilerplate
988a0e99 read_first_line:: Add missing `#pragma once`
42675266 http: Make client::make_request accept const request&
c7709fb5 http: Make request making API return exceptional future not throw
b68ed89b http: Move request content length header setup
1d96dac6 http: Move request version configuration
072e86f6 http: Setup request once
```

Closes scylladb/scylladb#25915
2025-09-13 17:14:28 +03:00
Radosław Cybulski
436150eb52 treewide: fix spelling errors
Fix spelling errors reported by copilot on github.
Remove single use namespace alias.

Closes scylladb/scylladb#25960
2025-09-12 15:58:19 +03:00
Avi Kivity
c91b326d5a Merge 'transport: replace throwing protocol_exception with returns' from Dario Mirovic
Replace throwing `protocol_exception` with returning it as a result or an exceptional future in the transport server module. The goal is to improve performance.

Most of the `protocol_exception` throws were made from `fragmented_temporary_buffer` module, by passing `exception_thrower()` to its `read*` methods. `fragmented_temporary_buffer` is changed so that it now accepts an exception creator, not exception thrower. `fragmented_temporary_buffer_concepts::ExceptionCreator` concept replaced `fragmented_temporary_buffer_concepts::ExceptionThrower` and all methods that have been throwing now return failed result of type `utils::result_with_eptr`. This change is then propagated to the callers.

The scope of this patch is `protocol_exception`, so commitlog just calls `.value()` method on the result. If the result failed, that will throw the exception from the result, as defined by `utils::result_with_eptr_throw_policy`. This means that the behavior of commitlog module stays the same.

transport server module handles results gracefully. All the caller functions that return non-future value `T` now return `utils::result_with_eptr<T>`. When the caller is a function that returns a future, and it receives failed result, `make_exception_future(std::move(failed_result).value())` is returned. The rest of the callstack up to the transport server `handle_error` function is already working without throwing, and that's how zero throws is achieved.

cql3 module changes do the same as transport server module.

Benchmark that is not yet merged has commit `67fbe35833e2d23a8e9c2dcb5e04580231d8ec96`, [GitHub diff view](https://github.com/scylladb/scylladb/compare/master...nuivall:scylladb:perf_cql_raw). It uses either read or write query.

Command line used:
```
./build/release/scylla perf-cql-raw --workdir ~/tmp/scylladir --smp 1 --developer-mode 1 --workload write --duration 300 --concurrency 1000 --username cassandra --password cassandra 2>/dev/null
```
The only thing changed across runs is `--workload write`/`--workload read`.

Built and run on `release` target.

<details>

```
throughput:
        mean=   36946.04 standard-deviation=1831.28
        median= 37515.49 median-absolute-deviation=1544.52
        maximum=39748.41 minimum=28443.36
instructions_per_op:
        mean=   108105.70 standard-deviation=965.19
        median= 108052.56 median-absolute-deviation=53.47
        maximum=124735.92 minimum=107899.00
cpu_cycles_per_op:
        mean=   70065.73 standard-deviation=2328.50
        median= 69755.89 median-absolute-deviation=1250.85
        maximum=92631.48 minimum=66479.36

⏱  real=5:11.08  user=2:00.20  sys=2:25.55  cpu=85%
```

```
throughput:
        mean=   40718.30 standard-deviation=2237.16
        median= 41194.39 median-absolute-deviation=1723.72
        maximum=43974.56 minimum=34738.16
instructions_per_op:
        mean=   117083.62 standard-deviation=40.74
        median= 117087.54 median-absolute-deviation=31.95
        maximum=117215.34 minimum=116874.30
cpu_cycles_per_op:
        mean=   58777.43 standard-deviation=1225.70
        median= 58724.65 median-absolute-deviation=776.03
        maximum=64740.54 minimum=55922.58

⏱  real=5:12.37  user=27.461  sys=3:54.53  cpu=83%
```

```
throughput:
        mean=   37107.91 standard-deviation=1698.58
        median= 37185.53 median-absolute-deviation=1300.99
        maximum=40459.85 minimum=29224.83
instructions_per_op:
        mean=   108345.12 standard-deviation=931.33
        median= 108289.82 median-absolute-deviation=55.97
        maximum=124394.65 minimum=108188.37
cpu_cycles_per_op:
        mean=   70333.79 standard-deviation=2247.71
        median= 69985.47 median-absolute-deviation=1212.65
        maximum=92219.10 minimum=65881.72

⏱  real=5:10.98  user=2:40.01  sys=1:45.84  cpu=85%
```

```
throughput:
        mean=   38353.12 standard-deviation=1806.46
        median= 38971.17 median-absolute-deviation=1365.79
        maximum=41143.64 minimum=32967.57
instructions_per_op:
        mean=   117270.60 standard-deviation=35.50
        median= 117268.07 median-absolute-deviation=16.81
        maximum=117475.89 minimum=117073.74
cpu_cycles_per_op:
        mean=   57256.00 standard-deviation=1039.17
        median= 57341.93 median-absolute-deviation=634.50
        maximum=61993.62 minimum=54670.77

⏱  real=5:12.82  user=4:10.79  sys=11.530  cpu=83%
```

This shows ~240 instructions per op increase for reads and ~180 instructions per op increase for writes.

Tests have been run multiple times, with almost identical results. Each run lasted 300 seconds. Number of operations executed is roughly 38k per second * 300 seconds = 11.4m ops.

Update:

I have repeated the benchmark with clean state - reboot computer, put in performance mode, rebuild, closed other apps that might affect CPU and disk usage.

run count: 5 times before and 5 times after the patch
duration: 300 seconds

Average write throughput median before patch: 41155.99
Average write throughput median after patch: 42193.22

Median absolute deviation is also lower now, with values in range 350-550, while the previous runs' values were in range 750-1350.

</details>

Built and run on `release` target.

<details>

./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false --bypass-cache 2>/dev/null

```
throughput:
        mean=   14910.90 standard-deviation=477.72
        median= 14956.73 median-absolute-deviation=294.16
        maximum=16061.18 minimum=13198.68
instructions_per_op:
        mean=   659591.63 standard-deviation=495.85
        median= 659595.46 median-absolute-deviation=324.91
        maximum=661184.94 minimum=658001.49
cpu_cycles_per_op:
        mean=   213301.49 standard-deviation=2724.27
        median= 212768.64 median-absolute-deviation=1403.85
        maximum=225837.15 minimum=208110.12

⏱  real=5:19.26  user=5:00.22  sys=15.827  cpu=98%
```

./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false 2>/dev/null

```
throughput:
        mean=   93345.45 standard-deviation=4499.00
        median= 93915.52 median-absolute-deviation=2764.41
        maximum=104343.64 minimum=79816.66
instructions_per_op:
        mean=   65556.11 standard-deviation=97.42
        median= 65545.11 median-absolute-deviation=71.51
        maximum=65806.75 minimum=65346.25
cpu_cycles_per_op:
        mean=   34160.75 standard-deviation=803.02
        median= 33927.16 median-absolute-deviation=453.08
        maximum=39285.19 minimum=32547.13

⏱  real=5:03.23  user=4:29.46  sys=29.255  cpu=98%
```

./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache true 2>/dev/null

```
throughput:
        mean=   206982.18 standard-deviation=15894.64
        median= 208893.79 median-absolute-deviation=9923.41
        maximum=232630.14 minimum=127393.34
instructions_per_op:
        mean=   35983.27 standard-deviation=6.12
        median= 35982.75 median-absolute-deviation=3.75
        maximum=36008.24 minimum=35952.14
cpu_cycles_per_op:
        mean=   17374.87 standard-deviation=985.06
        median= 17140.81 median-absolute-deviation=368.86
        maximum=26125.38 minimum=16421.99

⏱  real=5:01.23  user=4:57.88  sys=0.124  cpu=98%
```

./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false --bypass-cache 2>/dev/null

```
throughput:
        mean=   16198.26 standard-deviation=902.41
        median= 16094.02 median-absolute-deviation=588.58
        maximum=17890.10 minimum=13458.74
instructions_per_op:
        mean=   659752.73 standard-deviation=488.08
        median= 659789.16 median-absolute-deviation=334.35
        maximum=660881.69 minimum=658460.82
cpu_cycles_per_op:
        mean=   216070.70 standard-deviation=3491.26
        median= 215320.37 median-absolute-deviation=1678.06
        maximum=232396.48 minimum=209839.86

⏱  real=5:17.33  user=4:55.87  sys=18.425  cpu=99%
```

./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false 2>/dev/null

```
throughput:
        mean=   97067.79 standard-deviation=2637.79
        median= 97058.93 median-absolute-deviation=1477.30
        maximum=106338.97 minimum=87457.60
instructions_per_op:
        mean=   65695.66 standard-deviation=58.43
        median= 65695.93 median-absolute-deviation=37.67
        maximum=65947.76 minimum=65547.05
cpu_cycles_per_op:
        mean=   34300.20 standard-deviation=704.66
        median= 34143.92 median-absolute-deviation=321.72
        maximum=38203.68 minimum=33427.46

⏱  real=5:03.22  user=4:31.56  sys=29.164  cpu=99%
```

./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache true 2>/dev/null

```
throughput:
        mean=   223495.91 standard-deviation=6134.95
        median= 224825.90 median-absolute-deviation=3302.09
        maximum=234859.90 minimum=193209.69
instructions_per_op:
        mean=   35981.41 standard-deviation=3.16
        median= 35981.13 median-absolute-deviation=2.12
        maximum=35991.46 minimum=35972.55
cpu_cycles_per_op:
        mean=   17482.26 standard-deviation=281.82
        median= 17424.08 median-absolute-deviation=143.91
        maximum=19120.68 minimum=16937.43

⏱  real=5:01.23  user=4:58.54  sys=0.136  cpu=99%
```

</details>

Fixes: #24567

This PR is a continuation of #24738 [transport: remove throwing protocol_exception on connection start](https://github.com/scylladb/scylladb/pull/24738). This PR does not solve a burning issue, but is rather an improvement in the same direction. As it is just an enhancement, it should not be backported.

Closes scylladb/scylladb#25408

* github.com:scylladb/scylladb:
  test/cqlpy: add protocol exception tests
  test/cqlpy: `test_protocol_exceptions.py` refactor message frame building
  test/cqlpy: `test_protocol_exceptions.py` refactor duplicate code
  transport: replace `make_frame` throw with return result
  cql3: remove throwing `protocol_exception`
  transport: replace throw in validate_utf8 with result_with_exception_ptr return
  transport: replace throwing protocol_exception with returns
  utils: add result_with_exception_ptr
  test/cqlpy: add unknown compression algorithm test case
2025-09-10 21:54:15 +03:00
Avi Kivity
fc64333040 Merge 'sstables/trie: add BTI index readers and writers' from Michał Chojnowski
This is yet another part in the BTI index project.

Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25506/
Next part: plugging the BTI index readers and writers into sstable readers and writers.

The new code added in this PR isn't used outside of tests yet, but it's posted as a separate PR for reviewability.

This series implements, on top of the key translation logic, and abstract trie writing and traversal logic, a writer and a reader of sstable index files (which map primary keys to positions in Data.db), as described in f16fb6765b/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md.

Caveats:
1. I think the added test has reasonable coverage, but that depends on running it multiple times. (Though it shouldn't need more than a few runs to catch any bug it covers). It's somewhat awkward as a test meant for running in CI, it's better as something you run many times after a relevant change.
2. These readers and writers are intended to be compatible with Cassandra, but I did *NOT* do any compatibility testing. The writers and readers added here have only been tested against each other, not against Cassandra's readers and writers.
3. This didn't undergo any proper benchmarking and optimization work. I was doing some measurements in the past, but everything was rewritten so much since then that the my old measurements are effectively invalidated. Frankly I have no idea what the performance of all this branchy-branchy logic is now.

No backports needed, new functionality.

Closes scylladb/scylladb#25626

* github.com:scylladb/scylladb:
  test/manual: add bti_cassandra_compatibility_test
  test/lib/random_schema: add some constraints for generated uuid and time/date values
  test/lib/random_utils: add a variant of get_bytes which takes an `engine&`
  test/boost: add bti_index_test
  sstables/writer: add an accessor for the current write position in Data.db
  sstables/trie: introduce bti_index_reader
  sstables/trie: add bti_partition_index_writer.cc
  sstables/trie: add bti_row_index_writer.cc
  utils/bit_cast: add a new overload of write_unaligned()
  sstables/trie: add trie_writer::add_partial()
  sstables/consumer: add read_56()
  sstables/trie: make bti_node_reader::page_ptr copy-constructible
  sstables: extract abstract_index_reader from index_reader.hh to its own header
  sstables/trie: add an accessor to the file_writer under bti_node_sink
  sstables/types: make `deletion_time::operator tombstone()` const
  sstables/types: add sstables::deletion_time::make_live()
  sstables/trie: fix a special case in max_offset_from_child
  sstables/trie: handle `partition_region`s other than `clustered` in BTI position encoding
  sstables/trie: rewrite lcb_mismatch to handle fragment invalidation
  test/boost/bti_key_translation_test: fix a compilation error hidden behind `if constexpr`
2025-09-10 21:48:52 +03:00
Pavel Emelyanov
9deea3655f s3: Fix chunked download source metrics calculations
In S3 client both read and write metrics have three counters -- number
of requests made, number of bytes processed and request latency. In most
of the cases all three counters are updated at once -- upon response
arrival.

However, in case of chunked download source this way of accounting
metrics is misleading. In this code the request is made once, and then
the obtained bytes are consumed eventually as the data arrive.

Currently, each time a new portion of data is read from the socket the
number of read requests is incremented. That's wrong, the request is
made once, and this counter should also be incremented once, not for
every data buffer that arrived in response.

Same for read request latency -- it's "added" for every data buffer that
arrives, but it's a lenghy process, the _request_ latency should be
accounted once per responce. Maybe later we'll want to have "data
latency" metrics as well, but for what we have now it's request latency.

The number of read bytes is accounted properly, so not touched here.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#25770
2025-09-08 09:49:03 +03:00
Michał Chojnowski
a800fef633 utils/bit_cast: add a new overload of write_unaligned()
Does the same thing as the existing overload, but this one
takes `std::byte*` instead of `void*`, and it additionally
returns the pointer to the end position.
2025-09-07 00:30:15 +02:00
Avi Kivity
ed483647a4 interval: specialize interval_data<T> for trivial types
C++ data movement algorithms (std::uninitialized_copy()) and friends
and the containers that use them optimize for trivially copyable
and destructible types by calling memcpy instead of using a loop
around constructors/destructors. Make intervals of trivially
copyable and destructible types also trivially copyable and
destructible by specializing interval_data<T> not to have
user-defined special member functions. This requires that T have
a default constructor since we can't skip construction when
!_start_exists or !_end_exists.

To choose whether we specialize or not, we look at default
constructiblity (see above) and trivial destructibility. This is
wider than trivial copyablity (a user-defined copy constructor
can exist) but is still beneficial, since the generated copy
constructor for interval_data<T> will be branch-free.

We don't implement the poison words in debug mode; nor are they
necessary, since we no don't manage the lifetime of _start_value
and _end_value manually any more but let the compiler do that for us.

Note [1] prevents full conversion to memcpy for now, but we still
get branch free code.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121789
2025-09-06 18:38:24 +03:00
Avi Kivity
20751517a4 interval: split data members into new interval_data class
Prepare for specialized handling of trivial types by extracting
the data members of wrapping_internal<T> and the special member
functions (constructors/destructors/assignment) into a new
interval_data<T> template.

To avoid having to refer to data member with a this-> prefix,
add using declarations in wrapping_interval<T>.
2025-09-06 18:31:58 +03:00
Radosław Cybulski
c242234552 Revert "build: add precompiled headers to CMakeLists.txt"
This reverts commit 01bb7b629a.

Closes scylladb/scylladb#25735
2025-09-03 09:46:00 +03:00
Avi Kivity
7ed261fc52 Merge 'Inital GCP object storage support' from Calle Wilund
Adds infrastructure and client for interaction with GCP object storage services.

Note: this is just a client object usable for creating, listing, deleting and up/downloading of objects to/from said storage service. It makes no attempt at actually inserting it into the sstable storage flow. That can come later.

This PR breaks out GCP auth and some general REST call functionality into shared routines. Not all code is 100% reused, but at least some.

Test is added, though could be more comprehensive (feel free to suggest a test vector).
Test can run in either local mock server mode (default), or against actual GCP.
See `test/boost/gcp_object_storage_test.cc` for explanation on the config environment vars.
Default is to run the test against a temporary docker deamon.

Closes scylladb/scylladb#24629

* github.com:scylladb/scylladb:
  test::boost::gcp_object_storage_test: Initial unit tests for GCP obj storage
  proc-utils: Re-export waiting types from seastar
  proc-utils: Inherit environment from current process
  utils::gcp::object_storage: Add client for GCP object storage
  utils::http: Add optional external credentials to dns_connection_factory init
  utils::rest: Break out request wrapper and send logic
  encryption::gcp_host: Use shared gcp credentials + REST helpers
  utils::gcp: Move/add gcp credentials management to shared file
  utils::rest::client: Add formatter for seastar::http::reply
  utils::rest::client: Add helper routines for simple REST calls
  utils::http: Make shared system trust certificates public
2025-09-02 14:38:09 +03:00
Avi Kivity
fe308de8df Merge 'treewide: Add missing #pragma once' from Ernest Zaslavsky
Add missing #pragma once and license boilerplate to include headers.

Consider adding a CI step to catch missing header guards early. It can be done easily by running `cpplint` like below
```
 find . -path ./seastar -prune -o -path ./venv -prune -o -path ./idl -prune -o -type f \( -name "*.h" -o -name "*.hh" -o -name "*.hpp" \) -print0 | xargs -0 cpplint 2>&1 | grep "header guard found"
```

No backport is needed, the change is not "functional"

Closes scylladb/scylladb#25768

* github.com:scylladb/scylladb:
  treewide: Add missing license boilerplate
  treewide: Add missing `#pragma once`
2025-09-02 13:18:04 +03:00
Calle Wilund
4a5b547a86 utils::gcp::object_storage: Add client for GCP object storage
Adds a minial client for GCP object storage operations:

* Create buckets
* Delete buckets
* List bucket content
* Copy/move bucket content
* Delete bucket content
* Upload bucket content
* Download bucket content
2025-09-01 18:03:44 +00:00
Calle Wilund
8f54b709ce utils::http: Add optional external credentials to dns_connection_factory init
Also allow creating the object using an endpoint expression.
Note: this moves code to the .cc file, because it introduces a few
more lines, and I feel we have to much stuff in headers as is.
2025-09-01 18:03:44 +00:00
Calle Wilund
0e9e1f7738 utils::rest: Break out request wrapper and send logic
Allows sharing some of the wrapping and logic outside the
single-call object/routine paths, using it also with an external
seastar::http::client, i.e. caching resources across several calls.
2025-09-01 18:03:44 +00:00
Calle Wilund
2b7ad605b3 utils::gcp: Move/add gcp credentials management to shared file
Copied from encryption::gcp_host. Light-weight impl of gcp credentials
management.
2025-09-01 18:03:44 +00:00
Calle Wilund
f6d7c7e300 utils::rest::client: Add formatter for seastar::http::reply 2025-09-01 18:03:44 +00:00
Calle Wilund
cc1e659abd utils::rest::client: Add helper routines for simple REST calls
Packing headers and unpacking response to json. Usable for esp. gcp
interaction.
2025-09-01 18:03:43 +00:00
Calle Wilund
886fcf1759 utils::http: Make shared system trust certificates public
So other clients/factories can share.
2025-09-01 18:03:43 +00:00
Ernest Zaslavsky
0e4292adb4 treewide: Add missing license boilerplate
Add missing license boilerplate to include headers
2025-09-01 14:58:32 +03:00
Ernest Zaslavsky
19345e539f treewide: Add missing #pragma once
Add missing `#pragma once` to include headers
2025-09-01 14:58:21 +03:00