Sometimes file::list_directory() returns entries without type set. In
thase case lister calls file_type() on the entry name to get it. In case
the call returns disengated type, the code assumes that some error
occurred and resolves into exception.
That's not correct. The file_type() method returns disengated type only
if the file being inspected is missing (i.e. on ENOENT errno). But this
can validly happen if a file is removed bettween readdir and stat. In
that case it's not "some error happened", but a enry should be just
skipped. In "some error happened", then file_type() would resolve into
exceptional future on its own.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26595
(cherry picked from commit d9bfbeda9a)
Closesscylladb/scylladb#26767
Change all logging related to errors in `chunked_download_source` background download fiber to `info` to make it visible right away in logs.
(cherry picked from commit fdd0d66f6e)
Refactor the wrapping exception used in `chunked_download_source` to
prevent the retry strategy from reattempting failed requests. The new
implementation preserves the original `exception_ptr`, making the root
cause clearer and easier to diagnose.
(cherry picked from commit 1d34657b14)
To prevent client retrying indefinitely time skew and authentication errors add `max_attempts` to the `client::make_request`
(cherry picked from commit 43acc0d9b9)
It never worked as intended, so the credentials handling is moving to the same place where we handle time skew, since we have to reauthenticate the request
(cherry picked from commit 116823a6bc)
Add an option to retry S3 requests at the highest level, including
reinitializing headers and reauthenticating. This addresses cases
where retrying the same request fails, such as when the S3 server
rejects a timestamp older than 15 minutes.
(cherry picked from commit 185d5cd0c6)
Refactor `make_request` to use a single core implementation that
handles authentication and issues the HTTP request. All overloads now
delegate to this unified method.
(cherry picked from commit 55fb2223b6)
Introduce a counter metric to monitor instances where the background
filling fiber is blocked due to insufficient memory in the S3 client.
Closesscylladb/scylladb#26466
(cherry picked from commit 413739824f)
Closesscylladb/scylladb#26555
In one of the constructors of `named_value`, the `allowed_values`
argument isn't used.
(This means that if some config entry uses this constructor,
the values aren't validated on the config layer,
and might give some lower layer a bad surprise).
Fix that.
Fixesscyllladb/scylladb#26371Closesscylladb/scylladb#26196
(cherry picked from commit 3b338e36c2)
Closesscylladb/scylladb#26425
This is yet another part in the BTI index project.
Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25626
Next parts: make `ms` the default. Then, general tweaks and improvements. Later, potentially a full `da` format implementation.
This patch series introduces a new, Scylla-only sstable format version `ms`, which is like `me`, but with the index components (Summary.db and Index.db) replaced with BTI index components (Partitions.db and Rows.db), as they are in Cassandra 5.0's `da` format version.
(Eventually we want to just implement `da`, but there are several other changes (unrelated to the index files) between `me` and `da`. By adding this `ms` as an intermediate step we can adapt the new index formats without dragging all the other changes into the mix (and raising the risk of regressions, which is already high)).
The high-level structure of the PR is:
1. Introduce new component types — `Partitions` and `Rows`.
2. Teach `class sstable` to open them when they exist.
3. Teach the sstable writer how to write index data to them.
4. Teach `class sstable` and unit tests how to deal with sstables that have no `Index` or `Summary` (but have `Partitions` and `Rows` instead).
5. Introduce the new sstable version `ms`, specify that it has `Partitions` and `Rows` instead of `Index` and `Summary`.
6. Prepare unit tests for the appearance of `ms`.
7. Enable `ms` in unit tests.
8. Make `ms` enablable via db::config (with a silent fall back to `me` until the new `MS_SSTABLE_FORMAT` cluster feature is enabled).
9. Prepare integration tests for the appearance of `ms`.
10. Enable both `ms` and `me` in tests where we want both versions to be tested.
This series doesn't make `ms` the default yet, because that requires teaching Scylla Manager and a few dtests about the new format first. It can be enabled by setting `sstable_format: ms` in the config.
Per a review request, here is an example from `perf_fast_forward`, demonstrating some motivation for a new format. (Although not the main one. The main motivations are getting rid of restrictions on the RAM:disk ratio, and index read throughput for datasets with tiny partitions). The dataset was populated with `build/release/scylla perf-fast-forward --smp=1 --sstable-format=$VERSION --data-directory=data.$VERSION --column-index-size-in-kb=1 --populate --random-seed=0`.
This test involves a partition with 1000000 clustering rows (with 32-bit keys and 100-byte values) and ~500 index blocks, and queries a few particular rows from the partition. Since the branching factor for the BIG promoted index is 2 (it's a binary search), the lookup involves ~11.2 sequential page reads per row. The BTI format has a more reasonable branching factor, so it involves ~2.3 page reads per row.
`build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/me --run-tests=large-partition-select-few-rows`:
```
offset stride rows iterations avg aio aio (KiB)
500000 1 1 70 18.0 18 128
500001 1 1 647 19.0 19 132
0 1000000 1 748 15.0 15 116
0 500000 2 372 29.0 29 284
0 250000 4 227 56.0 56 504
0 125000 8 116 106.0 106 928
0 62500 16 67 195.0 195 1732
```
`build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/ms --run-tests=large-partition-select-few-rows`:
```
offset stride rows iterations avg aio aio (KiB)
500000 1 1 51 5.1 5 20
500001 1 1 64 5.3 5 20
0 1000000 1 679 4.0 4 16
0 500000 2 492 8.0 8 88
0 250000 4 804 16.0 16 232
0 125000 8 409 31.0 31 516
0 62500 16 97 54.0 54 1056
```
Index file size comparison for the default `perf_fast_forward` tables with `--random-seed=0`:
Large partition table (dominated by intra-partition index): 2.4 MB with `me`, 732 kB with `ms`.
For the small partitions table (dominated by inter-partition index): 11 MB with `me`, 8.4 MB with `ms`.
External tests:
I ran SCT test `longevity-mv-si-4days-streaming-test` test on 6 nodes with 30 shards each for 8 hours. No anomalies were observed.
New functionality, no backport needed.
Closesscylladb/scylladb#26215
* github.com:scylladb/scylladb:
test/boost/bloom_filter_test: add test_rebuild_from_temporary_hashes
test/cluster: add test_bti_index.py
test: prepare bypass_cache_test.py for `ms` sstables
sstables/trie/bti_index_reader: add a failure injection in advance_lower_and_check_if_present
test/cqlpy/test_sstable_validation.py: prepare the test for `ms` sstables
tools/scylla-sstable: add `--sstable-version=?` to `scylla sstable write`
db/config: expose "ms" format to the users via database config
test: in Python tests, prepare some sstable filename regexes for `ms`
sstables: add `ms` to `all_sstable_versions`
test/boost/sstable_3_x_test: add `ms` sstables to multi-version tests
test/lib/index_reader_assertions: skip some row index checks for BTI indexes
test/boost/sstable_inexact_index_test: explicitly use a `me` sstable
test/boost/sstable_datafile_test: skip test_broken_promoted_index_is_skipped for `ms` sstables
test/resource: add `ms` sample sstable files for relevant tests
test/boost/sstable_compaction_test: prepare for `ms` sstables.
test/boost/index_reader_test: prepare for `ms` sstables
test/boost/bloom_filter_tests: prepare for `ms` sstables
test/boost/sstable_datafile_test: prepare for `ms` sstables
test/boost/sstable_test: prepare for `ms` sstables.
sstables: introduce `ms` sstable format version
tools/scylla-sstable: default to "preferred" sstable version, not "highest"
sstables/mx/reader: use the same hashed_key for the bloom filter and the index reader
sstables/trie/bti_index_reader: allow the caller to passing a precalculated murmur hash
sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller
sstables/mx: make Index and Summary components optional
sstables: open Partitions.db early when it's needed to populate key range for sharding metadata
sstables: adapt sstable::set_first_and_last_keys to sstables without Summary
sstables: implement an alternative way to rebuild bloom filters for sstables without Index
utils/bloom_filter: add `add(const hashed_key&)`
sstables: adapt estimated_keys_for_range to sstables without Summary
sstables: make `sstable::estimated_keys_for_range` asynchronous
sstables/sstable: compute get_estimated_key_count() from Statistics instead of Summary
replica/database: add table::estimated_partitions_in_range()
sstables/mx: implement sstable::has_partition_key using a regular read
sstables: use BTI index for queries, when present and enabled
sstables/mx/writer: populate BTI index files
sstables: create and open BTI index files, when enabled
sstables: introduce Partition and Rows component types
sstables/mx/writer: make `_pi_write_m.partition_tombstone` a `sstables::deletion_time`
In one of the next patches, we will want to use (in BTI partition
index writer) the same hash as used by the bloom filter,
and we'll also want to allow rebuilding the filter in a second
pass (after the whole sstable is written) from hashes (as opposed
to rebuilding from partition keys saved in Index.db, which is
something we sometimes do today) saved to a temporary file.
For those, we need an interface that allows us to compute the hash
externally, and only pass the hash to `add()`.
This patch extend `lru_string_map` with `sized_string_map` - a class that helps to control cache size.
It implements cache resizing in background thread.
Adds a lru_string_map definition.
This structure maps a string keys to templated arguments, allowing efficient lookup and adding keys.
Each lookup (and adding) puts the keys on internal LRU list and the entires may be efficiently removed in a LRU order.
It will be a base for the expression cache in Alternator.
Acecssing this member directly is deprecated, migrate code to use {get,set}_query_param() and friends instead.
Fixes: https://github.com/scylladb/scylladb/issues/26023
Preparation for seastar update, no backport required.
Closesscylladb/scylladb#26024
* github.com:scylladb/scylladb:
treewide: move away from accessing httpd::request::query_parameters
test/pylib/s3_server_mock.py: better handle empty query params
In f3dccc2215 ("interval: change start()/end() not to return
references to data members"), we introduced interval_bound_const_ref
as a lightweight alternative to interval_bound that does not carry
a T. This was needed because interval no longer contains
interval_bound:s.
This interval_bound_const_ref was just an interval_bound<const T&>,
and converting constructors and operators were added to move between
the interval_bound<T> and interval_bound<const T&>.
However, these happen to be illegal in C++ and just happened to work
in clang 20. Clang 21 tightened its checks and these are now flagged.
The problem is that when instantiating interval_bound<const T&> the
converting constructor looks like a copy constructor; and while it's
behind a constraint (that evaluates to false) the rules don't care
about that.
Fix this by having a separate interval_bound_const_ref template.
The new template is slightly better as it allows assignment (since
the payload is a pointer rather than a reference). Not that it's really
needed.
The C++ rule was reported [1] as too restrictive, but there is no
resolution yet.
[1] https://cplusplus.github.io/CWG/issues/2837.htmlClosesscylladb/scylladb#26081
start using `write_body` in `rest/client` to properly set headers due to changes applied to seastar's http client
Seastar module update
```
b6be384e Merge 'http: generalize Content-Type setting' from Nadav Har'El
74472298 http: generalize request's Content-Type setting
9fd5a1cc http: generalize reply's Content-Type setting
a2665f38 memory: Remove deprecated enable_abort_on_allocation_failure()
d2a5a8a9 resource.cc: Remove some dead code
7ad9f424 http: Add support of multiple key repetitions for the request
a636baca task: Move task::get_backtrace() definition in its class
a0101efa Fixed "doxygen" spelling in error message
db969482 Merge 'http/reply: introduce set_cookie()' from Botond Dénes
5357b434 http/reply: introduce set_cookie()
1ddcf05f http/reply: make write_reply*() public
4b782d73 http/connection: start_response(): fix indentation
720feca0 http/reply: encapsulate reply writing in write_reply()
3e19917d Merge 'exceptions: log thrown and propagated exception with distinct log levels' from Botond Dénes
db9aea93 Merge 'Correctly wrap up abandoned yielding directory lister' from Pavel Emelyanov
dbb2bf3f test: Add test for input_stream::read_exactly()
a5308ec9 file/directory_lister: Correctly wrap up fallback generator
4f0811f4 file/directory_lister: Convert on-stack queue to shared pointer
59801da7 tests: Add directory lister early drop cases
33233032 http/reply: s/write_reply_to_connection/write_reply/
69b93620 http/reply: write_reply_{to_connection,headers}(): pass output stream
56e9bda7 test: Convert directory_test into seastar test
96782358 Merge 'Improve io_tester's seqwrite and append workloads' from Pavel Emelyanov
8b46e3d4 SEASTAR_ASSERT: assert to stderr and flush stream
3370e22a tutorial.md: use current_exception_as_future()
e977453a Add fixture support for seastar::testing
3e70d7f7 io_tester: Do not set append_is_unlikely unconditionally
2a4ae7b4 io_tester: Count file size overflows
5e678bb5 io_tester: Tuneup size overflow check
d5dad8ce io_tester: Move position management code to io_class_data
5586a056 io_tester: Rename seqwrite -> overwrite
92df2fb2 io_tester: Relax return value of create_and_fill_file()
03d9500d io_tester: Dont fill file for APPEND
d6844a7b io_tester: Indentation fix after previous patch
fb9e0088 io_tester: Coroutinize create_and_fill_file()
2f802f57 exceptions: log thrown and propagated exception with distinct log levels
4971fa70 util: move log-level into own header
39448fc1 Merge 'Fix and tune http::request setup by client' from Pavel Emelyanov
52d0c4fb iostream: Move output_stream::write(scattered_message) lower
7a52f734 Merge 'read_first_line: Missing pragma and licence' from Ernest Zaslavsky
d0881b7e read_first_line: Add missing license boilerplate
988a0e99 read_first_line:: Add missing `#pragma once`
42675266 http: Make client::make_request accept const request&
c7709fb5 http: Make request making API return exceptional future not throw
b68ed89b http: Move request content length header setup
1d96dac6 http: Move request version configuration
072e86f6 http: Setup request once
```
Closesscylladb/scylladb#25915
(cherry picked from commit 44d34663bc)
Closesscylladb/scylladb#26100
This is yet another part in the BTI index project.
Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25626
Next parts: introducing the new components, Partitions.db and Rows.db
This is the preparatory, uncontroversial part of https://github.com/scylladb/scylladb/pull/26039, which has been split out to a separate PR to make the main part (which, after a revision, will be posted later) smaller.
This series contains several small fixes and changes to BTI-related code added earlier, which either have to be done (i.e. propagating `reader_permit` to IO calls in index reads) or just deserved to be done. There's no single theme for the changes in this PR, refer to the individual commits for details.
The changes are for the sake of new and unreleased code. No backporting should be done.
Closesscylladb/scylladb#26075
* github.com:scylladb/scylladb:
sstables/mx/reader: remove mx::make_reader_with_index_reader
test/boost/bti_index_test: fix indentation
sstables/trie/bti_index_reader: in last_block_offset(), return offset from the beginning of partition, not file
sstables/trie: support reader_permit and trace_state properly
sstables/trie/bti_node_reader: avoid calling into `cached_file` if the target position is already cached
sstables/trie/bti_index_reader: get rid of the seastar::file wrapper in read_row_index_header
sstables/trie/bti_index_reader: support BYPASS CACHE
test/boost/bti_index_test: use read_bti_partitions_db_footer where appropriate
sstables/trie: change the signature of bti_partition_index_writer::finish
sstables/bti_index: improve signatures of special member functions in index writers
streaming/stream_transfer_task: coroutinize `estimate_partitions()`
types/comparable_bytes: add a missing implementation for date_type_impl
sstables: remove an outdated FIXME
storage_service: delete `get_splits()`
sstables/trie: fix some comment typos in bti_index_reader.cc
sstables/mx/writer: rename _pi_write_m.tomb to partition_tombstone
This PR fixes a bug in the Azure default credential provider that would cause the `test_azure_provider_with_incomplete_creds` unit test to be flaky. The provider would assume that an unreachable IMDS endpoint would always result in a timeout, but network errors are also possible (e.g., ICMP "host unreachable"). The issue is triggered by this particular test because it sets the IMDS endpoint to a non-routable address. Some routers choose to silently drop such packets, while others return ICMP errors. To fix it, the default credential provider has been updated to catch system errors as well.
This PR also raises the log level of the default credential provider from DEBUG to INFO, making it easier for operators to diagnose authentication issues.
More details in the commit messages.
Fixes#25641.
Closesscylladb/scylladb#25696
* github.com:scylladb/scylladb:
utils: azure: Catch system errors when detecting IMDS
utils: azure: Bump default credential logs from DEBUG to INFO
Upcoming changes in Seastar cause `rest::simple_send` to move the
`http::request` into `seastar::http::experimental::client::make_request`
when called multiple times. This leaves the original request in an
invalid state. Specifically, the `_version` field becomes empty,
causing request validation to fail. This patch ensures `version` is
explicitly set to prevent such failures.
Fixes: https://github.com/scylladb/scylladb/issues/26018Closesscylladb/scylladb#26066
Before this patch, `reader_permit` taken by `bti_index_reader`.
wasn't actually being passed down to disk reads. In this patch,
we fix this FIXME by propagating the permit down to the I/O
operations on the `cached_file`.
Also, it didn't take `trace_state_ptr` at all.
In this patch, we add a `trace_state_ptr` argument and propagate
it down to disk reads.
(We combine the two changes because the permit and the trace state
are passed together everywhere anyway).
As discussed in https://github.com/scylladb/scylladb/pull/24606#discussion_r2281870939
clear_gently of shared pointers should release the wrapped
object reference and when the object's use_count reaches 1,
the object itself would be cleared_gently, before it's destroyed.
This behavior is similar to the way we clear gently containers
like arrays or vectors, and so it is extended in this patch
to smart pointers like unique_ptr and foreign_ptr.
The unit tests are adjusted respectively to expect the
smart pointers to be reset after clear_gently, plus
the use of `reset()` for `foreign_ptr<shared_ptr<>>` was
replaced by `clear_gently().get()` which now ensures the
reference to a shared object is released, and awaited for,
if it happens on a foreign owner shard, unlike reset of
a foreign_ptr that kicks off destroy of that shared object
in the background on the owner shard - causing flakiness.
Fixes#25723
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#25759
The chunked download source sends large GET requests and then consumes data
as it arrives. Sometimes it can stop reading from socket early and drop the
in-flight data. The existing read-bytes metrics show only the number of
consumed bytes, we we also want to know the number of requested bytes
Refs #25770 (accounting of read-bytes)
Fixes#25876
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#25877
When the default credential provider probes IMDS to check its
availability, it assumes that application-level connection timeouts are
the only error that can occur when the node is not an Azure VM, i.e.,
the packets will be silently dropped somewhere in the network.
However, this has proven not always true for the
`test_azure_provider_with_incomplete_creds` unit test, which overrides
the default IMDS endpoint with a non-routeable IP from TEST-NET-1 [1].
This test has been reported to fail in some local setups where routers
respond with ICMP "host unreachable" errors instead of silently dropping
the packets. This error propagates to user space as an EHOSTUNREACH
system error, which is not caught by the default credential provider,
causing the test to fail. The reason we use a non-routeable address in
this test is to ensure that IMDS probing will always fail, even if
running the test on an Azure VM.
Theoretically, the same problem applies to the default IMDS endpoint as
well (169.254.169.254). The RFC 3927 [2] mandates that packets targeting
link-local addresses (169.254/16) must not be forwarded, but the exact
behavior is left to implementation.
Since we cannot predict how routers will behave, fix this by catching
all relevant system errors when probing IMDS.
[1] https://datatracker.ietf.org/doc/html/rfc5735
[2] https://datatracker.ietf.org/doc/html/rfc3927
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The default credential provider produces diagnostic logs on each step as
it walks through the credential chain. These logs are useful for
operators to diagnose authentication problems as they expose information
about which credential sources are being evaluated, in which order, why
they fail, and which source is eventually selected.
Promote them from DEBUG to INFO level.
Additionally, concatenate the logs for environment credentials into a
single log statement to avoid interleaving with other logs.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
start using `write_body` in `rest/client` to properly set headers due to changes applied to seastar's http client
Seastar module update
```
b6be384e Merge 'http: generalize Content-Type setting' from Nadav Har'El
74472298 http: generalize request's Content-Type setting
9fd5a1cc http: generalize reply's Content-Type setting
a2665f38 memory: Remove deprecated enable_abort_on_allocation_failure()
d2a5a8a9 resource.cc: Remove some dead code
7ad9f424 http: Add support of multiple key repetitions for the request
a636baca task: Move task::get_backtrace() definition in its class
a0101efa Fixed "doxygen" spelling in error message
db969482 Merge 'http/reply: introduce set_cookie()' from Botond Dénes
5357b434 http/reply: introduce set_cookie()
1ddcf05f http/reply: make write_reply*() public
4b782d73 http/connection: start_response(): fix indentation
720feca0 http/reply: encapsulate reply writing in write_reply()
3e19917d Merge 'exceptions: log thrown and propagated exception with distinct log levels' from Botond Dénes
db9aea93 Merge 'Correctly wrap up abandoned yielding directory lister' from Pavel Emelyanov
dbb2bf3f test: Add test for input_stream::read_exactly()
a5308ec9 file/directory_lister: Correctly wrap up fallback generator
4f0811f4 file/directory_lister: Convert on-stack queue to shared pointer
59801da7 tests: Add directory lister early drop cases
33233032 http/reply: s/write_reply_to_connection/write_reply/
69b93620 http/reply: write_reply_{to_connection,headers}(): pass output stream
56e9bda7 test: Convert directory_test into seastar test
96782358 Merge 'Improve io_tester's seqwrite and append workloads' from Pavel Emelyanov
8b46e3d4 SEASTAR_ASSERT: assert to stderr and flush stream
3370e22a tutorial.md: use current_exception_as_future()
e977453a Add fixture support for seastar::testing
3e70d7f7 io_tester: Do not set append_is_unlikely unconditionally
2a4ae7b4 io_tester: Count file size overflows
5e678bb5 io_tester: Tuneup size overflow check
d5dad8ce io_tester: Move position management code to io_class_data
5586a056 io_tester: Rename seqwrite -> overwrite
92df2fb2 io_tester: Relax return value of create_and_fill_file()
03d9500d io_tester: Dont fill file for APPEND
d6844a7b io_tester: Indentation fix after previous patch
fb9e0088 io_tester: Coroutinize create_and_fill_file()
2f802f57 exceptions: log thrown and propagated exception with distinct log levels
4971fa70 util: move log-level into own header
39448fc1 Merge 'Fix and tune http::request setup by client' from Pavel Emelyanov
52d0c4fb iostream: Move output_stream::write(scattered_message) lower
7a52f734 Merge 'read_first_line: Missing pragma and licence' from Ernest Zaslavsky
d0881b7e read_first_line: Add missing license boilerplate
988a0e99 read_first_line:: Add missing `#pragma once`
42675266 http: Make client::make_request accept const request&
c7709fb5 http: Make request making API return exceptional future not throw
b68ed89b http: Move request content length header setup
1d96dac6 http: Move request version configuration
072e86f6 http: Setup request once
```
Closesscylladb/scylladb#25915
Replace throwing `protocol_exception` with returning it as a result or an exceptional future in the transport server module. The goal is to improve performance.
Most of the `protocol_exception` throws were made from `fragmented_temporary_buffer` module, by passing `exception_thrower()` to its `read*` methods. `fragmented_temporary_buffer` is changed so that it now accepts an exception creator, not exception thrower. `fragmented_temporary_buffer_concepts::ExceptionCreator` concept replaced `fragmented_temporary_buffer_concepts::ExceptionThrower` and all methods that have been throwing now return failed result of type `utils::result_with_eptr`. This change is then propagated to the callers.
The scope of this patch is `protocol_exception`, so commitlog just calls `.value()` method on the result. If the result failed, that will throw the exception from the result, as defined by `utils::result_with_eptr_throw_policy`. This means that the behavior of commitlog module stays the same.
transport server module handles results gracefully. All the caller functions that return non-future value `T` now return `utils::result_with_eptr<T>`. When the caller is a function that returns a future, and it receives failed result, `make_exception_future(std::move(failed_result).value())` is returned. The rest of the callstack up to the transport server `handle_error` function is already working without throwing, and that's how zero throws is achieved.
cql3 module changes do the same as transport server module.
Benchmark that is not yet merged has commit `67fbe35833e2d23a8e9c2dcb5e04580231d8ec96`, [GitHub diff view](https://github.com/scylladb/scylladb/compare/master...nuivall:scylladb:perf_cql_raw). It uses either read or write query.
Command line used:
```
./build/release/scylla perf-cql-raw --workdir ~/tmp/scylladir --smp 1 --developer-mode 1 --workload write --duration 300 --concurrency 1000 --username cassandra --password cassandra 2>/dev/null
```
The only thing changed across runs is `--workload write`/`--workload read`.
Built and run on `release` target.
<details>
```
throughput:
mean= 36946.04 standard-deviation=1831.28
median= 37515.49 median-absolute-deviation=1544.52
maximum=39748.41 minimum=28443.36
instructions_per_op:
mean= 108105.70 standard-deviation=965.19
median= 108052.56 median-absolute-deviation=53.47
maximum=124735.92 minimum=107899.00
cpu_cycles_per_op:
mean= 70065.73 standard-deviation=2328.50
median= 69755.89 median-absolute-deviation=1250.85
maximum=92631.48 minimum=66479.36
⏱ real=5:11.08 user=2:00.20 sys=2:25.55 cpu=85%
```
```
throughput:
mean= 40718.30 standard-deviation=2237.16
median= 41194.39 median-absolute-deviation=1723.72
maximum=43974.56 minimum=34738.16
instructions_per_op:
mean= 117083.62 standard-deviation=40.74
median= 117087.54 median-absolute-deviation=31.95
maximum=117215.34 minimum=116874.30
cpu_cycles_per_op:
mean= 58777.43 standard-deviation=1225.70
median= 58724.65 median-absolute-deviation=776.03
maximum=64740.54 minimum=55922.58
⏱ real=5:12.37 user=27.461 sys=3:54.53 cpu=83%
```
```
throughput:
mean= 37107.91 standard-deviation=1698.58
median= 37185.53 median-absolute-deviation=1300.99
maximum=40459.85 minimum=29224.83
instructions_per_op:
mean= 108345.12 standard-deviation=931.33
median= 108289.82 median-absolute-deviation=55.97
maximum=124394.65 minimum=108188.37
cpu_cycles_per_op:
mean= 70333.79 standard-deviation=2247.71
median= 69985.47 median-absolute-deviation=1212.65
maximum=92219.10 minimum=65881.72
⏱ real=5:10.98 user=2:40.01 sys=1:45.84 cpu=85%
```
```
throughput:
mean= 38353.12 standard-deviation=1806.46
median= 38971.17 median-absolute-deviation=1365.79
maximum=41143.64 minimum=32967.57
instructions_per_op:
mean= 117270.60 standard-deviation=35.50
median= 117268.07 median-absolute-deviation=16.81
maximum=117475.89 minimum=117073.74
cpu_cycles_per_op:
mean= 57256.00 standard-deviation=1039.17
median= 57341.93 median-absolute-deviation=634.50
maximum=61993.62 minimum=54670.77
⏱ real=5:12.82 user=4:10.79 sys=11.530 cpu=83%
```
This shows ~240 instructions per op increase for reads and ~180 instructions per op increase for writes.
Tests have been run multiple times, with almost identical results. Each run lasted 300 seconds. Number of operations executed is roughly 38k per second * 300 seconds = 11.4m ops.
Update:
I have repeated the benchmark with clean state - reboot computer, put in performance mode, rebuild, closed other apps that might affect CPU and disk usage.
run count: 5 times before and 5 times after the patch
duration: 300 seconds
Average write throughput median before patch: 41155.99
Average write throughput median after patch: 42193.22
Median absolute deviation is also lower now, with values in range 350-550, while the previous runs' values were in range 750-1350.
</details>
Built and run on `release` target.
<details>
./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false --bypass-cache 2>/dev/null
```
throughput:
mean= 14910.90 standard-deviation=477.72
median= 14956.73 median-absolute-deviation=294.16
maximum=16061.18 minimum=13198.68
instructions_per_op:
mean= 659591.63 standard-deviation=495.85
median= 659595.46 median-absolute-deviation=324.91
maximum=661184.94 minimum=658001.49
cpu_cycles_per_op:
mean= 213301.49 standard-deviation=2724.27
median= 212768.64 median-absolute-deviation=1403.85
maximum=225837.15 minimum=208110.12
⏱ real=5:19.26 user=5:00.22 sys=15.827 cpu=98%
```
./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false 2>/dev/null
```
throughput:
mean= 93345.45 standard-deviation=4499.00
median= 93915.52 median-absolute-deviation=2764.41
maximum=104343.64 minimum=79816.66
instructions_per_op:
mean= 65556.11 standard-deviation=97.42
median= 65545.11 median-absolute-deviation=71.51
maximum=65806.75 minimum=65346.25
cpu_cycles_per_op:
mean= 34160.75 standard-deviation=803.02
median= 33927.16 median-absolute-deviation=453.08
maximum=39285.19 minimum=32547.13
⏱ real=5:03.23 user=4:29.46 sys=29.255 cpu=98%
```
./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache true 2>/dev/null
```
throughput:
mean= 206982.18 standard-deviation=15894.64
median= 208893.79 median-absolute-deviation=9923.41
maximum=232630.14 minimum=127393.34
instructions_per_op:
mean= 35983.27 standard-deviation=6.12
median= 35982.75 median-absolute-deviation=3.75
maximum=36008.24 minimum=35952.14
cpu_cycles_per_op:
mean= 17374.87 standard-deviation=985.06
median= 17140.81 median-absolute-deviation=368.86
maximum=26125.38 minimum=16421.99
⏱ real=5:01.23 user=4:57.88 sys=0.124 cpu=98%
```
./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false --bypass-cache 2>/dev/null
```
throughput:
mean= 16198.26 standard-deviation=902.41
median= 16094.02 median-absolute-deviation=588.58
maximum=17890.10 minimum=13458.74
instructions_per_op:
mean= 659752.73 standard-deviation=488.08
median= 659789.16 median-absolute-deviation=334.35
maximum=660881.69 minimum=658460.82
cpu_cycles_per_op:
mean= 216070.70 standard-deviation=3491.26
median= 215320.37 median-absolute-deviation=1678.06
maximum=232396.48 minimum=209839.86
⏱ real=5:17.33 user=4:55.87 sys=18.425 cpu=99%
```
./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false 2>/dev/null
```
throughput:
mean= 97067.79 standard-deviation=2637.79
median= 97058.93 median-absolute-deviation=1477.30
maximum=106338.97 minimum=87457.60
instructions_per_op:
mean= 65695.66 standard-deviation=58.43
median= 65695.93 median-absolute-deviation=37.67
maximum=65947.76 minimum=65547.05
cpu_cycles_per_op:
mean= 34300.20 standard-deviation=704.66
median= 34143.92 median-absolute-deviation=321.72
maximum=38203.68 minimum=33427.46
⏱ real=5:03.22 user=4:31.56 sys=29.164 cpu=99%
```
./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache true 2>/dev/null
```
throughput:
mean= 223495.91 standard-deviation=6134.95
median= 224825.90 median-absolute-deviation=3302.09
maximum=234859.90 minimum=193209.69
instructions_per_op:
mean= 35981.41 standard-deviation=3.16
median= 35981.13 median-absolute-deviation=2.12
maximum=35991.46 minimum=35972.55
cpu_cycles_per_op:
mean= 17482.26 standard-deviation=281.82
median= 17424.08 median-absolute-deviation=143.91
maximum=19120.68 minimum=16937.43
⏱ real=5:01.23 user=4:58.54 sys=0.136 cpu=99%
```
</details>
Fixes: #24567
This PR is a continuation of #24738 [transport: remove throwing protocol_exception on connection start](https://github.com/scylladb/scylladb/pull/24738). This PR does not solve a burning issue, but is rather an improvement in the same direction. As it is just an enhancement, it should not be backported.
Closesscylladb/scylladb#25408
* github.com:scylladb/scylladb:
test/cqlpy: add protocol exception tests
test/cqlpy: `test_protocol_exceptions.py` refactor message frame building
test/cqlpy: `test_protocol_exceptions.py` refactor duplicate code
transport: replace `make_frame` throw with return result
cql3: remove throwing `protocol_exception`
transport: replace throw in validate_utf8 with result_with_exception_ptr return
transport: replace throwing protocol_exception with returns
utils: add result_with_exception_ptr
test/cqlpy: add unknown compression algorithm test case
This is yet another part in the BTI index project.
Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25506/
Next part: plugging the BTI index readers and writers into sstable readers and writers.
The new code added in this PR isn't used outside of tests yet, but it's posted as a separate PR for reviewability.
This series implements, on top of the key translation logic, and abstract trie writing and traversal logic, a writer and a reader of sstable index files (which map primary keys to positions in Data.db), as described in f16fb6765b/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md.
Caveats:
1. I think the added test has reasonable coverage, but that depends on running it multiple times. (Though it shouldn't need more than a few runs to catch any bug it covers). It's somewhat awkward as a test meant for running in CI, it's better as something you run many times after a relevant change.
2. These readers and writers are intended to be compatible with Cassandra, but I did *NOT* do any compatibility testing. The writers and readers added here have only been tested against each other, not against Cassandra's readers and writers.
3. This didn't undergo any proper benchmarking and optimization work. I was doing some measurements in the past, but everything was rewritten so much since then that the my old measurements are effectively invalidated. Frankly I have no idea what the performance of all this branchy-branchy logic is now.
No backports needed, new functionality.
Closesscylladb/scylladb#25626
* github.com:scylladb/scylladb:
test/manual: add bti_cassandra_compatibility_test
test/lib/random_schema: add some constraints for generated uuid and time/date values
test/lib/random_utils: add a variant of get_bytes which takes an `engine&`
test/boost: add bti_index_test
sstables/writer: add an accessor for the current write position in Data.db
sstables/trie: introduce bti_index_reader
sstables/trie: add bti_partition_index_writer.cc
sstables/trie: add bti_row_index_writer.cc
utils/bit_cast: add a new overload of write_unaligned()
sstables/trie: add trie_writer::add_partial()
sstables/consumer: add read_56()
sstables/trie: make bti_node_reader::page_ptr copy-constructible
sstables: extract abstract_index_reader from index_reader.hh to its own header
sstables/trie: add an accessor to the file_writer under bti_node_sink
sstables/types: make `deletion_time::operator tombstone()` const
sstables/types: add sstables::deletion_time::make_live()
sstables/trie: fix a special case in max_offset_from_child
sstables/trie: handle `partition_region`s other than `clustered` in BTI position encoding
sstables/trie: rewrite lcb_mismatch to handle fragment invalidation
test/boost/bti_key_translation_test: fix a compilation error hidden behind `if constexpr`
In S3 client both read and write metrics have three counters -- number
of requests made, number of bytes processed and request latency. In most
of the cases all three counters are updated at once -- upon response
arrival.
However, in case of chunked download source this way of accounting
metrics is misleading. In this code the request is made once, and then
the obtained bytes are consumed eventually as the data arrive.
Currently, each time a new portion of data is read from the socket the
number of read requests is incremented. That's wrong, the request is
made once, and this counter should also be incremented once, not for
every data buffer that arrived in response.
Same for read request latency -- it's "added" for every data buffer that
arrives, but it's a lenghy process, the _request_ latency should be
accounted once per responce. Maybe later we'll want to have "data
latency" metrics as well, but for what we have now it's request latency.
The number of read bytes is accounted properly, so not touched here.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#25770
Does the same thing as the existing overload, but this one
takes `std::byte*` instead of `void*`, and it additionally
returns the pointer to the end position.
C++ data movement algorithms (std::uninitialized_copy()) and friends
and the containers that use them optimize for trivially copyable
and destructible types by calling memcpy instead of using a loop
around constructors/destructors. Make intervals of trivially
copyable and destructible types also trivially copyable and
destructible by specializing interval_data<T> not to have
user-defined special member functions. This requires that T have
a default constructor since we can't skip construction when
!_start_exists or !_end_exists.
To choose whether we specialize or not, we look at default
constructiblity (see above) and trivial destructibility. This is
wider than trivial copyablity (a user-defined copy constructor
can exist) but is still beneficial, since the generated copy
constructor for interval_data<T> will be branch-free.
We don't implement the poison words in debug mode; nor are they
necessary, since we no don't manage the lifetime of _start_value
and _end_value manually any more but let the compiler do that for us.
Note [1] prevents full conversion to memcpy for now, but we still
get branch free code.
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121789
Prepare for specialized handling of trivial types by extracting
the data members of wrapping_internal<T> and the special member
functions (constructors/destructors/assignment) into a new
interval_data<T> template.
To avoid having to refer to data member with a this-> prefix,
add using declarations in wrapping_interval<T>.
Adds infrastructure and client for interaction with GCP object storage services.
Note: this is just a client object usable for creating, listing, deleting and up/downloading of objects to/from said storage service. It makes no attempt at actually inserting it into the sstable storage flow. That can come later.
This PR breaks out GCP auth and some general REST call functionality into shared routines. Not all code is 100% reused, but at least some.
Test is added, though could be more comprehensive (feel free to suggest a test vector).
Test can run in either local mock server mode (default), or against actual GCP.
See `test/boost/gcp_object_storage_test.cc` for explanation on the config environment vars.
Default is to run the test against a temporary docker deamon.
Closesscylladb/scylladb#24629
* github.com:scylladb/scylladb:
test::boost::gcp_object_storage_test: Initial unit tests for GCP obj storage
proc-utils: Re-export waiting types from seastar
proc-utils: Inherit environment from current process
utils::gcp::object_storage: Add client for GCP object storage
utils::http: Add optional external credentials to dns_connection_factory init
utils::rest: Break out request wrapper and send logic
encryption::gcp_host: Use shared gcp credentials + REST helpers
utils::gcp: Move/add gcp credentials management to shared file
utils::rest::client: Add formatter for seastar::http::reply
utils::rest::client: Add helper routines for simple REST calls
utils::http: Make shared system trust certificates public
Also allow creating the object using an endpoint expression.
Note: this moves code to the .cc file, because it introduces a few
more lines, and I feel we have to much stuff in headers as is.
Allows sharing some of the wrapping and logic outside the
single-call object/routine paths, using it also with an external
seastar::http::client, i.e. caching resources across several calls.