scylladb

Author	SHA1	Message	Date
Avi Kivity	bd08b6e5b2	Merge 'Unify configuration of object storage endpoints (take 2)' from Pavel Emelyanov To configure S3 storage, one needs to do ``` object_storage_endpoints: - name: s3.us-east-1.amazonaws.com port: 443 https: true aws_region: us-east-1 ``` and for GCS it's ``` object_storage_endpoints: - name: https://storage.googleapis.com:433 type: gs credentials_file: <gcp account credentials json file> ``` This PR updates the S3 part to look like ``` object_storage_endpoints: - name: https://s3.us-east-1.amazonaws.com:443 aws_region: us-east-1 ``` fixes: #26570 This is 2nd attempt, previous one (#27360) was reverted because it reported endpoint configs in new format via API and CQL always, even if the endpoint was configured in the old way. This "broke" scylla manager and some dtests. This version has this bug fixed, and endpoints are reported in the same format as they were configured with. About correctness of the changes. No modifications to existing tests are made here, so old format is respected correctly (as far as it's covered by tests). To prove the new format works the the test_get_object_store_endpoints is extended to validate both options. Some preparations to this test to make this happen come on their own with the PR #28111 to show that they are valid and pass before changing the core code. Enhancing the way configuration is made, likely no need to backport. Closes scylladb/scylladb#28112 * github.com:scylladb/scylladb: test: Validate S3 endpoints new format works docs: Update docs according to new endpoints config option format object_storage: Create s3 client with "extended" endpoint name s3/storage: Tune config updating sstable: Shuffle args for s3_client_wrapper test: Rename badconf variable into objconf test: Split the object_store/test_get_object_store_endpoints test	2026-01-14 18:29:03 +02:00
Pavel Emelyanov	e57ee84662	util: Re-use seastar::util::memory_data_sink A data_sink that stores buffers into an in-memory collection had appeared in seastar recently. In Scylla there's similar thing that uses memory_data_sink_buffer as a container, so it's possible to drop the data_sink_impl iself in favor of seastar implementation. For that to work there should be append_buffers() overload for the aforementioned container. For its nice implementation the container, in turn, needs to get push_back() method and value_type trait. The method already exists, but is called put(), so just rename it. There's one more user of it this method in S3 client, and it can enjoy the added append_buffers() helper. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28124	2026-01-14 08:54:00 +02:00
Avi Kivity	c6dfae5661	treewide: #include Seastar headers with angle brackets Seastar is an external library from the point of view of ScyllaDB, so should be included with angle brackets. Closes scylladb/scylladb#27947	2026-01-13 14:56:15 +02:00
Pavel Emelyanov	f227de24b2	object_storage: Create s3 client with "extended" endpoint name For this, add the s3::client::make(endpoint, ...) overload that accepts endpoint in proto://host:port format. Then it parses the provided url and calls the legacy one, that accepts raw host string and config with port, https bit, etc. The generic object_storage_endpoint_param no longer needs to carry the internal s3::endpoint_config, the config option parsing changes respectively. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-13 13:24:06 +03:00
Pavel Emelyanov	8f97e6b3de	s3/storage: Tune config updating Don't prepare s3::endpoint_config from generic code, jut pass the region and iam_role_arn (those that can potentially change) to the callback. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-13 13:24:06 +03:00
Botond Dénes	af6cb0d0a4	Merge 'raft topology: preserve IP -> ID mapping of a replacing node on restart' from Patryk Jędrzejczak We currently do it only for a bootstrapping node, which is a bug. The missing IP can cause an internal error, for example, in the following scenario: - replace fails during streaming, - all live nodes are shut down before the rollback of replace completes, - all live nodes are restarted, - live nodes start hitting internal error in all operations that require IP of the replacing node (like client requests or REST API requests coming from nodetool). We fix the bug here, but we do it separately for replace with different IP and replace with the same IP. For replace with different IP, we persist the IP -> host ID mapping in `system.peers` just like for bootstrap. That's necessary, since there is no other way to determine IP of the replacing node on restart. For replace with the same IP, we can't do the same. This would require deleting the row corresponding to the node being replaced from `system.peers`. That's fine in theory, as that node is permanently banned, so its IP shouldn't be needed. Unfortunately, we have many places in the code where we assume that IP of a topology member is always present in the address map or that a topology member is always present in the gossiper endpoint set. Examples of such places: - nodetool operations, - REST API endpoints, - `db::hints::manager::store_hint`, - `group0_voter_handler::update_nodes`. We could fix all those places and verify that drivers work properly when they see a node in the token metadata, but not in `system.peers`. However, that would be too risky to backport. We take a different approach. We recover IP of the replacing node on restart based on the state of the topology state machine and `system.peers` just after loading `system.peers`. We rely on the fact that group 0 is set up at this point. The only case where this assumption is incorrect is a restart in the Raft-based recovery procedure. However, hitting this problem then seems improbable, and even if it happens, we can restart the node again after ensuring that no client and REST API requests come before replace is rolled back on the new topology coordinator. Hence, it's not worth to complicate the fix (by e.g. looking at the persistent topology state instead of the in-memory state machine). Fixes #28057 Backport this PR to all branches as it fixes a problematic bug. Closes scylladb/scylladb#27435 * github.com:scylladb/scylladb: gossiper: add_saved_endpoint: make generations of excluded nodes negative test: introduce test_full_shutdown_during_replace utils: error_injection: allow aborting wait_for_message raft topology: preserve IP -> ID mapping of a replacing node on restart	2026-01-09 14:56:16 +02:00
Avi Kivity	0df85c8ae8	Revert "Merge 'Unify configuration of object storage endpoints' from Pavel Emelyanov" This reverts commit `1bb897c7ca`, reversing changes made to `954f2cbd2f`. It makes incompatible changes to the object storage configuration format, breaking tests [1]. It's likely that it doesn't break any production configuration, but we can't be sure. Fixes #27966 Closes scylladb/scylladb#27969	2026-01-05 08:53:41 +02:00
Benny Halevy	8d00266f88	directory_lister: add ctor with opened directory This ctor allows the caller to open the directory first, on its own, and pass it down to the directory_lister. Once all callers use this ctor we can get rid of the delayed open in the get() method. Also, in can be used to replace full-path based file_stat calls on listed entries with file_stat(directory, name) calls that are based on statat() and a relative path name that is present in the listed directory entry. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> sq	2026-01-04 11:05:18 +02:00
Benny Halevy	3e9b071838	Update seastar submodule * seastar f0298e40...4dcd4df5 (29): > file: provide a default implementation for file_impl::statat > util: Genralize memory_data_sink > defer: Replace static_assert() with concept > treewide: drop the support of fmtlib < 9.0.0 > test: Improve resilience of netsed scheduling fairness test > Merge 'file: Use query_device_alignment_info in blkdev_alignments ' from Kefu Chai file: Put alignment helpers in anonymous namespace file: Use query_device_alignment_info in blkdev_alignments > Merge 'file: Query physical block size and minimum I/O size' from Kefu Chai file: Apply physical_block_size override to filesystem files file: Use designated initializers in xfs_alignments iotune: Add physical block size detection disk_params: Add support for physical_block_size overrides from io_properties.yaml block_device: Query alignment requirements separately for memory and I/O > Merge 'json: formatter: fix formatting of std:string_view' from Benny Halevy json: formatter: fix formatting of std:string_view json: formatter: make sure std::string_view conforms to is_string_like Fixes #27887 > demos:improve the output of demo_with_io_intent() in file_demo > test: Add accept() vs accept_abort() socket test > file: Refine posix_file_impl alignments initialization > Add file::statat and a corresponding file_stat overload > cmake: don't compile memcached app for API < 9 > Merge 'Revert to ~old lifetime semantics for lvalues passed to then()-alikes' from Travis Downs future: adjust lifetime for lvalue continuations future: fix value class operator() > pollable_fd: Unfriend everything > Merge 'file: experimental_list_directory: use buffered generator' from Benny Halevy file: experimental_list_directory: use buffered generator file: define list_directory_generator_type > Merge 'Make datagram API use temporary_buffer<>-s' from Pavel Emelyanov net: Deprecate datagram::get_data() returning packet memcache: Fix indentation after previous patch memcache: Use new datagram::get_buffers() API dns: Use new datagram::get_buffers() API tests: Use new datagram::get_buffers() API demo: Use new datagram::get_buffers() API udp: Make datagram implementations return span of temporary_buffer-s > Merge 'Remove callback from timer_set::complete()' from Pavel Emelyanov reactor: Fix indentation after previous patch timers: Remove enabling callback from timer_set::complete() > treewide: avoid 'static sstring' in favor of 'constexpr string_view' > resource: Hide hwloc from public interface > Merge 'Fix handle_exception_type for lvalues' from Travis Downs futures_test: compile-time tests function_traits: handle reference_wrapper > posix_data_sink_impl: Assert to guard put UB > treewide: fix build with `SEASTAR_SSTRING` undefined > avoid deprecation warnings for json_exception > `util/variant_utils`: correct type deduction for `seastar::visit` > net/dns: fixed socket concurrent access > treewide: add missing headers > Merge 'Remove posix file helper file_read_state class' from Pavel Emelyanov file: Remove file_read_state test: Add a test for posix_file_impl::do_dma_read_bulk() > membarrier: simplify locking Adjust scylla to the following changes in scylla: - file_stat became polymorphic - needs explicit inference in table::snapshot_exists, table::get_snapshot_details - file::experimental_list_directory now returns list_directory_generator_type Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#27916	2025-12-30 19:37:13 +03:00
Patryk Jędrzejczak	4526dd93b1	utils: error_injection: allow aborting wait_for_message The test added in the following commit utilizes it.	2025-12-29 19:13:55 +01:00
Radosław Cybulski	dfa600fb8f	Add simple_value_with_expiry util class Add a `simple_value_with_expiry` utility class, which functions like a `std::optional` with added timeout. When emplacing a value, user needs to provide timeout, after which value expires (in which case the `simple_value_with_expiry` object behaves as if was never set at all). Add boost tests for the new class.	2025-12-29 08:32:52 +01:00
Pavel Emelyanov	2e33234e91	util: Remove lister::rmdir() There's seastar helper that does the same, no need to carry yet another implementation Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#27851	2025-12-28 19:46:19 +02:00
Pavel Emelyanov	e963a8d603	checked-file: Implement experimental_list_directory() The method in question returns coroutine generator that co_yields directory_entry-s. In case the method is not implemented, seastar creates a fallback generator, that calls existing subscription-based list_directory() and co_yields them. And since checked file doesn't yet have it, fallback generator is used, thus skipping the lower file yielding lister. Not nice. This patch implements the generator lister for checked file, thus making full use of lower file generator lister too. A side note. It's not enough to implement it like return do_io_check([] { return lower_file->experimental_list_directory(); }); like list_directory() does, since io-checking will _not_ happen on directory reading itself, as it's supposed to. This is the problem of the check_file::list_directory() implementation -- it only checks for exception when creating the subscription (and it really never happens), but reading the directory itself happens without io checks. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#27850	2025-12-28 13:37:44 +02:00
Botond Dénes	1bb897c7ca	Merge 'Unify configuration of object storage endpoints' from Pavel Emelyanov To configure S3 storage, one needs to do ``` object_storage_endpoints: - name: s3.us-east-1.amazonaws.com port: 443 https: true aws_region: us-east-1 ``` and for GCS it's ``` object_storage_endpoints: - name: https://storage.googleapis.com:433 type: gs credentials_file: <gcp account credentials json file> ``` This PR updates the S3 part to look like ``` object_storage_endpoints: - name: https://s3.us-east-1.amazonaws.com:443 aws_region: us-east-1 ``` fixes: #26570 Not-yet released feature, no need to backport. Old configs are not accepted any longer. If it's needed, then this decision needs to be revised. Closes scylladb/scylladb#27360 * github.com:scylladb/scylladb: object_storage: Temporarily handle pure endpoint addresses as endpoints code: Remove dangling mentions of s3::endpoint_config docs: Update docs according to new endpoints config option format object_storage: Create s3 client with "extended" endpoint name test: Add named constants for test_get_object_store_endpoints endpoint names s3/storage: Tune config updating sstable: Shuffle args for s3_client_wrapper	2025-12-24 06:59:02 +02:00
copilot-swe-agent[bot]	288d4b49e9	Skip backtrace in lsa-timing logs for preemptible reclaim Preemptible reclaim is only done from the background reclaimer, so backtrace is not useful. It's also normal that it takes a long time. Skip the backtrace when reclaim is preemptible to reduce log noise. Fixes the issue where background reclaim was printing unnecessary backtraces in lsa-timing logs when operations took longer than the stall detection threshold. Closes: #27692 Co-authored-by: tgrabiec <283695+tgrabiec@users.noreply.github.com>	2025-12-22 20:02:40 +02:00
Michael Litvak	55f4a2b754	migration_listener: fix deadlock in nested notifications When calling a migration notification from the context of a notification callback, this could lead to a deadlock with unregistering a listener: A: the parent notification is called. it calls thread_for_each, where it acquires a read lock on the vector of listeners, and calls the callback function for each listener while holding the lock. B: a listener is unregistered. it calls `remove` and tries to acquire a write lock on the vector of listeners. it waits because the lock is held. A: the callback function calls another notification and calls thread_for_each which tries to acquire the read lock again. but it waits since there is a waiter. Currently we have such concrete scenario when creating a table, where the callback of `before_create_column_family` in the tablet allocator calls `before_allocate_tablet_map`, and this could deadlock with node shutdown where we unregister listeners. Fix this by not acquiring the read lock again in the nested notification. There is no need because the read lock is already held by the parent notification while the child notification is running. We add a function `thread_for_each_nested` that is similar to `thread_for_each` except it assumes the read lock is already held and doesn't acquire it, and it should be used for nested notifications instead of `thread_for_each`. Fixes scylladb/scylladb#27364 Closes scylladb/scylladb#27637	2025-12-17 14:00:28 +01:00
Botond Dénes	74347625f9	Merge 'test/alternator: add reproducers for more issues' from Nadav Har'El This series adds an xfailing reproducers for two issue: #8070 and #27037: 27037 is about where even with alternator_streams_increased_compatibility set to true, if an attribute is set to the same value it had but using a different JSON representation - a Alternator Streams event is unduly produced. 8070 is about the ability to write malformed values into the database and then fail during read - instead of failing, as expected, during the write. This issue was known for years, but we never really had a reproducer for it - it's not possible to reproduce it using clean boto3 code and we need to build a request manually. The first two patches are two small cleanups (including fixes #27372) that I did while preparing the real tests - which are in the final two patches. Closes scylladb/scylladb#27376 * github.com:scylladb/scylladb: test/alternator: add reproducer for bug with storing invalid values test/alternator: reproducer for issue 27375 utils/rjson: fix error messages from rjson::parse() test/alternator: extract get_signed_request() to util.py	2025-12-16 06:53:14 +02:00
Patryk Jędrzejczak	844545bb74	Merge 'treewide: fix cases of improper re-throwing of `std::exception_ptr`' from Emil Maskovsky Fix multiple cases where the captured `std::exception_ptr` has been re-thrown via simple `throw eptr;`, which results in losing the original exception type and details. Resolved at various places found by clang-tidy: 1. db::schema_applier When applying schema changes, the previous implementation attempted to handle exceptions by catching and rethrowing them, but did so incorrectly: using `throw ex` with a `std::exception_ptr` loses the original exception type and details. However, in this case, explicit exception handling is unnecessary. The only reason for catching was to ensure `ap.destroy()` is called before propagating the exception. This can be more cleanly and safely achieved using Seastar's `.finally()` continuation, which guarantees cleanup regardless of success or failure. 2. directories The `std::exception_ptr()` has been captured for logging and then again re-thrown incorrectly via `throw ex;`. We could use `std::rethrow_exception()` here instead, but it seems to be simpler to just use regular `throw;` to rethrow the original exception, and only use the `std::current_exception()` for logging (which is a pattern used in other places as well). 3. storage_service Here the exception has been re-thrown incorrectly in a coroutine. There it is best to use the `co_await coroutine::return_exception_ptr` to propagate exception more efficiently in a coroutine-friendly manner. Fixes: SCYLLADB-94 Refs: scylladb/scylladb#27501 No backport: This fixes an error logging issue, that isn't a production problem by itself (only found in test), therefore not backporting to older branches. Closes scylladb/scylladb#27613 * https://github.com/scylladb/scylladb: db: schema_applier: improve exception-safe cleanup directories: fix exception rethrowing storage_service: use coroutine-friendly exception propagation in join_node_response_handler	2025-12-15 13:56:45 +01:00
Pavel Emelyanov	3f7ee3ce5d	Merge 'batchlog: make replay (flush) faster' from Botond Dénes The batchlog table contains an entry for each logged batch that is processed by the local node as coordinator. These entries are typically very short lived, they are inserted when the batch is processed and deleted immediately after the batch is successfully applied. When a table has `tombstone_gc = {'mode': 'repair'}` enabled, every repair has to flush all hints and batchlogs, so that we can be certain that there is no live data in any of these, older than the last repair. Since batches can contain member queries from any number of tables, the whole batchlog has to be flushed, even if repair-mode tombstone-gc is enabled for a single table. Flushing the batchlog table happens by doing a batchlog replay. This involves reading the entire content of this table, and attempting to replay+delete any live entries (that are old enough to be replayed). Under normal operating circumstances, 99%+ of the content of the batchlog table is partition tombstones. Because of this, scanning the content of this table has to process thousands to millions of tombstones. This was observed to require up to 20 minutes to finish, causing repairs to slow down to a crawl, as the batchlog-flush has to be repeated at the end of the repair of each token-range. When trying to address this problem, the first idea was that we should expedite the garbage-collection of these accumulated tombstones. This experiment failed, see https://github.com/scylladb/scylladb/pull/23752. The commitlog proved to be an impossible to bypass barrier, preventing quick garbage-collection of tombstones. So long as a single commit-log segment is alive, holding content from the batchlog table, all tombstones written after are blocked from GC. The second approach, represented by this PR, is to not rely in tombstone GC to reduce the tombstone amount. Instead restructure the table such that a single higher-order tombstone can be used to shadow and allow for the eviction of the myriads of individual batchlog entry tombstones. This is realized by reorganizing the batchlog table such that individual batches are rows, not partitions. This new schema is introduced by the new `system.batchlog_v2` table, introduced by this PR: CREATE TABLE system.batchlog_v2 ( version int, stage int, shard int, written_at timestamp, id uuid, data blob, PRIMARY KEY ((version, stage, shard), written_at, id)); The new schema organization has the following goals: 1) Make post-replay batchlog cleanup possible with a simple range-tombstone. This allows dropping the individual dead batchlog entries, as they are shadowed by a higher level tombstone. This enables dropping tombstones without tombstone GC. 2) To make the above possible, introduce the stage key component: batchlog entries that fail the first replay attempt, are moved to the failed_replay stage, so the initial stage can be cleaned up safely. 3) Spread out the data among Scylla shards, via the batchlog shard column. 4) Make batchlog entries ordered by the batchlog create time (id). This allows for selecting batchlogs to replay, without post-filtering of batchlogs that are too young to be replayed. Fixes: https://github.com/scylladb/scylladb/issues/23358 This is an improvement, normally not a backport-candidate. We might override this and backport to allow wider use of `tombstone_gc: {'mode': 'repair'}`. Closes scylladb/scylladb#26671 * github.com:scylladb/scylladb: db/config: change batchlog_replay_cleanup_after_replays default to 1 test/boost/batchlog_manager_test: add test for batchlog cleanup replica/mutation_dump: always set position weight for clustering positions service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/ test/lib: introduce error_injection.hh utils/error_injection: add debug log to disable() and disable_all() test/lib/cql_test_env: forward config to batchlog test/lib/cql_test_env: add batch type to execute_batch() test/lib/cql_assertions: add with_size(predicate) overload test/lib/cql_assertions: add source location to fail messages test/lib/cql_assertions: columns_assertions: add assert_for_columns_of_each_row() test/lib/cql_assertions: rows_assertions::assert_for_columns_of_row(): add index bound check test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload db/batchlog_manager: config: s/write_timeout/reply_timeot/ db,service: switch to system.batchlog_v2 db/system_keyspace: introduce system.batchlog_v2 service,db: extract generation of batchlog delete mutation service,db: extract get_batchlog_mutation_for() from storage-proxy db/batchlog_manager: only consider propagation delay with tombstone-gc=repair db/batchlog_manager: don't drop entire batch if one mutations' table was dropped data_dictionary: table: add get_truncation_time() db/batchlog_manager: batch(): replace map_reduce() with simple loop db/batchlog_manager: finish coroutinizing replay_all_failed_batches db/batchlog_manager: improve replayAllFailedBatches logs	2025-12-15 15:05:19 +03:00
Nadav Har'El	c06e63daed	Merge 'auth: start using SHA 512 hashing originated from musl with added yielding' from Andrzej Jackowski This patch series contains the following changes: - Incorporation of `crypt_sha512.c` from musl to out codebase - Conversion of `crypt_sha512.c` to C++ and coroutinization - Coroutinization of `auth::passwords::check` - Enabling use of `__crypt_sha512` orignated from `crypt_sha512.c` for computing SHA 512 passwords of length <=255 - Addition of yielding in the aforementioned hashing implementation. The alien thread was a solution for reactor stalls caused by indivisible password‑hashing tasks (https://github.com/scylladb/scylladb/issues/24524). However, because there is only one alien thread, overall hashing throughput was reduced (see, e.g., https://github.com/scylladb/scylla-enterprise/issues/5711). To address this, the alien‑thread solution is reverted, and a hashing implementation with yielding is introduced in this patch series. Before this patch series, ScyllaDB used SHA-512 hashing provided by the `crypt_r` function, which in our case meant using the implementation from the `libxcrypt` library. Adding yielding to this `libxcrypt` implementation is problematic, both due to licensing (LGPL) and because the implementation is split into many functions across multiple files. In contrast, the SHA-512 implementation from `musl libc` has a more permissive license and is concise, which makes it easier to incorporate into the ScyllaDB codebase. The performance of this solution was compared with the previous implementation that used one alien thread and the implementation after the alien thread was reverted. The results (median) of `perf-cql-raw` with `--connection-per-request 1 --smp 10` parameters are as follows: - Alien thread: 41.5 new connections/s per shard - Reverted alien thread: 244.1 new connections/s per shard - This commit (yielding in hashing): 198.4 new connections/s per shard The roughly 20% performance deterioration compared to the old implementation without the alien thread comes from the fact that the new hashing algorithm implemented in `utils/crypt_sha512.cc` performs an expensive self-verification and stack cleanup. On the other hand, with smp=10 the current implementation achieves roughly 5x higher throughput than the alien thread. In addition, due to yielding added in this commit, the algorithm is expected to provide similar protection from stalls as the alien thread did. In a test that in parallel started a cassandra-stress workload and created thousands of new connections using python-driver, the values of `scylla_reactor_stalls_count` metric were as follows: - Alien thread: 109 stalls/shard total - Reverted alien thread: 13186 stalls/shard total - This commit (yielding in hashing): 149 stalls/shard total Similarly, the `scylla_scheduler_time_spent_on_task_quota_violations_ms` values were: - Alien thread: 1087 ms/shard total - Reverted alien thread: 72839 ms/shard total - This commit (yielding in hashing): 1623 ms/shard total To summarize, yielding during hashing computations achieves similar throughput to the old solution without the alien thread but also prevents stalls similarly to the alien thread. Fixes: scylladb/scylladb#26859 Refs: scylladb/scylla-enterprise#5711 No automatic backport. After this PR is completed, the alien thread should be rather reverted from older branches (2025.2-2025.4 because on 2025.1 it's already removed). Backporting of the other commits needs further discussion. Closes scylladb/scylladb#26860 * github.com:scylladb/scylladb: test/boost: add too_long_password to auth_passwords_test test/boost: add same_hashes_as_crypt_r to auth_passwords_test auth: utils: add yielding to crypt_sha512 auth: change return type of passwords::check to future auth: remove code duplication in verify_scheme test/boost: coroutinize auth_passwords_test utils: coroutinize crypt_sha512 utils: make crypt_sha512.cc to compile utils: license: import crypt_sha512.c from musl to the project Revert "auth: move passwords::check call to alien thread"	2025-12-14 14:01:01 +02:00
Emil Maskovsky	e6f5f2537e	directories: fix exception rethrowing Fix location identified by clang-tidy where `std::exception_ptr` was incorrectly rethrown using `throw ep;`. The correct approach is to use `std::rethrow_exception(ep)`, which preserves the original exception type and stack trace. But this can be even further simplified by logging the current exception with `std::current_exception()` and rethrowing using `throw;` instead of capturing and rethrowing a `std::exception_ptr`. This matches the idiomatic pattern used elsewhere in the codebase and improves clarity. This change ensures proper exception propagation and avoids type slicing or loss of diagnostic information.	2025-12-12 18:10:20 +01:00
Nadav Har'El	0c64e3be9a	Merge 'Unify and fix rjson string and string_view conversions' from Marcin Maliszkiewicz This patch-set consolidates and corrects rjson string conversion handling. It removes unnecessary string copies, ensures proper length usage and replaces ad-hoc conversions with consistent helper functions. Overall, the changes make rjson string handling safer, faster, and more uniform across the codebase. Backport: no, it's a refactor Closes scylladb/scylladb#27394 * github.com:scylladb/scylladb: fix rjson::value to bytes conversion with missing GetStringLength call alternator: change type from string to string_view in should_add_capacity fix rjson::value to string_view conversion with missing GetStringLength call use rjson::to_string_view when rjson::value gets converted using GetStringLength use rjson::to_sstring and rjson::to_string for various string conversions utils: use rjson document wrapper in instance_profile_credentials_provider::parse_creds utils: move rjson::to_string_view func to string related place utils: add to_sstring and to_string rjson helper	2025-12-11 12:05:41 +02:00
Nadav Har'El	3595941020	utils/rjson: fix error messages from rjson::parse() rjson::parse() when parsing JSON stored in a chunked_content (a vector of temporary buffers) failed to initialize its byte counter to 0, resulting in garbage positions in error messages like: Parsing JSON failed: Missing a name for object member. at 1452254 These error messages were most noticable in Alternator, which parses JSON requests using a chunked_content, and reports these errors back to the user. The fix is trivial: add the missing initialization of the counter. The patch also adds a regression test for this bug - it sends a JSON corrupt at position 1, and expect to see "at 1" and not some large random number. Fixes #27372 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-12-11 11:17:01 +02:00
Benny Halevy	5f13880a91	utils: error_injection: wait_for_message: print injection_name and caller source_location on timeout When waiting for the condition variable times out we call on_internal_error, but unfortunately, the backtrace it generates is obfuscated by `coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume`. To make the log more useful, print the error injection name and the caller's source_location in the timeout error message. Fixes #27531 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#27532	2025-12-10 23:25:54 +01:00
Andrzej Jackowski	98f431dd81	auth: utils: add yielding to crypt_sha512 This change allows yielding during hashing computations to prevent stalls. The performance of this solution was compared with the previous implementation that used one alien thread and the implementation after the alien thread was reverted. The results (median) of `perf-cql-raw` with `--connection-per-request 1 --smp 10` parameters are as follows: - Alien thread: 41.5 new connections/s per shard - Reverted alien thread: 244.1 new connections/s per shard - This commit (yielding in hashing): 198.4 new connections/s per shard The alien thread is limited by a single-core hashing throughput, which is roughly 400-500 hashes/s in the test environment. Therefore, with smp=10, the throughput is below 50 hashes/s, and the difference between the alien thread and other solutions further increases with higer smp. The roughly 20% performance deterioration compared to the old implementation without the alien thread comes from the fact that the new hashing algorithm implemented in `utils/crypt_sha512.cc` performs an expensive self-verification and stack cleanup. On the other hand, with smp=10 the current implementation achieves roughly 5x higher throughput than the alien thread. In addition, due to yielding added in this commit, the algorithm is expected to provide similar protection from stalls as the alien thread did. In a test that in parallel started a cassandra-stress workload and created thousands of new connections using python-driver, the values of `scylla_reactor_stalls_count` metric were as follows: - Alien thread: 109 stalls/shard total - Reverted alien thread: 13186 stalls/shard total - This commit (yielding in hashing): 149 stalls/shard total Similarly, the `scylla_scheduler_time_spent_on_task_quota_violations_ms` values were: - Alien thread: 1087 ms/shard total - Reverted alien thread: 72839 ms/shard total - This commit (yielding in hashing): 1623 ms/shard total To summarize, yielding during hashing computations achieves similar throughput to the old solution without the alien thread but also prevents stalls similarly to the alien thread. Fixes: scylladb/scylladb#26859 Refs: scylladb/scylla-enterprise#5711	2025-12-10 15:36:18 +01:00
Andrzej Jackowski	d7818b56df	utils: coroutinize crypt_sha512 Change `sha512crypt` and `__crypt_sha512` to coroutines to allow yielding during hash computations later in this patch series. Refs: scylladb/scylladb#26859	2025-12-10 15:36:18 +01:00
Andrzej Jackowski	033fed5734	utils: make crypt_sha512.cc to compile The purpose of this change is to allow the usage of Seastar futures in crypt_sha512 later in this patch series. Refs: scylladb/scylladb#26859	2025-12-10 15:36:18 +01:00
Andrzej Jackowski	c6c30b7d0a	utils: license: import crypt_sha512.c from musl to the project This patch imports the `crypt_sha512.c` file from the musl library. We need it to incorporate yielding in the `crypt_r` function to avoid reactor stalls during long hashing computations. Before this patch series, ScyllaDB used SHA-512 hashing provided by the `crypt_r` function, which in our case meant using the implementation from the `libxcrypt` library. Adding yielding to this `libxcrypt` implementation is problematic, both due to licensing (LGPL) and because the implementation is split into many functions across multiple files. In contrast, the SHA-512 implementation from `musl libc` has a more permissive license and is concise, which makes it easier to incorporate into the ScyllaDB codebase. Both `crypt_sha512.c` and musl license are obtained from git.musl-libc.org: - https://git.musl-libc.org/cgit/musl/tree/src/crypt/crypt_sha512.c - https://git.musl-libc.org/cgit/musl/tree/COPYRIGHT Import commit: commit 1b76ff0767d01df72f692806ee5adee13c67ef88 Author: Alex Rønne Petersen <alex@alexrp.com> Date: Sun Oct 12 05:35:19 2025 +0200 s390x: shuffle register usage in __tls_get_offset to avoid r0 as address Refs: scylladb/scylladb#26859	2025-12-10 15:36:18 +01:00
Pavel Emelyanov	a3ca4fccef	object_storage: Create s3 client with "extended" endpoint name For this, add the s3::client::make(endpoint, ...) overload that accepts endpoint in proto://host:port format. Then it parses the provided url and calls the legacy one, that accepts raw host string and config with port, https bit, etc. The generic object_storage_endpoint_param no longer needs to carry the internal s3::endpoint_config, the config option parsing changes respectively. Tests, that generate the config files, and docs are updated. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-10 15:33:47 +03:00
Pavel Emelyanov	932b008107	s3/storage: Tune config updating Don't prepare s3::endpoint_config from generic code, jut pass the region and iam_role_arn (those that can potentially change) to the callback. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-10 15:33:46 +03:00
Marcin Maliszkiewicz	060c2f7c0d	use rjson::to_string_view when rjson::value gets converted using GetStringLength This commit is only cosmetics, changes calls to GetStringLength into rjson::to_string_view with the same underlying implementation.	2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz	64149b57c3	use rjson::to_sstring and rjson::to_string for various string conversions In some cases we ommit size checking which is wrong as according to rapid json documentation strings may contain \0 byte in the middle.	2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz	4b004fcdfc	utils: use rjson document wrapper in instance_profile_credentials_provider::parse_creds So that we can use our common utility functions.	2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz	5e38b3071b	utils: move rjson::to_string_view func to string related place	2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz	225b3351fc	utils: add to_sstring and to_string rjson helper So that conversion code is common and it's easier to avoid accidental type conversions. Additionally according to rapid json library size must be checked explicitly, this also avoids extra iteration in char* to (s)string conversion.	2025-12-09 19:27:21 +01:00
Tomasz Grabiec	082342ecad	Attach names to allocating sections for better debuggability Large reserves in allocating_section can cause stalls. We already log reserve increase, but we don't know which table it belongs to: lsa - LSA allocation failure, increasing reserve in section 0x600009f94590 to 128 segments; Allocating sections used for updating row cache on memtable flush are notoriously problematic. Each table has its own row_cache, so its own allocating_section(s). If we attached table name to those sections, we could identify which table is causing problems. In some issues we suspected system.raft, but we can't be sure. This patch allows naming allocating_sections for the purpose of identifying them in such log messages. I use abstract_formatter for this purpose to avoid the cost of formatting strings on the hot path (e.g. index_reader). And also to avoid duplicating strings which are already stored elsewhere. Fixes #25799 Closes scylladb/scylladb#27470	2025-12-07 14:14:25 +02:00
Calle Wilund	4e289e8e6a	utils::http: Handle ipv6 numeric host part in URL:s Fixes #27366 A URL with numeric host part formats special in case of ipv6, to avoid confusion with port part. The parser should handle this. I.e. http://[2001:db8:4006:812::200e]:8080 v2: * Include scheme agnostic parse + case insensitive scheme matching	2025-12-04 11:38:41 +00:00
Calle Wilund	4e7ec9333f	gcp::object_storage: Include auth in exponential back-off-retry Fixes #27268 Refs #27268 Includes the auth call in code covered by backoff-retry on server error, as well as moves the code to use the shared primitive for this and increase the resilience a bit (increase retry count). v2: * Don't do backoff if we need to refresh credentials. * Use abort source for backoff if avail v3: * Include other retryable conditions in auth check Closes scylladb/scylladb#27269	2025-12-02 15:08:49 +02:00
Botond Dénes	0a7df4b8ac	utils/error_injection: add debug log to disable() and disable_all() enable() and friends already has debug logs.	2025-12-02 14:21:26 +02:00
Ernest Zaslavsky	605f71d074	s3_client: handle additional transient network errors Add handling for a broader set of transient network-related `std::errc` values in `aws_error::from_system_error`. Treat these conditions as retryable when the client re-creates the socket for each request. Fixes: https://github.com/scylladb/scylladb/issues/27349 Closes scylladb/scylladb#27350	2025-12-02 11:44:40 +02:00
Avi Kivity	85db7b1caf	Merge 'address_map: Use more efficient and reliable replication method' from Tomasz Grabiec Primary issue with the old method is that each update is a separate cross-shard call, and all later updates queue behind it. If one of the shards has high latency for such calls, the queue may accumulate and system will appear unresponsive for mapping changes on non-zero shards. This happened in the field when one of the shards was overloaded with sstables and compaction work, which caused frequent stalls which delayed polling for ~100ms. A queue of 3k address updates accumulated, because we update mapping on each change of gossip states. This made bootstrap impossible because nodes couldn't learn about the IP mapping for the bootstrapping node and streaming failed. To protect against that, use a more efficient method of replication which requires a single cross-shard call to replicate all prior updates. It is also more reliable, if replication fails transiently for some reason, we don't give up and fail all later updates. Fixes #26865 Closes scylladb/scylladb#26941 * github.com:scylladb/scylladb: address_map: Use barrier() to wait for replication address_map: Use more efficient and reliable replication method utils: Introduce helper for replicated data structures	2025-11-23 19:15:12 +02:00
Radosław Cybulski	d589e68642	Add precompiled headers to CMakeLists.txt Add precompiled header support to CMakeLists.txt and configure.py - it improves compilation time by approximately 10%. New header `stdafx.hh` is added, don't include it manually - the compiler will include it for you. The header contains includes from external libraries used by Scylla - seastar, standard library, linux headers and zlib. The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER` or configure.py --disable-precompiled-header to disable. The feature should be disabled, when trying to check headers - otherwise you might get false negatives on missing includes from seastar / abseil and so on. Note: following configuration needs to be added to ccache.conf: sloppiness = pch_defines,time_macros,include_file_mtime,include_file_ctime Closes scylladb/scylladb#26617	2025-11-21 12:27:41 +02:00
Calle Wilund	03408b185e	utils::gcp::object_storage: Fix buffer alignment reordering trailing data Fixes #26874 Due to certain people (me) not being able to tell forward from backward, the data alignment to ensure partial uploads adhere to the 256k-align rule would potentially _reorder_ trailing buffers generated iff the source buffers input into the sink are small enough. Which, as a fun fact, they are in backup upload. Change the unit test to use raw sink IO and add two unit tests (of which the smaller size provokes the bug) that checks the same 64k buf segmented upload backup uses. Closes scylladb/scylladb#26938	2025-11-21 09:36:13 +02:00
Gleb Natapov	ad3cf2c174	utils: fix get_random_time_UUID_from_micros to generate correct time uuid According to the IETF spec uuid variant bits should be set to '10'. All others are either invalid or reserved. The patch change the code to follow the spec. Closes scylladb/scylladb#27073	2025-11-20 10:27:29 +02:00
Benny Halevy	8ed36702ae	Update seastar submodule * seastar 63900e03...340e14a7 (19): > Merge 'rpc: harden sink_impl::close()' from Benny Halevy rpc: sink_impl::close: fixup indentation rpc: harden sink_impl::close() > http: Document the way "unread body bytes" accounting works > net: tighten port load balancing port access > coroutine: reimplement generator with buffered variant > Merge 'Stop using net::packet in posix data sink' from Pavel Emelyanov net/posix-stack: Don't use packet in posix_data_sink_impl reactor: Move fragment-vs-iovec static assertion reactor: Make backend::sendmsg() calls use std::span<iovec> utils: Introduce iovec_trim_front helper utils: Spannize iovec_len() > Merge 'Generalize memory data sink in tests' from Pavel Emelyanov test: Make output_stream_test splitting test case use own sink test: Make some output_stream_test cases use memory data sink test: Threadify one of output_stream_test test cases test: Make json_formatter_test use memory_data_sink test: Move memory_data_sink to its own header > dns: avoid using deprecated c-ares API > reactor: Move read_directory() to posix_file_impl > Merge 'rpc: sink_impl: batch sending and deletion of snd_buf:s' from Benny Halevy test: rpc_test: add test_rpc_stream_backpressure_across_shards reactor: add abort_on_too_long_task_queue option rpc: make sink flush and close noexcept rpc: sink_impl: batch sending and deletion of snd_buf:s rpc: move sink_impl and source_impl into internal namespace rpc: sink_impl: extend backpressure until snd_buf destroy > configure.py: fix --api-level help > Merge 'Close http client connection if handler doesn't consume all chunked-encoded body' from Pavel Emelyanov test: Fix indentation after previous patch test/http: Extend test for improper client handling of aborted requests test/http: Ignore EPIPE exception from server closing connection test/http: Split the partial response body read test http: Track "remaining bytes" for chunked_source_impl http: Switch content_length_source_impl to update remaining bytes > metrics: Add default ~config() > headers: Remove smp.hh from app-template.hh > prometheus: remove hostname and metric_help config > rpc: Tune up connection methods visibility > perf_tests: Fix build with fmt 12.0.0 by avoiding internal functions > doc: Fix some typos in codying style > reactor: Remove unused try_sleep() method directory_lister::get is adjusted in this patch to use the new experimental::coroutine::generator interface that was changed in scylladb/seastar@81f2dc9dd9 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#26913	2025-11-20 07:29:47 +03:00
Avi Kivity	0d68512b1f	stall_free: make variadic dispose_gently sequential Having variadic dispose_gently() clear inputs concurrently serves no purpose, since this is a CPU bound operation. It will just add more tasks for the reactor to process. Reduce disruption to other work by processing inputs sequentially. Closes scylladb/scylladb#26993	2025-11-20 07:16:16 +03:00
Tomasz Grabiec	ed8d127457	utils: Introduce helper for replicated data structures Key goals: - efficient (batching updates) - reliable (no lost updates) Will be used in data structures maintained on one designed owning shard and replicated to other shards.	2025-11-19 15:21:02 +01:00
Benny Halevy	a290505239	utils: stall_free: add dispose_gently dispose_gently consumes the object moved to it, clearing it gently before it's destroyed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#26356	2025-11-11 12:20:18 +02:00
Calle Wilund	565c701226	utils::gcp::object_storage: Fix typo in semaphore init Fixes #26776 Semaphore storage is ssize_t, not size_t.	2025-11-05 10:22:22 +00:00
Pavel Emelyanov	ae0136792b	utils: Make directory_lister use generator lister from seastar The directory_lister uses utils::lister under the hood which accepts a callback to put directory_entry-s in. The directory_lister's callback then puts the entries into a queue and its .get() method pops up entries from there to return to caller. This patch simplifies this code by switching the directory_lister to use experimental generator lister from seastar. With it, the entries to be returned from .get() are simply co_await-ed from calling the generator object (wich co_yield-s them). As a result the directory_lister becomes smaller and drops the need for utils::lister. Since directory_lister was created as a replacement for that callback-based lister, the latter can be eventually removed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26586	2025-10-28 15:20:20 +02:00

1 2 3 4 5 ...

2168 Commits