scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-31 20:16:43 +00:00

Author	SHA1	Message	Date
Benny Halevy	f06269dfca	utils: error_injection: wait_for_message: print injection_name and caller source_location on timeout When waiting for the condition variable times out we call on_internal_error, but unfortunately, the backtrace it generates is obfuscated by `coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume`. To make the log more useful, print the error injection name and the caller's source_location in the timeout error message. Fixes #27531 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#27532 (cherry picked from commit `5f13880a91`) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#27581	2025-12-12 14:20:30 +01:00
Tomasz Grabiec	3cc75afbbe	utils: Introduce helper for replicated data structures Key goals: - efficient (batching updates) - reliable (no lost updates) Will be used in data structures maintained on one designed owning shard and replicated to other shards. (cherry picked from commit `ed8d127457`)	2025-12-05 13:25:14 +01:00
Nadav Har'El	9d27db5e98	utils: add "fatal" version of utils::on_internal_error() utils::on_internal_error() is a wrapper for Seastar's on_internal_error() which does not require a logger parameter - because it always uses one logger ("on_internal_error"). Not needing a unique logger is especially important when using on_internal_error() in a header file, where we can't define a logger. Seastar also has a another similar function, on_fatal_internal_error(), for which we forgot to implement a "utils" version (without a logger parameter). This patch fixes that oversight. In the next patch, we need to use on_fatal_internal_error() in a header file, so the "utils" version will be useful. We will need the fatal version because we will encounter an unexpected situation during server destruction, and if we let the regular on_internal_error() just throw an exception, we'll be left in an undefined state. Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `33476c7b06`)	2025-12-05 13:25:14 +01:00
Pavel Emelyanov	05c0f8ed03	lister: Fix race between readdir and stat Sometimes file::list_directory() returns entries without type set. In thase case lister calls file_type() on the entry name to get it. In case the call returns disengated type, the code assumes that some error occurred and resolves into exception. That's not correct. The file_type() method returns disengated type only if the file being inspected is missing (i.e. on ENOENT errno). But this can validly happen if a file is removed bettween readdir and stat. In that case it's not "some error happened", but a enry should be just skipped. In "some error happened", then file_type() would resolve into exceptional future on its own. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26595 (cherry picked from commit `d9bfbeda9a`) Closes scylladb/scylladb#26759	2025-10-29 11:38:35 +02:00
Avi Kivity	5de570c9ae	Merge 'auth: move passwords::check call to alien thread' from Andrzej Jackowski Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive. This PR addresses the issue in two ways: 1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case). 2) `passwords::check` is moved to a dedicated alien thread. Regarding point 1: before this change, the following hashing schemes were supported by `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However: - The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it not good idea to fix or use it. - SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512. - MD5 is no longer considered secure for password hashing. Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers. Fixes https://github.com/scylladb/scylladb/issues/24524 Backport not needed, as it is a new feature. Closes scylladb/scylladb#24924 * github.com:scylladb/scylladb: main: utils: add thread names to alien workers auth: move passwords::check call to alien thread test: wait for 3 clients with given username in test_service_level_api auth: refactor password checking in password_authenticator auth: make SHA-512 the only password hashing scheme for new passwords auth: whitespace change in identify_best_supported_scheme() auth: require scheme as parameter for `generate_salt` auth: check password hashing scheme support on authenticator start (cherry picked from commit `c762425ea7`)	2025-09-07 14:30:26 +03:00
Nadav Har'El	c04b086929	alternator: avoid oversized allocation in Query/Scan This patch fixes one cause of oversized allocations - and therefore potentially stalls and increased tail latencies - in Alternator. Alternator's Scan or Query operation return a page of results. When the number of items is not limited by a "Limit" parameter, the default is to return a 1 MB page. If items are short, a large number of them can fit in that 1MB. The test test_query.py::test_query_large_page_small_rows has 30,000 items returned in a single page. In the response JSON, all these items are returned in a single array "Items". Before this patch, we build the full response as a RapidJSON object before sending it. The problem is that unfortunately, RapidJSON stores arrays as contiguous allocations. This results in large contiguous allocations in workloads that scan many small items, and large contiguous allocations can also cause stalls and high tail latencies. For example, before this patch, running test/alternator/run --runveryslow \ test_query.py::test_query_large_page_small_rows reports in the log: oversized allocation: 573440 bytes. After this patch, this warning no longer appears. The patch solves the problem by collecting the scanned items not in a RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e, a chunked (non-contiguous) array of items (each a JSON value). After collecting this array separately from the response object, we need to print its content without actually inserting it into the object - we add a new function print_with_extra_array() to do that. The new separate-chunked-vector technique is used when a large number (currently, >256) of items were scanned. When there is a smaller number of items in a page (this is typical when each item is longer), we just insert those items in the object and print it as before. Beyond the original slow test that demonstrated the oversized allocation (which is now gone), this patch also includes a new test which exercises the new code with a scan of 700 (>256) items in a page - but this new test is fast enough to be permanently in our test suite and not a manual "veryslow" test as the other test. Fixes #23535 (cherry picked from commit `2385fba4b6`) Closes scylladb/scylladb#25654	2025-09-01 16:40:02 +03:00
Nadav Har'El	46bd9f2f27	utils, alternator: fix detection of invalid base-64 This patch fixes an error-path bug in the base-64 decoding code in utils/base64.cc, which among other things is used in Alternator to decode blobs in JSON requests. The base-64 decoding code has a lookup table, which was wrongly sized 255 bytes, but needed to be 256 bytes. This meant that if the byte 255 (0xFF) was included in an invalid base-64 string, instead of detecting that this is an invalid byte (since the only valid bytes in a base-64 string are A-Z,a-z,0-9,+,/ and =), the code would either think it's valid with a nonsense 6-bit part, or even crash on an out-of-bounds read. Besides the trivial fix, this patch also includes a reproducing test, which tries to write a blob as a supposedly base-64 encoded string with a 0xFF byte in it. The test fails before this patch (the write succeeds, unexpectedly), and passes after this patch (the write fails as expected). The test also passes on DynamoDB. Fixes #25701 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25705 (cherry picked from commit `ff91027eac`) Closes scylladb/scylladb#25765	2025-09-01 09:07:00 +03:00
Avi Kivity	6171da6fbc	utils: chunked_vector: add swap() method Following std::vector(), we implement swap(). It's a simple matter of swapping all the contents. A unit test is added. (cherry picked from commit `13a75ff835`)	2025-08-25 12:44:13 +03:00
Avi Kivity	faaec66be7	utils: chunked_vector: add range insert() overloads Inserts an iterator range at some position. Again we insert the range at the end and use std::rotate() to move the newly inserted elements into place, forgoing possible optimizations. Unit tests are added. (cherry picked from commit `24e0d17def`)	2025-08-25 12:44:13 +03:00
Patryk Jędrzejczak	1434afa588	db/config, utils: allow using UUID as a config option We change the `recovery_leader` option to UUID in the following commit. (cherry picked from commit `ec69028907`)	2025-08-05 10:59:06 +00:00
Dario Mirovic	a8d38882ff	utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception Make make_bytes_ostream and make_fragmented_temporary_buffer accept writer callbacks that return utils::result_with_exception instead of forcing them to throw on error. This lets callers propagate failures by returning an error result rather than throwing an exception. Introduce buffer_writer_for, bytes_ostream_writer, and fragmented_buffer_writer concepts to simplify and document the template requirements on writer callbacks. This patch does not modify the actual callbacks passed, except for the syntax changes needed for successful compilation, without changing the logic. Refs: #24567 Fixes: #25272 (cherry picked from commit `9f4344a435`)	2025-07-30 21:54:41 +02:00
Ernest Zaslavsky	6d8350b20d	s3_client: parse multipart response XML defensively Ensure robust handling of XML responses when initiating multipart uploads. Check for the existence of required nodes before access, and throw an exception if the XML is empty or malformed. Refs: https://github.com/scylladb/scylladb/issues/24676 Closes scylladb/scylladb#24990 (cherry picked from commit `342e94261f`) Closes scylladb/scylladb#25054	2025-07-21 12:08:25 +02:00
Calle Wilund	0d61d63e7e	utils::http::dns_connection_factory: Use a shared certificate_credentials Fixes #24447 This factory type, which is really more a data holder/connection producer per connection instance, creates, if using https, a new certificate_credentials on every instance. Which when used by S3 client is per client and scheduling groups. Which eventually means that we will do a set_system_trust + "cold" handshake for every tls connection created this way. This will cause both IO and cold/expensive certificate checking -> possible stalls/wasted CPU. Since the credentials object in question is literally a "just trust system", it could very well be shared across the shard. This PR adds a thread local static cached credentials object and uses this instead. Could consider moving this to seastar, but maybe this is too much. Closes scylladb/scylladb#24448 (cherry picked from commit `80feb8b676`) Closes scylladb/scylladb#24461	2025-07-18 09:34:45 +03:00
Avi Kivity	8f65d7e63b	utils: chunked_vector: implement erase() for single elements and ranges Implement using std::rotate() and resize(). The elements to be erased are rotated to the end, then resized out of existence. Again we defer optimization for trivially copyable types. Unit tests are added. Needed for range_streamer with token_ranges using chunked_vector. (cherry picked from commit `d6eefce145`)	2025-07-16 07:43:29 +08:00
Avi Kivity	c6b0bacfb1	utils: chunked_vector: implement insert() for single-element inserts partition_range_compat's unwrap() needs insert if we are to use it for chunked_vector (which we do). Implement using push_back() and std::rotate(). emplace(iterator, args) is also implemented, though the benefit is diluted (it will be moved after construction). The implementation isn't optimal - if T is trivially copyable then using std::memmove() will be much faster that std::rotate(), but this complex optimization is left for later. Unit tests are added. (cherry picked from commit `5301f3d0b5`)	2025-07-16 07:43:21 +08:00
Michał Chojnowski	08b117425e	utils/alien_worker: fix a data race in submit() We move a `seastar::promise` on the external worker thread, after the matching `seastar::future` was returned to the shard. That's illegal. If the `promise` move occurs concurrently with some operation (move, await) on the `future`, it becomes a data race which could cause various kinds of corruption. This patch fixes that by keeping the promise at a stable address on the shard (inside a coroutine frame) and only passing through the worker. Fixes #24751 Closes scylladb/scylladb#24752 (cherry picked from commit `a29724479a`) Closes scylladb/scylladb#24777	2025-07-03 11:02:30 +03:00
Avi Kivity	ee733c4d38	Merge '[Backport 2025.2] generic_server: fix connections semaphore config observer' from Scylladb[bot] In `ed3e4f33fd` we introduced new connection throttling feature which is controlled by uninitialized_connections_semaphore_cpu_concurrency config. But live updating of it was broken, this patch fixes it. When the temporary value from observer() is destroyed, it disconnects from updateable_value, so observation stops right away. We need to retain the observer. Backport: to 2025.2 where this feature was added Fixes: https://github.com/scylladb/scylladb/issues/24557 - (cherry picked from commit `c6a25b9140`) - (cherry picked from commit `45392ac29e`) - (cherry picked from commit `68ead01397`) Parent PR: #24484 Closes scylladb/scylladb#24679 * github.com:scylladb/scylladb: test: add test for live updates of generic server config utils: don't allow do discard updateable_value observer generic_server: fix connections semaphore config observer	2025-07-01 12:29:53 +03:00
Szymon Malewski	3bac46a18d	utils/exceptions.cc: Added check for `exceptions::request_timeout_exception` in `is_timeout_exception` function. It solves the issue, where in some cases a timeout exceptions in CAS operations are logged incorrectly as a general failure. Fixes #24591 Closes scylladb/scylladb#24619 (cherry picked from commit `f28bab741d`) Closes scylladb/scylladb#24687	2025-07-01 12:29:27 +03:00
Lakshmi Narayanan Sreethar	adab525151	utils/big_decimal: fix scale overflow when parsing values with large exponents The exponent of a big decimal string is parsed as an int32, adjusted for the removed fractional part, and stored as an int32. When parsing values like `1.23E-2147483647`, the unscaled value becomes `123`, and the scale is adjusted to `2147483647 + 2 = 2147483649`. This exceeds the int32 limit, and since the scale is stored as an int32, it overflows and wraps around, losing the value. This patch fixes that the by parsing the exponent as an int64 value and then adjusting it for the fractional part. The adjusted scale is then checked to see if it is still within int32 limits before storing. An exception is thrown if it is not within the int32 limits. Note that strings with exponents that exceed the int32 range, like `0.01E2147483650`, were previously not parseable as a big decimal. They are now accepted if the final adjusted scale fits within int32 limits. For the above value, unscaled_value = 1 and scale = -2147483648, so it is now accepted. This is in line with how Java's `BigDecimal` parses strings. Fixes: #24581 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#24640 (cherry picked from commit `279253ffd0`) Closes scylladb/scylladb#24692	2025-07-01 12:28:55 +03:00
Marcin Maliszkiewicz	011765ced8	utils: don't allow do discard updateable_value observer If the object returned from observe() is destructured, it stops observing, potentially causing subtle bugs. Typically, the observer object is retained as a class member. (cherry picked from commit `45392ac29e`)	2025-06-26 14:50:25 +00:00
Benny Halevy	afa2b40ac9	disk_space_monitor: add space_source_registration Register the current space_source_fn in an RAII object that resets monitor._space_source to the previous function when the RAII object is destroyed. Use space_source_registration in database_test:: mutation_dump_generated_schema_deterministic_id_version to prevent use-after-stack-return in the test. Fixes #24314 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#24342 (cherry picked from commit `8b387109fc`) Closes scylladb/scylladb#24392	2025-06-24 10:02:23 +03:00
Michał Chojnowski	a539ff6419	utils/lsa/chunked_managed_vector: fix the calculation of max_chunk_capacity() `chunked_managed_vector` is a vector-like container which splits its contents into multiple contiguous allocations if necessary, in order to fit within LSA's max preferred contiguous allocation limits. Each limited-size chunk is stored in a `managed_vector`. `managed_vector` is unaware of LSA's size limits. It's up to the user of `managed_vector` to pick a size which is small enough. This happens in `chunked_managed_vector::max_chunk_capacity()`. But the calculation is wrong, because it doesn't account for the fact that `managed_vector` has to place some metadata (the backreference pointer) inside the allocation. In effect, the chunks allocated by `chunked_managed_vector` are just a tiny bit larger than the limit, and the limit is violated. Fix this by accounting for the metadata. Also, before the patch `chunked_managed_vector::max_contiguous_allocation`, repeats the definition of logalloc::max_managed_object_size. This is begging for a bug if `logalloc::max_managed_object_size` changes one day. Adjust it so that `chunked_managed_vector` looks directly at `logalloc::max_managed_object_size`, as it means to. Fixes scylladb/scylladb#23854 (cherry picked from commit `7f9152babc`) Closes scylladb/scylladb#24371	2025-06-10 11:25:52 +03:00
Michał Chojnowski	6cd954de8d	utils/stream_compressor: allocate memory for zstd compressors externally The default and recommended way to use zstd compressors is to let zstd allocate and free memory for compressors on its own. That's what we did for zstd compressors used in RPC compression. But it turns out that it generates allocation patterns we dislike. We expected zstd not to generate allocations after the context object is initialized, but it turns out that it tries to downsize the context sometimes (by reallocation). We don't want that because the allocations generated by zstd are large (1 MiB with the parameters we use), so repeating them periodically stresses the reclaimer. We can avoid this by using the "static context" API of zstd, in which the memory for context is allocated manually by the user of the library. In this mode, zstd doesn't allocate anything on its own. The implementation details of this patch adds a consideration for forward compatibility: later versions of Scylla can't use a window size greater than the one we hardcoded in this patch when talking to the old version of the decompressor. (This is not a problem, since those compressors are only used for RPC compression at the moment, where cross-version communication can be prevented by bumping COMPRESSOR_NAME. But it's something that the developer who changes the window size must _remember_ to do). Fixes #24160 Fixes #24183 Closes scylladb/scylladb#24161 (cherry picked from commit `185a032044`) Closes scylladb/scylladb#24281	2025-06-03 10:02:34 +03:00
Michał Chojnowski	2c431c1ea2	logalloc: make background_reclaimer::free_memory_threshold publicly visible Wanted by the change to the background_reclaim test in the next patch. (cherry picked from commit `c47f438db3`)	2025-05-09 16:12:22 +00:00
Pavel Emelyanov	b56d6fbb84	Merge 'sstables: Fix quadratic space complexity in partitioned_sstable_set' from Raphael Raph Carvalho Interval map is very susceptible to quadratic space behavior when it's flooded with many entries overlapping all (or most of) intervals, since each such entry will have presence on all intervals it overlaps with. A trigger we observed was memtable flush storm, which creates many small "L0" sstables that spans roughly the entire token range. Since we cannot rely on insertion order, solution will be about storing sstables with such wide ranges in a vector (unleveled). There should be no consequence for single-key reads, since upper layer applies an additional filtering based on token of key being queried. And for range scans, there can be an increase in memory usage, but not significant because the sstables span an wide range and would have been selected in the combined reader if the range of scan overlaps with them. Anyway, this is a protection against storm of memtable flushes and shouldn't be the common scenario. It works both with tablets and vnodes, by adjusting the token range spanned by compaction group accordingly. Fixes #23634. We can backport this into 2024.2, 2025.1, but we should let this cook in master for 1 month or so. Closes scylladb/scylladb#23806 * github.com:scylladb/scylladb: test: Verify partitioned set store split and unsplit correctly sstables: Fix quadratic space complexity in partitioned_sstable_set compaction: Wire table_state into make_sstable_set() compaction: Introduce token_range() to table_state dht: Add overlap_ratio() for token range	2025-05-05 11:28:38 +03:00
Piotr Dulikowski	8ffe4b0308	utils::loading_cache: gracefully skip timer if gate closed The loading_cache has a periodic timer which acquires the _timer_reads_gate. The stop() method first closes the gate and then cancels the timer - this order is necessary because the timer is re-armed under the gate. However, the timer callback does not check whether the gate was closed but tries to acquire it, which might result in unhandled exception which is logged with ERROR severity. Fix the timer callback by acquiring access to the gate at the beginning and gracefully returning if the gate is closed. Even though the gate used to be entered in the middle of the callback, it does not make sense to execute the timer's logic at all if the cache is being stopped. Fixes: scylladb/scylladb#23951 Closes scylladb/scylladb#23952	2025-04-30 16:43:22 +03:00
Raphael S. Carvalho	d5bee4c814	test: Verify partitioned set store split and unsplit correctly Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Pavel Emelyanov	324daac156	Merge 'Add CopyObject API implementation to S3 client' from Ernest Zaslavsky Implement the CopyObject API to directly copy S3 object from one location to another. This implementation consumes zero networking overhead on the client side since the object is copied internally by S3 machinery Usage example: Backup of tiered SSTables - you already have SSTables on S3, CopyObject is the ideal way to go No need to backport since we are adding new functionality for a future use Closes scylladb/scylladb#23779 * github.com:scylladb/scylladb: s3_client: implement S3 copy object s3_client: improve exception message s3_client: reposition local function for future use	2025-04-18 16:17:41 +03:00
Pavel Emelyanov	cc919b08c2	Merge 'backup: Optimize S3 throughput with shard-based upload' from Ernest Zaslavsky This PR enhances S3 throughput by leveraging every available shard to upload backup files concurrently. By distributing the load across multiple shards, we significantly improve the upload performance. Each shard retrieves an SSTable and processes its files sequentially, ensuring efficient, file-by-file uploads. To prevent uncontrolled fiber creation and potential resource exhaustion, the backup task employs a directory semaphore from the sstables_manager. This mechanism helps regulate concurrency at the directory level, ensuring stable and predictable performance during large-scale backup operations. Refs #22460 fixes: #22520 ``` =========================================== Release build, master, smp-16, mem-32GiB Bytes: 2342880184, backup time: 9.51 s =========================================== Release build, this PR, smp-16, mem-32GiB Bytes: 2342891015, backup time: 1.23 s =========================================== ``` Looks like it is faster at least x7.7 No backport needed since it (native backup) is still unused functionality Closes scylladb/scylladb#23727 * github.com:scylladb/scylladb: backup: Add test for invalid endpoint backup_task: upload on all shards backup_task: integrate sharded storage manager for upload	2025-04-18 16:17:41 +03:00
Avi Kivity	6b415cfd4b	Merge 'managed_bytes: in the copy constructor, respect the target preferred allocation size' from Michał Chojnowski Commit `14bf09f447` added a single-chunk layout to `managed_bytes`, which makes the overhead of `managed_bytes` smaller in the common case of a small buffer. But there was a bug in it. In the copy constructor of `managed_bytes`, a copy of a single-chunk `managed_bytes` is made single-chunk too. But this is wrong, because the source of the copy and the target of the copy might have different preferred max contiguous allocation sizes. In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB is copied from the standard allocator into LSA, the resulting `managed_bytes` is a single chunk which violates LSA's preferred allocation size. (And therefore is placed by LSA in the standard allocator). In other words, since Scylla 6.0, cache and memtable cells between 13 kiB and 128 kiB are getting allocated in the standard allocator rather than inside LSA segments. Consequences of the bug: 1. Effective memory consumption of an affected cell is rounded up to the nearest power of 2. 2. With a pathological-enough allocation pattern (for example, one which somehow ends up placing a single 16 kiB memtable-owned allocation in every aligned 128 kiB span), memtable flushing could theoretically deadlock, because the allocator might be too fragmented to let the memtable grow by another 128 kiB segment, while keeping the sum of all allocations small enough to avoid triggering a flush. (Such an allocation pattern probably wouldn't happen in practice though). 3. It triggers a bug in reclaim which results in spurious allocation failures despite ample evictable memory. There is a path in the reclaimer procedure where we check whether reclamation succeeded by checking that the number of free LSA segments grew. But in the presence of evictable non-LSA allocations, this is wrong because the reclaim might have met its target by evicting the non-LSA allocations, in which case memory is returned directly to the standard allocator, rather than to the pool of free segments. If that happens, the reclaimer wrongly returns `reclaimed_nothing` to Seastar, which fails the allocation. Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072 Fixes https://github.com/scylladb/scylladb/issues/22941 Fixes https://github.com/scylladb/scylladb/issues/22389 Fixes https://github.com/scylladb/scylladb/issues/23781 This is a regression fix, should be backported to all affected releases. Closes scylladb/scylladb#23782 * github.com:scylladb/scylladb: managed_bytes_test: add a reproducer for #23781 managed_bytes: in the copy constructor, respect the target preferred allocation size	2025-04-17 21:14:10 +03:00
Benny Halevy	b7212620f9	backup_task: upload on all shards Use all shards to upload snapshot files to S3. By using the sharded sstables_manager_for_table infrastructure. Refs #22460 Quick perf comparison =========================================== Release build, master, smp-16, mem-32GiB Bytes: 2342880184, backup time: 9.51 s =========================================== Release build, this PR, smp-16, mem-32GiB Bytes: 2342891015, backup time: 1.23 s =========================================== Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Co-authored-by: Ernest Zaslavsky <ernest.zaslavsky@scylladb.com>	2025-04-17 16:31:42 +03:00
Kefu Chai	b0cbe86780	s3/client: define a constant for security credential resource instead of repeating it, let's define a consstant and reuse it. less repeatings this way. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23713	2025-04-17 11:51:15 +03:00
Ernest Zaslavsky	a369dda049	s3_client: implement S3 copy object Add support for the CopyObject API to enable direct copying of S3 objects between locations. This approach eliminates networking overhead on the client side, as the operation is handled internally by S3.	2025-04-17 09:47:47 +03:00
Michał Chojnowski	4e2f62143b	managed_bytes: in the copy constructor, respect the target preferred allocation size Commit `14bf09f447` added a single-chunk layout to `managed_bytes`, which makes the overhead of `managed_bytes` smaller in the common case of a small buffer. But there was a bug in it. In the copy constructor of `managed_bytes`, a copy of a single-chunk `managed_bytes` is made single-chunk too. But this is wrong, because the source of the copy and the target of the copy might have different preferred max contiguous allocation sizes. In particular, if a `managed_bytes` of size between 13 kiB and 128 kiB is copied from the standard allocator into LSA, the resulting `managed_bytes` is a single chunk which violates LSA's preferred allocation size. (And therefore is placed by LSA in the standard allocator). In other words, since Scylla 6.0, cache and memtable cells between 13 kiB and 128 kiB are getting allocated in the standard allocator rather than inside LSA segments. Consequences of the bug: 1. Effective memory consumption of an affected cell is rounded up to the nearest power of 2. 2. With a pathological-enough allocation pattern (for example, one which somehow ends up placing a single 16 kiB memtable-owned allocation in every aligned 128 kiB span), memtable flushing could theoretically deadlock, because the allocator might be too fragmented to let the memtable grow by another 128 kiB segment, while keeping the sum of all allocations small enough to avoid triggering a flush. (Such an allocation pattern probably wouldn't happen in practice though). 3. It triggers a bug in reclaim which results in spurious allocation failures despite ample evictable memory. There is a path in the reclaimer procedure where we check whether reclamation succeeded by checking that the number of free LSA segments grew. But in the presence of evictable non-LSA allocations, this is wrong because the reclaim might have met its target by evicting the non-LSA allocations, in which case memory is returned directly to the standard allocator, rather than to the pool of free segments. If that happens, the reclaimer wrongly returns `reclaimed_nothing` to Seastar, which fails the allocation. Refs (possibly fixes) https://github.com/scylladb/scylladb/issues/21072 Fixes https://github.com/scylladb/scylladb/issues/22941 Fixes https://github.com/scylladb/scylladb/issues/22389 Fixes https://github.com/scylladb/scylladb/issues/23781	2025-04-16 22:06:06 +02:00
Ernest Zaslavsky	8929cb324e	s3_client: improve exception message Clarify that the multipart upload was aborted due to a failure in parsing ETags.	2025-04-16 18:58:22 +03:00
Ernest Zaslavsky	993953016f	s3_client: reposition local function for future use The local function has been relocated higher in the code to prepare for its usage in upcoming implementations.	2025-04-16 18:46:31 +03:00
Pavel Emelyanov	b25cb5af0c	Merge 'Use named gates' from Benny Halevy Name the gates and phased barriers we use to make it easy to debug gate_closed_exception Refs https://github.com/scylladb/seastar/pull/2688 * Enhancement only, no backport needed Closes scylladb/scylladb#23329 * github.com:scylladb/scylladb: utils: loading_cache: use named_gate utils: flush_queue: use named_gate sstables_manager: use named gate sstables_loader: use named gate utils: phased_barrier, pluggable: use named gate utils: s3::client::multipart_upload: use named gate utils: s3::client: use named_gate transport: controller: use named gate tracing: trace_keyspace_helper: use named gate task_manager: module: use named gate topology_coordinator: use named gate storage_service: use named gate storage_proxy: wait_for_hint_sync_point: use named gate storage_proxy: remote: use named gate service: session: use named gate service: raft: raft_rpc: use named gate service: raft: raft_group0: use named gate service: raft: persistent_discovery: use named gate service: raft: group0_state_machine: use named gate service: migration_manager: use named gate replica: table: use named gate replica: compaction_group, storage_group: use named gate redis: query_processor: use named gate repair: repair_meta: use named gate reader_concurrency_semaphore: use named gate raft: server_impl: use named gate querier_cache: use named gate gms: gossiper: use named gate generic_server: use named gate db: sstables_format_listener: use named gate db: snapshot: backup_task: use named gate db: snapshot_ctl: use named gate hints: hints_sender: use named gate hints: manager: use named gate hints: hint_endpoint_manager: use named gate commitlog: segment_manager: use named gate db: batchlog_manager: use named gate query_processor: remote: use named gate compaction: compaction_state: use named gate alternator/server: use named_gate	2025-04-14 20:56:32 +03:00
Kefu Chai	b3f709bed7	s3: remove an extraneous space Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23714	2025-04-14 13:02:58 +03:00
Benny Halevy	8d7e4d6c36	utils: loading_cache: use named_gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:47:09 +03:00
Benny Halevy	46f2a24772	utils: flush_queue: use named_gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:47:02 +03:00
Benny Halevy	e1fe82ed33	utils: phased_barrier, pluggable: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:47:00 +03:00
Benny Halevy	d3f498ae59	utils: s3::client::multipart_upload: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:47:00 +03:00
Benny Halevy	eea83464c7	utils: s3::client: use named_gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:46:51 +03:00
Pavel Emelyanov	d9853efa7c	Merge '[Out-of-space prevention] db: backup: prioritize sstables that were deleted from the table' from Benny Halevy The motivation behind this change to free up disk space as early as possible. The reason is that snapshot locks the space of all SSTables in the snapshot, and deleting form the table, for example, by compaction, or tablet migration, won't free-up their capacity until they are uploaded to object storage and deleted from the snapshot. This series adds prioritization of deleted sstables in two cases: First, after the snapshot dir is processed, the list of SSTable generation is cross-referenced with the list of SSTables presently in the table and any generation that is not in the table is prioritized to be uploaded earlier. In addition, a subscription mechanism was added to sstables_manager and it is used in backup to prioritize SSTables that get deleted from the table directory during backup. This is particularly important when backup happens during high disk utilization (e.g. 90%). Without it, even if the cluster is scaled up and tablets are migrated away from the full nodes to new nodes, tablet cleanup might not free any space if all the tablet sstables are hardlinked to the snapshot taken for backup. * Enhancement, no backport needed Closes scylladb/scylladb#23241 * github.com:scylladb/scylladb: db: snapshot: backup_task: prioritize sstables deleted during upload sstables_manager: add subscriptions db: snapshot: backup_task: limit concurrency sstables: directory_semaphore: expose get_units db: snapshot: backup_task: add sharded sstables_manager database: expose get_sstables_manager(schema) db: snapshot: backup_task: do_backup: prioritize sstables that are already deleted from the table db: snapshot-ctl: pass table_id to backup_task db: snapshot-ctl: expose sharded db() getter db: snapshot: backup_task: do_backup: organize components by sstable generation db: snapshot: coroutinize backup_task db: snapshot: backup_task: refactor backup_file out of uploads_worker db: snapshot: backup_task: refactor uploads_worker out of do_backup db: snapshot: backup_task: process_snapshot_dir: initialize total progress utils/s3: upload_progress: init members to 0 db: snapshot: backup_task: do_backup: refactor process_snapshot_dir db: snapshot: backup_task: keep expection as member	2025-04-09 15:32:11 +03:00
Benny Halevy	6da215e8af	utils/s3: upload_progress: init members to 0 For default construction. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:44:52 +03:00
Botond Dénes	f7938e3f8b	utils/error_injection: add a way to set parameters from error injection points With this, now it is possible to have two-way communication between the error injection point and its enabler. The test can enable the error injection point, then wait until it is hit, before proceedin.	2025-04-08 00:11:36 -04:00
Kefu Chai	55777812d4	s3/client: Optimize file streaming with zero-copy multipart uploads When streaming files using multipart upload, switch from using `output_stream::write(const char*, size_t)` to passing buffer objects directly to `output_stream::write()`. This eliminates unnecessary memory copying that occurred when the original implementation had to defensively copy data before sending. The buffer objects can now be safely reused by the output stream instead of creating deep copies, which should improve performance by reducing memory operations during S3 file uploads. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23567	2025-04-07 12:50:06 +03:00
Michał Chojnowski	2bd393849c	utils/hashers: add get_sha256() Add a helper function which computes the SHA256 for a blob. We will use it to compute identifiers for SSTable compression dictionaries later.	2025-04-01 00:07:28 +02:00
Piotr Dulikowski	288216a89e	Merge 'Ignore wrapped exceptions `gate_closed_exception` and `rpc::closed_error` when node shuts down.' from Sergey Zolotukhin Normally, when a node is shutting down, `gate_closed_exception` and `rpc::closed_error` in `send_to_live_endpoints` should be ignored. However, if these exceptions are wrapped in a `nested_exception`, an error message is printed, causing tests to fail. This commit adds handling for nested exceptions in this case to prevent unnecessary error messages. Fixes scylladb/scylladb#23325 Fixes scylladb/scylladb#23305 Fixes scylladb/scylladb#21815 Backport: looks like this is quite a frequent issue, therefore backport to 2025.1. Closes scylladb/scylladb#23336 * github.com:scylladb/scylladb: database: Pass schema_ptr as const ref in `wrap_commitlog_add_error` database: Unify exception handling in `do_apply` and `apply_with_commitlog` storage_proxy: Ignore wrapped `gate_closed_exception` and `rpc::closed_error` when node shuts down. exceptions: Add `try_catch_nested` to universally handle nested exceptions of the same type.	2025-03-27 11:39:42 +01:00
Sergey Zolotukhin	6abfed9817	exceptions: Add `try_catch_nested` to universally handle nested exceptions of the same type.	2025-03-26 11:15:13 +01:00

1 2 3 4 5 ...

1943 Commits