scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-30 19:46:48 +00:00

Author	SHA1	Message	Date
Piotr Smaroń	34c3688017	db: config: add live_updatable_config_params_changeable_via_cql option If `live_updatable_config_params_changeable_via_cql` is set to true, configuration parameters defined with `liveness::LiveUpdate` option can be updated in the runtime with CQL, i.e. by updating `system.config` virtual table. If we don't want any configuration parameter to be changed in the runtime by updating `system.config` virtual table, this option should be set to false. This option should be set to false for e.g. cloud users, who can only perform CQL queries, and should not be able to change scylla's configuration on the fly. Current implemenatation is generic, but has a small drawback - messages returned to the user can be not fully accurate, consider: ``` cqlsh> UPDATE system.config SET value='2' WHERE name='task_ttl_in_seconds'; WriteFailure: Error from server: code=1500 [Replica(s) failed to execute write] message="option is not live-updateable" info={'failures': 1, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'} ``` where `task_ttl_in_seconds` has been defined with `liveness::LiveUpdate`, but because `live_updatable_config_params_changeable_via_cql` is set to `false` in `scylla.yaml,` `task_ttl_in_seconds` cannot be modified in the runtime by updating `system.config` virtual table. Fixes #14355 Closes #14382	2023-08-16 17:56:27 +03:00
Pavel Emelyanov	3c6686e181	bptree: Replace assert with static_assert The one runs under checked constexpr value anyway Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #14951	2023-08-06 16:36:12 +03:00
Kamil Braun	39ca07c49b	Merge 'Gossiper endpoint locking' from Benny Halevy This series cleans up and hardens the endpoint locking design and implementation in the gossiper and endpoint-state subscribers. We make sure that all notifications (expect for `before_change`, that apparently can be dropped) are called under lock_endpoint, as well as all calls to gossiper::replicate, to serialize endpoint_state changes across all shards. An endpoint lock gets a unique permit_id that is passed to the notifications and passed back by them if the notification functions call the gossiper back for the same endpoint on paths that modify the endpoint_state and may acquire the same endpoint lock - to prevent a deadlock. Fixes scylladb/scylladb#14838 Refs scylladb/scylladb#14471 Closes #14845 * github.com:scylladb/scylladb: gossiper: replicate: ensure non-null permit gossiper: add_saved_endpoint: lock_endpoint gossiper: mark_as_shutdown: lock_endpoint gossiper: real_mark_alive: lock_endpoint gossiper: advertise_token_removed: lock_endpoint gossiper: do_status_check: lock_endpoint gossiper: remove_endpoint: lock_endpoint if needed gossiper: force_remove_endpoint: lock_endpoint if needed storage_service: lock_endpoint when removing node gossiper: use permit_id to serialize state changes while preventing deadlocks gossiper: lock_endpoint: add debug messages utils: UUID: make default tagged_uuid ctor constexpr gossiper: lock_endpoint must be called on shard 0 gossiper: replicate: simplify interface gossiper: mark_as_shutdown: make private gossiper: convict: make private gossiper: mark_as_shutdown: do not call convict	2023-08-02 13:50:08 +02:00
Benny Halevy	929d03b370	utils: UUID: make default tagged_uuid ctor constexpr So it can be used for gms::null_permit_id in the next patch Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-07-31 19:29:18 +03:00
Benny Halevy	60862c63dd	utils/directories: verify_owner_and_mode: add recursive flag Allow the caller to verify only the top level directories so that sub-directories can be verified selectively (in particular, skip validation of snapshots). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-07-31 16:01:43 +03:00
Raphael S. Carvalho	050ce9ef1d	cached_file: Evict unused pages that aren't linked to LRU yet It was found that cached_file dtor can hit the following assert after OOM cached_file_test: utils/cached_file.hh:379: cached_file::~cached_file(): Assertion _cache.empty()' failed.` cached_file's dtor iterates through all entries and evict those that are linked to LRU, under the assumption that all unused entries were linked to LRU. That's partially correct. get_page_ptr() may fetch more than 1 page due to read ahead, but it will only call cached_page::share() on the first page, the one that will be consumed now. share() is responsible for automatically placing the page into LRU once refcount drops to zero. If the read is aborted midway, before cached_file has a chance to hit the 2nd page (read ahead) in cache, it will remain there with refcount 0 and unlinked to LRU, in hope that a subsequent read will bring it out of that state. Our main user of cached_file is per-sstable index caching. If the scenario above happens, and the sstable and its associated cached_file is destroyed, before the 2nd page is hit, cached_file will not be able to clear all the cache because some of the pages are unused and not linked. A page read ahead will be linked into LRU so it doesn't sit in memory indefinitely. Also allowing for cached_file dtor to clear all cache if some of those pages brought in advance aren't fetched later. A reproducer was added. Fixes #14814. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #14818	2023-07-27 00:01:46 +02:00
Kefu Chai	a8254111ef	utils: drop operator<< for pretty printers since all callers of these operators have switched to fmt formatters. let's drop them. the tests are updated accordingly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-17 14:02:13 +08:00
Kefu Chai	fc6b84ec1f	utils: add fmt formatter for pretty printers add fmt formatter for `utils::pretty_printed_data_size` and `utils::pretty_printed_throughput`. this is a part of a series to migrating from `operator<<(ostream&, ..)` based formatting to fmtlib based formatting. the goal here is to enable fmtlib to print `utils::pretty_printed_data_size` and `utils::pretty_printed_throughput` without the help of `operator<<`. please note, despite that it's more popular to use the IEC prefixes when presenting the size of storage, i.e., MiB for 10242 bytes instead of MB for 10002 bytes, we are still using the SI binary prefixes as the default binary prefix, in order to preserve the existing behavior. also, we use the singular form of "byte" when formating "1". this is more correct. the tests are updated accordingly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-07-17 14:02:13 +08:00
Kefu Chai	567b453689	utils: avoid using out-of-range index in pretty_printers before this change, if the formatter size is greater than a pettabyte, `exp` would be 6. but we still use it as the index to find the suffix in `suffixes`, but the array's size is 6. so we would be referencing random bits after "PB" for the suffix of the formatted size. in this change * loop in the suffix for better readability. and to avoid the off-by-one errors. * add tests for both pretty printers Branches: 5.1,5.2,5.3 Fixes #14702 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14713	2023-07-16 18:46:09 +03:00
Mikołaj Grzebieluch	b165f1e88b	utils: error injection: check if it is an ongoing one-shot injection in is_enabled Change it for consistency with `enabled_injections`. Closes #14597	2023-07-13 15:56:33 +02:00
Kamil Braun	a2fe63349d	Merge 'utils: error injection: add a string-to-string map of injection's parameters' from Mikołaj Grzebieluch Add `parameters` map to `injection_shared_data`. Now tests can attach string data to injections that can be read in injected code via `injection_handler`. Closes #14521 Closes #14608 * github.com:scylladb/scylladb: tests: add a `parameters` argument to code that enables injections api/error_injection: add passing injection's parameters to enable endpoint tests: utils: error injection: add test for injection's parameters utils: error injection: add a string-to-string map of injection's parameters utils: error injection: rename received_messages_counter to injection_shared_data	2023-07-13 11:52:15 +02:00
Mikołaj Grzebieluch	f60580ab3e	utils: error injection: add a string-to-string map of injection's parameters Add `parameters` map. Now tests can attach string data to injections that can be read in injected code via `injection_handler`.	2023-07-13 10:10:52 +02:00
Mikołaj Grzebieluch	b33714a0f0	utils: error injection: rename received_messages_counter to injection_shared_data For now, `received_messages_counter` have only data for messaging the injection. In future, there will be more data to keep, for example, a string-to-string map of injection's parameters. Rename this class and its attributes.	2023-07-13 10:10:52 +02:00
Kamil Braun	9d4b3c6036	test: use correct timestamp resolution in `test_group0_history_clearing_old_entries` In `10c1f1dc80` I fixed `make_group0_history_state_id_mutation` to use correct timestamp resolution (microseconds instead of milliseconds) which was supposed to fix the flakiness of `test_group0_history_clearing_old_entries`. Unfortunately, the test is still flaky, although now it's failing at a later step -- this is because I was sloppy and I didn't adjust this second part of the test to also use microsecond resolution. The test is counting the number of entries in the `system.group0_history` table that are older than a certain timestamp, but it's doing the counting using millisecond resolution, causing it to give results that are off by one sometimes. Fix it by using microseconds everywhere. Fixes #14653 Closes #14670	2023-07-13 10:33:52 +03:00
Tomasz Grabiec	e8ee0a2f86	Merge 'group0_state_machine: use correct comparison for timeuuids in `merger`' from Kamil Braun In `d2a4079bbe`, `merger` was modified so that when we merge a command, `last_group0_state_id` is taken to be the maximum of the merged command's state_id and the current `last_group0_state_id`. This is necessary for achieving the same behavior as if the commands were applied individually instead of being merged -- where we take the maximum state ID from `group0_history` table which was applied until now (because the table is sorted using the state IDs and we take the greatest row). However, a subtle bug was introduced -- the `std::max` function uses the `utils::UUID` standard comparison operator which is unfortunately not the same as timeuuid comparison that Scylla performs when sorting the `group0_history` table. So in rare cases it could return the smaller of the two timeuuids w.r.t. the correct timeuuid ordering. This would then lead to commands being applied which should have been turned to no-ops due to the `prev_state_id` check -- and then, for example, permanent schema desync or worse. Fix it by using the correct comparison method. Fixes: #14600 Closes #14616 * github.com:scylladb/scylladb: utils/UUID: reference `timeuuid_tri_compare` in `UUID::operator<=>` comment group0_state_machine: use correct comparison for timeuuids in `merger` utils/UUID: introduce `timeuuid_tri_compare` for `const UUID&` utils/UUID: introduce `timeuuid_tri_compare` for `const int8_t*`	2023-07-12 14:48:18 +02:00
Kamil Braun	051728318d	utils/UUID: reference `timeuuid_tri_compare` in `UUID::operator<=>` comment	2023-07-11 13:19:50 +02:00
Kamil Braun	5ce802676f	utils/UUID: introduce `timeuuid_tri_compare` for `const UUID&` The existing `timeuuid_tri_compare` operates on UUIDs serialized in byte buffers. Introduce a version which operates directly on the `utils::UUID` type. To reuse existing comparison code, we serialize to a buffer before comparing. But we avoid allocations by using `std::array`. Since the serialized size needs to be known at compile time for `std::array`, mark `UUID::serialized_size()` as `constexpr`.	2023-07-11 11:48:02 +02:00
Kamil Braun	668beedadc	utils/UUID: introduce `timeuuid_tri_compare` for `const int8_t` `timeuuid_tri_compare` takes `bytes_view` parameters and converts them to `const int8_t` before comparing. Extract the part that operates on `const int8_t*` to separate function which we will reuse in a later commit.	2023-07-11 11:48:02 +02:00
Kefu Chai	ef78b31b43	s3/client: add tagging ops with tagging ops, we will be able to attach kv pairs to an object. this will allow us to mark sstable components with taggings, and filter them based on them. * test/pylib/minio_server.py: enable anonymous user to perform more actions. because the tagging related ops are not enabled by "mc anonymous set public", we have to enable them using "set-json" subcommand. * utils/s3/client: add methods to manipulate taggings. * test/boost/s3_test: add a simple test accordingly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14486	2023-07-11 09:30:46 +03:00
Kefu Chai	0dca0a7f27	build: cmake: include pretty_printers.cc in util we added pretty_printers.cc back in `83c70ac04f`, in which configure.py is updated. so let's sync the CMake building system accordingly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14442	2023-07-11 09:16:33 +03:00
Avi Kivity	0cabf4eeb9	build: disable implicit fallthrough Prevent switch case statements from falling through without annotation ([[fallthrough]]) proving that this was intended. Existing intended cases were annotated. Closes #14607	2023-07-10 19:36:06 +02:00
Kefu Chai	26dcfea84a	estimated_histogram: do not use dynamic format_string fmtlib allows us to specify the field width dynamically, so specify the field width in the same statement formatting the argument improves the readability. and use the constexpr fmt string allows us to switch to compile-time formatter supported by fmtlib v8. this change also use `fmt::print()` to format the argument right to the output ostream, instead of creating a temporary sstring, and copy it to the output ostream. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14579	2023-07-08 15:10:41 +03:00
Mikołaj Grzebieluch	086b3369f4	utils: error injection: add inject_with_handler for interactions with injected code Currently, it is hard for injected code to wait for some events, for example, requests on some REST endpoint. This commit adds the `inject_with_handler` method that executes injected function and passes `injection_handler` as its argument. The `injection_handler` class is used to wait for events inside the injected code. The `error_injection` class can notify the injection's handler or handlers associated with the injection on all shards about the received message. There is a counter of received messages in `received_messages_counter`; it is shared between the injection_data, which is created once when enabling an injection on a given shard, and all `injection_handler`s, that are created separately for each firing of this injection. The `counter` is incremented when receiving a message from the REST endpoint and the condition variable is signaled. Each `injection_handler` (separate for each firing) stores its own private counter, `_read_messages_counter` that private counter is incremented whenever we wait for a message, and compared to the received counter. We sleep on the condition variable if not enough messages were received.	2023-07-06 12:32:07 +02:00
Mikołaj Grzebieluch	01bc6f5294	utils: error injection: create structure for error injections data This enables holding additional data associated with the injection.	2023-07-05 13:52:46 +02:00
Raphael S. Carvalho	83c70ac04f	utils: Extract pretty printers into a header Can be easily reused elsewhere. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-06-26 21:58:20 -03:00
Petr Gusev	1e851262f2	storage_proxy: handler responses, use pointers to default constructed values instead of nulls The current Seastar RPC infrastructure lacks support for null values in tuples in handler responses. In this commit we add the make_default_rpc_tuple function, which solves the problem by returning pointers to default-constructed values for smart pointer types rather than nulls. The problem was introduced in this commit `2d791a5ed4`. The function `encode_replica_exception_for_rpc` used `default_tuple_maker` callback to create tuples containing exceptions. Callers returned pointers to default-constructed values in this callback, e.g. `foreign_ptr(make_lw_shared<reconcilable_result>())`. The commit changed this to just `SourceTuple{}`, which means nullptr for pointer types. Fixes: #14282 Closes #14352	2023-06-26 11:10:38 +03:00
Tomasz Grabiec	ad6d2b42f2	test: Extract throttle object to separate header	2023-06-21 00:58:24 +02:00
Botond Dénes	ddf8547f25	Merge 'Add concurrency control and workload isolation for S3 client' from Pavel Emelyanov In its current state s3 client uses a single default-configured http client thus making different sched classes' workload compete with each other for sockets to make requests on. There's an attempt to handle that in upload-sink implementation that limits itself with some small number of concurrent PUT requests, but that doesn't help much as many sinks don't share this limit. This PR makes S3 client maintain a set of http clients, one per sched-group, configures maximum number of TCP connections proportional to group's shares and removes the artificial limit from sinks thus making them share the group's http concurrency limit. As a side effect, the upload-sink fixes the no-writes-after-flush protection -- if it's violated, write will result in exception, while currently it just hangs on a semaphore forever. fixes: #13458 fixes: #13320 fixes: #13021 Closes #14187 * github.com:scylladb/scylladb: s3/client: Replace skink flush semaphore with gate s3/client: Configure different max-connections on http clients s3/client: Maintain several http clients on-board s3/client: Remove now unused http reference from sink and file s3/client: Add make_request() method	2023-06-20 07:09:21 +03:00
Petr Gusev	2d791a5ed4	storage_proxy.cc: refactor encode_replica_exception_for_rpc We are going to add fencing to read RPCs, it would be easier to do it once for all three of them. This refactoring enables this since it allows to use encode_replica_exception_for_rpc for handle_read_digest.	2023-06-15 15:52:50 +04:00
Pavel Emelyanov	c1c1752f88	s3/client: Replace skink flush semaphore with gate Uploading sinks have internal semaphore limiting the maximum number of uploading parts and pieces with the value of two. This approach has several drawbacks. 1. The number is random. It could as well be three, four and any other 2. Jumbo upload in fact violates this parallelizm, because it applies to maximum number of pieces _and_ maximum number of parts in each piece that can be uploaded in parallels. Thus jumbo upload results in four parts in parallel. 3. Multiple uploads don't sync with each other, so uploading N objects would result in N * 2 (or even N * 4 with jumbo) uploads in parallel. 4. Single upload could benefit from using more sockets if no other uploads happen in parallel. IOW -- limit should be shard-wide, not single-upload-wide Previous patches already put the per-shard parallelizm under (some) control, so this semaphore is in fact used as a way to collect background uploading fibers on final flush and thus can be replaced with a gate. As a side effect, this fixes an issue that writes-after-flush shouldn't happen (see #13320) -- when flushed the upload gate is closed and subsequent writes would hit gate-closed error. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-08 18:38:57 +03:00
Pavel Emelyanov	99b92f0ed8	s3/client: Configure different max-connections on http clients After previous patch different sched groups got different http clients. By default each client is started with 100 allowed connections. This can be too much -- 100 * nr-sched-groups * smp::count can be quite huge number. Also, different groups should have different parallelizm, e.g. flush/compaction doesn't care that much about latency and can use fewer sockets while query class is more welcome to have larger concurrency. As a starter -- configure http clients with maximum shares/100 sockets. Thus query class would have 10 and flush/compaction -- 1. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-08 18:35:59 +03:00
Pavel Emelyanov	81d1bfce2a	s3/client: Maintain several http clients on-board The intent is to isolate workloads from different sched groups from each other and not let one sched group consume all sockets from the http client thus affecting requests made by other sched groups. The conention happens in the maximim number of socket an http client may have (see scylladb/seastar#1652). If requests take time and client is asked to make more and more it will eventually stop spawning new connections and would get blocked internally waiting for running requests to complete and put a socket back to pool. If a sched group workload (e.g. -- memtable flush) consumes all the available sockets then workload from another group (e.g. -- query) would be blocked thus spoiling its latency (which is poor on its own, but still) After this change S3 client maintains a sched_group:http_client map thus making sure different sched groups don't clash with each other so that e.g. query requests don't wait for flush/compaction to release a socket. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-08 18:28:55 +03:00
Pavel Emelyanov	a8492a065b	s3/client: Remove now unused http reference from sink and file Now these two classes use client-> calls and don't need the http& shortcut Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-08 18:28:30 +03:00
Pavel Emelyanov	b9ee0d385b	s3/client: Add make_request() method This helper call will serve several purposes. First, make necessary preparations to the request before making, in particular -- calling authorize() Second, there's the need to re-make requests that failed with "connection closed" error (see #13736) Third, one S3 client is shared between different scheduling groups. In order to isolate groups' workload from each other different http clients should be used, and this helper will be in change of selecting one Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-06-08 18:19:19 +03:00
Pavel Emelyanov	ce6a1ca13b	Update seastar submodule * seastar afe39231...99d28ff0 (16): > file/util: Include seastar.hh > http/exception: Use http::reply explicitly > http/client: Include lost condition-variable.hh > util: file: drop unnecessary include of reactor.hh > tests: perf: add a markdown printer > http/client: Introduce unexpected_status_error for client requests > sharded: avoid #include <seastar/core/reactor.hh> for run_in_background() > code: Use std::is_invocable_r_v instead of InvokeReturns > http/client: Add ability to change pool size on the fly > http/client: Add getters for active/idle connections counts > http/client: Count and limit the number of connections > http/client: Add connection->client RAII backref > build: use the user-specified compiler when building DPDK > build: use proper toolchain based on specified compiler > build: only pass CMAKE_C_COMPILER when building ingredients > build: use specified compiler when building liburing Two changes are folded into the commit: 1. missing seastar/core/coroutine.hh include in one .cc file that got it indirectly included before seastar reactor.hh drop from file.hh 2. http client now returns unexpected_status_error instead of std::runtime_error, so s3 test is updated respectively Closes #14168	2023-06-07 20:25:49 +03:00
Pavel Emelyanov	66e43912d6	code: Switch to seastar API level 7 In that level no io_priority_class-es exist. Instead, all the IO happens in the context of current sched-group. File API no longer accepts prio class argument (and makes io_intent arg mandatory to impls). So the change consists of - removing all usage of io_priority_class - patching file_impl's inheritants to updated API - priority manager goes away altogether - IO bandwidth update is performed on respective sched group - tune-up scylla-gdb.py io_queues command The first change is huge and was made semi-autimatically by: - grep io_priority_class \| default_priority_class - remove all calls, found methods' args and class' fields Patching file_impl-s is smaller, but also mechanical: - replace io_priority_class& argument with io_intent* one - pass intent to lower file (if applicatble) Dropping the priority manager is: - git-rm .cc and .hh - sed out all the #include-s - fix configure.py and cmakefile The scylla-gdb.py update is a bit hairry -- it needs to use task queues list for IO classes names and shares, but to detect it should it checks for the "commitlog" group is present. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #13963	2023-06-06 13:29:16 +03:00
Nadav Har'El	d2e089777b	Merge 'Yield while building large results in Alternator - rjson::print, executor::batch_get_item' from Marcin Maliszkiewicz Adds preemption points used in Alternator when: - sending bigger json response - building results for BatchGetItem I've tested manually by inserting in preemptible sections (e.g. before `os.write`) code similar to: auto start = std::chrono::steady_clock::now(); do { } while ((std::chrono::steady_clock::now() - start) < 100ms); and seeing reactor stall times. After the patch they were not increasing while before they kept building up due to no preemption. Refs #7926 Fixes #13689 Closes #12351 * github.com:scylladb/scylladb: alternator: remove redundant flush call in make_streamed utils: yield when streaming json in print() alternator: yield during BatchGetItem operation	2023-06-04 23:22:51 +03:00
Kefu Chai	82cac8e7cf	treewide: s/std::source_location/seastar::compact::source_location/ CWG 2631 (https://cplusplus.github.io/CWG/issues/2631.html) reports an issue on how the default argument is evaluated. this problem is more obvious when it comes to how `std::source_location::current()` is evaluated as a default argument. but not all compilers have the same behavior, see https://godbolt.org/z/PK865KdG4. notebaly, clang-15 evaluates the default argument at the callee site. so we need to check the capability of compiler and fall back to the one defined by util/source_location-compat.hh if the compiler suffers from CWG 2631. and clang-16 implemented CWG2631 in https://reviews.llvm.org/D136554. But unfortunately, this change was not backported to clang-15. before switching over to clang-16, for using std::source_location::current() as the default parameter and expect the behavior defined by CWG2631, we have to use the compatible layer provided by Seastar. otherwise we always end up having the source_location at the callee side, which is not interesting under most circumstances. so in this change, all places using the idiom of passing std::source_location::current() as the default parameter are changed to use seastar::compat::source_location::current(). despite that we have `#include "seastarx.h"` for opening the seastar namespace, to disambiguate the "namespace compat" defined somewhere in scylladb, the fully qualified name of `seastar::compat::source_location::current()` is used. see also `09a3c63345`, where we used std::source_location as an alias of std::experimental::source_location if it was available. but this does not apply to the settings of our current toolchain, where we have GCC-12 and Clang-15. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #14086	2023-05-30 15:10:12 +03:00
Avi Kivity	2303f08eea	utils: logalloc: correct asan_interface.h location It's a system header, so it deserves angle brackets. Closes #14036	2023-05-29 23:03:25 +03:00
Pavel Emelyanov	2eb88945ea	utils: Restore indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-26 18:53:14 +03:00
Pavel Emelyanov	4ebb812df0	utils: Coroutinize verify_owner_and_mode() There's a helper verification_error() that prints a warning and returns excpetional future. The one is converted into void throwing one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-26 18:52:15 +03:00
Botond Dénes	5a14c3311a	Merge 'Break S3 upload 50Gb file limit' from Pavel Emelyanov Current S3 uploading sink has implicit limit for the final file size that comes from two places. First, S3 protocol declares that uploading parts count from 1 to 10000 (inclusive). Second, uploading sink sends out parts once they grow above S3 minimal part size which is 5Mb. Since sstables puts data in 128kb (or smaller) portions, parts are almost exactly 5Mb in size, so the total uploading size cannot grow above ~50Gb. That's too low. To break the limit the new sink (called jumbo sink) uses the UploadPartCopy S3 call that helps splicing several objects into one right on the server. Jumbo sink starts uploading parts into an intermediate temporary object called a piece and named ${original_object}_${piece_number}. When the number of parts in current piece grows above the configured limit the piece is finalized and upload-copied into the object as its next part, then deleted. This happens in the background, meanwhile the new piece is created and subsequent data is put into it. When the sink is flushed the current piece is flushed as is and also squashed into the object. The new jumbo sink is capable of uploading ~500Tb of data, which looks enough. fixes: #13019 Closes #13577 * github.com:scylladb/scylladb: sstables: Switch data and index sink to use jumbo uploader s3/test: Tune-up multipart upload test alignment s3/test: Add jumbo upload test s3/client: Wait for background upload fiber on close-abort c3/client: Implement jumbo upload sink s3/client: Move memory buffers to upload_sink from base s3/client: Move last part upload out of finalize_upload() s3/client: Merge do_flush() with upload_part() s3/client: Rename upload_sink -> upload_sink_base	2023-05-25 11:44:06 +03:00
Petr Gusev	79c6bf0885	clear_gently: remove noexcept for rvalue references overload We use this overload in vnode_erm, one of the arguments is boost::icl::interval_map, whose move constructor is not noexcept.	2023-05-24 12:08:19 +04:00
Petr Gusev	e0bc98a217	sequenced_set: add extract_vector method Can be useful if we want to reuse the vector when we are done with this sequenced_set instance.	2023-05-21 11:33:38 +04:00
Petr Gusev	700eb90ed8	stall_free.hh: add clear_gently for rvalues	2023-05-21 11:33:33 +04:00
Petr Gusev	4a127c3782	stall_free.hh: relax Container requirement We don't use the return value of erase, so we can allow it to return anything. We'll need this for ring_mapping, since boost::icl::interval_map::erase(it) returns void.	2023-05-19 22:11:09 +04:00
Pavel Emelyanov	908d0d2e6a	s3/client: Wait for background upload fiber on close-abort When uploading a part (and a piece) there can be one or more background fibers handling the upload. In case client needs to abort the operation it calls .close() without flush()ing. In this case the S3 API Abort is made and the sink can be terminated. It's expected that background fibers would resolve on their own eventually, but it's not quite the case. First, they hold units for the semaphore and the semaphore should be alive by the time units are returned. Second, the PUT (or copy) request can finish successfully and it may be sitting in the reactor queue waiting for its continuation to get scheduler. The continuation references sink via "this" capture to put the part etag. Finally, in case of piece uploading the copy fiber needs _client at the end to issue delete-object API call dropping the no longer needed part. Said that -- background fibers must be waited upon on .close() if the closing is aborting (if it's successfull close, then the fibers mush have been picked up by final flush() call). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-16 12:23:18 +03:00
Pavel Emelyanov	f9686926c2	c3/client: Implement jumbo upload sink The sink is also in charge of uploading large objects in parts, but this time each part is put with the help of upload-part-copy API call, not the regular upload-part one. To make it work the new sink inherits from the uploading base class, but instead of keeping memory_data_sink_buffers with parts it keeps a sink to upload a temporary intermediate object with parts. When the object is "full", i.e. the number of parts in it hits the limit, the object is flushed, then copied into the target object with the S3 API call, then deletes the intermediate object. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-16 12:23:18 +03:00
Pavel Emelyanov	8fa3294ae1	s3/client: Move memory buffers to upload_sink from base All the buffers manipulations now happen in the upload_sink class and the respective member can be removed from base class. The base class only messes with the buffers in its upload_part() call, but that's unavoidable, as uploading part implies sending its contents which sits in buffers. Now the base class can be re-used for uploading parts with the help of copy-part API call (next patches) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-16 12:19:50 +03:00
Pavel Emelyanov	2ac5ecd659	s3/client: Move last part upload out of finalize_upload() This change has two reasons. First, is to facilitate moving the memory_data_sink_buffers from base class, i.e. -- continuation of the previous patch. Also this fixes a corner case -- if final sink flush happens right after the previous part was sent for uploading, the finalization doesn't happen and sink closing aborts the upload even if it was successful. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-16 12:19:50 +03:00

1 2 3 4 5 ...

1490 Commits