scylladb

Author	SHA1	Message	Date
Benny Halevy	2ceecc9d2a	generic_server: server: do_accepts: prevent gate_closed_exception do_accepts might be called after `_gate` was closed. In this case it should just return early rather than throw gate_closed_exception, similar to the it breaks from the infinite for loop when the _gate is closed. With this change, do_accepts (and consequently, _listeners_stopped), should never fail as it catches and ignores all exceptions in the loop. Fixes #23775 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#23818	2025-05-13 20:00:04 +03:00
Tomasz Grabiec	fadfbe8459	Merge 'transport: storage_proxy: release ERM when waiting for query timeout' from Andrzej Jackowski Before this change, if a read executor had just enough targets to achieve query's CL, and there was a connection drop (e.g. node failure), the read executor waited for the entire request timeout to give drivers time to execute a speculative read in a meantime. Such behavior don't work well when a very long query timeout (e.g. 1800s) is set, because the unfinished request blocks topology changes. This change implements a mechanism to thrown a new read_failure_exception_with_timeout in the aforementioned scenario. The exception is caught by CQL server which conducts the waiting, after ERM is released. The new exception inherits from read_failure_exception, because layers that don't catch the exception (such as mapreduce service) should handle the exception just a regular read_failure. However, when CQL server catch the exception, it returns read_timeout_exception to the client because after additional waiting such an error message is more appropriate (read_timeout_exception was also returned before this change was introduced). This change: - Rewrite cql_server::connection::process_request_one to use seastar::futurize_invoke and try_catch<> instead of utils::result_try - Add new read_failure_exception_with_timeout and throws it in storage_proxy - Add sleep in CQL server when the new exception is caught - Catch local exceptions in Mapreduce Service and convert them to std::runtime_error. - Add get_cql_exclusive to manager_client.py - Add test_long_query_timeout_erm No backport needed - minor issue fix. Closes scylladb/scylladb#23156 * github.com:scylladb/scylladb: test: add test_long_query_timeout_erm test: add get_cql_exclusive to manager_client.py mapreduce: catch local read_failure_exception_with_timeout transport: storage_proxy: release ERM when waiting for query timeout transport: remove redundant references in process_request_one transport: fix the indentation in process_request_one transport: add futures in CQL server exception handling	2025-05-08 12:45:49 +02:00
Andrzej Jackowski	1fca994c7b	transport: storage_proxy: release ERM when waiting for query timeout Before this change, if a read executor had just enough targets to achieve query's CL, and there was a connection drop (e.g. node failure), the read executor waited for the entire request timeout to give drivers time to execute a speculative read in a meantime. Such behavior don't work well when a very long query timeout (e.g. 1800s) is set, because the unfinished request blocks topology changes. This change implements a mechanism to thrown a new read_failure_exception_with_timeout in the aforementioned scenario. The exception is caught by CQL server which conducts the waiting, after ERM is released. The new exception inherits from read_failure_exception, because layers that don't catch the exception (such as mapreduce service) should handle the exception just a regular read_failure. However, when CQL server catch the exception, it returns read_timeout_exception to the client because after additional waiting such an error message is more appropriate (read_timeout_exception was also returned before this change was introduced). This change: - Add new read_failure_exception_with_timeout exception - Add throw of read_failure_exception_with_timeout in storage_proxy - Add abort_source to CQL server, as well as to_stop() method for the correct abort handling - Add sleep in CQL server when the new exception is caught Refs #21831	2025-04-23 09:29:47 +02:00
Pavel Emelyanov	8b2cababb6	generic_server: Don't mess with db::config The db::config is top-level configuration of scylla, we generally try to avoid using it even in scylla components: each uses its own config initialized by the service creator out of the db::config itself. The generic_server is not an exception, all the more so, it already has its own config. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23705	2025-04-16 17:02:30 +03:00
Benny Halevy	bc69bc3de7	generic_server: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Marcin Maliszkiewicz	ce18909688	transport: move on_connection_close into connection destructor To make the code more robust by ensuring closing code is always executed.	2025-04-09 13:50:19 +02:00
Marcin Maliszkiewicz	599f4d312b	transport: add blocked and shed connection metrics This adds some visibility into connection storm mitigations added in following commits.	2025-04-09 10:49:18 +02:00
Marcin Maliszkiewicz	26518704ab	generic_server: throttle and shed incoming connections according to semaphore limit If we have uninitialized_connections_semaphore_cpu_concurrency (default 2) connections being processed we start delay accepting new connections. Connections which are in network IO state are not counted towards this limit and they can go to cpu phase without blocking. So it can happen that we process more concurrent new connections but that's a necessary tradeof to make progress during storm without implementing more advanced machinery (i.e. priority queue).	2025-04-09 10:48:51 +02:00
Marcin Maliszkiewicz	9f5de2c256	generic_server: add data source and sink wrappers bookkeeping network IO They release semaphore units when we start network IO and acquire it when we enter cpu intensive phase. We use consume() so it doesn't block because we don't want connections we started processing to compete with new incomming connections. Otherwise during connection storm we wouldn't make much progress. There will be a simplification here as we'll treat disc IO (if there is any) as cpu work.	2025-04-09 10:48:42 +02:00
Marcin Maliszkiewicz	c56116372e	generic_server: coroutinize part of server::do_accepts	2025-04-09 10:48:42 +02:00
Marcin Maliszkiewicz	ed82bede39	generic_server: add semaphore for limiting new connections concurrency It will be used in following commits.	2025-04-09 10:30:58 +02:00
Marcin Maliszkiewicz	33122d3f93	generic_server: add config to the constructor	2025-04-09 10:30:58 +02:00
Marcin Maliszkiewicz	474e84199c	generic_server: add on_connection_ready handler This patch cleans the code a bit so that ready state is set in a single place. And adds handler which will allow adding logic when connection is made ready, this will be added in the following commits.	2025-04-09 10:30:58 +02:00
Calle Wilund	e49f2046e5	generic_server: Update conditions for is_broken_pipe_or_connection_reset Refs scylla-enterprise#5185 Fixes #22901 If a tls socket gets EPIPE the error is not translated to a specific gnutls error code, but only a generic ERROR_PULL/PUSH. Since we treat EPIPE as ignorable for plain sockets, we need to unwind nested exception here to detect that the error was in fact due to this, so we can suppress log output for this. Closes scylladb/scylladb#22888	2025-02-25 10:35:11 +02:00
Kefu Chai	3aeecd4264	generic_server: correct typo in comment s/invokation/invocation/ Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22697	2025-02-05 11:48:50 +03:00
Calle Wilund	c59c87c233	generic_server: Allow sharing reloadability of certificates across shards Adds an optional callback to "listen", returning the shard local object instance. If provided, instead of creating a "full" reloadable cerificate object, only do so on shard 0, and use callback to reload other shards "manually".	2025-01-27 16:16:23 +00:00
Piotr Dulikowski	6d90a933cd	transport/server: use scheduling group assigned to current user Now, when the user logs in and the connection becomes authenticated, the processing loop of the connection is switched to the scheduling group that corresponds to the service level assigned to the logged in user. The scheduling group is also updated when the service level assigned to this user changes. Starting from this commit, the scheduling groups managed by the service level controller are actually being used by user workload.	2025-01-02 07:13:34 +01:00
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Michał Jadwiszczak	fe67efda5b	Revert "generic_server: use async function in `for_each_gently()`" This reverts commit `324b3c43c0`. It isn't safe to do asynchronous calls in `for_each_gently`, as the connection may be disconnected while a call in callback preempts. Fixes scylladb/scylla#21801	2024-12-05 13:32:47 +01:00
Laszlo Ersek	49bff3b1ab	generic_server: make server::stop() idempotent After server::shutdown(), make server::stop() more robust too, by allowing callers (internal or external) to call it several times (not concurrently though, just yet; see <https://github.com/scylladb/scylladb/issues/20309>). Suggested-by: Benny Halevy <bhalevy@scylladb.com> Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>	2024-08-28 15:54:31 +02:00
Laszlo Ersek	1138347e7e	generic_server: coroutinize server::shutdown() By turning server::shutdown() into a coroutine, we need not dynamically allocate "nr_conn". Verified as follows: (1) In terminal #1: build/Dev/scylla --overprovisioned --developer-mode=yes \ --memory=2G --smp=1 --default-log-level error \ --logger-log-level cql_server=debug:cql_server_controller=debug > INFO [...] cql_server_controller - Starting listening for CQL clients > on 127.0.0.1:9042 (unencrypted, > non-shard-aware) > INFO [...] cql_server_controller - Starting listening for CQL clients > on 127.0.0.1:19042 (unencrypted, > shard-aware) (2) In terminals #2 and #3: tools/cqlsh/bin/cqlsh.py (3) Press ^C in terminal #1: > DEBUG [...] cql_server - abort accept nr_total=2 > DEBUG [...] cql_server - abort accept 1 out of 2 done > DEBUG [...] cql_server - abort accept 2 out of 2 done > DEBUG [...] cql_server - shutdown connection nr_total=4 > DEBUG [...] cql_server - shutdown connection 1 out of 4 done > DEBUG [...] cql_server - shutdown connection 2 out of 4 done > DEBUG [...] cql_server - shutdown connection 3 out of 4 done > DEBUG [...] cql_server - shutdown connection 4 out of 4 done > INFO [...] cql_server_controller - CQL server stopped This patch is best viewed with "git show --word-diff=color". Suggested-by: Benny Halevy <bhalevy@scylladb.com> Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>	2024-08-28 10:59:44 +02:00
Laszlo Ersek	2216275ebd	generic_server: make server::shutdown() idempotent Make server::shutdown() more robust by allowing callers (internal or external) to call it several times (not concurrently though, just yet; see <https://github.com/scylladb/scylladb/issues/20309>). Suggested-by: Benny Halevy <bhalevy@scylladb.com> Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>	2024-08-28 10:59:44 +02:00
Laszlo Ersek	5a04743663	generic_server: convert connection tracking to seastar::gate If we call server::stop() right after "server" construction, it hangs: With the server never listening (never accepting connections and never serving connections), nothing ever calls server::maybe_stop(). Consequently, co_await _all_connections_stopped.get_future(); at the end of server::stop() deadlocks. Such a server::stop() call does occur in controller::do_start_server() [transport/controller.cc], when - cserver->start() (sharded<cql_server>::start()) constructs a "server"-derived object, - start_listening_on_tcp_sockets() throws an exception before reaching listen_on_all_shards() (for example because it fails to set up client encryption -- certificate file is inaccessible etc.), - the "deferred_action" cserver->stop().get(); is invoked during cleanup. (The cserver->stop() call exposing the connection tracking problem dates back to commit `ae4d5a60ca` ("transport::controller: Shut down distributed object on startup exception", 2020-11-25), and it's been triggerable through the above code path since commit `6b178f9a4a` ("transport/controller: split configuring sockets into separate functions", 2024-02-05).) Tracking live connections and connection acceptances seems like a good fit for "seastar::gate", so rewrite the tracking with that. "seastar::gate" can be closed (and the returned future can be waited for) without anyone ever having entered the gate. NOTE: this change makes it quite clear that neither server::stop() nor server::shutdown() must be called multiple times. The permitted sequences are: - server::shutdown() + server::stop() - or just server::stop(). Fixes #10305 Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>	2024-08-28 10:59:44 +02:00
Michał Jadwiszczak	324b3c43c0	generic_server: use async function in `for_each_gently()` In the following patch, we will add a method to update service levels parameters for each cql connections. To support this, this patch allows to pass async function as a parameter to `for_each_gently()` method.	2024-08-08 10:42:09 +02:00
Pavel Emelyanov	ddd2623418	generic_server: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-05-03 12:29:08 +03:00
Pavel Emelyanov	a1daa7093e	generic_server: Coroutinize listen() method Straightforward. Indentation is deliberately left broken. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-05-03 12:28:42 +03:00
Pavel Emelyanov	030f1ef81c	generic_server: Rename creds argument to builder So that it doesn't clash with local creds variable that will appear in this method after its coroutinization. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-05-03 12:27:37 +03:00
Kefu Chai	372a4d1b79	treewide: do not define FMT_DEPRECATED_OSTREAM since we do not rely on FMT_DEPRECATED_OSTREAM to define the fmt::formatter for us anymore, let's stop defining `FMT_DEPRECATED_OSTREAM`. in this change, * utils: drop the range formatters in to_string.hh and to_string.c, as we don't use them anymore. and the tests for them in test/boost/string_format_test.cc are removed accordingly. * utils: use fmt to print chunk_vector and small_vector. as we are not able to print the elements using operator<< anymore after switching to {fmt} formatters. * test/boost: specialize fmt::details::is_std_string_like<bytes> due to a bug in {fmt} v9, {fmt} fails to format a range whose element type is `basic_sstring<uint8_t>`, as it considers it as a string-like type, but `basic_sstring<uint8_t>`'s char type is signed char, not char. this issue does not exist in {fmt} v10, so, in this change, we add a workaround to explicitly specialize the type trait to assure that {fmt} format this type using its `fmt::formatter` specialization instead of trying to format it as a string. also, {fmt}'s generic ranges formatter calls the pair formatter's `set_brackets()` and `set_separator()` methods when printing the range, but operator<< based formatter does not provide these method, we have to include this change in the change switching to {fmt}, otherwise the change specializing `fmt::details::is_std_string_like<bytes>` won't compile. * test/boost: in tests, we use `BOOST_REQUIRE_EQUAL()` and its friends for comparing values. but without the operator<< based formatters, Boost.Test would not be able to print them. after removing the homebrew formatters, we need to use the generic `boost_test_print_type()` helper to do this job. so we are including `test_utils.hh` in tests so that we can print the formattable types. * treewide: add "#include "utils/to_string.hh" where `fmt::formatter<optional<>>` is used. * configure.py: do not define FMT_DEPRECATED_OSTREAM * cmake: do not define FMT_DEPRECATED_OSTREAM Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-04-19 22:57:36 +08:00
Kefu Chai	a439ebcfce	treewide: include fmt/ranges.h and/or fmt/std.h before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we include `fmt/ranges.h` and/or `fmt/std.h` for formatting the container types, like vector, map optional and variant using {fmt} instead of the homebrew formatter based on operator<<. with this change, the changes adding fmt::formatter and the changes using ostream formatter explicitly, we are allowed to drop `FMT_DEPRECATED_OSTREAM` macro. Refs scylladb#13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-04-19 22:56:16 +08:00
Mikołaj Grzebieluch	4cecda7ead	transport/controller: pass unix_domain_socket_permissions to generic_server::listen	2024-02-05 14:22:03 +01:00
Avi Kivity	7cb1c10fed	treewide: replace seastar::future::get0() with seastar::future::get() get0() dates back from the days where Seastar futures carried tuples, and get0() was a way to get the first (and usually only) element. Now it's a distraction, and Seastar is likely to deprecate and remove it. Replace with seastar::future::get(), which does the same thing.	2024-02-02 22:12:57 +08:00
Michał Jadwiszczak	0083ddd7a0	generic_server: use mutable reference in `for_each_gently` Make `generic_server::gentle_iterator` a mutable iterator to allow `for_each_gently` to make changes to the connections. Fixes: #16035 Closes scylladb/scylladb#16036	2023-11-14 14:25:22 +02:00
Pavel Emelyanov	4682c7f9a5	generic_server: Introduce shutdown() The method waits for listening sockets to stop listening and aborts the connected sockets, but doesn't wait for the established connections to finish processing. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-11 17:37:48 +03:00
Pavel Emelyanov	6dcf653995	generic_server: Decouple server stopped from connection stopped The _stopped future resolves when all "sockets" stop -- listening and connected ones. Furure patching will need to wait for listening sockets to stop separately from connected ones. Rename the `_stopped` to reflect what it is now while at it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-11 17:32:07 +03:00
Calle Wilund	890f1f4ad3	generic_server: Handle TLS error codes indicating broken pipe Fixes #14625 In broken pipe detection, handle also TLS error codes. Requires https://github.com/scylladb/seastar/pull/1729 Closes #14626	2023-07-12 16:04:33 +03:00
Avi Kivity	6aa91c13c5	Merge 'Optimize topology::compare_endpoints' from Benny Halevy The code for compare_endpoints originates at the dawn of time (`bc034aeaec`) and is called on the fast path from storage_proxy via `sort_by_proximity`. This series considerably reduces the function's footprint by: 1. carefully coding the many comparisons in the function so to reduce the number of conditional banches (apparently the compiler isn't doing a good enough job at optimizing it in this case) 2. avoid sstring copy in topology::get_{datacenter,rack} Closes #12761 * github.com:scylladb/scylladb: topology: optimize compare_endpoints to_string: add print operators for std::{weak,partial}_ordering utils: to_sstring: deinline std::strong_ordering print operator move to_string.hh to utils/ test: network_topology: add test_topology_compare_endpoints	2023-03-07 15:17:19 +02:00
Kefu Chai	0cb842797a	treewide: do not define/capture unused variables these warnings are found by Clang-17 after removing `-Wno-unused-lambda-capture` and '-Wno-unused-variable' from the list of disabled warnings in `configure.py`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-02-15 22:57:18 +02:00
Benny Halevy	25ebc63b82	move to_string.hh to utils/ Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-02-15 11:09:04 +02:00
Pavel Emelyanov	f035313b16	generic_server: Gentle iterator Add the ability to iterate over the list of connections in a "gentle" manner, i.e. -- preempting the loop when required. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-18 14:25:08 +03:00
Pavel Emelyanov	661c12066b	generic_server: Type alias For simpler future patching Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-18 14:25:07 +03:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Pavel Emelyanov	ba16318457	generic_server: Keep server alive during conn background processing There's at least one tiny race in generic_server code. The trailing .handle_exception after the conn->process() captures this, but since the whole continuation chain happens in the background, that this can be released thus causing the whole lambda to execute on freed generic_server instance. This, in turn, is not nice because captured this is used to get a _logger from. The fix is based on the observation that all connections pin the server in memory until all of them (connections) are destructed. Said that, to keep the server alive in the aforementioned lambda it's enough to make sure the conn variable (it's lw_shared_ptr on the connection) is alive in it. Not to generate a bunch of tiny continuations with identical set of captures -- tail the single .then_wrapped() one and do whatever is needed to wrap up the connection processing in it. tests: unit(dev) fixes: #9316 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20211115105818.11348-1-xemul@scylladb.com>	2021-11-16 11:10:39 +02:00
Pavel Emelyanov	c7b0b25494	transport, generic_server: Remove no longer used functionality After subscription management was moved onto controller level a bunch of code can be dropped: - passing migration notifier beyond controller - event_notifier's _stopped bit - event_notifier .stop() method - event_notifier empty constructor and destrictor - generic_server's on_stop virtual method Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-22 18:41:32 +03:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Pekka Enberg	2b6438c044	generic_server: Rename "maybe_idle" to "maybe_stop"	2021-04-13 14:13:24 +03:00
Pekka Enberg	16f262b852	transport, redis: Use generic server::listen() Let's pull up cql_server listen() to generic_server::server base class and convert redis_server to use it.	2021-04-13 14:13:24 +03:00
Pekka Enberg	6c619e4462	transport/server: Remove "redis_server" prefix from logging The logger itself has the name "redis_server" that appears in the logs.	2021-04-13 13:57:22 +03:00
Pekka Enberg	f560b3daa3	generic_server: Remove unneeded static_pointer_cast<> Now that do_accepts() is in generic_server, we can get rid of the static_pointer_cast<>.	2021-04-13 13:57:22 +03:00
Pekka Enberg	ac90a8ea50	transport, redis: Use generic server::do_accepts() The cql_server and redis_server share the same ancestor of do_accepts(). Let's pull up the cql_server version of do_accept() (that has more functionality) to generic_server::server and use it in the redis_server too.	2021-04-13 13:57:21 +03:00
Pekka Enberg	3689db26fc	transport, redis: Use generic server::process() Pull up the cql_server process() to base class and convert redis_server to use it. Please note that this fixes EPIPE and connection reset issue in the Redis server, which was fixed in the CQL server in commit `1a8630e6a` ("transport: silence "broken pipe" and "connection reset by peer" errors").	2021-04-13 13:56:45 +03:00

1 2

53 Commits