This patch split the timed_rate_moving_average functionality into two, a
data class: rates_moving_average, and a wrapper class
timed_rate_moving_average that uses a timer to update the rates
periodically.
To make the transition as simple as possible timed_rate_moving_average,
takes the original API.
A new helper class meter_timer was introduced to handle the timer update
functionality.
This change required minimal code adaptation in some other parts of the
code.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This PR gets rid of exception throws/rethrows on the replica side for writes and single-partition reads. This goal is achieved without using `boost::outcome` but rather by replacing the parts of the code which throw with appropriate seastar idioms and by introducing two helper functions:
1.`try_catch` allows to inspect the type and value behind an `std::exception_ptr`. When libstdc++ is used, this function does not need to throw the exception and avoids the very costly unwind process. This based on the "How to catch an exception_ptr without even try-ing" proposal mentioned in https://github.com/scylladb/scylla/issues/10260.
This function allows to replace the current `try..catch` chains which inspect the exception type and account it in the metrics.
Example:
```c++
// Before
try {
std::rethrow_exception(eptr);
} catch (std::runtime_exception& ex) {
// 1
} catch (...) {
// 2
}
// After
if (auto* ex = try_catch<std::runtime_exception>(eptr)) {
// 1
} else {
// 2
}
```
2. `make_nested_exception_ptr` which is meant to be a replacement for `std::throw_with_nested`. Unlike the original function, it does not require an exception being currently thrown and does not throw itself - instead, it takes the nested exception as an `std::exception_ptr` and produces another `std::exception_ptr` itself.
Apart from the above, seastar idioms such as `make_exception_future`, `co_await as_future`, `co_return coroutine::exception()` are used to propagate exceptions without throwing. This brings the number of exception throws to zero for single partition reads and writes (tested with scylla-bench, --mode=read and --mode=write).
Results from `perf_simple_query`:
```
Before (719724e4df):
Writes:
Normal:
127841.40 tps ( 56.2 allocs/op, 13.2 tasks/op, 50042 insns/op, 0 errors)
Timeouts:
94770.81 tps ( 53.1 allocs/op, 5.1 tasks/op, 78678 insns/op, 1000000 errors)
Reads:
Normal:
138902.31 tps ( 65.1 allocs/op, 12.1 tasks/op, 43106 insns/op, 0 errors)
Timeouts:
62447.01 tps ( 49.7 allocs/op, 12.1 tasks/op, 135984 insns/op, 936846 errors)
After (d8ac4c02bfb7786dc9ed30d2db3b99df09bf448f):
Writes:
Normal:
127359.12 tps ( 56.2 allocs/op, 13.2 tasks/op, 49782 insns/op, 0 errors)
Timeouts:
163068.38 tps ( 52.1 allocs/op, 5.1 tasks/op, 40615 insns/op, 1000000 errors)
Reads:
Normal:
151221.15 tps ( 65.1 allocs/op, 12.1 tasks/op, 43028 insns/op, 0 errors)
Timeouts:
192094.11 tps ( 41.2 allocs/op, 12.1 tasks/op, 33403 insns/op, 960604 errors)
```
Closes#10368
* github.com:scylladb/scylla:
database: avoid rethrows when handling exceptions from commitlog
database: convert throw_commitlog_add_error to use make_nested_exception_ptr
utils: add make_nested_exception_ptr
storage_proxy: don't rethrow when inspecting replica exceptions on write path
database: don't rethrow rate_limit_exception
storage_proxy: don't rethrow the exception in abstract_read_resolver::error
utils/exceptions.cc: don't rethrow in is_timeout_exception
utils/exceptions: add try_catch
utils: add abi/eh_ia64.hh
storage_proxy: don't rethrow exceptions from replicas when accounting read stats
message: get rid of throws in send_message{,_timeout,_abortable}
database/{query,query_mutations}: don't rethrow read semaphore exceptions
Convert most use sites from `co_return coroutine::make_exception`
to `co_await coroutine::return_exception{,_ptr}` where possible.
In cases this is done in a catch clause, convert to
`co_return coroutine::exception`, generating an exception_ptr
if needed.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10972
Due to its sharded and token-based architecture, Scylla works best when the user workload is more or less uniformly balanced across all nodes and shards. However, a common case when this assumption is broken is the "hot partition" - suddenly, a single partition starts getting a lot more reads and writes in comparison to other partitions. Because the shards owning the partition have only a fraction of the total cluster capacity, this quickly causes latency problems for other partitions within the same shard and vnode.
This PR introduces per-partition rate limiting feature. Now, users can choose to apply per-partition limits to their tables of choice using a schema extension:
```
ALTER TABLE ks.tbl
WITH per_partition_rate_limit = {
'max_writes_per_second': 100,
'max_reads_per_second': 200
};
```
Reads and writes which are detected to go over that quota are rejected to the client using a new RATE_LIMIT_ERROR CQL error code - existing error codes didn't really fit well with the rate limit error, so a new error code is added. This code is implemented as a part of a CQL protocol extension and returned to clients only if they requested the extension - if not, the existing CONFIG_ERROR will be used instead.
Limits are tracked and enforced on the replica side. If a write fails with some replicas reporting rate limit being reached, the rate limit error is propagated to the client. Additionally, the following optimization is implemented: if the coordinator shard/node is also a replica, we account the operation into the rate limit early and return an error in case of exceeding the rate limit before sending any messages to other replicas at all.
The PR covers regular, non-batch writes and single-partition reads. LWT and counters are not covered here.
Results of `perf_simple_query --smp=1 --operations-per-shard=1000000`:
- Write mode:
```
8f690fdd47 (PR base):
129644.11 tps ( 56.2 allocs/op, 13.2 tasks/op, 49785 insns/op)
This PR:
125564.01 tps ( 56.2 allocs/op, 13.2 tasks/op, 49825 insns/op)
```
- Read mode:
```
8f690fdd47 (PR base):
150026.63 tps ( 63.1 allocs/op, 12.1 tasks/op, 42806 insns/op)
This PR:
151043.00 tps ( 63.1 allocs/op, 12.1 tasks/op, 43075 insns/op)
```
Manual upgrade test:
- Start 3 nodes, 4 shards each, Scylla version 8f690fdd47
- Create a keyspace with scylla-bench, RF=3
- Start reading and writing with scylla-bench with CL=QUORUM
- Manually upgrade nodes one by one to the version from this PR
- Upgrade succeeded, apart from a small number of operations which failed when each node was being put down all reads/writes succeeded
- Successfully altered the scylla-bench table to have a read and write limit and those limits were enforced as expected
Fixes: #4703Closes#9810
* github.com:scylladb/scylla:
storage_proxy: metrics for per-partition rate limiting of reads
storage_proxy: metrics for per-partition rate limiting of writes
database: add stats for per partition rate limiting
tests: add per_partition_rate_limit_test
config: add add_per_partition_rate_limit_extension function for testing
cf_prop_defs: guard per-partition rate limit with a feature
query-request: add allow_limit flag
storage_proxy: add allow rate limit flag to get_read_executor
storage_proxy: resultize return type of get_read_executor
storage_proxy: add per partition rate limit info to read RPC
storage_proxy: add per partition rate limit info to query_result_local(_digest)
storage_proxy: add allow rate limit flag to mutate/mutate_result
storage_proxy: add allow rate limit flag to mutate_internal
storage_proxy: add allow rate limit flag to mutate_begin
storage_proxy: choose the right per partition rate limit info in write handler
storage_proxy: resultize return types of write handler creation path
storage_proxy: add per partition rate limit to mutation_holders
storage_proxy: add per partition rate limit info to write RPC
storage_proxy: add per partition rate limit info to mutate_locally
database: apply per-partition rate limiting for reads/writes
database: move and rename: classify_query -> classify_request
schema: add per_partition_rate_limit schema extension
db: add rate_limiter
storage_proxy: propagate rate_limit_exception through read RPC
gms: add TYPED_ERRORS_IN_READ_RPC cluster feature
storage_proxy: pass rate_limit_exception through write RPC
replica: add rate_limit_exception and a simple serialization framework
docs: design doc for per-partition rate limiting
transport: add rate_limit_error
Adds a metric "read_rate_limited" which indicates how many times a read
operation was rejected due to per-partition rate limiting. The metric
differentiates between reads rejected by the coordinator and reads
rejected by replicas.
Adds a metric "write_rate_limited" which indicates how many times a
write operation was rejected due to per-partition rate limiting. The
metric differentiates between writes rejected by the coordinator and
writes rejected by replicas.
Adds a flag to get_read_executor which decides whether the read should
be rate limited or not. The read executors were modified to choose the
appropriate per partition rate limit info parameter and send it to the
replicas.
Now, get_read_executor is able to return coordinator exceptions without
throwing them. In an upcoming commit, it will start returning rate limit
exception in some cases and it is preferable to return them without
throwing.
The query_result_local and query_result_local_digest methods were
updated to accept db::per_partition_rate_limit::info structure and pass
it on to database::accept.
Now, mutate/mutate_result accept a flag which decides whether the write
should be rate limited or not.
The new parameter is mandatory and all call sites were updated.
The mutate_prepare and create_write_response_handler(_helper) functions
are modified to be able to return exceptions without throwing them. In
an upcoming commit, create_write_response_handler will sometimes return
rate limit exception, and it is preferable to return them without
throwing.
This commit modifies the read RPC and the storage_proxy logic so that
the coordinator knows whether a read operation failed due to rate limit
being exceeded, and returns `exceptions::rate_limit_exception` if that
happens.
This commit modifies the storage_proxy logic so that the coordinator
knows whether a write operation failed due to rate limit being exceeded,
and returns `exceptions::rate_limit_exception` when that happens.
The reference is already at hand. The get_ep_stats() calls another
helper that also maps endpoint to datacenter, but it can get the
obtained dc sstring via argument.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The latter will need it to get dc info from. All the callers are either
storage proxy or have storage proxy pointer/reference to get topology
from.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Proxy has shared token metadata from which it can get the topology.
This change obsoletes static get_local_dc() helper.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
coroutine::parallel_for_each avoids an allocation and is therefore preferred. The lifetime
of the function object is less ambiguous, and so it is safer. Replace all eligible
occurences (i.e. caller is a coroutine).
One case (storage_service::node_ops_cmd_heartbeat_updater()) needed a little extra
attention since there was a handle_exception() continuation attached. It is converted
to a try/catch.
Closes#10699
After fcb8d040 ("treewide: use Software Package Data Exchange
(SPDX) license identifiers"), many dual-licensed files were
left with empty comments on top. Remove them to avoid visual
noise.
Closes#10562
The lcs at those places are explicitly start()ed beforehand. The
is_start() check is necessary when using the latency_counter with a
histogram that may or may not start the counter (this is the case
in several class table methods).
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This series futurizes two synchronous functions used for data reconciliation:
`data_read_resolver::resolve` and `to_data_query_result` and does so
by introducing lower-level asynchronous infrastructure:
`mutation_partition_view::accept_gently`,
`frozen_mutation::unfreeze_gently` and `frozen_mutation::consume_gently`,
and `mutation::consume_gently`.
This trades some cycles on this cold path to prevent known reactor stalls.
Fixes#2361Fixes#10038Closes#10482
* github.com:scylladb/scylla:
mutation: add consume_gently
frozen_mutation: add consume_gently
query: coroutinize to_data_query_result
frozen_mutation: add unfreeze_gently
mutation_partition_view: add accept_gently methods
storage_proxy: futurize data_read_resolver::resolve
Reduce stalls by maybe yielding in-between partitions,
and by awaiting unfreeze_gently where possible.
Refs #10038
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Allow yielding in data_read_resolver::resolve to
prevent reactor stalls.
TODO: unfreeze_gently, to prevent stalls due
to large partitions.
Refs #2361
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Each feature has a private variable and a public accessor. Since the
accessor effectively makes the variable public, avoid the intermediary
and make the variable public directly.
To ease mechanical translation, the variable name is chosen as
the function name (without the cluster_supports_ prefix).
References throughout the codebase are adjusted.
The table::get_hit_rate needs gossiper to get hitrates state from.
There's no way to carry gossiper reference on the table itself, so it's
up to the callers of that method to provide it. Fortunately, there's
only one caller -- the proxy -- but the call chain to carry the
reference it not very short ... oh, well.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We don't need the database to determine the shard of the mutation,
only its schema. So move the implementation to the respecive
definitions of mutation and frozen_mutation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10430
Seastar is an external library from Scylla's point of view so
we should use the angle bracket #include style. Most of the source
follows this, this patch fixes a few stragglers.
Also fix cases of #include which reached out to seastar's directory
tree directly, via #include "seastar/include/sesatar/..." to
just refer to <seastar/...>.
Closes#10433
The do_with() means we have an unconditional allocation, so we can
justify the coroutine's allocation (replacing it). Meanwhile,
coroutine::parallel_for_each() reduces an allocation if mutate_locally()
blocks.
Closes#10387
Currently, rpc handlers are all lambdas inside
storage_proxy::init_messaging_service(). This means any stack trace
refers to storage_proxy::init_messaging_service::lambda#n instead of
a meaningful function name, and it makes init_messaging_service()
very intimidating.
Fix that by moving all such lambdas to regular member functions.
This is easy now that they don't capture anything except `this`,
which we provide during registration via std::bind_front().
A few #includes and forward declarations had to be added to
storage_proxy.hh. This is unfortunate, but can only be solved
by splitting storage_proxy into a client part and a server part.
We'd like to make the server callbacks member functions, rather
than lambdas, so we need to eliminate their captures. This patch
eliminats 'ms' by referringn to the already existing member '_messaging'
instead.
We'd like to make the server callbacks member functions, rather
than lambdas, so we need to eliminate their captures. This patch
eliminates 'mm' by making it a member variable and capturing 'this'
instead. In one case 'mm' was used by a handle_write() intermediate
lambda so we have to make that non-static and capture it too.
uninit_messaging_service() clears the member variable to preserve
the same lifetime 'mm' had before, in case that's important.
Filtering remote rpc errors based on exception type did not work because
the remote errors were reported as std::runtime_error and all rpc
exceptions inherit from it. New rpc propagates remote errors using
special type rpc::remote_verb_error now, so we can filter on that
instead.
Fixes#10339
Message-Id: <YlQYV5G6GksDytGp@scylladb.com>
A node that runs DDL query while its cluster does not have a quorum
cannot be shutdown since the query is not abortable. The series makes it
abortable and also fixes the order in which components are shutdown to
avoid the deadlock.
* gleb/raft_shutdown_v4 of git@github.com:scylladb/scylla-dev.git:
migration_manager: drain migration manager before stopping protocol servers on shutdown
migration_manager: pass abort source to raft primitives
storage_proxy: relax some read error reporting
Silence request_aborted read error since it is expected to happen suring
shutdown and report remote rpc errors as warnings instead of errors since
if they are indeed server they should be handled by the rpc client, but
OTOH some non critical errors do expect to happen during shutdown.
The host id is cached on db::config object that's available in
all the places that need it. This allows removing the method in
question from the system_keyspace and not caring that anyone that
needs host_id would have to depend on system_keyspace instance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>