This pull request adds partial admission control to Thrift frontend. The solution is partial mostly because the Thrift layer, aside from allowing Thrift messages, may also be used as a base protocol for CQL messages. Coupling admission control to this one is a little bit more complicated due to how the layer currently works - a Thrift handler, created once per connection, keeps a local `query_state` instance for the occasion of handling CQL requests. However, `query_state` should be kept per query, not per connection, so adding admission control to this aspect of the frontend is left for later.
Finally, the way service permits are passed from the server, via the handler factory, handler and then to queries is hacky. I haven't figured out how to force Thrift to pass custom context per query, so the way it works now is by relying on the fact that the server does not yield (in Seastar sense) between having read the request and launching the proper handler. Due to that, it's possible to just store the service permit in the server itself, pass the reference (address) to it down to the handler, and then read it back from the handling code and claim ownership of it. It works, but if anyone has a better idea, please share.
Refs #4826Closes#8313
* github.com:scylladb/scylla:
thrift: add support for max_concurrent_requests_per_shard
thrift: add metrics for admission control
thrift: add a counter for in-flight requests
thrift: add a counter for blocked requests
thrift: partially add admission control
service_permit: add a getter for the number of units held
thrift: coroutinize processing a request
memory_limiter: add a missing seastarx include
LCS reshape is currently inefficient for repair-based operation, because
the disjoint run of 256 sstables is reshaped into bigger L0 files, which
will be then integrated into the main sstable set.
On reshape completion, LCS has to compact those big L0 files onto higher
levels, until last level is reached, producing bad write amplification.
A much better approach is to instead compact that disjoint run into the
best possible level L, which can be figured out with:
log (base fan_out) of (total_size / max_sstable_size)
This compaction will be essentially a copy operation. It's important
to do it rather than only mutating the level of sstables because we have
to reshape the input run according to LCS parameters like sstable size.
For repair-based bootstrap/replace, the input disjoint run is now efficiently
reshaped into an ideal level L, so there's no compaction backlog once
reshape completes.
This behavior will manifest in the log as this:
LeveledManifest - Reshaping 256 disjoint sstables in level 0 into level 2
For repair-based decommission/removenode though, which reshape wasn't
wired on yet, level L may temporarily hold 2 disjoint runs, which overlap
one another, but LCS itself will incrementally merge them through either
promotion of L-1 into L, or by detecting overlapping in level L and
merging the overlapping sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210329171826.42873-1-raphaelsc@scylladb.com>
Failures in this test typically happen inside the test consumer object.
These however don't stop the test as the code invoking the consumer
object handles exceptions coming from it. So the test will run to
completion and will fail again when comparing the produced output with
the expected one. This results in distracting failures. The real problem
is not the difference in the output, but the first check that failed,
which is however buried in the noise. To prevent this add an "ok" flag
which is set to false if the consumer fails. In this case the additional
checks are skipped in the end to not generate useless noise.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210326083147.26113-2-bdenes@scylladb.com>
"
When a permit is destroyed we check if it still holds on to any
resources in the destructor. Any resources the permit still holds on are
leaked resources, as users should have released these. Currently we just
invoke `on_internal_error_noexcept()` to handle this, which -- depending
on the configuration -- will result in an error message or an assert. In
the former case, the resources will be leaked for good. This mini-series
fixes this, by signaling back these resources to the semaphore. This
helps avoid an eventual complete dry-up of all semaphore resources and a
subsequent complete shutdown of reads.
Tests: unit(release, debug)
"
* 'reader-permit-signal-leaked-resources/v1' of https://github.com/denesb/scylla:
reader_permit: signal leaked resources
test: test_reader_lifecycle_policy: keep semaphores alive until all ops cease
sstables: generate_summary(): extend the lifecycle of the reader concurrency semaphore
Said test has two separate logical readers, but they share the same
permit, which is illegal. This didn't cause any problems yet, but soon
the semaphore will start to keep score of active/inactive permits which
will be confused by such sharing, so have them use separate permits.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210326083147.26113-1-bdenes@scylladb.com>
Clean up step() function by moving state specific processing into per
state functions. This way it is easier to see how each state handles
individual messages. No functional changes here.
Message-Id: <YGHCiTWjq+L/jVCB@scylladb.com>
The command can be used to inspect IO queues of a local reactor.
Example output:
```
(gdb) scylla io-queues
Dev 0:
Class: |shares: |ptr:
--------------------------------------------------------------------------------
"default" |1 |(seastar::priority_class_data *)0x6000002c6500
"commitlog" |1000 |(seastar::priority_class_data *)0x6000003ad940
"memtable_flush" |1000 |(seastar::priority_class_data *)0x6000005cb300
"streaming" |200 |(seastar::priority_class_data *)0x0
"query" |1000 |(seastar::priority_class_data *)0x600000718580
"compaction" |1000 |(seastar::priority_class_data *)0x6000030ef0c0
Max request size: 2147483647
Max capacity: Ticket(weight: 4194303, size: 4194303)
Capacity tail: Ticket(weight: 73168384, size: 100561888)
Capacity head: Ticket(weight: 77360511, size: 104242143)
Resources executing: Ticket(weight: 2176, size: 514048)
Resources queued: Ticket(weight: 384, size: 98304)
Handles: (1)
Class 0x6000005d7278:
Ticket(weight: 128, size: 32768)
Ticket(weight: 128, size: 32768)
Ticket(weight: 128, size: 32768)
Pending in sink: (0)
```
Created when debugging a core dump. Turned out not to be immediately useful for this use case, but I'm publishing it since it may come in handy in future investigations.
Closes#8362
* github.com:scylladb/scylla:
scylla-gdb: add io-queues command
scylla-gdb.py: add parsing std::priority_queue
scylla-gdb.py: add parsing std::atomic
scylla-gdb.py: add parsing std::shared_ptr
scylla-db.py: add parsing intrusive_slist
`flat_mutation_reader::consume_pausable` is widely used in Scylla. Some places
worth mentioning are memtables and combined readers but there are others as
well.
This patchset improves `consume_pausable` in three ways:
1. it removes unnecessary allocation
2. it rearranges ifs to not check the same thing twice
3. for a consumer that returns plain stop_iteration not a future<stop_iteration>
it reduces the amount of future usage
Test: unit(dev, release, debug)
Combined reader microbenchmark has shown from 2% to 22% improvement in median
execution time while memtable microbenchmark has shown from 3.6% to 7.8%
improvement in median execution time.
Before the change:
```
./build/release/test/perf/perf_mutation_readers --random-seed 3549335083
single run iterations: 0
single run duration: 1.000s
number of runs: 5
number of cores: 16
random seed: 3549335083
test iterations median mad min max
combined.one_row 1316234 140.120ns 0.020ns 140.074ns 140.141ns
combined.single_active 7332 91.484us 31.890ns 91.453us 91.778us
combined.many_overlapping 945 870.973us 429.720ns 868.625us 871.403us
combined.disjoint_interleaved 7102 85.989us 7.847ns 85.973us 85.997us
combined.disjoint_ranges 7129 85.570us 7.840ns 85.562us 85.596us
combined.overlapping_partitions_disjoint_rows 5458 124.787us 56.738ns 124.731us 125.370us
clustering_combined.ranges_generic 1920688 217.940ns 0.184ns 217.742ns 218.275ns
clustering_combined.ranges_specialized 1935318 194.610ns 0.199ns 194.210ns 195.228ns
memtable.one_partition_one_row 624001 1.600us 1.405ns 1.599us 1.605us
memtable.one_partition_many_rows 79551 12.555us 1.829ns 12.549us 12.558us
memtable.many_partitions_one_row 40557 24.748us 77.083ns 24.644us 25.135us
memtable.many_partitions_many_rows 3220 310.429us 57.628ns 310.295us 311.189us
```
After the change:
```
./build/release/test/perf/perf_mutation_readers --random-seed 3549335083
single run iterations: 0
single run duration: 1.000s
number of runs: 5
number of cores: 16
random seed: 3549335083
test iterations median mad min max
combined.one_row 1358839 109.222ns 0.122ns 109.089ns 109.348ns
combined.single_active 7525 87.305us 25.540ns 87.273us 87.362us
combined.many_overlapping 962 853.195us 1.904us 851.244us 855.142us
combined.disjoint_interleaved 7310 81.988us 28.877ns 81.949us 82.032us
combined.disjoint_ranges 7315 81.699us 37.144ns 81.662us 81.874us
combined.overlapping_partitions_disjoint_rows 5591 120.964us 15.294ns 120.949us 121.120us
clustering_combined.ranges_generic 1954722 211.993ns 0.052ns 211.883ns 212.084ns
clustering_combined.ranges_specialized 2042194 187.807ns 0.066ns 187.732ns 188.289ns
memtable.one_partition_one_row 648701 1.542us 0.339ns 1.542us 1.543us
memtable.one_partition_many_rows 85007 11.759us 1.168ns 11.752us 11.782us
memtable.many_partitions_one_row 43893 22.805us 17.147ns 22.782us 22.843us
memtable.many_partitions_many_rows 3441 290.220us 41.720ns 290.172us 290.306us
```
Closes#8359
* github.com:scylladb/scylla:
flat_mutation_reader: optimize consume_pausable for some consumers
flat_mutation_reader: special case consumers in consume_pausable
flat_mutation_reader: Change order of checks in consume_pausable
flat_mutation_reader: fix indentation in consume_pausable
flat_mutation_reader: Remove allocation in consume_pausable
perf: Add benchmarks for large partitions
This commit adds admission control in the form of passing
service permits to the Thrift server.
The support is partial, because Thrift also supports running CQL
queries, and for that purpose a query_state object is kept
in the Thrift handler. However, the handler is generally created
once per connection, not once per query, and the query_state object
is supposed to keep the state of a single query only.
In order to keep this series simpler, the CQL-on-top-of-Thrift
layer is not touched and is left as TODO.
Moreover, the Thrift layer does not make it easy to pass custom
per-query context (like service_permit), so the implementation
uses a trick: the service permit is created on the server
and then passed as reference to its connections and their respective
Thrift handlers. Then, each time a query is read from the socket,
this service permit is overwritten and then read back from the Thrift
handler. This mechanism heavily relies on the fact that there are
zero preemption points between overwriting the service permit
and reading it back by the handler. Otherwise, races may occur.
This assumption was verified by code inspection + empirical tests,
but if somebody is aware that it may not always hold, please speak up.
The Thrift layer is functional, but it's not usually the first-choice protocol for Scylla users, so it's hereby disabled by default.
Fixes#8336Closes#8338
* github.com:scylladb/scylla:
docs: mention disabling Thrift by default
db,config: disable Thrift by default
Raft instance needs to update RPC subsystem on changes in
configuration, so that RPC can deliver messages to the new nodes
in configuration, as well as dispose of the old nodes.
I.e. the nodes which are not the part of the most recent
configuration anymore.
The effective scope of RPC mappings is limited by the piece of
code which sends messages to both the "new" nodes (which
are added to the cluster with the most recent configuration
change) and the "old" nodes which are removed from the cluster.
Until the messages are successfully delivered to at least
the majority of "old" nodes and we have heard back from them,
the mappings should be kept intact. After that point the RPC
mappings for the removed nodes are no longer of interest
and thus can be immediately disposed.
There is also another problem to be solved: in Raft an instance may
need to communicate with a peer outside its current configuration.
This may happen, e.g., when a follower falls out of sync with the
majority and then a configuration is changed and a leader not present
in the old configuration is elected.
The solution is to introduce the concept of "expirable" updates to
the RPC subsystem.
When RPC receives a message from an unknown peer, it also adds the
return address of the peer to the address map with a TTL. Should
we need to respond to the peer, its address will be known.
An outgoing communication to an unconfigured peer is impossible.
* manmanson/raft_mappings_wiring_v12:
raft: update README.md with info on RPC server address mappings
raft: wire up `rpc::add_server` and `rpc::remove_server` for configuration changes
raft/fsm: add optional `rpc_configuration` field to fsm_output
raft: maintain current rpc context in `server_impl`
raft: use `.contains` instead of `.count` for std::set in `raft::configuration::diff`
raft: unit-tests for `raft_address_map`
raft: support expiring server address mappings for rpc module
consumers that return stop_iteration not future<stop_iteration> don't
have to consume a single fragment per each iteration of repeat. They can
consume whole buffer in each iteration.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
consume_pausable works with consumers that return either stop_iteration
or future<stop_iteration>. So far it was calling futurize_invoke for
both. This patch special cases consumers that return
future<stop_iteration> and don't call futurize_invoke for them as this
is unnecessary work.
More importantly, this will allow the following patch to optimize
consumers that return plain stop_iteration.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This way we can avoid checking is_buffer_empty twice.
Compiler might be able to optimize this out but why depend on it
when the alternative is not less readable.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Code was left with wrong indentation by the previous commit that
removed do_with call around the code that's currently present.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The allocation was introduced in 515bed90bb but I couldn't figure out
why it's needed. It seems that the consumer can just be captured inside
lambda. Tests seem to support the idea.
Indentation will be fixed in the following commit to make the review
easier.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
dbuild detects if the kernel is using cgroupv2 by checking if the
cgroup2 filesystem is mounted on /sys/fs/cgroup. However, on Ubuntu
20.10, the cgroup filesystem is mounted on /sys/fs/cgroup and the
cgroup2 filesystem is mounted on /sys/fs/cgroup/unified. This second
mount matches the search expression and gives a false positive.
Fix by adding a space at the end; this will fail to match
/sys/fs/cgroup/unified.
Closes#8355
Describe the high-level scheme of managing RPC mappings and
also expand on the introduction of "expirable" RPC mappings concept
and why these are needed.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Raft instance needs to update RPC subsystem on changes in
configuration, so that RPC can deliver messages to the new nodes
in configuration, as well as dispose of the old nodes.
I.e. the nodes which are not the part of the most recent
configuration anymore.
The effective scope of RPC mappings is limited by the piece of
code which sends messages to both the "new" nodes (which
are added to the cluster with the most recent configuration
change) and the "old" nodes which are removed from the cluster.
Until the messages are successfully delivered to at least
the majority of "old" nodes and we have heard back from them,
the mappings should be kept intact. After that point the RPC
mappings for the removed nodes are no longer of interest
and thus can be immediately disposed.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The field is set in `fsm.get_output` whenever
`_log.last_conf_idx()` or the term changes.
Also, add `_last_conf_idx` and `_last_term` to
`fsm::last_observed_state`, they are utilized
in the condition to evaluate current rpc
configuration in `fsm.get_output()`.
This will be used later to update rpc config state
stored in `server_impl` and maintain rpc address map.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The start-stop code is drifting towards a straightforward
scheme of a bunch of
service foo;
foo.start();
auto stop_foo = defer([&foo] { foo.stop(); });
blocks. The drain_on_shutdown() and its relation to drain()
and decommission() is a big hurdle on the way of this effort.
This set unifies drain() and drain_on_shutdown() so that drain
really becomes just some first steps of the regular shutdown,
i.e. -- what it should be. Some synchronisation bits around it
are still needed, though.
This unification also closes a bunch not-yet-caught bugs when
parts of the system remained running in case shutdown happens
after nodetool drain. In this case the whole drain_on_sutdown()
becomes a noop (just returns drain()'s future) and what's
missing in drain() becomes missing on shutdown.
tests: unit(dev), dtest(simple_boot_shutdown : dev),
manual(start+stop, start+drain+stop : dev)
refs: #2737
* xemul/br-drain-on-shutdown:
drain_on_shutdown: Simplify
drain: Fix indentation
storage_service: Unify drain and drain_on_shutdown
storage_proxy: Drain and unsubscribe in main.cc
migration_manager: Stop it in two phases
stream_manager: Stop instances on drain
batchlog_manager: Stop its instances on shutdown
tracing: Shutdown tracing in drain
tracing: Stop it in main.cc
system_distributed_keyspace: Stop it in main.cc
storage_service: Move (un)subscription to migration events
Introduce rpc server_address that represents the
last observed state of address mappings
for RPC module.
It does not correspond to any kind of configuration
in the raft sense, just an artificial construct
corresponding to the largest set of server
addresses coming from both previous and current
raft configurations (to be able to contact both
joining and leaving servers).
This will be used later to update rpc module mappings
when cluster configuration changes.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
`std::unordered_set::contains` is introduced in C++20 and provides
clearer semantics to check existence of a given element in a set.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
This patch introduces `raft_address_map` class to abstract
the notion of expirable address mappings for a raft rpc module.
In Raft an instance may need to communicate with a peer outside
its current configuration. This may happen, e.g., when a follower
falls out of sync with the majority and then a configuration is
changed and a leader not present in the old configuration is elected.
The solution is to introduce the concept of "expirable" updates to
the RPC subsystem.
When RPC receives a message from an unknown peer, it also adds the
return address of the peer to the address map with a TTL. Should
we need to respond to the peer, its address will be known.
An outgoing communication to an unconfigured peer is impossible.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
The modern version of this method doesn't need the
run_with_no_api_lock(), as it's launched on shard 0
anyway, neither it needs logging before and after
as it's done by the deferred action from main that
calls it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now they only differ in one bit -- compaction manager is
drained on drain and is left running (until regular stop)
on shutdown. So this unification adds a boolean flag for
this case.
Also the indentation is deliberately left broken for the
sake of patch readability.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>