Permits have to wait for re-admission after having been evicted. This
happens via `reader_permit::maybe_wait_readmission()`. The user of this
method -- the evictable reader -- uses it to re-wait admission when the
underlying reader was evicted. There is one tricky scenario however,
when the underlying reader is created for the first time. When the
evictable reader is part of a multishard query stack, the created reader
might in fact be a resumed, saved one. These readers are kept in an
inactive state until actually resumed. The evictable reader shares it
permit with the to-be-resumed reader so it can check whether it has been
evicted while saved and needs to wait readmission before being resumed.
In this flow it is critical that there is no preemption point between
this check and actually resuming the reader, because if there is, the
reader might end up actually recreated, without having waited for
readmission first.
To help avoid this situation, the existing `maybe_wait_readmission()` is
split into two methods:
* `bool reader_permit::needs_readmission()`
* `future<> reader_permit::wait_for_readmission()`
The evictable reader can now ensure there is no preemption point between
`needs_readmission()` and resuming the reader.
Fixes: #10187
Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220315105851.170364-1-bdenes@scylladb.com>
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
This will make it easier, for example, to enforce memory limits in lower
levels of the flat_mutation_reader stack.
By default the size is unlimited. However, for specific queries it is
possible to store a different value (for example, obtained from a
`read_command` object) through a setter.
Now that the timeout is stored in the reader
permit use it for admission rather than a timeout
parameter.
Note that evictable_reader::next_partition
currently passes db::no_timeout to
resume_or_create_reader, which propagated to
maybe_wait_readmission, but it seems to be
an oversight of the f_m_r api that doesn't
pass a timeout to next_partition().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The timeout needs to be propagated to the reader's permit.
Reset it to db::no_timeout in repair_reader::pause().
Warn if set_timeout asks to change the timeout too far into the
past (100ms). It is possible that it will be passed a
past timeout from the rcp path, where the message timeout
is applied (as duration) over the local lowres_clock time
and parallel read_data messages that share the query may end
up having close, but different timeout values.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch flips two "switches":
1) It switches admission to be up-front.
2) It changes the admission algorithm.
(1) by now all permits are obtained up-front, so this patch just yanks
out the restricted reader from all reader stacks and simultaneously
switches all `obtain_permit_nowait()` calls to `obtain_permit()`. By
doing this admission is now waited on when creating the permit.
(2) we switch to an admission algorithm that adds a new aspect to the
existing resource availability: the number of used/blocked reads. Namely
it only admits new reads if in addition to the necessary amount of
resources being available, all currently used readers are blocked. In
other words we only admit new reads if all currently admitted reads
requires something other than CPU to progress. They are either waiting
on I/O, a remote shard, or attention from their consumers (not used
currently).
We flip these two switches at the same time because up-front admission
means cache reads now need to obtain a permit too. For cache reads the
optimal concurrency is 1. Anything above that just increases latency
(without increasing throughput). So we want to make sure that if a cache
reader hits it doesn't get any competition for CPU and it can run to
completion. We admit new reads only if the read misses and has to go to
disk.
Another change made to accommodate this switch is the replacement of the
replica side read execution stages which the reader concurrency
semaphore as an execution stage. This replacement is needed because with
the introduction of up-front admission, reads are not independent of
each other any-more. One read executed can influence whether later reads
executed will be admitted or not, and execution stages require
independent operations to work well. By moving the execution stage into
the semaphore, we have an execution stage which is in control of both
admission and running the operations in batches, avoiding the bad
interaction between the two.
We want to make permits be admitted up-front, before even being created.
As part of this change, we will get rid of the `wait_admission()`
method on the permit, instead, the permit will be created as a result of
waiting for admission (just like back some time ago).
To allow evicted readers to wait for re-admission, a new method
`maybe_wait_readmission()` is created, which waits for readmission if
the permit is in evicted state.
Also refactor the internals of the semaphore to support and favor
up-front admission code. As up-front admission is the future we want the
permit code to be organized in such a way that it is natural to use with
it. This means that the "old-style" admission code might suffer but we
tolerate this as it is on its way out. To this end the following changes
were done:
* Add a _base_resources field to reader_permit which tracks the base
cost of said permit. This is passed in the constructor and is used in
the first and subsequent admissions.
* The base cost is now managed internally by the permit, instead of
relying on an external `resource_units` instance, though the old way
is still supported temporarily.
* Change the admission pipeline to favor the new permit-internally
managed base cost variant.
* Compatibility with old-style admission: permits are created with 0
base resources, base resources are set with the compatibility method
`set_base_resources()` right before admission, then externalized again
after admission with `base_resource_as_resource_units()`. These
methods will be gone when the old style admission is retired (together
with `wait_admission()`).
By enabling shared from this for impl and adding a reader permit
constructor which takes a shared pointer to an impl.
This allows impl members to invoke functions requiring a `reader_permit`
instance as a parameter.
Distinguish between permits that are blocked and those that are not.
Conceptually a blocked permit is one that needs to wait on either I/O or
a remote shard to proceed.
This information will be used by admission, which will only admit new
reads when all currently used ones are blocked. More on that in the
commit introducing this new admission type.
This patch only adds the infrastructure, block sites are not marked yet.
Distinguish between permits that are used and those that are not. These
are two subtypes of the current 'active' state (and replace it).
Conceptually a permit is used when any readers associated with it have a
pending call to any of their async methods, i.e. the consumer is
actively consuming from them.
This information will be used for admission, together with a new blocked
state introduced by a future patch.
This patch only adds the infrastructure, use sites are not marked yet.
We want to introduce more fine-grained states for permits than what we
have currently, splitting the current 'active' state into multiple
sub-states. As a preparatory step, introduce an evicted state too, to
keep track of permits that were evicted while being inactive. This will
be important in determining what permits need to re-wait admission, once
we keep permits across pages. Having an evicted state also aids
validating internal state transitions.
This commit conceptually reverts 4c8ab10. Said commit was meant to
prevent the scenario where memory-only permits -- those that don't pass
admission but still consume memory -- completely prevent the admission
of reads, possibly even causing a deadlock because a permit might even
blocks its own admission. The protection introduced by said commit
however proved to be very problematic. It made the status of resources
on the permit very hard to reason about and created loopholes via which
permits could accumulate without tracking or they could even leak
resources. Instead of continuing to patch this broken system, this
commit does away with this "protection" based on the observation that
deadlocks are now prevented anyway by the admission criteria introduced
by 0fe75571d9, which admits a read anyway when all the initial count
resources are available (meaning no admitted reader is alive),
regardless of availability of memory.
The benefits of this revert is that the semaphore now knows about all
the resources and is able to do its job better as it is not "lied to"
about resource by the permits. Furthermore the status of a permit's
resources is much simpler to reason about, there are no more loopholes
in unexpected state transitions to swallow/leak resources.
To prove that this revert is indeed safe, in the next commit we add
robust tests that stress test admission on a highly contested semaphore.
This patch also does away with the registered/admitted differentiation
of permits, as this doesn't make much sense anymore, instead these two
are unified into a single "active" state. One can always tell whether a
permit was admitted or not from whether it owns count resources anyway.
This state will be used for permits that are not in admitted state when
registered as inactive. We can have such reads if a read can be served
entirely from cache/memtables and it doesn't have to go to disk and
hence doesn't go through admission. These permits currently don't
forward their cost to the semaphore so they won't prevent their own
admission creating a deadlock. However, when in inactive state, we do
want to keep tabs on their resource consumption so we don't accumulate
too much of these inactive reads. So introduce a new state for these
non-admitted inactive reads. When entering the inactive state, the
permit registers its cost with the semaphore, and when unregistered as
inactive, it retracts it. This is a workaround (khm hack) until #4758 is
solved and all permits will be admitted on creation.
The reader concurrency semaphore timing out or its queue being overflown
are fairly common events both in production and in testing. At the same
time it is a hard to diagnose problem that often has a benign cause
(especially during testing), but it is equally possible that it points
to something serious. So when this error starts to appear in logs,
usually we want to investigate and the investigation is lengthy...
either involves looking at metrics or coredumps or both.
This patch intends to jumpstart this process by dumping a diagnostics on
semaphore timeout or queue overflow. The diagnostics is printed to the
log with debug level to avoid excessive spamming. It contains a
histogram of all the permits associated with the problematic semaphore
organized by table, operation and state.
Example:
DEBUG 2020-10-08 17:05:26,115 [shard 0] reader_concurrency_semaphore -
Semaphore _read_concurrency_sem: timed out, dumping permit diagnostics:
Permits with state admitted, sorted by memory
memory count name
3499M 27 ks.test:data-query
3499M 27 total
Permits with state waiting, sorted by count
count memory name
1 0B ks.test:drain
7650 0B ks.test:data-query
7651 0B total
Permits with state registered, sorted by count
count memory name
0 0B total
Total: permits: 7678, memory: 3499M
This allows determining several things at glance:
* What are the tables involved
* What are the operations involved
* Where is the memory
This can speed up a follow-up investigation greatly, or it can even be
enough on its own to determine that the issue is benign.
Instead of a simple boolean, designating whether the permit was already
admitted or not, add a proper state field with a value for all the
different states the permit can be in. Currently there are three such
states:
* registered - the permit was created and started accounting resource
consumption.
* waiting - the permit was queued to wait for admission.
* admitted - the permit was successfully admitted.
The state will be used for debugging purposes, both during coredump
debugging as well as for dumping diagnostics data about permits.
Require a schema and an operation name to be given to each permit when
created. The schema is of the table the read is executed against, and
the operation name, which is some name identifying the operation the
permit is part of. Ideally this should be different for each site the
permit is created at, to be able to discern not only different kind of
reads, but different code paths the read took.
As not all read can be associated with one schema, the schema is allowed
to be null.
The name will be used for debugging purposes, both for coredump
debugging and runtime logging of permit-related diagnostics.
This can be used with standard containers and other containers that use
the std::allocator interface to track the allocations made by them via a
reader_permit.
In the next patches we plan to start tracking the memory consumption of
the actual allocations made by the circular_buffer<mutation_fragment>,
as well as the memory consumed by the mutation fragments.
This means that readers will start consuming memory off the permit right
after being constructed. Ironically this can prevent the reader from
being admitted, due to its own pre-admission memory consumption. To
prevent this hold on forwarding the memory consumption to the semaphore,
until the permit is actually admitted.
In the next patches the reader permit will gain members that are shared
across all instances of the same permit. To facilitate this move all
internals into an impl class, of which the permit stores a shared
pointer. We use a shared_ptr to avoid defining `impl` in the header.
This is how the reader permit started in the beginning. We've done a
full circle. :)
And do all consuming and signalling through these methods. These
operations will soon be more involved than the simple forwarding they do
today, so we want to centralize them to a single method pair.
In the next patches we want to introduce per-permit resource tracking --
that is, have each permit track the amount of resource consumed through
it. For this, we need all consumption to happen through a permit, and
not directly with the semaphore.
Remove `no_reader_permit()` and all ways to create empty (invalid)
permits. All permits are guaranteed to be valid now and are only
obtainable from a semaphore.
`reader_permit::semaphore()` now returns a reference, as it is
guaranteed to always have a valid semaphore reference.
Permits are now created with `make_permit()` and code is using the
permit to do all resource consumption tracking and admission waiting, so
we can remove these from the semaphore. This allows us to remove some
now unused code from the permit as well, namely the `base_cost` which
was used to track the resource amount the permit was created with. Now
this amount is also tracked with a `resource_units` RAII object, returned
from `reader_permit::wait_admission()`, so it can be removed. Curiously,
this reduces the reader permit to be glorified semaphore pointer. Still,
the permit abstraction is worth keeping, because it allows us to make
changes to how the resource tracking part of the semaphore works,
without having to change the huge amount of code sites passing around
the permit.
We want to make `read_permit` the single interface through which reads
interact with the concurrency limiting mechanism. So far it was only
usable to track memory consumption. Add the missing `wait_admission()`
and `consume_resources()` to the permit API. As opposed to
`reader_concurrency_semaphore::` equivalents which returned a
permit, the `reader_permit::` variants jut return
`reader_permit::resource_units` which is an RAII holder for the acquired
units. This also allows for the permit to be created earlier, before the
reader is admitted, allowing for tracking pre-admission memory usage as
well. In fact this is what we are going to do in the next patches.
This patch also introduces a `broken()` method on the reader concurrency
semaphore which resolves waiters with an exception. This method is also
called internally from the semaphore's destructor. This is needed
because the semaphore can now have external waiters, who has to be
resolved before the semaphore itself is destroyed.
We want to refactor reader_permit::memory_units to work in terms of
reader_resources, as we are planning to use it for guarding count
resources as well. This patch makes the first step: renames it from
memory_units to resources_units. Since this is a very noisy change, we
do it in a separate patch, the semantic change is in the next patch.
This patch is a bag of fixes/cleanups that were omitted from the reader
memory tracking series due to contributor error. It contains the
following changes:
* Get rid of unused `increase()` and `decrease()` methods.
* Make all constructors and assignment operators `noexcept`.
* Make move assignment operator safe w.r.t. self assignment.
* `reset()`: consume the new amount before releasing the old amount,
to prevent a transient window where new readers might be admitted.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200206143007.633069-1-bdenes@scylladb.com>
Previously `tracking_file_impl::make_tracked_buf()`. In the next patches
we plan on using this outside `tracking_file_impl`, so make it public
and templatize on the char type.
Similar to `seastar::semaphore_units`, this allows consuming and
releasing memory via an RAII object. In addition to that, it also allows
tracking changing values. This feature was designed to be used for
tracking the ever changing memory consumption of the buffers of
`flat_mutation_reader`:s.
This is now the only supported way of consuming memory from a permit.
In the next patches we will replace `reader_resource_tracker` and have
code use the `reader_permit` directly. In subsequent patches, the
`reader_permit` will get even more usages as we attempt to make the
tracking of reader resource more accurate by tracking more parts of it.
So the grand plan is that the current `reader_concurrency_semaphore.hh`
is split into two headers:
* `reader_concurrency_semaphore.hh` - containing the semaphore proper.
* `reader_permit.hh` - a very lightweight header, to be used by
components which only want to track various parts of the resource
consumption of reads.