In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).
So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command
The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields
Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)
Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile
The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13963
This PR contains some small improvements to the safety of consuming/releasing resources to/from the semaphore:
* reader_permit: make the low-level `consume()/signal()` API private, making the only user (an RAII class) friend.
* reader_resources: split `reset()` into `noexcept` and potentially throwing variant.
* reader_resources::reset_to(): try harder to avoid calling `consume()` (when the new resource amount is smaller then the previous one)
Closes#13678
* github.com:scylladb/scylladb:
reader_permit: resource_units::reset_to(): try harder to avoid calling consume()
reader_permit: split resource_units::reset()
reader_permit: make consume()/signal() API private
The execution loop consumes permits from the _ready_list and executes
them. The _ready_list usually contains a single permit. When the
_ready_list is not empty, new permits are queued until it becomes empty.
The execution loops relies on admission checks triggered by the read
releasing resouces, to bring in any queued read into the _ready_list,
while it is executing the current read. But in some cases the current
read might not free any resorces and thus fail to trigger an admission
check and the currently queued permits will sit in the queue until
another source triggers an admission check.
I don't yet know how this situation can occur, if at all, but it is
reproducible with a simple unit test, so it is best to cover this
corner-case in the off-chance it happens in the wild.
Add an explicit admission check to the execution loop, after the
_ready_list is exhausted, to make sure any waiters that can be admitted
with an empty _ready_list are admitted immediately and execution
continues.
Fixes: #13540Closes#13541
When requesting memory via `reader_permit::request_memory()`, the
requested amount is added to `_requested_memory` member of the permit
impl. This is because multiple concurrent requests may be blocked and
waiting at the same time. When the requests are fulfilled, the entire
amount is consumed and individual requests track their requested amount
with `resource_units` to release later.
There is a corner-case related to this: if a reader permit is registered
as inactive while it is waiting for memory, its active requests are
killed with `std::bad_alloc`, but the `_requested_memory` fields is not
cleared. If the read survives because the killed requests were part of
a non-vital background read-ahead, a later memory request will also
include amount from the failed requests. This extra amount wil not be
released and hence will cause a resource leak when the permit is
destroyed.
Fix by detecting this corner case and clearing the `_requested_memory`
field. Modify the existing unit test for the scenario of a permit
waiting on memory being registered as inactive, to also cover this
corner case, reproducing the bug.
Fixes: #13539Closes#13679
In https://github.com/scylladb/scylladb/pull/13482 we renamed the reader permit states to more descriptive names. That PR however only covered only the states themselves and their usages, as well as the documentation in `docs/dev`.
This PR is a followup to said PR, completing the name changes: renaming all symbols, names, comments etc, so all is consistent and up-to-date.
Closes#13573
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: misc updates w.r.t. recent permit state name changes
reader_concurrency_semaphore: update permit members w.r.t. recent permit state name changes
reader_concurrency_semaphore: update RAII state guard classes w.r.t. recent permit state name changes
reader_concurrency_semaphore: update API w.r.t. recent permit state name changes
reader_concurrency_semaphore: update stats w.r.t. recent permit state name changes
a signed/unsigned comparsion can overflow. and GCC-13 rightly points
this out. so let's use `std::cmp_greater_equal()` when comparing
unsigned and signed for greater-or-equal.
```
/home/kefu/dev/scylladb/reader_concurrency_semaphore.cc:931:76: error: comparison of integer expressions of different signedness: ‘long int’ and ‘uint64_t’ {aka ‘long unsigned int’} [-Werror=sign-compare]
931 | if (_resources.memory <= 0 && (consumed_resources().memory + r.memory) >= get_kill_limit()) [[unlikely]] {
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Currently, the `reset_to()` implementation calls `consume(new_amount)` (if
not zero), then calls `signal(old_amount)`. This means that even if
`reset_to()` is a net reduction in the amount of resources, there is a
call to `consume()` which can now potentially throw.
Add a special case for when the new amount of resources is strictly
smaller than the old amount. In this case, just call `signal()` with the
difference. This not just avoids a potential `std::bad_alloc`, but also
helps relieving memory pressure when this is most needed, by not failing
calls to release memory.
Into reset_to() and reset_to_zero(). The latter replaces `reset()` with
the default 0 resources argument, which was often called from noexcept
contexts. Splitting it out from `reset()` allows for a specialized
implementation that is guaranteed to be `noexcept` indeed and thus
peace of mind.
Update comments, test names and etc. that are still using the old terminology for
permit state names, bring them up to date with the recent state name changes.
The names of these states have been the source of confusion ever since
they were introduced. Give them names which better reflects their true
meaning and gives less room for misinterpretation. The changes are:
* active/unused -> active
* active/used -> active/need_cpu
* active/blocked -> active/await
Hopefully the new names do a better job at conveying what these states
really mean:
* active - a regular admitted permit, which is active (as opposed to
an inactive permit).
* active/need_cpu - an active permit which was marked as needing CPU for
the read to make progress. This permit prevents admission of new
permits while it is in this state.
* active/await - a former active/need_cpu permit, which has to wait on
I/O or a remote shard. While in this state, it doesn't block the
admission of new permits (pending other criteria such as resource
availability).
Inactive readers should only be evicted to free up resources for waiting
readers. Evicting them when waiters are not admitted for any other
reason than resources is wasteful and leads to extra load later on when
these evicted readers have to be recreated end requeued.
This patch changes the logic on both the registering path and the
admission path to not evict inactive readers unless there are readers
actually waiting on resources.
A unit-test is also added, reproducing the overly-agressive eviction and
checking that it doesn't happen anymore.
Fixes: #11803Closes#13286
Notably, to admission execution and eviction. Registering/unregistering
the permit as inactive is not traced, as this happens on every
buffer-fill for range scans.
Semaphore trace messages have a "[reader_concurrency_semaphore]" prefix
to allow them to be clearly associated with the semaphore.
To make sure all tracing done on a certain page will make its way into
the appropriate trace session.
This is a contination of the previous patch (which added trace pointer
to the permit).
And propagate it down to where it is created. This will be used to add
trace points for semaphore related events, but this will come in the
next patches.
Print the semaphore stats below the permit listing and remove the
currently redundant "Total: " line.
Some of the stats printed here are already exported as metrics, but
instead of trying to cherry-pick and risk some metrics falling through
the cracks, just print everything, there aren't that many anyway.
When diagnosing problems, knowing why permits were queued is very
valuable. Record the reason in a new stats, one for each reason a permit
can be queued.
Kill said read's memory requests with std::bad_alloc and dequeue it from
the memory wait list, then evict it on the spot.
Now that `_inactive_reads` just store permits, we can do this easily.
Add an member of type `inactive_read` to reader permit, and store permit
instances in `_inactive_reads`. This list is now just another intrusive
list the permit can be linked into, depending on its state.
Inactive read handles now just store a reader permit pointer.
Following the same scheme we used to make the wait lists intrusive.
Permits are added to the ready list intrusive list while waiting to be
executed and moved back to the _permit_list when de-queued from this
list.
We now use a conditional variable for signaling when there are permits
ready to be executed.
Used while the permit is in the _ready_list, waiting for the execution
loop to pick it up. This just acknowledging the existence of this
wait-state. This state will now show up in permit diagnostics printouts
and we can now determine whether a permit is waiting for execution,
without checking which queue it is in.
Instead of using expiring_fifo to store queued permits, use the same
intrusive list mechanism we use to keep track of all permits.
Permits are now moved between the _permit_list and the wait queues,
depending on which state they are in. This means _permit_list is now not
the definitive list containing all permits, instead it is the list
containing all permits that are not in a more specialized queue at the
moment.
Code wishing to iterate over all permits should now use
foreach_permits(). For outside code, this was already the only way and
internal users are already patched.
Making the wait lists intrusive allows us to dequeue a permit from any
position, with nothing but a permit reference at hand. It also means
the wait queues don't have any additional memory requirements, other
than the memory for the permit itself.
Timeout while being queued is now handled by the permit's on_timeout()
callback.
Instead of the `entry` wrapper. In _wait_list and _ready_list, that is.
Data stored in the `entry` wrapper is moved to a new
`reader_permit::auxiliary_data` type. This makes the reader permit
self-sufficient. This in turn prepares the ground for the ability to
de-queue a permit from any queue, with nothing but a permit reference at
hand: no need to have back pointer to wrappers and/or iterators.
Currently the reader_permit has some private methods that only the
semaphore's internal calls. But this method of communication is not
consistent, other times the semaphore accesses the permit impl directly,
calling methods on that.
This commit introduces operator * and -> for reader_permit. With this,
the semaphore internals always call the reader_permit::impl methods
direcly, either via a direct reference, or via the above operators.
This makes the permit internface a little narrower and reduces
boilerplate code.
Use it to keep track of all permits that are currently waiting on
something: admission, memory or execution.
Currently we keep track of size, by adding up the result of size() of
the various queues. In future patches we are going to change the queues
such that they will not have constant time size anymore, move to an
explicit counter in preperation to that.
Another change this commit makes is to also include ready list entries
in this counter. Permits in the ready list are also waiters, they wait
to be executed. Soon we will have a separate wait state for this too.
Instead of having callers use get_timeout(), then compare it against the
current time, set up a timeout timer in the permit, which assigned a new
`_ex` member (a `std::exception_ptr`) to the appropriate exception type
when it fires.
Callers can now just poll check_abort() which will throw when `_ex`
is not null. This is more natural and allows for more general reasons
for aborting reads in the future.
This prepares the ground for timeouts being managed inside the permit,
instead of by the semaphore. Including timing out while in a wait queue.
This param is from a time when _permit_list was not accessible from the
outside, so it was passed along the semaphore instance to avoid making
the diagnostics methods friends.
To allow the semaphore freedom in how permits are stored, the
diagnostics code is instead made to use foreach_permit(), instead of
accessing the underlying list directly.
As the diagnostics code wants reader_permit::impl& directly, a new
variant of foreach_permit() passing impl references is introduced.
It already is conceptually, as it passes const references to the permits
it iterates over. The only reason it wasn't const before is a technical
issue which is solved here with a const_cast.
Instead of open-coding the same, in an incomplete way.
clear_inactive_reads() does incomplete eviction in severeal ways:
* it doesn't decrement _stats.inactive_reads
* it doesn't set the permit to evicted state
* it doesn't cancel the ttl timer (if any)
* it doesn't call the eviction notifier on the permit (if there is one)
The list goes on. We already have an evict() method that all this
correctly, use that instead of the current badly open-coded alternative.
This patch also enhances the existing test for clear_inactive_reads()
and adds a new one specifically for `stop()` being called while having
inactive reads.
Fixes: #13048Closes#13049
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Schema related files are moved there. This excludes schema files that
also interact with mutations, because the mutation module depends on
the schema. Those files will have to go into a separate module.
Closes#12858
When the memory consumption of the semaphore reaches the configured
serialize threshold, all but the blessed permit is blocked from
consuming any more memory. This ensures that past this limit, only one
permit at a time can consume memory.
Such a blessed permit can be registered inactive. Before this patch, it
would still retain its blessed status when doing so. This could result
in this permit being re-queued for admission if it was evicted in the
meanwhile, potentially resulting in a complete deadlock of the semaphore:
* admission queue permits cannot be admitted because there is no memory
* admitter permits are all queued on memory, as none of them are blessed
This patch strips the blessed status from the permit when it is
registered as inactive. It also adds a unit test to verify this happens.
Fixes: #12603Closes#12694
When the collective memory consumption of all readers goes above
$kill_limit_multiplier * $memory_limit, consume() will throw
std::bad_alloc(), instantly unwinding the read that is unlucky enough
to have requested the last bytes of memory. This should help situation
where there are some problematic partitions, either because of large
cells or because they are scattered in too many sstables. Currently
nothing prevents such reads from bringing down the entire node via OOM.
reset() is called from the destructor, with null resources. Calling
consume() can be avoided in this case and in fact it is required as
consume() is soon going to throw in some cases.
Use the recently added `request_memory()` to aquire the memory units for
the I/O. This allows blocking all but one readers when memory
consumption grows too high.