This patchset combines two important changes to the way reader permits
are created and admitted:
1) It switches admission to be up-front.
2) It changes the admission algorithm.
(1) Currently permits are created before the read is started, but they
only wait for admission when going to the disk. This leaves the
resources consumption of cache and memtables reads unbounded, possibly
leading to OOM (rare but happens). This series changes this that permits
are admitted at the moment they are creating making admission up-front
-- at least those reads that pass admission at all (some don't).
(2) Admission currently is based on availability of resources. We have a
certain amount of memory available, which derived from the memory
available to the shard, as well a hardcoded count resource. Reads are
admitted when a count and a certain amount (base cost) of memory is
available. This patchset adds a new aspect to this admission process
beyond the existing resource availability: the number of used/blocked
reads. Namely it only admits new reads if in addition to the necessary
amount of resources being available, all currently used readers are
blocked. In other words we only admit new reads if all currently
admitted reads requires something other than CPU to progress. They are
either waiting on I/O, a remote shard, or attention from their consumers
(not used currently).
The reason for making these two changes at the same time is that
up-front admission means cache reads now need to obtain a permit too.
For cache reads the optimal concurrency is 1. Anything above that just
increases latency (without increasing throughput). So we want to make sure
that if a cache reader hits it doesn't get any competition for CPU and
it can run to completion. We admit new reads only if the read misses and
has to go to disk.
A side effect of these changes is that the execution stages from the
replica-side read path are replaced with the reader concurrency
semaphore as an execution stage. This is necessary due to bad
interaction between said execution stages and up-front admission. This
has an important consequence: read timeouts are more strictly enforced
because the execution stage doesn't have a timeout so it can execute
already timed-out reads too. This is not the case with the semaphore's
queue which will drop timed-out reads. Another consequence is that, now
data and mutation reads share the same execution stage, which increases
its effectiveness, on the other hand system and user reads don't
anymore.
Fixes: #4758Fixes: #5718
Tests: unit(dev, release, debug)
* 'reader-concurrency-semaphore-in-front-of-the-cache/v5.3' of https://github.com/denesb/scylla: (54 commits)
test/boost/reader_concurrency_semaphore_test: add used/blocked test
test/boost/reader_concurrency_semaphore_test: add admission test
reader_permit: add operator<< for reader_resources
reader_concurrency_semaphore: add reads_{admitted,enqueued} stats
table: make_sstable_reader(): fix indentation
table: clean up make_sstable_reader()
database: remove now unused query execution stages
mutation_reader: remove now unused restricting_reader
sstables: sstable_set: remove now unused make_restricted_range_sstable_reader()
reader_permit: remove now unused wait_admission()
reader_concurrency_semaphore: remove now unused obtain_permit_nowait()
reader_concurrency_semaphore: admission: flip the switch
database: increase semaphore max queue size
test: index_with_paging_test: increase semaphore's queue size
reader_concurrency_semaphore: add set_max_queue_size()
test: mutation_reader_test: remove restricted reader tests
reader_concurrency_semaphore: remove now unused make_permit()
test: reader_concurrency_semaphore_test: move away from make_permit()
test: move away from make_permit()
treewide: use make_tracking_only_permit()
...