scylladb

Author	SHA1	Message	Date
Botond Dénes	61028ad718	evicatble_reader: avoid preemption pitfall around waiting for readmission Permits have to wait for re-admission after having been evicted. This happens via `reader_permit::maybe_wait_readmission()`. The user of this method -- the evictable reader -- uses it to re-wait admission when the underlying reader was evicted. There is one tricky scenario however, when the underlying reader is created for the first time. When the evictable reader is part of a multishard query stack, the created reader might in fact be a resumed, saved one. These readers are kept in an inactive state until actually resumed. The evictable reader shares it permit with the to-be-resumed reader so it can check whether it has been evicted while saved and needs to wait readmission before being resumed. In this flow it is critical that there is no preemption point between this check and actually resuming the reader, because if there is, the reader might end up actually recreated, without having waited for readmission first. To help avoid this situation, the existing `maybe_wait_readmission()` is split into two methods: * `bool reader_permit::needs_readmission()` * `future<> reader_permit::wait_for_readmission()` The evictable reader can now ensure there is no preemption point between `needs_readmission()` and resuming the reader. Fixes: #10187 Tests: unit(release) Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20220315105851.170364-1-bdenes@scylladb.com>	2022-03-15 14:37:22 +02:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Botond Dénes	4762ddec0f	reader_permit: add release_base_resource() Signals base resources to the semaphore and zeros it. This basically undoes admission.	2022-01-07 14:06:31 +02:00
Kamil Braun	e8824986dd	reader_permit: make query max result size accessible from the permit This will make it easier, for example, to enforce memory limits in lower levels of the flat_mutation_reader stack. By default the size is unlimited. However, for specific queries it is possible to store a different value (for example, obtained from a `read_command` object) through a setter.	2021-09-14 13:27:25 +02:00
Benny Halevy	4e3dcfd7d6	reader_concurrency_semaphore: use permit timeout for admission Now that the timeout is stored in the reader permit use it for admission rather than a timeout parameter. Note that evictable_reader::next_partition currently passes db::no_timeout to resume_or_create_reader, which propagated to maybe_wait_readmission, but it seems to be an oversight of the f_m_r api that doesn't pass a timeout to next_partition(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-24 16:30:51 +03:00
Benny Halevy	eeab5f77d9	repair: row_level: read_mutation_fragment: set reader timeout The timeout needs to be propagated to the reader's permit. Reset it to db::no_timeout in repair_reader::pause(). Warn if set_timeout asks to change the timeout too far into the past (100ms). It is possible that it will be passed a past timeout from the rcp path, where the message timeout is applied (as duration) over the local lowres_clock time and parallel read_data messages that share the query may end up having close, but different timeout values. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-24 16:30:40 +03:00
Benny Halevy	fe479aca1d	reader_permit: add timeout member To replace the timeout parameter passed to flat_mutation_reader methods. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-24 14:29:44 +03:00
Botond Dénes	b81f39cec9	reader_permit: add operator<< for reader_resources And use it in tests, it results in actually useful error messages.	2021-07-14 17:19:02 +03:00
Botond Dénes	5b8d6f02eb	reader_permit: remove now unused wait_admission()	2021-07-14 17:19:02 +03:00
Botond Dénes	1b7eea0f52	reader_concurrency_semaphore: admission: flip the switch This patch flips two "switches": 1) It switches admission to be up-front. 2) It changes the admission algorithm. (1) by now all permits are obtained up-front, so this patch just yanks out the restricted reader from all reader stacks and simultaneously switches all `obtain_permit_nowait()` calls to `obtain_permit()`. By doing this admission is now waited on when creating the permit. (2) we switch to an admission algorithm that adds a new aspect to the existing resource availability: the number of used/blocked reads. Namely it only admits new reads if in addition to the necessary amount of resources being available, all currently used readers are blocked. In other words we only admit new reads if all currently admitted reads requires something other than CPU to progress. They are either waiting on I/O, a remote shard, or attention from their consumers (not used currently). We flip these two switches at the same time because up-front admission means cache reads now need to obtain a permit too. For cache reads the optimal concurrency is 1. Anything above that just increases latency (without increasing throughput). So we want to make sure that if a cache reader hits it doesn't get any competition for CPU and it can run to completion. We admit new reads only if the read misses and has to go to disk. Another change made to accommodate this switch is the replacement of the replica side read execution stages which the reader concurrency semaphore as an execution stage. This replacement is needed because with the introduction of up-front admission, reads are not independent of each other any-more. One read executed can influence whether later reads executed will be admitted or not, and execution stages require independent operations to work well. By moving the execution stage into the semaphore, we have an execution stage which is in control of both admission and running the operations in batches, avoiding the bad interaction between the two.	2021-07-14 17:19:02 +03:00
Botond Dénes	844a99a91a	reader_concurrency_semaphore: prepare for up-front admission We want to make permits be admitted up-front, before even being created. As part of this change, we will get rid of the `wait_admission()` method on the permit, instead, the permit will be created as a result of waiting for admission (just like back some time ago). To allow evicted readers to wait for re-admission, a new method `maybe_wait_readmission()` is created, which waits for readmission if the permit is in evicted state. Also refactor the internals of the semaphore to support and favor up-front admission code. As up-front admission is the future we want the permit code to be organized in such a way that it is natural to use with it. This means that the "old-style" admission code might suffer but we tolerate this as it is on its way out. To this end the following changes were done: * Add a _base_resources field to reader_permit which tracks the base cost of said permit. This is passed in the constructor and is used in the first and subsequent admissions. * The base cost is now managed internally by the permit, instead of relying on an external `resource_units` instance, though the old way is still supported temporarily. * Change the admission pipeline to favor the new permit-internally managed base cost variant. * Compatibility with old-style admission: permits are created with 0 base resources, base resources are set with the compatibility method `set_base_resources()` right before admission, then externalized again after admission with `base_resource_as_resource_units()`. These methods will be gone when the old style admission is retired (together with `wait_admission()`).	2021-07-14 16:48:43 +03:00
Botond Dénes	05e6881c73	reader_permit: allow constructing reader_permit from impl& By enabling shared from this for impl and adding a reader permit constructor which takes a shared pointer to an impl. This allows impl members to invoke functions requiring a `reader_permit` instance as a parameter.	2021-07-14 16:48:43 +03:00
Botond Dénes	aa480fa3f9	reader_permit: allow marking blocked Distinguish between permits that are blocked and those that are not. Conceptually a blocked permit is one that needs to wait on either I/O or a remote shard to proceed. This information will be used by admission, which will only admit new reads when all currently used ones are blocked. More on that in the commit introducing this new admission type. This patch only adds the infrastructure, block sites are not marked yet.	2021-07-14 16:48:43 +03:00
Botond Dénes	a5dc48b4b1	reader_permit: allow marking it as used Distinguish between permits that are used and those that are not. These are two subtypes of the current 'active' state (and replace it). Conceptually a permit is used when any readers associated with it have a pending call to any of their async methods, i.e. the consumer is actively consuming from them. This information will be used for admission, together with a new blocked state introduced by a future patch. This patch only adds the infrastructure, use sites are not marked yet.	2021-07-14 16:48:43 +03:00
Botond Dénes	5a20861a1d	reader_permit: add reader_permit_opt	2021-07-14 16:48:43 +03:00
Botond Dénes	a251cc2368	reader_permit: introduce evicted state We want to introduce more fine-grained states for permits than what we have currently, splitting the current 'active' state into multiple sub-states. As a preparatory step, introduce an evicted state too, to keep track of permits that were evicted while being inactive. This will be important in determining what permits need to re-wait admission, once we keep permits across pages. Having an evicted state also aids validating internal state transitions.	2021-07-14 16:48:43 +03:00
Pavel Solodovnikov	76bea23174	treewide: reduce header interdependencies Use forward declarations wherever possible. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Closes #8813	2021-06-07 15:58:35 +03:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Botond Dénes	caaa8ef59a	reader_permit: always forward resources This commit conceptually reverts `4c8ab10`. Said commit was meant to prevent the scenario where memory-only permits -- those that don't pass admission but still consume memory -- completely prevent the admission of reads, possibly even causing a deadlock because a permit might even blocks its own admission. The protection introduced by said commit however proved to be very problematic. It made the status of resources on the permit very hard to reason about and created loopholes via which permits could accumulate without tracking or they could even leak resources. Instead of continuing to patch this broken system, this commit does away with this "protection" based on the observation that deadlocks are now prevented anyway by the admission criteria introduced by `0fe75571d9`, which admits a read anyway when all the initial count resources are available (meaning no admitted reader is alive), regardless of availability of memory. The benefits of this revert is that the semaphore now knows about all the resources and is able to do its job better as it is not "lied to" about resource by the permits. Furthermore the status of a permit's resources is much simpler to reason about, there are no more loopholes in unexpected state transitions to swallow/leak resources. To prove that this revert is indeed safe, in the next commit we add robust tests that stress test admission on a highly contested semaphore. This patch also does away with the registered/admitted differentiation of permits, as this doesn't make much sense anymore, instead these two are unified into a single "active" state. One can always tell whether a permit was admitted or not from whether it owns count resources anyway.	2021-04-26 15:56:56 +03:00
Benny Halevy	81391b845f	reader_permit: expose description method Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:16:10 +03:00
Botond Dénes	a14bb4ba94	reader_permit: add inactive state This state will be used for permits that are not in admitted state when registered as inactive. We can have such reads if a read can be served entirely from cache/memtables and it doesn't have to go to disk and hence doesn't go through admission. These permits currently don't forward their cost to the semaphore so they won't prevent their own admission creating a deadlock. However, when in inactive state, we do want to keep tabs on their resource consumption so we don't accumulate too much of these inactive reads. So introduce a new state for these non-admitted inactive reads. When entering the inactive state, the permit registers its cost with the semaphore, and when unregistered as inactive, it retracts it. This is a workaround (khm hack) until #4758 is solved and all permits will be admitted on creation.	2021-03-18 14:58:21 +02:00
Botond Dénes	18454e4a80	reader_concurrency_semaphore: dump permit diagnostics on timeout or queue overflow The reader concurrency semaphore timing out or its queue being overflown are fairly common events both in production and in testing. At the same time it is a hard to diagnose problem that often has a benign cause (especially during testing), but it is equally possible that it points to something serious. So when this error starts to appear in logs, usually we want to investigate and the investigation is lengthy... either involves looking at metrics or coredumps or both. This patch intends to jumpstart this process by dumping a diagnostics on semaphore timeout or queue overflow. The diagnostics is printed to the log with debug level to avoid excessive spamming. It contains a histogram of all the permits associated with the problematic semaphore organized by table, operation and state. Example: DEBUG 2020-10-08 17:05:26,115 [shard 0] reader_concurrency_semaphore - Semaphore _read_concurrency_sem: timed out, dumping permit diagnostics: Permits with state admitted, sorted by memory memory count name 3499M 27 ks.test:data-query 3499M 27 total Permits with state waiting, sorted by count count memory name 1 0B ks.test:drain 7650 0B ks.test:data-query 7651 0B total Permits with state registered, sorted by count count memory name 0 0B total Total: permits: 7678, memory: 3499M This allows determining several things at glance: * What are the tables involved * What are the operations involved * Where is the memory This can speed up a follow-up investigation greatly, or it can even be enough on its own to determine that the issue is benign.	2020-10-13 12:32:14 +03:00
Botond Dénes	70fa543c31	reader_concurrency_semaphore: add state to permits Instead of a simple boolean, designating whether the permit was already admitted or not, add a proper state field with a value for all the different states the permit can be in. Currently there are three such states: * registered - the permit was created and started accounting resource consumption. * waiting - the permit was queued to wait for admission. * admitted - the permit was successfully admitted. The state will be used for debugging purposes, both during coredump debugging as well as for dumping diagnostics data about permits.	2020-10-13 12:32:13 +03:00
Botond Dénes	ff623e70b3	reader_concurrency_semaphore: name permits Require a schema and an operation name to be given to each permit when created. The schema is of the table the read is executed against, and the operation name, which is some name identifying the operation the permit is part of. Ideally this should be different for each site the permit is created at, to be able to discern not only different kind of reads, but different code paths the read took. As not all read can be associated with one schema, the schema is allowed to be null. The name will be used for debugging purposes, both for coredump debugging and runtime logging of permit-related diagnostics.	2020-10-13 12:32:13 +03:00
Botond Dénes	73a6b97c75	reader_permit: add consumed_resources() accessor That allows querying he amount of resources accounted though this permit, and by extension by this logical read.	2020-10-06 08:18:42 +03:00
Botond Dénes	63578bf0a7	reader_permit: reader_resources: add operator==	2020-09-28 11:27:49 +03:00
Botond Dénes	52662f17ea	reader_permit: resource_units: add permit() and resources() accessors	2020-09-28 11:27:29 +03:00
Botond Dénes	c1215592da	reader_permit: introduce tracking_allocator This can be used with standard containers and other containers that use the std::allocator interface to track the allocations made by them via a reader_permit.	2020-09-28 08:46:22 +03:00
Botond Dénes	f10abf6e35	reader_permit: reader_resources: add with_memory() factory function To make creating reader resource with just memory more convenient and more readable at the same time.	2020-09-28 08:46:22 +03:00
Botond Dénes	4c8ab10563	reader_permit: only forward resource consumption to semaphore after admission In the next patches we plan to start tracking the memory consumption of the actual allocations made by the circular_buffer<mutation_fragment>, as well as the memory consumed by the mutation fragments. This means that readers will start consuming memory off the permit right after being constructed. Ironically this can prevent the reader from being admitted, due to its own pre-admission memory consumption. To prevent this hold on forwarding the memory consumption to the semaphore, until the permit is actually admitted.	2020-09-28 08:46:22 +03:00
Botond Dénes	cd953a36fd	reader_permit: move internals to impl In the next patches the reader permit will gain members that are shared across all instances of the same permit. To facilitate this move all internals into an impl class, of which the permit stores a shared pointer. We use a shared_ptr to avoid defining `impl` in the header. This is how the reader permit started in the beginning. We've done a full circle. :)	2020-09-28 08:46:22 +03:00
Botond Dénes	12372731cb	reader_permit: add consume()/signal() And do all consuming and signalling through these methods. These operations will soon be more involved than the simple forwarding they do today, so we want to centralize them to a single method pair.	2020-09-28 08:46:22 +03:00
Botond Dénes	375815e650	reader_permit::resource_units: store permit instead of semaphore In the next patches we want to introduce per-permit resource tracking -- that is, have each permit track the amount of resource consumed through it. For this, we need all consumption to happen through a permit, and not directly with the semaphore.	2020-09-28 08:46:22 +03:00
Botond Dénes	04d83f6678	reader_permit: move resource_units declaration outside the reader_permit class In the next patch we want to store a `reader_permit` instance inside `resource_units` so a full definition of the former must be available.	2020-09-28 08:46:22 +03:00
Botond Dénes	3bb25eefb6	reader_permit: remove unused release() method Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20200924090040.240906-1-bdenes@scylladb.com>	2020-09-24 12:28:00 +03:00
Botond Dénes	e5db1ce785	reader_permit: reader_resources: add operator- and operator+ In addition to the already available operator+= and operator-=.	2020-07-20 11:23:39 +03:00
Botond Dénes	3cd2598ab3	reader_permit: forbid empty permits Remove `no_reader_permit()` and all ways to create empty (invalid) permits. All permits are guaranteed to be valid now and are only obtainable from a semaphore. `reader_permit::semaphore()` now returns a reference, as it is guaranteed to always have a valid semaphore reference.	2020-05-28 11:34:35 +03:00
Botond Dénes	e40b1fc3c8	reader_permit: fix reader_resources::operator bool	2020-05-28 11:34:35 +03:00
Botond Dénes	f417b9a3ea	reader_concurrency_semaphore: remove wait_admission and consume_resources() Permits are now created with `make_permit()` and code is using the permit to do all resource consumption tracking and admission waiting, so we can remove these from the semaphore. This allows us to remove some now unused code from the permit as well, namely the `base_cost` which was used to track the resource amount the permit was created with. Now this amount is also tracked with a `resource_units` RAII object, returned from `reader_permit::wait_admission()`, so it can be removed. Curiously, this reduces the reader permit to be glorified semaphore pointer. Still, the permit abstraction is worth keeping, because it allows us to make changes to how the resource tracking part of the semaphore works, without having to change the huge amount of code sites passing around the permit.	2020-05-28 11:34:35 +03:00
Botond Dénes	bf4ade8917	reader_permit: resource_units: introduce add() Allows merging two resource_units into one.	2020-05-28 11:34:35 +03:00
Botond Dénes	4d7250d12b	reader_permit: add wait_admission We want to make `read_permit` the single interface through which reads interact with the concurrency limiting mechanism. So far it was only usable to track memory consumption. Add the missing `wait_admission()` and `consume_resources()` to the permit API. As opposed to `reader_concurrency_semaphore::` equivalents which returned a permit, the `reader_permit::` variants jut return `reader_permit::resource_units` which is an RAII holder for the acquired units. This also allows for the permit to be created earlier, before the reader is admitted, allowing for tracking pre-admission memory usage as well. In fact this is what we are going to do in the next patches. This patch also introduces a `broken()` method on the reader concurrency semaphore which resolves waiters with an exception. This method is also called internally from the semaphore's destructor. This is needed because the semaphore can now have external waiters, who has to be resolved before the semaphore itself is destroyed.	2020-05-28 11:34:35 +03:00
Botond Dénes	bd793d6e19	reader_permit: resource_units: work in terms of reader_resources Refactor resource_units semantically as well to work in terms of reader_resources, instead of just memory.	2020-05-28 11:34:35 +03:00
Botond Dénes	0f9c24631a	reader_permit: s/memory_units/resource_units/ We want to refactor reader_permit::memory_units to work in terms of reader_resources, as we are planning to use it for guarding count resources as well. This patch makes the first step: renames it from memory_units to resources_units. Since this is a very noisy change, we do it in a separate patch, the semantic change is in the next patch.	2020-05-28 11:34:35 +03:00
Botond Dénes	434d32befe	reader_permit: tidy up reader_permit::memory_units This patch is a bag of fixes/cleanups that were omitted from the reader memory tracking series due to contributor error. It contains the following changes: * Get rid of unused `increase()` and `decrease()` methods. * Make all constructors and assignment operators `noexcept`. * Make move assignment operator safe w.r.t. self assignment. * `reset()`: consume the new amount before releasing the old amount, to prevent a transient window where new readers might be admitted. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20200206143007.633069-1-bdenes@scylladb.com>	2020-02-06 16:35:07 +02:00
Botond Dénes	dea24ca859	reader_permit: expose make_tracked_temporary_buffer() Previously `tracking_file_impl::make_tracked_buf()`. In the next patches we plan on using this outside `tracking_file_impl`, so make it public and templatize on the char type.	2020-01-28 08:13:16 +02:00
Botond Dénes	16cea36a94	reader_permit: introduce make_tracked_file() Free function equivalent of `reader_resource_tracker::track_file()`, using a `reader_permit` directly.	2020-01-28 08:13:16 +02:00
Botond Dénes	1859a03629	reader_permit: introduce memory_units Similar to `seastar::semaphore_units`, this allows consuming and releasing memory via an RAII object. In addition to that, it also allows tracking changing values. This feature was designed to be used for tracking the ever changing memory consumption of the buffers of `flat_mutation_reader`:s. This is now the only supported way of consuming memory from a permit.	2020-01-28 08:13:16 +02:00
Botond Dénes	c0f96db2d9	reader_concurrency_semaphore: mv reader_resources and reader_permit to reader_permit.hh In the next patches we will replace `reader_resource_tracker` and have code use the `reader_permit` directly. In subsequent patches, the `reader_permit` will get even more usages as we attempt to make the tracking of reader resource more accurate by tracking more parts of it. So the grand plan is that the current `reader_concurrency_semaphore.hh` is split into two headers: * `reader_concurrency_semaphore.hh` - containing the semaphore proper. * `reader_permit.hh` - a very lightweight header, to be used by components which only want to track various parts of the resource consumption of reads.	2020-01-28 08:13:16 +02:00

48 Commits