scylladb

Author	SHA1	Message	Date
Benny Halevy	679e73053f	reader_concurrency_semaphore: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:48 +03:00
Botond Dénes	f2d5819645	reader_concurrency_semaphore: with_permit(): proper clean-up after queue overload with_permit() creates a permit, with a self-reference, to avoid attaching a continuation to the permit's run function. This self-reference is used to keep the permit alive, until the execution loop processes it. This self reference has to be carefully cleared on error-paths, otherwise the permit will become a zombie, effectively leaking memory. Instead of trying to handle all loose ends, get rid of this self-reference altogether: ask caller to provide a place to save the permit, where it will survive until the end of the call. This makes the call-site a little bit less nice, but it gets rid of a whole class of possible bugs. Fixes: #22588 Closes scylladb/scylladb#22624	2025-02-04 21:27:16 +02:00
Piotr Dulikowski	7383013f43	replica/database: add reader concurrency semaphore groups Replace the reader concurrency semaphores for user reads and view updates with the newly introduced reader concurrency semaphore group, which assigns a semaphore for each service level. Each group is statically assigned to some pool of memory on startup and dynamically distribute this memory between the semaphores, relative to the number of shares of the corresponding scheduling group. The intent of having a separate reader concurrency semaphore for each scheduling group is to prevent priority inversion issues due to reads with different priorities waiting on the same semaphore, as well as make memory allocation more fair between service levels due to the adjusted number of shares.	2025-01-02 07:13:34 +01:00
Tomasz Grabiec	bf3d0b3543	reader_concurrency_semaphore: Optimize resource_units destruction by postponing wait list processing Observed 3% throughput improvement in sstable-heavy workload bounded by CPU. SStable parsing involves lots of buffer operations which obtain and destroy resource_units. Before the patch, reosurce_unit destruction invoked maybe_admit_waiters(), which performs some computations on waiting permits. We don't really need to admit on each change of resources, since the CPU is used by other things anyway. We can batch the computation. There is already a fiber which does this for processing the _ready_list. We can reuse it for processing _wait_list as well. The changes violate an assumption made by tests that releasing resources immediately triggers an admission check. Therefore, some of the BOOST_REQUIRE_EQUAL needs to be replaced with REQUIRE_EVENTUALLY_EQUAL as the admision check is now done in the fiber processing the _ready_list. `perf-simple-query` --tablets --smp 1 -m 1G results obtained for fixed 400MHz frequency: Before: ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 112590.60 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41353 insns/op, 17992 cycles/op, 0 errors) 122620.68 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41310 insns/op, 17713 cycles/op, 0 errors) 118169.48 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41353 insns/op, 17857 cycles/op, 0 errors) 120634.65 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41328 insns/op, 17733 cycles/op, 0 errors) 117317.18 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41347 insns/op, 17822 cycles/op, 0 errors) throughput: mean=118266.52 standard-deviation=3797.81 median=118169.48 median-absolute-deviation=2368.13 maximum=122620.68 minimum=112590.60 instructions_per_op: mean=41337.86 standard-deviation=18.73 median=41346.89 median-absolute-deviation=14.64 maximum=41352.53 minimum=41309.83 cpu_cycles_per_op: mean=17823.50 standard-deviation=111.75 median=17821.97 median-absolute-deviation=90.45 maximum=17992.04 minimum=17713.00 ``` After ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 123689.63 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40997 insns/op, 17384 cycles/op, 0 errors) 129643.24 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40997 insns/op, 17325 cycles/op, 0 errors) 128907.27 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41009 insns/op, 17325 cycles/op, 0 errors) 130342.56 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40993 insns/op, 17286 cycles/op, 0 errors) 130294.09 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40972 insns/op, 17336 cycles/op, 0 errors) throughput: mean=128575.36 standard-deviation=2792.75 median=129643.24 median-absolute-deviation=1718.73 maximum=130342.56 minimum=123689.63 instructions_per_op: mean=40993.51 standard-deviation=13.23 median=40996.73 median-absolute-deviation=3.30 maximum=41008.86 minimum=40972.48 cpu_cycles_per_op: mean=17331.16 standard-deviation=35.02 median=17324.84 median-absolute-deviation=6.49 maximum=17383.97 minimum=17286.33 ``` Closes scylladb/scylladb#21918 [avi: patch was co-authored by Łukasz Paszkowski <lukasz.paszkowski@scylladb.com>]	2024-12-30 23:37:46 +02:00
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Botond Dénes	c34127092d	reader_concurrency_semaphore: test constructor: don't ignore metrics param The for_tests constructor has a metrics parameter defaulted to register_metrics::no, but when delegating to the other constructor, a hard-coded register_metrics::no is passed. This makes no difference currently, because all callers use the default and the hard-coded value corresponds to it. Let's fix it nevertheless to avoid any future surprises. Closes scylladb/scylladb#20007	2024-08-04 21:14:42 +03:00
Botond Dénes	07c0a8a6f8	reader_concurrency_semaphore: wire in the configurable cpu concurrency Before this patch, the semaphore was hard-wired to stop admission, if there is even a single permit, which is in the need_cpu state. Therefore, keeping the CPU concurrency at 1. This patch makes use of the new cpu_concurrency parameter, which was wired in in the last patches, allowing for a configurable amount of concurrent need_cpu permits. This is to address workloads where some small subset of reads are expected to be slow, and can hold up faster reads behind them in the semaphore queue.	2024-06-27 09:57:11 -04:00
Botond Dénes	59faa6d4ff	reader_concurrency_semaphore: add cpu_concurrency constructor parameter In the case of the user semaphore, this receives the new reader_concurrency_semaphore_cpu_limit config item. Not used yet.	2024-06-27 09:57:11 -04:00
Avi Kivity	fdc1449392	treewide: rename flat_mutation_reader_v2 to mutation_reader flat_mutation_reader_v2 was introduced in a pair of commits in 2021: `e3309322c3` "Clone flat_mutation_reader related classes into v2 variants" `08b5773c12` "Adapt flat_mutation_reader_v2 to the new version of the API" as a replacement for flat_mutation_reader, using range_tombstone_change instead of range_tombstone to represent represent range tombstones. See those commits for more information. The transition was incremental; the last use of the original flat_mutation_reader was removed in 2022 in commit `026f8cc1e7` "db: Use mutation_partition_v2 in mvcc" In turn, flat_mutation_reader was introduced in 2017 in commit `748205ca75` "Introduce flat_mutation_reader" To transition from a mutation_reader that nested rows within a partition in a separate stream, to a flat reader that streamed partitions and rows in the same stream. Here, we reclaim the original name and rename the awkward flat_mutation_reader_v2 to mutation_reader. Note that mutation_fragment_v2 remains since we still use the original for compatibilty, sometimes. Some notes about the transition: - files were also renamed. In one case (flat_mutation_reader_test.cc), the rename target already existed, so we rename to mutation_reader_another_test.cc. - a namespace 'mutation_reader' with two definitions existed (in mutation_reader_fwd.hh). Its contents was folded into the mutation_reader class. As a result, a few #includes had to be adjusted. Closes scylladb/scylladb#19356	2024-06-21 07:12:06 +03:00
Botond Dénes	ba0cc29d82	reader_concurrency_semaphore: make count parameter live-update So that the amount of count resources can be changed at run-time, triggered by a e.g. a config change. Previous constant-count based constructor is left intact, to avoid patching all clients, as only a small subset will want the new functionality.	2024-06-13 01:59:21 -04:00
Botond Dénes	3c813fbb99	reader_concurrency_semaphore: add range param to evict_inactive_reads_for_table() When the new optional parameter has a value, evict only inactive reads, whose ranges overlap with the provided range. The range for the inactive read is provided in `register_inactive_read()`. If the inactive read has no range, ovarlap is assumed and the read is evicted. This will be used to evict all inactive reads that could potentially use a cleaned-up tablet.	2024-04-30 01:31:08 -04:00
Botond Dénes	9e7a957ffb	reader_concurrency_semaphore: allow storing a range with the inactive reader This allows specifying the range the inactive read is reading from. To be used in the next patch to selectively evict inactive reads whose range overlaps with a certain (tablet) range.	2024-04-30 01:31:08 -04:00
Lakshmi Narayanan Sreethar	76f0d5e35b	reader_permit: store schema_ptr instead of raw schema pointer Store schema_ptr in reader permit instead of storing a const pointer to schema to ensure that the schema doesn't get changed elsewhere when the permit is holding on to it. Also update the constructors and all the relevant callers to pass down schema_ptr instead of a raw pointer. Fixes #16180 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#16658	2024-01-11 08:37:56 +02:00
Avi Kivity	7fce057cda	database, reader_concurrency_sempaphore: deduplicate reader_concurrency_sempaphore metrics reader_concurrency_sempaphore are triplicated: each metrics is registered for streaming, user, and system classes. To fix, just move the metrics registration from database to reader_concurrency_sempaphore, so each reader_concurrency_sempaphore instantiated will register its metrics (if its creator asked for it). Adjust the names given to reader_concurrency_sempaphore so we don't change the labels. scylla-gdb is adjusted to support the new names.	2023-12-13 09:16:18 -05:00
Botond Dénes	e1b30f50be	reader_concurrency_semaphore: add register_metrics constructor parameter To be used in the next patch to control whether the semaphore registers and exports metrics or not. We want to move metric registration to the semaphore but we don't want all semaphores to export metrics. The decision on whether a semaphore should or shouldn't export metrics should be made on a case-by-case basis so this new parameter has no default value (except for the for_tests constructor).	2023-12-13 06:25:45 -05:00
Yaniv Kaul	ae2ab6000a	Typos: fix typos in code Fixes some more typos as found by codespell run on the code. In this commit, there are more user-visible errors. Refs: https://github.com/scylladb/scylladb/issues/16255	2023-12-05 15:18:11 +02:00
Botond Dénes	804403f618	reader_concurrency_semaphore: update RAII state guard classes w.r.t. recent permit state name changes They is still using the old terminology for permit state names, bring them up to date with the recent state name changes.	2023-04-19 05:20:42 -04:00
Botond Dénes	89328ce447	reader_concurrency_semaphore: update API w.r.t. recent permit state name changes It is still using the old terminology for permit state names, bring it up to date with the recent state name changes.	2023-04-19 05:18:13 -04:00
Botond Dénes	3919effe2d	reader_concurrency_semaphore: update stats w.r.t. recent permit state name changes It is still using the old terminology for permit state names, bring it up to date with the recent state name changes.	2023-04-19 05:17:34 -04:00
Botond Dénes	bd57471e54	reader_concurrency_semaphore: don't evict inactive readers needlessly Inactive readers should only be evicted to free up resources for waiting readers. Evicting them when waiters are not admitted for any other reason than resources is wasteful and leads to extra load later on when these evicted readers have to be recreated end requeued. This patch changes the logic on both the registering path and the admission path to not evict inactive readers unless there are readers actually waiting on resources. A unit-test is also added, reproducing the overly-agressive eviction and checking that it doesn't happen anymore. Fixes: #11803 Closes #13286	2023-04-13 15:20:18 +03:00
Botond Dénes	156e5d346d	reader_permit: keep trace_state pointer on permit And propagate it down to where it is created. This will be used to add trace points for semaphore related events, but this will come in the next patches.	2023-03-22 04:58:01 -04:00
Botond Dénes	7b701ac52e	reader_concurrency_semaphore: add stats to record reason for queueing permits When diagnosing problems, knowing why permits were queued is very valuable. Record the reason in a new stats, one for each reason a permit can be queued.	2023-03-17 03:15:41 -04:00
Botond Dénes	bb00405818	reader_concurrency_semaphore: can_admit_read(): also return reason for rejection So caller can bump the appropriate counters or log the reason why the the request cannot be admitted.	2023-03-17 03:15:40 -04:00
Botond Dénes	4f5657422d	reader_concurrency_semaphore: move _permit_list next to the other lists A mostly cosmetic change. Also add a comment mentioning that this is the catch-all list.	2023-03-13 08:07:53 -04:00
Botond Dénes	6181c08191	reader_concurrency_semaphore: move inactive_read to .cc It is not used in the header anymore and moving it to the .cc allows us to remove the dependency on flat_mutation_reader_v2.hh.	2023-03-13 08:07:53 -04:00
Botond Dénes	e56ec9373d	reader_concurrency_semaphore: store permits in _inactive_reads Add an member of type `inactive_read` to reader permit, and store permit instances in `_inactive_reads`. This list is now just another intrusive list the permit can be linked into, depending on its state. Inactive read handles now just store a reader permit pointer.	2023-03-13 08:07:53 -04:00
Botond Dénes	d11f9efbfe	reader_concurrency_semaphore: inactive_read: de-inline more methods They will soon need to access reader_permit::impl internals, only available in the .cc file.	2023-03-13 08:07:53 -04:00
Botond Dénes	8e296e8e05	reader_concurrency_semaphore: make _ready_list intrusive Following the same scheme we used to make the wait lists intrusive. Permits are added to the ready list intrusive list while waiting to be executed and moved back to the _permit_list when de-queued from this list. We now use a conditional variable for signaling when there are permits ready to be executed.	2023-03-13 08:07:53 -04:00
Botond Dénes	6229f8b1a6	reader_concurrency_semaphore: make wait lists intrusive Instead of using expiring_fifo to store queued permits, use the same intrusive list mechanism we use to keep track of all permits. Permits are now moved between the _permit_list and the wait queues, depending on which state they are in. This means _permit_list is now not the definitive list containing all permits, instead it is the list containing all permits that are not in a more specialized queue at the moment. Code wishing to iterate over all permits should now use foreach_permits(). For outside code, this was already the only way and internal users are already patched. Making the wait lists intrusive allows us to dequeue a permit from any position, with nothing but a permit reference at hand. It also means the wait queues don't have any additional memory requirements, other than the memory for the permit itself. Timeout while being queued is now handled by the permit's on_timeout() callback.	2023-03-09 07:11:49 -05:00
Botond Dénes	9ea9a48dbc	reader_concurrency_semaphore: move most wait_queue methods out-of-line They will soon depend on the definition of the reader_permit::impl, which is only available in the .cc file.	2023-03-09 06:53:11 -05:00
Botond Dénes	1d27dd8f0e	reader_concurrency_semaphore: store permits directly in queues Instead of the `entry` wrapper. In _wait_list and _ready_list, that is. Data stored in the `entry` wrapper is moved to a new `reader_permit::auxiliary_data` type. This makes the reader permit self-sufficient. This in turn prepares the ground for the ability to de-queue a permit from any queue, with nothing but a permit reference at hand: no need to have back pointer to wrappers and/or iterators.	2023-03-09 06:53:11 -05:00
Botond Dénes	f5b80fdfd8	reader_concurrency_semaphore: remove redundant waiters() member There is now a field in stats with the same information, use that.	2023-03-09 06:53:11 -05:00
Botond Dénes	74a5981dbe	reader_concurrency_semaphore: add waiters counter Use it to keep track of all permits that are currently waiting on something: admission, memory or execution. Currently we keep track of size, by adding up the result of size() of the various queues. In future patches we are going to change the queues such that they will not have constant time size anymore, move to an explicit counter in preperation to that. Another change this commit makes is to also include ready list entries in this counter. Permits in the ready list are also waiters, they wait to be executed. Soon we will have a separate wait state for this too.	2023-03-09 06:53:11 -05:00
Botond Dénes	23f4e250c2	reader_concurrency_semaphore: maybe_dump_permit_diagnostics(): remove permit list param This param is from a time when _permit_list was not accessible from the outside, so it was passed along the semaphore instance to avoid making the diagnostics methods friends. To allow the semaphore freedom in how permits are stored, the diagnostics code is instead made to use foreach_permit(), instead of accessing the underlying list directly. As the diagnostics code wants reader_permit::impl& directly, a new variant of foreach_permit() passing impl references is introduced.	2023-03-09 05:19:59 -05:00
Botond Dénes	59dc15682b	reader_concurrency_semaphroe: make foreach_permit() const It already is conceptually, as it passes const references to the permits it iterates over. The only reason it wasn't const before is a technical issue which is solved here with a const_cast.	2023-03-09 05:19:59 -05:00
Botond Dénes	34cdcaffae	reader_concurrency_semaphore: un-bless permits when they become inactive When the memory consumption of the semaphore reaches the configured serialize threshold, all but the blessed permit is blocked from consuming any more memory. This ensures that past this limit, only one permit at a time can consume memory. Such a blessed permit can be registered inactive. Before this patch, it would still retain its blessed status when doing so. This could result in this permit being re-queued for admission if it was evicted in the meanwhile, potentially resulting in a complete deadlock of the semaphore: * admission queue permits cannot be admitted because there is no memory * admitter permits are all queued on memory, as none of them are blessed This patch strips the blessed status from the permit when it is registered as inactive. It also adds a unit test to verify this happens. Fixes: #12603 Closes #12694	2023-02-01 21:02:17 +02:00
Botond Dénes	7f8469db27	reader_concurrency_semaphore: add foreach_permit() Allows iterating over all permits.	2023-01-17 05:27:04 -05:00
Botond Dénes	4c70b58993	reader_concurrency_semaphore: document the new memory limits	2023-01-17 05:27:04 -05:00
Botond Dénes	edb32cb171	reader_concurrency_semaphore: add OOM killer When the collective memory consumption of all readers goes above $kill_limit_multiplier * $memory_limit, consume() will throw std::bad_alloc(), instantly unwinding the read that is unlucky enough to have requested the last bytes of memory. This should help situation where there are some problematic partitions, either because of large cells or because they are scattered in too many sstables. Currently nothing prevents such reads from bringing down the entire node via OOM.	2023-01-17 05:27:04 -05:00
Botond Dénes	81e2a2be7d	reader_concurrency_semaphore: make consume() and signal() private Using this API is quite dangerous as any mistakes can lead to leaking resources from the semaphore. Also, soon we will tie this API closer to permits, so they won't be as generic. Make them private so we don't have to worry about correct usage. All external users are patched away already.	2023-01-17 05:27:04 -05:00
Botond Dénes	8f9e8aafdf	reader_concurrency_semaphore: move consume() out-of-line Its about to get a little bit more complex.	2023-01-17 05:27:04 -05:00
Botond Dénes	9ed5d861be	reader_concurrency_semaphore: add request_memory() A possibly blocking request for more memory. If the collective memory consumption of all reads goes above $serialize_limit_multiplier * $memory_limit this request will block for all but one reader (the first requester). Until this situation is resolved, that is until memory stays above the above explained limit, only this one reader is allowed to make progress. This should help reign in the memory consumption of reads in a situation where their memory consumption used to baloon without constraints before.	2023-01-17 05:27:04 -05:00
Botond Dénes	969beebe5f	reader_concurrency_semaphore: wrap wait list The wait list will become two lists soon. To keep callers simple (as if there was still one list) we wrap it with a wrapper which abstracts this away.	2023-01-16 02:05:27 -05:00
Botond Dénes	8658cfc066	reader_concurrency_semaphore: add {serialize,kill}_limit_multiplier parameters Propagate the recently added reader_concurrency_semaphore_{serialize,kill}_limit_multiplier config items to the semaphore. Not used yet.	2023-01-16 02:05:27 -05:00
Botond Dénes	2c0de50969	reader_concurrency_semaphore: add disk_reads and sstables_read stats And the infrastructure to reader_permit to update them. The infrastructure is not wired in yet. These metrics will be used to count the number of reads gone to disk and the number of sstables read currently respectively.	2023-01-03 09:37:29 -05:00
Botond Dénes	dcd2deb5af	replica/database: fix active_reads_memory_consumption_metric Rename to reads_memory_consumption and drop the "active" from the description as well. This metric tracks the memory consumption of all reads: active or inactive. We don't even currently have a way to track the memory consumption of only active reads. Drop the part of the description which explains the interaction with other metrics: this part is outdated and the new interactions are much more complicated, no way to explain in a metric description. Also ask the semaphore to calculate the memory amount, instead of doing it in the metric itself.	2023-01-03 09:25:47 -05:00
Botond Dénes	929481ea9c	replica/database: fix active_reads metric This metric has been broken for a long time, since inactive reads were introduced. As calculated currently, it includes all permits that passed admission, including inactive reads. On the other hand, it excludes permits created bypassing admission. Fix by using the newly introduced (in this patch) reader_concurrency_semaphore::active_reads() as the basis of this metric: this now includes all permits (reads) that are currently active, excluding waiters and inactive reads.	2023-01-03 08:12:25 -05:00
Botond Dénes	b39ca29b3c	reader_concurrency_semaphore: do_wait_admission(): detect admission-waiter anomaly The semaphore should admit readers as soon as it can. So at any point in time there should be either no waiters, or the semaphore shouldn't be able to admit new reads. Otherwise something went wrong. Detect this when queuing up reads and dump the diagnostics if detected. Even though tests should ensure this should never happen, recently we've seen a race between eviction and enqueuing producing such situations. This is very hard to write tests for, so add built-in detection and protection instead. Detecting this is very cheap anyway.	2022-11-18 11:35:47 +02:00
Avi Kivity	6b0afb968d	Merge 'reader_concurrency_semaphore: add set_resources()' from Botond Dénes Allowing to change the total or initial resources the semaphore has. After calling `set_resources()` the semaphore will look like as if it was created with the specified amount of resources when created. Use the new method in `replica::database::revert_initial_system_read_concurrency_boost()` so it doesn't lead to strange semaphore diagnostics output. Currently the system semaphore has 90/100 count units when there are no reads against it, which has led to some confusion. I also plan on using the new facility in enterprise. Closes #11772 * github.com:scylladb/scylladb: replica/database: revert initial boost to system semaphore with set_resources() reader_concurrency_semaphore: add set_resources()	2022-10-19 18:04:20 +03:00
Botond Dénes	7fbad8de87	reader_concurrency_semaphore: unify admission logic across all paths The semaphore currently has two admission paths: the obtain_permit()/with_permit() methods which admits permits on user request (the front door) and the maybe_admit_waiters() which admits permits based on internal events like memory resource being returned (the back door). The two paths used their own admission conditions and naturally this means that they diverged in time. Notably, maybe_admit_waiters() did not look at inactive readers assuming that if there are waiters there cannot be inactive readers. This is not true however since we merged the execution-stage into the semaphore. Waiters can queue up even when there are inactive reads and thus maybe_admit_waiters() has to consider evicting some of them to see if this would allow for admitting new reads. To avoid such divergence in the future, the admission logic was moved into a new method can_admit_read() which is now shared between the two method families. This method now checks for the possibility of evicting inactive readers as well. The admission logic was tuned slightly to only consider evicting inactive readers if there is a real possibility that this will result in admissions: notably, before this patch, resource availability was checked before stalls were (used permits == blocked permits), so we could evict readers even if this couldn't help. Because now eviction can be started from maybe_admit_waiters(), which is also downstream from eviction, we added a flag to avoid recursive evict -> maybe admit -> evict ... loops. Fixes: #11770 Closes #11784	2022-10-18 17:07:43 +03:00

1 2 3

149 Commits