scylladb

Author	SHA1	Message	Date
Nadav Har'El	d598a94b43	Merge: everywhere: mark deferred actions noexcept Merged patch series by By Benny Halevy: Prepare for updating seastar submodule to a change that requires deferred actions to be noexcept (and return void). Test: unit(dev, debug) * tag 'deferred_action-noexcept-v1' of github.com:bhalevy/scylla: everywhere: make deferred actions noexcept cql3: prepare_context: mark methods noexcept commitlog: segment, segment_manager: mark methods noexcept everywhere: cleanup defer.hh includes	2021-08-23 11:16:17 +03:00
Benny Halevy	4439e5c132	everywhere: cleanup defer.hh includes Get rid of unused includes of seastar/util/{defer,closeable}.hh and add a few that are missing from source files. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-22 21:11:39 +03:00
Vlad Zolotarov	7bd1bcd779	loading_shared_values/loading_cache: get rid of iterators interface and return value_ptr from find(...) instead loading_shared_values/loading_cache'es iterators interface is dangerous/fragile because iterator doesn't "lock" the entry it points to and if there is a preemption point between aquiring non-end() iterator and its dereferencing the corresponding cache entry may had already got evicted (for whatever reason, e.g. cache size constraints or expiration) and then dereferencing may end up in a use-after-free and we don't have any protection against it in the value_extractor_fn today. And this is in addition to #8920. So, instead of trying to fix the iterator interface this patch kills two birds in a single shot: we are ditching the iterators interface completely and return value_ptr from find(...) instead - the same one we are returning from loading_cache::get_ptr(...) asyncronous APIs. A similar rework is done to a loading_shared_values loading_cache is based on: we drop iterators interface and return loading_shared_values::entry_ptr from find(...) instead. loading_cache::value_ptr already takes care of "lock"ing the returned value so that it would relain readable even if it's evicted from the cache by the time one tries to read it. And of course it also takes care of updating the last read time stamp and moving the corresponding item to the top of the MRU list. Fixes #8920 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <20210817222404.3097708-1-vladz@scylladb.com>	2021-08-22 16:49:40 +03:00
Piotr Dulikowski	5a0942a0f8	utils,alternator: move base64 code from alternator to utils The base64 encoding/decoding functions will be used for serialization of hint sync point descriptions. Base64 format is not specific to Alternator, so it can be moved to utils.	2021-08-09 09:24:36 +02:00
Michael Livshin	0eb2eb1b44	rename `coarse_clock` to `coarse_steady_clock` Also add a comment to explain why it exists. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Closes #9123	2021-08-02 17:41:21 +03:00
Michael Livshin	71d721a97e	logalloc: add on-stall memory reclaim diagnostics Reuse the existing `reclaim_timer` for stall detection. * Since a timer is now set around every reclaim and compaction, use a coarse one for speed. * Set log level according to conditions (stalls deserve a warning). * Add compaction/migration/eviction/allocation stats. Refs #4186. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 21:51:08 +03:00
Michael Livshin	68ab3948f8	utils: add a coarse clock Implement a millisecond-resolution `std::chrono`-style clock using `CLOCK_MONOTONIC_COARSE`. The use cases are those where you care about clock sampling latency more than about accuracy. Assuming non-ancient versions of the kernel & libc, all clock types recognized by `clock_gettime()` are implemented through a vDSO, so `clock_gettime()` is not an actual system call. That means that even `CLOCK_MONOTONIC` (which is what `std::chrono::steady_clock` uses) is not terribly expensive in practice. But `CLOCK_MONOTONIC_COARSE` is still 3.5 times faster than that (on my machine the latencies are 4ns versus 14ns) and is also supposed to be easier on the cache. The actual granularity of `CLOCK_MONOTONIC_COARSE` is tick (on x86-64, anyway) -- but `getclock_getres()` says it has millisecond resolution, so we use that. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 21:51:08 +03:00
Michael Livshin	20c760e638	logalloc: split tracker::impl::reclaim into reclaim & reclaim_locked Similarly to compact_and_evict(). Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 19:34:13 +03:00
Michael Livshin	a96aed3973	logalloc: metrics: remove unneeded captures and a pleonasm Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 19:34:13 +03:00
Michael Livshin	aa6c8ef582	logalloc: add metrics for evicted and freed memory Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 19:34:13 +03:00
Michael Livshin	a6283b322b	logalloc: count evicted memory Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 19:34:13 +03:00
Michael Livshin	4bcd91a09a	logalloc: count freed memory (On the individual free() request level, i.e. similarly to allocs) Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 19:34:13 +03:00
Piotr Sarna	60072045db	Merge 'cql3: replace cql3::selection::selectable::raw ... hierarchy with expressions' from Avi Kivity Currently, the grammar has two parallel hierarchies. One hierarchy is used in the WHERE clause, and is based on a combination of `term` and expressions. The other is used in the SELECT clause, and is using the cql3::selection::selectable hierarchy. There is some overlap between the hierarchies: both can name columns. Logically, however, they overlap completely - in SQL anything you can select you can filter on, and vice versa. So merging the two hierarchies is important if we want to enrich CQL. This series does that, partially (see below), converting the SELECT clause to expressions. There is another hierarchy split: between the "raw", pre-prepare object hierarchy, and post-prepare non-raw. This series limits itself to converting the raw hierarchy and leaves the non-raw hierarchy alone. An important design choice is not to have this raw/non-raw split in expressions. Note that most of the hierarchy is completely parallel: addition is addition both before prepare and after prepare (but see [1]). The main difference is around identifiers - before preparation they are unresolved, and after preparation they become `column_definition` objects. We resolve that by having two separate types: `unresolved_identifier` for the pre-prepare phase, and the existing `column_value` for post-prepare phase. Alternative choices would be to keep a separate expression::raw variant, or to template the expression variant on whether it is raw or not. I think it would cause undue bloat and confusion. Note the series introduces many on_internal_error() calls. This is because there is not a lot of overlap in the hierarchies today; you can't have a cast in the WHERE clause, for example. These on_internal_error() calls cannot be triggered since the grammar does not yet allow such expressions to be expressed. As we expand the grammar, they will have to be replaced with working implementations. Lastly, field selection is expressible in both hierarchies. This series does not yet merge the two representations (`column_value.sub` vs `field_selection`), but it should be easy to do so later. [1] the `+` operator can also be translated to list concatenation, which we may choose to represent by yet another type. Test: unit(dev) Closes #9087 * github.com:scylladb/scylla: cql3: expression: update find_atom, count_if for function_call, cast, field_selection cql3: expressions: fix printing of nested expressions cql3: selection: replace selectable::raw with expression cql3: expression: convert selectable::with_field_selection::raw to expression cql3: expression: convert selectable::with_cast::raw to expression cql3: expression: convert selectable::with_anonymous_function::raw to expression cql3: expression: convert selectable::with_function_call::raw to expressions cql3: selectable: make selectable::raw forward-declarable cql3: expressions: convert writetime_or_ttl::raw to expression cql3: expression: add convenience constructor from expression element to nested expression utils: introduce variant_element.hh cql3: expression: use nested_expression in binary_operator cql3: expression: introduce nested_expression class Convert column_identifier_raw's use as selectable to expressions make column_identifier::raw forward declarable cql3: introduce selectable::with_expression::raw	2021-07-30 09:57:39 +02:00
Avi Kivity	14fd886c72	utils: int_range: change to std::strong_ordering Ref #1449.	2021-07-28 13:29:50 +03:00
Avi Kivity	89bd7737f3	utils: big_decimal: change to std::strong_ordering Ref #1449.	2021-07-28 13:28:21 +03:00
Avi Kivity	59941c536c	utils: fragment_range: change to std::strong_ordering Ref #1449.	2021-07-28 13:27:49 +03:00
Avi Kivity	7729ff03ad	uuid: change comparators to std::strong_ordering Ref #1449.	2021-07-28 13:20:32 +03:00
Avi Kivity	636b133cbc	utils: introduce variant_element.hh A type trait (is_variant_element) and a concept (VariantElement) that tell if a type T is a member of a variant or not. It can be used even if the variant's elements are not yet defined (just forward-declared).	2021-07-27 20:08:47 +03:00
Pavel Emelyanov	c2a36f5668	utils: Introduce immutable_collection<> Wokring with collections can be done via const- and non-const references. In the former case the collection can only be read from (find, iterate, etc) in the latter it's possible to alter the collection (erase elements from or insert them into). Also the const-ness of the collection refernece is transparently inherited by the returned _elements_ of the collection, so when having a const reference on a collection it's impossible to modify the found element. This patch introduces a immutable_collection -- a wrapper over a random collection that makes sure the collection itself is not modified, but the obtained from it elements can be non-const. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Pavel Emelyanov	d1c693473a	btree: Generalize some iterator methods The non-const iterator has constructor from key pointer and the tree_if_singular method. There's no reasons why these two are absent in the const_iterator. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Pavel Emelyanov	6ef27c9fa1	btree: Make iterators not modify the tree itself The const_iterator cannot modify anything, but the plain iterator has public methods to remove the key from the tree. To control how the tree is modified this method must be marked private and modification by iterator should come from somewhere else. This somewhere else is the existing key_grabber that's already used to move keys between trees. Generalize this ability to move a key out of a tree (i.e. -- erase). Once done -- mark the iterator::erase_and_dispose private. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Pavel Emelyanov	0f53e83a8e	range_tombstone_list, code: Mark external_memory_usage noexcept The range_tombstone_list's method is at the top of the stack of calls each not throwing anything, so do the deep-dive noexcept marking. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Piotr Sarna	e9d26dd7ed	utils/coroutine: wrap a helper in utils namespace The class name `coroutine` became problematic since seastar introduced it as a namespace for coroutine helpers. To avoid a clash, the class from scylla is wrapped in a separate namespace. Without this patch, Seastar submodule update fails to compile. Message-Id: <6cb91455a7ac3793bc78d161e2cb4174cf6a1606.1626949573.git.sarna@scylladb.com>	2021-07-22 13:28:43 +03:00
Tomasz Grabiec	dcd05f77b1	lsa: Avoid excessive eviction if region is not compactible Introduced in `d72b91053b`. If region was not compactible, for example because it has dense segments, we would keep evicting even though the target for reclaimed segments was met. In the worst case we may have to evict whole cache. Refs #9038 (unlikely to be the cause though) Message-Id: <20210720104039.463662-1-tgrabiec@scylladb.com>	2021-07-20 14:36:14 +03:00
Tomasz Grabiec	50ec3ea295	lsa: Fix misaccunting of used space when allocating lsa_buffers lsa_buffer allocations are aligned to 4K. If smaller size is requested, whole 4K is used. However, only requested size was used in accounting segment occupancy. This can confuse reclaimer which may think the segment is sparse while it is actually dense, and compacting it will yield no or little gain. This can cause inefficient memory reclamation or lack of progress. Refs #9038 Message-Id: <20210720104110.463812-1-tgrabiec@scylladb.com>	2021-07-20 14:08:06 +03:00
Tomasz Grabiec	a8528cb24d	lsa: Fix uninitialized field access resulting in hangs during segment compaction _free_space may be initialized with garbage so kind() getter should only look at the bit which corresponds to the kind. Misclasification of segment as being of different kind may result in a hang during segment compaction. Surfaced in debug mode build where the field is filled with 0xbebebebe. Introduced in `b5ca0eb2a2`. Fixes #9057 Message-Id: <20210719232734.443964-1-tgrabiec@scylladb.com>	2021-07-20 02:33:21 +03:00
Nadav Har'El	2cc8c40c07	Merge 'Fix some issues found by gcc 11' from Avi Kivity This series fixes some issues that gcc 11 complains about. I believe all are correct errors from the standard's view. Clang accepts the changed code. Note that this is not enough to build with gcc 11, but it's a start. Closes #9007 * github.com:scylladb/scylla: utils: compact-radix-tree: detemplate array_of<> utils: compact-radix-tree: don't redefine type as member raft: avoid changing meaning of a symbol inside a class cql3: lists: catch polymorphic exceptions by reference	2021-07-12 11:17:57 +03:00
Avi Kivity	29c9570556	utils: compact-radix-tree: detemplate array_of<> The radix tree template defines a nested class template array_of; both a generic template and a fully specialized version. However, gcc (I believe correctly) rejects the fully specialized template that happens to be a member of another class template. As it happens, we don't really need a template here at all. Define a non-template class for each of the cases we need, and use std::conditional_t to select the type we need.	2021-07-11 18:16:21 +03:00
Avi Kivity	f576ecb7cc	utils: compact-radix-tree: don't redefine type as member The `direct_layout` and `indirect_layout` template classes accept a template parameter named `Layout` of type `layout`, and re-export `Layout` as a static data member named `layout`. This redefinition of `layout` is disliked by gcc. Fix by renaming the static data member to `this_layout` and adjust all references.	2021-07-11 18:16:21 +03:00
Avi Kivity	222ef17305	build, treewide: enable -Wredundant-move Returning a function parameter guarantees copy elision and does not require a std::move(). Enable -Wredundant-move to warn us that the move is unneeded, and gain slightly more readable code. A few violations are trivially adjusted. Closes #9004	2021-07-11 12:53:02 +03:00
Benny Halevy	023d103fee	utils: exceptions: is_timeout_exception: add timed_out_error Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210708083252.1934651-1-bhalevy@scylladb.com>	2021-07-08 15:23:29 +03:00
Avi Kivity	99d5355007	Merge "Cache sstable indexes in memory" from Tomasz " The main goal of this series is to improve efficiency of reads from large partitions by reducing amount of I/O needed to read the sstable index. This is achieved by caching index file pages and partition index entries in memory. Currently, the pages are cached by individual reads only for the duration of the read. This was done to facilitate binary search in the promoted index (intra-partition index). After this series, all reads share the index file page cache, which stays around even after reads stop. The page cache is subject to eviction. It uses the same region as the current row cache and shares the LRU with row cache entries. This means that LRU objects need to be virtualized. This series takes an easy approach and does this by introducing a virtual base class. This adds an overhead to row cache entry to store the vtable pointer. SStable indexes have a hierarchy. There is a summary, which is a sparse partition key index into the full partition index. This one is already kept in memory. The partition index is divided by the summary into pages. Each entry in the partition index contains promoted index, which is a sparse index into atoms identified by the clustering key (rows, tombstones). In order to read the promoted index, the reader needs to read the partition index entry first. To speed this up, this series also adds caching of partition index entries. This cache survives reads and is subject to eviction, just like the index file page cache. The unit of caching is the partition index page. Without this cache, each access to promoted index would have to be preceded with the parsing of the partition index page containing the partition key. Performance testing results follow. 1) scylla-bench large partition reads Populated with: perf_fast_forward --run-tests=large-partition-skips --datasets=sb-large-part-ds1 \ -c1 -m1G --populate --value-size=1024 --rows=10000000 Single partition, 9G data file, 4MB index file Test execution: build/release/scylla -c1 -m4G scylla-bench -workload uniform -mode read -limit 1 -concurrency 100 -partition-count 1 \ -clustering-row-count 10000000 -duration 60m TL;DR: after: 2x throughput, 0.5 median latency Before (`c1daf2bb24`): Results Time (avg): 5m21.033180213s Total ops: 966951 Total rows: 966951 Operations/s: 3011.997048812112 Rows/s: 3011.997048812112 Latency: max: 74.055679ms 99.9th: 63.569919ms 99th: 41.320447ms 95th: 38.076415ms 90th: 37.158911ms median: 34.537471ms mean: 33.195994ms After: Results Time (avg): 5m14.706669345s Total ops: 2042831 Total rows: 2042831 Operations/s: 6491.22243800942 Rows/s: 6491.22243800942 Latency: max: 60.096511ms 99.9th: 35.520511ms 99th: 27.000831ms 95th: 23.986175ms 90th: 21.659647ms median: 15.040511ms mean: 15.402076ms 2) scylla-bench small partitions I tested several scenarios with a varying data set size, e.g. data fully fitting in memory, half fitting, and being much larger. The improvement varied a bit but in all cases the "after" code performed slightly better. Below is a representative run over data set which does not fit in memory. scylla -c1 -m4G scylla-bench -workload uniform -mode read -concurrency 400 -partition-count 10000000 \ -clustering-row-count 1 -duration 60m -no-lower-bound Before: Time (avg): 51.072411913s Total ops: 3165885 Total rows: 3165885 Operations/s: 61988.164024260645 Rows/s: 61988.164024260645 Latency: max: 34.045951ms 99.9th: 25.985023ms 99th: 23.298047ms 95th: 19.070975ms 90th: 17.530879ms median: 3.899391ms mean: 6.450616ms After: Time (avg): 50.232410679s Total ops: 3778863 Total rows: 3778863 Operations/s: 75227.58014424688 Rows/s: 75227.58014424688 Latency: max: 37.027839ms 99.9th: 24.805375ms 99th: 18.219007ms 95th: 14.090239ms 90th: 12.124159ms median: 4.030463ms mean: 5.315111ms The results include the warmup phase which populates the partition index cache, so the hot-cache effect is dampened in the statistics. See the 99th percentile. Latency gets better after the cache warms up which moves it lower. 3) perf_fast_forward --run-tests=large-partition-skips Caching is not used here, included to show there are no regressions for the cold cache case. TL;DR: No significant change perf_fast_forward --run-tests=large-partition-skips --datasets=large-part-ds1 -c1 -m1G Config: rows: 10000000, value size: 2000 Before: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 1 0 36.429822 4 10000000 274500 62 274521 274429 153889.2 153883 19696986 153853 0 0 0 0 0 0 0 22.5% 1 1 36.856236 4 5000000 135662 7 135670 135650 155652.0 155652 19704117 139326 1 0 1 1 0 0 0 38.1% 1 8 36.347667 4 1111112 30569 0 30570 30569 155652.0 155652 19704117 139071 1 0 1 1 0 0 0 19.5% 1 16 36.278866 4 588236 16214 1 16215 16213 155652.0 155652 19704117 139073 1 0 1 1 0 0 0 16.6% 1 32 36.174784 4 303031 8377 0 8377 8376 155652.0 155652 19704117 139056 1 0 1 1 0 0 0 12.3% 1 64 36.147104 4 153847 4256 0 4256 4256 155652.0 155652 19704117 139109 1 0 1 1 0 0 0 11.1% 1 256 9.895288 4 38911 3932 1 3933 3930 100869.2 100868 3178298 59944 38912 0 1 1 0 0 0 14.3% 1 1024 2.599921 4 9757 3753 0 3753 3753 26604.0 26604 801850 15071 9758 0 1 1 0 0 0 14.6% 1 4096 0.784568 4 2441 3111 1 3111 3109 7982.0 7982 205946 3772 2442 0 1 1 0 0 0 13.8% 64 1 36.553975 4 9846154 269359 10 269369 269337 155663.8 155652 19704117 139230 1 0 1 1 0 0 0 28.2% 64 8 36.509694 4 8888896 243467 8 243475 243449 155652.0 155652 19704117 139120 1 0 1 1 0 0 0 26.5% 64 16 36.466282 4 8000000 219381 4 219385 219374 155652.0 155652 19704117 139232 1 0 1 1 0 0 0 24.8% 64 32 36.395926 4 6666688 183171 6 183180 183165 155652.0 155652 19704117 139158 1 0 1 1 0 0 0 21.8% 64 64 36.296856 4 5000000 137753 4 137757 137737 155652.0 155652 19704117 139105 1 0 1 1 0 0 0 17.7% 64 256 20.590392 4 2000000 97133 18 97151 94996 135248.8 131395 7877402 98335 31282 0 1 1 0 0 0 15.7% 64 1024 6.225773 4 588288 94492 1436 95434 88748 46066.5 41321 2324378 30360 9193 0 1 1 0 0 0 15.8% 64 4096 1.856069 4 153856 82893 54 82948 82721 16115.0 16043 583674 11574 2675 0 1 1 0 0 0 16.3% After: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 1 0 36.429240 4 10000000 274505 38 274515 274417 153887.8 153883 19696986 153849 0 0 0 0 0 0 0 22.4% 1 1 36.933806 4 5000000 135377 15 135385 135354 155658.0 155658 19704085 139398 1 0 1 1 0 0 0 40.0% 1 8 36.419187 4 1111112 30509 2 30510 30507 155658.0 155658 19704085 139233 1 0 1 1 0 0 0 22.0% 1 16 36.353475 4 588236 16181 0 16182 16181 155658.0 155658 19704085 139183 1 0 1 1 0 0 0 19.2% 1 32 36.251356 4 303031 8359 0 8359 8359 155658.0 155658 19704085 139120 1 0 1 1 0 0 0 14.8% 1 64 36.203692 4 153847 4249 0 4250 4249 155658.0 155658 19704085 139071 1 0 1 1 0 0 0 13.0% 1 256 9.965876 4 38911 3904 0 3906 3904 100875.2 100874 3178266 60108 38912 0 1 1 0 0 0 17.9% 1 1024 2.637501 4 9757 3699 1 3700 3697 26610.0 26610 801818 15071 9758 0 1 1 0 0 0 19.5% 1 4096 0.806745 4 2441 3026 1 3027 3024 7988.0 7988 205914 3773 2442 0 1 1 0 0 0 18.3% 64 1 36.611243 4 9846154 268938 5 268942 268921 155669.8 155705 19704085 139330 2 0 1 1 0 0 0 29.9% 64 8 36.559471 4 8888896 243135 11 243156 243124 155658.0 155658 19704085 139261 1 0 1 1 0 0 0 28.1% 64 16 36.510319 4 8000000 219116 15 219126 219101 155658.0 155658 19704085 139173 1 0 1 1 0 0 0 26.3% 64 32 36.439069 4 6666688 182954 9 182964 182943 155658.0 155658 19704085 139274 1 0 1 1 0 0 0 23.2% 64 64 36.334808 4 5000000 137609 11 137612 137596 155658.0 155658 19704085 139258 2 0 1 1 0 0 0 19.1% 64 256 20.624759 4 2000000 96971 88 97059 92717 138296.0 131401 7877370 98332 31282 0 1 1 0 0 0 17.2% 64 1024 6.260598 4 588288 93967 1429 94905 88051 45939.5 41327 2324346 30361 9193 0 1 1 0 0 0 17.8% 64 4096 1.881338 4 153856 81780 140 81920 81520 16109.8 16092 582714 11617 2678 0 1 1 0 0 0 18.2% 4) perf_fast_forward --run-tests=large-partition-slicing Caching enabled, each line shows the median run from many iterations TL;DR: We can observe reduction in IO which translates to reduction in execution time, especially for slicing in the middle of partition. perf_fast_forward --run-tests=large-partition-slicing --datasets=large-part-ds1 -c1 -m1G --keep-cache-across-test-cases Config: rows: 10000000, value size: 2000 Before: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 0 1 0.000491 127 1 2037 24 2109 127 4.0 4 128 2 2 0 1 1 0 0 0 157 80 3058208 15.0% 0 32 0.000561 1740 32 56995 410 60031 47208 5.0 5 160 3 2 0 1 1 0 0 0 386 111 113353 17.5% 0 256 0.002052 488 256 124736 7111 144762 89053 16.6 17 672 14 2 0 1 1 0 0 0 2113 446 52669 18.6% 0 4096 0.016437 61 4096 249199 692 252389 244995 69.4 69 8640 57 5 0 1 1 0 0 0 26638 1717 23321 22.4% 5000000 1 0.002171 221 1 461 2 466 221 25.0 25 268 3 3 0 1 1 0 0 0 638 376 14311524 10.2% 5000000 32 0.002392 404 32 13376 48 13528 13015 27.0 27 332 5 3 0 1 1 0 0 0 931 432 489691 11.9% 5000000 256 0.003659 279 256 69967 764 73130 52563 39.5 41 780 19 3 0 1 1 0 0 0 2689 825 93756 15.8% 5000000 4096 0.018592 55 4096 220313 433 234214 218803 94.2 94 9484 62 9 0 1 1 0 0 0 27349 2213 26562 21.0% After: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 0 1 0.000229 115 1 4371 85 4585 115 2.1 2 64 1 1 1 0 0 0 0 0 90 31 1314749 22.2% 0 32 0.000277 2174 32 115674 1015 128109 14144 3.0 3 96 2 1 1 0 0 0 0 0 319 62 52508 26.1% 0 256 0.001786 576 256 143298 5534 179142 113715 14.7 17 544 15 1 1 0 0 0 0 0 2110 453 45419 21.4% 0 4096 0.015498 61 4096 264289 2006 268850 259342 67.4 67 8576 59 4 1 0 0 0 0 0 26657 1738 22897 23.7% 5000000 1 0.000415 233 1 2411 15 2456 234 4.1 4 128 2 2 1 0 0 0 0 0 199 72 2644719 16.8% 5000000 32 0.000635 1413 32 50398 349 51149 46439 6.0 6 192 4 2 1 0 0 0 0 0 458 128 125893 18.6% 5000000 256 0.002028 486 256 126228 3024 146327 82559 17.8 18 1024 13 4 1 0 0 0 0 0 2123 385 51787 19.6% 5000000 4096 0.016836 61 4096 243294 814 263434 241660 73.0 73 9344 62 8 1 0 0 0 0 0 26922 1920 24389 22.4% Future work: - Check the impact on non-uniform workloads. Caching sstable indexes takes space away from the row cache which may reduce the hit ratio. - Reduce memory footprint of partition index cache. Currently, about 8x bloat over the on-disk size. - Disable cache population for "bypass cache" reads - Add a switch to disable sstable index caching, per-node, maybe per-table - Better sstable index format. Current format leads to inefficiency in caching since only some elements of the cached page can be hot. A B-tree index would be more efficient. Same applies to the partition index. Only some elements in the partition index page can be hot. - Add heuristic for reducing index file IO size when large partitions are anticipated. If we're bound by disk's bandwidth it's wasteful to read the front of promoted index using 32K IO, better use 4K which should cover the partition entry and then let binary search read the rest. In V2: - Fixed perf_fast_forward regression in the number of IOs used to read partition index page The reader uses 32K reads, which were split by page cache into 4K reads Fix by propagating IO size hints to page cache and using single IO to populate it. New patch: "cached_file: Issue single I/O for the whole read range on miss" - Avoid large allocations to store partition index page entries (due to managed_vector storage). There is a unit test which detects this and fails. Fixed by implementing chunked_managed_vector, based on chunked_vector. - fixed bug in cached_file::evict_gently() where the wrong allocation strategy was used to free btree chunks - Simplify region_impl::free_buf() according to Avi's suggestions - Fit segment_kind in segment_descriptor::_free_space and lift requirement that _buf_pointers emptiness determines the kind - Workaround sigsegv which was most likely due to coroutine miscompilation. Worked around by manipulating local object scope. - Wire up system/drop_sstable_caches RESTful API - Fix use-after-move on permit for the old scanning ka/la index reader - Fixed more cases of double open_data() in tests leading to assert failure - Adjusted cached_file class doc to account for changes in behavior. - Rebased Fixes #7079. Refs #363. " * tag 'sstable-index-caching-v2' of github.com:tgrabiec/scylla: (39 commits) api: Drop sstable index caches on system/drop_sstable_caches cached_file: Issue single I/O for the whole read range on miss row_cache: cache_tracker: Do not register metrics when constructed for tests sstables, cached_file: Evict cache gently when sstable is destroyed sstables: Hide partition_index_cache implementation away from sstables.hh sstables: Drop shared_index_lists alias sstables: Destroy partition index cache gently sstables: Cache partition index pages in LSA and link to LRU utils: Introduce lsa::weak_ptr<> sstables: Rename index_list to partition_index_page and shared_index_lists to partition_index_cache sstables, cached_file: Avoid copying buffers from cache when parsing promoted index cached_file: Introduce get_page_units() sstables: read: Document that primitive_consumer::read_32() is alloc-free sstables: read: Count partition index page evictions sstables: Drop the _use_binary_search flag from index entries sstables: index_reader: Keep index objects under LSA lsa: chunked_managed_vector: Adapt more to managed_vector utils: lsa: chunked_managed_vector: Make LSA-aware test: chunked_managed_vector_test: Make exception_safe_class standard layout lsa: Copy chunked_vector to chunked_managed_vector ...	2021-07-07 18:17:10 +03:00
Avi Kivity	4c01a88c9d	logalloc: do not capture backtraces by default in debug mode logalloc has a nice leak/double-free sanitizer, with the nice feature of capturing backtraces to make error reports easy to track down. But capturing backtraces is itself very expensive. This patch makes backtrace capture optional, reducing database_test runtime from 30 minutes to 20 minutes on my machine. Closes #8978	2021-07-06 00:18:22 +02:00
Tomasz Grabiec	f553db69f7	cached_file: Issue single I/O for the whole read range on miss Currently, reading a page range would issue I/O for each missing page. This is inefficient, better to issue a single I/O for the whole range and populate cache from that. As an optimization, issue a single I/O if the first page is missing. This is important for index reads which optimistically try to read 32KB of index file to read the partition index page.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	1f74863bf8	sstables, cached_file: Evict cache gently when sstable is destroyed We must evict before the _cached_index_file associated with the sstable goes away. Better to do it gently to avoid stalls.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	9f957f1cf9	sstables: Cache partition index pages in LSA and link to LRU As part of this change, the container for partition index pages was changed from utils::loading_shared_values to intrusive_btree. This is to avoid reactor stalls which the former induces with a large number of elements (pages) due to its use of a hashtable under the hood, which reallocates contiguous storage.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	b3728f7d9b	utils: Introduce lsa::weak_ptr<> Simplifies managing non-owning references to LSA-managed objects. The lsa::weak_ptr is a smart pointer which is not invalidated by LSA and can be used safely in any allocator context. Dereferenced will always give a valid reference. This can be used as a building block for implementing cursors into LSA-based caches. Example simple use: // LSA-managed struct X : public lsa::weakly_referencable<X> { int value; }; lsa::weak_ptr<X> x_ptr = with_allocator(region(), [] { X* x = current_allocator().construct<X>(); return x->weak_from_this(); }); std::cout << x_ptr->value;	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	934824394a	sstables, cached_file: Avoid copying buffers from cache when parsing promoted index	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	7b6f18b4ed	cached_file: Introduce get_page_units() Will be needed later for reading a page view which cannot use make_tracked_temporary_buffer(). Standardize on get_page_units(), converting existing code to wrap the units in a deleter.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	20ef54e9ed	lsa: chunked_managed_vector: Adapt more to managed_vector For seamless transition.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	78e5b9fd85	utils: lsa: chunked_managed_vector: Make LSA-aware The max chunk size is set to be 10% of segment size.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	c87ea09535	lsa: Copy chunked_vector to chunked_managed_vector In preparation for adapting it to LSA. Split into two steps to make reiew easier.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	1523a7d367	utils: managed_vector: Make clear_and_release() public Will be needed by index reader to ensure that destructor doesn't invoke the allocator so that all is destroyed in the desried allocation context before the object is destroyed.	2021-07-02 19:02:14 +02:00
Tomasz Grabiec	484e06d69b	cached_file: Always start at offset 0 All current uses start at offset 0, so simplify the code by assuming it.	2021-07-02 19:02:13 +02:00
Tomasz Grabiec	078a6e422b	sstables: Cache all index file reads After this patch, there is a singe index file page cache per sstable, shared by index readers. The cache survives reads, which reduces amount of I/O on subsequent reads. As part of this, cached_file needed to be adjusted in the following ways. The page cache may occupy a significant portion of memory. Keeping the pages in the standard allocator could cause memory fragmentation problems. To avoid them, the cache_file is changed to keep buffers in LSA using lsa_buffer allocation method. When a page is needed by the seastar I/O layer, it needs to be copied to a temporary_buffer which is stable, so must be allocated in the standard allocator space. We copy the page on-demand. Concurrent requests for the same page will share the temporary_buffer. When page is not used, it only lives in the LSA space. In the subsequent patches cached_file::stream will be adjusted to also support access via cached_page::ptr_type directly, to avoid materializating a temporary_buffer. While a page is used, it is not linked in the LRU so that it is not freed. This ensures that the storage which is actively consumed remains stable, either via temporary_buffer (kept alive by its deleter), or by cached_page::ptr_type directly.	2021-07-02 19:02:13 +02:00
Tomasz Grabiec	b5ca0eb2a2	lsa: Introduce lsa_buffer lsa_buffer is similar in spirit to std::unique_ptr<char[]>. It owns buffers allocated inside LSA segments. It uses an alternative allocation method which differs from regular LSA allocations in the following ways: 1) LSA segments only hold buffers, they don't hold metadata. They also don't mix with standard allocations. So a 128K segment can hold 32 4K buffers. 2) objects' life time is managed by lsa_buffer, an owning smart pointer, which is automatically updated when buffers are migrated to another segment. This makes LSA allocations easier to use and off-loads metadata management to the client (which can keep the lsa_buffer wherever he wants). The metadata is kept inside segment_descriptor, in a vector. Each allocated buffer will have an entangled object there (8 bytes), which is paired with an entabled object inside lsa_buffer. The reason to have an alternative allocation method is to efficiently pack buffers inside LSA segments.	2021-07-02 19:02:13 +02:00
Tomasz Grabiec	a23f27034f	lsa: Introduce entangled helper Will be useful in building higher-level LSA tools.	2021-07-02 19:02:13 +02:00
Tomasz Grabiec	056f14063e	lsa: Encapsulate segment_descriptor::_free_space access Prepares for reusing some of its bits for storing segment kind.	2021-07-02 19:02:13 +02:00
Tomasz Grabiec	019956739d	cached_file: Switch to bplus::tree In order to be able to move it to LSA later.	2021-07-02 10:25:58 +02:00
Tomasz Grabiec	8fbea0b5b7	utils: cached_file: Introduce file wrapper It's an adpator between seastar::file and cached_file. It gives a seastar::file which will serve reads using a given cached_file as a read-through cache.	2021-07-02 10:25:58 +02:00

1 2 3 4 5 ...

1028 Commits