scylladb

Author	SHA1	Message	Date
Tomasz Grabiec	082342ecad	Attach names to allocating sections for better debuggability Large reserves in allocating_section can cause stalls. We already log reserve increase, but we don't know which table it belongs to: lsa - LSA allocation failure, increasing reserve in section 0x600009f94590 to 128 segments; Allocating sections used for updating row cache on memtable flush are notoriously problematic. Each table has its own row_cache, so its own allocating_section(s). If we attached table name to those sections, we could identify which table is causing problems. In some issues we suspected system.raft, but we can't be sure. This patch allows naming allocating_sections for the purpose of identifying them in such log messages. I use abstract_formatter for this purpose to avoid the cost of formatting strings on the hot path (e.g. index_reader). And also to avoid duplicating strings which are already stored elsewhere. Fixes #25799 Closes scylladb/scylladb#27470	2025-12-07 14:14:25 +02:00
Avi Kivity	f0ec9dd8f2	Merge 'utils/logalloc: enforce the max contiguous allocation size limit' from Michał Chojnowski This series fixes the only known violation of logalloc's allocation size limits (in `chunked_managed_vector`), and then it make those limits hard. Before the series, LSA handles overly-large allocations by forwarding them to the standard allocator. After the series, an attempt to do an overly large allocations via LSA will trigger an `on_internal_error` instead. We do this because the allocator fallback logic turned out to have subtle and problematic accounting bugs. We could fix them, or we can remove the mechanism altogether. It's hard to say which choice is better. This PR arbitrarily makes the choice to remove the mechanism. This makes the logic simpler, at the risk of escalating some allocation size bugs to crashes. See the descriptions of individual commits for more details. Fixes scylladb/scylladb#23850 Fixes scylladb/scylladb#23851 Fixes scylladb/scylladb#23854 I'm not sure if any of this should be backported or not. The `chunked_managed_vector` fix could be backported, because it's a bugfix. It's an old bug, though, and we have never observed problems related to it. The changes to `logalloc` aren't supposed to be fixing any observable problem, so a backport probably has more risk than benefit in this case. Closes scylladb/scylladb#23944 * github.com:scylladb/scylladb: utils/logalloc: enforce LSA allocation size limits utils/lsa/chunked_managed_vector: fix the calculation of max_chunk_capacity()	2025-05-29 22:11:41 +03:00
Michał Chojnowski	c47f438db3	logalloc: make background_reclaimer::free_memory_threshold publicly visible Wanted by the change to the background_reclaim test in the next patch.	2025-05-06 18:59:18 +02:00
Michał Chojnowski	7f9152babc	utils/lsa/chunked_managed_vector: fix the calculation of max_chunk_capacity() `chunked_managed_vector` is a vector-like container which splits its contents into multiple contiguous allocations if necessary, in order to fit within LSA's max preferred contiguous allocation limits. Each limited-size chunk is stored in a `managed_vector`. `managed_vector` is unaware of LSA's size limits. It's up to the user of `managed_vector` to pick a size which is small enough. This happens in `chunked_managed_vector::max_chunk_capacity()`. But the calculation is wrong, because it doesn't account for the fact that `managed_vector` has to place some metadata (the backreference pointer) inside the allocation. In effect, the chunks allocated by `chunked_managed_vector` are just a tiny bit larger than the limit, and the limit is violated. Fix this by accounting for the metadata. Also, before the patch `chunked_managed_vector::max_contiguous_allocation`, repeats the definition of logalloc::max_managed_object_size. This is begging for a bug if `logalloc::max_managed_object_size` changes one day. Adjust it so that `chunked_managed_vector` looks directly at `logalloc::max_managed_object_size`, as it means to.	2025-04-28 12:30:13 +02:00
Kefu Chai	7215d4bfe9	utils: do not include unused headers these unused includes were identifier by clang-include-cleaner. after auditing these source files, all of the reports have been confirmed. please note, because quite a few source files relied on `utils/to_string.hh` to pull in the specialization of `fmt::formatter<std::optional<T>>`, after removing `#include <fmt/std.h>` from `utils/to_string.hh`, we have to include `fmt/std.h` directly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-14 07:56:39 -05:00
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Avi Kivity	aa1270a00c	treewide: change assert() to SCYLLA_ASSERT() assert() is traditionally disabled in release builds, but not in scylladb. This hasn't caused problems so far, but the latest abseil release includes a commit [1] that causes a 1000 insn/op regression when NDEBUG is not defined. Clearly, we must move towards a build system where NDEBUG is defined in release builds. But we can't just define it blindly without vetting all the assert() calls, as some were written with the expectation that they are enabled in release mode. To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT() macro in utils/assert.hh. This macro is always defined and is not conditional on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release mode. [1] `66ef711d68` Closes scylladb/scylladb#20006	2024-08-05 08:23:35 +03:00
Benny Halevy	e5ca65f78b	test/perf: report also log_allocations/op Currently perf-simple-query --write ignores log allocations that happen on the memtable apply path. This change adds tracking and accounting of the number of log allocation, and reporting of thereof. For reference, here's the output of build/release/scylla perf-simple-query --write --default-log-level=error --random-seed=1 -c 1 ``` random-seed=1 enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, frontend=cql, query_single_key=no, counters=no} Disabling auto compaction 78073.55 tps ( 59.4 allocs/op, 16.3 logallocs/op, 14.3 tasks/op, 52991 insns/op, 0 errors) 77263.59 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53282 insns/op, 0 errors) 79913.07 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53295 insns/op, 0 errors) 79554.32 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53284 insns/op, 0 errors) 79151.53 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53289 insns/op, 0 errors) median 79151.53 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53289 insns/op, 0 errors) median absolute deviation: 761.54 maximum: 79913.07 minimum: 77263.59 ``` Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-05-02 18:42:41 +03:00
Kefu Chai	168ade72f8	treewide: replace formatter<std::string_view> with formatter<string_view> in in {fmt} before v10, it provides the specialization of `fmt::formatter<..>` for `std::string_view` as well as the specialization of `fmt::formatter<..>` for `fmt::string_view` which is an implementation builtin in {fmt} for compatibility of pre-C++17. and this type is used even if the code is compiled with C++ stadandard greater or equal to C++17. also, before v10, the `fmt::formatter<std::string_view>::format()` is defined so it accepts `std::string_view`. after v10, `fmt::formatter<std::string_view>` still exists, but it is now defined using `format_as()` machinery, so it's `format()` method does not actually accept `std::string_view`, it accepts `fmt::string_view`, as the former can be converted to `fmt::string_view`. this is why we can inherit from `fmt::formatter<std::string_view>` and use `formatter<std::string_view>::format(foo, ctx);` to implement the `format()` method with {fmt} v9, but we cannot do this with {fmt} v10, and we would have following compilation failure: ``` FAILED: service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o /home/kefu/.local/bin/clang++ -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSCYLLA_BUILD_MODE=release -DSEASTAR_API_LEVEL=7 -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++20 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -mllvm -inline-threshold=2500 -fno-slp-vectorize -U_FORTIFY_SOURCE -Werror=unused-result -MD -MT service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o -MF service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o.d -o service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o -c /home/kefu/dev/scylladb/service/topology_state_machine.cc /home/kefu/dev/scylladb/service/topology_state_machine.cc:254:41: error: no matching member function for call to 'format' 254 \| return formatter<std::string_view>::format(it->second, ctx); \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~ /usr/include/fmt/core.h:2759:22: note: candidate function template not viable: no known conversion from 'seastar::basic_sstring<char, unsigned int, 15>' to 'const fmt::basic_string_view<char>' for 1st argument 2759 \| FMT_CONSTEXPR auto format(const T& val, FormatContext& ctx) const \| ^ ~~~~~~~~~~~~ ``` because the inherited `format()` method actually comes from `fmt::formatter<fmt::string_view>`. to reduce the confusion, in this change, we just inherit from `fmt::format<string_view>`, where `string_view` is actually `fmt::string_view`. this follows the document at https://fmt.dev/latest/api.html#formatting-user-defined-types, and since there is less indirection under the hood -- we do not use the specialization created by `FMT_FORMAT_AS` which inherit from `formatter<fmt::string_view>`, hopefully this can improve the compilation speed a little bit. also, this change addresses the build failure with {fmt} v10. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#18299	2024-04-19 07:44:07 +03:00
Kefu Chai	3d9054991b	utils/logalloc: add fmt::formatter for occupancy_stats before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for `occupancy_stats`, and drop its operator<<. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-02-23 11:32:41 +08:00
Botond Dénes	c0da6bcfb8	utils/logalloc: handle utils::memory_limit_reached in with_reclaiming_disabled() Said method catches bad-allocs and retries the passed-in function after raising the reserves. This does nothing to help the function succeed if the bad alloc was throw from the semaphore, because the kill limit was reached. In this case the read should be left to fail and terminate. Now that the semaphore is throwing utils::memory_limit_reached in this case, we can distinguish this case and just re-throw the exception.	2023-09-27 10:28:00 -04:00
Pavel Emelyanov	30959fc9b1	lsa, test: Extend memory footprint test with per-type total sizes When memory footprint test is over it prints total size taken by row cache, memtable and sstables as well as individual objects' sizes. It's also nice to know the details on the row-cache's individual objects. This patch extends the printing with total size of allocated object types according to migrator_fn types. Sample output: mutation footprint: - in cache: 11040928 - in memtable: 9142424 - in sstable: mc: 2160000 md: 2160000 me: 2160000 - frozen: 540 - canonical: 827 - query result: 342 sizeof(cache_entry) = 64 sizeof(memtable_entry) = 64 sizeof(bptree::node) = 288 sizeof(bptree::data) = 72 -- sizeof(decorated_key) = 32 -- sizeof(mutation_partition) = 96 -- -- sizeof(_static_row) = 8 -- -- sizeof(_rows) = 24 -- -- sizeof(_row_tombstones) = 40 sizeof(rows_entry) = 144 sizeof(evictable) = 24 sizeof(deletable_row) = 72 sizeof(row) = 16 radix_tree::inner_node::node_sizes = 48 80 144 272 528 1040 radix_tree::leaf_node::node_sizes = 120 216 416 816 3104 sizeof(atomic_cell_or_collection) = 16 btree::linear_node_size(1) = 24 btree::inner_node_size = 216 btree::leaf_node_size = 120 LSA stats: N18compact_radix_tree4treeI13cell_and_hashjE9leaf_nodeE: 360 N5bplus4dataIl15intrusive_arrayI11cache_entryEN3dht25raw_token_less_comparatorELm16ELNS_10key_searchE0ELNS_10with_debugE0EEE: 5040 N5bplus4nodeIl15intrusive_arrayI11cache_entryEN3dht25raw_token_less_comparatorELm16ELNS_10key_searchE0ELNS_10with_debugE0EEE: 19296 17partition_version: 952416 N11intrusive_b4nodeI10rows_entryXadL_ZNS1_5_linkEEENS1_11tri_compareELm12ELm20ELNS_10key_searchE0ELNS_10with_debugE0EEE: 317472 10rows_entry: 1429056 12blob_storage: 254 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#15434	2023-09-18 11:23:18 +02:00
Botond Dénes	a55903c839	utils/logalloc: add use_standard_allocator_segment_pool_backend() Creating a standard-memory-allocator backend for the segment store. This is targeted towards tools, which want to configure LSA with a segment store backend that is appropriate for the standard allocator (which they want to use). We want to be able to use this in both release and debug mode. The former will be used by tools and the latter will be used to run the logalloc tests with this new backend, making sure it works and doesn't regress. For this latter, we have to allow the release and debug stores to coexist in the same build and for the debug store to be able to delegate to the release store when the standard allocator backend is used.	2022-09-16 13:02:40 +03:00
Botond Dénes	499b9a3a7c	utils/logalloc: tracker_reclaimer_lock: get shard tracker via constructor arg	2022-08-23 10:38:58 +03:00
Botond Dénes	7d17d675af	utils/logalloc: move global stat accessors to tracker These are pretend free functions, accessing globals in the background, make them a member of the tracker instead, which everything needed locally to compute them. Callers still have to access these stats through the global tracker instance, but this can be changed to happen through a local instance. Soon....	2022-08-23 10:38:58 +03:00
Botond Dénes	f406151a86	utils/logalloc: allocating_section: don't use the global tracker Instead, get the tracker instance from the region. This requires adding a `region&` parameter to `with_reserve()`. This brings us one step closer to eliminating the global tracker.	2022-08-23 10:38:58 +03:00
Botond Dénes	5b86dfc35a	utils/logalloc: add tracker member to basic_region_impl For now this member is initialized from the global tracker instance. But it allows the members of region impl to be detached from said global, making a step towards removing it.	2022-08-23 10:38:58 +03:00
Benny Halevy	6e961ead3b	logalloc: mark free functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	705b42efe2	logalloc: allocating_section: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	f9db708376	logalloc: allocating_section: guard: mark constructor noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	5416808367	logalloc: reclaim_lock: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	95b0e41abb	logalloc: tracker_reclaimer_lock: mark constructor noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	ed9e036509	logalloc: mark shard_tracker noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	d6e6ffc741	logalloc: region: mark functions const/noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	2beee4a6cd	logalloc: basic_region_impl: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:43:17 +03:00
Benny Halevy	fe50c76dbc	logalloc: tracker: mark functions const/noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:40:50 +03:00
Benny Halevy	a49619a601	logalloc: occupancy_stats: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 13:17:43 +03:00
Benny Halevy	a6356539bf	logalloc: lsa_buffer: mark functions noexcept Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 10:22:35 +03:00
Avi Kivity	5b541bed72	logalloc: drop region_impl public accessors With the region heap handle removed from logalloc::region, there is nothing remaining there that needs violation of the abstraction boundary, so we can drop these hacks.	2022-07-26 11:12:10 +03:00
Avi Kivity	2cb5f79e9d	logalloc, dirty_memory_manager: move size-tracking binomial heap out of logalloc The region_group mechanism used an intrusive heap handle embedded in logalloc::region to allow region_group:s to track the largest region. But with region_group moved out of logalloc, the handle is out of place. Move it out, introducing a new intermediate class size_tracked_region to hold the heap handle. We might eventually merge the new class into memtable (which derives from it), but that requires a large rearrangement of unit tests, so defer that.	2022-07-26 11:12:10 +03:00
Avi Kivity	ee720fa23b	logalloc: relax lifetime rules around region_listener Currently, a region_listener is added during construction and removed during destruction. This was done to mimick the old region(region_group&) constructor, as region_listener replaces region_group. However, this makes moving the binomial heap handle outside logalloc difficult. The natural place for the handle is in a derived class of logalloc::region (e.g. memtable), but members of this derived class will be destroyed earlier than the logalloc::region here. We could play trickes with an earlier base class but it's better to just decouple region lifecycle from listener lifecycle. Do that be adding listen()/unlisten() methods. Some small awkwardness remains in that merge() implicitly unlistens (see comment in region::unlisten). Unit tests are adjusted.	2022-07-26 11:12:10 +03:00
Avi Kivity	fbe8ea7727	logalloc, dirty_memory_manager: move region_group and associated code region_group is an abstraction that allows accounting for groups of regions, but the cost/benefit ratio of maintaining the abstraction is poor. Each time we need to change decision algorithm of memtable flushing (admittedly rarely), we need to distill that into an abstraction for region_groups and then use it. An example is virtual regions groups; we wanted to account for the partially flushed memtables and had to invent region groups to stand in their place. Rather than continuing to invest in the abstraction, break it now and move it to the memtable dirty memory manager which is responsible for making those decisions. The relevant code is moved to dirty_memory_manager.hh and dirty_memory_manager.cc (new file), and a new unit test file is added as well. A downside of the change is that unit testing will be more difficult.	2022-07-26 11:12:10 +03:00
Avi Kivity	bffee2540f	logalloc: expose tracker_reclaimer_lock tracker_reclaimer_lock is used by region_group, which is being moved out of logalloc, so expose it.	2022-07-26 11:12:10 +03:00
Avi Kivity	652ab6f4a2	logalloc: reduce friendship between region and region_group - add conversions between region and region_impl - add accessor for the binomial heap handle - add accessor for region_impl::id() - remove friend declarations This helps in moving region_group to a different source file, where the definitions of region_impl will not be visible.	2022-07-26 11:12:10 +03:00
Avi Kivity	c91ee9d04e	logalloc: decouple region_group from region As a first step in moving region_group away from logalloc, decouple communications between region and region_group. We introduce region_listener, that listens for the events that region passed directly to region_group. A region_group now installs a region_listener in a region, instead of having region know about the region_group directly. This decoupling is still leaky: - merge() chooses to forget the merged-from region's region_listener. This happens to be suitable for the only user of merge(). - We're still embedding the binomial heap handle, used by region_group to keep track of region sizes, in regions. A complete decoupling would transfer that responsibility to region_group.	2022-07-26 11:12:03 +03:00
Pavel Emelyanov	ffbf19ee3c	code: Convert is_future result_of assertions into invoke_result concept Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-24 19:47:32 +03:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Michael Livshin	a6283b322b	logalloc: count evicted memory Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 19:34:13 +03:00
Michael Livshin	4bcd91a09a	logalloc: count freed memory (On the individual free() request level, i.e. similarly to allocs) Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2021-08-01 19:34:13 +03:00
Avi Kivity	99d5355007	Merge "Cache sstable indexes in memory" from Tomasz " The main goal of this series is to improve efficiency of reads from large partitions by reducing amount of I/O needed to read the sstable index. This is achieved by caching index file pages and partition index entries in memory. Currently, the pages are cached by individual reads only for the duration of the read. This was done to facilitate binary search in the promoted index (intra-partition index). After this series, all reads share the index file page cache, which stays around even after reads stop. The page cache is subject to eviction. It uses the same region as the current row cache and shares the LRU with row cache entries. This means that LRU objects need to be virtualized. This series takes an easy approach and does this by introducing a virtual base class. This adds an overhead to row cache entry to store the vtable pointer. SStable indexes have a hierarchy. There is a summary, which is a sparse partition key index into the full partition index. This one is already kept in memory. The partition index is divided by the summary into pages. Each entry in the partition index contains promoted index, which is a sparse index into atoms identified by the clustering key (rows, tombstones). In order to read the promoted index, the reader needs to read the partition index entry first. To speed this up, this series also adds caching of partition index entries. This cache survives reads and is subject to eviction, just like the index file page cache. The unit of caching is the partition index page. Without this cache, each access to promoted index would have to be preceded with the parsing of the partition index page containing the partition key. Performance testing results follow. 1) scylla-bench large partition reads Populated with: perf_fast_forward --run-tests=large-partition-skips --datasets=sb-large-part-ds1 \ -c1 -m1G --populate --value-size=1024 --rows=10000000 Single partition, 9G data file, 4MB index file Test execution: build/release/scylla -c1 -m4G scylla-bench -workload uniform -mode read -limit 1 -concurrency 100 -partition-count 1 \ -clustering-row-count 10000000 -duration 60m TL;DR: after: 2x throughput, 0.5 median latency Before (`c1daf2bb24`): Results Time (avg): 5m21.033180213s Total ops: 966951 Total rows: 966951 Operations/s: 3011.997048812112 Rows/s: 3011.997048812112 Latency: max: 74.055679ms 99.9th: 63.569919ms 99th: 41.320447ms 95th: 38.076415ms 90th: 37.158911ms median: 34.537471ms mean: 33.195994ms After: Results Time (avg): 5m14.706669345s Total ops: 2042831 Total rows: 2042831 Operations/s: 6491.22243800942 Rows/s: 6491.22243800942 Latency: max: 60.096511ms 99.9th: 35.520511ms 99th: 27.000831ms 95th: 23.986175ms 90th: 21.659647ms median: 15.040511ms mean: 15.402076ms 2) scylla-bench small partitions I tested several scenarios with a varying data set size, e.g. data fully fitting in memory, half fitting, and being much larger. The improvement varied a bit but in all cases the "after" code performed slightly better. Below is a representative run over data set which does not fit in memory. scylla -c1 -m4G scylla-bench -workload uniform -mode read -concurrency 400 -partition-count 10000000 \ -clustering-row-count 1 -duration 60m -no-lower-bound Before: Time (avg): 51.072411913s Total ops: 3165885 Total rows: 3165885 Operations/s: 61988.164024260645 Rows/s: 61988.164024260645 Latency: max: 34.045951ms 99.9th: 25.985023ms 99th: 23.298047ms 95th: 19.070975ms 90th: 17.530879ms median: 3.899391ms mean: 6.450616ms After: Time (avg): 50.232410679s Total ops: 3778863 Total rows: 3778863 Operations/s: 75227.58014424688 Rows/s: 75227.58014424688 Latency: max: 37.027839ms 99.9th: 24.805375ms 99th: 18.219007ms 95th: 14.090239ms 90th: 12.124159ms median: 4.030463ms mean: 5.315111ms The results include the warmup phase which populates the partition index cache, so the hot-cache effect is dampened in the statistics. See the 99th percentile. Latency gets better after the cache warms up which moves it lower. 3) perf_fast_forward --run-tests=large-partition-skips Caching is not used here, included to show there are no regressions for the cold cache case. TL;DR: No significant change perf_fast_forward --run-tests=large-partition-skips --datasets=large-part-ds1 -c1 -m1G Config: rows: 10000000, value size: 2000 Before: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 1 0 36.429822 4 10000000 274500 62 274521 274429 153889.2 153883 19696986 153853 0 0 0 0 0 0 0 22.5% 1 1 36.856236 4 5000000 135662 7 135670 135650 155652.0 155652 19704117 139326 1 0 1 1 0 0 0 38.1% 1 8 36.347667 4 1111112 30569 0 30570 30569 155652.0 155652 19704117 139071 1 0 1 1 0 0 0 19.5% 1 16 36.278866 4 588236 16214 1 16215 16213 155652.0 155652 19704117 139073 1 0 1 1 0 0 0 16.6% 1 32 36.174784 4 303031 8377 0 8377 8376 155652.0 155652 19704117 139056 1 0 1 1 0 0 0 12.3% 1 64 36.147104 4 153847 4256 0 4256 4256 155652.0 155652 19704117 139109 1 0 1 1 0 0 0 11.1% 1 256 9.895288 4 38911 3932 1 3933 3930 100869.2 100868 3178298 59944 38912 0 1 1 0 0 0 14.3% 1 1024 2.599921 4 9757 3753 0 3753 3753 26604.0 26604 801850 15071 9758 0 1 1 0 0 0 14.6% 1 4096 0.784568 4 2441 3111 1 3111 3109 7982.0 7982 205946 3772 2442 0 1 1 0 0 0 13.8% 64 1 36.553975 4 9846154 269359 10 269369 269337 155663.8 155652 19704117 139230 1 0 1 1 0 0 0 28.2% 64 8 36.509694 4 8888896 243467 8 243475 243449 155652.0 155652 19704117 139120 1 0 1 1 0 0 0 26.5% 64 16 36.466282 4 8000000 219381 4 219385 219374 155652.0 155652 19704117 139232 1 0 1 1 0 0 0 24.8% 64 32 36.395926 4 6666688 183171 6 183180 183165 155652.0 155652 19704117 139158 1 0 1 1 0 0 0 21.8% 64 64 36.296856 4 5000000 137753 4 137757 137737 155652.0 155652 19704117 139105 1 0 1 1 0 0 0 17.7% 64 256 20.590392 4 2000000 97133 18 97151 94996 135248.8 131395 7877402 98335 31282 0 1 1 0 0 0 15.7% 64 1024 6.225773 4 588288 94492 1436 95434 88748 46066.5 41321 2324378 30360 9193 0 1 1 0 0 0 15.8% 64 4096 1.856069 4 153856 82893 54 82948 82721 16115.0 16043 583674 11574 2675 0 1 1 0 0 0 16.3% After: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu 1 0 36.429240 4 10000000 274505 38 274515 274417 153887.8 153883 19696986 153849 0 0 0 0 0 0 0 22.4% 1 1 36.933806 4 5000000 135377 15 135385 135354 155658.0 155658 19704085 139398 1 0 1 1 0 0 0 40.0% 1 8 36.419187 4 1111112 30509 2 30510 30507 155658.0 155658 19704085 139233 1 0 1 1 0 0 0 22.0% 1 16 36.353475 4 588236 16181 0 16182 16181 155658.0 155658 19704085 139183 1 0 1 1 0 0 0 19.2% 1 32 36.251356 4 303031 8359 0 8359 8359 155658.0 155658 19704085 139120 1 0 1 1 0 0 0 14.8% 1 64 36.203692 4 153847 4249 0 4250 4249 155658.0 155658 19704085 139071 1 0 1 1 0 0 0 13.0% 1 256 9.965876 4 38911 3904 0 3906 3904 100875.2 100874 3178266 60108 38912 0 1 1 0 0 0 17.9% 1 1024 2.637501 4 9757 3699 1 3700 3697 26610.0 26610 801818 15071 9758 0 1 1 0 0 0 19.5% 1 4096 0.806745 4 2441 3026 1 3027 3024 7988.0 7988 205914 3773 2442 0 1 1 0 0 0 18.3% 64 1 36.611243 4 9846154 268938 5 268942 268921 155669.8 155705 19704085 139330 2 0 1 1 0 0 0 29.9% 64 8 36.559471 4 8888896 243135 11 243156 243124 155658.0 155658 19704085 139261 1 0 1 1 0 0 0 28.1% 64 16 36.510319 4 8000000 219116 15 219126 219101 155658.0 155658 19704085 139173 1 0 1 1 0 0 0 26.3% 64 32 36.439069 4 6666688 182954 9 182964 182943 155658.0 155658 19704085 139274 1 0 1 1 0 0 0 23.2% 64 64 36.334808 4 5000000 137609 11 137612 137596 155658.0 155658 19704085 139258 2 0 1 1 0 0 0 19.1% 64 256 20.624759 4 2000000 96971 88 97059 92717 138296.0 131401 7877370 98332 31282 0 1 1 0 0 0 17.2% 64 1024 6.260598 4 588288 93967 1429 94905 88051 45939.5 41327 2324346 30361 9193 0 1 1 0 0 0 17.8% 64 4096 1.881338 4 153856 81780 140 81920 81520 16109.8 16092 582714 11617 2678 0 1 1 0 0 0 18.2% 4) perf_fast_forward --run-tests=large-partition-slicing Caching enabled, each line shows the median run from many iterations TL;DR: We can observe reduction in IO which translates to reduction in execution time, especially for slicing in the middle of partition. perf_fast_forward --run-tests=large-partition-slicing --datasets=large-part-ds1 -c1 -m1G --keep-cache-across-test-cases Config: rows: 10000000, value size: 2000 Before: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 0 1 0.000491 127 1 2037 24 2109 127 4.0 4 128 2 2 0 1 1 0 0 0 157 80 3058208 15.0% 0 32 0.000561 1740 32 56995 410 60031 47208 5.0 5 160 3 2 0 1 1 0 0 0 386 111 113353 17.5% 0 256 0.002052 488 256 124736 7111 144762 89053 16.6 17 672 14 2 0 1 1 0 0 0 2113 446 52669 18.6% 0 4096 0.016437 61 4096 249199 692 252389 244995 69.4 69 8640 57 5 0 1 1 0 0 0 26638 1717 23321 22.4% 5000000 1 0.002171 221 1 461 2 466 221 25.0 25 268 3 3 0 1 1 0 0 0 638 376 14311524 10.2% 5000000 32 0.002392 404 32 13376 48 13528 13015 27.0 27 332 5 3 0 1 1 0 0 0 931 432 489691 11.9% 5000000 256 0.003659 279 256 69967 764 73130 52563 39.5 41 780 19 3 0 1 1 0 0 0 2689 825 93756 15.8% 5000000 4096 0.018592 55 4096 220313 433 234214 218803 94.2 94 9484 62 9 0 1 1 0 0 0 27349 2213 26562 21.0% After: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 0 1 0.000229 115 1 4371 85 4585 115 2.1 2 64 1 1 1 0 0 0 0 0 90 31 1314749 22.2% 0 32 0.000277 2174 32 115674 1015 128109 14144 3.0 3 96 2 1 1 0 0 0 0 0 319 62 52508 26.1% 0 256 0.001786 576 256 143298 5534 179142 113715 14.7 17 544 15 1 1 0 0 0 0 0 2110 453 45419 21.4% 0 4096 0.015498 61 4096 264289 2006 268850 259342 67.4 67 8576 59 4 1 0 0 0 0 0 26657 1738 22897 23.7% 5000000 1 0.000415 233 1 2411 15 2456 234 4.1 4 128 2 2 1 0 0 0 0 0 199 72 2644719 16.8% 5000000 32 0.000635 1413 32 50398 349 51149 46439 6.0 6 192 4 2 1 0 0 0 0 0 458 128 125893 18.6% 5000000 256 0.002028 486 256 126228 3024 146327 82559 17.8 18 1024 13 4 1 0 0 0 0 0 2123 385 51787 19.6% 5000000 4096 0.016836 61 4096 243294 814 263434 241660 73.0 73 9344 62 8 1 0 0 0 0 0 26922 1920 24389 22.4% Future work: - Check the impact on non-uniform workloads. Caching sstable indexes takes space away from the row cache which may reduce the hit ratio. - Reduce memory footprint of partition index cache. Currently, about 8x bloat over the on-disk size. - Disable cache population for "bypass cache" reads - Add a switch to disable sstable index caching, per-node, maybe per-table - Better sstable index format. Current format leads to inefficiency in caching since only some elements of the cached page can be hot. A B-tree index would be more efficient. Same applies to the partition index. Only some elements in the partition index page can be hot. - Add heuristic for reducing index file IO size when large partitions are anticipated. If we're bound by disk's bandwidth it's wasteful to read the front of promoted index using 32K IO, better use 4K which should cover the partition entry and then let binary search read the rest. In V2: - Fixed perf_fast_forward regression in the number of IOs used to read partition index page The reader uses 32K reads, which were split by page cache into 4K reads Fix by propagating IO size hints to page cache and using single IO to populate it. New patch: "cached_file: Issue single I/O for the whole read range on miss" - Avoid large allocations to store partition index page entries (due to managed_vector storage). There is a unit test which detects this and fails. Fixed by implementing chunked_managed_vector, based on chunked_vector. - fixed bug in cached_file::evict_gently() where the wrong allocation strategy was used to free btree chunks - Simplify region_impl::free_buf() according to Avi's suggestions - Fit segment_kind in segment_descriptor::_free_space and lift requirement that _buf_pointers emptiness determines the kind - Workaround sigsegv which was most likely due to coroutine miscompilation. Worked around by manipulating local object scope. - Wire up system/drop_sstable_caches RESTful API - Fix use-after-move on permit for the old scanning ka/la index reader - Fixed more cases of double open_data() in tests leading to assert failure - Adjusted cached_file class doc to account for changes in behavior. - Rebased Fixes #7079. Refs #363. " * tag 'sstable-index-caching-v2' of github.com:tgrabiec/scylla: (39 commits) api: Drop sstable index caches on system/drop_sstable_caches cached_file: Issue single I/O for the whole read range on miss row_cache: cache_tracker: Do not register metrics when constructed for tests sstables, cached_file: Evict cache gently when sstable is destroyed sstables: Hide partition_index_cache implementation away from sstables.hh sstables: Drop shared_index_lists alias sstables: Destroy partition index cache gently sstables: Cache partition index pages in LSA and link to LRU utils: Introduce lsa::weak_ptr<> sstables: Rename index_list to partition_index_page and shared_index_lists to partition_index_cache sstables, cached_file: Avoid copying buffers from cache when parsing promoted index cached_file: Introduce get_page_units() sstables: read: Document that primitive_consumer::read_32() is alloc-free sstables: read: Count partition index page evictions sstables: Drop the _use_binary_search flag from index entries sstables: index_reader: Keep index objects under LSA lsa: chunked_managed_vector: Adapt more to managed_vector utils: lsa: chunked_managed_vector: Make LSA-aware test: chunked_managed_vector_test: Make exception_safe_class standard layout lsa: Copy chunked_vector to chunked_managed_vector ...	2021-07-07 18:17:10 +03:00
Avi Kivity	4c01a88c9d	logalloc: do not capture backtraces by default in debug mode logalloc has a nice leak/double-free sanitizer, with the nice feature of capturing backtraces to make error reports easy to track down. But capturing backtraces is itself very expensive. This patch makes backtrace capture optional, reducing database_test runtime from 30 minutes to 20 minutes on my machine. Closes #8978	2021-07-06 00:18:22 +02:00
Tomasz Grabiec	b5ca0eb2a2	lsa: Introduce lsa_buffer lsa_buffer is similar in spirit to std::unique_ptr<char[]>. It owns buffers allocated inside LSA segments. It uses an alternative allocation method which differs from regular LSA allocations in the following ways: 1) LSA segments only hold buffers, they don't hold metadata. They also don't mix with standard allocations. So a 128K segment can hold 32 4K buffers. 2) objects' life time is managed by lsa_buffer, an owning smart pointer, which is automatically updated when buffers are migrated to another segment. This makes LSA allocations easier to use and off-loads metadata management to the client (which can keep the lsa_buffer wherever he wants). The metadata is kept inside segment_descriptor, in a vector. Each allocated buffer will have an entangled object there (8 bytes), which is paired with an entabled object inside lsa_buffer. The reason to have an alternative allocation method is to efficiently pack buffers inside LSA segments.	2021-07-02 19:02:13 +02:00
Benny Halevy	02917c79b6	logalloc: get rid of unused _descendant_blocked_requests Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210620064204.1709957-1-bhalevy@scylladb.com>	2021-06-22 15:58:56 +02:00
Pavel Solodovnikov	76bea23174	treewide: reduce header interdependencies Use forward declarations wherever possible. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Closes #8813	2021-06-07 15:58:35 +03:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Avi Kivity	ae660eeec4	logalloc: reduce minimum lsa reserve in allocating_section to 1 Many workloads have fairly constant and small request sizes, so we don't need large reserves for them. These workloads suffer needlessly from the current large reserve of 10 segments (1.2MB) when they really need a few hundred bytes. Reduce the reserve to a minimum of 1 segment. Note that due to #8542 this can make a large difference. Consider a workload that has a 1000-byte footprint in cache. If we've just consumed some free memory and reduced the reserve to zero, then we'll evict about 50,000 objects before proceeding to compact. With the reserved reduced to 1, we'll evict 128 objects. All this for 1000 bytes of memory. Of course, #8542 should be fixed, but reducing the reserve provides some quick relief and makes sense even with the larger fix. The reserve will quickly grow for workloads that handle bigger requests, so they won't see an impact from the reduction. Closes #8572	2021-05-02 15:22:04 +02:00
Avi Kivity	ca0c006b37	logalloc: background reclaim Set up a coroutine in a new scheduling group to ensure there is a "cushion" of free memory. It reclaims in preemptible mode in order to reduce reactor stalls (constrast with synchronous reclaim that cannot preempt until it achieved its goal). The free memory target is arbitrarily set at 60MB. The reclaimer's shares are proportional to the distance from the free memory target; so a workload that allocates memory rapidly will have the background reclaimer working harder. I rolled my own condition variable here, mostly as an experiment. seastar::condition_variable requires several allocations, while the one here requires none. We should formalize it after we gain more experience with it.	2021-02-14 19:09:29 +02:00
Botond Dénes	7b56ed6057	utils: logalloc: add lsa_global_occupancy_stats() Allows querying the occupancy stats of all the lsa memory.	2020-11-17 15:13:21 +02:00
Avi Kivity	7ac59dcc98	lsa: decay reserves The log-structured allocator (LSA) reserves memory when performing operations, since its operations are performed with reclaiming disabled and if it runs out, it cannot evict cache to gain more. The amount of memory to reserve is remembered across calls so that it does not have to repeat the fail/increase-reserve/retry cycle for every operation. However, we currently lack decaying the amount to reserve. This means that if a single operation increased the reserve in the distant past, all current operations also require this large reserve. Large reserves are expensive since they can cause large amounts of cache to be evicted. This patch adds reserve decay. The time-to-decay is inversely proportional to reserve size: 10GB/reserve. This means that a 20MB reserve is halved after 500 operations (10GB/20MB) while a 20kB reserve is halved after 500,000 operations (10GB/20kB). So large, expensive reserves are decayed quickly while small, inexpensive reserves are decayed slowly to reduce the risk of allocation failures and exceptions. A unit test is added. Fixes #325.	2020-09-08 15:59:25 +03:00
Pavel Emelyanov	3237796e00	region: Mark trivial noexcept methods as such Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-07-09 14:41:37 +03:00

1 2 3

139 Commits