scylladb

Author	SHA1	Message	Date
Avi Kivity	c020b4e5e2	logalloc: increase capacity of _regions vector outside reclaim lock Reclaim consults the _regions vector, so we don't want it moving around while allocating more capacity. For that we take the reclaim lock. However, that can cause a false-positive OOM during startup: 1. all memory is allocated to LSA as part of priming (`2baa16b371`) 2. the _regions vector is resized from 64k to 128k, requiring a segment to be freed (plenty are free) 3. but reclaiming_lock is taken, so we cannot reclaim anything. To fix, resize the _regions vector outside the lock. Fixes #6003. Message-Id: <20200311091217.1112081-1-avi@scylladb.com>	2020-03-11 12:29:31 +02:00
Rafael Ávila de Espíndola	090164791c	logalloc: Store unused ids in a std::vector There doesn't seem to be any requirement for how unused ids are reused, so we may as well use the simpler type. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200129211154.47907-1-espindola@scylladb.com>	2020-01-30 10:31:16 +02:00
Avi Kivity	454074f284	Merge "database: Avoid OOMing with flush continuations after failed memtable flush" from Tomasz " The original fix (`10f6b125c8`) didn't take into account that if there was a failed memtable flush (Refs flush) but is not a flushable memtable because it's not the latest in the memtable list. If that happens, it means no other memtable is flushable as well, cause otherwise it would be picked due to evictable_occupancy(). Therefore the right action is to not flush anything in this case. Suspected to be observed in #4982. I didn't manage to reproduce after triggering a failed memtable flush. Fixes #3717 " * tag 'avoid-ooming-with-flush-continuations-v2' of github.com:tgrabiec/scylla: database: Avoid OOMing with flush continuations after failed memtable flush lsa: Introduce operator bool() to occupancy_stats lsa: Expose region_impl::evictable_occupancy in the region class	2020-01-08 16:58:54 +02:00
Juliusz Stasiewicz	430b2ad19d	commitlog+region_group: timeout exceptions with names `segment_manager' now uses a decorated version of `timed_out_error' with hardcoded name. On the other hand `region_group' uses named `on_request_expiry' within its `expiring_fifo'.	2019-12-03 19:07:19 +01:00
Tomasz Grabiec	a69fda819c	lsa: Expose region_impl::evictable_occupancy in the region class	2019-11-22 12:08:10 +01:00
Rafael Ávila de Espíndola	99c7f8457d	logalloc: Add a migrators_base that is common to debug and release This simplifies the debug implementation and it now should work with scylla-gdb.py. It is not clear what, if anything, is lost by not using random ids. They were never being reused in the debug implementation anyway. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20190618144755.31212-1-espindola@scylladb.com>	2019-08-12 19:44:55 +03:00
Paweł Dziepak	eb7d17e5c5	lsa: make sure align_up_for_asan() doesn't cause reads past end of segment In debug mode the LSA needs objects to be 8-byte aligned in order to maximise coverage from the AddressSanitizer. Usually `close_active()` creates a dummy objects that covers the end of the segment being closed. However, it the last real objects ends in the last eight bytes of the segment then that dummy won't be created because of the alignment requirements. This broke exit conditions on loops trying to read all objects in the segment and caused them to attempt to dereference address at the end of the segment. This patch fixes that. Fixes #4653.	2019-07-10 19:19:24 +02:00
Rafael Ávila de Espíndola	d8dbacc7f6	More precise poisoning in logalloc This change aligns descriptors and values to 8 bytes so that poisoning a descriptor or value doesn't interfere with other descriptors and values. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-06-26 13:13:48 -07:00
Rafael Ávila de Espíndola	6a2accb483	Convert macros to inline functions Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-06-26 13:13:48 -07:00
Tomasz Grabiec	f7e79b07d1	lsa: Respect the reclamation step hint from seastar allocator This will allow us to reduce the amount of segment compaction when reclaiming on behlaf of a large allocation because we'll evict much more up front. Tests: - unit (dev) Reviewed-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <1559906584-16770-1-git-send-email-tgrabiec@scylladb.com>	2019-06-23 16:03:06 +03:00
Rafael Ávila de Espíndola	bf87b7e1df	logalloc: Use asan to poison free areas With this patch, when using asan, we poison segment memory that has been allocated from the system but should not be accessible to user code. Should help with debugging user after free bugs. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20190607140313.5988-1-espindola@scylladb.com>	2019-06-12 11:46:45 +02:00
Rafael Ávila de Espíndola	b3adabda2d	Reduce logalloc differences between debug and release A lot of code in scylla is only reachable if SEASTAR_DEFAULT_ALLOCATOR is not defined. In particular, refill_emergency_reserve in the default allocator case is empty, but in the seastar allocator case it compacts segments. I am trying to debug a crash that seems to involve memory corruption around the lsa allocator, and being able to use a debug build for that would be awesome. This patch reduces the differences between the two cases by having a common segment_pool that defers only a few operations to different segment_store implementations. Tests: unit (debug, dev) Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20190606020937.118205-1-espindola@scylladb.com>	2019-06-06 12:55:56 +03:00
Tomasz Grabiec	21fbf59fa8	lsa: Fix compact_and_evict() being called with a too low step compact_and_evict gets memory_to_release in bytes while reclamation step is in segments. Broken in `f092decd90`. It doesn't make much difference with the current default step of 1 segment since we cannot reclaim less than that, so shouldn't cause problems in practice. Message-Id: <1556013920-29676-1-git-send-email-tgrabiec@scylladb.com>	2019-04-23 13:14:43 +03:00
Tomasz Grabiec	f092decd90	lsa: Fix potential bad_alloc even though evictable memory exists When we start the LSA reclamation it can be that segment_pool::_free_segments is 0 under some conditions and segment_pool::_current_emergency_reserve_goal is set to 1. The reclamation step is 1 segment, and compact_and_evict_locked() frees 1 segment back into the segment_pool. However, segment_pool::reclaim_segments() doesn't free anything to the standard allocator because the condition _free_segments > _current_emergency_reserve_goal is false. As a result, tracker::impl::reclaim() returns 0 as the amount of released memory, tracker::reclaim() returns memory::reclaiming_result::reclaimed_nothing and the seastar allocator thinks it's a real OOM and throws std::bad_alloc. The fix is to change compact_and_evict() to make sure that reserves are met, by releasing more if they're not met at entry. This change also allows us to drop the variant of allocate_segment() which accepts the reclamation step as a means to refill reserves faster. This is now not needed, because compact_and_evict() will look at the reserve deficit to increase the amount of memory to reclaim. Fixes #4445 Message-Id: <1555671713-16530-1-git-send-email-tgrabiec@scylladb.com>	2019-04-20 09:17:49 +03:00
Tomasz Grabiec	3356a085d2	lsa: Cover more bad_alloc cases with abort When --abort-on-lsa-bad-alloc is enabled we want to abort whenever we think we can be out of memory. We covered failures due to bad_alloc thrown from inside of the allocation section, but did not cover failures from reservations done at the beginning of with_reserve(). Fix by moving the trap into reserve(). Message-Id: <1553258915-27929-1-git-send-email-tgrabiec@scylladb.com>	2019-04-03 16:39:40 +03:00
Tomasz Grabiec	dafe22dd83	lsa: Fix spurios abort with --enable-abort-on-lsa-bad-alloc allocate_segment() can fail even though we're not out of memory, when it's invoked inside an allocating section with the cache region locked. That section may later succeed after retried after memory reclamation. We should ignore bad_alloc thrown inside allocating section body and fail only when the whole section fails. Fixes #2924 Message-Id: <1550597493-22500-1-git-send-email-tgrabiec@scylladb.com>	2019-02-20 12:53:49 +02:00
Tomasz Grabiec	dbc1894bd5	lsa: Avoid unnecessary compact_and_evict_locked() When the reclaim request was satisfied from the pool there's no need to call compact_and_evict_locked(). This allows us to avoid calling boost::range::make_heap(), which is a tiny performance difference, as well as some confusing log messages. Message-Id: <1548091941-8534-1-git-send-email-tgrabiec@scylladb.com>	2019-01-21 20:19:20 +02:00
Rafael Ávila de Espíndola	67039e942b	Remove the only use of with_alignment from scylla In c++17 there are standard ways of requesting aligned memory, so seastar doesn't need to provide one. This patch is in preparation for removing with_alignment from seastar. Tests: unit (debug) Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20190107191019.22295-1-espindola@scylladb.com>	2019-01-07 21:34:47 +02:00
Avi Kivity	be99101f36	utils: convert sprint() to format() sprint() recently became more strict, throwing on sprint("%s", 5). Replace with the more modern format(). Mechanically converted with https://github.com/avikivity/unsprint.	2018-11-01 13:16:17 +00:00
Tomasz Grabiec	1e50f85288	database: Make soft-pressure memtable flusher not consider already flushed memtables The flusher picks the memtable list which contains the largest region according to region_impl::evictable_occupancy().total_space(), which follows region::occupancy().total_space(). But only the latest memtable in the list can start flushing. It can happen that the memtable corresponding to the largest region was already flushed to an sstable (flush permit released), but not yet fsynced or moved to cache, so it's still in the memtable list. The latest memtable in the winning list may be small, or empty, in which case the soft pressure flusher will not be able to make much progress. There could be other memtable lists with non-empty (flushable) latest memtables. This can lead to writes unnecessarily blocking on dirty. I observed this for the system memtable group, where it's easy for the memtables to overshoot small soft pressure limits. The flusher kept trying to flush empty memtables, while the previous non-empty memtable was still in the group. The CPU scheduler makes this worse, because it runs memtable_to_cache in a separate scheduling group, so it further defers in time the removal of the flushed memtable from the memtable list. This patch fixes the problem by making regions corresponding to memtables which started flushing report evictable_occupancy() as 0, so that they're picked by the flusher last. Fixes #3716. Message-Id: <1535040132-11153-2-git-send-email-tgrabiec@scylladb.com>	2018-08-26 11:02:34 +03:00
Tomasz Grabiec	364418b5c5	logalloc: Make evictable_occupancy() indicate no free space Doesn't fix any bug, but it's closer to the truth that all segments are used rather than none is used. Message-Id: <1535040132-11153-1-git-send-email-tgrabiec@scylladb.com>	2018-08-26 11:02:32 +03:00
Avi Kivity	2c9b886b6d	logalloc: reindent No functional changes. Message-Id: <20180731125116.32009-1-avi@scylladb.com>	2018-08-01 00:35:54 +01:00
Avi Kivity	0fc54aab98	logalloc: run releaser() in user-provided scheduling group Let the user specify which scheduling group should run the releaser, since it is running functions on the user's behalf. Perhaps a cleaner interface is to require the user to call a long-running function for the releaser, and so we'd just inherit its scheduling group, but that's a much bigger change.	2018-07-31 11:57:58 +03:00
Tomasz Grabiec	d94c7c07a3	lsa: Disable alloc failure injector inside the LSA sanitizer Message-Id: <1531814822-30259-1-git-send-email-tgrabiec@scylladb.com>	2018-07-17 11:27:56 +01:00
Paweł Dziepak	55bf9d78a6	lsa: enhance sanitizer for migrators Current LSA sanitizer performs only basic checks on the migrators use, without doing any additonal reporting in case an error is detected. This patch enhances it so that when a problem is detected relevant stack traces get printed.	2018-06-25 09:37:43 +01:00
Paweł Dziepak	fcd9b1f821	lsa: formalise migrator id requirements object_descriptor uses special encoding for migrator ids which assumes that the valid ones are in a range smaller than uint32_t. Let's add some static asserts that make this fact more visible.	2018-06-25 09:37:43 +01:00
Gleb Natapov	b38ced0fcd	Configure logalloc memory size during initialization	2018-06-11 15:34:14 +03:00
Tomasz Grabiec	498a4132c5	lsa: Add use for debug::static_migrators Otherwise GDB complains about it being optimized out, breaking our debug scritps.	2018-05-17 14:22:14 +02:00
Avi Kivity	05cec4a265	Merge "Reduce LSA memory reclamation overhead" from Tomasz " Main optimization is in the patch titled "lsa: Reduce amount of segment compactions". I measured 50% reduction of cache update run time in a steady state for an append-only workload with large partition, in perf_row_cache_update version from: `c3f9e6ce1f/tests/perf_row_cache_update.cc` Other workloads, and other allocation sites probably also could see the improvement. " * tag 'tgrabiec/reduce-lsa-segment-compactions-v1' of github.com:tgrabiec/scylla: lsa: Expose counters for allocation and compaction throughput lsa: Reduce amount of segment compactions lsa: Avoid the call to segment_pool::descriptor() in compact() lsa: Make reclamation on reserve refill more efficient	2018-05-16 10:24:20 +03:00
Tomasz Grabiec	4fdd61f1b0	lsa: Expose counters for allocation and compaction throughput Allow observing amplification induced by segment compaction.	2018-05-15 21:49:01 +02:00
Tomasz Grabiec	3775a9ecec	lsa: Reduce amount of segment compactions Reclaiming memory through segment compaction is expensive. For occupancy of 85%, in order to reclaim one free segment, we need to compact 7 segments, by migrating 6 segments worth of data. This results in significant amplification. Compaction involves moving objects, which in some cases is expensive in itself as well (See https://github.com/scylladb/scylla/issues/3247). This patch reduces amount of segment compactions in favor of doing more eviction. It especially helps workloads in which LRU order matches allocation order, in which case there will be no segment compaction, and just eviction. In perf_row_cache_update test case for large partition with lots of rows, which simulates appending workload, I measured that for each new object allocated, 2 need to be migrated, before the patch. After the patch, only 0.003 objects are migrated. This reduces run time of cache update part by 50%.	2018-05-15 21:49:01 +02:00
Tomasz Grabiec	8faafdaae5	lsa: Avoid the call to segment_pool::descriptor() in compact()	2018-05-11 19:07:23 +02:00
Tomasz Grabiec	19edf3970e	lsa: Make reclamation on reserve refill more efficient Currently reserve refill allocates segments repeatedly until the reserve threhsold is met. If single segment allocation needs to reclaim memory, it will ask the reclaimer for one segment. The reclaimer could make better decisions if it knew the total number of segments we try to allocate. In particular, it would not attempt to compact any segment until it evicts total amount of memory first, which may reduce the total amount of segment compactions during refill. This patch changes refill to increase reclamation step used by allocate_segment() so that it matches the total amount of memory we refill.	2018-05-11 19:07:23 +02:00
Paweł Dziepak	c6c5accd19	lsa: provide migrator with the object size While the migration function should have enough information to obtain the object size itself, the LSA logic needs to compute it as well. IMR is going to make calculating object sizes more expensive, so by providing the infromation to the migrator we can avoid some needless operations.	2018-05-09 16:52:26 +01:00
Paweł Dziepak	884888dc11	lsa: add free() that does not require object size It is non-trivial to get the size of an IMR object. However, the standard allocator doesn't really need it and LSA can compute it itself by asking the migrator.	2018-05-09 16:52:26 +01:00
Paweł Dziepak	b1bec336b3	lsa: sanitize use of migrators Having migrators dynamically registered and deregistered opens a new class of bugs. This patch adds some additional checks in the debug mode with the hopes of catching any misuse early.	2018-05-09 16:52:26 +01:00
Paweł Dziepak	cca9f8c944	lsa: reuse registered migrator ids With the introduction of the new in-memory representation we will get type- and schema-dependent migrators. Since there is no bound how many times they can be created and destroyed it is better to be safe and reuse registered migrator ids.	2018-05-09 16:52:20 +01:00
Paweł Dziepak	b3699f286d	lsa: make migrators table thread-local Migrators can be registered and deregistered at any time. If the table is not thread-local we risk race conditions.	2018-05-09 16:10:46 +01:00
Avi Kivity	7161244130	Merge seastar upstream * seastar 70aecca...ac02df7 (5): > Merge "Prefix preprocessor definitions" from Jesse > cmake: Do not enable warnings transitively > posix: prevent unused variable warning > build: Adjust DPDK options to fix compilation > io_scheduler: adjust property names DEBUG, DEFAULT_ALLOCATOR, and HAVE_LZ4_COMPRESS_DEFAULT macro references prefixed with SEASTAR_. Some may need to become Scylla macros.	2018-04-29 11:03:21 +03:00
Avi Kivity	fc488adc72	logalloc: remove segment_descriptor::_lsa_managed _lsa_managed is always 1:1 with _region, so we can remove it, saving some space in the segment descriptor vector. Tests: unit (release), logalloc_test (debug) Message-Id: <20180410122606.10671-1-avi@scylladb.com>	2018-04-10 13:54:38 +01:00
Avi Kivity	2c670f6161	logalloc: limit std segment allocations in debug mode Address Sanitizer has a global limit on the number of allocations (note: not number of allocations less number of frees, but cumulative number of allocations). Running some tests in debug mode on a machine with sufficient memory can break that limit. Work around that limit by restricting the amount of memory the debug mode segment_pool can allocate. It's also nicer for running the test on a workstation.	2018-04-07 21:04:10 +03:00
Avi Kivity	2baa16b371	logalloc: introduce prime_segment_pool() To segregate std and lsa allocations, we prime the segment pool during initialization so that lsa will release lower-addressed memory to std, rather than lsa and std competing for memory at random addresses. However, tests often evict all of lsa memory for their own purposes, which defeats this priming. Extract the functionality into a new prime_segment_pool() function for use in tests that rely on allocation segregation.	2018-04-07 14:52:58 +03:00
Avi Kivity	ff6325ee7e	logalloc: limit non-contiguous reclaims We may fail to reclaim because a region has reclaim disabled (usually because it is in an allocating_section. Failed reclaims can cause high CPU usage if all of the lower addresses happen to be in a reclaim-disabled region (this is somewhat mitigated by the fact that checking for reclaim disabled is very cheap), but worse, failing a segment reclaim can lead to reclaimed memory being fragmented. This results in the original allocation continuing to fail. To combat that, we limit the number of failed reclaims. If we reach the limit, we fail the reclaim. The surrounding allocating_section will release the reclaim_lock, and increase reserves, which will result in reclaim being retried with all regions being reclaimable, and succeed in allocating contiguous memory.	2018-04-07 14:52:58 +03:00
Avi Kivity	c6c659ce7a	logalloc: pre-allocate all memory as lsa on startup Since lsa tries to keep some non-lsa memory as reserve, we end up with three blocks of memory: at low addresses, non-lsa memory that was allocated during startup or subsequently freed by lsa; at middle addresses, lsa; and at the top addresses, memory that lsa left alone during initial cache population due to the reserve. After time passes, both std and lsa will allocate from the top section, causing a mix of lsa and non-lsa memory. Since lsa tries to free from lower addresses, this mix will stay there forever, increasing fragmentation. Fix that by disabling the reserve during startup and allocating all of memory for lsa. Any further allocation will then have to be satisfied by lsa first freeing memory from the low addresses, so we will now have just two sections of memory: low addresses for std, and top addresses for lsa. Note that this startup allocation does not page in lsa segments, since the segment constructor does not touch memory.	2018-04-07 14:52:58 +03:00
Avi Kivity	14510ae986	dynamic_bitset: get rid of resize() Makes it easier to modify later on. Maybe "dynamic" is not so justified now.	2018-04-07 14:52:58 +03:00
Avi Kivity	3f17dbfcbc	logalloc: get rid of the emergency reserve stack Instead of keeping specific segments in the emergency reserve, just keep the number of segments in the reserve. This simplifies the code considerably.	2018-04-07 14:52:55 +03:00
Avi Kivity	fa73d844e9	logalloc: replace zones with segment-at-a-time alloc/free This patch replaces the zones mechanism with something simpler: a single segment is moved from the standard allocator to lsa and vice versa, at a time. Fragmentation resistance is (hopefully) achieved by having lsa prefer high addresses for lsa data, and return segments at low address to the standard allocator. Over time, the two will move apart. Moving just once segment at a time reduces the latency costs of transferring memory between free and std.	2018-04-07 13:48:40 +03:00
Paweł Dziepak	5dfa36c526	lsa: add basic sanitizer LSA being an allocator built on top of the standard may hide some erroneous usage from AddressSanitizer. Moreover, it has its own classes of bugs that could be caused by incorrect user behaviour (e.g. migrator returning wrong object size). This patch adds basic sanitizer for the LSA that is active in the debug mode and verifies if the allocator is used correctly and if a problem is found prints information about the affected object that it has collected earlier. Theat includes the address and size of an object as well as backtrace of the allocation site. At the moment the following errors are being checked for: * leaks, objects not freed at region destructor * attempts to free objects at invalid address * mismatch between object size at allocation and free * mismatch between object size at allocation and as reported by the migrator * internal LSA error: attempt to allocate object at already used address * internal LSA error: attempt to merge regions containing allocated objects at conflicting addresses Message-Id: <20180226122314.32049-1-pdziepak@scylladb.com>	2018-02-26 14:35:13 +02:00
Tomasz Grabiec	7e0ff8a920	lsa: Disable allocation failure injection inside merge() Fixes termiantion in tests due to throw from merge(), which is noexcept.	2018-02-14 16:42:49 +01:00
Tomasz Grabiec	66701c1671	lsa: Make region deregistration robust against duplicates	2018-02-14 16:42:49 +01:00

1 2 3 4

184 Commits