scylladb

Author	SHA1	Message	Date
Avi Kivity	7161244130	Merge seastar upstream * seastar 70aecca...ac02df7 (5): > Merge "Prefix preprocessor definitions" from Jesse > cmake: Do not enable warnings transitively > posix: prevent unused variable warning > build: Adjust DPDK options to fix compilation > io_scheduler: adjust property names DEBUG, DEFAULT_ALLOCATOR, and HAVE_LZ4_COMPRESS_DEFAULT macro references prefixed with SEASTAR_. Some may need to become Scylla macros.	2018-04-29 11:03:21 +03:00
Avi Kivity	fc488adc72	logalloc: remove segment_descriptor::_lsa_managed _lsa_managed is always 1:1 with _region, so we can remove it, saving some space in the segment descriptor vector. Tests: unit (release), logalloc_test (debug) Message-Id: <20180410122606.10671-1-avi@scylladb.com>	2018-04-10 13:54:38 +01:00
Avi Kivity	2c670f6161	logalloc: limit std segment allocations in debug mode Address Sanitizer has a global limit on the number of allocations (note: not number of allocations less number of frees, but cumulative number of allocations). Running some tests in debug mode on a machine with sufficient memory can break that limit. Work around that limit by restricting the amount of memory the debug mode segment_pool can allocate. It's also nicer for running the test on a workstation.	2018-04-07 21:04:10 +03:00
Avi Kivity	2baa16b371	logalloc: introduce prime_segment_pool() To segregate std and lsa allocations, we prime the segment pool during initialization so that lsa will release lower-addressed memory to std, rather than lsa and std competing for memory at random addresses. However, tests often evict all of lsa memory for their own purposes, which defeats this priming. Extract the functionality into a new prime_segment_pool() function for use in tests that rely on allocation segregation.	2018-04-07 14:52:58 +03:00
Avi Kivity	ff6325ee7e	logalloc: limit non-contiguous reclaims We may fail to reclaim because a region has reclaim disabled (usually because it is in an allocating_section. Failed reclaims can cause high CPU usage if all of the lower addresses happen to be in a reclaim-disabled region (this is somewhat mitigated by the fact that checking for reclaim disabled is very cheap), but worse, failing a segment reclaim can lead to reclaimed memory being fragmented. This results in the original allocation continuing to fail. To combat that, we limit the number of failed reclaims. If we reach the limit, we fail the reclaim. The surrounding allocating_section will release the reclaim_lock, and increase reserves, which will result in reclaim being retried with all regions being reclaimable, and succeed in allocating contiguous memory.	2018-04-07 14:52:58 +03:00
Avi Kivity	c6c659ce7a	logalloc: pre-allocate all memory as lsa on startup Since lsa tries to keep some non-lsa memory as reserve, we end up with three blocks of memory: at low addresses, non-lsa memory that was allocated during startup or subsequently freed by lsa; at middle addresses, lsa; and at the top addresses, memory that lsa left alone during initial cache population due to the reserve. After time passes, both std and lsa will allocate from the top section, causing a mix of lsa and non-lsa memory. Since lsa tries to free from lower addresses, this mix will stay there forever, increasing fragmentation. Fix that by disabling the reserve during startup and allocating all of memory for lsa. Any further allocation will then have to be satisfied by lsa first freeing memory from the low addresses, so we will now have just two sections of memory: low addresses for std, and top addresses for lsa. Note that this startup allocation does not page in lsa segments, since the segment constructor does not touch memory.	2018-04-07 14:52:58 +03:00
Avi Kivity	14510ae986	dynamic_bitset: get rid of resize() Makes it easier to modify later on. Maybe "dynamic" is not so justified now.	2018-04-07 14:52:58 +03:00
Avi Kivity	3f17dbfcbc	logalloc: get rid of the emergency reserve stack Instead of keeping specific segments in the emergency reserve, just keep the number of segments in the reserve. This simplifies the code considerably.	2018-04-07 14:52:55 +03:00
Avi Kivity	fa73d844e9	logalloc: replace zones with segment-at-a-time alloc/free This patch replaces the zones mechanism with something simpler: a single segment is moved from the standard allocator to lsa and vice versa, at a time. Fragmentation resistance is (hopefully) achieved by having lsa prefer high addresses for lsa data, and return segments at low address to the standard allocator. Over time, the two will move apart. Moving just once segment at a time reduces the latency costs of transferring memory between free and std.	2018-04-07 13:48:40 +03:00
Paweł Dziepak	5dfa36c526	lsa: add basic sanitizer LSA being an allocator built on top of the standard may hide some erroneous usage from AddressSanitizer. Moreover, it has its own classes of bugs that could be caused by incorrect user behaviour (e.g. migrator returning wrong object size). This patch adds basic sanitizer for the LSA that is active in the debug mode and verifies if the allocator is used correctly and if a problem is found prints information about the affected object that it has collected earlier. Theat includes the address and size of an object as well as backtrace of the allocation site. At the moment the following errors are being checked for: * leaks, objects not freed at region destructor * attempts to free objects at invalid address * mismatch between object size at allocation and free * mismatch between object size at allocation and as reported by the migrator * internal LSA error: attempt to allocate object at already used address * internal LSA error: attempt to merge regions containing allocated objects at conflicting addresses Message-Id: <20180226122314.32049-1-pdziepak@scylladb.com>	2018-02-26 14:35:13 +02:00
Tomasz Grabiec	7e0ff8a920	lsa: Disable allocation failure injection inside merge() Fixes termiantion in tests due to throw from merge(), which is noexcept.	2018-02-14 16:42:49 +01:00
Tomasz Grabiec	66701c1671	lsa: Make region deregistration robust against duplicates	2018-02-14 16:42:49 +01:00
Tomasz Grabiec	cf876bbe2d	lsa: Make region allocation exception safe We were not unregisterring in case add() fails.	2018-02-14 16:42:49 +01:00
Paweł Dziepak	dcd79af8ed	lsa: optimise disabling reclamation and invalidation counter Most of the lsa gory details are hidden in utils/logalloc.cc. That includes the actual implementation of a lsa region: region_impl. However, there is code in the hot path that often accesses the _reclaiming_enabled member as well as its base class allocation_strategy. In order to optimise those accesses another class is introduced: basic_region_impl that inherits from allocation_strategy and is a base of region_impl. It is defined in utils/logalloc.hh so that it is publicly visible and its member functions are inlineable from anywhere in the code. This class is supposed to be as small as possible, but contain all members and functions that are accessed from the fast path and should be inlined.	2018-01-30 18:33:26 +01:00
Paweł Dziepak	d825ae37bf	lsa: split alloc section into reserving and reclamation-disabled parts Allocating sections reserves certain amount of memory, then disables reclamation and attempts to perform given operation. If that fails due to std::bad_alloc the reserve is increased and the operation is retried. Reserving memory is expensive while just disabling reclamation isn't. Moreover, the code that runs inside the section needs to be safely retryable. This means that we want the amount of logic running with reclamation disabled as small as possible, even if it means entering and leaving the section multiple times. In order to reduce the performance penalty of such solution the memory reserving and reclamation disabling parts of the allocating sections are separated.	2018-01-30 18:33:26 +01:00
Tomasz Grabiec	5c85e9c2db	lsa: Expose max_zone_segments for tests	2018-01-16 13:17:20 +01:00
Tomasz Grabiec	99708cc498	lsa: Expose tracker::non_lsa_used_space() So that it can be used in unit tests.	2018-01-16 13:17:20 +01:00
Tomasz Grabiec	e5f8176c32	lsa: Fix memory leak on zone reclaim _free_segments_in_zones is not adjusted by segment_pool::reclaim_segments() for empty zones on reclaim under some conditions. For instance when some zone becomes empty due to regular free() and then reclaiming is called from the std allocator, and it is satisfied from a zone after the one which is empty. This would result in free memory in such zone to appear as being leaked due to corrupted free segment count, which may cause a later reclaim to fail. This could result in bad_allocs. The fix is to always collect such zones. Fixes #3129 Refs #3119 Refs #3120	2018-01-16 13:17:11 +01:00
Tomasz Grabiec	34ccf234ea	Integrate with allocation failure injection framework	2017-11-07 15:33:24 +01:00
Avi Kivity	a2f26f7b29	log_histogram: rename to log_heap log_histogram is not really a histogram, it is a heap-like container. Rename to log_heap in case we do want a log_histogram one day. Message-Id: <20170916172137.30941-1-avi@scylladb.com>	2017-09-18 12:44:05 +02:00
Tomasz Grabiec	87be474c19	lsa: Move reclaim counter concept to allocation_strategy level So that generic code can detect invalidation of references. Also, to allow reusing the same mechanism for signalling external reference invalidation.	2017-09-13 17:38:08 +02:00
Tomasz Grabiec	5d2f2bc90b	lsa: Mark region::merge() as noexcept It seems to satisfy this, and row_cache::do_update() will rely on it to simplify error handling. Message-Id: <1504023113-30374-1-git-send-email-tgrabiec@scylladb.com>	2017-08-29 19:17:17 +03:00
Avi Kivity	5a2439e702	main: check for large allocations Large allocations can require cache evictions to be satisfied, and can therefore induce long latencies. Enable the seastar large allocation warning so we can hunt them down and fix them. Message-Id: <20170819135212.25230-1-avi@scylladb.com>	2017-08-21 10:25:40 +03:00
Tomasz Grabiec	3489c68a68	lsa: Fix performance regression in eviction and compact_on_idle Region comparator, used by the two, calls region_impl::min_occupancy(), which calls log_histogram::largest(). The latter is O(N) in terms of the number of segments, and is supposed to be used only in tests. We should call one_of_largest() instead, which is O(1). This caused compact_on_idle() to take more CPU as the number of segments grew (even when there was nothing to compact). Eviction would see the same kind of slow down as well. Introduced in `11b5076b3c`. Message-Id: <1498641973-20054-1-git-send-email-tgrabiec@scylladb.com>	2017-06-28 12:32:43 +03:00
Avi Kivity	ebaeefa02b	Merge seatar upstream (seastar namespace) - introcduced "seastarx.hh" header, which does a "using namespace seastar"; - 'net' namespace conflicts with seastar::net, renamed to 'netw'. - 'transport' namespace conflicts with seastar::transport, renamed to cql_transport. - "logger" global variables now conflict with logger global type, renamed to xlogger. - other minor changes	2017-05-21 12:26:15 +03:00
Avi Kivity	1d12d69881	logalloc: define segment_zone::maximum_size Yield build errors with some compilers, if missing.	2017-05-01 16:31:29 +03:00
Paweł Dziepak	f5cf86484e	lsa: introduce upper bound on zone size Attempting to create huge zones may introduce significant latency. This patch introduces the maximum allowed zone size so that the time spent trying to allocate and initialising zone is bounded. Fixes #2335. Message-Id: <20170428145916.28093-1-pdziepak@scylladb.com>	2017-04-30 10:58:11 +03:00
Pekka Enberg	940c3f4330	Merge "Clang fixes (part 2)" from Avi "This series fixes some more errors found by clang, with the aim of enabling clang/zapcc as a supported compiler. A single issue remains, but it's probably in std::experimental::optional::swap(); not in our code." * tag 'clang/2/v1' of https://github.com/avikivity/scylla: sstable_test: avoid passing negative non-type template arguments to unsigned parameters UUID: add more comparison operators sstable_datafile_test: avoid string_view user-defined literal conversion operator mutation_source_test: avoid template function without template keyword cql_query_test: define static variable cql_query_test: add braces for single-item collection initializers storage_service: don't use typeid(temporary) logalloc: remove unused max_occupancy_for_compaction storage_proxy: drop overzealous use of __int128_t in recently-modified-no-read-repair logic storage_proxy: drop unused member access from return value storage_proxy: fix reference bound to temporary in data_read_resolver::less_compare read_repair_decision: fix operator<<(std::ostream&, ...)	2017-04-24 20:32:16 +03:00
Avi Kivity	6d9e18fd61	logalloc: reduce descriptor overhead Every lsa-allocated object is prefixed by a header that contains information needed to free or migrate it. This includes its size (for freeing) and an 8-byte migrator (for migrating). Together with some flags, the overhead is 14 bytes (16 bytes if the default alignment is used). This patch reduces the header size to 1 byte (8 bytes if the default alignment is used). It uses the following techniques: - ULEB128-like encoding (actually more like ULEB64) so a live object's header can typically be stored using 1 byte - indirection, so that migrators can be encoded in a small index pointing to a migrator table, rather than using an 8-byte pointer; this exploits the fact that only a small number of types are stored in LSA - moving the responsibility for determining an object's size to its migrator, rather than storing it in the header; this exploits the fact that the migrator stores type information, and object size is in fact information about the type The patch improves the results of memory_footprint_test as following: Before: - in cache: 976 - in memtable: 947 After: mutation footprint: - in cache: 880 - in memtable: 858 A reduction of about 10%. Further reductions are possible by reducing the alignment of lsa objects. logalloc_test was adjusted to free more objects, since with the lower footprint, rounding errors (to full segments) are different and caused false errors to be detected. Missing: adjustments to scylla-gdb.py; will be done after we agree on the new descriptor's format.	2017-04-24 12:23:12 +02:00
Avi Kivity	9303b09a64	logalloc: remove unused max_occupancy_for_compaction Noticed by clang.	2017-04-22 21:09:41 +03:00
Tomasz Grabiec	20f4c9bf23	lsa: Reduce reclamation latency Currently eviction is performed until occupancy of the whole region drops below the 85% threshold. This may take a while if region had high occupancy and is large. We could improve the situation by only evicting until occupancy of the sparsest segment drops below the threshold, as is done by this change. I tested this using a c-s read workload in which the condition triggers in the cache region, with 1G per shard: lsa-timing - Reclamation cycle took 12.934 us. lsa-timing - Reclamation cycle took 47.771 us. lsa-timing - Reclamation cycle took 125.946 us. lsa-timing - Reclamation cycle took 144356 us. lsa-timing - Reclamation cycle took 655.765 us. lsa-timing - Reclamation cycle took 693.418 us. lsa-timing - Reclamation cycle took 509.869 us. lsa-timing - Reclamation cycle took 1139.15 us. The 144ms pause is when large eviction is necessary. Statistics for reclamation pauses for a read workload over larger-than-memory data set: Before: avg = 865.796362 stdev = 10253.498038 min = 93.891000 max = 264078.000000 sum = 574022.988000 samples = 663 After: avg = 513.685650 stdev = 275.270157 min = 212.286000 max = 1089.670000 sum = 340573.586000 samples = 663 Refs #1634. Message-Id: <1484730859-11969-1-git-send-email-tgrabiec@scylladb.com>	2017-04-21 12:52:31 +02:00
Tomasz Grabiec	c83768d6bb	log_histogram: Allow non-power-of-two minimum values We will want to reuse the min_size mechanism for the whole compaction threshold, including the occupancy threshold. That threshold is close to the segment size and we cannot pick a power of two which would be close enough to what we need. Therefore, change log_histogram to support arbitrary minimum base. bucket_of() was moved into log_histogram_options so that it can be used in number_of_buckets(), which makes for a simple and much less error-prone implementation.	2017-04-21 10:54:50 +02:00
Tomasz Grabiec	7a800c54bf	lsa: Use regular compaction threshold in on-idle compaction Idle-time compaction should not produce not-compactible segments becuase that means we would have to evict a lot when we finally need to reclaim some memory, so that occupancy falls below the regular compaction threshold. This may cause latency spikes. Refs #1634.	2017-04-20 15:00:15 +02:00
Tomasz Grabiec	7aa286439f	lsa: Add getter for region's eviction function	2017-04-20 14:51:42 +02:00
Duarte Nunes	af37a3fdbf	logalloc: Fix compilation error This patch moves a function using the region_impl type after the type has been defined. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170418124551.25369-1-duarte@scylladb.com>	2017-04-18 15:56:26 +03:00
Avi Kivity	844529fe33	logalloc: avoid auto in function argument declaration 'auto' in a non-lambda function argument is not legal C++, and is hard to read besides. Replace with the right type. Since the right type is private, add some friendship.	2017-04-17 23:18:44 +03:00
Paweł Dziepak	0318dccafd	lsa: avoid unnecessary segment migrations during reclaim segment_zone::migrate_all_segments() was trying to migrate all segments inside a zone to the other one hoping that the original one could be completely freed. This was an attempt to optimise for throughput. However, this may unnecesairly hurt latency if the zone is large, but only few segments are required to satisfy reclaimer's demands. Message-Id: <20170410171912.26821-1-pdziepak@scylladb.com>	2017-04-11 08:55:29 +02:00
Tomasz Grabiec	3609665b19	lsa: Fix debug-mode compilation error By moving definitions of setters out of #ifdef	2017-03-16 18:23:05 +01:00
Tomasz Grabiec	88e7b3ff79	lsa: Ensure can_allocate_more_memory() always leaves a gap above seastar's min_free_memory() One of the goals of can_allocate_more_memory() is to prevent depleting seastar's free memory close to its minimum, leaving a head room above that minimum so that standard allocations will not cause reclamation immediately. Currently the function doesn't take into accoutn actual threshold used by the seastar allocator, so there could be no gap or even could go below the minimum. Fix that by ensuring there's always a gap above min_free_memory(). min_gap was reduced to 1 MiB so that low memory setups are not impacted significantly by the change. Message-Id: <1489667863-15099-1-git-send-email-tgrabiec@scylladb.com>	2017-03-16 12:42:50 +00:00
Tomasz Grabiec	4ab8b255da	lsa: Allow adjusting reserves in allocating_section	2017-03-16 10:21:10 +01:00
Duarte Nunes	d32c848d73	utils/logalloc: Change linkage of hist_options to external Change linkage of segment_descriptor_hist_options to external to keep good old GCC5 happy, despite C++11 allowing static linkage of non-type template arguments. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170309213206.10383-1-duarte@scylladb.com>	2017-03-10 11:02:51 +02:00
Duarte Nunes	ca4f5cabd4	lsa: Extract log_histogram class Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-03-04 14:47:19 +01:00
Duarte Nunes	2b6abd5a91	lsa: Make log_histogram more generic Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-03-03 17:59:17 +01:00
Duarte Nunes	3819e6d55f	lsa: log_histogram cleanups Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-03-03 17:09:07 +01:00
Duarte Nunes	11b5076b3c	lsa: Use log histogram for closed segments This patch replaces the current heap with a logarithmic histogram to hold the closed segment descriptors. This histogram stores elements in different buckets according to their size. Values are mapped to a sequence of power-of-two ranges that are split in N sub-buckets. Values less than a minimum value are placed in bucket 0, whereas values bigger than a maximum value are not admitted. There is some loss of precision as segments are now not totally ordered, and precision decreases the more sparse a segment is. This allows to reduce the cost of the computations needed when freeing from a closed segment. Performance results for perf_simple_query -c4 --duration 60 before after diff read 43954.27 45246.10 +2.9% write 48911.54 52807.76 +7.9% Fixes #1442 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170227235328.27937-1-duarte@scylladb.com>	2017-02-28 18:40:38 +02:00
Tomasz Grabiec	c70ebc7ca5	lsa: Make reclaim_timer enclose segment_pool::reclaim_segments() LSA timing did not include segment migration. It does after this change. Message-Id: <1486657046-9378-1-git-send-email-tgrabiec@scylladb.com>	2017-02-09 17:07:59 +00:00
Tomasz Grabiec	e40fb438f5	lsa: Avoid avalanche releasing of requests Before, the logic for releasing writes blocked on dirty worked like this: 1) When region group size changes and it is not under pressure and there are some requests blocked, then schedule request releasing task 2) request releasing task, if no pressure, runs one request and if there are still blocked requests, schedules next request releasing task If requests don't change the size of the region group, then either some request executes or there is a request releasing task scheduled. The amount of scheduled tasks is at most 1, there is a single thread of excution. However, if requests themselves would change the size of the group, then each such change would schedule yet another request releasing thread, growing the task queue size by one. The group size can also change when memory is reclaimed from the groups (e.g. when contains sparse segments). Compaction may start many request releasing threads due to group size updates. Such behavior is detrimental for performance and stability if there are a lot of blocked requests. This can happen on 1.5 even with modest concurrency becuase timed out requests stay in the queue. This is less likely on 1.6 where they are dropped from the queue. The releasing of tasks may start to dominate over other processes in the system. When the amount of scheduled tasks reaches 1000, polling stops and server becomes unresponsive until all of the released requests are done, which is either when they start to block on dirty memory again or run out of blocked requests. It may take a while to reach pressure condition after memtable flush if it brings virtual dirty much below the threshold, which is currently the case for workloads with overwrites producing sparse regions. Refs #2021. Fix by ensuring there is at most one request releasing thread at a time. There will be one releasing fiber per region group which is woken up when pressure is lifted. It executes blocked requests until pressure occurs. The logic for notification across hierachy was replaced by calling region_group::notify_relief() from region_group::update() on the broadest relieved group.	2017-02-01 17:41:55 +01:00
Tomasz Grabiec	d55baa0cd1	lsa: Move definitions to .cc	2017-02-01 17:41:55 +01:00
Tomasz Grabiec	8f8b111b33	lsa: Simplify hard pressure notification management The hard pressure was only signalled on region group when run_when_memory_available() was called after the pressure condition was met. So the following loop is always an infinite loop rather than stopping when engouh is allocated to cause pressure: while (!gr.under_pressure()) { region.allocate(...); } It's cleaner if pressure notification works not only if run_when_memory_available() is used but whenever conditino changes, like we do for the soft pressure. There is comment in run_when_memory_available() which gives reasons why notifications are called from there, but I think those reasons no longer hold: - we already notify on soft pressure conditions from update(), and if that is safe, notifying about hard pressure should also be safe. I checked and it looks safe to me. - avoiding notification in the rare case when we stopped writing right after crossing the threshold doesn't seem benefitial. It's unlikely in the first place, and one could argue it's better to actually flush now so that when writes resume they will not block.	2017-02-01 17:41:55 +01:00
Tomasz Grabiec	2c7902fb2b	Revert "lsa: Reduce reclamation latency" This reverts commit `d61002cc33`. Introduced a regression in row_cache_alloc_stress. The problem is that reclaim_from_evictable() evicts way too much after the refactor due to the stop condition not taking into account how much data was evicted so far and only looking at occupancy of the minimal segment. This may lead to eviction of the whole region.	2017-01-26 10:43:18 +01:00

1 2 3

146 Commits