scylladb

Author	SHA1	Message	Date
Tomasz Grabiec	34ccf234ea	Integrate with allocation failure injection framework	2017-11-07 15:33:24 +01:00
Avi Kivity	a2f26f7b29	log_histogram: rename to log_heap log_histogram is not really a histogram, it is a heap-like container. Rename to log_heap in case we do want a log_histogram one day. Message-Id: <20170916172137.30941-1-avi@scylladb.com>	2017-09-18 12:44:05 +02:00
Tomasz Grabiec	87be474c19	lsa: Move reclaim counter concept to allocation_strategy level So that generic code can detect invalidation of references. Also, to allow reusing the same mechanism for signalling external reference invalidation.	2017-09-13 17:38:08 +02:00
Tomasz Grabiec	5d2f2bc90b	lsa: Mark region::merge() as noexcept It seems to satisfy this, and row_cache::do_update() will rely on it to simplify error handling. Message-Id: <1504023113-30374-1-git-send-email-tgrabiec@scylladb.com>	2017-08-29 19:17:17 +03:00
Avi Kivity	5a2439e702	main: check for large allocations Large allocations can require cache evictions to be satisfied, and can therefore induce long latencies. Enable the seastar large allocation warning so we can hunt them down and fix them. Message-Id: <20170819135212.25230-1-avi@scylladb.com>	2017-08-21 10:25:40 +03:00
Tomasz Grabiec	3489c68a68	lsa: Fix performance regression in eviction and compact_on_idle Region comparator, used by the two, calls region_impl::min_occupancy(), which calls log_histogram::largest(). The latter is O(N) in terms of the number of segments, and is supposed to be used only in tests. We should call one_of_largest() instead, which is O(1). This caused compact_on_idle() to take more CPU as the number of segments grew (even when there was nothing to compact). Eviction would see the same kind of slow down as well. Introduced in `11b5076b3c`. Message-Id: <1498641973-20054-1-git-send-email-tgrabiec@scylladb.com>	2017-06-28 12:32:43 +03:00
Avi Kivity	ebaeefa02b	Merge seatar upstream (seastar namespace) - introcduced "seastarx.hh" header, which does a "using namespace seastar"; - 'net' namespace conflicts with seastar::net, renamed to 'netw'. - 'transport' namespace conflicts with seastar::transport, renamed to cql_transport. - "logger" global variables now conflict with logger global type, renamed to xlogger. - other minor changes	2017-05-21 12:26:15 +03:00
Avi Kivity	1d12d69881	logalloc: define segment_zone::maximum_size Yield build errors with some compilers, if missing.	2017-05-01 16:31:29 +03:00
Paweł Dziepak	f5cf86484e	lsa: introduce upper bound on zone size Attempting to create huge zones may introduce significant latency. This patch introduces the maximum allowed zone size so that the time spent trying to allocate and initialising zone is bounded. Fixes #2335. Message-Id: <20170428145916.28093-1-pdziepak@scylladb.com>	2017-04-30 10:58:11 +03:00
Pekka Enberg	940c3f4330	Merge "Clang fixes (part 2)" from Avi "This series fixes some more errors found by clang, with the aim of enabling clang/zapcc as a supported compiler. A single issue remains, but it's probably in std::experimental::optional::swap(); not in our code." * tag 'clang/2/v1' of https://github.com/avikivity/scylla: sstable_test: avoid passing negative non-type template arguments to unsigned parameters UUID: add more comparison operators sstable_datafile_test: avoid string_view user-defined literal conversion operator mutation_source_test: avoid template function without template keyword cql_query_test: define static variable cql_query_test: add braces for single-item collection initializers storage_service: don't use typeid(temporary) logalloc: remove unused max_occupancy_for_compaction storage_proxy: drop overzealous use of __int128_t in recently-modified-no-read-repair logic storage_proxy: drop unused member access from return value storage_proxy: fix reference bound to temporary in data_read_resolver::less_compare read_repair_decision: fix operator<<(std::ostream&, ...)	2017-04-24 20:32:16 +03:00
Avi Kivity	6d9e18fd61	logalloc: reduce descriptor overhead Every lsa-allocated object is prefixed by a header that contains information needed to free or migrate it. This includes its size (for freeing) and an 8-byte migrator (for migrating). Together with some flags, the overhead is 14 bytes (16 bytes if the default alignment is used). This patch reduces the header size to 1 byte (8 bytes if the default alignment is used). It uses the following techniques: - ULEB128-like encoding (actually more like ULEB64) so a live object's header can typically be stored using 1 byte - indirection, so that migrators can be encoded in a small index pointing to a migrator table, rather than using an 8-byte pointer; this exploits the fact that only a small number of types are stored in LSA - moving the responsibility for determining an object's size to its migrator, rather than storing it in the header; this exploits the fact that the migrator stores type information, and object size is in fact information about the type The patch improves the results of memory_footprint_test as following: Before: - in cache: 976 - in memtable: 947 After: mutation footprint: - in cache: 880 - in memtable: 858 A reduction of about 10%. Further reductions are possible by reducing the alignment of lsa objects. logalloc_test was adjusted to free more objects, since with the lower footprint, rounding errors (to full segments) are different and caused false errors to be detected. Missing: adjustments to scylla-gdb.py; will be done after we agree on the new descriptor's format.	2017-04-24 12:23:12 +02:00
Avi Kivity	9303b09a64	logalloc: remove unused max_occupancy_for_compaction Noticed by clang.	2017-04-22 21:09:41 +03:00
Tomasz Grabiec	20f4c9bf23	lsa: Reduce reclamation latency Currently eviction is performed until occupancy of the whole region drops below the 85% threshold. This may take a while if region had high occupancy and is large. We could improve the situation by only evicting until occupancy of the sparsest segment drops below the threshold, as is done by this change. I tested this using a c-s read workload in which the condition triggers in the cache region, with 1G per shard: lsa-timing - Reclamation cycle took 12.934 us. lsa-timing - Reclamation cycle took 47.771 us. lsa-timing - Reclamation cycle took 125.946 us. lsa-timing - Reclamation cycle took 144356 us. lsa-timing - Reclamation cycle took 655.765 us. lsa-timing - Reclamation cycle took 693.418 us. lsa-timing - Reclamation cycle took 509.869 us. lsa-timing - Reclamation cycle took 1139.15 us. The 144ms pause is when large eviction is necessary. Statistics for reclamation pauses for a read workload over larger-than-memory data set: Before: avg = 865.796362 stdev = 10253.498038 min = 93.891000 max = 264078.000000 sum = 574022.988000 samples = 663 After: avg = 513.685650 stdev = 275.270157 min = 212.286000 max = 1089.670000 sum = 340573.586000 samples = 663 Refs #1634. Message-Id: <1484730859-11969-1-git-send-email-tgrabiec@scylladb.com>	2017-04-21 12:52:31 +02:00
Tomasz Grabiec	c83768d6bb	log_histogram: Allow non-power-of-two minimum values We will want to reuse the min_size mechanism for the whole compaction threshold, including the occupancy threshold. That threshold is close to the segment size and we cannot pick a power of two which would be close enough to what we need. Therefore, change log_histogram to support arbitrary minimum base. bucket_of() was moved into log_histogram_options so that it can be used in number_of_buckets(), which makes for a simple and much less error-prone implementation.	2017-04-21 10:54:50 +02:00
Tomasz Grabiec	7a800c54bf	lsa: Use regular compaction threshold in on-idle compaction Idle-time compaction should not produce not-compactible segments becuase that means we would have to evict a lot when we finally need to reclaim some memory, so that occupancy falls below the regular compaction threshold. This may cause latency spikes. Refs #1634.	2017-04-20 15:00:15 +02:00
Tomasz Grabiec	7aa286439f	lsa: Add getter for region's eviction function	2017-04-20 14:51:42 +02:00
Duarte Nunes	af37a3fdbf	logalloc: Fix compilation error This patch moves a function using the region_impl type after the type has been defined. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170418124551.25369-1-duarte@scylladb.com>	2017-04-18 15:56:26 +03:00
Avi Kivity	844529fe33	logalloc: avoid auto in function argument declaration 'auto' in a non-lambda function argument is not legal C++, and is hard to read besides. Replace with the right type. Since the right type is private, add some friendship.	2017-04-17 23:18:44 +03:00
Paweł Dziepak	0318dccafd	lsa: avoid unnecessary segment migrations during reclaim segment_zone::migrate_all_segments() was trying to migrate all segments inside a zone to the other one hoping that the original one could be completely freed. This was an attempt to optimise for throughput. However, this may unnecesairly hurt latency if the zone is large, but only few segments are required to satisfy reclaimer's demands. Message-Id: <20170410171912.26821-1-pdziepak@scylladb.com>	2017-04-11 08:55:29 +02:00
Tomasz Grabiec	3609665b19	lsa: Fix debug-mode compilation error By moving definitions of setters out of #ifdef	2017-03-16 18:23:05 +01:00
Tomasz Grabiec	88e7b3ff79	lsa: Ensure can_allocate_more_memory() always leaves a gap above seastar's min_free_memory() One of the goals of can_allocate_more_memory() is to prevent depleting seastar's free memory close to its minimum, leaving a head room above that minimum so that standard allocations will not cause reclamation immediately. Currently the function doesn't take into accoutn actual threshold used by the seastar allocator, so there could be no gap or even could go below the minimum. Fix that by ensuring there's always a gap above min_free_memory(). min_gap was reduced to 1 MiB so that low memory setups are not impacted significantly by the change. Message-Id: <1489667863-15099-1-git-send-email-tgrabiec@scylladb.com>	2017-03-16 12:42:50 +00:00
Tomasz Grabiec	4ab8b255da	lsa: Allow adjusting reserves in allocating_section	2017-03-16 10:21:10 +01:00
Duarte Nunes	d32c848d73	utils/logalloc: Change linkage of hist_options to external Change linkage of segment_descriptor_hist_options to external to keep good old GCC5 happy, despite C++11 allowing static linkage of non-type template arguments. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170309213206.10383-1-duarte@scylladb.com>	2017-03-10 11:02:51 +02:00
Duarte Nunes	ca4f5cabd4	lsa: Extract log_histogram class Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-03-04 14:47:19 +01:00
Duarte Nunes	2b6abd5a91	lsa: Make log_histogram more generic Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-03-03 17:59:17 +01:00
Duarte Nunes	3819e6d55f	lsa: log_histogram cleanups Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-03-03 17:09:07 +01:00
Duarte Nunes	11b5076b3c	lsa: Use log histogram for closed segments This patch replaces the current heap with a logarithmic histogram to hold the closed segment descriptors. This histogram stores elements in different buckets according to their size. Values are mapped to a sequence of power-of-two ranges that are split in N sub-buckets. Values less than a minimum value are placed in bucket 0, whereas values bigger than a maximum value are not admitted. There is some loss of precision as segments are now not totally ordered, and precision decreases the more sparse a segment is. This allows to reduce the cost of the computations needed when freeing from a closed segment. Performance results for perf_simple_query -c4 --duration 60 before after diff read 43954.27 45246.10 +2.9% write 48911.54 52807.76 +7.9% Fixes #1442 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170227235328.27937-1-duarte@scylladb.com>	2017-02-28 18:40:38 +02:00
Tomasz Grabiec	c70ebc7ca5	lsa: Make reclaim_timer enclose segment_pool::reclaim_segments() LSA timing did not include segment migration. It does after this change. Message-Id: <1486657046-9378-1-git-send-email-tgrabiec@scylladb.com>	2017-02-09 17:07:59 +00:00
Tomasz Grabiec	e40fb438f5	lsa: Avoid avalanche releasing of requests Before, the logic for releasing writes blocked on dirty worked like this: 1) When region group size changes and it is not under pressure and there are some requests blocked, then schedule request releasing task 2) request releasing task, if no pressure, runs one request and if there are still blocked requests, schedules next request releasing task If requests don't change the size of the region group, then either some request executes or there is a request releasing task scheduled. The amount of scheduled tasks is at most 1, there is a single thread of excution. However, if requests themselves would change the size of the group, then each such change would schedule yet another request releasing thread, growing the task queue size by one. The group size can also change when memory is reclaimed from the groups (e.g. when contains sparse segments). Compaction may start many request releasing threads due to group size updates. Such behavior is detrimental for performance and stability if there are a lot of blocked requests. This can happen on 1.5 even with modest concurrency becuase timed out requests stay in the queue. This is less likely on 1.6 where they are dropped from the queue. The releasing of tasks may start to dominate over other processes in the system. When the amount of scheduled tasks reaches 1000, polling stops and server becomes unresponsive until all of the released requests are done, which is either when they start to block on dirty memory again or run out of blocked requests. It may take a while to reach pressure condition after memtable flush if it brings virtual dirty much below the threshold, which is currently the case for workloads with overwrites producing sparse regions. Refs #2021. Fix by ensuring there is at most one request releasing thread at a time. There will be one releasing fiber per region group which is woken up when pressure is lifted. It executes blocked requests until pressure occurs. The logic for notification across hierachy was replaced by calling region_group::notify_relief() from region_group::update() on the broadest relieved group.	2017-02-01 17:41:55 +01:00
Tomasz Grabiec	d55baa0cd1	lsa: Move definitions to .cc	2017-02-01 17:41:55 +01:00
Tomasz Grabiec	8f8b111b33	lsa: Simplify hard pressure notification management The hard pressure was only signalled on region group when run_when_memory_available() was called after the pressure condition was met. So the following loop is always an infinite loop rather than stopping when engouh is allocated to cause pressure: while (!gr.under_pressure()) { region.allocate(...); } It's cleaner if pressure notification works not only if run_when_memory_available() is used but whenever conditino changes, like we do for the soft pressure. There is comment in run_when_memory_available() which gives reasons why notifications are called from there, but I think those reasons no longer hold: - we already notify on soft pressure conditions from update(), and if that is safe, notifying about hard pressure should also be safe. I checked and it looks safe to me. - avoiding notification in the rare case when we stopped writing right after crossing the threshold doesn't seem benefitial. It's unlikely in the first place, and one could argue it's better to actually flush now so that when writes resume they will not block.	2017-02-01 17:41:55 +01:00
Tomasz Grabiec	2c7902fb2b	Revert "lsa: Reduce reclamation latency" This reverts commit `d61002cc33`. Introduced a regression in row_cache_alloc_stress. The problem is that reclaim_from_evictable() evicts way too much after the refactor due to the stop condition not taking into account how much data was evicted so far and only looking at occupancy of the minimal segment. This may lead to eviction of the whole region.	2017-01-26 10:43:18 +01:00
Tomasz Grabiec	d61002cc33	lsa: Reduce reclamation latency Currently eviction is performed until occupancy of the whole region drops below the 85% threshold. This may take a while if region had high occupancy and is large. We could improve the situation by only evicting until occupancy of the sparsest segment drops below the threshold, as is done by this change. I tested this using a c-s read workload in which the condition triggers in the cache region, with 1G per shard: lsa-timing - Reclamation cycle took 12.934 us. lsa-timing - Reclamation cycle took 47.771 us. lsa-timing - Reclamation cycle took 125.946 us. lsa-timing - Reclamation cycle took 144356 us. lsa-timing - Reclamation cycle took 655.765 us. lsa-timing - Reclamation cycle took 693.418 us. lsa-timing - Reclamation cycle took 509.869 us. lsa-timing - Reclamation cycle took 1139.15 us. The 144ms pause is when large eviction is necessary. The change improves worst case latency. Reclamation time statistics over 30 second period after cache fills up, in microseconds: Before: avg = 1524.283148 stdev = 11021.021118 min = 12.934000 max = 144356.000000 sum = 257603.852000 samples = 169 After: avg = 1317.362414 stdev = 1913.542802 min = 263.935000 max = 19244.600000 sum = 175209.201000 samples = 133 Refs #1634. Message-Id: <1484730859-11969-1-git-send-email-tgrabiec@scylladb.com>	2017-01-19 17:35:36 +02:00
Vlad Zolotarov	022bca16bf	utils::logalloc: move collectd counters registration to metrics registration layer Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-01-10 16:24:55 -05:00
Tomasz Grabiec	e14caaef60	utils/logalloc: Add ability to timeout run_when_memory_available() task	2016-11-29 16:40:58 +01:00
Paweł Dziepak	b8d737ff0a	tests/row_cache_test: verify that eviction follows lru Refs #1847. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1479231555-28191-1-git-send-email-pdziepak@scylladb.com>	2016-11-15 18:57:54 +01:00
Tomasz Grabiec	6548132423	lsa: Make logalloc::tracker::full_compaction() compact all reclaimable regions is_compactible() will pass on very small regions. full_compaction() is only used in tests to force objects to be moved due to compaction, so we want all reclaimable regions to be compacted.	2016-10-18 11:16:08 +02:00
Paweł Dziepak	d08cffd3c7	lsa: avoid exceptions during segment_zone creation LSA tries to allocate zones as large as possible (while still leaving enough free space for the standard allocator). It uses the amount of free memory in order to guess how much it can get, but that obviously doesn't account for fragmentation and the allocation attempt may fail. This patch changes the LSA code so that it doesn't throw in case zone couldn't be created but just returns a null pointer which should be more performant if the LSA memory cannot grow any more. Fixes #1394. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1476435031-5601-1-git-send-email-pdziepak@scylladb.com>	2016-10-14 11:08:24 +02:00
Tomasz Grabiec	e617bcd8a7	logalloc: disable abort on allocation failure in places in which it is benign Some places start big expecting allocation failure, then reduce the requested size. Let's not abort in such cases. Message-Id: <1476295120-32047-1-git-send-email-tgrabiec@scylladb.com>	2016-10-13 10:53:32 +03:00
Glauber Costa	86aa0b830d	LSA: allow a group to query its own region group Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-04 10:39:10 -04:00
Glauber Costa	28e3f2f6ee	LSA: export information about object memory footprint We allocate objects of a certain size, but we use a bit more memory to hold them. To get a clerer picture about how much memory will an object cost us, we need help from the allocator. This patch exports an interface that allow users to query into a specific allocator to get that information. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-04 10:39:10 -04:00
Tomasz Grabiec	b0b28696b5	scylla-gdb: Add 'scylla lsa-segment' command Allows one to examine contents of LSA segment. Example: (gdb) scylla lsa-segment 0x601000480000 0x601000480e70: live size=144 migrator=standard_migrator<cache_entry>::object 0x601000480f10: live size=144 migrator=standard_migrator<cache_entry>::object 0x601000480fb0: free size=192 0x60100048107e: free size=42 0x6010004814e0: free size=192 0x6010004815ae: free size=40 0x6010004815e8: free size=192 0x6010004816b8: live size=144 migrator=standard_migrator<cache_entry>::object 0x601000481758: free size=192 ...	2016-09-20 16:53:21 +02:00
Glauber Costa	fe6a0d97d1	logalloc: make sure allocations in release_requests don't recurse back into the allocator Calls like later() and with_gate() may allocate memory, although that is not very common. This can create a problem in the sense that it will potentially recurse and bring us back to the allocator during free - which is the very thing we are trying to avoid with the call to later(). This patch wraps the relevant calls in the reclaimer lock. This do mean that the allocation may fail if we are under severe pressure - which includes having exhausted all reserved space - but at least we won't recurse back to the allocator. To make sure we do this as early as possible, we just fold both release_requests and do_release_requests into a single function Thanks Tomek for the suggestion. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <980245ccc17960cf4fcbbfedb29d1878a98d85d8.1470254846.git.glauber@scylladb.com>	2016-08-04 11:16:53 +02:00
Tomasz Grabiec	9476bc5a31	Introduce --abort-on-lsa-bad-alloc command line option Useful for triggerring core dump on allocation failure inside LSA, which makes it easier to debug allocation failures. They normally don't cause aborts, just fail the current operation, which makes it hard to figure out what was the cause of allocation failure. Message-Id: <1470233631-18508-1-git-send-email-tgrabiec@scylladb.com>	2016-08-03 17:26:44 +03:00
Avi Kivity	d261927fa3	logalloc: change sprint() of a pointer to use void* explicitly Otherwise, fmtlib dislikes it.	2016-07-18 19:37:16 +03:00
Glauber Costa	4e81f19ab5	LSA: fix typo in region merge There are many potentially tricky things about referring to different regions from the LSA perspective. Madness, however, is not one of them. I can only assume we meant made? Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <8eb81f35de4b208a494e43cb392eea07b87b2bf1.1466534798.git.glauber@scylladb.com>	2016-06-21 22:58:44 +03:00
Tomasz Grabiec	e783b58e3b	Merge branch 'glommer/LSA-throttler-v6' from git@github.com:glommer/scylla.gi From Glauber: This is my new take at the "Move throttler to the LSA" series, except this one don't actually move anything anywhere: I am leaving all memtable conversion out, and instead I am sending just the LSA bits + LSA active reclaim. This should help us see where we are going, and then we can discuss all memtable changes in a series on its own, logically separated (and hopefully already integrated with virtual dirty). [tgrabiec: trivial merge conflicts in logalloc.cc]	2016-06-21 10:22:26 +02:00
Glauber Costa	579d121db8	LSA: export largest region We now keep the regions sorted by size, and the children region groups as well. Internally, the LSA has all information it needs to make size-based reclaim decisions. However, we don't do reclaim internally, but rather warn our user that a pressure situation is mounted. The user of a region_group doesn't need to evict the largest region in case of pressure and is free to do whatever it chooses - including nothing. But more likely than not, taking into account which region is the largest makes sense. This patch puts together this last missing piece of the puzzle, and exports the information we have internally to the user. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:51:00 -04:00
Glauber Costa	35f8a2ce2c	LSA: add a backpointer to the region from its private data Region is implemented using the pimpl pattern (region_impl), and all its relevant data is present in a private structure instead of the region itself. That private structure is the one that the other parts of the LSA will refer to, the region_group being the prime example. To allow classes such as the region_group the externally export a particular region, we will introduce a backpointer region_impl -> region. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:50:59 -04:00
Glauber Costa	38a402307d	LSA: enhance region_group reclaimer We are currently just allowing the region_group to specify a throttle_threshold, that triggers throttling when a certain amount of memory is reached. We would like to notify the callers that such condition is reached, so that the callers can do something to alleviate it - like triggering flushes of their structures. The approach we are taking here is to pass a reclaimer instance. Any user of a region_group can specialize its methods start_reclaiming and stop_reclaiming that will be called when the region_group becomes under pressure or ceases to be, respectively. Now that we have such facility, it makes more sense to move the throttle_threshold here than having it separately. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-06-20 18:50:59 -04:00

1 2 3

128 Commits