The segment heap is a max-heap, with sparser segments on the top. When
we free from a segment its occupancy is decreased, but its position in
the heap increases.
This bug caused that we picked up segments for compaction in the wrong
order. In extreme cases this can lead to a livelock, in some cases may
just increase compaction latency.
Also fixes https://github.com/cloudius-systems/seastar/issues/54
==5658==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6250006b7848 at pc 0x1413e02 bp 0x7fff7cd7f1e0 sp 0x7fff7cd7f1d8
WRITE of size 8 at 0x6250006b7848 thread T0
#0 0x1413e01 in unsigned long* std::__copy_move<false, false, std::random_access_iterator_tag>::__copy_m<std::_Deque_iterator<unsigned long, unsigned long&, unsigned long*>, unsigned long*>(std::_Deque_iterator<unsigned long, unsigned long&, unsigned long*>, std::_Deque_iterator<unsigned long, unsigned long&, unsigned long*>, unsigned long*) /usr/include/c++/4.9/bits/stl_algobase.h:336
#1 0x1413c59 in unsigned long* std::__copy_move_a<false, std::_Deque_iterator<unsigned long, unsigned long&, unsigned long*>, unsigned long*>(std::_Deque_iterator<unsigned long, unsigned long&, unsigned long*>, std::_Deque_iterator<unsigned long, unsigned long&, unsigned long*>, unsigned long*) /usr/include/c++/4.9/bits/stl_algobase.h:396
#2 0x1413aea in unsigned long* std::__copy_move_a2<false, std::_Deque_iterator<unsigned long, unsigned long&, unsigned long*>, unsigned long*>(std::_Deque_iterator<unsigned long, unsigned long&, unsigned long*>, std::_Deque_iterator<unsigned long, unsigned long&, unsigned long*>, unsigned long*) /usr/include/c++/4.9/bits/stl_algobase.h:434
#3 0x14138df in unsigned long* std::copy<std::_Deque_iterator<unsigned long, unsigned long&, unsigned long*>, unsigned long*>(std::_Deque_iterator<unsigned long, unsigned long&, unsigned long*>, std::_Deque_iterator<unsigned long, unsigned long&, unsigned long*>, unsigned long*) /usr/include/c++/4.9/bits/stl_algobase.h:466
#4 0x1413545 in unsigned long* std::__copy_n<std::_Deque_iterator<unsigned long, unsigned long&, unsigned long*>, unsigned long, unsigned long*>(std::_Deque_iterator<unsigned long, unsigned long&, unsigned long*>, unsigned long, unsigned long*, std::random_access_iterator_tag) /usr/include/c++/4.9/bits/stl_algo.h:779
#5 0x1412d44 in unsigned long* std::copy_n<std::_Deque_iterator<unsigned long, unsigned long&, unsigned long*>, unsigned long, unsigned long*>(std::_Deque_iterator<unsigned long, unsigned long&, unsigned long*>, unsigned long, unsigned long*) /usr/include/c++/4.9/bits/stl_algo.h:804
#6 0x14112b3 in unsigned long large_bitset::load<std::_Deque_iterator<unsigned long, unsigned long&, unsigned long*> >(std::_Deque_iterator<unsigned long, unsigned long&, unsigned long*>, std::_Deque_iterator<unsigned long, unsigned long&, unsigned long*>, unsigned long) utils/large_bitset.hh:81
#7 0x13fcfc9 in _ZZZN8sstables7sstable11read_filterEvENKUlRT_E_clINS_6filterEEEDaS2_ENKUlvE_clEv (/home/tgrabiec/src/urchin/build/debug/scylla+0x13fcfc9)
#8 0x1400a50 in apply /home/tgrabiec/src/urchin/seastar/core/apply.hh:34
#9 0x1400afb in apply<sstables::sstable::read_filter()::<lambda(auto:25&)> [with auto:25 = sstables::filter]::<lambda()> > /home/tgrabiec/src/urchin/seastar/core/apply.hh:42
#10 0x1400bb2 in apply<sstables::sstable::read_filter()::<lambda(auto:25&)> [with auto:25 = sstables::filter]::<lambda()> > /home/tgrabiec/src/urchin/seastar/core/future.hh:1062
#11 0x140f1b7 in _ZZN6futureIIEE4thenIZZN8sstables7sstable11read_filterEvENKUlRT_E_clINS2_6filterEEEDaS5_EUlvE_S0_EET0_OT_ENUlOS4_E_clI12future_stateIIEEEEDaSC_ (/home/tgrabiec/src/urchin/build/debug/scylla+0x140f1b7)
#12 0x140f350 in run /home/tgrabiec/src/urchin/seastar/core/future.hh:359
#13 0x426e2c in reactor::run_tasks(circular_buffer<std::unique_ptr<task, std::default_delete<task> >, std::allocator<std::unique_ptr<task, std::default_delete<task> > > >&, unsigned long) core/reactor.cc:1093
#14 0x429cb1 in reactor::run() core/reactor.cc:1190
#15 0x72bc69 in app_template::run_deprecated(int, char**, std::function<void ()>&&) core/app-template.cc:122
#16 0xa119bc in main /home/tgrabiec/src/urchin/main.cc:279
#17 0x7ffc1b6beec4 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21ec4)
#18 0x412558 (/home/tgrabiec/src/urchin/build/debug/scylla+0x412558)
0x6250006b7848 is located 0 bytes to the right of 8008-byte region [0x6250006b5900,0x6250006b7848)
allocated by thread T0 here:
#0 0x7ffc1cf6c7df in __interceptor_malloc (/usr/lib/x86_64-linux-gnu/libasan.so.1+0x547df)
#1 0x7ffc204eef17 in operator new(unsigned long) (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x8df17)
#2 0xfa5d4f in large_bitset::large_bitset(unsigned long) utils/large_bitset.cc:15
#3 0x13fcec6 in _ZZZN8sstables7sstable11read_filterEvENKUlRT_E_clINS_6filterEEEDaS2_ENKUlvE_clEv (/home/tgrabiec/src/urchin/build/debug/scylla+0x13fcec6)
#4 0x1400a50 in apply /home/tgrabiec/src/urchin/seastar/core/apply.hh:34
#5 0x1400afb in apply<sstables::sstable::read_filter()::<lambda(auto:25&)> [with auto:25 = sstables::filter]::<lambda()> > /home/tgrabiec/src/urchin/seastar/core/apply.hh:42
#6 0x1400bb2 in apply<sstables::sstable::read_filter()::<lambda(auto:25&)> [with auto:25 = sstables::filter]::<lambda()> > /home/tgrabiec/src/urchin/seastar/core/future.hh:1062
#7 0x140f1b7 in _ZZN6futureIIEE4thenIZZN8sstables7sstable11read_filterEvENKUlRT_E_clINS2_6filterEEEDaS5_EUlvE_S0_EET0_OT_ENUlOS4_E_clI12future_stateIIEEEEDaSC_ (/home/tgrabiec/src/urchin/build/debug/scylla+0x140f1b7)
#8 0x140f350 in run /home/tgrabiec/src/urchin/seastar/core/future.hh:359
#9 0x426e2c in reactor::run_tasks(circular_buffer<std::unique_ptr<task, std::default_delete<task> >, std::allocator<std::unique_ptr<task, std::default_delete<task> > > >&, unsigned long) core/reactor.cc:1093
#10 0x429cb1 in reactor::run() core/reactor.cc:1190
#11 0x72bc69 in app_template::run_deprecated(int, char**, std::function<void ()>&&) core/app-template.cc:122
#12 0xa119bc in main /home/tgrabiec/src/urchin/main.cc:279
#13 0x7ffc1b6beec4 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21ec4)
SUMMARY: AddressSanitizer: heap-buffer-overflow /usr/include/c++/4.9/bits/stl_algobase.h:336 unsigned long* std::__copy_move<false, false, std::random_access_iterator_tag>::__copy_m<std::_Deque_iterator<unsigned long, unsigned long&, unsigned long*>, unsigned long*>(std::_Deque_iterator<unsigned long, unsigned long&, unsigned long*>, std::_Deque_iterator<unsigned long, unsigned long&, unsigned long*>, unsigned long*)
Shadow bytes around the buggy address:
0x0c4a800ceeb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c4a800ceec0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c4a800ceed0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c4a800ceee0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c4a800ceef0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0c4a800cef00: 00 00 00 00 00 00 00 00 00[fa]fa fa fa fa fa fa
0x0c4a800cef10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c4a800cef20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c4a800cef30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c4a800cef40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c4a800cef50: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Heap right redzone: fb
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack partial redzone: f4
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Contiguous container OOB:fc
ASan internal: fe
==5658==ABORTING
Single-bit accessors are very slow, especially because we don't support
setting a bit to a value (just set to 1 and clear to 0). This causes
loading and retrieving the contents of a bitmap to be painfully slow.
Fix by providing iterator-based load() and save() methods. The methods
support partial load/save so that access to very large bitmaps can be
split over multiple tasks.
A region being merged can still be in use; but after merging, compaction_lock
and the reclaim counter will no longer work. This can lead to
use-after-compact-without-re-lookup errors.
Fix by making the source region be the same as the target region; they
will share compaction locks and reclaim counters, so lookup avoidance
will still work correctly.
Fixes#286.
In some cases region may be in a state where it is not empty and
nothing could be evicted from it. For example when creating the first
entry, reclaimer may get invoked during creation before it gets
linked. We therefore can't rely on emptiness as a stop condition for
reclamation, the evction function shall signal us if it made forward
progress.
Related to #259. In some cases we need to allocate memory and hold
reclaim lock at the same time. If that region holds most of the
reclaimable memory, allocations inside that code section may
fail. allocating_section is a work-around of the problem. It learns
how big reserves shold be from past execution of critical section and
tries to ensure proper reserves before entering the section.
Some times we may close an empty active segment, if all data in it was
evicted. Normally segments are removed as soon as the last object in
it is freed, but if the segment is already empty when closed, noone is
supposed to call free on it. Such segments would be quickly reclaimed
during compaction, but it's possible that we will destroy the region
before they're reclaimed by compaction. Currently we would fail on an
assertion which checks that there are no segments. This change fixes
the problem by handling empty closed segments when region is
destroyed.
Memory releasing is invoked from destructors so should not throw. As a
consequence it should not allocate memory, so emergency segment pool
was switched from std::deque<> to an alloc-free intrusive stack.
Copying values may throw (std::bad_alloc for example), which will
result in a leak because destructor will not be called when
constructor throws.
May manifest as the following assertion failure:
utils/logalloc.cc:672: virtual logalloc::region_impl::~region_impl(): Assertion `_active->is_empty()' failed.
Fixes complaint about ignored result:
utils/file_lock.cc: In destructor 'utils::file_lock::impl::~impl()':
utils/file_lock.cc:29:33: error: ignoring return value of 'int lockf(int, int, __off_t)', declared with attribute warn_unused_result [-Werror=unused-result]
Disabling compaction of a region is currently done in order to keep
the references valid. But disabling only compaction is not enough, we
also need to disable eviction, as it also invalidates
references. Rather than introducing another type of lock, compaction
and eviction are controlled together, generalized as "reclaiming"
(hence the reclaim_lock).
The goal is to make allocation less likely to fail. With async
reclaimer there is an implicit bound on the amount of memory that can
be allocated between deferring points. This bound is difficult to
enforce though. Sync reclaimer lifts this limitation off.
Also, allocations which could not be satisfied before because of
fragmentation now will have higher chances of succeeding, although
depending on how much memory is fragmented, that could involve
evicting a lot of segments from cache, so we should still avoid them.
Downside of sync reclaiming is that now references into regions may be
invalidated not only across deferring points but at any allocation
site. compaction_lock can be used to pin data, preferably just
temporarily.
* seastar 5176352...68fee6c (1):
> Merge "Memory reclamation infrastructure follow-up" from Tomasz
Adjusted logalloc::tracker's reclaimer to fit new API
"Initial implementation/transposition of commit log replay.
* Changes replay position to be shard aware
* Commit log segment ID:s now follow basically the same scheme as origin;
max(previous ID, wall clock time in ms) + shard info (for us)
* SStables now use the DB definition of replay_position.
* Stores and propagates (compaction) flush replay positions in sstables
* If CL segments are left over from a previous run, they, and existing
sstables are inspected for high water mark, and then replayed from
those marks to amend mutations potentially lost in a crash
* Note that CPU count change is "handled" in so much that shard matching is
per _previous_ runs shards, not current.
Known limitations:
* Mutations deserialized from old CL segments are _not_ fully validated
against existing schemas.
* System::truncated_at (not currently used) does not handle sharding afaik,
so watermark ID:s coming from there are dubious.
* Mutations that fail to apply (invalid, broken) are not placed in blob files
like origin. Partly because I am lazy, but also partly because our serial
format differs, and we currently have no tools to do anything useful with it
* No replay filtering (Origin allows a system property to designate a filter
file, detailing which keyspace/cf:s to replay). Partly because we have no
system properties.
There is no unit test for the commit log replayer (yet).
Because I could not really come up with a good one given the test
infrastructure that exists (tricky to kill stuff just "right").
The functionality is verified by manual testing, i.e. running scylla,
building up data (cassandra-stress), kill -9 + restart.
This of course does not really fully validate whether the resulting DB is
100% valid compared to the one at k-9, but at least it verified that replay
took place, and mutations where applied.
(Note that origin also lacks validity testing)"
Fixes#98.
"I saw about 4% improvement in perf_sstable write on muninn with this. The
decorated_key comparison is gone from the perf profile now. Now most of the
work inside the reader is for copying the mutation."
By using a recognized idiom, gcc can optimize the unaligned little endian
load as a single instruction (actually less than an instruction, as it
combines it with a succeeding arithmetic operation).
"Initial implementation/transposition of commit log replay.
* Changes replay position to be shard aware
* Commit log segment ID:s now follow basically the same scheme as origin;
max(previous ID, wall clock time in ms) + shard info (for us)
* SStables now use the DB definition of replay_position.
* Stores and propagates (compaction) flush replay positions in sstables
* If CL segments are left over from a previous run, they, and existing
sstables are inspected for high water mark, and then replayed from
those marks to amend mutations potentially lost in a crash
* Note that CPU count change is "handled" in so much that shard matching is
per _previous_ runs shards, not current.
Known limitations:
* Mutations deserialized from old CL segments are _not_ fully validated
against existing schemas.
* System::truncated_at (not currently used) does not handle sharding afaik,
so watermark ID:s coming from there are dubious.
* Mutations that fail to apply (invalid, broken) are not placed in blob files
like origin. Partly because I am lazy, but also partly because our serial
format differs, and we currently have no tools to do anything useful with it
* No replay filtering (Origin allows a system property to designate a filter
file, detailing which keyspace/cf:s to replay). Partly because we have no
system properties.
There is no unit test for the commit log replayer (yet).
Because I could not really come up with a good one given the test
infrastructure that exists (tricky to kill stuff just "right").
The functionality is verified by manual testing, i.e. running scylla,
building up data (cassandra-stress), kill -9 + restart.
This of course does not really fully validate whether the resulting DB is
100% valid compared to the one at k-9, but at least it verified that replay
took place, and mutations where applied.
(Note that origin also lacks validity testing)"
If stopping a task, we shouldn't retry a compaction because if
removing a cf, we would push back the cf into the back of the
queue if an error happened, and that would possibly lead to a
use-after-free.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
It was noticed that the same sstable files could be selected for
compaction if concurrent compaction happens on the same cf.
That's possible because compaction manager uses 2 tasks for
handling compactions.
Solution is to not duplicate cf in the compaction manager queue,
and re-schedule compaction for a cf if needed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>