scylladb

Author	SHA1	Message	Date
Avi Kivity	987294a412	Add missing copyrights	2015-09-20 10:16:11 +03:00
Raphael S. Carvalho	8b6319702e	compaction_manager: recreate gate when task is stopped Otherwise, a gate_closed_exception would be triggered when resuming the task. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-09-16 17:54:43 +03:00
Tomasz Grabiec	53caf5ecca	lsa: Fix segment heap corruption The segment heap is a max-heap, with sparser segments on the top. When we free from a segment its occupancy is decreased, but its position in the heap increases. This bug caused that we picked up segments for compaction in the wrong order. In extreme cases this can lead to a livelock, in some cases may just increase compaction latency.	2015-09-10 17:20:04 +03:00
Tomasz Grabiec	8e1b3e5475	lsa: Remove underscore from local variable names	2015-09-10 12:40:12 +03:00
Tomasz Grabiec	a4536c3186	utils/large_bitset: Fix buffer overflow in load()/save() Also fixes https://github.com/cloudius-systems/seastar/issues/54 ==5658==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6250006b7848 at pc 0x1413e02 bp 0x7fff7cd7f1e0 sp 0x7fff7cd7f1d8 WRITE of size 8 at 0x6250006b7848 thread T0 #0 0x1413e01 in unsigned long* std::__copy_move<false, false, std::random_access_iterator_tag>::__copy_m<std::_Deque_iterator<unsigned long, unsigned long&, unsigned long>, unsigned long>(std::_Deque_iterator<unsigned long, unsigned long&, unsigned long>, std::_Deque_iterator<unsigned long, unsigned long&, unsigned long>, unsigned long) /usr/include/c++/4.9/bits/stl_algobase.h:336 #1 0x1413c59 in unsigned long std::__copy_move_a<false, std::_Deque_iterator<unsigned long, unsigned long&, unsigned long>, unsigned long>(std::_Deque_iterator<unsigned long, unsigned long&, unsigned long>, std::_Deque_iterator<unsigned long, unsigned long&, unsigned long>, unsigned long) /usr/include/c++/4.9/bits/stl_algobase.h:396 #2 0x1413aea in unsigned long std::__copy_move_a2<false, std::_Deque_iterator<unsigned long, unsigned long&, unsigned long>, unsigned long>(std::_Deque_iterator<unsigned long, unsigned long&, unsigned long>, std::_Deque_iterator<unsigned long, unsigned long&, unsigned long>, unsigned long) /usr/include/c++/4.9/bits/stl_algobase.h:434 #3 0x14138df in unsigned long std::copy<std::_Deque_iterator<unsigned long, unsigned long&, unsigned long>, unsigned long>(std::_Deque_iterator<unsigned long, unsigned long&, unsigned long>, std::_Deque_iterator<unsigned long, unsigned long&, unsigned long>, unsigned long) /usr/include/c++/4.9/bits/stl_algobase.h:466 #4 0x1413545 in unsigned long std::__copy_n<std::_Deque_iterator<unsigned long, unsigned long&, unsigned long>, unsigned long, unsigned long>(std::_Deque_iterator<unsigned long, unsigned long&, unsigned long>, unsigned long, unsigned long, std::random_access_iterator_tag) /usr/include/c++/4.9/bits/stl_algo.h:779 #5 0x1412d44 in unsigned long* std::copy_n<std::_Deque_iterator<unsigned long, unsigned long&, unsigned long>, unsigned long, unsigned long>(std::_Deque_iterator<unsigned long, unsigned long&, unsigned long>, unsigned long, unsigned long) /usr/include/c++/4.9/bits/stl_algo.h:804 #6 0x14112b3 in unsigned long large_bitset::load<std::_Deque_iterator<unsigned long, unsigned long&, unsigned long> >(std::_Deque_iterator<unsigned long, unsigned long&, unsigned long>, std::_Deque_iterator<unsigned long, unsigned long&, unsigned long>, unsigned long) utils/large_bitset.hh:81 #7 0x13fcfc9 in _ZZZN8sstables7sstable11read_filterEvENKUlRT_E_clINS_6filterEEEDaS2_ENKUlvE_clEv (/home/tgrabiec/src/urchin/build/debug/scylla+0x13fcfc9) #8 0x1400a50 in apply /home/tgrabiec/src/urchin/seastar/core/apply.hh:34 #9 0x1400afb in apply<sstables::sstable::read_filter()::<lambda(auto:25&)> [with auto:25 = sstables::filter]::<lambda()> > /home/tgrabiec/src/urchin/seastar/core/apply.hh:42 #10 0x1400bb2 in apply<sstables::sstable::read_filter()::<lambda(auto:25&)> [with auto:25 = sstables::filter]::<lambda()> > /home/tgrabiec/src/urchin/seastar/core/future.hh:1062 #11 0x140f1b7 in _ZZN6futureIIEE4thenIZZN8sstables7sstable11read_filterEvENKUlRT_E_clINS2_6filterEEEDaS5_EUlvE_S0_EET0_OT_ENUlOS4_E_clI12future_stateIIEEEEDaSC_ (/home/tgrabiec/src/urchin/build/debug/scylla+0x140f1b7) #12 0x140f350 in run /home/tgrabiec/src/urchin/seastar/core/future.hh:359 #13 0x426e2c in reactor::run_tasks(circular_buffer<std::unique_ptr<task, std::default_delete<task> >, std::allocator<std::unique_ptr<task, std::default_delete<task> > > >&, unsigned long) core/reactor.cc:1093 #14 0x429cb1 in reactor::run() core/reactor.cc:1190 #15 0x72bc69 in app_template::run_deprecated(int, char, std::function<void ()>&&) core/app-template.cc:122 #16 0xa119bc in main /home/tgrabiec/src/urchin/main.cc:279 #17 0x7ffc1b6beec4 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21ec4) #18 0x412558 (/home/tgrabiec/src/urchin/build/debug/scylla+0x412558) 0x6250006b7848 is located 0 bytes to the right of 8008-byte region [0x6250006b5900,0x6250006b7848) allocated by thread T0 here: #0 0x7ffc1cf6c7df in __interceptor_malloc (/usr/lib/x86_64-linux-gnu/libasan.so.1+0x547df) #1 0x7ffc204eef17 in operator new(unsigned long) (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x8df17) #2 0xfa5d4f in large_bitset::large_bitset(unsigned long) utils/large_bitset.cc:15 #3 0x13fcec6 in _ZZZN8sstables7sstable11read_filterEvENKUlRT_E_clINS_6filterEEEDaS2_ENKUlvE_clEv (/home/tgrabiec/src/urchin/build/debug/scylla+0x13fcec6) #4 0x1400a50 in apply /home/tgrabiec/src/urchin/seastar/core/apply.hh:34 #5 0x1400afb in apply<sstables::sstable::read_filter()::<lambda(auto:25&)> [with auto:25 = sstables::filter]::<lambda()> > /home/tgrabiec/src/urchin/seastar/core/apply.hh:42 #6 0x1400bb2 in apply<sstables::sstable::read_filter()::<lambda(auto:25&)> [with auto:25 = sstables::filter]::<lambda()> > /home/tgrabiec/src/urchin/seastar/core/future.hh:1062 #7 0x140f1b7 in _ZZN6futureIIEE4thenIZZN8sstables7sstable11read_filterEvENKUlRT_E_clINS2_6filterEEEDaS5_EUlvE_S0_EET0_OT_ENUlOS4_E_clI12future_stateIIEEEEDaSC_ (/home/tgrabiec/src/urchin/build/debug/scylla+0x140f1b7) #8 0x140f350 in run /home/tgrabiec/src/urchin/seastar/core/future.hh:359 #9 0x426e2c in reactor::run_tasks(circular_buffer<std::unique_ptr<task, std::default_delete<task> >, std::allocator<std::unique_ptr<task, std::default_delete<task> > > >&, unsigned long) core/reactor.cc:1093 #10 0x429cb1 in reactor::run() core/reactor.cc:1190 #11 0x72bc69 in app_template::run_deprecated(int, char, std::function<void ()>&&) core/app-template.cc:122 #12 0xa119bc in main /home/tgrabiec/src/urchin/main.cc:279 #13 0x7ffc1b6beec4 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21ec4) SUMMARY: AddressSanitizer: heap-buffer-overflow /usr/include/c++/4.9/bits/stl_algobase.h:336 unsigned long std::__copy_move<false, false, std::random_access_iterator_tag>::__copy_m<std::_Deque_iterator<unsigned long, unsigned long&, unsigned long>, unsigned long>(std::_Deque_iterator<unsigned long, unsigned long&, unsigned long>, std::_Deque_iterator<unsigned long, unsigned long&, unsigned long>, unsigned long*) Shadow bytes around the buggy address: 0x0c4a800ceeb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0c4a800ceec0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0c4a800ceed0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0c4a800ceee0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x0c4a800ceef0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 =>0x0c4a800cef00: 00 00 00 00 00 00 00 00 00[fa]fa fa fa fa fa fa 0x0c4a800cef10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c4a800cef20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c4a800cef30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c4a800cef40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c4a800cef50: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Heap right redzone: fb Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack partial redzone: f4 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Contiguous container OOB:fc ASan internal: fe ==5658==ABORTING	2015-09-10 09:33:59 +03:00
Avi Kivity	8405aa1c95	Merge "Add decimal type" from Paweł "These patches add support for decimal type. Fixes #146."	2015-09-08 19:03:37 +03:00
Paweł Dziepak	026ecdb50f	utils: add big_decimal class Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>	2015-09-08 16:04:30 +02:00
Tomasz Grabiec	3b7dfbcc85	Fix whitespace errors	2015-09-08 14:10:56 +02:00
Avi Kivity	d590e327c0	large_bitmap: support for loading and saving the bitmap to raw ints Single-bit accessors are very slow, especially because we don't support setting a bit to a value (just set to 1 and clear to 0). This causes loading and retrieving the contents of a bitmap to be painfully slow. Fix by providing iterator-based load() and save() methods. The methods support partial load/save so that access to very large bitmaps can be split over multiple tasks.	2015-09-08 14:09:59 +02:00
Avi Kivity	6d0a2b5075	logalloc: don't invalidate merged region A region being merged can still be in use; but after merging, compaction_lock and the reclaim counter will no longer work. This can lead to use-after-compact-without-re-lookup errors. Fix by making the source region be the same as the target region; they will share compaction locks and reclaim counters, so lookup avoidance will still work correctly. Fixes #286.	2015-09-08 08:55:44 +02:00
Tomasz Grabiec	fecc87e601	lsa: stub allocation_section with default allocator memory::stats() always returns 0 as free memory which confuses guard::enter().	2015-09-07 17:23:02 +02:00
Paweł Dziepak	03f5827570	logalloc: add missing methods to DEFAULT_ALLOCATOR version Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>	2015-09-07 16:59:27 +02:00
Tomasz Grabiec	3b441416fa	lsa: Make segment size publicly accessible Some tests depend on segment size.	2015-09-06 21:25:44 +02:00
Tomasz Grabiec	c82325a76c	lsa: Make region evictor signal forward progress In some cases region may be in a state where it is not empty and nothing could be evicted from it. For example when creating the first entry, reclaimer may get invoked during creation before it gets linked. We therefore can't rely on emptiness as a stop condition for reclamation, the evction function shall signal us if it made forward progress.	2015-09-06 21:25:44 +02:00
Tomasz Grabiec	94f0db933f	lsa: Fix typo in the word 'emergency'	2015-09-06 21:24:59 +02:00
Tomasz Grabiec	200562abe7	lsa: Reclaim over-max segments from segment pool reserve	2015-09-06 21:24:59 +02:00
Tomasz Grabiec	d022a1a4a3	lsa: Introduce allocating_section Related to #259. In some cases we need to allocate memory and hold reclaim lock at the same time. If that region holds most of the reclaimable memory, allocations inside that code section may fail. allocating_section is a work-around of the problem. It learns how big reserves shold be from past execution of critical section and tries to ensure proper reserves before entering the section.	2015-09-06 21:24:59 +02:00
Tomasz Grabiec	3caad2294b	lsa: Tolerate empty segments when region is destroyed Some times we may close an empty active segment, if all data in it was evicted. Normally segments are removed as soon as the last object in it is freed, but if the segment is already empty when closed, noone is supposed to call free on it. Such segments would be quickly reclaimed during compaction, but it's possible that we will destroy the region before they're reclaimed by compaction. Currently we would fail on an assertion which checks that there are no segments. This change fixes the problem by handling empty closed segments when region is destroyed.	2015-09-06 21:24:59 +02:00
Tomasz Grabiec	c37aa73051	lsa: Drop alignment requirement from segment	2015-09-06 21:24:59 +02:00
Tomasz Grabiec	2c1536b5a7	lsa: Make free() path noexcept Memory releasing is invoked from destructors so should not throw. As a consequence it should not allocate memory, so emergency segment pool was switched from std::deque<> to an alloc-free intrusive stack.	2015-09-06 21:24:59 +02:00
Tomasz Grabiec	f404a238bb	allocation_strategy: Make construct() exception-safe	2015-09-06 21:24:59 +02:00
Tomasz Grabiec	0d7cdab0ff	utils: managed_vector: Make copy assignment exception-safe	2015-09-06 21:24:58 +02:00
Tomasz Grabiec	81d81c81f2	utils: managed_vector: Make copy constructor exception-safe Copying values may throw (std::bad_alloc for example), which will result in a leak because destructor will not be called when constructor throws. May manifest as the following assertion failure: utils/logalloc.cc:672: virtual logalloc::region_impl::~region_impl(): Assertion `_active->is_empty()' failed.	2015-09-06 21:24:58 +02:00
Tomasz Grabiec	fa8d530cc2	lsa: Add ability to trace reclaiming latency	2015-09-06 21:24:58 +02:00
Tomasz Grabiec	25e4f10c14	utils/file_lock: Fix compilation error Fixes complaint about ignored result: utils/file_lock.cc: In destructor 'utils::file_lock::impl::~impl()': utils/file_lock.cc:29:33: error: ignoring return value of 'int lockf(int, int, __off_t)', declared with attribute warn_unused_result [-Werror=unused-result]	2015-09-04 13:00:54 +02:00
Avi Kivity	ef582753c0	Merge "Lock files for CL+Data dirs" from Calle "Acquires and maintains lock files during execution. Lock files are deleted on "clean" exit, and re-taken on uncontended startup. Fixes #34 "	2015-09-01 19:18:46 +03:00
Calle Wilund	e3abde50f9	Add lock file helper class "file_lock" Using posix "lockf" per-fd region locking	2015-09-01 17:50:18 +02:00
Tomasz Grabiec	870e9e5729	lsa: Replace compaction_lock with broader reclaim_lock Disabling compaction of a region is currently done in order to keep the references valid. But disabling only compaction is not enough, we also need to disable eviction, as it also invalidates references. Rather than introducing another type of lock, compaction and eviction are controlled together, generalized as "reclaiming" (hence the reclaim_lock).	2015-09-01 17:29:04 +03:00
Tomasz Grabiec	48569651ea	lsa: Fix calculation of bytes.non_lsa_used_space	2015-09-01 17:29:03 +03:00
Tomasz Grabiec	d20fae96a2	lsa: Make reclaimer run synchronously with allocations The goal is to make allocation less likely to fail. With async reclaimer there is an implicit bound on the amount of memory that can be allocated between deferring points. This bound is difficult to enforce though. Sync reclaimer lifts this limitation off. Also, allocations which could not be satisfied before because of fragmentation now will have higher chances of succeeding, although depending on how much memory is fragmented, that could involve evicting a lot of segments from cache, so we should still avoid them. Downside of sync reclaiming is that now references into regions may be invalidated not only across deferring points but at any allocation site. compaction_lock can be used to pin data, preferably just temporarily.	2015-08-31 21:50:18 +02:00
Tomasz Grabiec	6105c05dbe	lsa: Introduce compaction_lock helper	2015-08-31 21:50:17 +02:00
Tomasz Grabiec	42dce17c82	lsa: Fix documentation for eviction functions	2015-08-31 21:50:17 +02:00
Avi Kivity	203b349722	Merge seastar upstream * seastar 5176352...68fee6c (1): > Merge "Memory reclamation infrastructure follow-up" from Tomasz Adjusted logalloc::tracker's reclaimer to fit new API	2015-08-31 20:01:07 +03:00
Paweł Dziepak	8d0419b621	managed_bytes: simplify empty() Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>	2015-08-31 17:29:16 +02:00
Paweł Dziepak	0343fddeb0	managed_bytes: simplify default constructor Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>	2015-08-31 17:29:16 +02:00
Paweł Dziepak	956c27e021	utils: add LSA-aware vector capable of internal storage Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>	2015-08-31 17:29:13 +02:00
Avi Kivity	702de43ce3	Merge "Commit log replay" from Calle "Initial implementation/transposition of commit log replay. * Changes replay position to be shard aware * Commit log segment ID:s now follow basically the same scheme as origin; max(previous ID, wall clock time in ms) + shard info (for us) * SStables now use the DB definition of replay_position. * Stores and propagates (compaction) flush replay positions in sstables * If CL segments are left over from a previous run, they, and existing sstables are inspected for high water mark, and then replayed from those marks to amend mutations potentially lost in a crash * Note that CPU count change is "handled" in so much that shard matching is per _previous_ runs shards, not current. Known limitations: * Mutations deserialized from old CL segments are _not_ fully validated against existing schemas. * System::truncated_at (not currently used) does not handle sharding afaik, so watermark ID:s coming from there are dubious. * Mutations that fail to apply (invalid, broken) are not placed in blob files like origin. Partly because I am lazy, but also partly because our serial format differs, and we currently have no tools to do anything useful with it * No replay filtering (Origin allows a system property to designate a filter file, detailing which keyspace/cf:s to replay). Partly because we have no system properties. There is no unit test for the commit log replayer (yet). Because I could not really come up with a good one given the test infrastructure that exists (tricky to kill stuff just "right"). The functionality is verified by manual testing, i.e. running scylla, building up data (cassandra-stress), kill -9 + restart. This of course does not really fully validate whether the resulting DB is 100% valid compared to the one at k-9, but at least it verified that replay took place, and mutations where applied. (Note that origin also lacks validity testing)" Fixes #98.	2015-08-31 15:58:12 +03:00
Calle Wilund	c040565bf9	runtime: expose boot_time (boot == app start, I did not rename the var).	2015-08-31 14:29:45 +02:00
Avi Kivity	8c69098c89	Merge "Optimize memtable's scanning_reader" from Tomasz "I saw about 4% improvement in perf_sstable write on muninn with this. The decorated_key comparison is gone from the perf profile now. Now most of the work inside the reader is for copying the mutation."	2015-08-31 15:07:27 +03:00
Tomasz Grabiec	110a55886c	lsa: Introduce region::compaction_counter()	2015-08-31 13:58:42 +02:00
Tomasz Grabiec	9ad3dbe592	lsa: Add region::compaction_enabled()	2015-08-31 13:58:42 +02:00
Tomasz Grabiec	048387782a	lsa: Rename region::set_compactible() to set_compaction_enabled() To avoid confusion with region_impl::is_compactible() when the getter is added.	2015-08-31 13:58:42 +02:00
Avi Kivity	f171d71c16	utils: optimize murmur3_hash data fetch By using a recognized idiom, gcc can optimize the unaligned little endian load as a single instruction (actually less than an instruction, as it combines it with a succeeding arithmetic operation).	2015-08-28 12:37:43 +03:00
Avi Kivity	5f62f7a288	Revert "Merge "Commit log replay" from Calle" Due to test breakage. This reverts commit `43a4491043`, reversing changes made to `5dcf1ab71a`.	2015-08-27 12:39:08 +03:00
Avi Kivity	4e3c9c5493	Merge "compaction manager fixes" from Raphael	2015-08-27 11:05:26 +03:00
Avi Kivity	43a4491043	Merge "Commit log replay" from Calle "Initial implementation/transposition of commit log replay. * Changes replay position to be shard aware * Commit log segment ID:s now follow basically the same scheme as origin; max(previous ID, wall clock time in ms) + shard info (for us) * SStables now use the DB definition of replay_position. * Stores and propagates (compaction) flush replay positions in sstables * If CL segments are left over from a previous run, they, and existing sstables are inspected for high water mark, and then replayed from those marks to amend mutations potentially lost in a crash * Note that CPU count change is "handled" in so much that shard matching is per _previous_ runs shards, not current. Known limitations: * Mutations deserialized from old CL segments are _not_ fully validated against existing schemas. * System::truncated_at (not currently used) does not handle sharding afaik, so watermark ID:s coming from there are dubious. * Mutations that fail to apply (invalid, broken) are not placed in blob files like origin. Partly because I am lazy, but also partly because our serial format differs, and we currently have no tools to do anything useful with it * No replay filtering (Origin allows a system property to designate a filter file, detailing which keyspace/cf:s to replay). Partly because we have no system properties. There is no unit test for the commit log replayer (yet). Because I could not really come up with a good one given the test infrastructure that exists (tricky to kill stuff just "right"). The functionality is verified by manual testing, i.e. running scylla, building up data (cassandra-stress), kill -9 + restart. This of course does not really fully validate whether the resulting DB is 100% valid compared to the one at k-9, but at least it verified that replay took place, and mutations where applied. (Note that origin also lacks validity testing)"	2015-08-27 10:53:36 +03:00
Calle Wilund	45d07d2744	runtime: expose boot_time (boot == app start, I did not rename the var).	2015-08-25 09:14:40 +02:00
Avi Kivity	0617aecb62	lsa: downgrade "no compactible pool" warning to trace It's a fairly standard condition.	2015-08-24 17:26:48 +02:00
Raphael S. Carvalho	41b6d430c0	compaction_manager: do not retry compaction if stopping task If stopping a task, we shouldn't retry a compaction because if removing a cf, we would push back the cf into the back of the queue if an error happened, and that would possibly lead to a use-after-free. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-08-24 11:23:24 -03:00
Raphael S. Carvalho	4c9c144987	compaction_manager: avoid concurrent compaction on the same cf It was noticed that the same sstable files could be selected for compaction if concurrent compaction happens on the same cf. That's possible because compaction manager uses 2 tasks for handling compactions. Solution is to not duplicate cf in the compaction manager queue, and re-schedule compaction for a cf if needed. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-08-24 11:11:47 -03:00

1 2 3 4

156 Commits