scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-04 05:53:13 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	c82325a76c	lsa: Make region evictor signal forward progress In some cases region may be in a state where it is not empty and nothing could be evicted from it. For example when creating the first entry, reclaimer may get invoked during creation before it gets linked. We therefore can't rely on emptiness as a stop condition for reclamation, the evction function shall signal us if it made forward progress.	2015-09-06 21:25:44 +02:00
Tomasz Grabiec	94f0db933f	lsa: Fix typo in the word 'emergency'	2015-09-06 21:24:59 +02:00
Tomasz Grabiec	200562abe7	lsa: Reclaim over-max segments from segment pool reserve	2015-09-06 21:24:59 +02:00
Tomasz Grabiec	d022a1a4a3	lsa: Introduce allocating_section Related to #259. In some cases we need to allocate memory and hold reclaim lock at the same time. If that region holds most of the reclaimable memory, allocations inside that code section may fail. allocating_section is a work-around of the problem. It learns how big reserves shold be from past execution of critical section and tries to ensure proper reserves before entering the section.	2015-09-06 21:24:59 +02:00
Tomasz Grabiec	3caad2294b	lsa: Tolerate empty segments when region is destroyed Some times we may close an empty active segment, if all data in it was evicted. Normally segments are removed as soon as the last object in it is freed, but if the segment is already empty when closed, noone is supposed to call free on it. Such segments would be quickly reclaimed during compaction, but it's possible that we will destroy the region before they're reclaimed by compaction. Currently we would fail on an assertion which checks that there are no segments. This change fixes the problem by handling empty closed segments when region is destroyed.	2015-09-06 21:24:59 +02:00
Tomasz Grabiec	c37aa73051	lsa: Drop alignment requirement from segment	2015-09-06 21:24:59 +02:00
Tomasz Grabiec	2c1536b5a7	lsa: Make free() path noexcept Memory releasing is invoked from destructors so should not throw. As a consequence it should not allocate memory, so emergency segment pool was switched from std::deque<> to an alloc-free intrusive stack.	2015-09-06 21:24:59 +02:00
Tomasz Grabiec	f404a238bb	allocation_strategy: Make construct() exception-safe	2015-09-06 21:24:59 +02:00
Tomasz Grabiec	0d7cdab0ff	utils: managed_vector: Make copy assignment exception-safe	2015-09-06 21:24:58 +02:00
Tomasz Grabiec	81d81c81f2	utils: managed_vector: Make copy constructor exception-safe Copying values may throw (std::bad_alloc for example), which will result in a leak because destructor will not be called when constructor throws. May manifest as the following assertion failure: utils/logalloc.cc:672: virtual logalloc::region_impl::~region_impl(): Assertion `_active->is_empty()' failed.	2015-09-06 21:24:58 +02:00
Tomasz Grabiec	fa8d530cc2	lsa: Add ability to trace reclaiming latency	2015-09-06 21:24:58 +02:00
Tomasz Grabiec	25e4f10c14	utils/file_lock: Fix compilation error Fixes complaint about ignored result: utils/file_lock.cc: In destructor 'utils::file_lock::impl::~impl()': utils/file_lock.cc:29:33: error: ignoring return value of 'int lockf(int, int, __off_t)', declared with attribute warn_unused_result [-Werror=unused-result]	2015-09-04 13:00:54 +02:00
Avi Kivity	ef582753c0	Merge "Lock files for CL+Data dirs" from Calle "Acquires and maintains lock files during execution. Lock files are deleted on "clean" exit, and re-taken on uncontended startup. Fixes #34 "	2015-09-01 19:18:46 +03:00
Calle Wilund	e3abde50f9	Add lock file helper class "file_lock" Using posix "lockf" per-fd region locking	2015-09-01 17:50:18 +02:00
Tomasz Grabiec	870e9e5729	lsa: Replace compaction_lock with broader reclaim_lock Disabling compaction of a region is currently done in order to keep the references valid. But disabling only compaction is not enough, we also need to disable eviction, as it also invalidates references. Rather than introducing another type of lock, compaction and eviction are controlled together, generalized as "reclaiming" (hence the reclaim_lock).	2015-09-01 17:29:04 +03:00
Tomasz Grabiec	48569651ea	lsa: Fix calculation of bytes.non_lsa_used_space	2015-09-01 17:29:03 +03:00
Tomasz Grabiec	d20fae96a2	lsa: Make reclaimer run synchronously with allocations The goal is to make allocation less likely to fail. With async reclaimer there is an implicit bound on the amount of memory that can be allocated between deferring points. This bound is difficult to enforce though. Sync reclaimer lifts this limitation off. Also, allocations which could not be satisfied before because of fragmentation now will have higher chances of succeeding, although depending on how much memory is fragmented, that could involve evicting a lot of segments from cache, so we should still avoid them. Downside of sync reclaiming is that now references into regions may be invalidated not only across deferring points but at any allocation site. compaction_lock can be used to pin data, preferably just temporarily.	2015-08-31 21:50:18 +02:00
Tomasz Grabiec	6105c05dbe	lsa: Introduce compaction_lock helper	2015-08-31 21:50:17 +02:00
Tomasz Grabiec	42dce17c82	lsa: Fix documentation for eviction functions	2015-08-31 21:50:17 +02:00
Avi Kivity	203b349722	Merge seastar upstream * seastar 5176352...68fee6c (1): > Merge "Memory reclamation infrastructure follow-up" from Tomasz Adjusted logalloc::tracker's reclaimer to fit new API	2015-08-31 20:01:07 +03:00
Paweł Dziepak	8d0419b621	managed_bytes: simplify empty() Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>	2015-08-31 17:29:16 +02:00
Paweł Dziepak	0343fddeb0	managed_bytes: simplify default constructor Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>	2015-08-31 17:29:16 +02:00
Paweł Dziepak	956c27e021	utils: add LSA-aware vector capable of internal storage Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>	2015-08-31 17:29:13 +02:00
Avi Kivity	702de43ce3	Merge "Commit log replay" from Calle "Initial implementation/transposition of commit log replay. * Changes replay position to be shard aware * Commit log segment ID:s now follow basically the same scheme as origin; max(previous ID, wall clock time in ms) + shard info (for us) * SStables now use the DB definition of replay_position. * Stores and propagates (compaction) flush replay positions in sstables * If CL segments are left over from a previous run, they, and existing sstables are inspected for high water mark, and then replayed from those marks to amend mutations potentially lost in a crash * Note that CPU count change is "handled" in so much that shard matching is per _previous_ runs shards, not current. Known limitations: * Mutations deserialized from old CL segments are _not_ fully validated against existing schemas. * System::truncated_at (not currently used) does not handle sharding afaik, so watermark ID:s coming from there are dubious. * Mutations that fail to apply (invalid, broken) are not placed in blob files like origin. Partly because I am lazy, but also partly because our serial format differs, and we currently have no tools to do anything useful with it * No replay filtering (Origin allows a system property to designate a filter file, detailing which keyspace/cf:s to replay). Partly because we have no system properties. There is no unit test for the commit log replayer (yet). Because I could not really come up with a good one given the test infrastructure that exists (tricky to kill stuff just "right"). The functionality is verified by manual testing, i.e. running scylla, building up data (cassandra-stress), kill -9 + restart. This of course does not really fully validate whether the resulting DB is 100% valid compared to the one at k-9, but at least it verified that replay took place, and mutations where applied. (Note that origin also lacks validity testing)" Fixes #98.	2015-08-31 15:58:12 +03:00
Calle Wilund	c040565bf9	runtime: expose boot_time (boot == app start, I did not rename the var).	2015-08-31 14:29:45 +02:00
Avi Kivity	8c69098c89	Merge "Optimize memtable's scanning_reader" from Tomasz "I saw about 4% improvement in perf_sstable write on muninn with this. The decorated_key comparison is gone from the perf profile now. Now most of the work inside the reader is for copying the mutation."	2015-08-31 15:07:27 +03:00
Tomasz Grabiec	110a55886c	lsa: Introduce region::compaction_counter()	2015-08-31 13:58:42 +02:00
Tomasz Grabiec	9ad3dbe592	lsa: Add region::compaction_enabled()	2015-08-31 13:58:42 +02:00
Tomasz Grabiec	048387782a	lsa: Rename region::set_compactible() to set_compaction_enabled() To avoid confusion with region_impl::is_compactible() when the getter is added.	2015-08-31 13:58:42 +02:00
Avi Kivity	f171d71c16	utils: optimize murmur3_hash data fetch By using a recognized idiom, gcc can optimize the unaligned little endian load as a single instruction (actually less than an instruction, as it combines it with a succeeding arithmetic operation).	2015-08-28 12:37:43 +03:00
Avi Kivity	5f62f7a288	Revert "Merge "Commit log replay" from Calle" Due to test breakage. This reverts commit `43a4491043`, reversing changes made to `5dcf1ab71a`.	2015-08-27 12:39:08 +03:00
Avi Kivity	4e3c9c5493	Merge "compaction manager fixes" from Raphael	2015-08-27 11:05:26 +03:00
Avi Kivity	43a4491043	Merge "Commit log replay" from Calle "Initial implementation/transposition of commit log replay. * Changes replay position to be shard aware * Commit log segment ID:s now follow basically the same scheme as origin; max(previous ID, wall clock time in ms) + shard info (for us) * SStables now use the DB definition of replay_position. * Stores and propagates (compaction) flush replay positions in sstables * If CL segments are left over from a previous run, they, and existing sstables are inspected for high water mark, and then replayed from those marks to amend mutations potentially lost in a crash * Note that CPU count change is "handled" in so much that shard matching is per _previous_ runs shards, not current. Known limitations: * Mutations deserialized from old CL segments are _not_ fully validated against existing schemas. * System::truncated_at (not currently used) does not handle sharding afaik, so watermark ID:s coming from there are dubious. * Mutations that fail to apply (invalid, broken) are not placed in blob files like origin. Partly because I am lazy, but also partly because our serial format differs, and we currently have no tools to do anything useful with it * No replay filtering (Origin allows a system property to designate a filter file, detailing which keyspace/cf:s to replay). Partly because we have no system properties. There is no unit test for the commit log replayer (yet). Because I could not really come up with a good one given the test infrastructure that exists (tricky to kill stuff just "right"). The functionality is verified by manual testing, i.e. running scylla, building up data (cassandra-stress), kill -9 + restart. This of course does not really fully validate whether the resulting DB is 100% valid compared to the one at k-9, but at least it verified that replay took place, and mutations where applied. (Note that origin also lacks validity testing)"	2015-08-27 10:53:36 +03:00
Calle Wilund	45d07d2744	runtime: expose boot_time (boot == app start, I did not rename the var).	2015-08-25 09:14:40 +02:00
Avi Kivity	0617aecb62	lsa: downgrade "no compactible pool" warning to trace It's a fairly standard condition.	2015-08-24 17:26:48 +02:00
Raphael S. Carvalho	41b6d430c0	compaction_manager: do not retry compaction if stopping task If stopping a task, we shouldn't retry a compaction because if removing a cf, we would push back the cf into the back of the queue if an error happened, and that would possibly lead to a use-after-free. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-08-24 11:23:24 -03:00
Raphael S. Carvalho	4c9c144987	compaction_manager: avoid concurrent compaction on the same cf It was noticed that the same sstable files could be selected for compaction if concurrent compaction happens on the same cf. That's possible because compaction manager uses 2 tasks for handling compactions. Solution is to not duplicate cf in the compaction manager queue, and re-schedule compaction for a cf if needed. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-08-24 11:11:47 -03:00
Avi Kivity	77b3212c88	lsa: provide a fallback during normal allocation Instead of failing normal allocations when the seastar allocator cannot allocate a segment, provide a generous reserve. An allocation failure will now be satisified from the reserve, but it will still trigger a reclaim. This allows hiding low-memory conditions from the user.	2015-08-23 16:38:04 +03:00
Avi Kivity	1bb840bb72	sstables: use large_bitset in bloom filter Avoids allocation failures due to multi-megabyte filters.	2015-08-23 12:22:49 +03:00
Avi Kivity	e928bcaf19	utils: introduce large_bitset Like boost::dynamic_bitset, but less capable. On the other hand it avoids very large allocations, which are incurred by the bloom filter's bitset on even moderately sized sstables.	2015-08-23 12:22:49 +03:00
Raphael S. Carvalho	c6ea25c5fb	compaction_manager: fix compaction_manager::stop For stopping a task of compaction manager, we first close the gate used by compaction then bust semaphore via semaphore::broken(). The problem is that semaphore::broken() only signals waiters, and so subsequent semaphore::wait() calls would succeed and the task would remain alive forever. The fix is to signal semaphore, forcing the task to exit via gate exception, so we will no longer rely on semaphore::broken() for finishing the task. That's possible because we try to access the gate right after we waited on semaphore. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-08-22 20:38:12 +03:00
Avi Kivity	f531f36a44	lsa: fix types in logs	2015-08-20 15:29:08 +03:00
Avi Kivity	9012f991bf	logalloc: really allow dipping into the emergency pool during reclaim The RAII wrapper for the emergency pool was invoked without an object, and so had no effect.	2015-08-20 12:10:03 +03:00
Avi Kivity	9ed2bbb25c	lsa: introduce region_group A region_group is a nestable group of regions, for cumulative statistics purposes.	2015-08-19 19:36:40 +03:00
Avi Kivity	71aad57ca8	lsa: make region::impl a top-level class Makes using forward declarations possible.	2015-08-19 14:43:17 +03:00
Avi Kivity	5252d5ec9b	managed_bytes: fix self-assignment	2015-08-19 11:18:07 +03:00
Avi Kivity	00f39c4e1a	managed_bytes: add small string optimization	2015-08-19 11:18:07 +03:00
Raphael S. Carvalho	820ba6f4d2	adapt compaction manager for column family removal We need a way to remove a column family from the compaction manager because when dropping a column family we need to make sure that the compaction manager doesn't hold a reference to it anymore. So compaction manager queue is now of column_family, allowing us to cancel requests pertaining to a column family being dropped. There may be an ongoing compaction for the column family being dropped, so we also need to wait for its termination. Testcase for compaction manager was also adapted and improved. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-08-18 11:38:06 +03:00
Avi Kivity	608c0b8460	Merge "initial work on compaction manager API" from Rapahel	2015-08-17 17:24:13 +03:00
Avi Kivity	932ddc328c	logalloc: optimize current_allocation_strategy() This heavily used function shows up in many places in the profile (as part of other functions), so it's worth optimizing by eliminating the special case for the standard allocator. Use a statically allocated object instead. (a non-thread-local object is fine since it has no data members).	2015-08-17 16:51:10 +03:00

1 2 3

143 Commits