scylladb

Author	SHA1	Message	Date
Avi Kivity	ef582753c0	Merge "Lock files for CL+Data dirs" from Calle "Acquires and maintains lock files during execution. Lock files are deleted on "clean" exit, and re-taken on uncontended startup. Fixes #34 "	2015-09-01 19:18:46 +03:00
Calle Wilund	e3abde50f9	Add lock file helper class "file_lock" Using posix "lockf" per-fd region locking	2015-09-01 17:50:18 +02:00
Tomasz Grabiec	870e9e5729	lsa: Replace compaction_lock with broader reclaim_lock Disabling compaction of a region is currently done in order to keep the references valid. But disabling only compaction is not enough, we also need to disable eviction, as it also invalidates references. Rather than introducing another type of lock, compaction and eviction are controlled together, generalized as "reclaiming" (hence the reclaim_lock).	2015-09-01 17:29:04 +03:00
Tomasz Grabiec	48569651ea	lsa: Fix calculation of bytes.non_lsa_used_space	2015-09-01 17:29:03 +03:00
Tomasz Grabiec	d20fae96a2	lsa: Make reclaimer run synchronously with allocations The goal is to make allocation less likely to fail. With async reclaimer there is an implicit bound on the amount of memory that can be allocated between deferring points. This bound is difficult to enforce though. Sync reclaimer lifts this limitation off. Also, allocations which could not be satisfied before because of fragmentation now will have higher chances of succeeding, although depending on how much memory is fragmented, that could involve evicting a lot of segments from cache, so we should still avoid them. Downside of sync reclaiming is that now references into regions may be invalidated not only across deferring points but at any allocation site. compaction_lock can be used to pin data, preferably just temporarily.	2015-08-31 21:50:18 +02:00
Tomasz Grabiec	6105c05dbe	lsa: Introduce compaction_lock helper	2015-08-31 21:50:17 +02:00
Tomasz Grabiec	42dce17c82	lsa: Fix documentation for eviction functions	2015-08-31 21:50:17 +02:00
Avi Kivity	203b349722	Merge seastar upstream * seastar 5176352...68fee6c (1): > Merge "Memory reclamation infrastructure follow-up" from Tomasz Adjusted logalloc::tracker's reclaimer to fit new API	2015-08-31 20:01:07 +03:00
Paweł Dziepak	8d0419b621	managed_bytes: simplify empty() Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>	2015-08-31 17:29:16 +02:00
Paweł Dziepak	0343fddeb0	managed_bytes: simplify default constructor Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>	2015-08-31 17:29:16 +02:00
Paweł Dziepak	956c27e021	utils: add LSA-aware vector capable of internal storage Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>	2015-08-31 17:29:13 +02:00
Avi Kivity	702de43ce3	Merge "Commit log replay" from Calle "Initial implementation/transposition of commit log replay. * Changes replay position to be shard aware * Commit log segment ID:s now follow basically the same scheme as origin; max(previous ID, wall clock time in ms) + shard info (for us) * SStables now use the DB definition of replay_position. * Stores and propagates (compaction) flush replay positions in sstables * If CL segments are left over from a previous run, they, and existing sstables are inspected for high water mark, and then replayed from those marks to amend mutations potentially lost in a crash * Note that CPU count change is "handled" in so much that shard matching is per _previous_ runs shards, not current. Known limitations: * Mutations deserialized from old CL segments are _not_ fully validated against existing schemas. * System::truncated_at (not currently used) does not handle sharding afaik, so watermark ID:s coming from there are dubious. * Mutations that fail to apply (invalid, broken) are not placed in blob files like origin. Partly because I am lazy, but also partly because our serial format differs, and we currently have no tools to do anything useful with it * No replay filtering (Origin allows a system property to designate a filter file, detailing which keyspace/cf:s to replay). Partly because we have no system properties. There is no unit test for the commit log replayer (yet). Because I could not really come up with a good one given the test infrastructure that exists (tricky to kill stuff just "right"). The functionality is verified by manual testing, i.e. running scylla, building up data (cassandra-stress), kill -9 + restart. This of course does not really fully validate whether the resulting DB is 100% valid compared to the one at k-9, but at least it verified that replay took place, and mutations where applied. (Note that origin also lacks validity testing)" Fixes #98.	2015-08-31 15:58:12 +03:00
Calle Wilund	c040565bf9	runtime: expose boot_time (boot == app start, I did not rename the var).	2015-08-31 14:29:45 +02:00
Avi Kivity	8c69098c89	Merge "Optimize memtable's scanning_reader" from Tomasz "I saw about 4% improvement in perf_sstable write on muninn with this. The decorated_key comparison is gone from the perf profile now. Now most of the work inside the reader is for copying the mutation."	2015-08-31 15:07:27 +03:00
Tomasz Grabiec	110a55886c	lsa: Introduce region::compaction_counter()	2015-08-31 13:58:42 +02:00
Tomasz Grabiec	9ad3dbe592	lsa: Add region::compaction_enabled()	2015-08-31 13:58:42 +02:00
Tomasz Grabiec	048387782a	lsa: Rename region::set_compactible() to set_compaction_enabled() To avoid confusion with region_impl::is_compactible() when the getter is added.	2015-08-31 13:58:42 +02:00
Avi Kivity	f171d71c16	utils: optimize murmur3_hash data fetch By using a recognized idiom, gcc can optimize the unaligned little endian load as a single instruction (actually less than an instruction, as it combines it with a succeeding arithmetic operation).	2015-08-28 12:37:43 +03:00
Avi Kivity	5f62f7a288	Revert "Merge "Commit log replay" from Calle" Due to test breakage. This reverts commit `43a4491043`, reversing changes made to `5dcf1ab71a`.	2015-08-27 12:39:08 +03:00
Avi Kivity	4e3c9c5493	Merge "compaction manager fixes" from Raphael	2015-08-27 11:05:26 +03:00
Avi Kivity	43a4491043	Merge "Commit log replay" from Calle "Initial implementation/transposition of commit log replay. * Changes replay position to be shard aware * Commit log segment ID:s now follow basically the same scheme as origin; max(previous ID, wall clock time in ms) + shard info (for us) * SStables now use the DB definition of replay_position. * Stores and propagates (compaction) flush replay positions in sstables * If CL segments are left over from a previous run, they, and existing sstables are inspected for high water mark, and then replayed from those marks to amend mutations potentially lost in a crash * Note that CPU count change is "handled" in so much that shard matching is per _previous_ runs shards, not current. Known limitations: * Mutations deserialized from old CL segments are _not_ fully validated against existing schemas. * System::truncated_at (not currently used) does not handle sharding afaik, so watermark ID:s coming from there are dubious. * Mutations that fail to apply (invalid, broken) are not placed in blob files like origin. Partly because I am lazy, but also partly because our serial format differs, and we currently have no tools to do anything useful with it * No replay filtering (Origin allows a system property to designate a filter file, detailing which keyspace/cf:s to replay). Partly because we have no system properties. There is no unit test for the commit log replayer (yet). Because I could not really come up with a good one given the test infrastructure that exists (tricky to kill stuff just "right"). The functionality is verified by manual testing, i.e. running scylla, building up data (cassandra-stress), kill -9 + restart. This of course does not really fully validate whether the resulting DB is 100% valid compared to the one at k-9, but at least it verified that replay took place, and mutations where applied. (Note that origin also lacks validity testing)"	2015-08-27 10:53:36 +03:00
Calle Wilund	45d07d2744	runtime: expose boot_time (boot == app start, I did not rename the var).	2015-08-25 09:14:40 +02:00
Avi Kivity	0617aecb62	lsa: downgrade "no compactible pool" warning to trace It's a fairly standard condition.	2015-08-24 17:26:48 +02:00
Raphael S. Carvalho	41b6d430c0	compaction_manager: do not retry compaction if stopping task If stopping a task, we shouldn't retry a compaction because if removing a cf, we would push back the cf into the back of the queue if an error happened, and that would possibly lead to a use-after-free. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-08-24 11:23:24 -03:00
Raphael S. Carvalho	4c9c144987	compaction_manager: avoid concurrent compaction on the same cf It was noticed that the same sstable files could be selected for compaction if concurrent compaction happens on the same cf. That's possible because compaction manager uses 2 tasks for handling compactions. Solution is to not duplicate cf in the compaction manager queue, and re-schedule compaction for a cf if needed. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-08-24 11:11:47 -03:00
Avi Kivity	77b3212c88	lsa: provide a fallback during normal allocation Instead of failing normal allocations when the seastar allocator cannot allocate a segment, provide a generous reserve. An allocation failure will now be satisified from the reserve, but it will still trigger a reclaim. This allows hiding low-memory conditions from the user.	2015-08-23 16:38:04 +03:00
Avi Kivity	1bb840bb72	sstables: use large_bitset in bloom filter Avoids allocation failures due to multi-megabyte filters.	2015-08-23 12:22:49 +03:00
Avi Kivity	e928bcaf19	utils: introduce large_bitset Like boost::dynamic_bitset, but less capable. On the other hand it avoids very large allocations, which are incurred by the bloom filter's bitset on even moderately sized sstables.	2015-08-23 12:22:49 +03:00
Raphael S. Carvalho	c6ea25c5fb	compaction_manager: fix compaction_manager::stop For stopping a task of compaction manager, we first close the gate used by compaction then bust semaphore via semaphore::broken(). The problem is that semaphore::broken() only signals waiters, and so subsequent semaphore::wait() calls would succeed and the task would remain alive forever. The fix is to signal semaphore, forcing the task to exit via gate exception, so we will no longer rely on semaphore::broken() for finishing the task. That's possible because we try to access the gate right after we waited on semaphore. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-08-22 20:38:12 +03:00
Avi Kivity	f531f36a44	lsa: fix types in logs	2015-08-20 15:29:08 +03:00
Avi Kivity	9012f991bf	logalloc: really allow dipping into the emergency pool during reclaim The RAII wrapper for the emergency pool was invoked without an object, and so had no effect.	2015-08-20 12:10:03 +03:00
Avi Kivity	9ed2bbb25c	lsa: introduce region_group A region_group is a nestable group of regions, for cumulative statistics purposes.	2015-08-19 19:36:40 +03:00
Avi Kivity	71aad57ca8	lsa: make region::impl a top-level class Makes using forward declarations possible.	2015-08-19 14:43:17 +03:00
Avi Kivity	5252d5ec9b	managed_bytes: fix self-assignment	2015-08-19 11:18:07 +03:00
Avi Kivity	00f39c4e1a	managed_bytes: add small string optimization	2015-08-19 11:18:07 +03:00
Raphael S. Carvalho	820ba6f4d2	adapt compaction manager for column family removal We need a way to remove a column family from the compaction manager because when dropping a column family we need to make sure that the compaction manager doesn't hold a reference to it anymore. So compaction manager queue is now of column_family, allowing us to cancel requests pertaining to a column family being dropped. There may be an ongoing compaction for the column family being dropped, so we also need to wait for its termination. Testcase for compaction manager was also adapted and improved. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-08-18 11:38:06 +03:00
Avi Kivity	608c0b8460	Merge "initial work on compaction manager API" from Rapahel	2015-08-17 17:24:13 +03:00
Avi Kivity	932ddc328c	logalloc: optimize current_allocation_strategy() This heavily used function shows up in many places in the profile (as part of other functions), so it's worth optimizing by eliminating the special case for the standard allocator. Use a statically allocated object instead. (a non-thread-local object is fine since it has no data members).	2015-08-17 16:51:10 +03:00
Avi Kivity	5a061fe66e	lsa: increase segment size While #152 is still open, we need to allow for moderately sized allocations to succeed. Extend the segment size to 256k, which allows for threads to be allocated. Fixes #151.	2015-08-16 19:26:59 +03:00
Avi Kivity	eb09eddee5	Merge "Adding sampled histogram" from Amnon "Histograms are used to collect latency information, in Origin, many of the operations are timed, this is a potential performance issue. This series adds an option to sample the operations, where small amount will be timed and the most will only be counted. This will give an estimation for the statistics, while keeping an accurate count of the total events and have neglectible performance impact. The first to use the modified histogram are the column family for their read and write." Conflicts: database.hh	2015-08-16 17:15:24 +03:00
Raphael S. Carvalho	74415f2772	compaction_manager: add stats for API Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-08-12 15:10:24 -03:00
Avi Kivity	ecc3ccc716	lsa: emergency segment reserve for compaction To free memory, we need to allocate memory. In lsa compaction, we convert N segments with average occupancy of (N-1)/N into N-1 new segments. However, to do that, we need to allocate segments, which we may not be able to do due to the low memory condition which caused us to compact anyway. Fix by introducing a segment reserve, which we normally try to ensure is full. During low memory conditions, we temporarily allow allocating from the emergency reserve.	2015-08-12 11:29:09 +03:00
Raphael S. Carvalho	9823164c89	db: introduce compaction manager Currently, each column family creates a fiber to handle compaction requests in parallel to the system. If there are N column families, N compactions could be running in parallel, which is definitely horrible. To solve that problem, a per-database compaction manager is introduced here. Compaction manager is a feature used to service compaction requests from N column families. Parallelism is made available by creating more than one fiber to service the requests. That being said, N compaction requests will be served by M fibers. A compaction request being submitted will go to a job queue shared between all fibers, and the fiber with the lowest amount of pending jobs will be signalled. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-08-11 17:25:46 +03:00
Amnon Heiman	bd9a758b80	Utils: Support sample based histogram The histogrm object is used both as a general counter for the number of events and for statistics and sampling. This chanage the histogram implementation, so it would support spares sampling while keeping the total number of event accurate. The implementation includes the following: Remove the template nature of the histogram, as it is used only for timer and use the name ihistogram instead. If in the future we'll need a histogram for other types, we can use the histogrma name for it. a total counter was added that count the number of events that are part of the statistic calculation. A helper methods where added to the ihistogram to handle the latency counter object. According to the sample mask it would mark the latency object as start if the counter and the mask are non zero and it would accept the latency object in its mark method, in which if the latency was not start, it will not be added and only the 'count' counter that counts the total number of events will be incremented. This should reduce the impact of latency calculation to a neglectable effect. Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>	2015-08-11 10:00:53 +03:00
Amnon Heiman	af2ec7c7e8	Utils add an is start method to latency_counter When doing a spares latency check, it is required to know if a latency object was started. This returns true if the start timer was set. Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>	2015-08-11 09:05:11 +03:00
Avi Kivity	d6351ecca7	utils: add crc32 class C++ interface to the crc32 x86 instruction.	2015-08-09 00:05:33 +03:00
Tomasz Grabiec	ef549ae5a5	lsa: Reclaim space from evictable regions incrementally When LSA reclaimer cannot reclaim more space by compaction, it will reclaim data by evicting from evictable regions. Currently the only evictable region is the one owned by the row cache.	2015-08-08 09:59:24 +02:00
Tomasz Grabiec	a095b39091	lsa: Don't leak empty _active segment in merge()	2015-08-08 09:59:24 +02:00
Tomasz Grabiec	5b5c0038e6	lsa: Don't allocate aligned segments Requiring alignment means that there must be 64K of contiguous space to allocate each 32K segment. When memory is fragmented, we may fail to allocate such segment, even though there's plenty of free space. This especially hurts forward progress of compaction, which frees segments randomly and relies on the fact that freeing a segment will make it available to the next segment request.	2015-08-07 22:13:17 +02:00
Tomasz Grabiec	64bd4bee94	lsa: Log segment closing and releasing on trace level	2015-08-07 22:06:15 +02:00

1 2 3

131 Commits