scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-31 03:56:42 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	eb1b21eb4b	Introduce hashing helpers	2016-01-08 21:10:25 +01:00
Avi Kivity	c8b09a69a9	lsa: disable constant_time_size in binomial_heap implementation Corrupts heap on boost < 1.60, and not needed. Fixes #698.	2015-12-29 12:59:00 +01:00
Vlad Zolotarov	33552829b2	core: use steady_clock where monotinic clock is required Use steady_clock instead of high_resolution_clock where monotonic clock is required. high_resolution_clock is essentially a system_clock (Wall Clock) therefore may not to be assumed monotonic since Wall Clock may move backwards due to time/date adjustments. Fixes issue #638 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2015-12-27 18:07:53 +02:00
Calle Wilund	803b58620f	data_output: specialize serialized_size for bool to ensure sync with write	2015-12-21 14:19:45 +00:00
Paweł Dziepak	442bc90505	compaction_manager: check whether the manager is already stopped Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-12-17 14:06:41 +01:00
Tomasz Grabiec	157af1036b	data_output: Introduce write_view() which matches data_input::read_view()	2015-12-16 18:06:54 +01:00
Raphael S. Carvalho	e74dcc86bd	compaction_manager: introduce list of compaction_stats This list will store compaction_stats for each ongoing compaction. That's why register and deregister methods are provided. This change is important for compaction stats API that needs data of each ongoing compaction, such as progress, ks, cf, etc. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2015-12-15 09:50:28 -02:00
Lucas Meneghel Rodrigues	2167173251	utils/logalloc.cc - Declare member minimum_size from segment_zone struct This fixes compile error: In function `logalloc::segment_zone::segment_zone()': /home/lmr/Code/scylla/utils/logalloc.cc:412: undefined reference to `logalloc::segment_zone::minimum_size' collect2: error: ld returned 1 exit status ninja: build stopped: subcommand failed. Signed-off-by: Lucas Meneghel Rodrigues <lmr@scylladb.com>	2015-12-10 12:54:34 +02:00
Paweł Dziepak	ec453c5037	managed_bytes: fix potentially unaligned accesses blob_storage defined with attribute packed which makes its alignment requirement equal 1. This means that its members may be unaligned. GCC is obviously aware of that and will generate appropriate code (and not generate ubsan checks). However, there are few places where members of blob_storage are accessed via pointers, these have to be wrapped by unaligned_cast<> to let the compiler know that the location pointed to may be not aligned properly. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-12-10 11:59:54 +02:00
Avi Kivity	204610ac61	Merge "Make LSA more large-allocation-friendly" from Paweł "This series attempts to make LSA more friendly for large (i.e. bigger than LSA segment) allocations. It is achieved by introducing segment zones – large, contiguous areas of segments and using them to allocate segments instead of calling malloc() directly. Zones can be shrunk when needed to reclaim memory and segments can be migrated either to reduce number of zone or to defragment one in order to be able to shrink it. LSA tries to keep all segments at the lower addresses and reclaims memory starting from the zones in the highest parts of the address space."	2015-12-09 10:49:23 +02:00
Paweł Dziepak	8ba66bb75d	managed_bytes: fix copy size in move constructor Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-12-09 10:38:28 +02:00
Paweł Dziepak	0d66300d43	lsa: add more counters Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-12-08 19:31:40 +01:00
Paweł Dziepak	83b004b2fb	lsa: avoid fragmenting memory Originally, lsa allocated each segment independently what could result in high memory fragmentation. As a result many compaction and eviction passes may be needed to release a sufficiently big contiguous memory block. These problems are solved by introduction of segment zones, contiguous groups of segments. All segments are allocated from zones and the algorithm tries to keep the number of zones to a minimum. Moreover, segments can be migrated between zones or inside a zone in order to deal with fragmentation inside zone. Segment zones can be shrunk but cannot grow. Segment pool keeps a tree containing all zones ordered by their base addresses. This tree is used only by the memory reclamer. There is also a list of zones that have at least one free segments that is used during allocation. Segment allocation doesn't have any preferences which segment (and zone) to choose. Each zone contains a free list of unused segments. If there are no zones with free segments a new one is created. Segment reclamation migrates segments from the zones higher in memory to the ones at lower addresses. The remaining zones are shrunk until the requested number of segments is reclaimed. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-12-08 19:31:40 +01:00
Paweł Dziepak	2fb14a10b6	utils: add dynamic_bitset A dynamic bitset implementation that provides functions to search for both set and cleared bits in both directions. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-12-08 19:31:40 +01:00
Paweł Dziepak	40dda261f2	lsa: maintain segment to region mapping Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-12-08 19:31:40 +01:00
Paweł Dziepak	2e94086a2c	lsa: use bi::list to implement segment_stack Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-12-08 19:31:40 +01:00
Tomasz Grabiec	6ead7a0ec5	Merge tag 'large-blobs/v3' from git@github.com:avikivity/scylla.git Scattering of blobs from Avi: This patchset converts the stack to scatter managed_bytes in lsa memory, allowing large blobs (and collections) to be stored in memtable and cache. Outside memtable/cache, they are still stored sequentially, but it is assumed that the number of transient objects is bounded. The approach taken here is to scatter managed_bytes data in multiple blob_storage objects, but to linearize them back when accessing (for example, to merge cells). This allows simple access through the normal bytes_view. It causes an extra two copies, but copying a megabyte twice is cheap compared to accessing a megabyte's worth of small cells, so per-byte throughput is increased. Testing show that lsa large object space is kept at zero, but throughput is bad because Scylla easily overwhelms the disk with large blobs; we'll need Glauber's throttling patches or a really fast disk to see good throughput with this.	2015-12-08 19:15:13 +01:00
Avi Kivity	0c2fba7e0b	lsa: advertize our preferred maximum allocation size Let managed_bytes know that allocating below a tenth of the segment size is the right thing to do.	2015-12-08 15:17:09 +02:00
Avi Kivity	13324607e6	managed_bytes: conform to allocation_strategy's max_preferred_allocation_size Instead of allocating a single blob_storage, chain multiple blob_storage objects in a list, each limited not to exceed the allocation_strategy's max_preferred_allocation_size. This allows lsa to allocate each blob_storage object as an lsa managed object that can be migrated in memory. Also provide linearize()/scatter() methods that can be used to temporarily consolidate the storage into a single blob_storage. This makes the data contiguous, so we can use a regular bytes_view to examine it.	2015-12-08 15:17:08 +02:00
Tomasz Grabiec	657841922a	Mark move constructors noexcept when possible	2015-12-07 09:50:27 +01:00
Avi Kivity	2437fc956c	allocation_strategy: expose preferred allocation size limit Our premier allocation_strategy, lsa, prefers to limit allocations below a tenth of the segment size so they can be moved around; larger allocations are pinned and can cause memory fragmentation. Provide an API so that objects can query for this preferred size limit. For now, lsa is not updated to expose its own limit; this will be done after the full stack is updated to make use of the limit, or intermediate steps will not work correctly.	2015-12-06 16:23:42 +02:00
Amnon Heiman	61abc85eb3	histogram: Add started counter This patch adds a started counter, that is used to mark the number of operation that were started. This counter serves two purposes, it is a better indication for when to sample the data and it is used to indicate how many pending operations are. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2015-12-01 15:28:06 +02:00
Amnon Heiman	88dcf2e935	latency: Switch to steady_clock The system clock is less suitable for for time difference than steady_clock. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2015-12-01 15:28:06 +02:00
Tomasz Grabiec	a3e3add28a	utils: Introduce phased_barrier Utility for waiting on a group of async actions started before certain point in time.	2015-11-29 16:25:21 +01:00
Pekka Enberg	cf7541020f	Merge "Enable more config options" from Asias	2015-11-25 16:09:22 +02:00
Paweł Dziepak	89f7f746cb	lsa: fix printing object_descriptor::_alignment object_descriptor::_alignment is of type uint8_t which is actually an unsigned char. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-11-24 20:13:29 +01:00
Paweł Dziepak	65875124b7	lsa: guarantee that segment_heap doesn't throw boost::heap::binomial_heap allocates helper object in push() and, therefore, may throw an exception. This shouldn't happen during compaction. The solution is to reserve space for this helper object in segment_descriptor and use a custom allocator with boost::heap::binomial_heap. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-11-24 19:51:22 +01:00
Paweł Dziepak	273b8daeeb	lsa: add no-op default constructor for segment Zero initialization of segment::data when segment is value initialized is undesirable. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-11-24 16:37:37 +01:00
Paweł Dziepak	e6cf3e915f	lsa: add counters for memory used by large objects Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-11-24 16:36:27 +01:00
Paweł Dziepak	6b113a9a7a	lsa: fix eviction of large blobs LSA memory reclaimer logic assumes that the amount of memory used by LSA equals: segments_in_use * segment_size. However, LSA is also responsible for eviction of large objects which do not affect the used segmentcount, e.g. region with no used segments may still use a lot of memory for large objects. The solution is to switch from measuring memory in used segments to used bytes count that includes also large objects. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-11-24 16:29:09 +01:00
Asias He	33ef58c5c9	utils: Add get_broadcast_rpc_address and set_broadcast_rpc_address helper	2015-11-24 10:07:31 +08:00
Avi Kivity	ba859acb3b	big_decimal: add default constructor Arithmetic types should have a default constructor, and anyway the following patch wants it.	2015-11-18 10:36:03 +02:00
Paweł Dziepak	c37afcfdee	lsa: account for size of objects too big for LSA While the objects above max_manage_object_size aren't stored in the LSA segments they are still considered to be belonging to the LSA region and are evictable using that region evictor. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-11-16 12:22:12 +01:00
Tomasz Grabiec	7e0f99cc3b	Merge tag 'native-preparatory/v1' from https://github.com/avikivity/scylla.git Assorted patches that pave the way for native storage (while not committing us in any way).	2015-11-16 10:01:38 +01:00
Avi Kivity	1c425d6b50	logalloc: allow allocating_section code blocks to return references	2015-11-15 19:10:24 +02:00
Avi Kivity	36994a5d08	managed_bytes: add a constructor from std::initializer_list<> Not actually used in the patchset now, but nice.	2015-11-13 17:13:07 +02:00
Avi Kivity	f3afe3e876	allocation_strategy: constify migrate_fn Since abstract_type will be providing our migrate_fn, they must be const, and indeed a migration does not change the migration function.	2015-11-13 17:13:07 +02:00
Calle Wilund	0fa543800a	data_output: Template "blob" writers (bytes*) to allow for varying "size" type	2015-11-10 13:12:33 +01:00
Calle Wilund	9ee8204993	data_input: Fix missing bounds check	2015-11-10 13:12:33 +01:00
Paweł Dziepak	64f1c2866c	lsa: free segment in trim_emergency_reserve_to_max() _emergency_reserve is an intrusive containers and it doesn't care about segment lifetime. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-11-05 18:04:38 +02:00
Raphael S. Carvalho	c2a98807c7	compaction_manager: fix remove remove() is the function used to remove every reference to a cf from the compaction manager. This function works by removing cf from the queue, and waiting for possible ongoing compaction on cf. However, a cf may be re-queued by compaction manager task if there is pending compaction by the end of compaction. If cf is still referenced by the time remove() returns, we could end up with an use-after-free. To fix that, a task shouldn't re-queue a cf if it was asked to stop. The stat pending_tasks was also not being updated when a cf was removed from the task queue. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2015-10-28 17:35:26 +02:00
Vlad Zolotarov	5613979a85	utils::fb_utilities: add the ability to set a broadcast address Add utils::fb_utilities::set_broadcast_address(). Set it to either broadcast_address or listen_address configuration value if appropriate values are set. If none of the two values above are set - abort the application. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com> New in v2: - Simplify the utils::fb_utilities::get_broadcast() logic.	2015-10-26 14:10:39 +02:00
Tomasz Grabiec	764d913d84	Merge branch 'pdziepak/row-cache-range-query/v4' from seastar-dev.git From Pawel: This series enables row cache to serve range queries. In order to achieve that row cache needs to know whether there are some other partitions in the specified range that are not cached and need to be read from the sstables. That information is provied by key_readers, which work very similarly to mutation_readers, but return only the decorated key of partitions in range. In case of sstables key_readers is implemented to use partition index. Approach like this has the disadvantage of needing to access the disk even if all partitions in the range are cached. There are (at least) two solutions ways of dealing with that problem: - cache partition index - that will also help in all other places where it is neededed - add a flag to cache_entry which, when set, indicates that the immediate successor of the partition is also in the cache. Such flag would be set by mutation reader and cleared during eviction. It will also allow newly created mutations from memtable to be moved to cache provided that both their successors and predecessors are already there. The key_reader part of this patchsets adds a lot of new code that probably won't be used in any other place, but the alternative would be to always interleave reads from cache with reads from sstables and that would be more heavy on partition index, which isn't cached. Fixes #185.	2015-10-21 15:26:45 +02:00
Avi Kivity	16006949d0	logalloc: make migrator an object, not a function pointer The migrator tells lsa how to move an object when it is compacted. Currently it is a function pointer, which means we must know how to move the object at compile time. Making it an object allows us to build the migration function at runtime, making it suitable for runtime-defined types (such as tuples and user-defined types). In the future, we may also store the size there for fixed-size types, reducing lsa overhead. C++ variable templates would have made this patch smaller, but unfortunately they are only supported on gcc 5+.	2015-10-21 11:24:56 +02:00
Paweł Dziepak	aed403efc2	mutation_reader: move move_and_disengage to a separate header Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-10-20 20:24:11 +02:00
Avi Kivity	cf734132e7	Merge "Flusing of CF:s without replay positions" from Calle "Fixes: #469 We occasionally generate memtables that are not empty, yet have no high replay_position set. (Typical case is CL replay, but apparently there are others). Moreover, we can do this repeatedly, and thus get caught in the flush queue ordering restrictions. Solve this by treating a flush without replay_position as a flush at the highest running position, i.e. "last" in queue. Note that this will not affect the actual flush operation, nor CL callbacks, only anyone waiting for the operation(s) to complete. To do this, the flush_queue had its restrictions eased, and some introspection methods added."	2015-10-20 17:36:57 +03:00
Tomasz Grabiec	67d0f9c7df	lsa: Restore heap invariant before calling _segments.erase() This is certainly the right thing to do and seems to fix #403. However I didn't manage to convince myself that this would cause problems for binomial_heap, given that binomial_heap::erase() calls siftup() anyway: void erase(handle_type handle) { node_pointer n = handle.node_; siftup(n, force_inf()); top_element = n; pop(); } void increase (handle_type handle) { node_pointer n = handle.node_; siftup(n, *this); update_top_element(); sanity_check(); }	2015-10-20 15:18:05 +03:00
Calle Wilund	62c0be376c	flush_queue: Ease key restriction and allow multiple calls on each key As long as we guarantee that the execution order for the post ops are upheld, we can allow insertion of multiple ops on the same key. Implemented by adding a ref count to each position. The restriction then becomes that an added key must either be larger than any already existing key, _OR_ already exist. In the latter case, we still know that we have not finished this position and signaled "upwards".	2015-10-20 08:24:04 +02:00
Calle Wilund	cd9f5e38f7	flush_queue: Fix task reordering bug, simplify code, allow value propagation Previous version dit looping on post execution and signaling of waiters. This could "race" with an op just finishing if task reordering happened. This version simplifies the code significantly (and raises the question why it was not written like this in the first place... Shame on me) by simpy building a promise-dependency chain between _previous_ queue items and next instead. Also, the code now handles propagation of return value from the "Func" pre-op to the "Post" op, with exceptions automatically handled.	2015-10-15 02:10:26 +02:00
Calle Wilund	4540036a01	Add "flush_queue" helper structure Small utility to order operation->post operation so that the "post" step is guaranteed to only be run when all "post"-ops for lower valued keys (T) have been completed This is a generalized utility mainly to be testable.	2015-10-14 14:07:38 +02:00

1 2 3 4 5

211 Commits