scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-30 11:36:54 +00:00

Author	SHA1	Message	Date
Avi Kivity	7161244130	Merge seastar upstream * seastar 70aecca...ac02df7 (5): > Merge "Prefix preprocessor definitions" from Jesse > cmake: Do not enable warnings transitively > posix: prevent unused variable warning > build: Adjust DPDK options to fix compilation > io_scheduler: adjust property names DEBUG, DEFAULT_ALLOCATOR, and HAVE_LZ4_COMPRESS_DEFAULT macro references prefixed with SEASTAR_. Some may need to become Scylla macros.	2018-04-29 11:03:21 +03:00
Vladimir Krivopalov	f6f99919da	Factor out min_tracker and max_tracker as common helpers. They will be re-used for collecting encoding statistics which is needed to write SSTables 3.0. Part of #1969. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-04-25 14:58:47 -07:00
Piotr Jastrzebski	fdad8eba97	buffer_input_stream: make it possible to specify chunk size This will allow to force input stream to return its data in chunks of a specified size. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2018-04-16 21:11:13 +02:00
Piotr Jastrzebski	cc6e619aa9	Introduce make_limiting_data_source This method takes a data_source and returns another data_source that returns data from the input source but in chunks of limited size. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2018-04-16 20:56:30 +02:00
Avi Kivity	fc488adc72	logalloc: remove segment_descriptor::_lsa_managed _lsa_managed is always 1:1 with _region, so we can remove it, saving some space in the segment descriptor vector. Tests: unit (release), logalloc_test (debug) Message-Id: <20180410122606.10671-1-avi@scylladb.com>	2018-04-10 13:54:38 +01:00
Glauber Costa	b2f9958071	large_bitset: use a chunked_vector internally and simplify API save and load functions for the large_bitset were introduced by Avi with `d590e327c0`. In that commit, Avi says: "... providing iterator-based load() and save() methods. The methods support partial load/save so that access to very large bitmaps can be split over multiple tasks." The only user of this interface is SSTables. And turns out we don't really split the access like that. What we do instead is to create a chunked vector and then pass its begin() method with position = 0 and let it write everything. The problem here is that this require the chunked vector to be fully initialized, not just reserved. If the bitmap is large enough that in itself can take a long time without yielding (up to 16ms seen in my setup). We can simplify things considerably by moving the large_bitset to use a chunked vector internally: it already uses a poor man's version of it by allocating chunks internally (it predates the chunked_vector). By doing that, we can turn save() into a simple copy operation, and do away with load altogether by adding a new constructor that will just copy an existing chunked_vector. Fixes #3341 Tests: unit (release) Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <20180409234726.28219-1-glauber@scylladb.com>	2018-04-10 10:25:06 +03:00
Avi Kivity	2c670f6161	logalloc: limit std segment allocations in debug mode Address Sanitizer has a global limit on the number of allocations (note: not number of allocations less number of frees, but cumulative number of allocations). Running some tests in debug mode on a machine with sufficient memory can break that limit. Work around that limit by restricting the amount of memory the debug mode segment_pool can allocate. It's also nicer for running the test on a workstation.	2018-04-07 21:04:10 +03:00
Avi Kivity	2baa16b371	logalloc: introduce prime_segment_pool() To segregate std and lsa allocations, we prime the segment pool during initialization so that lsa will release lower-addressed memory to std, rather than lsa and std competing for memory at random addresses. However, tests often evict all of lsa memory for their own purposes, which defeats this priming. Extract the functionality into a new prime_segment_pool() function for use in tests that rely on allocation segregation.	2018-04-07 14:52:58 +03:00
Avi Kivity	ff6325ee7e	logalloc: limit non-contiguous reclaims We may fail to reclaim because a region has reclaim disabled (usually because it is in an allocating_section. Failed reclaims can cause high CPU usage if all of the lower addresses happen to be in a reclaim-disabled region (this is somewhat mitigated by the fact that checking for reclaim disabled is very cheap), but worse, failing a segment reclaim can lead to reclaimed memory being fragmented. This results in the original allocation continuing to fail. To combat that, we limit the number of failed reclaims. If we reach the limit, we fail the reclaim. The surrounding allocating_section will release the reclaim_lock, and increase reserves, which will result in reclaim being retried with all regions being reclaimable, and succeed in allocating contiguous memory.	2018-04-07 14:52:58 +03:00
Avi Kivity	c6c659ce7a	logalloc: pre-allocate all memory as lsa on startup Since lsa tries to keep some non-lsa memory as reserve, we end up with three blocks of memory: at low addresses, non-lsa memory that was allocated during startup or subsequently freed by lsa; at middle addresses, lsa; and at the top addresses, memory that lsa left alone during initial cache population due to the reserve. After time passes, both std and lsa will allocate from the top section, causing a mix of lsa and non-lsa memory. Since lsa tries to free from lower addresses, this mix will stay there forever, increasing fragmentation. Fix that by disabling the reserve during startup and allocating all of memory for lsa. Any further allocation will then have to be satisfied by lsa first freeing memory from the low addresses, so we will now have just two sections of memory: low addresses for std, and top addresses for lsa. Note that this startup allocation does not page in lsa segments, since the segment constructor does not touch memory.	2018-04-07 14:52:58 +03:00
Avi Kivity	ff52767ec9	dynamic_bitset: optimize for large sets Add 1:64 summary bitmaps so that searching for set bits is O(log n) instead of O(n).	2018-04-07 14:52:58 +03:00
Avi Kivity	14510ae986	dynamic_bitset: get rid of resize() Makes it easier to modify later on. Maybe "dynamic" is not so justified now.	2018-04-07 14:52:58 +03:00
Avi Kivity	f219ae1275	dynamic_bitset: remove find__clear() variants They are no longer used, and cannot be efficiently implemenented for large bitsets using a summary vector approach without slowing down the find__set() variants, which are used. Also remove find_previous_set() for the same reason.	2018-04-07 14:52:58 +03:00
Avi Kivity	54db0f3d30	logalloc: reduce segment size to 128k Reducing the segment size reduces the time needed to compact segments, and increases the number of segments that can be compacted (and so the probability of finding low-occupancy segments). 128k is the size of I/O buffers and of thread stacks, so we can't go lower than that without more significant changes.	2018-04-07 14:52:58 +03:00
Avi Kivity	3f17dbfcbc	logalloc: get rid of the emergency reserve stack Instead of keeping specific segments in the emergency reserve, just keep the number of segments in the reserve. This simplifies the code considerably.	2018-04-07 14:52:55 +03:00
Avi Kivity	fa73d844e9	logalloc: replace zones with segment-at-a-time alloc/free This patch replaces the zones mechanism with something simpler: a single segment is moved from the standard allocator to lsa and vice versa, at a time. Fragmentation resistance is (hopefully) achieved by having lsa prefer high addresses for lsa data, and return segments at low address to the standard allocator. Over time, the two will move apart. Moving just once segment at a time reduces the latency costs of transferring memory between free and std.	2018-04-07 13:48:40 +03:00
Avi Kivity	7ab52947dc	conf: define named_value<log_level> externally While building with -O1, I saw that the linker could not find the vtable for named_value<log_level>. Rather than fixing up the includes (and likely lengthening build time), fix by defining the class as an extern template, preventing it from being instantiated at the call site. Message-Id: <20180401150235.13451-1-avi@scylladb.com>	2018-04-02 19:23:06 +01:00
Avi Kivity	c9aa9f0d86	Revert "logalloc: capture current scheduling group for deferring function" This reverts commit `3b53f922a3`. It's broken in two ways: 1. concrete_allocating_function::allocate()'ss caller, region_group::start_releaser() loop, will delete the object as soon as it returns; however we scheduled some work depending on `this` in a separate continuation (via with_scheduling_group()) 2. the calling loop's termination condition depends on the work being done immediately, not later.	2018-03-29 16:08:12 +03:00
Avi Kivity	16a7650873	Merge "More extensions: commitlog + system tables" from Calle " Additional extension points. * Allows wrapping commitlog file io (including hinted handoff). * Allows system schema modification on boot, allowing extensions to inject extensions into hardcoded schemas. Note: to make commitlog file extensions work, we need to both enforce we can be notified on segment delete, and thus need to fix the old issue of hard ::unlink call in segment destructor. Segment delete is therefore moved to a batch routine, run at intervals/flush. Replay segments and hints are also deleted via the commitlog object, ensuring an extension is notified (metadata). Configurable listeneres are now allowed to inject configuration object into the main config. I.e. a local object can, either by becoming a "configurable" or manually, add references to self-describing values that will be parsed from the scylla.yaml file, effectively extending it. All these wonderful abstractions courtesy of encryption of course. But super generalized! " * 'calle/commitlog_ext' of github.com:scylladb/seastar-dev: db::extensions: Allow extensions to modify (system) schemas db::commitlog: Add commitlog/hints file io extension db::commitlog: Do segment delete async + force replay delete go via CL main/init: Change configurable callbacks and calls to allow adding opts util::config_file: Add "add" config item overload	2018-03-26 16:18:22 +03:00
Glauber Costa	3b53f922a3	logalloc: capture current scheduling group for deferring function When we call run_when_memory_available, it is entirely possible that the caller is doing that inside a scheduling_group. If we don't defer we will execute correctly. But if we do defer, the current code will execute - in the future - with the default scheduling group. This patch fixes that by capturing the caller scheduling group and making sure the function is executed later using it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-03-20 16:58:35 -04:00
Calle Wilund	fc97e39782	util::config_file: Add "add" config item overload	2018-03-19 12:24:04 +00:00
Avi Kivity	9eb7c0c65b	Merge "Remove (some) reactor stalls in the SSTable code" from Glauber " This is an improvement on my latest series. Instead of just dealing with the problem of destroying the Summary that I have identified in a previous test, I have tried to find other sources of stalls. Some of them are on readers and would affect early processes and operations like nodetool refresh. Others are on writers, which can affect any SSTable being written. Two of those stalls (on large filter, on summary read), I saw in a synthetic benchmark where I used very small values + nodetool compact to generate one SSTable with many keys. They were 80ms and 20ms respectively, and now they are totally gone. For others, I just tried to be safe (for instance, if we know reading/writing large vectors can be costly, just always insert preemption points in them). With all of these patches applied, I no longer see stalls coming from the SSTable code in those tests (although given enough time, I am sure I can find more). Tests: unit (release) Fixes: #3282, Fixes #3281, Fixes #3269 " * 'sstables-stalls-v3-updated' of github.com:glommer/scylla: large_bitset/bloom filter: add preemption points in loops sstables: read filter in a thread abstract summary entry version of the token with a token view add a token_view sstables: rework summary entries reading sstables: avoid calls to resize for vectors sstables: replace potentially large for loop with do_until summary_entry: do not store key bytes in each summary entry tests: change tests to make summary non-copyable chunked_vector: do not iterate to destruct trivially destructible types	2018-03-16 09:43:36 +01:00
Glauber Costa	7fd31088f2	large_bitset/bloom filter: add preemption points in loops SSTables that contain many keys - a common case with small partitions in long lived nodes - can generate filters that are quite large. I have seen stalls over 80ms when reading a filter that was the result of a 6h write load of very small keys after nodetool compact (filter was in the 100s of MB) Similar care should be taken when creating the filter, as if the estimated number of partitions is big, the resulting large_bitset can be quite big as well. If we treat the i_filter.hh and large_bitset.hh interfaces as truly generic, then maybe we should have an in_thread version along with a common version. But the bloom filter is the only user for both and even if that changes in the future, it is still a good idea to run something with a massive loop in a thread. So for simplicity, I am just asserting that we are on a thread to avoid surprises, and inserting preemption points in the loops. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-03-15 12:24:15 -04:00
Vladimir Krivopalov	5c3b32a9bf	Remove to_boost_visitor heler. The minimal Boost version required for Scylla now is 1.58 and this helper is no longer needed. Replaced it with more generic visitation utils from Seastar. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com> Message-Id: <e589ace7ac411d3d55dead475a8a2271f51642f1.1520976010.git.vladimir@scylladb.com>	2018-03-14 23:49:07 +00:00
Glauber Costa	00d04b49a0	chunked_vector: do not iterate to destruct trivially destructible types Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-03-14 09:16:54 -04:00
Asias He	8624467e26	utils: Remove utils/utils.cc It is used to make sure the header compiles in the early days. Message-Id: <531fc6570805bd163afedd53f5d71e1b79a477d1.1520840644.git.asias@scylladb.com>	2018-03-12 09:47:40 +02:00
Tomasz Grabiec	654d4b76c0	anchorless_list: Introduce all_elements_reversed()	2018-03-06 11:50:26 +01:00
Avi Kivity	d973445a94	Merge "sstable/schema extensions" from Calle " Adds extension points to schema/sstables to enable hooking in stuff, like, say, something that modifies how sstable disk io works. (Cough, cough, encryption) Extensions are processed as property keywords in CQL. To add an extension, a "module" must register it into the extensions object on boot time. To avoid globals (and yet don't), extensions are reachable from config (and thus from db). Table/view tables already contain an extension element, so we utilize this to persist config. schema_tables tables/views from mutations now require a "context" object (currently only extensions, but abstracted for easier further changes. Because of how schemas currently operate, there is a super lame workaround to allow "schema_registry" access to config and by extension extensions. DB, upon instansiation, calls a thread local global "init" in schema_registry and registers the config. It, in turn, can then call table_from_mutations as required. Includes the (modified) patch to encapsulate compression into objects, mainly because it is nice to encapsulate, and isolate a little. " * 'calle/extensions-v5' of github.com:scylladb/seastar-dev: extensions: Small unit test sstables: Process extensions on file open sstables::types: Add optional extensions attribute to scylla metadata sstables::disk_types: Add hash and comparator(sstring) to disk_string schema_tables: Load/save extensions table cql: Add schema extensions processing to properties schema_tables: Require context object in schema load path schema_tables: Add opaque context object config_file_impl: Remove ostream operators main/init: Formalize configurables + add extensions to init call db::config: Add extensions as a config sub-object db::extensions: Configuration object to store various extensions cql3::statements::property_definitions: Use std::variant instead of any sstables: Add extension type for wrapping file io schema: Add opaque type to represent extensions sstables::compress/compress: Make compression a virtual object	2018-02-26 17:15:29 +02:00
Paweł Dziepak	5dfa36c526	lsa: add basic sanitizer LSA being an allocator built on top of the standard may hide some erroneous usage from AddressSanitizer. Moreover, it has its own classes of bugs that could be caused by incorrect user behaviour (e.g. migrator returning wrong object size). This patch adds basic sanitizer for the LSA that is active in the debug mode and verifies if the allocator is used correctly and if a problem is found prints information about the affected object that it has collected earlier. Theat includes the address and size of an object as well as backtrace of the allocation site. At the moment the following errors are being checked for: * leaks, objects not freed at region destructor * attempts to free objects at invalid address * mismatch between object size at allocation and free * mismatch between object size at allocation and as reported by the migrator * internal LSA error: attempt to allocate object at already used address * internal LSA error: attempt to merge regions containing allocated objects at conflicting addresses Message-Id: <20180226122314.32049-1-pdziepak@scylladb.com>	2018-02-26 14:35:13 +02:00
Vladimir Krivopalov	721bd3eef6	Added missing 'override' to skip() in buffer_input_stream and prepended_input_stream. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com> Message-Id: <4e91bead8de7f6fa9b3bfdab8bda73efdb22749d.1519152303.git.vladimir@scylladb.com>	2018-02-20 19:49:11 +00:00
Duarte Nunes	9ce0be60d4	utils/flush_queue: Remove unused function Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180216234502.23931-1-duarte@scylladb.com>	2018-02-19 13:09:11 +00:00
Tomasz Grabiec	7e0ff8a920	lsa: Disable allocation failure injection inside merge() Fixes termiantion in tests due to throw from merge(), which is noexcept.	2018-02-14 16:42:49 +01:00
Tomasz Grabiec	66701c1671	lsa: Make region deregistration robust against duplicates	2018-02-14 16:42:49 +01:00
Tomasz Grabiec	cf876bbe2d	lsa: Make region allocation exception safe We were not unregisterring in case add() fails.	2018-02-14 16:42:49 +01:00
Calle Wilund	2ee68ce0d4	config_file_impl: Remove ostream operators We don't generate default strings for command line, so these are not needed as such, and conflict with other operators in to_string.hh	2018-02-07 10:11:46 +00:00
Paweł Dziepak	dcd79af8ed	lsa: optimise disabling reclamation and invalidation counter Most of the lsa gory details are hidden in utils/logalloc.cc. That includes the actual implementation of a lsa region: region_impl. However, there is code in the hot path that often accesses the _reclaiming_enabled member as well as its base class allocation_strategy. In order to optimise those accesses another class is introduced: basic_region_impl that inherits from allocation_strategy and is a base of region_impl. It is defined in utils/logalloc.hh so that it is publicly visible and its member functions are inlineable from anywhere in the code. This class is supposed to be as small as possible, but contain all members and functions that are accessed from the fast path and should be inlined.	2018-01-30 18:33:26 +01:00
Paweł Dziepak	d825ae37bf	lsa: split alloc section into reserving and reclamation-disabled parts Allocating sections reserves certain amount of memory, then disables reclamation and attempts to perform given operation. If that fails due to std::bad_alloc the reserve is increased and the operation is retried. Reserving memory is expensive while just disabling reclamation isn't. Moreover, the code that runs inside the section needs to be safely retryable. This means that we want the amount of logic running with reclamation disabled as small as possible, even if it means entering and leaving the section multiple times. In order to reduce the performance penalty of such solution the memory reserving and reclamation disabling parts of the allocating sections are separated.	2018-01-30 18:33:26 +01:00
Paweł Dziepak	eb2e88e925	linearization_context: remove non-trivial operations from fast path Since linearization_context is thread_local every time it is accessed the compiler needs to emit code that checks if it was already constructed and does so if it wasn't. Moreover, upon leaving the context from the outermost scope the map needs to be cleared. All these operations impose some performance overhead and aren't really necessary if no buffers were linearised (the expected case). This patch rearranges the code so that lineatization_context is trivially constructible and the map is cleared only if it was modified.	2018-01-30 18:33:25 +01:00
Vladimir Krivopalov	9fdf4b24b5	Add helper input streams: buffer_input_stream and prepended_input_stream. buffer_input_stream is a simple input_stream wrapping a single temporary_buffer. prepended_input_stream suits for the case when some data has been read into a buffer and the rest is still in a stream. It accepts a buffer and a data_source and first reads from the buffer and then, when it ends, proceeds reading from the data_source. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-01-29 11:57:04 -08:00
Botond Dénes	12b1520415	exponential_backoff_retry::do_until_value(): restore indentation Deferred from previous patch. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <a10053f6c0ed8a24a74e51f1df4e9a5acf59922d.1517222195.git.bdenes@scylladb.com>	2018-01-29 10:50:01 +00:00
Botond Dénes	e0c082616a	exponential_backoff_retry::do_until_value(): fix use-after-move The exponential_backoff_retry instance is captured by move and is then indirectly moved again as repeat_until_value() moves the lambda its passed into its internal state. This caused problems as internal lambdas store references to the instance and these references go stale after the move. To fix this keep hold of the existential_backoff_retry instance in an enclosing do_with() to make it safe for internal lambdas to reference it. Indentation will be fixed by the next patch. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <adc49d25a6176756d60e092f3713c0c897732382.1517222195.git.bdenes@scylladb.com>	2018-01-29 10:50:01 +00:00
Duarte Nunes	bfe5a8e96f	utils/managed_vector: Return reference to emplaced element We are in 2018, after all. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20180126105417.54285-1-duarte@scylladb.com>	2018-01-26 13:49:56 +01:00
Tomasz Grabiec	1292315579	anchorless_list: Introduce last()	2018-01-18 11:32:49 +01:00
Tomasz Grabiec	5c85e9c2db	lsa: Expose max_zone_segments for tests	2018-01-16 13:17:20 +01:00
Tomasz Grabiec	99708cc498	lsa: Expose tracker::non_lsa_used_space() So that it can be used in unit tests.	2018-01-16 13:17:20 +01:00
Tomasz Grabiec	e5f8176c32	lsa: Fix memory leak on zone reclaim _free_segments_in_zones is not adjusted by segment_pool::reclaim_segments() for empty zones on reclaim under some conditions. For instance when some zone becomes empty due to regular free() and then reclaiming is called from the std allocator, and it is satisfied from a zone after the one which is empty. This would result in free memory in such zone to appear as being leaked due to corrupted free segment count, which may cause a later reclaim to fail. This could result in bad_allocs. The fix is to always collect such zones. Fixes #3129 Refs #3119 Refs #3120	2018-01-16 13:17:11 +01:00
Glauber Costa	80c4a211d8	consolidate timeout_clock At the moment, various different subsystems use their different ideas of what a timeout_clock is. This makes it a bit harder to pass timeouts between them because although most are actually a lowres_clock, that is not guaranteed to be the case. As a matter of fact, the timeout for restricted reads is expressed as nanoseconds, which is not a valid duration in the lowres_clock. As a first step towards fixing this, we'll consolidate all of the existing timeout_clocks in one, now called db::timeout_clock. Other things that tend to be expressed in terms of that clock--like the fact that the maximum time_point means no timeout and a semaphore that wait()s with that resolution are also moved to the common header. In the upcoming patch we will fix the restricted reader timeouts to be expressed in terms of the new timeout_clock. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-01-11 12:07:41 -05:00
Duarte Nunes	40ad65666f	utils/exponential_backoff_retry: Add helper to automate retries This patch adds the do_until_value static member function to exponential_backoff_retry, which retries the specified function until it returns an engaged optional. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-12-28 13:00:28 +00:00
Duarte Nunes	9a602c7796	utils/exponential_backoff_retry: Add abort_source-based retry Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-12-28 13:00:28 +00:00
Duarte Nunes	1374f898b9	Merge seastar upstream Class optimized_optional was moved into seastar, and its usage simplified so move_and_disengage() is replaced in favour of std::exchange(_, { }). * seastar adaca37...b0f5591 (9): > Merge "core: Introduce cancellation mechanism" from Duarte > Fix Seastar build that no longer builds with --enable-dpdk after the recent commit fd87ea2 > noncopyable_function: support function objects whose move constructors throw > Adding new hardware options to new config format, using new config format for dpdk device > Fix check for Boost version during pre-build configuration. > variant_utils: add variant_visitor constructor for C++17 mode > Merge "Allows json object to be stream to an" from Amnon > Merge 'Default to C++17' from Avi > Add const version of subscript operator to circular_buffer Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20171228112126.18142-1-duarte@scylladb.com>	2017-12-28 13:24:18 +02:00

1 2 3 4 5 ...

496 Commits