scylladb

Author	SHA1	Message	Date
Avi Kivity	29a5047982	utils: error_injection: convert enable_if to concepts Constrain inject() with a requires clause rather than enable_if, simplifying the code and compiler diagnostics. Note that the second instance could not have been called, since the template argument does not appear in the function parameter list and thus could not be deduced. This is corrected here. Closes #8322	2021-03-21 09:28:23 +02:00
Piotr Sarna	2509b7dbde	Merge 'dht: convert ring_position and decorated_key to std::strong_ordering' from Avi Kivity As #1449 notes, trichotomic comparators returning int are dangerous as they can be mistaken for less comparators. This series converts dht::ring_position and dht::decorated_key, as well as a few closely related downstream types, to return std::strong_ordering. Closes #8225 * github.com:scylladb/scylla: dht: ring_position, decorated_key: convert tri_comparators to std::strong_ordering pager: rephrase misleading comparison check test: total_order_checks: prepare for std::strong_ordering test: mutation_test: prepare merge_container for std::strong_ordering intrusive_array: prepare for std::strong_ordering utils: collection-concepts: prepare for std::strong_ordering	2021-03-18 11:51:54 +01:00
Avi Kivity	fe0f983dfb	intrusive_array: prepare for std::strong_ordering Newer comparators can return std::strong_ordering, so don't expect an int.	2021-03-18 12:40:05 +02:00
Avi Kivity	9fbe4850c9	utils: collection-concepts: prepare for std::strong_ordering collection-concepts includes a Comparable concept for a trichotomic comparator function, used in intrusive btree and double_decker. Prepare for std::strong_ordering by also allowing std::strong_ordering as a return type. Once we've cleaned the code base, we can tighten it to only allow std::strong_ordering.	2021-03-18 12:40:03 +02:00
Michał Chojnowski	5c3385730b	treewide: get rid of unaligned_cast unaligned_cast violates strict aliasing rules. Replace it with safe equivalents.	2021-03-17 17:00:41 +01:00
Michał Chojnowski	4e35befcf2	treewide: get rid of incorrect reinterpret casts In some places we use the `reinterpret_cast<const net::packed<T>>(&x)` pattern to reinterpret memory. This is a violation of C++'s aliasing rules, which invokes undefined behaviour. The blessed way to correctly reinterpret memory is to copy it into a new object. Let's do that. Note: the reinterpret_cast way has no performance advantage. Compilers recognize the memory copy pattern and optimize it away.	2021-03-17 17:00:38 +01:00
Avi Kivity	290897ddbc	logalloc: background reclaim: use default scheduling group for adjusting shares If the shares are currently low, we might not get enough CPU time to adjust the shares in time. This is currently no-op, since Seastar runs the callback outside scheduling groups (and only uses the scheduling group for inherited continuations); but better be insulated against such details.	2021-03-15 13:54:49 +02:00
Avi Kivity	a87f6498c3	logalloc: background reclaim: log shares adjustment under trace level Useful when debugging, but too noisy at any other time.	2021-03-15 13:54:49 +02:00
Avi Kivity	ce1b1d6ec4	logalloc: background reclaim: fix shares not updated by periodic timer adjust_shares() thinks it needs to do nothing if the main loop is running, but in reality it can only avoid waking the main loop; it still needs to adjust the shares unconditionally. Otherwise, the background reclaim shares can get locked into a low value. Fix by splitting the conditional into two.	2021-03-15 13:54:37 +02:00
Nadav Har'El	f41dac2a3a	alternator: avoid large contiguous allocation for request body Alternator request sizes can be up to 16 MB, but the current implementation had the Seastar HTTP server read the entire request as a contiguous string, and then processed it. We can't avoid reading the entire request up-front - we want to verify its integrity before doing any additional processing on it. But there is no reason why the entire request needs to be stored in one big contiguous allocation. This always a bad idea. We should use a non- contiguous buffer, and that's the goal of this patch. We use a new Seastar HTTPD feature where we can ask for an input stream, instead of a string, for the request's body. We then begin the request handling by reading lthe content of this stream into a vector<temporary_buffer<char>> (which we alias "chunked_content"). We then use this non-contiguous buffer to verify the request's signature and if successful - parse the request JSON and finally execute it. Beyond avoiding contiguous allocations, another benefit of this patch is that while parsing a long request composed of chunks, we free each chunk as soon as its parsing completed. This reduces the peak amount of memory used by the query - we no longer need to store both unparsed and parsed versions of the request at the same time. Although we already had tests with requests of different lengths, most of them were short enough to only have one chunk, and only a few had 2 or 3 chunks. So we also add a test which makes a much longer request (a BatchWriteItem with large items), which in my experiment had 17 chunks. The goal of this test is to verify that the new signature and JSON parsing code which needs to cross chunk boundaries work as expected. Fixes #7213. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210309222525.1628234-1-nyh@scylladb.com>	2021-03-10 09:22:34 +01:00
Avi Kivity	9038a81317	treewide: drop SEASTAR_CONCEPT Since Scylla requires C++20, there is no need to protect concept definitions or usages with SEASTAR_CONCEPT; it just clutters the code. This patch therefore removes all uses. Closes #8236	2021-03-08 16:04:20 +01:00
Tomasz Grabiec	ecb6c56a2a	Merge 'lsa: background reclaim' from Avi Kivity This series adds background reclaim to lsa, with the goal that most large allocations can be satisfied from available free memory, and and reclaim work can be done from a preemptible context. If the workload has free cpu, then background reclaim will utilize that free cpu, reducing latency for the main workload. Otherwise, background reclaim will compete with the main workload, but since that work needs to happen anyway, throughput will not be reduced. A unit test is added to verify it works. Fixes #1634. Closes #8044 * github.com:scylladb/scylla: test: logalloc_test: test background reclaim logalloc: reduce gap between std min_free and logalloc min_free logalloc: background reclaim logalloc: preemptible reclaim	2021-02-24 13:23:30 +01:00
Avi Kivity	78d1afeabd	Merge "Use radix tree to store cells on a row" from Pavel E " Current storage of cells in a row is a union of vector and set. The vector holds 5 cell_and_hash's inline, up to 32 ones in the external storage and then it's switched to std::set. Once switched, the whole union becomes the waste of space, as it's size is sizeof(vector head) + 5 * sizeof(cell and hash) = 90+ bytes and only 3 pointers from it are used (std::set header). Also the overhead to keep cell_and_hash as a set entry is more then the size of the structure itself. Column ids are 32-bit integers that most likely come sequentialy. For this kind of a search key a radix tree (with some care for non-sequential cases) can be beneficial. This set introduces a compact radix tree, that uses 7-bit sub values from the search key to index on each node and compacts the nodes themselves for better memory usage. Then the row::_storage is replaced with the new tree. The most notable result is the memory footprint decrease, for wide rows down to 2x times. The performance of micro-benchmarks is a bit lower for small rows and (!) higer for longer (8+ cells). The numbers are in patch #12 (spoiler: they are better than for v2) v3: - trimmed size of radix down to 7 bits - simplified the nodes layouts, now there are 2 of them (was 4) - enhanced perf_mutation to test N-cells schema - added AVX intra-nodes search for medium-sized nodes - added .clone_from() method that helped to improve perf_mutation - minor - changed functions not to return values via refs-arguments - fixed nested classes to properly use language constructors - renamed index_to to key_t to distinguish from node_index_t - improved recurring variadic templates not to use sentinel argument - use standard concepts v2: - fixed potential mis-compilation due to strict-aliasing violation - added oracle test (radix tree is compared with std::map) - added radix to perf_collection - cosmetic changes (concepts, comments, names) A note on item 1 from v2 changelog. The nodes are no longer packed perfectly, each has grown 3 bytes. But it turned out that when used as cells container most of this growth drowned in lsa alignments. next todo: - aarch64 version of 16-keys node search tests: unit(dev), unit(debug for radix), pref(dev) " 'br-radix-tree-for-cells-3' of https://github.com/xemul/scylla: test/memory_footpring: Print radix tree node sizes row: Remove old storages row: Prepare row::equal for switch row: Prepare row::difference for switch row: Introduce radix tree storage type row-equal: Re-declare the cells_equal lambda test: Add tests for radix tree utils: Compact radix tree array-search: Add helpers to search for a byte in array test/perf_collection: Add callback to check the speed of clone test/perf_mutation: Add option to run with more than 1 columns test/perf_mutation: Prepare to have several regular columns test/perf_mutation: Use builder to build schema	2021-02-18 21:19:14 +02:00
Botond Dénes	ba7a9d2ac3	imr: switch back to open-coded description of structures Commit `aab6b0ee27` introduced the controversial new IMR format, which relied on a very template-heavy infrastructure to generate serialization and deserialization code via template meta-programming. The promise was that this new format, beyond solving the problems the previous open-coded representation had (working on linearized buffers), will speed up migrating other components to this IMR format, as the IMR infrastructure reduces code bloat, makes the code more readable via declarative type descriptions as well as safer. However, the results were almost the opposite. The template meta-programming used by the IMR infrastructure proved very hard to understand. Developers don't want to read or modify it. Maintainers don't want to see it being used anywhere else. In short, nobody wants to touch it. This commit does a conceptual revert of `aab6b0ee27`. A verbatim revert is not possible because related code evolved a lot since the merge. Also, going back to the previous code would mean we regress as we'd revert the move to fragmented buffers. So this revert is only conceptual, it changes the underlying infrastructure back to the previous open-coded one, but keeps the fragmented buffers, as well as the interface of the related components (to the extent possible). Fixes: #5578	2021-02-16 23:43:07 +01:00
Michał Chojnowski	25a9569cc4	utils: managed_bytes: add a few trivial helper methods We will use them in the upcoming IMR removal patch.	2021-02-16 23:43:07 +01:00
Michał Chojnowski	3f248ca7cc	utils: fragment_range: move FragmentedView helpers to fragment_range.hh In the upcoming IMR removal patch we will need read_simple() and similar helpers for FragmentedView outside of types.hh. For now, let's move them to fragment_range.hh, where FragmentedView is defined. Since it's a widely included header, we should consider moving them to a more specialized header later.	2021-02-16 21:35:15 +01:00
Michał Chojnowski	8a06a576aa	utils: fragment_range: add single_fragmented_mutable_view We will use it later in the upcoming IMR removal patch.	2021-02-16 21:35:15 +01:00
Michał Chojnowski	7b662b9315	utils: fragment_range: implement FragmentRange for fragment_range This will allow us to pass FragmentedView instances to places where FragmentRange is expected.	2021-02-16 21:35:15 +01:00
Michał Chojnowski	f972f90193	utils: mutable_view: add front() We will use it in the upcoming patches.	2021-02-16 21:35:14 +01:00
Pavel Emelyanov	9baf1226dc	test/memory_footpring: Print radix tree node sizes After switching cells storage onto compact radix tree it becomes useful to know the tree nodes' sizes. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 20:41:09 +03:00
Pavel Emelyanov	a5bd68ae5d	utils: Compact radix tree The tree uses integral type as a search key. On each level the local index is next 7 bits from the key, respectively for 32-bit key we have 5 levels. The tree uses 2 memory packing techniques -- prefix compaction and growing node layouts. The prefix compaction is used when a node has only one child. In this case such a node is replaced in its parent with this only child and the child in question keeps "prefix:length" pair on board, that's used to check if the short-cut lookup took the correct path. The growing node layouts makes the nodes occupy as much memory as needed to keep the _present_ keys and there are 2 kinds of layouts. Direct layout is array, intra-node search is plain indexing. The layout storage grows in vector-like manner, but there's a special case for the maximum-sized layout that helps avoiding some boundary checks. Indirect layout keeps two arrays on board -- with values and with indices. The intra-node search is thus a lookup in the latter array first. This layout is used to save memory for sparse keys. Lookup is optimized with SIMD instructions. Inner nodes use direct layouts, as they occupy ~1% of memory and thus need not to be memory efficient. At the same time lookup of a key in the tree potentially walks several inner nodes, so speeding up search for them is beneficial. Leaf nodes are indirect, since they are 99% of memory and thus need to be packed well. The single indirect lookup when searching in the tree doesn't slow things down notably even on insertion stress test. Said that * inner nodes are: header + 4 / 8 / 16 / 32 / 64 / 128 pointers * leaf nodes are : header + 4 / 8 / 16 / 32 bytes + <same nr> objects or header + 16 bytes bitmap + 128 objects The header is - backreference (8 bytes) - prefix (4 bytes) - size, layout, capacity (1 byte each) The iterator is one-direction (for simplicity) but it enough for its main target -- the sparse array of cells on a row. Also the iterator has an .index() method that reports back the index of the entry at which it points. This greatly simplifies the tree scans by the class row further. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 19:25:10 +03:00
Pavel Emelyanov	d43ad8738c	array-search: Add helpers to search for a byte in array The radix tree code will need the code to find 8-bit value in an array of some fixed size, so here are the helpers. Those that allow for SIMD implementations are such for x86_64 TODO: Add aarch64 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 17:47:59 +03:00
Avi Kivity	cb4e1bb0b9	logalloc: reduce gap between std min_free and logalloc min_free With the larger gap, logalloc reserved more memory for std than the background reclaim threshold for running, so it was triggered rarely. With the gap reduced, background reclaim is constantly running in an allocating workload (e.g. cache misses).	2021-02-14 19:09:29 +02:00
Avi Kivity	ca0c006b37	logalloc: background reclaim Set up a coroutine in a new scheduling group to ensure there is a "cushion" of free memory. It reclaims in preemptible mode in order to reduce reactor stalls (constrast with synchronous reclaim that cannot preempt until it achieved its goal). The free memory target is arbitrarily set at 60MB. The reclaimer's shares are proportional to the distance from the free memory target; so a workload that allocates memory rapidly will have the background reclaimer working harder. I rolled my own condition variable here, mostly as an experiment. seastar::condition_variable requires several allocations, while the one here requires none. We should formalize it after we gain more experience with it.	2021-02-14 19:09:29 +02:00
Avi Kivity	35076dd2d3	logalloc: preemptible reclaim Add an option (currently unused by all callers) to preempt reclaim. If reclaim is preempted, it just stops what it is doing, even if it reclaimed nothing. This is useful for background reclaim. Currently, preemption checks are on segment granularity. This is probably too coarse, and should be refined later, but is already better than the current granularity which does not allow preemption until the entire requested memory size was reclaimed.	2021-02-14 19:09:29 +02:00
Piotr Sarna	aa39130a20	bounded_stats_queue: add missing const qualifiers Most of the methods of this utility are effectively const. Message-Id: <ed376ab74b6323cf770cc0a1314edbae0b16111e.1612953601.git.sarna@scylladb.com>	2021-02-10 13:04:35 +02:00
Pavel Emelyanov	2f7c03d84c	utils: Intrusive B-tree (with tests) The design of the tree goes from the row-cache needs, which are 1. Insert/Remove do not invalidate iterators 2. Elements are LSA-manageable 3. Low key overhead 4. External tri-comparator 5. As little actions on insert/remove as possible With the above the design is Two types of nodes -- inner and leaf. Both types keep pointer on parent nodes and N pointers on keys (not keys themselves). Two differences: inner nodes have array of pointers on kids, leaf nodes keep pointer on the tree (to update left- and rightmost tree pointers on node move). Nodes do not keep pointers/references on trees, thus we have O(1) move of any object, but O(logN) to get the tree size. Fortunately, with big keys-per-node value this won't result in too many steps. In turn, the tree has 3 pointers -- root, left- and rightmost leaves. The latter is for constant-time begin() and end(). Keys are managed by user with the help of embeddable member_hook instance, which is 1 pointer in size. The code was copied from the B+ tree one, then heavily reworked, the internal algorythms turned out to differ quite significantly. For the sake of mutation_partition::apply_monotonically(), which needs to move an element from one tree into another, there's a key_grabber helping wrapper that allows doing this move respecting the exception-safety requirement. As measured by the perf_collections test the B-tree with 8 keys is faster, than the std::set, but slower than the B+tree: vs set vs b+tree fill: +13% -6% find: +23% -35% Another neat thing is that 1-key insertion-removal is ~40% faster than for BST (the same number of allocations, but the key object is smaller, less pointers to set-up and less instructions to execute when linking node with root). v4: - equip insertion methods with on_alloc_point() calls to catch potential exception guarantees violations eariler - add unlink_leftmost_without_rebalance. The method is borrowed from boost intrusive set, and is added to kill two birds -- provide it, as it turns out to be popular, and use a bit faster step-by-step tree destruction than plain begin+erase loop v3: - introduce "inline" root node that is embedded into tree object and in which the 1st key is inserted. This greatly improves the 1-key-tree performance, which is pretty common case for rows cache v2: - introduce "linear" root leaf that grows on demand This improves the memory consumption for small trees. This linear node may and should over-grow the NodeSize parameter. This comes from the fact that there are two big per-key memory spikes on small trees -- 1-key root leaf and the first split, when the tree becomes 1-key root with two half-filled leaves. If the linear extention goes above NodeSize it can flatten even the 2nd peak - mitigate the keys indirection a bit Prefetching the keys while doing the intra-node linear scan and the nodes while descending the tree gives ~+5% of fill and find - generalize stress tests for B and B+ trees - cosmetic changes TODO: - fix few inefficincies in the core code (walks the sub-tree twice sometimes) - try to optimize the leaf nodes, that are not lef-/righmost not to carry unused tree pointer on board Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-02 09:30:29 +03:00
Raphael S. Carvalho	298d54ceb0	utils/fragment_temporary_buffer: don't push empty fragment if data size is fragment-aligned last fragment is unconditionally pushed to set of fragments, so if data size is fragment-aligned, an empty fragment will be needlessly pushed to the back of the fragment set. note: i haven't tested if empty fragment at back of set will cause issues, i think it won't, but this should be avoided anyway. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210129231532.871405-3-raphaelsc@scylladb.com>	2021-01-30 20:54:20 +02:00
Raphael S. Carvalho	e745f1e697	utils/fragmented_temporary_buffer: avoid reallocations by reserving upfront Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210129231532.871405-2-raphaelsc@scylladb.com>	2021-01-30 20:54:20 +02:00
Raphael S. Carvalho	08e838d4b5	utils/fragmented_temporary_buffer: simplify allocate_to_fit() 1) reuse default_fragment_size for knowledge of max fragment size 2) fragments_count is not a good name as it doesn't include last non-full fragment (if present), so rename it. 3) simplify calculation of last fragment size Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210129231532.871405-1-raphaelsc@scylladb.com>	2021-01-30 20:54:20 +02:00
Pavel Solodovnikov	d14dc030ac	utils: add `fragmented_temporary_buffer::allocate_to_fit` Introduce `fragmented_temporary_buffer::allocate_to_fit` static function returning an instance of the buffer of a specified size. The allocated buffer fragments have a size of at most 128kb. `bytes_ostream` has the same hard-coded limit, so just use the same here. This patch will be later needed for `raft::log_entry` raw data serialization when writing to the underlying persistent storage. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-01-29 01:59:16 +03:00
Nadav Har'El	49440d67ad	Merge: Fix multiple issues with timeuuid type Merged patch series by Konstantin Osipov: "These series improve uniqueness of generated timeuuids and change list append/prepend logic to use client/LWT timestamp in timeuuids generated for list keys. Timeuuid compare functions are optimized. The test coverage is extended for all of the above." uuid: add a comment warning against UUID::operator< uuid: replace slow versions of timeuiid compare with optimized/tested versions. test: add tests for legacy uuid compare & msb monotonicity test: add a test case for append/prepend limit test: add a test case for monotonicity of timeuuid least significant bits uuid: implement optimized timeuuid compare test: add a test case for list prepend/append with custom timestamp lists: rewrite list prepend to use append machinery lists: use query timestamp for list cell values during append uuid: fill in UUID node identifier part of UUID test: add a CQL test for list append/prepend operations	2021-01-21 13:20:07 +02:00
Konstantin Osipov	e18e2cb9f2	uuid: add a comment warning against UUID::operator<	2021-01-21 13:03:59 +03:00
Konstantin Osipov	56d8d166cb	test: add tests for legacy uuid compare & msb monotonicity	2021-01-21 13:03:59 +03:00
Konstantin Osipov	0af3758aff	uuid: implement optimized timeuuid compare Introduce uint64_t based comparator for serialized timeuuids. Respect Cassandra legacy for timeuuid compare order. Scylla uses two versions of timeuuid compare: - one for timeuuid values stored in uuid columns - a different one for timeuuid values stored in timeuuid columns. This commit re-implements the implementations of these comparators in types.cc and deprecates the respective implementations types.cc. They will be removed in a following patch. A micro-benchmark at https://github.com/alecco/timeuuid-bench/ shows 2-4x speed up of the new comparators.	2021-01-21 13:03:59 +03:00
Konstantin Osipov	2b8ce83eea	lists: use query timestamp for list cell values during append Scylla list cells are represented internally as a map of timeuuid => value. To append a new value to a list the coordinator generates a timeuuid reflecting the current time as key and adds a value to the map using this key. Before this patch, Scylla always generated a timeuuid for a new value, even if the query had a user supplied or LWT timestamp. This could break LWT linearizability. User supplied timestamps were ignored. This is reported as https://github.com/scylladb/scylla/issues/7611 A statement which appended multiple values to a list or a BATCH generated an own microsecond-resolution timeuuid for each value: BEGIN BATCH UPDATE ... SET a = a + [3] UPDATE ... SET a = a + [4] APPLY BATCH UPDATE ... SET a = a + [3, 4] To fix the bug, it's necessary to preserve monotonicity of timeuuids within a batch or multi-value append, but make sure they all use the microsecond time, as is set by LWT or user. To explain the fix, it's first necessary to recall the structure of time-based UUIDs: 60 bits: time since start of GMT epoch, year 1582, represented in 100-nanosecond units 4 bits: version 14 bits: clock sequence, a random number to avoid duplicates in case system clock is adjusted 2 bits: type 48 bits: MAC address (or other hardware address) The purpose of clockseq bits is as defined in https://tools.ietf.org/html/rfc4122#section-4.1.5 is to reduce the probability of UUID collision in case clock goes back in time or node id changes. The implementation should reset it whenever one of these events may occur. Since LWT microsecond time is guaranteed to be unique by Paxos, the RFC provisioning for clockseq and MAC slots becomes excessive. The fix thus changes timeuuid slot content in the following way: - time component now contains the same microsecond time for all values of a statement or a batch. The time is unique and monotonic in case of LWT. Otherwise it's most always monotonic, but may not be unique if two timestamps are created on different coordinators. - clockseq component is used to store a sequence number which is unique and monotonic for all values within the statement/batch. - to protect against time back-adjustments and duplicates if time is auto-generated, MAC component contains a random (spoof) MAC address, re-created on each restart. The address is different at each shard. The change is made for all sources of time: user, generated, LWT. Conditioning the list key generation algorithm on the source of time would unnecessarily complicate the code while not increase quality (uniqueness) of created list keys. Since 14 bits of clockseq provide us with only 16383 distinct slots per statement or batch, 3 extra bits in nanosecond part of the time are used to extend the range to 131071 values per statement/batch. If the rang is exceeded beyond the limit, an exception is produced. A twist on the use of clockseq to extend timeuuid uniqueness is that Scylla, like Cassandra, uses int8 compare to compare lower bits of timeuuid for ordering. The patch takes this into account and sign-complements the clockseq value to make it monotonic according to the legacy compare function. Fixes #7611 test: unit (dev)	2021-01-21 13:03:59 +03:00
Konstantin Osipov	6d1781be36	uuid: fill in UUID node identifier part of UUID Before this patch, UUID generation code was not creating sufficiently unique IDs: the 6 byte node identifier was mostly empty, i.e. only containing shard id. This could lead to collisions between queries executed concurrently at different coordinators, and, since timeuuid is used as key in list append and prepend operations, lead to lost updates. To generate a unique node id, the patch uses a combination of hardware MAC address (or a random number if no hardware address is available) and the current shard id. The shard id is mixed into higher bits of MAC, to reduce the chances on NIC collision within the same network. With sufficiently unique timeuuids as list cell keys, such updates are no longer lost, but multi-value update can still be "merged" with another multi-value update. E.g. if node A executes SET l = l + [4, 5] and node B executes SET l = l + [6, 7], the list value could be any of [4, 5, 6, 7], [4, 6, 5, 7], [6, 4, 5, 7] and so on. At least we are now less likely to get any value lost. Fixes #6208. @todo: initialize UUID subsystem explicitly in main() and switch to using seastar::engine().net().network_interfaces() test: unit (dev)	2021-01-21 13:03:53 +03:00
Avi Kivity	4cfaab208e	allocation_strategy: set preferred max contiguous allocation to 128k for standard allocations Now that managed_bytes and its users do not assume that a managed_bytes instance allocated using standard_allocation_strategy is non-fragmented, we can set the preferred max contiguous allocation to 128k. This causes managed_bytes to fragment instances that are larger than this size. Note that managed_bytes is the only user. Closes #7943	2021-01-21 11:15:13 +02:00
Michał Chojnowski	85048b349b	memtable: fix accounting of managed_bytes in partition_snapshot_accounter managed_bytes has a small overhead per each fragment. Due to that, managed_bytes containing the same data can have different total memory usage in different allocators. The smaller the preferred max allocation size setting is, the more fragments are needed and the greater total per-fragment overhead is. In particular, managed_bytes allocated in the LSA could grow in memory usage when copied to the standard allocator, if the standard allocator had a preferred max allocation setting smaller than the LSA. partition_snapshot_accounter calculates the amount of memory used by mutation fragments in the memtable (where they are allocated with LSA) based on the memory usage after they are copied to the standard allocator. This could result in an overestimation, as explained above. But partition_snapshot_accounter must not overestimate the amount of freed memory, as doing otherwise might result in OOM situations. This patch prevents the overaccounting by adding minimal_external_memory_usage(): a new version of external_memory_usage(), which ignores allocator-dependent overhead. In particular, it includes the per-fragment overhead in managed_bytes only once, no matter how many fragments there are.	2021-01-15 18:21:13 +01:00
Michał Chojnowski	72ecbd6936	utils: fragment_range: add a fragment iterator for FragmentedView A stylistic change. Iterators are the idiomatic way to iterate in C++.	2021-01-15 14:05:44 +01:00
Pavel Solodovnikov	eb523d4ac8	utils: remove unused linearization facilities in `managed_bytes` class Remove the following bits of `managed_bytes` since they are unused: * `with_linearized_managed_bytes` function template * `linearization_context_guard` RAII wrapper class for managing `linearization_context` instances. * `do_linearize` function * `linearization_context` class Since there is no more public or private methods in `managed_class` to linearize the value except for explicit `with_linearized()`, which doesn't use any of aforementioned parts, we can safely remove these. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-01-08 14:16:08 +01:00
Avi Kivity	3bf6b78668	utils: managed_bytes: remove linearizing accessors Accessor that require linearization, such as data(), begin(), and casting to bytes_view, are no longer used and are now removed.	2021-01-08 14:16:08 +01:00
Michał Chojnowski	bf0ec63e34	utils: managed_bytes: add managed_bytes_view::operator[] This operator has a single purpose: an easier port of legacy_compound_view from bytes_view to managed_bytes_view. It is inefficient and should be removed as soon as legacy_compound_view stops using operator[].	2021-01-08 14:16:08 +01:00
Michał Chojnowski	778269151a	utils: managed_bytes: introduce managed_bytes_view managed_bytes_view is a non-owning view into managed_bytes. It can also be implicitly constructed from bytes_view. It conforms to the FragmentedView concept and is mainly used through that interface. It will be used as a replacement for bytes_view occurrences currently obtained by linearizing managed_bytes.	2021-01-08 14:16:08 +01:00
Michał Chojnowski	cf7d25b98d	utils: fragment_range: add serialization helpers for FragmentedMutableView We will use them to write to managed_bytes_view in an upcoming patch, to avoid linearization in compound_type::serialize_value.	2021-01-08 14:16:07 +01:00
Michał Chojnowski	4822730752	utils: mutable_view: add substr() Analogous to bytes_view::substr. This bit of functionality will be used to implement managed_bytes_mutable_view.	2021-01-08 13:17:46 +01:00
Michał Chojnowski	6c97027f85	utils: fragment_range: add compare_unsigned We will use it to compare fragmented buffers (mainly managed_bytes_view in types, compound, and tests) without linearization.	2021-01-04 22:50:45 +01:00
Michał Chojnowski	2d28471a59	utils: managed_bytes: make the constructors from bytes and bytes_view explicit Conversions from views to owners have no business being implicit. Besides, they would also cause various ambiguity problems when adding managed_bytes_view.	2021-01-04 22:22:12 +01:00
Avi Kivity	0f7b6dd180	utils: managed_bytes: introduce with_linearized() This is a temporary scaffold for weaning ourselves off linearization. It differs from with_linearized_managed_bytes in that it does not rely on the environment (linearization_context) and so is easier to remove.	2020-12-20 15:14:44 +01:00
Avi Kivity	c37e495958	utils: managed_bytes: constrain with_linearized_managed_bytes() The passed function must be called with a no parameters; document and enforce it.	2020-12-20 15:14:44 +01:00

1 2 3 4 5 ...

914 Commits