Commit Graph

926 Commits

Author SHA1 Message Date
Avi Kivity
ec3db140cb utils: data_input: replace enable_if with tightened concept
std::is_fundamental isn't a good constraint since it include nullptr_t
and void. Replace with std::integral which is sufficient. Use a concept
instead of enable_if to simplify the code.

Closes #8450
2021-04-11 18:56:21 +03:00
Pavel Emelyanov
26e27e27e8 btree: Add operator bool()
The btree's iterators allow for simple checking for '== tree.end()'
condition. For this check neither the tree itself, nor the ending
iterator is required. One just need to check if the _idx value is
the npos.

One additional change to make it work is required -- when removing
an entry from the inline node the _idx should be set to npos.

This change is, well, a bugfix. An iterator left with 0 in _idx is
treated as a valid one. However, the bug is non-triggerable. If such
an "invalid" iterator is compared against tree.end() the check would
return true, because the tree pointers would conside.

So this patch adds an operator bool() to btree iterator to facilitate
simpler checking if it reached the end of the collection or not.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-09 10:05:47 +03:00
Konstantin Osipov
c83cf1f965 uuid: switch the API to use std::chrono
A follow up for the patch for #7611. This change was requested
during review and moved out of #7611 to reduce its scope.

The patch switches UUID_gen API from using plain integers to
hold time units to units from std::chrono.

For one, we plan to switch the entire code base to std::chrono units,
to ensure type safety. Secondly, using std::chrono units allows to
increase code reuse with template metaprogramming and remove a few
of UUID_gen functions that beceme redundant as a result.

* switch  get_time_UUID(), unix_timestamp(), get_time_UUID_raw(), switch
  min_time_UUID(), max_time_UUID(), create_time_safe() to
  std::chrono
* remove unused variant of from_unix_timestamp()
* remove unused get_time_UUID_bytes(), create_time_unsafe(),
  redundant get_adjusted_timestamp()
* inline get_raw_UUID_bytes()
* collapse to similar implementations of get_time_UUID()
* switch internal constants to std::chrono
* remove unnecessary unique_ptr from UUID_gen::_instance
Message-Id: <20210406130152.3237914-2-kostja@scylladb.com>
2021-04-06 17:12:54 +03:00
Michał Chojnowski
f23a47e365 utils: fragment_range: fix FragmentedView utils for views with empty fragments
The copying and comparing utilities for FragmentedView are not prepared to
deal with empty fragments in non-empty views, and will fall into an infinite
loop in such case.
But data coming in result_row_view can contain such fragments, so we need to
fix that.

Fixes #8398.

Closes #8397
2021-04-04 15:31:51 +03:00
Avi Kivity
4739df2cb1 Merge 'cql3: remove linearizations in the write path' from Michał Chojnowski
As a part of the effort of removing big, contiguous buffers from the codebase,
cql3::raw_value should be made fragmented. Unfortunately a straightforward
rewrite to a fragmented buffer type is not possible, because we want
cql3::raw_value to be compatible with cql3::raw_value_view, and we want that
view to be based on fragmented_temporary_buffer::view, so that it can be
used to view data coming directly from seastar without copying.

This patch makes cql3::raw_value fragmented by making cql3::raw_value_view
a `variant` of managed_bytes_view and fragmented_temporary_buffer::view.

Code users which depended on `cql3::raw_value` being `bytes`,
and cql::raw_value_view being `fragmented_temporary_buffer::view` underneath
were adjusted to the new, dual representation, mainly through the
`cql3::raw_value_view::with_value` visitor and deserialization/validation
helpers added to `cql3::raw_value_view`.

The second part of this series gets rid of linearizations occuring when processing
compound types in the CQL layer. This is achieved by storing their elements in
`managed_bytes` instead of `bytes` in the partially deserialized form (`lists::value`
`tuples::value`, etc.) outputting `managed_bytes` instead of `bytes` in functions
which go from the partially deserialized form to the atomic cell format (for frozen
types), and avoiding calling deserialize/serialize on individual elements when
it's not necessary. (It's only necessary for CQLv2, because since CQLv3 the format
on the wire is the same as our internal one).

The above also forces some changes to `expression.cc`, and `restrictions`, mainly because
`IN` clauses store their arguments as `lists` and `tuples`, and the code which handled
this clause expected `bytes`.

After this series, the path from prepared CQL statements to `atomic_cell_or_collection`
is almost completely linearization-free. The last remaining place is `collection_mutation_description`,
where map keys are linearized to `bytes`.

Closes #8160

* github.com:scylladb/scylla:
  cql3: update_parameters: remove unused version of make_cell for bytes_view
  types: collection: remove an unused version of pack_fragmented
  cql3: optimize the deserialization of collections
  cql3: maps, sets: switch the element type from bytes to managed_bytes
  cql3: expression: use managed_bytes instead of bytes where possible
  cql3: expr: expression: make the argument of to_range a forwarding reference
  cql3: don't linearize elements of lists, tuples, and user types
  cql3: values: add const managed_bytes& constructor to raw_value_view
  cql3: output managed_bytes instead of bytes in get_with_protocol_version
  types: collection: add versions of pack for fragmented buffers
  types: add write_collection_{value,size} for managed_bytes_mutable_view
  cql3: tuples, user_types: avoid linearization in from_serialized() and get()
  types: tuple: add build_value_fragmented
  cql3: update_parameters: add make_cell version for managed_bytes_view
  cql3: remove operation::make_*cell
  cql3: values: make raw_value fragmented
  cql3: values: remove raw_value_view::operator==
  cql3: switch users of cql3::raw_value_view to internals-independent API
  cql3: values: add an internals-independent API to raw_value_view
  utils: managed_bytes: add a managed_bytes constructor from FragmentedView
  utils: managed_bytes: add operator<< and to_hex for managed_bytes
  utils: fragment_range: add to_hex
  configure: remove unused link dependencies from UUID_test
2021-04-01 15:21:32 +03:00
Pavel Emelyanov
8bbe2eae5e btree: Convert comparator to <=>
It turned out that all the users of btree can already be converted
to use safer std::strong_ordering. The only meaningful change here
is the btree code itself -- no more ints there.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210330153648.27049-1-xemul@scylladb.com>
2021-04-01 12:56:08 +03:00
Michał Chojnowski
45e0ef26d3 utils: managed_bytes: add a managed_bytes constructor from FragmentedView
Just for convenience. We will use it in an upcoming patch where we switch
the inner representation of cql3::raw_value from bytes to managed_bytes, and we
will want to construct managed_bytes from fragmented_temporary_buffer::view.
2021-04-01 10:39:42 +02:00
Michał Chojnowski
4715268e30 utils: managed_bytes: add operator<< and to_hex for managed_bytes
We will need them to replace bytes with managed_bytes in some places in an
upcoming patch.

The change to configure.py is necessary because opearator<< links to to_hex
in bytes.cc.
2021-04-01 10:39:42 +02:00
Michał Chojnowski
14c4639994 utils: fragment_range: add to_hex 2021-04-01 10:39:42 +02:00
Avi Kivity
7c953f33d5 utils: disk-error-handler: replace enable_if with concepts
Simpler, cleaner. We also replace the deprecated std::result_of_t
with std::invoke_result_t.

Closes #8305
2021-03-30 09:29:46 +02:00
Avi Kivity
3c292e31af utils: utf8: fix validate_partial() on non-SIMD-optimized architectures
validate_partial() is declared in the internal namespace, but defined
outside it. This causes calls to validate_partial() to be ambiguous
on architectures that haven't been SIMD-optimized yet (e.g. s390x).

Fix by defining it in the internal namespace.

Closes #8268
2021-03-23 09:21:14 +02:00
Avi Kivity
4dae434f69 utils: crc: fix build with big-endian architectures and 1-byte objects
crc has some code to reverse endianness on big-endian machines, but does
not handle the case of a 1-byte object (which doesn't need any adjustement).
This causes clang to complain that the switch statement doesn't handle that
case.

Fix by adding a no-op case.

Closes #8269
2021-03-23 09:16:20 +02:00
Avi Kivity
29a5047982 utils: error_injection: convert enable_if to concepts
Constrain inject() with a requires clause rather than enable_if,
simplifying the code and compiler diagnostics.

Note that the second instance could not have been called, since
the template argument does not appear in the function parameter
list and thus could not be deduced. This is corrected here.

Closes #8322
2021-03-21 09:28:23 +02:00
Piotr Sarna
2509b7dbde Merge 'dht: convert ring_position and decorated_key to std::strong_ordering' from Avi Kivity
As #1449 notes, trichotomic comparators returning int are dangerous as they
can be mistaken for less comparators. This series converts dht::ring_position
and dht::decorated_key, as well as a few closely related downstream types, to
return std::strong_ordering.

Closes #8225

* github.com:scylladb/scylla:
  dht: ring_position, decorated_key: convert tri_comparators to std::strong_ordering
  pager: rephrase misleading comparison check
  test: total_order_checks: prepare for std::strong_ordering
  test: mutation_test: prepare merge_container for std::strong_ordering
  intrusive_array: prepare for std::strong_ordering
  utils: collection-concepts: prepare for std::strong_ordering
2021-03-18 11:51:54 +01:00
Avi Kivity
fe0f983dfb intrusive_array: prepare for std::strong_ordering
Newer comparators can return std::strong_ordering, so don't
expect an int.
2021-03-18 12:40:05 +02:00
Avi Kivity
9fbe4850c9 utils: collection-concepts: prepare for std::strong_ordering
collection-concepts includes a Comparable concept for a trichotomic
comparator function, used in intrusive btree and double_decker. Prepare
for std::strong_ordering by also allowing std::strong_ordering as a
return type. Once we've cleaned the code base, we can tighten it to
only allow std::strong_ordering.
2021-03-18 12:40:03 +02:00
Michał Chojnowski
5c3385730b treewide: get rid of unaligned_cast
unaligned_cast violates strict aliasing rules. Replace it with
safe equivalents.
2021-03-17 17:00:41 +01:00
Michał Chojnowski
4e35befcf2 treewide: get rid of incorrect reinterpret casts
In some places we use the `*reinterpret_cast<const net::packed<T>*>(&x)`
pattern to reinterpret memory. This is a violation of C++'s aliasing rules,
which invokes undefined behaviour.

The blessed way to correctly reinterpret memory is to copy it into a new
object. Let's do that.

Note: the reinterpret_cast way has no performance advantage. Compilers
recognize the memory copy pattern and optimize it away.
2021-03-17 17:00:38 +01:00
Avi Kivity
290897ddbc logalloc: background reclaim: use default scheduling group for adjusting shares
If the shares are currently low, we might not get enough CPU time to
adjust the shares in time.

This is currently no-op, since Seastar runs the callback outside
scheduling groups (and only uses the scheduling group for inherited
continuations); but better be insulated against such details.
2021-03-15 13:54:49 +02:00
Avi Kivity
a87f6498c3 logalloc: background reclaim: log shares adjustment under trace level
Useful when debugging, but too noisy at any other time.
2021-03-15 13:54:49 +02:00
Avi Kivity
ce1b1d6ec4 logalloc: background reclaim: fix shares not updated by periodic timer
adjust_shares() thinks it needs to do nothing if the main loop
is running, but in reality it can only avoid waking the main loop;
it still needs to adjust the shares unconditionally. Otherwise,
the background reclaim shares can get locked into a low value.

Fix by splitting the conditional into two.
2021-03-15 13:54:37 +02:00
Nadav Har'El
f41dac2a3a alternator: avoid large contiguous allocation for request body
Alternator request sizes can be up to 16 MB, but the current implementation
had the Seastar HTTP server read the entire request as a contiguous string,
and then processed it. We can't avoid reading the entire request up-front -
we want to verify its integrity before doing any additional processing on it.
But there is no reason why the entire request needs to be stored in one big
*contiguous* allocation. This always a bad idea. We should use a non-
contiguous buffer, and that's the goal of this patch.

We use a new Seastar HTTPD feature where we can ask for an input stream,
instead of a string, for the request's body. We then begin the request
handling by reading lthe content of this stream into a
vector<temporary_buffer<char>> (which we alias "chunked_content"). We then
use this non-contiguous buffer to verify the request's signature and
if successful - parse the request JSON and finally execute it.

Beyond avoiding contiguous allocations, another benefit of this patch is
that while parsing a long request composed of chunks, we free each chunk
as soon as its parsing completed. This reduces the peak amount of memory
used by the query - we no longer need to store both unparsed and parsed
versions of the request at the same time.

Although we already had tests with requests of different lengths, most
of them were short enough to only have one chunk, and only a few had
2 or 3 chunks. So we also add a test which makes a much longer request
(a BatchWriteItem with large items), which in my experiment had 17 chunks.
The goal of this test is to verify that the new signature and JSON parsing
code which needs to cross chunk boundaries work as expected.

Fixes #7213.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210309222525.1628234-1-nyh@scylladb.com>
2021-03-10 09:22:34 +01:00
Avi Kivity
9038a81317 treewide: drop SEASTAR_CONCEPT
Since Scylla requires C++20, there is no need to protect
concept definitions or usages with SEASTAR_CONCEPT; it just
clutters the code. This patch therefore removes all uses.

Closes #8236
2021-03-08 16:04:20 +01:00
Tomasz Grabiec
ecb6c56a2a Merge 'lsa: background reclaim' from Avi Kivity
This series adds background reclaim to lsa, with the goal
that most large allocations can be satisfied from available
free memory, and and reclaim work can be done from a preemptible
context.

If the workload has free cpu, then background reclaim will
utilize that free cpu, reducing latency for the main workload.
Otherwise, background reclaim will compete with the main
workload, but since that work needs to happen anyway,
throughput will not be reduced.

A unit test is added to verify it works.

Fixes #1634.

Closes #8044

* github.com:scylladb/scylla:
  test: logalloc_test: test background reclaim
  logalloc: reduce gap between std min_free and logalloc min_free
  logalloc: background reclaim
  logalloc: preemptible reclaim
2021-02-24 13:23:30 +01:00
Avi Kivity
78d1afeabd Merge "Use radix tree to store cells on a row" from Pavel E
"
Current storage of cells in a row is a union of vector and set. The
vector holds 5 cell_and_hash's inline, up to 32 ones in the external
storage and then it's switched to std::set. Once switched, the whole
union becomes the waste of space, as it's size is

   sizeof(vector head) + 5 * sizeof(cell and hash) = 90+ bytes

and only 3 pointers from it are used (std::set header). Also the
overhead to keep cell_and_hash as a set entry is more then the size
of the structure itself.

Column ids are 32-bit integers that most likely come sequentialy.
For this kind of a search key a radix tree (with some care for
non-sequential cases) can be beneficial.

This set introduces a compact radix tree, that uses 7-bit sub values
from the search key to index on each node and compacts the nodes
themselves for better memory usage. Then the row::_storage is replaced
with the new tree.

The most notable result is the memory footprint decrease, for wide
rows down to 2x times. The performance of micro-benchmarks is a bit
lower for small rows and (!) higer for longer (8+ cells). The numbers
are in patch #12 (spoiler: they are better than for v2)

v3:
- trimmed size of radix down to 7 bits
- simplified the nodes layouts, now there are 2 of them (was 4)
- enhanced perf_mutation to test N-cells schema
- added AVX intra-nodes search for medium-sized nodes
- added .clone_from() method that helped to improve perf_mutation
- minor
  - changed functions not to return values via refs-arguments
  - fixed nested classes to properly use language constructors
  - renamed index_to to key_t to distinguish from node_index_t
  - improved recurring variadic templates not to use sentinel argument
  - use standard concepts

v2:
- fixed potential mis-compilation due to strict-aliasing violation
- added oracle test (radix tree is compared with std::map)
- added radix to perf_collection
- cosmetic changes (concepts, comments, names)

A note on item 1 from v2 changelog. The nodes are no longer packed
perfectly, each has grown 3 bytes. But it turned out that when used
as cells container most of this growth drowned in lsa alignments.

next todo:
- aarch64 version of 16-keys node search

tests: unit(dev), unit(debug for radix*), pref(dev)
"

* 'br-radix-tree-for-cells-3' of https://github.com/xemul/scylla:
  test/memory_footpring: Print radix tree node sizes
  row: Remove old storages
  row: Prepare row::equal for switch
  row: Prepare row::difference for switch
  row: Introduce radix tree storage type
  row-equal: Re-declare the cells_equal lambda
  test: Add tests for radix tree
  utils: Compact radix tree
  array-search: Add helpers to search for a byte in array
  test/perf_collection: Add callback to check the speed of clone
  test/perf_mutation: Add option to run with more than 1 columns
  test/perf_mutation: Prepare to have several regular columns
  test/perf_mutation: Use builder to build schema
2021-02-18 21:19:14 +02:00
Botond Dénes
ba7a9d2ac3 imr: switch back to open-coded description of structures
Commit aab6b0ee27 introduced the
controversial new IMR format, which relied on a very template-heavy
infrastructure to generate serialization and deserialization code via
template meta-programming. The promise was that this new format, beyond
solving the problems the previous open-coded representation had (working
on linearized buffers), will speed up migrating other components to this
IMR format, as the IMR infrastructure reduces code bloat, makes the code
more readable via declarative type descriptions as well as safer.
However, the results were almost the opposite. The template
meta-programming used by the IMR infrastructure proved very hard to
understand. Developers don't want to read or modify it. Maintainers
don't want to see it being used anywhere else. In short, nobody wants to
touch it.

This commit does a conceptual revert of
aab6b0ee27. A verbatim revert is not
possible because related code evolved a lot since the merge. Also, going
back to the previous code would mean we regress as we'd revert the move
to fragmented buffers. So this revert is only conceptual, it changes the
underlying infrastructure back to the previous open-coded one, but keeps
the fragmented buffers, as well as the interface of the related
components (to the extent possible).

Fixes: #5578
2021-02-16 23:43:07 +01:00
Michał Chojnowski
25a9569cc4 utils: managed_bytes: add a few trivial helper methods
We will use them in the upcoming IMR removal patch.
2021-02-16 23:43:07 +01:00
Michał Chojnowski
3f248ca7cc utils: fragment_range: move FragmentedView helpers to fragment_range.hh
In the upcoming IMR removal patch we will need read_simple() and similar helpers
for FragmentedView outside of types.hh. For now, let's move them to
fragment_range.hh, where FragmentedView is defined. Since it's a widely included
header, we should consider moving them to a more specialized header later.
2021-02-16 21:35:15 +01:00
Michał Chojnowski
8a06a576aa utils: fragment_range: add single_fragmented_mutable_view
We will use it later in the upcoming IMR removal patch.
2021-02-16 21:35:15 +01:00
Michał Chojnowski
7b662b9315 utils: fragment_range: implement FragmentRange for fragment_range
This will allow us to pass FragmentedView instances to places where
FragmentRange is expected.
2021-02-16 21:35:15 +01:00
Michał Chojnowski
f972f90193 utils: mutable_view: add front()
We will use it in the upcoming patches.
2021-02-16 21:35:14 +01:00
Pavel Emelyanov
9baf1226dc test/memory_footpring: Print radix tree node sizes
After switching cells storage onto compact radix tree it
becomes useful to know the tree nodes' sizes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:41:09 +03:00
Pavel Emelyanov
a5bd68ae5d utils: Compact radix tree
The tree uses integral type as a search key. On each level the local index
is next 7 bits from the key, respectively for 32-bit key we have 5 levels.
The tree uses 2 memory packing techniques -- prefix compaction and growing
node layouts.

The prefix compaction is used when a node has only one child. In this case
such a node is replaced in its parent with this only child and the child in
question keeps "prefix:length" pair on board, that's used to check if the
short-cut lookup took the correct path.

The growing node layouts makes the nodes occupy as much memory as needed
to keep the _present_ keys and there are 2 kinds of layouts.

Direct layout is array, intra-node search is plain indexing. The layout
storage grows in vector-like manner, but there's a special case for the
maximum-sized layout that helps avoiding some boundary checks.

Indirect layout keeps two arrays on board -- with values and with indices.
The intra-node search is thus a lookup in the latter array first. This
layout is used to save memory for sparse keys. Lookup is optimized with
SIMD instructions.

Inner nodes use direct layouts, as they occupy ~1% of memory and thus
need not to be memory efficient. At the same time lookup of a key in the
tree potentially walks several inner nodes, so speeding up search for
them is beneficial.

Leaf nodes are indirect, since they are 99% of memory and thus need to
be packed well. The single indirect lookup when searching in the tree
doesn't slow things down notably even on insertion stress test.

Said that
 * inner nodes are: header + 4 / 8 / 16 / 32 / 64 / 128 pointers
 * leaf nodes are : header + 4 / 8 / 16 / 32 bytes + <same nr> objects
                 or header + 16 bytes bitmap + 128 objects

The header is
 - backreference (8 bytes)
 - prefix (4 bytes)
 - size, layout, capacity (1 byte each)

The iterator is one-direction (for simplicity) but it enough for its main
target -- the sparse array of cells on a row. Also the iterator has an
.index() method that reports back the index of the entry at which it points.
This greatly simplifies the tree scans by the class row further.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 19:25:10 +03:00
Pavel Emelyanov
d43ad8738c array-search: Add helpers to search for a byte in array
The radix tree code will need the code to find 8-bit value
in an array of some fixed size, so here are the helpers.

Those that allow for SIMD implementations are such for x86_64

TODO: Add aarch64

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 17:47:59 +03:00
Avi Kivity
cb4e1bb0b9 logalloc: reduce gap between std min_free and logalloc min_free
With the larger gap, logalloc reserved more memory for std than
the background reclaim threshold for running, so it was triggered
rarely.

With the gap reduced, background reclaim is constantly running in
an allocating workload (e.g. cache misses).
2021-02-14 19:09:29 +02:00
Avi Kivity
ca0c006b37 logalloc: background reclaim
Set up a coroutine in a new scheduling group to ensure there is
a "cushion" of free memory. It reclaims in preemptible mode in
order to reduce reactor stalls (constrast with synchronous reclaim
that cannot preempt until it achieved its goal).

The free memory target is arbitrarily set at 60MB. The reclaimer's
shares are proportional to the distance from the free memory target;
so a workload that allocates memory rapidly will have the background
reclaimer working harder.

I rolled my own condition variable here, mostly as an experiment.
seastar::condition_variable requires several allocations, while
the one here requires none. We should formalize it after we gain
more experience with it.
2021-02-14 19:09:29 +02:00
Avi Kivity
35076dd2d3 logalloc: preemptible reclaim
Add an option (currently unused by all callers) to preempt
reclaim. If reclaim is preempted, it just stops what it is
doing, even if it reclaimed nothing. This is useful for background
reclaim.

Currently, preemption checks are on segment granularity. This is
probably too coarse, and should be refined later, but is already
better than the current granularity which does not allow preemption
until the entire requested memory size was reclaimed.
2021-02-14 19:09:29 +02:00
Piotr Sarna
aa39130a20 bounded_stats_queue: add missing const qualifiers
Most of the methods of this utility are effectively const.
Message-Id: <ed376ab74b6323cf770cc0a1314edbae0b16111e.1612953601.git.sarna@scylladb.com>
2021-02-10 13:04:35 +02:00
Pavel Emelyanov
2f7c03d84c utils: Intrusive B-tree (with tests)
The design of the tree goes from the row-cache needs, which are

1. Insert/Remove do not invalidate iterators
2. Elements are LSA-manageable
3. Low key overhead
4. External tri-comparator
5. As little actions on insert/remove as possible

With the above the design is

Two types of nodes -- inner and leaf. Both types keep pointer on parent nodes
and N pointers on keys (not keys themselves). Two differences: inner nodes have
array of pointers on kids, leaf nodes keep pointer on the tree (to update left-
and rightmost tree pointers on node move).

Nodes do not keep pointers/references on trees, thus we have O(1) move of any
object, but O(logN) to get the tree size. Fortunately, with big keys-per-node
value this won't result in too many steps.

In turn, the tree has 3 pointers -- root, left- and rightmost leaves. The latter
is for constant-time begin() and end().

Keys are managed by user with the help of embeddable member_hook instance,
which is 1 pointer in size.

The code was copied from the B+ tree one, then heavily reworked, the internal
algorythms turned out to differ quite significantly.

For the sake of mutation_partition::apply_monotonically(), which needs to move
an element from one tree into another, there's a key_grabber helping wrapper
that allows doing this move respecting the exception-safety requirement.

As measured by the perf_collections test the B-tree with 8 keys is faster, than
the std::set, but slower than the B+tree:

            vs set        vs b+tree
   fill:     +13%           -6%
   find:     +23%          -35%

Another neat thing is that 1-key insertion-removal is ~40% faster than
for BST (the same number of allocations, but the key object is smaller,
less pointers to set-up and less instructions to execute when linking
node with root).

v4:
- equip insertion methods with on_alloc_point() calls to catch
  potential exception guarantees violations eariler

- add unlink_leftmost_without_rebalance. The method is borrowed from
  boost intrusive set, and is added to kill two birds -- provide it,
  as it turns out to be popular, and use a bit faster step-by-step
  tree destruction than plain begin+erase loop

v3:
- introduce "inline" root node that is embedded into tree object and in
  which the 1st key is inserted. This greatly improves the 1-key-tree
  performance, which is pretty common case for rows cache

v2:
- introduce "linear" root leaf that grows on demand

  This improves the memory consumption for small trees. This linear node may
  and should over-grow the NodeSize parameter. This comes from the fact that
  there are two big per-key memory spikes on small trees -- 1-key root leaf
  and the first split, when the tree becomes 1-key root with two half-filled
  leaves. If the linear extention goes above NodeSize it can flatten even the
  2nd peak

- mitigate the keys indirection a bit

  Prefetching the keys while doing the intra-node linear scan and the nodes
  while descending the tree gives ~+5% of fill and find

- generalize stress tests for B and B+ trees

- cosmetic changes

TODO:

- fix few inefficincies in the core code (walks the sub-tree twice sometimes)
- try to optimize the leaf nodes, that are not lef-/righmost not to carry
  unused tree pointer on board

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:29 +03:00
Raphael S. Carvalho
298d54ceb0 utils/fragment_temporary_buffer: don't push empty fragment if data size is fragment-aligned
last fragment is unconditionally pushed to set of fragments, so if data
size is fragment-aligned, an empty fragment will be needlessly pushed to
the back of the fragment set.

note: i haven't tested if empty fragment at back of set will cause issues,
i think it won't, but this should be avoided anyway.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210129231532.871405-3-raphaelsc@scylladb.com>
2021-01-30 20:54:20 +02:00
Raphael S. Carvalho
e745f1e697 utils/fragmented_temporary_buffer: avoid reallocations by reserving upfront
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210129231532.871405-2-raphaelsc@scylladb.com>
2021-01-30 20:54:20 +02:00
Raphael S. Carvalho
08e838d4b5 utils/fragmented_temporary_buffer: simplify allocate_to_fit()
1) reuse default_fragment_size for knowledge of max fragment size
2) fragments_count is not a good name as it doesn't include last non-full
fragment (if present), so rename it.
3) simplify calculation of last fragment size

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210129231532.871405-1-raphaelsc@scylladb.com>
2021-01-30 20:54:20 +02:00
Pavel Solodovnikov
d14dc030ac utils: add fragmented_temporary_buffer::allocate_to_fit
Introduce `fragmented_temporary_buffer::allocate_to_fit` static
function returning an instance of the buffer of a specified size.

The allocated buffer fragments have a size of at most 128kb.
`bytes_ostream` has the same hard-coded limit, so just use the
same here.

This patch will be later needed for `raft::log_entry` raw data
serialization when writing to the underlying persistent storage.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-01-29 01:59:16 +03:00
Nadav Har'El
49440d67ad Merge: Fix multiple issues with timeuuid type
Merged patch series by Konstantin Osipov:

"These series improve uniqueness of generated timeuuids and change
list append/prepend logic to use client/LWT timestamp in timeuuids
generated for list keys. Timeuuid compare functions are
optimized.

The test coverage is extended for all of the above."

  uuid: add a comment warning against UUID::operator<
  uuid: replace slow versions of timeuiid compare with optimized/tested versions.
  test: add tests for legacy uuid compare & msb monotonicity
  test: add a test case for append/prepend limit
  test: add a test case for monotonicity of timeuuid least significant bits
  uuid: implement optimized timeuuid compare
  test: add a test case for list prepend/append with custom timestamp
  lists: rewrite list prepend to use append machinery
  lists: use query timestamp for list cell values during append
  uuid: fill in UUID node identifier part of UUID
  test: add a CQL test for list append/prepend operations
2021-01-21 13:20:07 +02:00
Konstantin Osipov
e18e2cb9f2 uuid: add a comment warning against UUID::operator< 2021-01-21 13:03:59 +03:00
Konstantin Osipov
56d8d166cb test: add tests for legacy uuid compare & msb monotonicity 2021-01-21 13:03:59 +03:00
Konstantin Osipov
0af3758aff uuid: implement optimized timeuuid compare
Introduce uint64_t based comparator for serialized timeuuids.

Respect Cassandra legacy for timeuuid compare order.

Scylla uses two versions of timeuuid compare:
- one for timeuuid values stored in uuid columns
- a different one for timeuuid values stored in timeuuid columns.

This commit re-implements the implementations of these comparators in
types.cc and deprecates the respective implementations types.cc. They
will be removed in a following patch.

A micro-benchmark at https://github.com/alecco/timeuuid-bench/
shows 2-4x speed up of the new comparators.
2021-01-21 13:03:59 +03:00
Konstantin Osipov
2b8ce83eea lists: use query timestamp for list cell values during append
Scylla list cells are represented internally as a map of
timeuuid => value. To append a new value to a list
the coordinator generates a timeuuid reflecting the current time as key
and adds a value to the map using this key.

Before this patch, Scylla always generated a timeuuid for a new
value, even if the query had a user supplied or LWT timestamp.
This could break LWT linearizability. User supplied timestamps were
ignored.

This is reported as https://github.com/scylladb/scylla/issues/7611

A statement which appended multiple values to a list or a BATCH
generated an own microsecond-resolution timeuuid for each value:

BEGIN BATCH
  UPDATE ... SET a = a + [3]
  UPDATE ... SET a = a + [4]
APPLY BATCH

UPDATE ... SET a = a + [3, 4]

To fix the bug, it's necessary to preserve monotonicity of
timeuuids within a batch or multi-value append, but make sure
they all use the microsecond time, as is set by LWT or user.

To explain the fix, it's first necessary to recall the structure
of time-based UUIDs:

60 bits: time since start of GMT epoch, year 1582, represented
         in 100-nanosecond units
4 bits:  version
14 bits: clock sequence, a random number to avoid duplicates
         in case system clock is adjusted
2 bits:  type
48 bits: MAC address (or other hardware address)

The purpose of clockseq bits is as defined in
https://tools.ietf.org/html/rfc4122#section-4.1.5
is to reduce the probability of UUID collision in case clock
goes back in time or node id changes. The implementation should reset it
whenever one of these events may occur.

Since LWT microsecond time is guaranteed to be
unique by Paxos, the RFC provisioning for clockseq and MAC
slots becomes excessive.

The fix thus changes timeuuid slot content in the following way:
- time component now contains the same microsecond time for all
  values of a statement or a batch. The time is unique and monotonic in
  case of LWT. Otherwise it's most always monotonic, but may not be
  unique if two timestamps are created on different coordinators.
- clockseq component is used to store a sequence number which is
  unique and monotonic for all values within the statement/batch.
- to protect against time back-adjustments and duplicates
  if time is auto-generated, MAC component contains a random (spoof)
  MAC address, re-created on each restart. The address is different
  at each shard.

The change is made for all sources of time: user, generated, LWT.
Conditioning the list key generation algorithm on the source of
time would unnecessarily complicate the code while not increase
quality (uniqueness) of created list keys.

Since 14 bits of clockseq provide us with only 16383 distinct slots
per statement or batch, 3 extra bits in nanosecond part of the time
are used to extend the range to 131071 values per statement/batch.
If the rang is exceeded beyond the limit, an exception is produced.

A twist on the use of clockseq to extend timeuuid uniqueness is
that Scylla, like Cassandra, uses int8 compare to compare lower
bits of timeuuid for ordering. The patch takes this into account
and sign-complements the clockseq value to make it monotonic
according to the legacy compare function.

Fixes #7611

test: unit (dev)
2021-01-21 13:03:59 +03:00
Konstantin Osipov
6d1781be36 uuid: fill in UUID node identifier part of UUID
Before this patch, UUID generation code was not creating
sufficiently unique IDs: the 6 byte node identifier was mostly
empty, i.e. only containing shard id. This could lead to
collisions between queries executed concurrently at different
coordinators, and, since timeuuid is used as key in list append
and prepend operations, lead to lost updates.

To generate a unique node id, the patch uses a combination of
hardware MAC address (or a random number if no hardware address is
available) and the current shard id.

The shard id is mixed into higher bits of MAC, to reduce the
chances on NIC collision within the same network.

With sufficiently unique timeuuids as list cell keys, such updates
are no longer lost, but multi-value update can still be "merged"
with another multi-value update.

E.g. if node A executes SET l = l + [4, 5] and node B executes SET
l  = l + [6, 7], the list value could be any of [4, 5, 6, 7], [4,
6, 5, 7], [6, 4, 5, 7] and so on.

At least we are now less likely to get any value lost.

Fixes #6208.

@todo: initialize UUID subsystem explicitly in main()
and switch to using seastar::engine().net().network_interfaces()

test: unit (dev)
2021-01-21 13:03:53 +03:00
Avi Kivity
4cfaab208e allocation_strategy: set preferred max contiguous allocation to 128k for standard allocations
Now that managed_bytes and its users do not assume that a managed_bytes
instance allocated using standard_allocation_strategy is non-fragmented,
we can set the preferred max contiguous allocation to 128k. This causes
managed_bytes to fragment instances that are larger than this size.

Note that managed_bytes is the only user.

Closes #7943
2021-01-21 11:15:13 +02:00