Commit Graph

219 Commits

Author SHA1 Message Date
Pavel Emelyanov
b3c89787be mutation_partition: Return immutable collection for range tombstones
Patch the .row_tombstones() to return the range_tombstone_list
wrapped into the immutable_collection<> so that callers are
guaranteed not to touch the collection itself, but still can
modify the tombstones.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-27 20:06:53 +03:00
Pavel Emelyanov
1bf643d4fd mutation_partition: Pin mutable access to range tombstones
Some callers of mutation_partition::row_tomstones() don't want
(and shouldn't) modify the list itself, while they may want to
modify the tombstones. This patch explicitly locates those that
need to modify the collection, because the next patch will
return immutable collection for the others.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-27 20:06:53 +03:00
Pavel Emelyanov
05b8cdfd24 mutation_partition: Return immutable collection for rows
Patch the .clustered_rows() method to return the btree of rows
wrapped into the immutable_collection<> so that callers are
guaranteed not to touch the collection itself, but still can
modify the elements in it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-27 20:06:53 +03:00
Pavel Emelyanov
ad27bf40e6 mutation_partition: Pin mutable access to rows
Some callers of mutation_partition::clustered_rows() don't want
(and shouldn't) modify the tree of rows, while they may want to
modify the rows themselves. This patch explicitly locates those
that need to modify the collection, because the next patch will
return immutable collection for the others.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-27 20:06:53 +03:00
Pavel Emelyanov
a9b4fa9db3 mutation_partition: Shuffle declarations
Its methods that provide access to enclosed collections of rows
and range tombstones are intermixed, so group them for smoother
next patching and mark noexcept while at it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-27 20:06:53 +03:00
Tomasz Grabiec
7fa4e10aa0 row_cache: Use generic LRU for eviction
In preparation for tracking different kinds of objects, not just
rows_entry, in the LRU, switch to the LRU implementation form
utils/lru.hh which can hold arbitrary element type.
2021-07-02 10:25:58 +02:00
Pavel Solodovnikov
76bea23174 treewide: reduce header interdependencies
Use forward declarations wherever possible.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>

Closes #8813
2021-06-07 15:58:35 +03:00
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Pavel Solodovnikov
fff7ef1fc2 treewide: reduce boost headers usage in scylla header files
`dev-headers` target is also ensured to build successfully.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
2021-05-20 01:33:18 +03:00
Pavel Emelyanov
64074f45ce code: Relax position_in_partition::tri_compare users
There are some pieces left doing res <=> 0 with the
res now being a strong_ordering itself. All these can
be just dropped.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-04-09 18:20:39 +03:00
Pavel Emelyanov
8bbe2eae5e btree: Convert comparator to <=>
It turned out that all the users of btree can already be converted
to use safer std::strong_ordering. The only meaningful change here
is the btree code itself -- no more ints there.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210330153648.27049-1-xemul@scylladb.com>
2021-04-01 12:56:08 +03:00
Pavel Emelyanov
9baf1226dc test/memory_footpring: Print radix tree node sizes
After switching cells storage onto compact radix tree it
becomes useful to know the tree nodes' sizes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:41:09 +03:00
Pavel Emelyanov
1bdfa355ea row: Remove old storages
Now when the 3rd storage type (radix tree) is all in, old
storage can be safely removed.  The result is:

1. memory footprint

sizeof(class row):  112 => 16 bytes
sizeof(rows_entry): 126 => 120 bytes

the "in cache" value depends on the number of cells:

num of cells     master       patch
         1       752         656
         2       808         712
         3       864         768
         4       920         824
         5       968         936
         6      1136         992
         ...
         16     1840        1672
         17     1904        1992  (+88)
         18     1976        2048  (+72)
         19     2048        2104  (+56)
         20     2120        2160  (+40)
         21     2184        2208  (+24)
         22     2256        2264  ( +8)
         23     2328        2320
         ...
         32     2960        2808

After 32 cells the storage switches into rbtree with
24-bytes per-cell overhead and the radix tree improvement
rocketlaunches

           64     7872        6056
           128   15040        9512
           256   29376       18568

2. perf_mutation test is enhanced by this series and the
   results differ depending on the number of columns used

                    tps value
--column-count    master   patch
          1       59.9k    57.6k  (-3.8%)
          2       59.9k    57.5k
          4       59.8k    57.6k
          8       57.6k    57.7k  <- eq
         16       56.3k    57.6k
         32       53.2k    57.4k  (+7.9%)

A note on this. Last time 1-column test was ~5% worse which
was explained by inline storage of 5 cells that's present on
current implementation and was absent in radix tree.

An attempt to make inline storage for small radix trees
resulted in complete loss of memory footprint gain, but gave
fraction of percent to perf_mutation performance. So this
version doesn't have inline nodes.

The 1.2% improvement from v2 surprisingly came from the
tree::clone_from() which in v2 was work-around-ed by slow
walk+emplace sequence while this version has the optimized
API call for cloning.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:35:06 +03:00
Pavel Emelyanov
f006acc853 row: Introduce radix tree storage type
Currently class row uses a union of a vector and a set to keep
the cells and switches between them. Add the 3rd type with the
radix tree, but never switch to it, just to show how the operations
would look like. Later on vector and set will be removed and the
whole row will be immediately switched to the radix tree storage.

NB: All the added places have indentation deliberately broken, so
that next patch will just remove the surrounding (old) code away
and (most of) the new one will happen in its place instantly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-15 20:27:00 +03:00
Pavel Emelyanov
5c0f9a8180 mutation_partition: Switch cache of rows onto B-tree
The switch is pretty straightforward, and consists of

- change less-compare into tri-compare

- rename insert/insert_check into insert_before_hint

- use tree::key_grabber in mutation_partition::apply_monotonically to
  exception-safely transfer a row from one tree to another

- explicitly erase the row from tree in rows_entry::on_evicted, there's
  a O(1) tree::iterator method for this

- rewrite rows_entry -> cache_entry transofrmation in the on_evicted to
  fit the B-tree API

- include the B-tree's external memory usage into stats

That's it. The number of keys per node was is set to 12 with linear search
and linear extention of 20 because

- experimenting with tree shows that numbers 8 through 10 keys with linear
  search show the best performance on stress tests for insert/find-s of
  keys that are memcmp-able arrays of bytes (which is an approximation of
  current clustring key compare). More keys work slower, but still better
  than any bigger value with any type of search up to 64 keys per node

- having 12 keys per nodes is the threshold at which the memory footprint
  for B-tree becomes smaller than for boost::intrusive::set for partitions
  with 32+ keys

- 20 keys for linear root eats the first-split peak and still performs
  well in linear search

As a result the footpring for B tree is bigger than the one for BST only for
trees filled with 21...32 keys by 0.1...0.7 bytes per key.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Pavel Emelyanov
306c40939b rows_entry: Generalize compare
Turn the rows_entry less-comparator's calls into a template as
they are nothing but wrappers on top of rows_entyry tri-comparator.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-02-02 09:30:30 +03:00
Botond Dénes
9c96d74b72 mutation: remove now unused query() and query_compacted() 2021-01-22 15:36:37 +02:00
Tomasz Grabiec
15b5b286d9 Merge "frozen_mutation: better diagnostics for out-of-order and duplicate rows" from Botond
Currently, frozen mutations, that contain partitions with out-of-order
or duplicate rows will trigger (if they even do) an assert in
`row::append_cell()`. However, this results in poor diagnostics (if at
all) as the context doesn't contain enough information on what exactly
went wrong. This results in a cryptic error message and an investigation
that can only start after looking at a coredump.

This series remedies this problem by explicitly checking for
out-of-order and duplicate rows, as early as possible, when the
supposedly empty row is created. If the row already existed (is a
duplicate) or it is not the last row in the partition (out-of-order row)
an exception is thrown and the deserialization is aborted. To further
improve diagnostics, the partition context is also added to the
exception.

Tests: unit(release)

* botond/frozen-mutation-bad-row-diagnostics/v3:
  frozen_mutation: add partition context to errors coming from deserializing
  partition_builder: accept_row(): use append_clustering_row()
  mutation_partition: add append_clustered_row()
2021-01-10 19:30:12 +02:00
Pavel Emelyanov
72c2482f73 mutation-partition: Construct rows_entry directly from clustering_row
When a rows_entry is added to row_cache it's constructed from
clustering_row  by unpacking all its internals and putting
them into the rows_entry's deletable_row. There's a shorter
way -- the clustering_row already has the deletale_row onboard
from which rows_entry can copy-construct its.

This lets keeping the rows_entry and deletable_row set of
constructors a bit shorter.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201224161112.20394-1-xemul@scylladb.com>
2020-12-24 18:13:44 +02:00
Botond Dénes
63ea36e277 mutation_partition: add append_clustered_row()
A variant of `clutered_row()` which throws if the row already exists, or
if any greater row already exists.
2020-12-02 15:08:32 +02:00
Tomasz Grabiec
a22645b7dd Merge "Unfriend rows_entry, cache_tracker and mutation_partition" from Pavel Emelyanov
The classes touche private data of each other for no real
reason. Putting the interaction behind API makes it easier
to track the usage.

* xemul/br-unfriends-in-row-cache-2:
  row cache: Unfriend classes from each other
  rows_entry: Move container/hooks types declarations
  rows_entry: Simplify LRU unlink
  mutation_partition: Define .replace_with method for rows_entry
  mutation_partition: Use rows_entry::apply_monotonically
2020-09-22 21:18:14 +02:00
Tomasz Grabiec
1f6c4f945e mutation_partition: Fix typo
drien -> driven

Message-Id: <1600103287-4948-1-git-send-email-tgrabiec@scylladb.com>
2020-09-15 10:09:15 +02:00
Pavel Emelyanov
bf4063d78e row cache: Unfriend classes from each other
Now cache_tracker, mutation_partition and rows_entry do not
need to be friends.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-11 16:35:51 +03:00
Pavel Emelyanov
7a1265a338 rows_entry: Move container/hooks types declarations
Define container types near the containing elements' hook
members, so that they could be private without the need
to friend classes with each other.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-11 16:35:51 +03:00
Pavel Emelyanov
7ed1e18a13 rows_entry: Simplify LRU unlink
The cache_tracker tries to access private member of the
rows_entry to unlink it, but the lru_type is auto_unlink
and can unlink itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-11 16:35:51 +03:00
Pavel Emelyanov
7f2c6aed50 mutation_partition: Define .replace_with method for rows_entry
The one is needed to hide the guts of rows_entry from mutation_partition.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-11 16:35:51 +03:00
Piotr Sarna
7b329f7102 digest: add null values to row digest
With the new hashing routine, null values are taken into account
when computing row digest. Previous behavior had a regression
which stopped computing the hash after the first null value
is encountered, but the original behavior was also prone
to errors - e.g. row [1, NULL, 2] was not distinguishable
from [1, 2, NULL], because their hashes were identical.
This hashing is not yet active - it will only be used after
the next commit introduces a proper cluster feature for it.
2020-09-10 13:16:44 +02:00
Paweł Dziepak
6f46010235 appending_hash<row>: make publicly visible
appending_hash<row> specialisation is declared and defined in a *.cc file
which means it cannot have a dedicated unit test. This patch moves the
declaration to the corresponding *.hh file.
2020-09-10 12:20:32 +02:00
Pavel Emelyanov
4e264b9e4f clustering_row: Do not re-implement deletable_row
The clustering_row is deletable_row + clustering_key, all
its internals work exactly as the relevant deletable_row's
ones.

The similar relation is between static_row and row, and
the former wrapes the latter, so here's the same trick
for the non-static row classes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-08 22:21:15 +03:00
Pavel Emelyanov
ca148acbf9 deletable_row: Do not mess with clustering_row
The deletable_row accepts clustering_row in constructor and
.apply() method. The next patch will make clustering_row
embed the deletable_row inside, so those two methods will
violate layering and should be fixed in advance.

The fix is in providing a clustering_row method to convert
itself into a deletable_row. There are two places that need
this: mutation_fragment_applier and partition_snapshot_row_cursor.
Both methods pass temporary clustering_row value, so the
method in question is also move-converter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-08 22:18:15 +03:00
Wojciech Mitros
45215746fe increase the maximum size of query results to 2^64
Currently, we cannot select more than 2^32 rows from a table because we are limited by types of
variables containing the numbers of rows. This patch changes these types and sets new limits.

The new limits take effect while selecting all rows from a table - custom limits of rows in a result
stay the same (2^32-1).

In classes which are being serialized and used in messaging, in order to be able to process queries
originating from older nodes, the top 32 bits of new integers are optional and stay at the end
of the class - if they're absent we assume they equal 0.

The backward compatibility was tested by querying an older node for a paged selection, using the
received paging_state with the same select statement on an upgraded node, and comparing the returned
rows with the result generated for the same query by the older node, additionally checking if the
paging_state returned by the upgraded node contained new fields with correct values. Also verified
if the older node simply ignores the top 32 bits of the remaining rows number when handling a query
with a paging_state originating from an upgraded node by generating and sending such a query to
an older node and checking the paging_state in the reply(using python driver).

Fixes #5101.
2020-08-03 17:32:49 +02:00
Juliusz Stasiewicz
9e4247090f cdc: Implementations of delta_mode::off/keys
At the stage of `finish`ing CDC mutation, deltas are removed (mode
`off`) or edited to keep only PK+CK of the base table (mode `keys`).

Fixes #6838
2020-07-27 19:05:47 +02:00
Avi Kivity
6f394e8e90 tombstone: use comparison operator instead of ad-hoc compare() function and with_relational_operators
The comparison operator (<=>) default implementation happens to exactly
match tombstone::compare(), so use the compiler-generated defaults. Also
default operator== and operator!= (these are not brought in by operator<=>).
These become slightly faster as they perform just an equality comparison,
not three-way compare.

shadowable_tombstone and row_tombstone depend on tombstone::compare(),
so convert them too in a similar way.

with_relational_operations.hh becomes unused, so delete it.

Tests: unit (dev)
Message-Id: <20200602055626.2874801-1-avi@scylladb.com>
2020-06-02 09:28:52 +03:00
Pavel Emelyanov
4fa12f2fb8 header: De-bloat schema.hh
The header sits in many other headers, but there's a handy
schema_fwd.hh that's tiny and contains needed declarations
for other headers. So replace shema.hh with schema_fwd.hh
in most of the headers (and remove completely from some).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200303102050.18462-1-xemul@scylladb.com>
2020-03-03 11:34:00 +01:00
Piotr Dulikowski
589313a110 row_marker: correct expiration condition
This change corrects condition on which a row was considered expired by
its TTL.

The logic that decides when a row becomes expired was inconsistent with
the logic that decides if a single cell is expired. A single cell
becomes expired when `expiry_timestamp <= now`, while a row became
expired when `expiry_timestamp < now` (notice the strict inequality).
For rows inserted with TTL, this caused non-key cells to expire (change
their values to null) one second before the row disappeared. Now, row
expiry logic uses non-strict inequality.

Fixes: #4263, #5290.

Tests:
- unit(dev)
- python test described in issue #5290
2019-11-19 11:46:59 +01:00
Piotr Dulikowski
59fbbb993f memtables: add partition/row hit/miss counters
Adds per-table metrics for counting partition and row reuse
in memtables. New metrics are as follows:
    - memtable_partition_writes - number of write operations performed
          on partitions in memtables,
    - memtable_partition_hits - number of write operations performed
          on partitions that previously existed in a memtable,
    - memtable_row_writes - number of row write operations performed
          in memtables,
    - memtable_row_hits - number of row write operations that ovewrote
          rows previously present in a memtable.

Tests: unit(release)
2019-11-12 13:35:41 +01:00
Vladimir Davydov
e0b31dd273 query: add flag to return static row on partition with no rows
A SELECT statement that has clustering key restrictions isn't supposed
to return static content if no regular rows matches the restrictions,
see #589. However, for the CAS statement we do need to return static
content on failure so this patch adds a flag that allows the caller to
override this behavior.
2019-10-28 21:50:44 +03:00
Kamil Braun
c90ea1056b Remove mutation_partition_applier.
It had been replaced by partition_builder
in commit dc290f0af7.
2019-10-25 10:19:45 +02:00
Avi Kivity
acc433b286 mutation_partition: make static_row optional to reduce memory footprint
The static row can be rare: many tables don't have them, and tables
that do will often have mutations without them (if the static row
is rarely updated, it may be present in the cache and in readers,
but absent in memtable mutations). However, it always consumes ~100
bytes of memory, even if it not present, due to row's overhead.

Change it to be optional by using lazy_row instead of row. Some call
sites treewide were adjusted to deal with the extra indirection.

perf_simple_query appears to improve by 2%, from 163krps to 165 krps,
though it's hard to be sure due to noisy measurements.

memory_footprint comparisons (before/after):

mutation footprint:		       mutation footprint:
 - in cache:	 1096		        - in cache:	992
 - in memtable:	 854		        - in memtable:	750
 - in sstable:	 351		        - in sstable:	351
 - frozen:	 540		        - frozen:	540
 - canonical:	 827		        - canonical:	827
 - query result: 342		        - query result: 342

 sizeof(cache_entry) = 112	        sizeof(cache_entry) = 112
 -- sizeof(decorated_key) = 36	        -- sizeof(decorated_key) = 36
 -- sizeof(cache_link_type) = 32        -- sizeof(cache_link_type) = 32
 -- sizeof(mutation_partition) = 200    -- sizeof(mutation_partition) = 96
 -- -- sizeof(_static_row) = 112        -- -- sizeof(_static_row) = 8
 -- -- sizeof(_rows) = 24	        -- -- sizeof(_rows) = 24
 -- -- sizeof(_row_tombstones) = 40     -- -- sizeof(_row_tombstones) = 40

 sizeof(rows_entry) = 232	        sizeof(rows_entry) = 232
 sizeof(lru_link_type) = 16	        sizeof(lru_link_type) = 16
 sizeof(deletable_row) = 168	        sizeof(deletable_row) = 168
 sizeof(row) = 112		        sizeof(row) = 112
 sizeof(atomic_cell_or_collection) = 8  sizeof(atomic_cell_or_collection) = 8

Tests: unit (dev)
2019-10-15 15:42:05 +03:00
Avi Kivity
88613e6882 mutation_partition: introduce lazy_row
lazy_row adds indirection to the row class, in order to reduce storage requirements
when the row is not present. The intent is to use it for the static row, which is
not present in many schemas, and is often not present in writes even in schemas that
have a static row.

Indirection is done using managed_ref, which is lsa-compatible.

lazy_row implements most of row's methods, and a few more:
 - get(), get_existing(), and maybe_create(): bypass the abstraction and the
   underlying row
 - some methods that accept a row parameter also have an overload with a lazy_row
   parameter
2019-10-15 15:42:05 +03:00
Tomasz Grabiec
bce0dac751 mutation_partition: Track and validate schema version in debug builds
This patch makes mutation_partition validate the invariant that it's
supposed to be accessed only with the schema version which it conforms
to.

Refs #5095
2019-09-25 10:27:06 +02:00
Botond Dénes
4c2781edaa row_marker: add garbage_collector
The new collector parameter is a pointer to a
`compaction_garbage_collector` implementation. This collector is passed
the row_marker when it expired and would be discarded.
The collector param is optional and defaults to nullptr.
2019-07-15 17:38:00 +03:00
Botond Dénes
7db2006162 row_marker: de-inline compact_and_expire() 2019-07-15 17:38:00 +03:00
Botond Dénes
4c7a7ffe8f row: add garbage_collector
The new collector parameter is a pointer to a
`compaction_garbage_collector` implementation. This collector is passed
all atoms that are expired and can would be discarded. The body of
`compact_and_expire()` was changed so that it checks cells' tombstone
coverage before it checks their expiry, so that cells that are both
covered by a tombstone and also expired are not passed to the collector.
The collector is forwarded to
`collection_type_impl::mutation::compact_and_expire()` as well.
The collector param is optional and defaults to nullptr
2019-07-15 17:38:00 +03:00
Benny Halevy
16dda033a5 sstables: row_marker: initialize _expiry
compare_row_marker_for_merge compares deletion_time also for row markers
that have missing timestamps.  This happened to succeed due to implicit
initialization to 0. However, we prefer the initialization to be explicit
and allow calling row_marker::deletion_time() in all states.

Fixes #4068

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190110102949.17896-1-bhalevy@scylladb.com>
2019-01-10 12:45:07 +01:00
Asias He
4e55d22a8f position_in_partition: Switch _bound_weight to use enum
The _bound_weight in position_in_partition will be sent on wire in rpc.
Make it enum instead of int.
2018-12-12 16:49:01 +08:00
Paweł Dziepak
637b9a7b3b atomic_cell_or_collection: make operator<< show cell content
After the new in-memory representation of cells was introduced there was
a regression in atomic_cell_or_collection::operator<< which stopped
printing the content of the cell. This makes debugging more incovenient
are time-consuming. This patch fixes the problem. Schema is propagated
to the atomic_cell_or_collection printer and the full content of the
cell is printed.

Fixes #3571.

Message-Id: <20181024095413.10736-1-pdziepak@scylladb.com>
2018-10-24 13:29:51 +03:00
Tomasz Grabiec
024b3c9fd9 mutation_partition: Fix exception safety of row::apply_monotonically()
When emplace_back() fails, value is already moved-from into a
temporary, which breaks monotonicity expected from
apply_monotonically(). As a result, writes to that cell will be lost.

The fix is to avoid the temporary by in-place construction of
cell_and_hash. To do that, appropriate cell_and_hash constructor was
added.

Found by mutation_test.cc::test_apply_monotonically_is_monotonic with
some modifications to the random mutation generator.

Introduced in 99a3e3a.

Fixes #3678.

Message-Id: <1533816965-27328-1-git-send-email-tgrabiec@scylladb.com>
2018-08-09 15:29:10 +03:00
Tomasz Grabiec
6b1fe6cbe5 mutation_partition: Introduce set_continuity() 2018-07-17 16:30:01 +02:00
Tomasz Grabiec
4d3cc2867a mutation_partition: Make merging preemtable 2018-06-27 12:48:30 +02:00