Commit Graph

293 Commits

Author SHA1 Message Date
Botond Dénes
3fab83b3a1 flat_mutation_reader: impl: add reader_permit parameter
Not used yet, this patch does all the churn of propagating a permit
to each impl.

In the next patch we will use it to track to track the memory
consumption of `_buffer`.
2020-09-28 10:53:48 +03:00
Tomasz Grabiec
a22645b7dd Merge "Unfriend rows_entry, cache_tracker and mutation_partition" from Pavel Emelyanov
The classes touche private data of each other for no real
reason. Putting the interaction behind API makes it easier
to track the usage.

* xemul/br-unfriends-in-row-cache-2:
  row cache: Unfriend classes from each other
  rows_entry: Move container/hooks types declarations
  rows_entry: Simplify LRU unlink
  mutation_partition: Define .replace_with method for rows_entry
  mutation_partition: Use rows_entry::apply_monotonically
2020-09-22 21:18:14 +02:00
Pavel Emelyanov
7ed1e18a13 rows_entry: Simplify LRU unlink
The cache_tracker tries to access private member of the
rows_entry to unlink it, but the lru_type is auto_unlink
and can unlink itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-11 16:35:51 +03:00
Pavel Emelyanov
fabf849fcb row_cache: Save one key compare on direct hit
The partitions_type::lower_bound() method can return a hint that saves
info about the "lower-ness of the bound", in particular when the search
key is found, this can be guessed from the hint without comparison.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-03 21:13:21 +03:00
Pavel Emelyanov
ada174c932 row_cache: Kill incomplete_tag
The incomplete entry is created in one place.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-03 21:13:21 +03:00
Pavel Emelyanov
240b966695 row_cache: Do not copy partition tombstone when creating cache entry
The row_cache::find_or_create is only used to put (or touch) an entry in cache
having the partition_start mutation at hands. Thus, theres no point in carrying
key reference and tombstone value through the calls, just the partition_start
reference is enough.

Since the new cache entry is created incomplete, rename the creation method
to reflect this.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-03 21:13:21 +03:00
Pavel Emelyanov
84a6d439ad test: Lookup an existing entry with its own helper
The only caller of find_or_create() in tests works on already existing (.populate()-d) entry,
so patch this place for explicity and for the sake of next patching.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-03 21:13:21 +03:00
Pavel Emelyanov
3f33a71c0c row_cache: Move missing entry creation into helper
No functional changes, just move the code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-03 21:13:21 +03:00
Pavel Emelyanov
4662082748 populating reader: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-03 21:13:21 +03:00
Pavel Emelyanov
e680bdc59c populating reader: Less allocator switching on population
Now when the key for new partition is copied inside do_find_or_create_entry we may call
this function without allocator set, as it sets the allocator inside.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-03 21:13:21 +03:00
Pavel Emelyanov
449f9e1218 populating reader: Do not copy decorated key too early
When the missing partition is created in cache the decorated key is copied from
the ring position view too early -- to do the lookup. However, the read context
had been already entered the partition and already has the decorated key on board,
so for lookup we can use the reference.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-03 21:13:21 +03:00
Pavel Emelyanov
5a29e17a5f row_cache: Revive do_find_or_create_entry concepts
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-09-03 21:13:21 +03:00
Botond Dénes
5e9a7d2608 row_cache: remove unnecessary includes of partition_snapshot_reader.hh
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200820124447.2561477-1-bdenes@scylladb.com>
2020-08-20 15:19:42 +02:00
Piotr Sarna
29e2dc242a row_cache: add tracing
In order to improve tracing for the read path, cache is now
also actively adding basic trace information.
Example:
select * from t where token(p) >= 42 and token(p) < 112;
 activity                                                                                | timestamp                  | source    | source_elapsed | client
-----------------------------------------------------------------------------------------+----------------------------+-----------+----------------+-----------
                                                                      Execute CQL3 query | 2020-08-07 13:10:34.694000 | 127.0.0.1 |              0 | 127.0.0.1
                                                           Parsing a statement [shard 0] | 2020-08-07 13:10:34.694307 | 127.0.0.1 |             -- | 127.0.0.1
                                                        Processing a statement [shard 0] | 2020-08-07 13:10:34.694377 | 127.0.0.1 |             70 | 127.0.0.1
                                                   read_data: querying locally [shard 0] | 2020-08-07 13:10:34.694425 | 127.0.0.1 |            118 | 127.0.0.1
                        Start querying token range [{42, start}, {112, start}] [shard 0] | 2020-08-07 13:10:34.694432 | 127.0.0.1 |            125 | 127.0.0.1
                                             Creating shard reader on shard: 0 [shard 0] | 2020-08-07 13:10:34.694446 | 127.0.0.1 |            139 | 127.0.0.1
 Scanning cache for range [{42, start}, {112, start}] and slice {(-inf, +inf)} [shard 0] | 2020-08-07 13:10:34.694454 | 127.0.0.1 |            147 | 127.0.0.1
                                                              Querying is done [shard 0] | 2020-08-07 13:10:34.694494 | 127.0.0.1 |            187 | 127.0.0.1
                                          Done processing - preparing a result [shard 0] | 2020-08-07 13:10:34.694520 | 127.0.0.1 |            213 | 127.0.0.1
                                                                        Request complete | 2020-08-07 13:10:34.694221 | 127.0.0.1 |            221 | 127.0.0.1

Example with cache miss:
select * from t where p = 7;
 activity                                                                                                                                                                          | timestamp                  | source    | source_elapsed | client
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+-----------
                                                                                                                                                                Execute CQL3 query | 2020-08-07 13:25:04.363000 | 127.0.0.1 |              0 | 127.0.0.1
                                                                                                                                                     Parsing a statement [shard 0] | 2020-08-07 13:25:04.363310 | 127.0.0.1 |             -- | 127.0.0.1
                                                                                                                                                  Processing a statement [shard 0] | 2020-08-07 13:25:04.363384 | 127.0.0.1 |             74 | 127.0.0.1
                                                   Creating read executor for token 1634052884888577606 with all: {127.0.0.1} targets: {127.0.0.1} repair decision: NONE [shard 0] | 2020-08-07 13:25:04.363450 | 127.0.0.1 |            139 | 127.0.0.1
                                                                                                                                             read_data: querying locally [shard 0] | 2020-08-07 13:25:04.363455 | 127.0.0.1 |            145 | 127.0.0.1
                                                                                                 Start querying singular range {{1634052884888577606, pk{000400000007}}} [shard 0] | 2020-08-07 13:25:04.363461 | 127.0.0.1 |            151 | 127.0.0.1
                                                                             Querying cache for range {{1634052884888577606, pk{000400000007}}} and slice {(-inf, +inf)} [shard 0] | 2020-08-07 13:25:04.363490 | 127.0.0.1 |            180 | 127.0.0.1
                                                                                                      Range {{1634052884888577606, pk{000400000007}}} not found in cache [shard 0] | 2020-08-07 13:25:04.363494 | 127.0.0.1 |            183 | 127.0.0.1
          Reading key {{1634052884888577606, pk{000400000007}}} from sstable /home/sarna/.ccm/scylla-1/node1/data/ks/t-f7b7a9b0d89f11eab650000000000000/mc-1-big-Data.db [shard 0] | 2020-08-07 13:25:04.363522 | 127.0.0.1 |            211 | 127.0.0.1
                           /home/sarna/.ccm/scylla-1/node1/data/ks/t-f7b7a9b0d89f11eab650000000000000/mc-1-big-Index.db: scheduling bulk DMA read of size 16 at offset 0 [shard 0] | 2020-08-07 13:25:04.363546 | 127.0.0.1 |            235 | 127.0.0.1
 /home/sarna/.ccm/scylla-1/node1/data/ks/t-f7b7a9b0d89f11eab650000000000000/mc-1-big-Index.db: finished bulk DMA read of size 16 at offset 0, successfully read 16 bytes [shard 0] | 2020-08-07 13:25:04.364406 | 127.0.0.1 |           1095 | 127.0.0.1
                            /home/sarna/.ccm/scylla-1/node1/data/ks/t-f7b7a9b0d89f11eab650000000000000/mc-1-big-Data.db: scheduling bulk DMA read of size 56 at offset 0 [shard 0] | 2020-08-07 13:25:04.364445 | 127.0.0.1 |           1134 | 127.0.0.1
  /home/sarna/.ccm/scylla-1/node1/data/ks/t-f7b7a9b0d89f11eab650000000000000/mc-1-big-Data.db: finished bulk DMA read of size 56 at offset 0, successfully read 56 bytes [shard 0] | 2020-08-07 13:25:04.364599 | 127.0.0.1 |           1288 | 127.0.0.1
                                                                                                                                                        Querying is done [shard 0] | 2020-08-07 13:25:04.364685 | 127.0.0.1 |           1375 | 127.0.0.1
                                                                                                                                    Done processing - preparing a result [shard 0] | 2020-08-07 13:25:04.364719 | 127.0.0.1 |           1408 | 127.0.0.1
                                                                                                                                                                  Request complete | 2020-08-07 13:25:04.364421 | 127.0.0.1 |           1421 | 127.0.0.1
Example without cache for verification:
select * from t where token(p) >= 42 and token(p) < 112 bypass cache;
 activity                                                         | timestamp                  | source    | source_elapsed | client
------------------------------------------------------------------+----------------------------+-----------+----------------+-----------
                                               Execute CQL3 query | 2020-08-07 13:11:16.122000 | 127.0.0.1 |              0 | 127.0.0.1
                                    Parsing a statement [shard 0] | 2020-08-07 13:11:16.122657 | 127.0.0.1 |             -- | 127.0.0.1
                                 Processing a statement [shard 0] | 2020-08-07 13:11:16.122742 | 127.0.0.1 |             85 | 127.0.0.1
                            read_data: querying locally [shard 0] | 2020-08-07 13:11:16.122806 | 127.0.0.1 |            149 | 127.0.0.1
 Start querying token range [{42, start}, {112, start}] [shard 0] | 2020-08-07 13:11:16.122814 | 127.0.0.1 |            158 | 127.0.0.1
                      Creating shard reader on shard: 0 [shard 0] | 2020-08-07 13:11:16.122829 | 127.0.0.1 |            172 | 127.0.0.1
                                       Querying is done [shard 0] | 2020-08-07 13:11:16.122895 | 127.0.0.1 |            239 | 127.0.0.1
                   Done processing - preparing a result [shard 0] | 2020-08-07 13:11:16.122928 | 127.0.0.1 |            271 | 127.0.0.1
                                                 Request complete | 2020-08-07 13:11:16.122280 | 127.0.0.1 |            280 | 127.0.0.1
Message-Id: <3b31584c13f23f84af35660d0aa73ba56c30cf13.1596799589.git.sarna@scylladb.com>
2020-08-09 12:53:04 +03:00
Pavel Emelyanov
4d2f5f93a4 memtable: Switch onto B+ rails
The change is the same as with row-cache -- use B+ with int64_t token
as key and array of memtable_entry-s inside it.

The changes are:

Similar to those for row_cache:

- compare() goes away, new collection uses ring_position_comparator

- insertion and removal happens with the help of double_decker, most
  of the places are about slightly changed semantics of it

- flags are added to memtable_entry, this makes its size larger than
  it could be, but still smaller than it was before

Memtable-specific:

- when the new entry is inserted into tree iterators _might_ get
  invalidated by double-decker inner array. This is easy to check
  when it happens, so the invalidation is avoided when possible

- the size_in_allocator_without_rows() is now not very precise. This
  is because after the patch memtable_entries are not allocated
  individually as they used to. They can be squashed together with
  those having token conflict and asking allocator for the occupied
  memory slot is not possible. As the closest (lower) estimate the
  size of enclosing B+ data node is used

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-14 16:30:02 +03:00
Pavel Emelyanov
174b101a49 row_cache: Switch partition tree onto B+ rails
The row_cache::partitions_type is replaced from boost::intrusive::set
to bplus::tree<Key = int64_t, T = array_trusted_bounds<cache_entry>>

Where token is used to quickly locate the partition by its token and
the internal array -- to resolve hashing conflicts.

Summary of changes in cache_entry:

- compare's goes away as the new collection needs tri-compare one which
  is provided by ring_position_comparator

- when initialized the dummy entry is added with "after_all_keys" kind,
  not "before_all_keys" as it was by default. This is to make tree
  entries sorted by token

- insertion and removing of cache_entries happens inside double_decker,
  most of the changes in row_cache.cc are about passing constructor args
  from current_allocator.construct into double_decker.empace_before()

- the _flags is extended to keep array head/tail bits. There's a room
  for it, sizeof(cache_entry) remains unchanged

The rest fits smothly into the double_decker API.

Also, as was told in the previous patch, insertion and removal _may_
invalidate iterators, but may leave them intact. However, currently
this doesn't seem to be a problem as the cache_tracker ::insert() and
::on_partition_erase do invalidate iterators unconditionally.

Later this can be otimized, as iterators are invalidated by double-decker
only in case of hash conflict, otherwise it doesn't change arrays and
B+ tree doesn't invalidate its.

tests: unit(dev), perf(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-14 16:30:02 +03:00
Pavel Emelyanov
dff5eb6f25 memtable: Count partitions separately
The B+ will not have constant-time .size() call, so do it by hands

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-14 16:30:02 +03:00
Pavel Emelyanov
7b2754cf5f row-cache: Use ring_position_comparator in some places
The row cache (and memtable) code uses own comparators built on top
of the ring_position_comparator for collections of partitions. These
collections will be switched from the key less-compare to the pair
of token less-compare + key tri-compare.

Prepare for the switch by generalizing the ring_partition_comparator
and by patching all the non-collections usage of less-compare to use
one.

The memtable code doesn't use it outside of collections, but patch it
anyway as a part of preparations.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-14 16:30:02 +03:00
Pavel Emelyanov
bb32cff23d row_cache: Mark invalidation lambda as noexcept
It calls noexcept functions inside and handles the exception from throwing one itself

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-09 14:46:38 +03:00
Pavel Emelyanov
1346289151 cache_tracker: Mark methods noexcept
All but few are trivially such.

The clear_continuity() calls cache_entry::set_continuous() that had become noexcept
a patch ago.

The allocator() calls region.allocator() which had been marked noexcept few patches
back.

The on_partition_erase() calls allocator().invalidate_references(), both had
been marked noexcept few patches back.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-07-09 14:44:17 +03:00
Tomasz Grabiec
e81fc1f095 row_cache: Fix undefined behavior on key linearization
This is relevant only when using partition or clustering keys which
have a representation in memory which is larger than 12.8 KB (10% of
LSA segment size).

There are several places in code (cache, background garbage
collection) which may need to linearize keys because of performing key
comparison, but it's not done safely:

 1) the code does not run with the LSA region locked, so pointers may
get invalidated on linearization if it needs to reclaim memory. This
is fixed by running the code inside an allocating section.

 2) LSA region is locked, but the scope of
with_linearized_managed_bytes() encloses the allocating section. If
allocating section needs to reclaim, linearization context will
contain invalidated pointers. The fix is to reorder the scopes so
that linearization context lives within an allocating section.

Example of 1 can be found in
range_populating_reader::handle_end_of_stream() where it performs a
lookup:

  auto prev = std::prev(it);
  if (prev->key().equal(*_cache._schema, *_last_key->_key)) {
     it->set_continuous(true);

but handle_end_of_stream() is not invoked under allocating section.

Example of 2 can be found in mutation_cleaner_impl::merge_some() where
it does:

  return with_linearized_managed_bytes([&] {
  ...
    return _worker_state->alloc_section(region, [&] {

Fixes #6637.
Refs #6108.

Tests:

  - unit (all)

Message-Id: <1592218544-9435-1-git-send-email-tgrabiec@scylladb.com>
2020-06-15 16:03:33 +03:00
Botond Dénes
fe024cecdc row_cache: pass a valid permit to underlying read
All reader are soon going to require a valid permit, so make sure we
have a valid permit which we can pass to the underlying reader when
creating it. This means `row_cache::make_reader()` now also requires
a permit to be passed to it.
2020-05-28 11:34:35 +03:00
Pavel Emelyanov
2ac24d38fa row-cache: Remove variadic future from range_populating_reader
Replace it with std::tuple, introduce range_populating_reader::read_result
type alias for less keystrokes.

This makes row_cache.o compilation warn-less.

tests: unit(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200518160511.26984-1-xemul@scylladb.com>
2020-05-21 19:29:39 +02:00
Pavel Emelyanov
d3b6f66f50 row_cache: Remove unused invalidate_unwrapped()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200423133557.27053-1-xemul@scylladb.com>
2020-04-23 17:04:31 +03:00
Botond Dénes
196dd5fa9b treewide: throw std::bad_function_call with backtraces
We typically use `std::bad_function_call` to throw from
mandatory-to-implement virtual functions, that cannot have a meaningful
implementation in the derived class. The problem with
`std::bad_function_call` is that it carries absolutely no information
w.r.t. where was it thrown from.

I originally wanted to replace `std::bad_function_call` in our codebase
with a custom exception type that would allow passing in the name of the
function it is thrown from to be included in the exception message.
However after I ended up also including a backtrace, Benny Halevy
pointed out that I might as well just throw `std:bad_function_call` with
a backtrace instead. So this is what this patch does.

All users are various unimplemented methods of the
`flat_mutation_reader::impl` interface.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200408075801.701416-1-bdenes@scylladb.com>
2020-04-08 13:54:06 +02:00
Rafael Ávila de Espíndola
eca0ac5772 everywhere: Update for deprecated apply functions
Now apply is only for tuples, for varargs use invoke.

This depends on the seastar changes adding invoke.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200324163809.93648-1-espindola@scylladb.com>
2020-03-25 08:49:53 +02:00
Botond Dénes
dfc8b2fc45 treewide: replace reader_resource_tracer with reader_permit
The former was never really more than a reader_permit with one
additional method. Currently using it doesn't even save one from any
includes. Now that readers will be using reader_permit we would have to
pass down both to mutation_source. Instead get rid of
reader_resource_tracker and just use reader_permit. Instead of making it
a last and optional parameter that is easy to ignore, make it a
first class parameter, right after schema, to signify that permits are
now a prominent part of the reader API.

This -- mostly mechanical -- patch essentially refactors mutation_source
to ask for the reader_permit instead of reader_resource_tracking and
updates all usage sites.
2020-01-28 08:13:16 +02:00
Tomasz Grabiec
e3d025d014 row_cache: Fix abort on bad_alloc during cache update
Since 90d6c0b, cache will abort when trying to detach partition
entries while they're updated. This should never happen. It can happen
though, when the update fails on bad_alloc, because the cleanup guard
invalidates the cache before it releases partition snapshots (held by
"update" coroutine).

Fix by destroying the coroutine first.

Fixes #5327.

Tests:
  - row_cache_test (dev)

Message-Id: <1574360259-10132-1-git-send-email-tgrabiec@scylladb.com>
2019-11-24 12:06:51 +02:00
Piotr Dulikowski
59fbbb993f memtables: add partition/row hit/miss counters
Adds per-table metrics for counting partition and row reuse
in memtables. New metrics are as follows:
    - memtable_partition_writes - number of write operations performed
          on partitions in memtables,
    - memtable_partition_hits - number of write operations performed
          on partitions that previously existed in a memtable,
    - memtable_row_writes - number of row write operations performed
          in memtables,
    - memtable_row_hits - number of row write operations that ovewrote
          rows previously present in a memtable.

Tests: unit(release)
2019-11-12 13:35:41 +01:00
Tomasz Grabiec
e6afc89735 row_cache: Record upgraded schema in memtable entries during update
Cache update may defer in the middle of moving of partition entry
from a flushed memtable to the cache. If the schema was changed since
the entry was written, it upgrades the schema of the partition_entry
first but doesn't update the schema_ptr in memtable_entry. The entry
is removed from the memtable afterward. If a memtable reader
encounters such an entry, it will try to upgrade it assuming it's
still at the old schema.

That is undefined behavior in general, which may include:

 - read failures due to bad_alloc, if fixed-size cells are interpreted
   as variable-sized cells, and we misinterpret a value for a huge
   size

 - wrong read results

 - node crash

This doesn't result in a permanent corruption, restarting the node
should help.

It's the more likely to happen the more rows there are in a
partition. It's unlikely to happen with single-row partitions.

Introduced in 70c7277.

Fixes #5128.
2019-10-03 22:03:29 +02:00
Tomasz Grabiec
90d6c0b9a2 row_cache, mvcc: Prevent locked snapshots from being evicted
If the whole partition entry is evicted while being updated from the
memtable, a subsequent read may populate the partition using the old
version of data if it attempts to do it before cache update advances
past that partition. Partial eviction is not affected because
populating reads will notice that there is a newer snapshot
corresponding to the updater.

This can happen only in OOM situations where the whole cache gets evicted.

Affects only tables with multi-row partitions, which are the only ones
that can experience the update of partition entry being preempted.

Introduced in 70c7277.

Fixes #5134.
2019-10-03 22:03:29 +02:00
Tomasz Grabiec
57a93513bd row_cache: Make evict() not use invalidate_unwrapped()
invalidate_unwrapped() calls cache_entry::evict(), which cannot be
called concurrently with cache update. invalidate() serializes it
properly by calling do_update(), but evict() doesn't. The purpose of
evict() is to stress eviction in tests, which can happen concurrently
with cache update. Switch it to use memory reclaimer, so that it's
both correct and more realistic.

evict() is used only in tests.
2019-10-03 22:03:28 +02:00
Tomasz Grabiec
25e2f87a37 row_cache, mvcc: Do not upgrade schema of entries which are being updated
When a read enters a partition entry in the cache, it first upgrades
it to the current schema of the cache. The same happens when an entry
is updated after a memtable flush. Upgrading the entry is currently
performed by squashing all versions and replacing them with a single
upgraded version. That has a side effect of detaching all snapshots
from the partition entry. Partition entry update on memtable flush is
writing into a snapshot. If that snapshot is detached by a schema
upgrade, the entry will be missing writes from the memtable which fall
into continuous ranges in that entry which have not yet been updated.

This can happen only if the update of the entry is preempted and the
schema was altered during that, and a read hit that partition before
the update went past it.

Affects only tables with multi-row partitions, which are the only ones
that can experience the update of partition entry being preempted.

The problem is fixed by locking updated entries and not upgrading
schema of locked entries. cache_entry::read() is prepared for this,
and will upgrade on-the-fly to the cache's schema.

Fixes #5135
2019-10-03 22:03:28 +02:00
Tomasz Grabiec
aad1307b14 row_cache, memtable: Use upgrade_schema() 2019-10-03 13:28:33 +02:00
Tomasz Grabiec
77fb34821b row_cache: Make invalidate() preemptible
This change inserts preemption points between removal of partitions.

The main complication is in maintaining consitency in the face of
concurrent population or eviction. We use the same mechanism which is
used by memtable updates. _prev_snapshot_pos is the ring position
which partitions the ring into the part which is already updated in
cache and the one which is yet to be updated. That position should be
set accordingly on preemption.

In case of invalidation, updating means removing all entries in the
range and marking the range as discontinuous.  When resuming
invalidation of a range we continue from _prev_snapshot_pos as the
lower bound.

This affects high-level operations like nodetool refresh, table
truncation, repair and streaming.

Fixes #2683

The improvement on stalls was measured using tests/perf_row_cache_update:

Before

Small partitions, no overwrites:
invalidation: 339.420624 [ms], preemption: {count: 2, 99%: 0.008239 [ms], max: 339.422144 [ms]}
Small partition with a few rows:
invalidation: 191.855331 [ms], preemption: {count: 2, 99%: 0.008239 [ms], max: 191.856816 [ms]}
Large partition, lots of small rows:
invalidation: 0.959328 [ms], preemption: {count: 2, 99%: 0.008239 [ms], max: 0.961453 [ms]}

After:

Small partitions, no overwrites:
invalidation: 400.505554 [ms], preemption: {count: 843, 99%: 0.545791 [ms], max: 0.502340 [ms]}
Small partition with a few rows:
invalidation: 306.352600 [ms], preemption: {count: 644, 99%: 0.545791 [ms], max: 0.506464 [ms]}
Large partition, lots of small rows:
invalidation: 0.963660 [ms], preemption: {count: 2, 99%: 0.009887 [ms], max: 0.963264 [ms]}

The maximum scheduling latency went down form 339 ms to 0.5 ms (task quota).
2019-05-13 19:32:00 +02:00
Tomasz Grabiec
595e1a540e row_cache: Switch _prev_snapshot_pos to be a ring_position_ext
dht::ring_position cannot represent all ring_position_view instances,
in particular those obtained from
dht::ring_position_view::for_range_start(). To allow using the latter,
switch to views.
2019-05-13 19:30:50 +02:00
Tomasz Grabiec
32f711ce56 row_cache: Fix crash on memtable flush with LCS
Presence checker is constructed and destroyed in the standard
allocator context, but the presence check was invoked in the LSA
context. If the presence checker allocates and caches some managed
objects, there will be alloc-dealloc mismatch.

That is the case with LeveledCompactionStrategy, which uses
incremental_selector.

Fix by invoking the presence check in the standard allocator context.

Fixes #4063.

Message-Id: <1547547700-16599-1-git-send-email-tgrabiec@scylladb.com>
2019-01-15 16:53:36 +02:00
Duarte Nunes
fa2b0384d2 Replace std::experimental types with C++17 std version.
Replace stdx::optional and stdx::string_view with the C++ std
counterparts.

Some instances of boost::variant were also replaced with std::variant,
namely those that called seastar::visit.

Scylla now requires GCC 8 to compile.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190108111141.5369-1-duarte@scylladb.com>
2019-01-08 13:16:36 +02:00
Paweł Dziepak
df1d438fcd row_cache: drop support for streamed_mutation::forwarding::yes entirely 2018-12-20 13:27:25 +00:00
Paweł Dziepak
adcb3ec20c row_cache: read is not single-partition if inter-partition forwarding is enabled 2018-12-20 13:27:25 +00:00
Paweł Dziepak
7ecee197c4 row_cache: use make_forwardable() to implement streamed_mutation::forwarding
Implementing intra-partition fast-forwarding adds more complexity to
already very-much-not-trivial cache readers and isn't really critical in
any way since it is not used outside of the tests. Let's use the generic
adapter instead of natively implementing it.
2018-12-20 13:27:25 +00:00
Avi Kivity
775b7e41f4 Update seastar submodule
* seastar d59fcef...b924495 (2):
  > build: Fix protobuf generation rules
  > Merge "Restructure files" from Jesse

Includes fixup patch from Jesse:

"
Update Seastar `#include`s to reflect restructure

All Seastar header files are now prefixed with "seastar" and the
configure script reflects the new locations of files.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com>
"
2018-11-21 00:01:44 +02:00
Avi Kivity
8cca3b2879 row_cache: fix bad format string syntax
Some sprint() calls use the fmt language instead of the printf syntax. Convert
them all the way to format().
2018-11-01 13:16:17 +00:00
Paweł Dziepak
637b9a7b3b atomic_cell_or_collection: make operator<< show cell content
After the new in-memory representation of cells was introduced there was
a regression in atomic_cell_or_collection::operator<< which stopped
printing the content of the cell. This makes debugging more incovenient
are time-consuming. This patch fixes the problem. Schema is propagated
to the atomic_cell_or_collection printer and the full content of the
cell is printed.

Fixes #3571.

Message-Id: <20181024095413.10736-1-pdziepak@scylladb.com>
2018-10-24 13:29:51 +03:00
Botond Dénes
eb357a385d flat_mutation_reader: make timeout opt-out rather than opt-in
Currently timeout is opt-in, that is, all methods that even have it
default it to `db::no_timeout`. This means that ensuring timeout is used
where it should be is completely up to the author and the reviewrs of
the code. As humans are notoriously prone to mistakes this has resulted
in a very inconsistent usage of timeout, many clients of
`flat_mutation_reader` passing the timeout only to some members and only
on certain call sites. This is small wonder considering that some core
operations like `operator()()` only recently received a timeout
parameter and others like `peek()` didn't even have one until this
patch. Both of these methods call `fill_buffer()` which potentially
talks to the lower layers and is supposed to propagate the timeout.
All this makes the `flat_mutation_reader`'s timeout effectively useless.

To make order in this chaos make the timeout parameter a mandatory one
on all `flat_mutation_reader` methods that need it. This ensures that
humans now get a reminder from the compiler when they forget to pass the
timeout. Clients can still opt-out from passing a timeout by passing
`db::no_timeout` (the previous default value) but this will be now
explicit and developers should think before typing it.

There were suprisingly few core call sites to fix up. Where a timeout
was available nearby I propagated it to be able to pass it to the
reader, where I couldn't I passed `db::no_timeout`. Authors of the
latter kind of code (view, streaming and repair are some of the notable
examples) should maybe consider propagating down a timeout if needed.
In the test code (the wast majority of the changes) I just used
`db::no_timeout` everywhere.

Tests: unit(release, debug)

Signed-off-by: Botond Dénes <bdenes@scylladb.com>

Message-Id: <1edc10802d5eb23de8af28c9f48b8d3be0f1a468.1536744563.git.bdenes@scylladb.com>
2018-09-20 11:31:24 +02:00
Tomasz Grabiec
567da3e063 memtable, cache: Fix exception safety of partition entry insertions
boost::intrusive::set::insert() may throw if keys require
linearization and that fails, in which case we will leak the entry.

When this happens in cache, we will also violate the invariant for
entry eviction, which assumes all tracked entries are linked, and
cause a SEGFAULT.

Use the non-throwing and faster insert_before() instead. Where we
can't use insert_before(), use alloc_strategy_unique_ptr<> to ensure
that entry is deallocated on insert failure.

Fixes #3585.
2018-07-17 16:30:01 +02:00
Tomasz Grabiec
074be4d4e8 memtable, cache: Run mutation_cleaner worker in its own scheduling group
The worker is responsible for merging MVCC snapshots, which is similar
to merging sstables, but in memory. The new scheduling group will be
therefore called "memory compaction".

We should run it in a separate scheduling group instead of
main/memtables, so that it doesn't disrupt writes and other system
activities. It's also nice for monitoring how much CPU time we spend
on this.
2018-06-27 21:51:04 +02:00
Tomasz Grabiec
6c6ffaee71 mutation_cleaner: Make merge() redirect old instance to the new one
If memtable snapshot goes away after memtable started merging to
cache, it would enqueue the snapshots for cleaning on the memtable's
cleaner, which will have to clean without deferrring when the memtable
is destroyed. That may stall the reactor. To avoid this, make merge()
cause the old instance of the cleaner to redirect to the new instance
(owned by cache), like we do for regions. This way the snapshots
mentioned earlier can be cleaned after memtable is destroyed,
gracefully.
2018-06-27 21:51:04 +02:00
Paweł Dziepak
96b0577343 row_cache: deglobalise row cache tracker
Row cache tracker has numerous implicit dependencies on ohter objects
(e.g. LSA migrators for data held by mutation_cleaner). The fact that
both cache tracker and some of those dependencies are thread local
objects makes it hard to guarantee correct destruction order.

Let's deglobalise cache tracker and put in in the database class.
2018-06-25 09:37:43 +01:00
Tomasz Grabiec
78274276f5 row_cache: Use the memtable cleaner to create memtable snapshot during update
Memtable entries should be cleaned using memtable cleaner, which
unlike the cache' cleaner is not associated with the cache
tracker. It's an error to clean a snapshot using tracker which doesn't
own the entries. This will corrupt cache tracker's row counter.

Fixes failure of test_exception_safety_of_update_from_memtable from
row_cache.cc in debug mode and with allocation failure injection
enabled.

Introduce in "cache: Defer during partition merging"
(70c72773be).
Message-Id: <1528988256-20578-1-git-send-email-tgrabiec@scylladb.com>
2018-06-14 18:03:02 +03:00