Commit Graph

158 Commits

Author SHA1 Message Date
Duarte Nunes
fa2b0384d2 Replace std::experimental types with C++17 std version.
Replace stdx::optional and stdx::string_view with the C++ std
counterparts.

Some instances of boost::variant were also replaced with std::variant,
namely those that called seastar::visit.

Scylla now requires GCC 8 to compile.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190108111141.5369-1-duarte@scylladb.com>
2019-01-08 13:16:36 +02:00
Avi Kivity
775b7e41f4 Update seastar submodule
* seastar d59fcef...b924495 (2):
  > build: Fix protobuf generation rules
  > Merge "Restructure files" from Jesse

Includes fixup patch from Jesse:

"
Update Seastar `#include`s to reflect restructure

All Seastar header files are now prefixed with "seastar" and the
configure script reflects the new locations of files.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com>
"
2018-11-21 00:01:44 +02:00
Tomasz Grabiec
074be4d4e8 memtable, cache: Run mutation_cleaner worker in its own scheduling group
The worker is responsible for merging MVCC snapshots, which is similar
to merging sstables, but in memory. The new scheduling group will be
therefore called "memory compaction".

We should run it in a separate scheduling group instead of
main/memtables, so that it doesn't disrupt writes and other system
activities. It's also nice for monitoring how much CPU time we spend
on this.
2018-06-27 21:51:04 +02:00
Paweł Dziepak
96b0577343 row_cache: deglobalise row cache tracker
Row cache tracker has numerous implicit dependencies on ohter objects
(e.g. LSA migrators for data held by mutation_cleaner). The fact that
both cache tracker and some of those dependencies are thread local
objects makes it hard to guarantee correct destruction order.

Let's deglobalise cache tracker and put in in the database class.
2018-06-25 09:37:43 +01:00
Paweł Dziepak
27014a23d7 treewide: require type info for copying atomic_cell_or_collection 2018-05-31 15:51:11 +01:00
Tomasz Grabiec
70c72773be cache: Defer during partition merging 2018-05-30 14:41:41 +02:00
Tomasz Grabiec
3f19f76c67 mvcc: Destroy memtable partition versions gently
Now all snapshots will have a mutation_cleaner which they will use to
gently destroy freed partition_version objects.

Destruction of memtable entries during cache update is also using the
gentle cleaner now. We need to have a separate cleaner for memtable
objects even though they're owned by cache's region, because memtable
versions must be cleared without a cache_tracker.

Each memtable will have its own cleaner, which will be merged with the
cache's cleaner when memtable is merged into cache.

Fixes some sources of reactor stalls on cache update when there are
large partition entries in memtables.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
f0c1edd672 cache: Destroy partition versions incrementally
Instead of destroying whole partition_versions at once, we will do that
gently using mutation_cleaner to avoid reactor stalls.

Large deletions could happen when large partition gets invalidated,
upgraded to a new schema, or when it's abandaned by a detached snapshot.

Refs #3289.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
40cc766cf2 database: Add API for incremental clearing of partition entries
Partitions can get very large. Destroying them all at once can stall
the reactor for significant amount of time. We want to avoid that by
doing destruction incrementally, deferring in between. A new API is
added for that at various levels:

  stop_iteration clear_gently() noexcept;

It returns stop_iteration::yes when the object is fully cleared and
can be now destroyed quickly. So a deferring destruction can look like
this:

  return repeat([this] { return clear_gently(); });

The reason why clear_gently() doesn't return a future<> itself is that some
contexts cannot defer, like memory reclamation.
2018-05-30 12:18:56 +02:00
Tomasz Grabiec
2f75212ca4 cache: Define trivial methods inline
They have users in a different compilation unit, in partition_version.cc
2018-05-30 12:18:56 +02:00
Tomasz Grabiec
641bcd0b35 cache: Introduce unlink_from_lru()
Will be used in row_cache_alloc_stress to unlink partitions which we
don't want to get evicted, instead of reapeatedly calling touch() on
them after each subsequent population. After switching to row-level
LRU, doing so greatly increases run time of the test due to quadratic
behavior.
2018-03-07 16:52:59 +01:00
Tomasz Grabiec
b9d22584bb cache: Add row-level stats about cache update from memtable 2018-03-07 16:52:58 +01:00
Tomasz Grabiec
da901b93fc cache: Track number of rows and row invalidations 2018-03-06 11:50:29 +01:00
Tomasz Grabiec
381bf02f55 cache: Evict with row granularity
Instead of evicting whole partitions, evicts whole rows.

As part of this, invalidation of partition entries was changed to not
evict from snapshots right away, but unlink them and let them be
evicted by the reclaimer.
2018-03-06 11:50:29 +01:00
Tomasz Grabiec
dce9185fc9 cache: Track static row insertions separately from regular rows
So that row eviction counter, which doesn't look at the static row,
can be in sync with row insertion counter.
2018-03-06 11:50:28 +01:00
Tomasz Grabiec
5320705300 cache: Propagate cache_tracker to places manipulating evictable entries
cache_tracker reference will be needed to link/unlink row entries.

No change of behavior in this patch.
2018-03-06 11:50:27 +01:00
Tomasz Grabiec
30df3ddd7d cache: Do not evict from cache_entry destructor
We will need to propagate a cache_tracker reference to evict(). Instead
of evicting from destructor, do so before cache_entry gets unlinked
from the tree. Entries which are not linked, don't need to be
explicitly evicted.
2018-03-06 11:50:27 +01:00
Tomasz Grabiec
4efab6f6a6 cache: Use on_evicted() in cache_tracker::clear()
In preparation for switching LRU to row level.
2018-03-06 11:50:27 +01:00
Tomasz Grabiec
2118bdce01 cache: Extract cache_entry::on_evicted() 2018-03-06 11:50:27 +01:00
Tomasz Grabiec
24c5949518 cache: cache_tracker: Rename on_merge() to on_partition_merge() 2018-03-06 11:50:27 +01:00
Tomasz Grabiec
d66e864310 cache: cache_tracer: Rename on_erase() to on_partition_erase() 2018-03-06 11:50:27 +01:00
Tomasz Grabiec
d9a38c1c85 mutation_partition: Add API to walk from rows_entry to cache_entry
Will be needed on row eviction, to unlink containers when they become
fully evicted.
2018-03-06 11:50:26 +01:00
Tomasz Grabiec
b0b57b8143 mvcc: Do not move unevictable snapshots to cache
Commit 6ccd317 introduced a bug in partition_entry::evict() where a
partition entry may be partially evicted if there are non-evictable
snapshots in it. Partially evicting some of the versions may violate
consistency of a snapshot which includes evicted versions. For one,
continuity flags are interpreted realtive to the merged view, not
within a version, so evicting from some of the versions may mark
reanges as continuous when before they were discontinuous. Also, range
tombtsones of the snapshot are taken from all versions, so we can't
partially evict some of them without marking all affected ranges as
discontinuous.

The fix is to revert back to full eviciton, and avoid moving
non-evictable snapshots to cache. When moving whole partition entry to
cache, we first create a neutral empty partition entry and then merge
the memtable entry into it just like we would if the entry already
existed.

Fixes #3215.

Tests: unit (release)
Message-Id: <1518710592-21925-2-git-send-email-tgrabiec@scylladb.com>
2018-02-15 16:48:07 +00:00
Tomasz Grabiec
27b114fe45 cache: Handle exceptions from make_evictable()
cache_entry constructor was marked noexcept, yet make_evictable() may
fail in rare cases due to allocation in add_version(). Lift the
annotation and make sure that construction has strong exception
guarantees for the moved-in state so that it can be retried without
data loss inside allocating section.
2018-02-14 16:42:49 +01:00
Tomasz Grabiec
cce1a2bce8 Merge "Use the CPU scheduler" from Glauber & Avi
In this patchset I am resubmitting Avi's enablement of the CPU scheduler
in his behalf. I've done a ton of testing in the series and there are
some improvements / changes that I had previously sent as a separate series.

What you see here is the result of merging that work.

After this patchset is applied, workloads are smoother and we are able to
uphold the pre-defined shares among the various actors.

We also finally have everything we need to merge the CPU and I/O controllers.
After that is done the code is now much simpler. But also, as a bonus,
controllers that were previously available for I/O only (compactions) are
enabled for CPU as well.

* git@github.com:glommer/scylla.git cpusched-v7:

Avi Kivity (4):
  database, sstables, compaction: convert use of thread_scheduling_group
    to seastar cpu scheduler
  memtable, database: make memtable::clear_gently() inherit
    scheduling_group
  config: mark background_writer_scheduling_quota as Unused
  database: place data_query execution stage into scheduling_group

Glauber Costa (9):
  database, main: set up scheduling_groups for our main tasks
  row_cache: actually use the scheduling group for update_cache
  allow update_cache and clear_gently to use the entire task quota.
  database: remove cpu_flush_quota metric
  controllers: retire auto_adjust_flush_quota
  controllers: allow memtable I/O controller to have shares statically
    set
  controllers: update control points for memtable I/O controller
  controllers: allow a static priority to override the controller output
  controllers: unify the I/O and CPU controllers
2018-02-08 15:58:40 +01:00
Glauber Costa
a3a4d0a17a row_cache: actually use the scheduling group for update_cache
We have moved clear_gently from using a seastar::thread's scheduling_group to
using the CPU scheduler's. However, update_cache was forgotten.

This patch fixes that and gets rid of the old group just in case.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-02-07 17:19:29 -05:00
Tomasz Grabiec
d398aa913e cache: Fix calculation of active_reads()
Message-Id: <1518023341-27855-1-git-send-email-tgrabiec@scylladb.com>
2018-02-07 17:20:00 +00:00
Tomasz Grabiec
d899ae0f02 mvcc: Encapsulate construction of evictable entries
Internal invariants of MVCC are better preserved by partition_entry
methods, so move construction of partition entries out of cache_entry
constructors.
2018-02-05 17:54:03 +01:00
Piotr Jastrzebski
39ec13133f row_cache: rename make_flat_reader to make_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
0f45df96ca row_cache: Delete unused make_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Duarte Nunes
a66c8d7973 row_cache: Don't require external_updater to be copyable
No good reason to copy it around, and even less reason to impose that
constraint on callers.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180118181142.15408-1-duarte@scylladb.com>
2018-01-19 13:00:49 +01:00
Piotr Jastrzebski
14d98aaa0b Rename row_cache::create_underlying_flat_reader to
create_underlying_reader

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:57 +01:00
Piotr Jastrzebski
49993e56a9 Remove unused row_cache::create_underlying_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:57 +01:00
Piotr Jastrzebski
1457a3d771 Rename cache_entry::*read_flat to cache_entry::*read
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:57 +01:00
Piotr Jastrzebski
8d37b71843 Rename autoupdating_underlying_flat_reader to autoupdating_underlying_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:57 +01:00
Piotr Jastrzebski
9789c37e9d Remove autoupdating_underlying_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:56 +01:00
Piotr Jastrzebski
df17bad13b Remove unused cache_entry::read and do_read
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 16:37:56 +01:00
Piotr Jastrzebski
a9b6551584 Add cache_entry::read_flat
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
82c603069b Make cache_flat_mutation_reader a friend of row_cache and cache_tracker
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
072fc2a309 Move lsa_manager to row_cache.hh
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:33 +01:00
Piotr Jastrzebski
77b6f7c599 read_context: create a copy of autoupdating_underlying_reader
called autoupdating_underlying_flat_reader. It will be modified
in the next patch to use flat reader to underlying.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:09 +01:00
Piotr Jastrzebski
bf4e1c0c54 Add row_cache::create_underlying_flat_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:09 +01:00
Piotr Jastrzebski
16a0d306fd Turn scanning_and_populating_reader into flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-18 13:28:09 +01:00
Piotr Jastrzebski
ceaf0dee99 Introduce row_cache::make_flat_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-14 12:49:39 +01:00
Glauber Costa
1d7617723d row cache: pin real dirty during cache updates.
Right now, once a region is moved to the cache is no longer visible to
the dirty memory system. Not as real dirty nor virtual dirty.

The problem is that until a particular partition is moved to the cache
it is not evictable. As a result we can OOM the system if we have a lot
of pending cache updates as the writes will not be throttled and memory
won't be made available.

This patch pins the memory used by the region as real dirty before the
cache update starts, and unpins it when it is over. In the mean time it
gradually releases memory of the partitions that are being moved to
cache.

I have verified in a couple of workloads that the amount of memory
accounted through this is the same amount of memory accounted through
the memtable flush procedure.

Fixes #1942

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 19:46:36 -05:00
Duarte Nunes
baeec0935f Replace query::full_slice with schema::full_slice()
query::full_slice doesn't select any regular or static columns, which
is at odds with the expectations of its users. This patch replaces it
with the schema::full_slice() version.

Refs #2885

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1507732800-9448-2-git-send-email-duarte@scylladb.com>
2017-10-17 11:25:53 +02:00
Tomasz Grabiec
c78047fa5b row_cache: Evict partition snapshots
If snapshots are not evicted, they may pin unbouned amount of memory
for a long time in cache, which may lead to OOM. Evict snapshots
together with the entry.

Fixes #2775.
Fixes #2730.
2017-09-13 17:47:03 +02:00
Tomasz Grabiec
adb159d51b row_cache: Reuse allocation_strategy::invalidate_references()
Modification count in the tracker is redundant, we can rely on
allocator's invalidation counter.
2017-09-13 17:38:08 +02:00
Tomasz Grabiec
27a3b4bca9 row_cache: Don't invalidate references on insertion
modification_count is currently only used to detect invalidation of
references, intended to be incremented on erasure.

Insertion into intrusive set doesn't invalidate references, so no
need to increment the counter.
2017-09-13 17:38:08 +02:00
Tomasz Grabiec
d22fdf4261 row_cache: Improve safety of cache updates
Cache imposes requirements on how updates to the on-disk mutation source
are made:
  1) each change to the on-disk muation source must be followed
     by cache synchronization reflecting that change
  2) The two must be serialized with other synchronizations
  3) must have strong failure guarantees (atomicity)

Because of that, sstable list update and cache synchronization must be
done under a lock, and cache synchronization cannot fail to synchronize.

Normally cache synchronization achieves no-failure thing by wiping the
cache (which is noexcept) in case failure is detect. There are some
setup steps hoever which cannot be skipped, e.g. taking a lock
followed by switching cache to use the new snapshot. That truly cannot
fail.  The lock inside cache synchronizers is redundant, since the
user needs to take it anyway around the combined operation.

In order to make ensuring strong exception guarantees easier, and
making the cache interface easier to use correctly, this patch moves
the control of the combined update into the cache. This is done by
having cache::update() et al accept a callback (external_updater)
which is supposed to perform modiciation of the underlying mutation
source when invoked.

This is in-line with the layering. Cache is layered on top of the
on-disk mutation source (it wraps it) and reading has to go through
cache. After the patch, modification also goes through cache. This way
more of cache's requirements can be confined to its implementation.

The failure semantics of update() and other synchronizers needed to
change due to strong exception guaratnees. Now if it fails, it means
the update was not performed, neither to the cache nor to the
underlying mutation source.

The database::_cache_update_sem goes away, serialization is done
internally by the cache.

The external_updater needs to have strong exception guarantees. This
requirement is not new. It is however currently violated in some
places. This patch marks those callbacks as noexcept and leaves a
FIXME. Those should be fixed, but that's not in the scope of this
patch. Aborting is still better than corrupting the state.

Fixes #2754.

Also fixes the following test failure:

  tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed

which started to trigger after commit 318423d50b. Thread stack
allocation may fail, in which case we did not do the necessary
invalidation.
2017-09-04 10:04:29 +02:00