Commit Graph

302 Commits

Author SHA1 Message Date
Piotr Dulikowski
589313a110 row_marker: correct expiration condition
This change corrects condition on which a row was considered expired by
its TTL.

The logic that decides when a row becomes expired was inconsistent with
the logic that decides if a single cell is expired. A single cell
becomes expired when `expiry_timestamp <= now`, while a row became
expired when `expiry_timestamp < now` (notice the strict inequality).
For rows inserted with TTL, this caused non-key cells to expire (change
their values to null) one second before the row disappeared. Now, row
expiry logic uses non-strict inequality.

Fixes: #4263, #5290.

Tests:
- unit(dev)
- python test described in issue #5290
2019-11-19 11:46:59 +01:00
Piotr Dulikowski
59fbbb993f memtables: add partition/row hit/miss counters
Adds per-table metrics for counting partition and row reuse
in memtables. New metrics are as follows:
    - memtable_partition_writes - number of write operations performed
          on partitions in memtables,
    - memtable_partition_hits - number of write operations performed
          on partitions that previously existed in a memtable,
    - memtable_row_writes - number of row write operations performed
          in memtables,
    - memtable_row_hits - number of row write operations that ovewrote
          rows previously present in a memtable.

Tests: unit(release)
2019-11-12 13:35:41 +01:00
Vladimir Davydov
e0b31dd273 query: add flag to return static row on partition with no rows
A SELECT statement that has clustering key restrictions isn't supposed
to return static content if no regular rows matches the restrictions,
see #589. However, for the CAS statement we do need to return static
content on failure so this patch adds a flag that allows the caller to
override this behavior.
2019-10-28 21:50:44 +03:00
Kamil Braun
4374982de0 types: collection_type_impl::to_value becomes serialize_for_cql.
The purpose of collection_type_impl::to_value was to serialize a
collection for sending over CQL. The corresponding function in origin
is called serializeForNativeProtocol, but the name is a bit lengthy,
so I settled for serialize_for_cql.

The method now became a free-standing function, using the visit
function to perform a dispatch on the collection type instead
of a virtual call. This also makes it easier to generalize it to UDTs
in future commits.

Remove the old serialize_for_native_protocol with a FIXME: implement
inside. It was already implemented (to_value), just called differently.

remove dead methods: enforce_limit and serialized_values. The
corresponding methods in C* are auxiliary methods used inside
serializeForNativeProtocol. In our case, the entire algorithm
is wholly written in serialize_for_cql.
2019-10-25 10:49:19 +02:00
Kamil Braun
d83ebe1092 collection_mutation: move collection_type_impl::difference to collection_mutation.hh. 2019-10-25 10:42:58 +02:00
Kamil Braun
7e3bbe548c collection_mutation: move collection_type_impl::merge to collection_mutation.hh. 2019-10-25 10:42:58 +02:00
Kamil Braun
a41277a7cd collection_mutation: move collection_type_impl::last_update to collection_mutation_view 2019-10-25 10:42:58 +02:00
Kamil Braun
30802f5814 collection_mutation: move collection_type_impl::is_any_live to collection_mutation_view 2019-10-25 10:42:58 +02:00
Kamil Braun
e16ba76c2e collection_mutation: move collection_type_impl::is_empty to collection_mutation_view. 2019-10-25 10:42:58 +02:00
Kamil Braun
bbdb438d89 collection_mutation: easier (de)serialization of collection_mutation(s).
`collection_type_impl::serialize_mutation_form`
became `collection_mutation(_view)_description::serialize`.

Previously callers had to cast their data_type down to collection_type
to use serialize_mutation_form. Now it's done inside `serialize`.
In the future `serialize` will be generalized to handle UDTs.

`collection_type_impl::deserialize_mutation_form`
became a free standing function `deserialize_collection_mutation`
with similiar benefits. Actually, noone needs to call this function
manually because of the next paragraph.

A common pattern consisting of linearizing data inside a `collection_mutation_view`
followed by calling `deserialize_mutation_form` has been abstracted out
as a `with_deserialized` method inside collection_mutation_view.

serialize_mutation_form_only_live was removed,
because it hadn't been used anywhere.
2019-10-25 10:42:58 +02:00
Kamil Braun
b1d16c1601 types: move collection_type_impl::mutation(_view) out of collection_type_impl.
collection_type_impl::mutation became collection_mutation_description.
collection_type_impl::mutation_view became collection_mutation_view_description.
These classes now reside inside collection_mutation.hh.

Additional documentation has been written for these classes.

Related function implementations were moved to collection_mutation.cc.

This makes it easier to generalize these classes to non-frozen UDTs in future commits.
The new names (together with documentation) better describe their purpose.
2019-10-25 10:19:45 +02:00
Kamil Braun
c90ea1056b Remove mutation_partition_applier.
It had been replaced by partition_builder
in commit dc290f0af7.
2019-10-25 10:19:45 +02:00
Avi Kivity
acc433b286 mutation_partition: make static_row optional to reduce memory footprint
The static row can be rare: many tables don't have them, and tables
that do will often have mutations without them (if the static row
is rarely updated, it may be present in the cache and in readers,
but absent in memtable mutations). However, it always consumes ~100
bytes of memory, even if it not present, due to row's overhead.

Change it to be optional by using lazy_row instead of row. Some call
sites treewide were adjusted to deal with the extra indirection.

perf_simple_query appears to improve by 2%, from 163krps to 165 krps,
though it's hard to be sure due to noisy measurements.

memory_footprint comparisons (before/after):

mutation footprint:		       mutation footprint:
 - in cache:	 1096		        - in cache:	992
 - in memtable:	 854		        - in memtable:	750
 - in sstable:	 351		        - in sstable:	351
 - frozen:	 540		        - frozen:	540
 - canonical:	 827		        - canonical:	827
 - query result: 342		        - query result: 342

 sizeof(cache_entry) = 112	        sizeof(cache_entry) = 112
 -- sizeof(decorated_key) = 36	        -- sizeof(decorated_key) = 36
 -- sizeof(cache_link_type) = 32        -- sizeof(cache_link_type) = 32
 -- sizeof(mutation_partition) = 200    -- sizeof(mutation_partition) = 96
 -- -- sizeof(_static_row) = 112        -- -- sizeof(_static_row) = 8
 -- -- sizeof(_rows) = 24	        -- -- sizeof(_rows) = 24
 -- -- sizeof(_row_tombstones) = 40     -- -- sizeof(_row_tombstones) = 40

 sizeof(rows_entry) = 232	        sizeof(rows_entry) = 232
 sizeof(lru_link_type) = 16	        sizeof(lru_link_type) = 16
 sizeof(deletable_row) = 168	        sizeof(deletable_row) = 168
 sizeof(row) = 112		        sizeof(row) = 112
 sizeof(atomic_cell_or_collection) = 8  sizeof(atomic_cell_or_collection) = 8

Tests: unit (dev)
2019-10-15 15:42:05 +03:00
Avi Kivity
88613e6882 mutation_partition: introduce lazy_row
lazy_row adds indirection to the row class, in order to reduce storage requirements
when the row is not present. The intent is to use it for the static row, which is
not present in many schemas, and is often not present in writes even in schemas that
have a static row.

Indirection is done using managed_ref, which is lsa-compatible.

lazy_row implements most of row's methods, and a few more:
 - get(), get_existing(), and maybe_create(): bypass the abstraction and the
   underlying row
 - some methods that accept a row parameter also have an overload with a lazy_row
   parameter
2019-10-15 15:42:05 +03:00
Tomasz Grabiec
bce0dac751 mutation_partition: Track and validate schema version in debug builds
This patch makes mutation_partition validate the invariant that it's
supposed to be accessed only with the schema version which it conforms
to.

Refs #5095
2019-09-25 10:27:06 +02:00
Rafael Ávila de Espíndola
096de10eee types: Remove abstract_type::equals
All types are interned, so we can just compare the pointers.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2019-08-14 10:02:00 -07:00
Tomasz Grabiec
3af8431a40 Merge "compaction: allow collecting purged data" from Botond
compaction: allow collecting purged data

Allow the compaction initiator to pass an additional consumer that will
consume any data that is purged during the compaction process. This
allows the separate retention of these dead cells and tombstone until
some long-running process like compaction safely finishes. If the
process fails or is interrupted the purged data can be used to prevent
data resurrection.

This patch was developed to serve as the basis for a solution to #4531
but it is not a complete solution in and on itself.

This series is a continuation of the patch: "[PATCH v1 1/3] Introduce
Garbage Collected Consumer to Mutation Compactor" by Raphael S.
Carvalho <raphaelsc@scylladb.com>.

Refs: #4531

* https://github.com/denesb/scylla.git compaction_collect_purged_data/v8:
  Introduce compaction_garbage_collector interface
  collection_type_impl::mutation: compact_and_expire() add collector
    parameter
  row: add garbage_collector
  row_marker: de-inline compact_and_expire()
  row_marker: add garbage_collector
  Introduce Garbage Collected Consumer to Mutation Compactor
  tests: mutation_writer_test.cc/generate_mutations() ->
    random_schema.hh/generate_random_mutations()
  tests/random_schema: generate_random_mutations(): remove `engine`
    parameter
  tests/random_schema: add assert to make_clustering_key()
  tests/random_schema: generate_random_mutations(): allow customizing
    generated data
  tests: random_schema: futurize generate_random_mutations()
  random_schema: generate_random_mutations(): restore indentation
  data_model: extend ttl and expiry support
  tests/random_schema: generate_random_mutations(): generate partition
    tombstone
  random_schema: add ttl and expiry support
  tests/random: add get_bool() overload with random engine param
  random_schema: generate_random_mutations(): ensure partitions are
    unique
  tests: add unit tests for the data stream split in compaction
2019-07-23 17:12:28 +02:00
Paweł Dziepak
060e3f8ac2 mutation_partition: verify row::append_cell() precondition
row::append_cell() has a precondition that the new cell column id needs
to be larger than that of any other already existing cell. If this
precondition is violated the row will end up in an invalid state. This
patch adds assertion to make sure we fail early in such cases.
2019-07-15 23:25:06 +02:00
Botond Dénes
4c2781edaa row_marker: add garbage_collector
The new collector parameter is a pointer to a
`compaction_garbage_collector` implementation. This collector is passed
the row_marker when it expired and would be discarded.
The collector param is optional and defaults to nullptr.
2019-07-15 17:38:00 +03:00
Botond Dénes
7db2006162 row_marker: de-inline compact_and_expire() 2019-07-15 17:38:00 +03:00
Botond Dénes
4c7a7ffe8f row: add garbage_collector
The new collector parameter is a pointer to a
`compaction_garbage_collector` implementation. This collector is passed
all atoms that are expired and can would be discarded. The body of
`compact_and_expire()` was changed so that it checks cells' tombstone
coverage before it checks their expiry, so that cells that are both
covered by a tombstone and also expired are not passed to the collector.
The collector is forwarded to
`collection_type_impl::mutation::compact_and_expire()` as well.
The collector param is optional and defaults to nullptr
2019-07-15 17:38:00 +03:00
Botond Dénes
307b48794d collection_type_impl::mutation: compact_and_expire() add collector parameter
The new collector parameter is a pointer to a
`compaction_garbage_collector` implementation. This collector is passed
all atoms that are expired and would be discarded. The body of
`compact_and_expire()` was changed so that it checks cells' tombstone
coverage before it checks their expiry, so that cells that are both
covered by a tombstone and also expired are not passed to the collector.
The collector param is optional and defaults to nullptr. To accommodate
the collector, which needs to know the column id, a new `column_id`
parameter was added as well.
2019-07-15 17:37:55 +03:00
Piotr Jastrzebski
147cc031db Move map_type_impl out of types.hh to types/map.hh
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2019-01-24 09:56:38 +01:00
Benny Halevy
93270dd8e0 gc_clock: make 64 bit
Fixes: #3353

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-01-22 15:34:32 +02:00
Benny Halevy
16dda033a5 sstables: row_marker: initialize _expiry
compare_row_marker_for_merge compares deletion_time also for row markers
that have missing timestamps.  This happened to succeed due to implicit
initialization to 0. However, we prefer the initialization to be explicit
and allow calling row_marker::deletion_time() in all states.

Fixes #4068

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190110102949.17896-1-bhalevy@scylladb.com>
2019-01-10 12:45:07 +01:00
Duarte Nunes
fa2b0384d2 Replace std::experimental types with C++17 std version.
Replace stdx::optional and stdx::string_view with the C++ std
counterparts.

Some instances of boost::variant were also replaced with std::variant,
namely those that called seastar::visit.

Scylla now requires GCC 8 to compile.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190108111141.5369-1-duarte@scylladb.com>
2019-01-08 13:16:36 +02:00
Tomasz Grabiec
ac49b1def0 mutation_cleaner: Migrate partition_snapshots when queueing for background cleanup
partition_snapshots created in the memtable will keep a reference to
the memtable (as region*) and to memtable::_cleaner. As long as the
reader is alive the memtable will be kept alive by
partition_snapshot_flat_reader::_container_guard. But after that,
nothing prevents it from being destroyed. The snapshot can outlive the
read if mutation_cleaner::merge_and_destroy() defers its destruction
for later. When the read ends after memtable was flushed, the snapshot
will be queued in the cache's cleaner, but internally will reference
memtable's region and cleaner. This will result in a use-after-free
when the snapshot resumses destruction.

The fix is to update snapshots's region and cleaner references at the
time of queueing to point to the cache's region and cleaner.

When memtable is destroyed without being moved to cache there is no
problem, because the snapshot would be queued into memtable's cleaner,
which will be drained on destruction from all snapshots.

Introduced in f3da043.

Fixes #4030.
2018-12-27 18:08:50 +01:00
Paweł Dziepak
9024187222 partition_slice: use small_vector for column_ids 2018-12-06 14:21:04 +00:00
Piotr Jastrzebski
411437f320 Fix format string in mutation_partition::operator<<
fmt does not allow bool values for :d and previous
format string was resulting in:

fmt::v5::format_error: invalid type specifier

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <3980a3cdb903263e29689b1c6cd24e3592826fe0.1542284205.git.piotr@scylladb.com>
2018-11-15 12:22:10 +00:00
Avi Kivity
a71ab365e3 toplevel: convert sprint() to format()
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().

Mechanically converted with https://github.com/avikivity/unsprint.
2018-11-01 13:16:17 +00:00
Paweł Dziepak
637b9a7b3b atomic_cell_or_collection: make operator<< show cell content
After the new in-memory representation of cells was introduced there was
a regression in atomic_cell_or_collection::operator<< which stopped
printing the content of the cell. This makes debugging more incovenient
are time-consuming. This patch fixes the problem. Schema is propagated
to the atomic_cell_or_collection printer and the full content of the
cell is printed.

Fixes #3571.

Message-Id: <20181024095413.10736-1-pdziepak@scylladb.com>
2018-10-24 13:29:51 +03:00
Botond Dénes
eb357a385d flat_mutation_reader: make timeout opt-out rather than opt-in
Currently timeout is opt-in, that is, all methods that even have it
default it to `db::no_timeout`. This means that ensuring timeout is used
where it should be is completely up to the author and the reviewrs of
the code. As humans are notoriously prone to mistakes this has resulted
in a very inconsistent usage of timeout, many clients of
`flat_mutation_reader` passing the timeout only to some members and only
on certain call sites. This is small wonder considering that some core
operations like `operator()()` only recently received a timeout
parameter and others like `peek()` didn't even have one until this
patch. Both of these methods call `fill_buffer()` which potentially
talks to the lower layers and is supposed to propagate the timeout.
All this makes the `flat_mutation_reader`'s timeout effectively useless.

To make order in this chaos make the timeout parameter a mandatory one
on all `flat_mutation_reader` methods that need it. This ensures that
humans now get a reminder from the compiler when they forget to pass the
timeout. Clients can still opt-out from passing a timeout by passing
`db::no_timeout` (the previous default value) but this will be now
explicit and developers should think before typing it.

There were suprisingly few core call sites to fix up. Where a timeout
was available nearby I propagated it to be able to pass it to the
reader, where I couldn't I passed `db::no_timeout`. Authors of the
latter kind of code (view, streaming and repair are some of the notable
examples) should maybe consider propagating down a timeout if needed.
In the test code (the wast majority of the changes) I just used
`db::no_timeout` everywhere.

Tests: unit(release, debug)

Signed-off-by: Botond Dénes <bdenes@scylladb.com>

Message-Id: <1edc10802d5eb23de8af28c9f48b8d3be0f1a468.1536744563.git.bdenes@scylladb.com>
2018-09-20 11:31:24 +02:00
Botond Dénes
d9a2ffad84 mutation_partition: don't move tracing_state early
Currently the `trace_state` is moved into the `querier` object's
constructor when one has to be created. Since the trace_state is used
below this lines this had the effect that on the first page of the
query, when a querier object has to be created, tracing would not work
inside the `querier_cache` which received a move-from `trace_state` (a
nullptr effectively).
Change the move to a copy so the other half of the function doesn't use
a moved-from `trace_state`.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <4987419781aa287141aa9dc8ce99c5068b564c84.1536739052.git.bdenes@scylladb.com>
2018-09-12 11:32:08 +02:00
Gleb Natapov
d7674288a9 mutation_partition: accurately account for result size in digest only queries
When measuring_output_stream is used to calculate result's element size
it incorrectly takes into account not only serialized element size, but
a placeholder that ser::qr_partition__rows/qr_partition__static_row__cells
constructors puts in the beginning. Fix it by taking starting point in a
stream before element serialization and subtracting it afterwords.

Fixes #3755

Message-Id: <20180906153609.GJ2326@scylladb.com>
2018-09-06 20:52:44 +03:00
Gleb Natapov
98092353df mutation_partition: correctly measure static row size when doing digest calculation
The code uses incorrect output stream in case only digest is requested
and thus getting incorrect data size. Failing to correctly account
for static row size while calculating digest may cause digest mismatch
between digest and data query.

Fixes #3753.

Message-Id: <20180905131219.GD2326@scylladb.com>
2018-09-06 13:09:41 +03:00
Tomasz Grabiec
4fb3f7e8eb managed_vector: Make external_memory_usage() ignore reserved space
This ensures that row::external_memory_usage() is invariant to
insertion order of cells.

It should be so, so that accounting of a clustering_row, merged from
multiple MVCC versions by the partition_snapshot_flat_reader on behalf
of a memtable flush, doesn't give a greater result than what is used
by the memtable region. Overaccounting leads to assertion failure in
~flush_memory_accounter.

Fixes #3625 (hopefully).

Message-Id: <1535982513-19922-1-git-send-email-tgrabiec@scylladb.com>
2018-09-03 17:09:54 +03:00
Botond Dénes
3bcd577907 Move reconcilable_result_builder declaration to mutation_query.hh
It will be used by code outside of mutation_partition.cc so it needs to
be public. The definition remains in mutation_partition.cc.
2018-09-03 10:31:44 +03:00
Botond Dénes
c12008b8cb querier: split querier into separate data and mutation querier types
Instead of hiding what compaction method the querier uses (and only
expose it via rejecting 'can_be_used_for_page()`) make it very explicit
that these are really two different queriers. This allows using
different indexes for the two queriers in `querier_cache` and
eliminating the possibility of picking up a querier with the wrong
compaction method (read kind).
This also makes it possible to add new querier type(s) that suit the
multishard-query's needs without making a confusing mess of `querier` by
making it a union of all querying logic.

Splitting the queriers this way changes what happens when a lookup finds
a querier of the wrong kind (e.g. emit_only_live::yes for an
emit_only_live::no command). As opposed to dropping the found (but
wrong) querier the querier will now simply not be found by the lookup.
This is a result of using separate search indexes for the different
mutation kinds. This change should have no practical implications.

Splitting is done by making querier templated on `emit_only_live_rows`.
It doesn't make sense to duplicate the entire querier as the two share
99% of the code.
2018-09-03 10:31:44 +03:00
Botond Dénes
cded477b94 querier: return std::optional<querier> instead of using create_fun()
Requiring the caller of lookup() to pass in a `create_fun()` was not
such a good idea in hindsight. It leads to awkward call sites and even
more awkward code when trying to find out whether the lookup was
successfull or not.
Returning an optional gives calling code much more flexibility and makes
the code cleaner.
2018-09-03 10:31:44 +03:00
Botond Dénes
5f726e9a89 querier: move all to query namespace
To avoid name clashes.
2018-09-03 10:31:44 +03:00
Avi Kivity
ebff1cfc37 database: make database::_mutation_query_stage inherit the scheduling group
Like the preceeding patch and for the same reasons, adjust
database::_mutation_query_stage to inherit the scheduling group from its
caller.
2018-08-24 19:04:49 +03:00
Tomasz Grabiec
024b3c9fd9 mutation_partition: Fix exception safety of row::apply_monotonically()
When emplace_back() fails, value is already moved-from into a
temporary, which breaks monotonicity expected from
apply_monotonically(). As a result, writes to that cell will be lost.

The fix is to avoid the temporary by in-place construction of
cell_and_hash. To do that, appropriate cell_and_hash constructor was
added.

Found by mutation_test.cc::test_apply_monotonically_is_monotonic with
some modifications to the random mutation generator.

Introduced in 99a3e3a.

Fixes #3678.

Message-Id: <1533816965-27328-1-git-send-email-tgrabiec@scylladb.com>
2018-08-09 15:29:10 +03:00
Tomasz Grabiec
6b1fe6cbe5 mutation_partition: Introduce set_continuity() 2018-07-17 16:30:01 +02:00
Tomasz Grabiec
c6c54021a8 mutation_partition: Fix merging to not leave the source with broader continuity on bad_alloc
When clustering keys are larger than 12.8 KiB they may get fragmented
and key comparator will need to linearize them on comparison. This may
cause lookups in the rows tree to fail with bad_alloc. Partition
version merging (mutation_partition::apply_monotonically()) was not
taking this into account. If we fail on lookup, the partition which is
being applied may be incorrectly left with the clustering range since
the begging up to the current row marked as continuous, if the current
row has the continuity flag set, because we've moved all of the
preceding rows into the target, and the correct lower bound row is no
longer there in the source. This may mark some discontinuous ranges as
continuous.

Merging is retried by allocating_section, and there will be no problem
if it eventually suceeds, original continity will be reflected in the
sum. The problem will persist if it doesn't eventually succeed, when
we're really out of memory.

To protect against this, we could reset the continuity flag of the
current row in the source when exiting on exception.

Fixes #3583
2018-07-17 16:30:01 +02:00
Tomasz Grabiec
de5c52f422 mutation_partition: Preserve continuity in case row merging with no tracker throws
Example:

 p:      row{key=A, cont=0} row{key=C, cont=1}
 this:                      row{key=C, cont=0}

When we get to processing key=C, key=A was already moved to this, so p
has stale continuity on key=C, which marks (-inf,C) as continuous,
whereas it should mark only (A, C). That's not a problem if merging
succeeds, but if exception happens at this point, we will violate the
invariant which says that the sum of p and this should yield the same
logical partition. It wouldn't because continuity of the sum is
calculated as a set union, and (-inf, A) would be incorrectly turned
into a continuous range.

This is not a problem currently because continuity is always full when
there is no tracker (memtables), so won't change anyway, and when
there is a tracker (cache) we never merge but overwrite instead, so
there is no memory allocation and thus no possibility for failure. But
better be safe.
2018-07-17 16:30:01 +02:00
Tomasz Grabiec
567da3e063 memtable, cache: Fix exception safety of partition entry insertions
boost::intrusive::set::insert() may throw if keys require
linearization and that fails, in which case we will leak the entry.

When this happens in cache, we will also violate the invariant for
entry eviction, which assumes all tracked entries are linked, and
cause a SEGFAULT.

Use the non-throwing and faster insert_before() instead. Where we
can't use insert_before(), use alloc_strategy_unique_ptr<> to ensure
that entry is deallocated on insert failure.

Fixes #3585.
2018-07-17 16:30:01 +02:00
Tomasz Grabiec
3f509ee3a2 mutation_partition: Fix exception-safety of row copy constructor
In case population of the vector throws, the vector object would not
be destroyed. It's a managed object, so in addition to causing a leak,
it would corrupt memory if later moved by the LSA, because it would
try to fixup forward references to itself.

Caused sporadic failures and crashes of row_cache_test, especially
with allocation failure injector enabled.

Introduced in 27014a23d7.
Message-Id: <1531757764-7638-1-git-send-email-tgrabiec@scylladb.com>
2018-07-17 13:21:21 +01:00
Tomasz Grabiec
074be4d4e8 memtable, cache: Run mutation_cleaner worker in its own scheduling group
The worker is responsible for merging MVCC snapshots, which is similar
to merging sstables, but in memory. The new scheduling group will be
therefore called "memory compaction".

We should run it in a separate scheduling group instead of
main/memtables, so that it doesn't disrupt writes and other system
activities. It's also nice for monitoring how much CPU time we spend
on this.
2018-06-27 21:51:04 +02:00
Tomasz Grabiec
6c6ffaee71 mutation_cleaner: Make merge() redirect old instance to the new one
If memtable snapshot goes away after memtable started merging to
cache, it would enqueue the snapshots for cleaning on the memtable's
cleaner, which will have to clean without deferrring when the memtable
is destroyed. That may stall the reactor. To avoid this, make merge()
cause the old instance of the cleaner to redirect to the new instance
(owned by cache), like we do for regions. This way the snapshots
mentioned earlier can be cleaned after memtable is destroyed,
gracefully.
2018-06-27 21:51:04 +02:00
Tomasz Grabiec
c26a304fbb mvcc: Merge partition version versions gradually in the background
When snapshots go away, typically when the last reader is destroyed,
we used to merge adjacent versions atomically. This could induce
reactor stalls if partitions were large. This is especially true for
versions created on cache update from memtables.

The solution is to allow this process to be preempted and move to the
background. mutation_cleaner keeps a linked list of such unmerged
snapshots and has a worker fiber which merges them incrementally and
asynchronously with regards to reads.

This reduces scheduling latency spikes in tests/perf_row_cache_update
for the case of large partition with many rows. For -c1 -m1G I saw
them dropping from 23ms to 2ms.
2018-06-27 12:48:30 +02:00