Commit Graph

79 Commits

Author SHA1 Message Date
Tomasz Grabiec
78274276f5 row_cache: Use the memtable cleaner to create memtable snapshot during update
Memtable entries should be cleaned using memtable cleaner, which
unlike the cache' cleaner is not associated with the cache
tracker. It's an error to clean a snapshot using tracker which doesn't
own the entries. This will corrupt cache tracker's row counter.

Fixes failure of test_exception_safety_of_update_from_memtable from
row_cache.cc in debug mode and with allocation failure injection
enabled.

Introduce in "cache: Defer during partition merging"
(70c72773be).
Message-Id: <1528988256-20578-1-git-send-email-tgrabiec@scylladb.com>
2018-06-14 18:03:02 +03:00
Tomasz Grabiec
f775fc2e4c mvcc: Fix partition_entry::open_version()
After 70c72773be it's possible that
open_version() is called with a phase which is smaller than the phase
of the latest version, because latest version belongs to the
in-progress cache update. In such case we must return the existing
non-latest snapshot and not create a new version on top of the
in-progress update. Not doing this violates several invariants, and
may lead to inconsistencies, including violation of write atomicity or
temporary loss of writes.

partition_entry::read() was already adjusted by the aforementioned
commit. Do a similar adjustement for open_version().

Fixes sporadic failures of row_cache_test.cc::test_concurrent_reads_and_eviction
Message-Id: <1528211847-22825-1-git-send-email-tgrabiec@scylladb.com>
2018-06-05 18:22:38 +03:00
Paweł Dziepak
ec9d166a4f treewide: require type to compute cell memory usage 2018-05-31 15:51:11 +01:00
Tomasz Grabiec
5bc201df10 cache: Release dirty memory with row granularity 2018-05-30 14:41:41 +02:00
Tomasz Grabiec
70c72773be cache: Defer during partition merging 2018-05-30 14:41:41 +02:00
Tomasz Grabiec
c653137b2b mvcc: Make apply_to_incomplete() work with attached versions
Needed before making it preemptible. We cannot steal the entry since
we may need to resume merging later.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
1792be3697 cache: Propagate phase to apply_to_incomplete()
It will be needed to create snapshots with appropriate phase markers.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
494cb3f3da cache: Prepare for incremental apply_to_incomplete()
Incremental merging will be implemented by the means of resumable
functions, which return stop_iteration::no when not yet
finished. We're not using futures, so that the caller can do work
around preemption points as well.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
3f19f76c67 mvcc: Destroy memtable partition versions gently
Now all snapshots will have a mutation_cleaner which they will use to
gently destroy freed partition_version objects.

Destruction of memtable entries during cache update is also using the
gentle cleaner now. We need to have a separate cleaner for memtable
objects even though they're owned by cache's region, because memtable
versions must be cleared without a cache_tracker.

Each memtable will have its own cleaner, which will be merged with the
cache's cleaner when memtable is merged into cache.

Fixes some sources of reactor stalls on cache update when there are
large partition entries in memtables.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
81d231f35b mvcc: Remove rows from tracker gently
Some parititons may have a lot of rows. Better to iterate over them
incrementally as part of clear_gently() to avoid stalls.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
e5aa02efeb mvcc: Introduce partition_version_list 2018-05-30 12:18:56 +02:00
Tomasz Grabiec
ca1ee93577 mvcc: Fix move constructor of partition_version_ref() not preserving _unique_owner
We didn't rely on that yet, it seems, but will.

(cherry picked from commit 21a744337de01f699d5c5c340483ad23cabab2ee)
2018-05-30 12:18:56 +02:00
Tomasz Grabiec
40cc766cf2 database: Add API for incremental clearing of partition entries
Partitions can get very large. Destroying them all at once can stall
the reactor for significant amount of time. We want to avoid that by
doing destruction incrementally, deferring in between. A new API is
added for that at various levels:

  stop_iteration clear_gently() noexcept;

It returns stop_iteration::yes when the object is fully cleared and
can be now destroyed quickly. So a deferring destruction can look like
this:

  return repeat([this] { return clear_gently(); });

The reason why clear_gently() doesn't return a future<> itself is that some
contexts cannot defer, like memory reclamation.
2018-05-30 12:18:56 +02:00
Vladimir Krivopalov
e1ee833861 Always pass mutation_partitions to partition_entry::apply()
Previously it was also possible to pass a frozen_mutation to it.
Now we de-serialize frozen mutations at the calling side.

This is a pre-requisite for collecting memtable statistics needed for
writing into the SSTables 3.0 format.

For #1969.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-04-25 14:58:47 -07:00
Tomasz Grabiec
381bf02f55 cache: Evict with row granularity
Instead of evicting whole partitions, evicts whole rows.

As part of this, invalidation of partition entries was changed to not
evict from snapshots right away, but unlink them and let them be
evicted by the reclaimer.
2018-03-06 11:50:29 +01:00
Tomasz Grabiec
bee875fa7d cache: Ensure all evictable partition_versions have a dummy after all rows
Every evictable version will have a dummy entry at the end so that it can be
tracked in the LRU.

It is also needed to allow old versions to stay around (with
tombstones and static rows) after all rows are evicted. Such versions
must be fully discontinuous, and we need some entry to mark that.
2018-03-06 11:50:27 +01:00
Tomasz Grabiec
5320705300 cache: Propagate cache_tracker to places manipulating evictable entries
cache_tracker reference will be needed to link/unlink row entries.

No change of behavior in this patch.
2018-03-06 11:50:27 +01:00
Tomasz Grabiec
e571bd5a2e mvcc: Add partition_entry::versions_from_oldest() 2018-03-06 11:50:26 +01:00
Tomasz Grabiec
d9a38c1c85 mutation_partition: Add API to walk from rows_entry to cache_entry
Will be needed on row eviction, to unlink containers when they become
fully evicted.
2018-03-06 11:50:26 +01:00
Tomasz Grabiec
9893e8e5f7 mvcc: Make each version have independent continuity
This change is a preparation for introducing row-level eviction, such that entries
can be evicted from older versions without having to touch other versions.

Currently continuity flags on entries are interpreted relative to the
combined view merged from all entries. For example:

 v2:                  <key=2, cont=1>
 v1: <key=1, cont=1>

In v2, the flag on entry key=2 marks the range (1, 2) as
continuous. This is problematic because if the old version is evicted, continuity
will change in an incorrect way:

   v2:                  <key=2, cont=1>

Here, the range (-inf, 1) would be marked as continuous, which is not true.

To solve this problem, we change the rules for continuity
interpretation in MVCC. Each version will have its own continuity,
fully specified in that version, independent of continuity of other
versions. Continuity of the snapshot will be a union of continuous
ranges in each version.

It is assumed that continuous intervals in different versions are non-
overlapping, except for points corresponding to complete rows, in
which case a later version may overlap with an older version
(overwrite). We make use of this assumption to make calculation of the
union of intervals on merging easier. I make use of the above
assumption in mutation_partition::apply_monotonically().

MVCC population of incomplete entries already almost maintains the
non-overlapping invariant, because population intervals correspond to
intervals which are incomplete in the old snapshot. The only change
needed is to ensure that both population bounds will have entries in
the latest version. Population from memtables doesn't mark any
intervals as continuous, so also conforms. The only change needed
there is to not inherit continuity flags from the old snapshot,
effectively making the new version internally discontinuous except for
row points.

The example from the beginning will become:

 v2: <key=1, cont=0>  <key=2, cont=1>
 v1: <key=1, cont=1>

When marking a range as continuous with some rows present only in
older versions, we need to insert entries in the latest version, so
that we can mark the range as continuous. The easiest solution is to
copy the entry from the old version. Another option would be to add
support for incomplete rows and insert such instead. This way we would
avoid duplicating row contents. This optimization is deferred.
2018-03-06 11:50:25 +01:00
Tomasz Grabiec
2f956499a7 mvcc: Drop unused _evictable flag from partition_version_ref 2018-03-06 11:32:09 +01:00
Paweł Dziepak
6b66e4833b mvcc: avoid ubsan warning about uninitialised boolean
Message-Id: <20180223160133.21383-1-pdziepak@scylladb.com>
2018-02-23 16:54:23 +00:00
Tomasz Grabiec
b0b57b8143 mvcc: Do not move unevictable snapshots to cache
Commit 6ccd317 introduced a bug in partition_entry::evict() where a
partition entry may be partially evicted if there are non-evictable
snapshots in it. Partially evicting some of the versions may violate
consistency of a snapshot which includes evicted versions. For one,
continuity flags are interpreted realtive to the merged view, not
within a version, so evicting from some of the versions may mark
reanges as continuous when before they were discontinuous. Also, range
tombtsones of the snapshot are taken from all versions, so we can't
partially evict some of them without marking all affected ranges as
discontinuous.

The fix is to revert back to full eviciton, and avoid moving
non-evictable snapshots to cache. When moving whole partition entry to
cache, we first create a neutral empty partition entry and then merge
the memtable entry into it just like we would if the entry already
existed.

Fixes #3215.

Tests: unit (release)
Message-Id: <1518710592-21925-2-git-send-email-tgrabiec@scylladb.com>
2018-02-15 16:48:07 +00:00
Tomasz Grabiec
27b114fe45 cache: Handle exceptions from make_evictable()
cache_entry constructor was marked noexcept, yet make_evictable() may
fail in rare cases due to allocation in add_version(). Lift the
annotation and make sure that construction has strong exception
guarantees for the moved-in state so that it can be retried without
data loss inside allocating section.
2018-02-14 16:42:49 +01:00
Avi Kivity
404172652e Merge "Use xxHash for digest instead of MD5" from Duarte
"This series changes digest calculation to use a faster algorithm
(xxHash) and to also cache calculated cell hashes that can be kept in
memory to speed up subsequent digest requests.

The MD5 hash function has proved to be slow for large cell values:

size = 256; elapsed = 4us
size = 512; elapsed = 8us
size = 1024; elapsed = 14us
size = 2048; elapsed = 21us
size = 4096; elapsed = 33us
size = 8192; elapsed = 51us
size = 16384; elapsed = 86us
size = 32768; elapsed = 150us
size = 65536; elapsed = 278us
size = 131072; elapsed = 531us
size = 262144; elapsed = 1032us
size = 524288; elapsed = 2026us
size = 1048576; elapsed = 4004us
size = 2097152; elapsed = 7943us
size = 4194304; elapsed = 15800us
size = 8388608; elapsed = 31731us
size = 16777216; elapsed = 64681us
size = 33554432; elapsed = 130752us
size = 67108864; elapsed = 263154us

The xxHash is a non-cryptographic, 64bit (there's work in progress on
the 128 version) hash that can be used to replace MD5. It performs much
better:

size = 256; elapsed = 2us
size = 512; elapsed = 1us
size = 1024; elapsed = 1us
size = 2048; elapsed = 2us
size = 4096; elapsed = 2us
size = 8192; elapsed = 3us
size = 16384; elapsed = 5us
size = 32768; elapsed = 8us
size = 65536; elapsed = 14us
size = 131072; elapsed = 28us
size = 262144; elapsed = 59us
size = 524288; elapsed = 116us
size = 1048576; elapsed = 226us
size = 2097152; elapsed = 456us
size = 4194304; elapsed = 935us
size = 8388608; elapsed = 1848us
size = 16777216; elapsed = 4723us
size = 33554432; elapsed = 10507us
size = 67108864; elapsed = 21622us

Performance was tested using a 3 node cluster with 1 cpu and 8GB,
and with the following cassandra-stress loaders. Measurements are for
the read workload.

sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=5000000 -schema 'replication(factor=3)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100
sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..5000000,5000000,500000)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100

xxhash + caching:

Results:
op rate                   : 32699 [READ:32699]
partition rate            : 32699 [READ:32699]
row rate                  : 32699 [READ:32699]
latency mean              : 3.0 [READ:3.0]
latency median            : 3.0 [READ:3.0]
latency 95th percentile   : 3.9 [READ:3.9]
latency 99th percentile   : 4.5 [READ:4.5]
latency 99.9th percentile : 6.6 [READ:6.6]
latency max               : 24.0 [READ:24.0]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:05:05
END

md5:

Results:
op rate                   : 25241 [READ:25241]
partition rate            : 25241 [READ:25241]
row rate                  : 25241 [READ:25241]
latency mean              : 3.9 [READ:3.9]
latency median            : 3.9 [READ:3.9]
latency 95th percentile   : 5.1 [READ:5.1]
latency 99th percentile   : 5.8 [READ:5.8]
latency 99.9th percentile : 8.0 [READ:8.0]
latency max               : 24.8 [READ:24.8]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:06:36
END

This translates into a 21% improvoment for this workload.

Bigger cell values were also tested:

sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=1000000 -schema 'replication(factor=3)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100
sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..1000000,500000,100000)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100

xxhash + caching:

Results:
op rate                   : 19964 [READ:19964]
partition rate            : 19964 [READ:19964]
row rate                  : 19964 [READ:19964]
latency mean              : 4.9 [READ:4.9]
latency median            : 4.6 [READ:4.6]
latency 95th percentile   : 7.2 [READ:7.2]
latency 99th percentile   : 11.5 [READ:11.5]
latency 99.9th percentile : 13.6 [READ:13.6]
latency max               : 29.2 [READ:29.2]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:08:20
END

md5:

Results:
op rate                   : 12773 [READ:12773]
partition rate            : 12773 [READ:12773]
row rate                  : 12773 [READ:12773]
latency mean              : 7.7 [READ:7.7]
latency median            : 7.3 [READ:7.3]
latency 95th percentile   : 10.2 [READ:10.2]
latency 99th percentile   : 16.8 [READ:16.8]
latency 99.9th percentile : 19.2 [READ:19.2]
latency max               : 71.5 [READ:71.5]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:13:02
END

This translates into a 37% improvoment for this workload.

Fixes #2884

Tests: unit-tests (release), dtests (smp=2)

Note: dtests are kinda broken in master (> 30 failures), so take the
tests tag with a grain of himalayan salt."

* 'xxhash/v5' of https://github.com/duarten/scylla: (29 commits)
  tests/row_cache_test: Test hash caching
  tests/memtable_test: Test hash caching
  tests/mutation_test: Use xxHash instead of MD5 for some tests
  tests/mutation_test: Test xx_hasher alongside md5_hasher
  schema: Remove unneeded include
  service/storage_proxy: Enable hash caching
  service/storage_service: Add and use xxhash feature
  message/messaging_service: Specify algorithm when requesting digest
  storage_proxy: Extract decision about digest algorithm to use
  cache_flat_mutation_reader: Pre-calculate cell hash
  partition_snapshot_reader: Pre-calculate cell hash
  query::partition_slice: Add option to specify when digest is requested
  row: Use cached hash for hash calculation
  mutation_partition: Replace hash_row_slice with appending_hash
  mutation_partition: Allow caching cell hashes
  mutation_partition: Force vector_storage internal storage size
  test.py: Increase memory for row_cache_stress_test
  atomic_cell_hash: Add specialization for atomic_cell_or_collection
  query-result: Use digester instead of md5_hasher
  range_tombstone: Replace feed_hash() member function with appending_hash
  ...
2018-02-08 18:24:58 +02:00
Tomasz Grabiec
06b7b54c3d mvcc: Take partition_entry by const ref in operator<<()
Some users will only have const&.
2018-02-06 14:24:19 +01:00
Tomasz Grabiec
50f5bee12e mvcc: Do not evict from non-evictable snapshots
When moving whole partition entries from memtable to cache, we move
snapshots as well. It is incorrect to evict from such snapshots
though, because associated readers would miss data.

Solution is to record evictability of partition version references (snapshots)
and avoiding eviction from non-evictable snapshots.

Could affect scanning reads, if the reader uses partition entry from
memtable, and the partition is too large to fit in reader's buffer,
and that entry gets moved to cache (was absent in cache), and then
gets evicted (memory pressure). The reader will not see the remainder
of that entry.

Introduced in ca8e3c4, so affects 2.1+

Fixes #3186.
2018-02-06 14:24:19 +01:00
Tomasz Grabiec
d899ae0f02 mvcc: Encapsulate construction of evictable entries
Internal invariants of MVCC are better preserved by partition_entry
methods, so move construction of partition entries out of cache_entry
constructors.
2018-02-05 17:54:03 +01:00
Duarte Nunes
712c051de6 cache_flat_mutation_reader: Pre-calculate cell hash
When digest is requested, pre-calculate the cell's hash. We consider
the case when the cell is already in the cache, and the case when it
added by the underlying reader.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Duarte Nunes
ec5b7fb553 partition_snapshot_reader: Pre-calculate cell hash
When digest is requested, pre-calculate the cell's hash. A downside of
this approach is that more work will be done when there are multiple
versions of a row that contain values for the same cell, but we expect
these cases to be rare and the upside of caching a cell's hash to
compensate for the extra work.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Piotr Jastrzebski
96c97ad1db Rename streamed_mutation* files to mutation_fragment*
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:49 +01:00
Tomasz Grabiec
88aff526df mvcc: Extract version merging functions 2018-01-18 11:32:49 +01:00
Duarte Nunes
16c975edcc partition_version: Return static_row fragment from static_row()
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180109162815.5811-1-duarte@scylladb.com>
2018-01-09 19:17:02 +01:00
Tomasz Grabiec
12704fd679 mvcc: Propagate region reference to partition_entry::apply_to_incomplete() 2017-12-08 17:50:48 +01:00
Tomasz Grabiec
a6e083ef6f mvcc: Add const-qualified partition_version_ref::operator*() 2017-12-08 17:50:48 +01:00
Tomasz Grabiec
b26ce36d4b mvcc: Introduce partition_snapshot::static_row_continuous() 2017-12-08 17:50:47 +01:00
Tomasz Grabiec
c283744fcb mvcc: Introduce partition_snapshot::range_tombstones() for full range 2017-12-08 17:50:47 +01:00
Tomasz Grabiec
df964c70f8 mvcc: Don't require external schema in parition_snapshot::range_tombstones() 2017-12-08 17:50:47 +01:00
Tomasz Grabiec
49c0705409 mvcc: partition_version: Use apply_monotonically() to provide atomicity
This patch drops the use of apply_reversibly(). We move the mutation
to be applied into a new version and then use apply_monotonically() to
merge it (if no snapshot) with the current version. This guarantees
that apply() is atomic even if apply_monotonically() throws.

Fixes #2012.
2017-11-28 12:38:28 +01:00
Tomasz Grabiec
52cabe343c mvcc: Extract partition_entry::add_version() 2017-11-28 12:38:27 +01:00
Glauber Costa
c2f49da609 partition: add method to calculate memory size of a partition
Once that is added, also add a method to a memtable entry to calculate
the entire size of a memtable entry. Right now we only have one method
to calculate the size minus rows.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2017-11-08 16:21:44 -05:00
Tomasz Grabiec
967cabcaf2 mvcc: Make the null state of partition_snapshot::change_mark explicit 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
4b7933543d mvcc: Add partition_snapshot::region() getter 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
9cf30f19ae mvcc: Add partition_snapshot::schema() getter 2017-11-02 11:05:19 +01:00
Tomasz Grabiec
b6ae5783cd mvcc: Introduce partition_entry::evict()
The operation frees as much memory as possible, marking affected
mutation elements as discontinuous.
2017-09-13 17:47:03 +02:00
Tomasz Grabiec
cda86abdbc mvcc: Encapsulate reference stability check in partition_snapshot 2017-09-13 17:38:08 +02:00
Tomasz Grabiec
2df6f356b1 mvcc: Store LSA region reference in partition_snapshot
Will be useful for improving encapsulation.
2017-09-13 17:38:08 +02:00
Piotr Jastrzebski
896bf2e5de Remove unused methods from MVCC
Some apply methods where replaced by apply_to_incomplete().

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
6ebfb730ee partition_entry: Introduce partition_tombstone() getter 2017-06-24 18:06:11 +02:00
Piotr Jastrzebski
b680de930c partition_entry: Introduce apply_to_incomplete()
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

[tgrabiec:
  - extracted from a larger commit
  - fix heap comparator in apply_incomplete_target to order versions properly
  - extracted partition_version detaching into
    partition_entry::with_detached_versions()
  - dropped unnecessary rows_iterator::_version field
  - dropped unnecessary allocation of rows_entry and key copies
    in rows_iterator
  - dropped row_pointer
  - replaced apply_reversibly() with weaker and faster apply()
  - added handling of dummy entries at any position
  - fixed exception safety issue in apply_to_incomplete() which may
    result in data loss. We cannot move data out of applied versions
    into a new synthetic row and then apply it, because if exception
    happens in the middle, the data which was moved from the source
    will be lost. To fix that, row_iterator::consume_row() is
    introduced which allows in-place consumption of data without
    construction of temporary deletable_row.
  ]
2017-06-24 18:06:11 +02:00