Commit Graph

175 Commits

Author SHA1 Message Date
Avi Kivity
f70ece9f88 tests: convert sprint() to format()
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().

Mechanically converted with https://github.com/avikivity/unsprint.
2018-11-01 13:16:17 +00:00
Botond Dénes
eb357a385d flat_mutation_reader: make timeout opt-out rather than opt-in
Currently timeout is opt-in, that is, all methods that even have it
default it to `db::no_timeout`. This means that ensuring timeout is used
where it should be is completely up to the author and the reviewrs of
the code. As humans are notoriously prone to mistakes this has resulted
in a very inconsistent usage of timeout, many clients of
`flat_mutation_reader` passing the timeout only to some members and only
on certain call sites. This is small wonder considering that some core
operations like `operator()()` only recently received a timeout
parameter and others like `peek()` didn't even have one until this
patch. Both of these methods call `fill_buffer()` which potentially
talks to the lower layers and is supposed to propagate the timeout.
All this makes the `flat_mutation_reader`'s timeout effectively useless.

To make order in this chaos make the timeout parameter a mandatory one
on all `flat_mutation_reader` methods that need it. This ensures that
humans now get a reminder from the compiler when they forget to pass the
timeout. Clients can still opt-out from passing a timeout by passing
`db::no_timeout` (the previous default value) but this will be now
explicit and developers should think before typing it.

There were suprisingly few core call sites to fix up. Where a timeout
was available nearby I propagated it to be able to pass it to the
reader, where I couldn't I passed `db::no_timeout`. Authors of the
latter kind of code (view, streaming and repair are some of the notable
examples) should maybe consider propagating down a timeout if needed.
In the test code (the wast majority of the changes) I just used
`db::no_timeout` everywhere.

Tests: unit(release, debug)

Signed-off-by: Botond Dénes <bdenes@scylladb.com>

Message-Id: <1edc10802d5eb23de8af28c9f48b8d3be0f1a468.1536744563.git.bdenes@scylladb.com>
2018-09-20 11:31:24 +02:00
Tomasz Grabiec
9a0548397c tests: row_cache: Add test for eviction from invalidated partitions
Message-Id: <1531933216-28026-1-git-send-email-tgrabiec@scylladb.com>
2018-07-18 21:06:36 +03:00
Tomasz Grabiec
1de5177175 tests: row_cache: Fix use-after-scope on partition_range passed to readers
The partition_range must outlive the reader.

Message-Id: <1531301583-15476-1-git-send-email-tgrabiec@scylladb.com>
2018-07-11 12:39:30 +03:00
Tomasz Grabiec
a91974af7a tests: row_cache: Reduce concurrency limit to avoid bad_alloc
The test uses random mutations. We saw it failing with bad_alloc from time to time.
Reduce concurrency to reduce memory footprint.

Message-Id: <20180611090304.16681-1-tgrabiec@scylladb.com>
2018-06-11 10:06:56 +01:00
Tomasz Grabiec
9975135110 row_cache: Make sure reader makes forward progress after each fill_buffer()
If reader's buffer is small enough, or preemption happens often
enough, fill_buffer() may not make enough progress to advance
_lower_bound. If also iteartors are constantly invalidated across
fill_buffer() calls, the reader will not be able to make progress.

See row_cache_test.cc::test_reading_progress_with_small_buffer_and_invalidation()
for an examplary scenario.

Also reproduced in debug-mode row_cache_test.cc::test_concurrent_reads_and_eviction

Message-Id: <1528283957-16696-1-git-send-email-tgrabiec@scylladb.com>
2018-06-06 16:01:52 +03:00
Avi Kivity
aab6b0ee27 Merge "Introduce new in-memory representation for cells" from Paweł
"
This is the first part of the first step of switching Scylla. It covers
converting cells to the new serialisation format. The actual structure
of the cells doesn't differ much from the original one with a notable
exception of the fact that large values are now fragmented and
linearisation needs to be explicit. Counters and collections still
partially rely on their old, custom serialisation code and their
handling is not optimial (although not significantly worse than it used
to be).

The new in-memory representation allows objects to be of varying size
and makes it possible to provide deserialisation context so that we
don't need to keep in each instance of an IMR type all the information
needed to interpret it. The structure of IMR types is described in C++
using some metaprogramming with the hopes of making it much easier to
modify the serialisation format that it would be in case of open-coded
serialisation functions.

Moreover, IMR types can own memory thanks to a limited support for
destructors and movers (the latter are not exactly the same thing as C++
move constructors hence a different name). This makes it (relatively)
to ensure that there is an upper bound on the size of all allocations.

For now the only thing that is converted to the IMR are atomic_cells
and collections which means that the reduction in the memory footprint
is not as big as it can be, but introducing the IMR is a big step on its
own and also paves the way towards complete elimination of unbounded
memory allocations.

The first part of this patchset contains miscellaneous preparatory
changes to various parts of the Scylla codebase. They are followed by
introduction of the IMR infrastructure. Then structure of cells is
defined and all helper functions are implemented. Next are several
treewide patches that mostly deal with propagating type information to
the cell-related operations. Finally, atomic_cell and collections are
switched to used the new IMR-based cell implementation.

The IMR is described in much more detail in imr/IMR.md added in "imr:
add IMR documentation".

Refs #2031.
Refs #2409.

perf_simple_query -c4, medians of 30 results:

        ./perf_base  ./perf_imr   diff
 read     308790.08   309775.35   0.3%
 write    402127.32   417729.18   3.9%

The same with 1 byte values:
        ./perf_base1  ./perf_imr1   diff
 read      314107.26    314648.96   0.2%
 write     463801.40    433255.96  -6.6%

The memory footprint is reduced, but that is partially due to removal of
small buffer optimisation (whether it will be restored depends on the
exact mesurements of the performance impact). Generally, this series was
not expected to make a huge difference as this would require converting
whole rows to the IMR.

Memory footprint:
Before:
mutation footprint:
 - in cache: 1264
 - in memtable: 986

After:
mutation footprint:
 - in cache: 1104
 - in memtable: 866

Tests: unit (release, debug)
"

* tag 'imr-cells/v3' of https://github.com/pdziepak/scylla: (37 commits)
  tests/mutation: add test for changing column type
  atomic_cell: switch to new IMR-based cell reperesentation
  atomic_cell: explicitly state when atomic_cell is a collection member
  treewide: require type for creating collection_mutation_view
  treewide: require type for comparing cells
  atomic_cell: introduce fragmented buffer value interface
  treewide: require type to compute cell memory usage
  treewide: require type to copy atomic_cell
  treewide: require type info for copying atomic_cell_or_collection
  treewide: require type for creating atomic_cell
  atomic_cell: require column_definition for creating atomic_cell views
  tests: test imr representation of cells
  types: provide information for IMR
  data: introduce cell
  data: introduce type_info
  imr/utils: add imr object holder
  imr: introduce concepts
  imr: add helper for allocating objects
  imr: allow creating lsa migrators for IMR objects
  imr: introduce placeholders
  ...
2018-05-31 19:21:15 +03:00
Tomasz Grabiec
b5e42bc6a0 tests: row_cache: Do not hang when only one of the readers throws
Message-Id: <20180531122729.3314-1-tgrabiec@scylladb.com>
2018-05-31 18:00:22 +03:00
Paweł Dziepak
27014a23d7 treewide: require type info for copying atomic_cell_or_collection 2018-05-31 15:51:11 +01:00
Tomasz Grabiec
f6e21accc7 tests: cache: Take into account that update() may defer
The test incorrectly assumed that once update() is started the
cache will return only versions from last_generation. This will not
hold once we start to defer during partition merging.
2018-05-30 14:41:40 +02:00
Tomasz Grabiec
f0c1edd672 cache: Destroy partition versions incrementally
Instead of destroying whole partition_versions at once, we will do that
gently using mutation_cleaner to avoid reactor stalls.

Large deletions could happen when large partition gets invalidated,
upgraded to a new schema, or when it's abandaned by a detached snapshot.

Refs #3289.
2018-05-30 14:41:40 +02:00
Avi Kivity
7161244130 Merge seastar upstream
* seastar 70aecca...ac02df7 (5):
  > Merge "Prefix preprocessor definitions" from Jesse
  > cmake: Do not enable warnings transitively
  > posix: prevent unused variable warning
  > build: Adjust DPDK options to fix compilation
  > io_scheduler: adjust property names

DEBUG, DEFAULT_ALLOCATOR, and HAVE_LZ4_COMPRESS_DEFAULT macro
references prefixed with SEASTAR_. Some may need to become
Scylla macros.
2018-04-29 11:03:21 +03:00
Tomasz Grabiec
180a877db3 tests: cache: Add tests for row-level eviction 2018-03-07 16:52:59 +01:00
Tomasz Grabiec
9fab5068c6 tests: cache: Check that data is evictable after schema change 2018-03-07 16:52:59 +01:00
Tomasz Grabiec
f0e0c79a70 tests: cache: Move definitions to the top 2018-03-07 16:52:59 +01:00
Tomasz Grabiec
da901b93fc cache: Track number of rows and row invalidations 2018-03-06 11:50:29 +01:00
Tomasz Grabiec
381bf02f55 cache: Evict with row granularity
Instead of evicting whole partitions, evicts whole rows.

As part of this, invalidation of partition entries was changed to not
evict from snapshots right away, but unlink them and let them be
evicted by the reclaimer.
2018-03-06 11:50:29 +01:00
Tomasz Grabiec
f2bdac2874 tests: cache: Do not depend on particular granularity of eviction 2018-03-06 11:50:28 +01:00
Tomasz Grabiec
c306c1050e tests: cache: Make sure readers touch rows in test_eviction()
With row-level eviction just creating a reader won't necessarily
update the LRU.
2018-03-06 11:50:28 +01:00
Tomasz Grabiec
fb2107416b tests: cache: Invoke partial eviction in test_concurrent_reads_and_eviction
In hope of catching more issues.
2018-03-06 11:50:27 +01:00
Tomasz Grabiec
bd1e730053 tests: cache: Add test for merging and reading randomly populated versions 2018-03-06 11:32:09 +01:00
Tomasz Grabiec
1b959cb6e9 tests: cache: Take parameters by const& 2018-03-06 11:32:09 +01:00
Tomasz Grabiec
d9f0c1f097 tests: cache: Fix invalidate() not being waited for
Probably responsible for occasional failures of subsequent assertion.
Didn't mange to reproduce.

Message-Id: <1520330967-584-1-git-send-email-tgrabiec@scylladb.com>
2018-03-06 12:14:04 +02:00
Tomasz Grabiec
9c3e56fb16 tests: row_cache: Improve test for snapshot consistency on eviction
Reproduces https://github.com/scylladb/scylla/issues/3215.
Message-Id: <1518710592-21925-1-git-send-email-tgrabiec@scylladb.com>
2018-02-15 16:48:23 +00:00
Tomasz Grabiec
b3415880b2 tests: row_cache: Add test for exception safety of updates from memtable 2018-02-15 10:13:02 +01:00
Avi Kivity
404172652e Merge "Use xxHash for digest instead of MD5" from Duarte
"This series changes digest calculation to use a faster algorithm
(xxHash) and to also cache calculated cell hashes that can be kept in
memory to speed up subsequent digest requests.

The MD5 hash function has proved to be slow for large cell values:

size = 256; elapsed = 4us
size = 512; elapsed = 8us
size = 1024; elapsed = 14us
size = 2048; elapsed = 21us
size = 4096; elapsed = 33us
size = 8192; elapsed = 51us
size = 16384; elapsed = 86us
size = 32768; elapsed = 150us
size = 65536; elapsed = 278us
size = 131072; elapsed = 531us
size = 262144; elapsed = 1032us
size = 524288; elapsed = 2026us
size = 1048576; elapsed = 4004us
size = 2097152; elapsed = 7943us
size = 4194304; elapsed = 15800us
size = 8388608; elapsed = 31731us
size = 16777216; elapsed = 64681us
size = 33554432; elapsed = 130752us
size = 67108864; elapsed = 263154us

The xxHash is a non-cryptographic, 64bit (there's work in progress on
the 128 version) hash that can be used to replace MD5. It performs much
better:

size = 256; elapsed = 2us
size = 512; elapsed = 1us
size = 1024; elapsed = 1us
size = 2048; elapsed = 2us
size = 4096; elapsed = 2us
size = 8192; elapsed = 3us
size = 16384; elapsed = 5us
size = 32768; elapsed = 8us
size = 65536; elapsed = 14us
size = 131072; elapsed = 28us
size = 262144; elapsed = 59us
size = 524288; elapsed = 116us
size = 1048576; elapsed = 226us
size = 2097152; elapsed = 456us
size = 4194304; elapsed = 935us
size = 8388608; elapsed = 1848us
size = 16777216; elapsed = 4723us
size = 33554432; elapsed = 10507us
size = 67108864; elapsed = 21622us

Performance was tested using a 3 node cluster with 1 cpu and 8GB,
and with the following cassandra-stress loaders. Measurements are for
the read workload.

sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=5000000 -schema 'replication(factor=3)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100
sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..5000000,5000000,500000)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100

xxhash + caching:

Results:
op rate                   : 32699 [READ:32699]
partition rate            : 32699 [READ:32699]
row rate                  : 32699 [READ:32699]
latency mean              : 3.0 [READ:3.0]
latency median            : 3.0 [READ:3.0]
latency 95th percentile   : 3.9 [READ:3.9]
latency 99th percentile   : 4.5 [READ:4.5]
latency 99.9th percentile : 6.6 [READ:6.6]
latency max               : 24.0 [READ:24.0]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:05:05
END

md5:

Results:
op rate                   : 25241 [READ:25241]
partition rate            : 25241 [READ:25241]
row rate                  : 25241 [READ:25241]
latency mean              : 3.9 [READ:3.9]
latency median            : 3.9 [READ:3.9]
latency 95th percentile   : 5.1 [READ:5.1]
latency 99th percentile   : 5.8 [READ:5.8]
latency 99.9th percentile : 8.0 [READ:8.0]
latency max               : 24.8 [READ:24.8]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:06:36
END

This translates into a 21% improvoment for this workload.

Bigger cell values were also tested:

sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=1000000 -schema 'replication(factor=3)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100
sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..1000000,500000,100000)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100

xxhash + caching:

Results:
op rate                   : 19964 [READ:19964]
partition rate            : 19964 [READ:19964]
row rate                  : 19964 [READ:19964]
latency mean              : 4.9 [READ:4.9]
latency median            : 4.6 [READ:4.6]
latency 95th percentile   : 7.2 [READ:7.2]
latency 99th percentile   : 11.5 [READ:11.5]
latency 99.9th percentile : 13.6 [READ:13.6]
latency max               : 29.2 [READ:29.2]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:08:20
END

md5:

Results:
op rate                   : 12773 [READ:12773]
partition rate            : 12773 [READ:12773]
row rate                  : 12773 [READ:12773]
latency mean              : 7.7 [READ:7.7]
latency median            : 7.3 [READ:7.3]
latency 95th percentile   : 10.2 [READ:10.2]
latency 99th percentile   : 16.8 [READ:16.8]
latency 99.9th percentile : 19.2 [READ:19.2]
latency max               : 71.5 [READ:71.5]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:13:02
END

This translates into a 37% improvoment for this workload.

Fixes #2884

Tests: unit-tests (release), dtests (smp=2)

Note: dtests are kinda broken in master (> 30 failures), so take the
tests tag with a grain of himalayan salt."

* 'xxhash/v5' of https://github.com/duarten/scylla: (29 commits)
  tests/row_cache_test: Test hash caching
  tests/memtable_test: Test hash caching
  tests/mutation_test: Use xxHash instead of MD5 for some tests
  tests/mutation_test: Test xx_hasher alongside md5_hasher
  schema: Remove unneeded include
  service/storage_proxy: Enable hash caching
  service/storage_service: Add and use xxhash feature
  message/messaging_service: Specify algorithm when requesting digest
  storage_proxy: Extract decision about digest algorithm to use
  cache_flat_mutation_reader: Pre-calculate cell hash
  partition_snapshot_reader: Pre-calculate cell hash
  query::partition_slice: Add option to specify when digest is requested
  row: Use cached hash for hash calculation
  mutation_partition: Replace hash_row_slice with appending_hash
  mutation_partition: Allow caching cell hashes
  mutation_partition: Force vector_storage internal storage size
  test.py: Increase memory for row_cache_stress_test
  atomic_cell_hash: Add specialization for atomic_cell_or_collection
  query-result: Use digester instead of md5_hasher
  range_tombstone: Replace feed_hash() member function with appending_hash
  ...
2018-02-08 18:24:58 +02:00
Tomasz Grabiec
c1b82e60e3 tests: row_cache: Add test for memtable readers surviving flush and eviction
Reproduces https://github.com/scylladb/scylla/issues/3186
2018-02-06 14:24:19 +01:00
Duarte Nunes
992de302a2 tests/row_cache_test: Test hash caching
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Piotr Jastrzebski
7729bc5e7b Remove unused mutation_reader_assertions
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Piotr Jastrzebski
39ec13133f row_cache: rename make_flat_reader to make_reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
0d76091a28 test_mvcc: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
425c1624cd test_cache_population_and_clear_race: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
dc97acb778 test_cache_population_and_update_race: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
1bead9747a test_continuity_flag_and_invalidate_race: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
4266b9759e test_update_failure: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
d5366026b1 row_cache_test: use flat reader in verify_has
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
56b0157831 row_cache_test: use flat reader in has_key
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
06bca9f4d5 test_sliced_read_row_presence: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
6c3d9cdb9f test_lru: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
a979869a15 test_update_invalidating: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
781d9a324d test_scan_with_partial_partitions: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
f199aab1ad test_cache_populates_partition_tombstone: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
9755f7677c test_tombstone_merging_in_partial_partition: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
2e1b12b6ce consume_all,populate_range: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
d08f4a40b2 test_readers_get_all_data_after_eviction: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
f99992261f test_tombstones_are_not_missed_when_range_is_invalidated: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
50fb2a57b6 test_exception_safety_of_reads: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
f0af5a1321 test_exception_safety_of_transitioning_from_underlying_read_to_read_from_cache: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
98b97be19a test_exception_safety_of_partition_scan: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:45 +01:00
Piotr Jastrzebski
5010c082f6 test_concurrent_population_before_latest_version_iterator: use flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:54:44 +01:00