Commit Graph

37 Commits

Author SHA1 Message Date
Piotr Dulikowski
59fbbb993f memtables: add partition/row hit/miss counters
Adds per-table metrics for counting partition and row reuse
in memtables. New metrics are as follows:
    - memtable_partition_writes - number of write operations performed
          on partitions in memtables,
    - memtable_partition_hits - number of write operations performed
          on partitions that previously existed in a memtable,
    - memtable_row_writes - number of row write operations performed
          in memtables,
    - memtable_row_hits - number of row write operations that ovewrote
          rows previously present in a memtable.

Tests: unit(release)
2019-11-12 13:35:41 +01:00
Botond Dénes
01e913397a tests: memtable_test: flush_reader_test: compare compacted mutations
To filter out artificial differences due to different representation of
an equivalent set of writes.

Fixes: #5207

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20191024103718.29266-1-bdenes@scylladb.com>
2019-10-26 18:14:18 +03:00
Botond Dénes
4d9f3e5705 data_model: extend ttl and expiry support 2019-07-15 17:38:00 +03:00
Paweł Dziepak
e315448d0a tests/memtable: test column tracking for encoding stats 2019-02-07 10:16:50 +00:00
Jesse Haber-Kucharsky
b39eac653d Switch to the the CMake-ified Seastar
Committer: Avi Kivity <avi@scylladb.com>
Branch: next

Switch to the the CMake-ified Seastar

This change allows Scylla to be compiled against the `master` branch of
Seastar.

The necessary changes:

- Add `-Wno-error` to prevent a Seastar warning from terminating the
  build

- The new Seastar build system generates the pkg-config files (for
  example, `seastar.pc`) at configure time, so we don't need to invoke
  Ninja to generate them

- The `-march` argument is no longer inherited from Seastar (correctly),
  so it needs to be provided independently

- Define `SEASTAR_TESTING_MAIN` so that the definition of an entry
  point is included for all unit test compilation units

- Independently link Scylla against Seastar's compiled copy of fmt in
  its build directory

- All test files use the (now public) Seastar testing headers

- Add some missing Seastar headers to source files

[avi: regenerate frozen toolchain, adjust seastar submoule]
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <02141f2e1ecff5cbcd56b32768356c3bf62750c4.1548820547.git.jhaberku@scylladb.com>
2019-01-30 11:17:38 +02:00
Duarte Nunes
fa2b0384d2 Replace std::experimental types with C++17 std version.
Replace stdx::optional and stdx::string_view with the C++ std
counterparts.

Some instances of boost::variant were also replaced with std::variant,
namely those that called seastar::visit.

Scylla now requires GCC 8 to compile.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190108111141.5369-1-duarte@scylladb.com>
2019-01-08 13:16:36 +02:00
Paweł Dziepak
e91165d929 tests/memtable: fix partition_range use-after-free 2018-12-20 13:27:25 +00:00
Avi Kivity
775b7e41f4 Update seastar submodule
* seastar d59fcef...b924495 (2):
  > build: Fix protobuf generation rules
  > Merge "Restructure files" from Jesse

Includes fixup patch from Jesse:

"
Update Seastar `#include`s to reflect restructure

All Seastar header files are now prefixed with "seastar" and the
configure script reflects the new locations of files.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com>
"
2018-11-21 00:01:44 +02:00
Botond Dénes
eb357a385d flat_mutation_reader: make timeout opt-out rather than opt-in
Currently timeout is opt-in, that is, all methods that even have it
default it to `db::no_timeout`. This means that ensuring timeout is used
where it should be is completely up to the author and the reviewrs of
the code. As humans are notoriously prone to mistakes this has resulted
in a very inconsistent usage of timeout, many clients of
`flat_mutation_reader` passing the timeout only to some members and only
on certain call sites. This is small wonder considering that some core
operations like `operator()()` only recently received a timeout
parameter and others like `peek()` didn't even have one until this
patch. Both of these methods call `fill_buffer()` which potentially
talks to the lower layers and is supposed to propagate the timeout.
All this makes the `flat_mutation_reader`'s timeout effectively useless.

To make order in this chaos make the timeout parameter a mandatory one
on all `flat_mutation_reader` methods that need it. This ensures that
humans now get a reminder from the compiler when they forget to pass the
timeout. Clients can still opt-out from passing a timeout by passing
`db::no_timeout` (the previous default value) but this will be now
explicit and developers should think before typing it.

There were suprisingly few core call sites to fix up. Where a timeout
was available nearby I propagated it to be able to pass it to the
reader, where I couldn't I passed `db::no_timeout`. Authors of the
latter kind of code (view, streaming and repair are some of the notable
examples) should maybe consider propagating down a timeout if needed.
In the test code (the wast majority of the changes) I just used
`db::no_timeout` everywhere.

Tests: unit(release, debug)

Signed-off-by: Botond Dénes <bdenes@scylladb.com>

Message-Id: <1edc10802d5eb23de8af28c9f48b8d3be0f1a468.1536744563.git.bdenes@scylladb.com>
2018-09-20 11:31:24 +02:00
Paweł Dziepak
27014a23d7 treewide: require type info for copying atomic_cell_or_collection 2018-05-31 15:51:11 +01:00
Avi Kivity
404172652e Merge "Use xxHash for digest instead of MD5" from Duarte
"This series changes digest calculation to use a faster algorithm
(xxHash) and to also cache calculated cell hashes that can be kept in
memory to speed up subsequent digest requests.

The MD5 hash function has proved to be slow for large cell values:

size = 256; elapsed = 4us
size = 512; elapsed = 8us
size = 1024; elapsed = 14us
size = 2048; elapsed = 21us
size = 4096; elapsed = 33us
size = 8192; elapsed = 51us
size = 16384; elapsed = 86us
size = 32768; elapsed = 150us
size = 65536; elapsed = 278us
size = 131072; elapsed = 531us
size = 262144; elapsed = 1032us
size = 524288; elapsed = 2026us
size = 1048576; elapsed = 4004us
size = 2097152; elapsed = 7943us
size = 4194304; elapsed = 15800us
size = 8388608; elapsed = 31731us
size = 16777216; elapsed = 64681us
size = 33554432; elapsed = 130752us
size = 67108864; elapsed = 263154us

The xxHash is a non-cryptographic, 64bit (there's work in progress on
the 128 version) hash that can be used to replace MD5. It performs much
better:

size = 256; elapsed = 2us
size = 512; elapsed = 1us
size = 1024; elapsed = 1us
size = 2048; elapsed = 2us
size = 4096; elapsed = 2us
size = 8192; elapsed = 3us
size = 16384; elapsed = 5us
size = 32768; elapsed = 8us
size = 65536; elapsed = 14us
size = 131072; elapsed = 28us
size = 262144; elapsed = 59us
size = 524288; elapsed = 116us
size = 1048576; elapsed = 226us
size = 2097152; elapsed = 456us
size = 4194304; elapsed = 935us
size = 8388608; elapsed = 1848us
size = 16777216; elapsed = 4723us
size = 33554432; elapsed = 10507us
size = 67108864; elapsed = 21622us

Performance was tested using a 3 node cluster with 1 cpu and 8GB,
and with the following cassandra-stress loaders. Measurements are for
the read workload.

sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=5000000 -schema 'replication(factor=3)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100
sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..5000000,5000000,500000)' -col 'size=FIXED(1024) n=FIXED(4)' -mode native cql3 -rate threads=100

xxhash + caching:

Results:
op rate                   : 32699 [READ:32699]
partition rate            : 32699 [READ:32699]
row rate                  : 32699 [READ:32699]
latency mean              : 3.0 [READ:3.0]
latency median            : 3.0 [READ:3.0]
latency 95th percentile   : 3.9 [READ:3.9]
latency 99th percentile   : 4.5 [READ:4.5]
latency 99.9th percentile : 6.6 [READ:6.6]
latency max               : 24.0 [READ:24.0]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:05:05
END

md5:

Results:
op rate                   : 25241 [READ:25241]
partition rate            : 25241 [READ:25241]
row rate                  : 25241 [READ:25241]
latency mean              : 3.9 [READ:3.9]
latency median            : 3.9 [READ:3.9]
latency 95th percentile   : 5.1 [READ:5.1]
latency 99th percentile   : 5.8 [READ:5.8]
latency 99.9th percentile : 8.0 [READ:8.0]
latency max               : 24.8 [READ:24.8]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:06:36
END

This translates into a 21% improvoment for this workload.

Bigger cell values were also tested:

sudo taskset -c 4-15 ./cassandra-stress write cl=ALL n=1000000 -schema 'replication(factor=3)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100
sudo taskset -c 4-15 ./cassandra-stress mixed cl=ALL 'ratio(read=1)' n=10000000 -pop 'dist=gauss(1..1000000,500000,100000)' -col 'size=FIXED(4096) n=FIXED(4)' -mode native cql3 -rate threads=100

xxhash + caching:

Results:
op rate                   : 19964 [READ:19964]
partition rate            : 19964 [READ:19964]
row rate                  : 19964 [READ:19964]
latency mean              : 4.9 [READ:4.9]
latency median            : 4.6 [READ:4.6]
latency 95th percentile   : 7.2 [READ:7.2]
latency 99th percentile   : 11.5 [READ:11.5]
latency 99.9th percentile : 13.6 [READ:13.6]
latency max               : 29.2 [READ:29.2]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:08:20
END

md5:

Results:
op rate                   : 12773 [READ:12773]
partition rate            : 12773 [READ:12773]
row rate                  : 12773 [READ:12773]
latency mean              : 7.7 [READ:7.7]
latency median            : 7.3 [READ:7.3]
latency 95th percentile   : 10.2 [READ:10.2]
latency 99th percentile   : 16.8 [READ:16.8]
latency 99.9th percentile : 19.2 [READ:19.2]
latency max               : 71.5 [READ:71.5]
Total partitions          : 10000000 [READ:10000000]
Total errors              : 0 [READ:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:13:02
END

This translates into a 37% improvoment for this workload.

Fixes #2884

Tests: unit-tests (release), dtests (smp=2)

Note: dtests are kinda broken in master (> 30 failures), so take the
tests tag with a grain of himalayan salt."

* 'xxhash/v5' of https://github.com/duarten/scylla: (29 commits)
  tests/row_cache_test: Test hash caching
  tests/memtable_test: Test hash caching
  tests/mutation_test: Use xxHash instead of MD5 for some tests
  tests/mutation_test: Test xx_hasher alongside md5_hasher
  schema: Remove unneeded include
  service/storage_proxy: Enable hash caching
  service/storage_service: Add and use xxhash feature
  message/messaging_service: Specify algorithm when requesting digest
  storage_proxy: Extract decision about digest algorithm to use
  cache_flat_mutation_reader: Pre-calculate cell hash
  partition_snapshot_reader: Pre-calculate cell hash
  query::partition_slice: Add option to specify when digest is requested
  row: Use cached hash for hash calculation
  mutation_partition: Replace hash_row_slice with appending_hash
  mutation_partition: Allow caching cell hashes
  mutation_partition: Force vector_storage internal storage size
  test.py: Increase memory for row_cache_stress_test
  atomic_cell_hash: Add specialization for atomic_cell_or_collection
  query-result: Use digester instead of md5_hasher
  range_tombstone: Replace feed_hash() member function with appending_hash
  ...
2018-02-08 18:24:58 +02:00
Duarte Nunes
d28bdb25c5 tests/memtable_test: Test hash caching
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 01:02:50 +00:00
Paweł Dziepak
20c460d8f0 tests/memtable: add more reader exception safety tests 2018-01-31 16:05:35 +00:00
Paweł Dziepak
1406ac5088 tests/memtable: add test for reader exception safety 2018-01-30 18:33:26 +01:00
Piotr Jastrzebski
7729bc5e7b Remove unused mutation_reader_assertions
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
José Guilherme Vanz
380bc0aa0d Swap arguments order of mutation constructor
Swap arguments in the mutation constructor keeping the same standard
from the constructor variants. Refs #3084

Signed-off-by: José Guilherme Vanz <guilherme.sft@gmail.com>
Message-Id: <20180120000154.3823-1-guilherme.sft@gmail.com>
2018-01-21 12:58:42 +02:00
Tomasz Grabiec
7ce02bc22e tests: memtable: Test that memtable with many versions is a mutation source 2017-12-22 11:06:34 +01:00
Piotr Jastrzebski
b1676db658 Migrate test_virtual_dirty_accounting_on_flush to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
b90677272f Migrate test_adding_a_column_during_reading_doesnt_affect_read_result
to flat reader

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
ddecd385c1 Migrate test_partition_version_consistency_after_lsa_compaction_happens
to flat reader

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Piotr Jastrzebski
681dc26dd1 Migrate test_fast_forward_to_after_memtable_is_flushed to flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-12-21 11:47:07 +01:00
Jesse Haber-Kucharsky
fb0866ca20 Move thread_local declarations out of main.cc
Since `disk-error-handler.hh` defines these global variables `extern`,
it makes sense to declare them in the `disk-error-handler.cc` instead of
`main.cc`.

This means that test files don't have to declare them.

Fixes #2735.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <1eed120bfd9bb3647e03fe05b60c871de2df2a86.1511810004.git.jhaberku@scylladb.com>
2017-11-27 20:27:42 +01:00
Paweł Dziepak
87b600cad8 tests/memtable: add test for flush reader 2017-11-27 20:07:23 +01:00
Paweł Dziepak
32eb6437fd memtable: make make_flush_reader() return flat_mutation_reader 2017-11-27 20:07:22 +01:00
Piotr Jastrzebski
83fd22face Add test to reproduce #2854
When memtable gets flushed, existing mutation_readers created
for it stop handling fast_forward_to correctly.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <f580ac59f3fcec53e7c78ad7a8b6374eb36958c6.1506690042.git.piotr@scylladb.com>
2017-09-29 15:17:53 +02:00
Glauber Costa
80440c0d79 database: rework dirty memory hierarchy
Issue #1918 describes a problem, in which we are generating smaller
memtables than we could, and therefore not respecting the flush
criteria.

That happens because group sizes (and limits) for pressure purposes, and
the the soft threshold is currently at 40 %. This causes system group's
soft threshold to be way below regular's virtual dirty limit and close
to regular group's soft threshold. The system group was very likely to
become under soft pressure when regular was because writes to regular
group are not yet throttled when they cross both soft thresholds.

This is a direct consequence of the linear hierarchy between the regions
and to guarantee that it won't happen we would have acqire the semaphore
of all ancestor regions when flushing from a child region. While that
works, it can lead to problems on its own, like priority inversion if
the regions have different priorities - like streaming and regular, and
groups lower in the hierarchy, like user, blocking explicit flushes
from their ancestors

To fix that, this patch reorganizes the dirty memory region groups so
that groups are now completely independent. As a disadvantage, when
streaming happen we will draw some memory from the cache, but we will
live with it for the time being.

Fixes #1918

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-12-13 14:07:53 -05:00
Tomasz Grabiec
c3768fe4de memtable: Pass dirty_memory_manager& to memtable constructor
The implementation assumes that memtable's region group is owned by
dirty_memory_manager, and tries to obtain a reference to it like this:

  boost::intrusive::get_parent_from_member(_region.group(), &dirty_memory_manager::_region_group));

This is undefined behavior when the region's group does not come from
dirty manager. It's safer to be explicit about this dependency by
taking a reference to dirty_memory_manager in the constructor.
2016-12-05 12:59:09 +01:00
Avi Kivity
7faf2eed2f build: support for linking statically with boost
Remove assumptions in the build system about dynamically linked boost unit
tests.  Includes seastar update which would have otherwise broken the
build.
2016-10-26 08:51:21 +03:00
Tomasz Grabiec
308434f891 tests: memtable: Add test for partition version list consistency after compaction 2016-10-18 11:57:14 +02:00
Tomasz Grabiec
d836e8f64b tests: memtable: Add tests for flushing reader
Message-Id: <1476454187-11462-1-git-send-email-tgrabiec@scylladb.com>
2016-10-14 15:11:06 +01:00
Duarte Nunes
dc8319ed91 keys: Remove schema argument from make_empty
An empty key is independent of the schema.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-06-02 16:21:36 +02:00
Pekka Enberg
38a54df863 Fix pre-ScyllaDB copyright statements
People keep tripping over the old copyrights and copy-pasting them to
new files. Search and replace "Cloudius Systems" with "ScyllaDB".

Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>
2016-04-08 08:12:47 +03:00
Benoît Canet
1fb9a48ac5 exception: Optionally shutdown communication on I/O errors.
I/O errors cannot be fixed by Scylla the only solution
is to shutdown the database communications.

Signed-off-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <1458154098-9977-1-git-send-email-benoit@scylladb.com>
2016-03-17 15:02:52 +02:00
Tomasz Grabiec
a63971ee4c tests: memtable_test: Add test for concurrent reading and schema changes 2016-01-11 10:34:52 +01:00
Avi Kivity
d5cf0fb2b1 Add license notices 2015-09-20 10:43:39 +03:00
Tomasz Grabiec
8978d0ba1a tests: Test memtable data survives full compaction 2015-08-06 14:05:16 +02:00
Avi Kivity
c720cddc5c tests: mv tests/urchin/* -> tests/
Now that seastar is in a separate repository, we can use the tests/
directory.
2015-08-05 14:16:52 +03:00