Commit Graph

1967 Commits

Author SHA1 Message Date
Kamil Braun
774ef653b1 test: raft: randomized_nemesis_test: move ticker to its own header 2021-07-13 11:15:25 +02:00
Kamil Braun
a45e8e0db0 test: raft: randomized_nemesis_test: ticker: take logger as a constructor parameter
Remove the global dependency on `tlogger`.
2021-07-13 11:15:25 +02:00
Kamil Braun
21b5a6d9f7 test: raft: logical_timer: handle immediate timeout
If the user calls `with_timeout` with a time point that's already been
reached, we return `timed_out_error` immediately.
2021-07-13 11:15:25 +02:00
Kamil Braun
ed8e9a564a test: raft: logical_timer: on timeout, return the original future in the exception
More specifically, return a future which is equivalent to the original
future (when the original future resolves, this future will contain its
result).

Thus we don't discard the future, the user gets it back.
Let them decide what to do with it.
2021-07-13 11:15:25 +02:00
Kamil Braun
c86ff1eb7c test: raft: logical_timer: add schedule member function
It allows scheduling the given function to be called at the given logical
time point.
2021-07-13 11:15:25 +02:00
Kamil Braun
cf0d503a92 test: raft: randomized_nemesis_test: move logical_timer to its own header 2021-07-13 11:15:25 +02:00
Avi Kivity
8fb4fe2f24 Merge "reader_concurrency_semaphore: relax on destroy stop checks" from Botond
"
Currently we `assert(_stopped)` in the destructor, but this is too
harsh, especially on freshly created semaphore instances that weren't
even used yet. This basically mandates semaphores to be initialized at
the end of the constructor body, which is very cumbersome.
Further to that, this series relaxes the checks on destroying an
unstopped previously (but not currently) used semaphore. As destroying
such a semaphore without stop is risky an error is still logged.

Tests: unit(dev)
"

* 'reader-concurrency-semaphore-relax-stop-check/v1' of https://github.com/denesb/scylla:
  reader_concurrency_semaphore: relax _stopped check when destroying a used semaphore
  reader_concurrency_semaphore: allow destroying without stop() when not used yet
  reader_concurrency_semaphore: add permit-stats
2021-07-12 20:07:01 +02:00
Botond Dénes
750b20fd85 reader_concurrency_semaphore: allow destroying without stop() when not used yet
To make it easier to construct objects with semaphore members. When the
construction of such object fails, they can now just destroy their
semaphore member as usual, without having to employ tricks to make sure
its is stopped before.
2021-07-12 15:53:00 +03:00
Nadav Har'El
3fda13e20e cql-pytest: fix sporadic failure in over-zealous TTL test
We have been seeing rare failures of the cql-pytest (translated from
Cassandra's unit tests) for testing TTL in secondary indexes:
cassandra_tests/validation/entities/secondary_index_test.py::testIndexOnRegularColumnInsertExpiringColumn

The problem is that the test writes an item with 1 second TTL, and then
sleeps *exactly* 1.0 seconds, and expects to see the item disappear
by that time. Which doesn't always happen:

The problem with that assumption stems from Scylla's TTL clock ("gc_clock")
being based on Seastar's lowres clock. lowres_clock only has a 10ms
"granularity": The time Scylla sees when deciding whether an item expires
may be up to 10ms in the past - the arbitrary point when the lowres timer
happened to last run. In rare overload cases, the inaccuracy may be even
grater than 10ms (if the timer got delayed by other things running).

So when Scylla is asked to expire an item in 1 second - we cannot be
sure it will be expired in exactly 1 second or less - the expiration
can be also around 10ms later.

So in this patch we change the test to sleep with more than enough
margin - 1.1 seconds (i.e., 100ms more than 1 second). By that time
we're sure the item must have expired.

Before this patch, I saw the test failing once every few hundred runs,
after this patch I ran if 2,000 times without a single failure.

Fixes #9008

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210712100655.953969-1-nyh@scylladb.com>
2021-07-12 13:48:21 +03:00
Piotr Jastrzebski
c010cefc4d cdc: Handle compact storage tables correctly
When a table with compact storage has no regular column (only primary
key columns), an artificial column of type empty is added. Such column
type can't be returned via CQL so CDC Log shouldn't contain a column
that reflects this artificial column.

This patch does two things:
1. Make sure that CDC Log schema does not contain columns that reflect
   the artificial column from a base table.
2. When composing mutation to CDC Log, ommit the artificial column.

Fixes #8410

Test: unit(dev)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

Closes #8988
2021-07-12 12:17:35 +03:00
Nadav Har'El
2cc8c40c07 Merge 'Fix some issues found by gcc 11' from Avi Kivity
This series fixes some issues that gcc 11 complains about. I believe all
are correct errors from the standard's view. Clang accepts the changed code.

Note that this is not enough to build with gcc 11, but it's a start.

Closes #9007

* github.com:scylladb/scylla:
  utils: compact-radix-tree: detemplate array_of<>
  utils: compact-radix-tree: don't redefine type as member
  raft: avoid changing meaning of a symbol inside a class
  cql3: lists: catch polymorphic exceptions by reference
2021-07-12 11:17:57 +03:00
Avi Kivity
332b5c395f raft: avoid changing meaning of a symbol inside a class
The construct

struct q {
    a a;
};

Changes the meaning of `a` from a type to a data member. gcc dislikes
it and I agree. Fully qualify the type name to avoid an error.
2021-07-11 18:16:21 +03:00
Avi Kivity
222ef17305 build, treewide: enable -Wredundant-move
Returning a function parameter guarantees copy elision and does not
require a std::move().  Enable -Wredundant-move to warn us that the
move is unneeded, and gain slightly more readable code. A few violations
are trivially adjusted.

Closes #9004
2021-07-11 12:53:02 +03:00
Avi Kivity
9059514335 build, treewide: enable -Wpessimizing-move warning
This warning prevents using std::move() where it can hurt
- on an unnamed temporary or a named automatic variable being
returned from a function. In both cases the value could be
constructed directly in its final destination, but std::move()
prevents it.

Fix the handful of cases (all trivial), and enable the warning.

Closes #8992
2021-07-08 17:52:34 +03:00
Avi Kivity
f756f34392 Merge "Add scylla-bench datasets to perf_fast_forward" from Tomasz
"
After this series one can use perf_fast_forward to generate the data set.
It takes a lot less time this way than to use scylla-bench.
"

* 'perf-fast-forward-scylla-bench-dataset' of github.com:tgrabiec/scylla:
  tests: perf_fast_forward: Use data_source::make_ck()
  tests: perf_fast_forward: Move declaration of clustered_ds up
  tests: perf_fast_forward: Make scylla_bench_small_part_ds1 not included by default
  tests: perf_fast_forward: Add data sets which conform to scylla-bench schema
2021-07-08 17:33:30 +03:00
Nadav Har'El
d0546a9bb5 cql-pytest: improve README
This patch adds to cql-pytest/README.md a paragraph on where run /
run-cassandra expect to find Scylla or Cassandra, and how to override
that choice.

Also make a couple of trivial formatting changes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210708142730.813660-1-nyh@scylladb.com>
2021-07-08 17:29:20 +03:00
Avi Kivity
4f1e21ceac Merge "reader_concurrency_semaphore: get rid of global semaphores" from Botond
"
When obtaining a valid permit was made mandatory, code which now had to
create reader permits but didn't have a semaphore handy suddenly found
itself in a difficult situation. Many places and most prominently tests
solved the problem by creating a thread-local semaphore to source
permits from. This was fine at the time but as usual, globals came back
to haunt us when `reader_concurrency_semaphore::stop()` was
introduced, as these global semaphores had no easy way to be stopped
before being destroyed. This patch-set cleans up this wart, by getting
rid of all global semaphores, replacing them with appropriately scoped
local semaphores, that are stopped after being used. With that, the
FIXME in `~reader_concurrency_semaphore()` can be resolved and we an
finally `assert()` that the semaphore was stopped before being
destroyed.

This series is another preparatory one for the series which moves the
semaphore in front of the cache.

tests: unit(dev)
"

* 'reader-concurrency-semaphore-mandatory-stop/v2' of https://github.com/denesb/scylla: (26 commits)
  reader_concurrency_semaphore: assert(_stopped) in the destructor
  test/lib: remove now unused reader_permit.{hh,cc}
  test/boost: migrate off the global test reader semaphore
  test/manual: migrate off the global test reader semaphore
  test/unit: migrate off the global test reader semaphore
  test/perf: migrate off the global test reader semaphore
  test/perf: perf.hh: add reader_concurrency_semaphore_wrapper
  test/lib: migrate off the global test reader semaphore
  test/lib/simple_schema: migrate off the global test reader semaphore
  test/lib/sstable_utils: migrate off the global test reader semaphore
  test/lib/test_services: migrate off the global test reader semaphore
  test/lib/sstable_test_env: add reader_concurrency_semaphore member
  test/lib/cql_test_env: add make_reader_permit()
  test/lib: add reader_concurrency_semaphore.hh
  test/boost/sstable_test: migrate row counting tests to seastar thread
  test/boost/sstable_test: test_using_reusable_sst(): pass env to func
  test/lib/reader_lifecycle_policy: add permit parameter to factory function
  test/boost/mutation_reader_test: share permit between readers in a read
  memtable: migrate off the global reader concurrency semaphore
  mutation_writer: multishard_writer: migrate off the global reader concurrency semaphore
  ...
2021-07-08 17:28:13 +03:00
Botond Dénes
6b941c4d34 test/lib: remove now unused reader_permit.{hh,cc}
Finally getting rid of the global test reader concurrency semaphore.
2021-07-08 16:53:38 +03:00
Botond Dénes
2d2b9e7b36 test/boost: migrate off the global test reader semaphore 2021-07-08 16:53:38 +03:00
Botond Dénes
0bf07cde7b test/manual: migrate off the global test reader semaphore 2021-07-08 16:53:38 +03:00
Botond Dénes
18e0c40c5d test/unit: migrate off the global test reader semaphore 2021-07-08 16:53:38 +03:00
Botond Dénes
37a1e506b1 test/perf: migrate off the global test reader semaphore 2021-07-08 16:53:38 +03:00
Botond Dénes
2454811dd6 test/perf: perf.hh: add reader_concurrency_semaphore_wrapper
A convenience, self-closing wrapper for those perf tests that have no
way to stop the semaphore and wait for it too.
2021-07-08 16:53:38 +03:00
Nadav Har'El
e22a52e69c cql-pytest: fix tests on Cassandra 3
After commit 76227fa ("cql-pytest: use NetworkTopologyStrategy, not
SimpleStrategy"), the cql-pytest tests now NetworkTopologyStrategy instead
of SimpleStrategy in the test keyspaces. The tests continued to use the
"replication_factor" option. The support for this option is a relatively
recent, and was only added to Cassandra in the 4.0 release series
(see https://issues.apache.org/jira/browse/CASSANDRA-14303). So users
who happen to have Cassandra 3 installed and want to run a cql-pytest
against it will see the test failing when it can't create a keyspace.

This patch trivially fixes the problem by using the name of the current
DC (automatically determined) instead of the word 'replication_factor'.

Almost all tests are fixed by a single fix to the test_keyspace fixture
which creates one keyspace used by most tests. Additional changes were
needed in test_keyspace.py, for tests which explicitly create keyspaces.

I tested the result on Cassandra 3.11.10, Cassandra 4 (git master) and
Scylla.

Fixes #8990

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210708123428.811184-1-nyh@scylladb.com>
2021-07-08 15:35:21 +02:00
Nadav Har'El
eb11ce046c cql-pytest: add reproducer for concurrent DROP KEYSPACE bug
We know that today in Scylla concurrent schema changes done on different
coordinators are not safe - and we plan to address this problem with Raft.
However, the test in this patch - reproducing issue #8968 - demonstrates
that even on a single node concurrent schema changes are not safe:

The test involves one thread which constantly creates a keyspace and
then a table in it - and a second thread which constantly deletes this
keyspace. After doing this for a while, the schema reaches an inconsistent
state: The keyspace is at a state of limbo where it cannot be dropped
(dropping it succeeds, but doesn't actually drop it), and a new keyspace
cannot be created under the same name).

Note that to reproduce this bug, it was important that the test create
both a keyspace and a table. Were the test to just create an empty keyspace,
without a table in it, the bug would not be reproduced.

Refs #8968.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210704121049.662169-1-nyh@scylladb.com>
2021-07-08 15:35:03 +02:00
Botond Dénes
0e78399051 test/lib: migrate off the global test reader semaphore 2021-07-08 15:28:39 +03:00
Botond Dénes
5fff314739 test/lib/simple_schema: migrate off the global test reader semaphore 2021-07-08 15:28:39 +03:00
Botond Dénes
d520655730 test/lib/sstable_utils: migrate off the global test reader semaphore 2021-07-08 15:28:39 +03:00
Botond Dénes
3679418e62 test/lib/test_services: migrate off the global test reader semaphore 2021-07-08 15:28:39 +03:00
Botond Dénes
0acc4d63da test/lib/sstable_test_env: add reader_concurrency_semaphore member
To enable tests using the test env to conveniently create permits for
themselves, reducing the pain of migrating to local semaphores.
2021-07-08 15:28:39 +03:00
Botond Dénes
7174d1beee test/lib/cql_test_env: add make_reader_permit()
A convenience method, allowing tests using the cql test env to
conveniently create a permit, reducing the pain of migrating to local
semaphores.
2021-07-08 15:28:39 +03:00
Botond Dénes
b739525fb6 test/lib: add reader_concurrency_semaphore.hh
Supplying a convenience semaphore wrapper, which stops the contained
semaphore when destroyed. It also provides a more convenient
`make_permit()`.  This class is intended to make the migration to local
semaphores less painful.
2021-07-08 15:28:36 +03:00
Nadav Har'El
814c4ad4ce cql-pytest: fix run-cassandra for older versions of Cassandra
In older versions of Cassandra (such as 3.11.10 which I tried), the
CQL server is not turned on by default, unless the configuration file
explicitly has "start_native_transport: true" - without it only the
Thrift server is started.

So fix the cql-pytest/run-cassandra to pass this option. It also
works correctly in Cassandra 4.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210708113423.804980-1-nyh@scylladb.com>
2021-07-08 14:59:09 +03:00
Piotr Sarna
bc0038913c cql-pytest: add a test case for base range deletion
The test case checks that deleting a base table clustering range
works fine. This operation is potentially heavy, as it involves
generating a view update for every row. With large enough ranges,
the number can reach millions and beyond.
2021-07-08 11:43:08 +02:00
Piotr Sarna
ef47b4565c cql-pytest: add a test case for base partition deletion
The test case checks that deleting a whole base table partition
works fine. This operation is potentially heavy, as it involves
generating a view update for every row. With large enough partitions,
the number can reach millions and beyond.
2021-07-08 11:42:54 +02:00
Botond Dénes
b9a5fd57bf test/boost/sstable_test: migrate row counting tests to seastar thread
To facilitate further patching.
2021-07-08 12:38:21 +03:00
Botond Dénes
fb310ec6e7 test/boost/sstable_test: test_using_reusable_sst(): pass env to func
To facilitate further patching.
2021-07-08 12:38:19 +03:00
Botond Dénes
46d21e842d test/lib/reader_lifecycle_policy: add permit parameter to factory function
The factory method doesn't match the signature of
`reader_lifecycle_policy::make_reader()`, notably the permit is missing.
Add it as it is important that the wrapping evictable reader and
underlying reader share the permits.
2021-07-08 12:31:36 +03:00
Botond Dénes
2a45d643b6 test/boost/mutation_reader_test: share permit between readers in a read
Permits were designed such that there is one permit per read, being
shared by all readers in that read. Make sure readers created by tests
adhere to this.
2021-07-08 12:31:36 +03:00
Botond Dénes
0f36e5c498 memtable: migrate off the global reader concurrency semaphore
Require the caller of `create_flush_reader()` to pass a permit instead.
2021-07-08 12:31:36 +03:00
Botond Dénes
c4e71fb9b8 reader_concurrency_semaphore: remove default name parameter
Naming the concurrency semaphore is currently optional, unnamed
semaphores defaulting to "Unnamed semaphore". Although the most
important semaphores are named, many still aren't, which makes for a
poor debugging experience when one of these times out.
To prevent this, remove the name parameter defaults from those
constructors that have it and require a unique name to be passed in.
Also update all sites creating a semaphore and make sure they use a
unique name.
2021-07-08 12:31:36 +03:00
Raphael S. Carvalho
1924e8d2b6 treewide: Move compaction code into a new top-level compaction dir
Since compaction is layered on top of sstables, let's move all compaction code
into a new top-level directory.
This change will give me extra motivation to remove all layer violations, like
sstable calling compaction-specific code, and compaction entanglement with
other components like table and storage service.

Next steps:
- remove all layer violations
- move compaction code in sstables namespace into a new one for compaction.
- move compaction unit tests into its own file

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210707194058.87060-1-raphaelsc@scylladb.com>
2021-07-07 23:21:51 +03:00
Tomasz Grabiec
33cba08735 tests: perf_fast_forward: Use data_source::make_ck()
Data sources differ in clustering key type. Make sure to use the right
data_value instance to produce correct keys.
2021-07-07 20:27:44 +02:00
Tomasz Grabiec
fa481e92c1 tests: perf_fast_forward: Move declaration of clustered_ds up 2021-07-07 20:27:44 +02:00
Tomasz Grabiec
407e42f5d8 tests: perf_fast_forward: Make scylla_bench_small_part_ds1 not included by default
This dataset exists for convenience, to be able to run scylla-bench
against the data set generated by perf_fast_forward. It doesn't
increase coverage. So do not include it by default to not waste
resources on it.
2021-07-07 20:27:44 +02:00
Tomasz Grabiec
d7250a12fd tests: perf_fast_forward: Add data sets which conform to scylla-bench schema
Useful for fast generation of test data.
2021-07-07 20:27:44 +02:00
Avi Kivity
5571ef0d6d compression: define 'class' attribute for compression and deprecate 'sstable_compression'
Cassandra 3.0 deprecated the 'sstable_compression' attribute and added
'class' as a replacement. Follow by supporting both.

The SSTABLE_COMPRESSION variable is renamed to SSTABLE_COMPRESSION_DEPRECATED
to detect all uses and prevent future misuse.

To prevent old-version nodes from seeing the new name, the
compression_parameters class preserves the key name when it is
constructed from an options map, and emits the same key name when
asked to generate an options map.

Existing unit tests are modified to use the new name, and a test
is added to ensure the old name is still supported.

Fixes #8948.

Closes #8949
2021-07-07 19:15:20 +02:00
Avi Kivity
99d5355007 Merge "Cache sstable indexes in memory" from Tomasz
"
The main goal of this series is to improve efficiency of reads from large partitions by
reducing amount of I/O needed to read the sstable index. This is achieved by caching
index file pages and partition index entries in memory.

Currently, the pages are cached by individual reads only for the duration of the read.
This was done to facilitate binary search in the promoted index (intra-partition index).
After this series, all reads share the index file page cache, which stays around even after reads stop.

The page cache is subject to eviction. It uses the same region as the current row cache and shares
the LRU with row cache entries. This means that LRU objects need to be virtualized. This series takes
an easy approach and does this by introducing a virtual base class. This adds an overhead to row cache
entry to store the vtable pointer.

SStable indexes have a hierarchy. There is a summary, which is a sparse partition key index into the
full partition index. This one is already kept in memory. The partition index is divided by the summary
into pages. Each entry in the partition index contains promoted index, which is a sparse index into atoms
identified by the clustering key (rows, tombstones).

In order to read the promoted index, the reader needs to read the partition index entry first.
To speed this up, this series also adds caching of partition index entries. This cache survives
reads and is subject to eviction, just like the index file page cache. The unit of caching is
the partition index page. Without this cache, each access to promoted index would have to be
preceded with the parsing of the partition index page containing the partition key.

Performance testing results follow.

1) scylla-bench large partition reads

  Populated with:

        perf_fast_forward --run-tests=large-partition-skips --datasets=sb-large-part-ds1 \
            -c1 -m1G --populate --value-size=1024 --rows=10000000

  Single partition, 9G data file, 4MB index file

  Test execution:

    build/release/scylla -c1 -m4G
    scylla-bench -workload uniform -mode read -limit 1 -concurrency 100 -partition-count 1 \
       -clustering-row-count 10000000 -duration 60m

  TL;DR: after: 2x throughput, 0.5 median latency

    Before (c1daf2bb24):

    Results
    Time (avg):	 5m21.033180213s
    Total ops:	 966951
    Total rows:	 966951
    Operations/s:	 3011.997048812112
    Rows/s:		 3011.997048812112
    Latency:
      max:		 74.055679ms
      99.9th:	 63.569919ms
      99th:		 41.320447ms
      95th:		 38.076415ms
      90th:		 37.158911ms
      median:	 34.537471ms
      mean:		 33.195994ms

    After:

    Results
    Time (avg):	 5m14.706669345s
    Total ops:	 2042831
    Total rows:	 2042831
    Operations/s:	 6491.22243800942
    Rows/s:		 6491.22243800942
    Latency:
      max:		 60.096511ms
      99.9th:	 35.520511ms
      99th:		 27.000831ms
      95th:		 23.986175ms
      90th:		 21.659647ms
      median:	 15.040511ms
      mean:		 15.402076ms

2) scylla-bench small partitions

  I tested several scenarios with a varying data set size, e.g. data fully fitting in memory,
  half fitting, and being much larger. The improvement varied a bit but in all cases the "after"
  code performed slightly better.

  Below is a representative run over data set which does not fit in memory.

  scylla -c1 -m4G
  scylla-bench -workload uniform -mode read  -concurrency 400 -partition-count 10000000 \
      -clustering-row-count 1 -duration 60m -no-lower-bound

  Before:

    Time (avg):	 51.072411913s
    Total ops:	 3165885
    Total rows:	 3165885
    Operations/s:	 61988.164024260645
    Rows/s:		 61988.164024260645
    Latency:
      max:		 34.045951ms
      99.9th:	 25.985023ms
      99th:		 23.298047ms
      95th:		 19.070975ms
      90th:		 17.530879ms
      median:	 3.899391ms
      mean:		 6.450616ms

  After:

    Time (avg):	 50.232410679s
    Total ops:	 3778863
    Total rows:	 3778863
    Operations/s:	 75227.58014424688
    Rows/s:		 75227.58014424688
    Latency:
      max:		 37.027839ms
      99.9th:	 24.805375ms
      99th:		 18.219007ms
      95th:		 14.090239ms
      90th:		 12.124159ms
      median:	 4.030463ms
      mean:		 5.315111ms

  The results include the warmup phase which populates the partition index cache, so the hot-cache effect
  is dampened in the statistics. See the 99th percentile. Latency gets better after the cache warms up which
  moves it lower.

3) perf_fast_forward --run-tests=large-partition-skips

    Caching is not used here, included to show there are no regressions for the cold cache case.

    TL;DR: No significant change

    perf_fast_forward --run-tests=large-partition-skips --datasets=large-part-ds1 -c1 -m1G

    Config: rows: 10000000, value size: 2000

    Before:

    read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
    1       0        36.429822            4  10000000     274500         62     274521     274429   153889.2 153883   19696986  153853       0        0        0        0        0        0        0  22.5%
    1       1        36.856236            4   5000000     135662          7     135670     135650   155652.0 155652   19704117  139326       1        0        1        1        0        0        0  38.1%
    1       8        36.347667            4   1111112      30569          0      30570      30569   155652.0 155652   19704117  139071       1        0        1        1        0        0        0  19.5%
    1       16       36.278866            4    588236      16214          1      16215      16213   155652.0 155652   19704117  139073       1        0        1        1        0        0        0  16.6%
    1       32       36.174784            4    303031       8377          0       8377       8376   155652.0 155652   19704117  139056       1        0        1        1        0        0        0  12.3%
    1       64       36.147104            4    153847       4256          0       4256       4256   155652.0 155652   19704117  139109       1        0        1        1        0        0        0  11.1%
    1       256       9.895288            4     38911       3932          1       3933       3930   100869.2 100868    3178298   59944   38912        0        1        1        0        0        0  14.3%
    1       1024      2.599921            4      9757       3753          0       3753       3753    26604.0  26604     801850   15071    9758        0        1        1        0        0        0  14.6%
    1       4096      0.784568            4      2441       3111          1       3111       3109     7982.0   7982     205946    3772    2442        0        1        1        0        0        0  13.8%

    64      1        36.553975            4   9846154     269359         10     269369     269337   155663.8 155652   19704117  139230       1        0        1        1        0        0        0  28.2%
    64      8        36.509694            4   8888896     243467          8     243475     243449   155652.0 155652   19704117  139120       1        0        1        1        0        0        0  26.5%
    64      16       36.466282            4   8000000     219381          4     219385     219374   155652.0 155652   19704117  139232       1        0        1        1        0        0        0  24.8%
    64      32       36.395926            4   6666688     183171          6     183180     183165   155652.0 155652   19704117  139158       1        0        1        1        0        0        0  21.8%
    64      64       36.296856            4   5000000     137753          4     137757     137737   155652.0 155652   19704117  139105       1        0        1        1        0        0        0  17.7%
    64      256      20.590392            4   2000000      97133         18      97151      94996   135248.8 131395    7877402   98335   31282        0        1        1        0        0        0  15.7%
    64      1024      6.225773            4    588288      94492       1436      95434      88748    46066.5  41321    2324378   30360    9193        0        1        1        0        0        0  15.8%
    64      4096      1.856069            4    153856      82893         54      82948      82721    16115.0  16043     583674   11574    2675        0        1        1        0        0        0  16.3%

    After:

    read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
    1       0        36.429240            4  10000000     274505         38     274515     274417   153887.8 153883   19696986  153849       0        0        0        0        0        0        0  22.4%
    1       1        36.933806            4   5000000     135377         15     135385     135354   155658.0 155658   19704085  139398       1        0        1        1        0        0        0  40.0%
    1       8        36.419187            4   1111112      30509          2      30510      30507   155658.0 155658   19704085  139233       1        0        1        1        0        0        0  22.0%
    1       16       36.353475            4    588236      16181          0      16182      16181   155658.0 155658   19704085  139183       1        0        1        1        0        0        0  19.2%
    1       32       36.251356            4    303031       8359          0       8359       8359   155658.0 155658   19704085  139120       1        0        1        1        0        0        0  14.8%
    1       64       36.203692            4    153847       4249          0       4250       4249   155658.0 155658   19704085  139071       1        0        1        1        0        0        0  13.0%
    1       256       9.965876            4     38911       3904          0       3906       3904   100875.2 100874    3178266   60108   38912        0        1        1        0        0        0  17.9%
    1       1024      2.637501            4      9757       3699          1       3700       3697    26610.0  26610     801818   15071    9758        0        1        1        0        0        0  19.5%
    1       4096      0.806745            4      2441       3026          1       3027       3024     7988.0   7988     205914    3773    2442        0        1        1        0        0        0  18.3%

    64      1        36.611243            4   9846154     268938          5     268942     268921   155669.8 155705   19704085  139330       2        0        1        1        0        0        0  29.9%
    64      8        36.559471            4   8888896     243135         11     243156     243124   155658.0 155658   19704085  139261       1        0        1        1        0        0        0  28.1%
    64      16       36.510319            4   8000000     219116         15     219126     219101   155658.0 155658   19704085  139173       1        0        1        1        0        0        0  26.3%
    64      32       36.439069            4   6666688     182954          9     182964     182943   155658.0 155658   19704085  139274       1        0        1        1        0        0        0  23.2%
    64      64       36.334808            4   5000000     137609         11     137612     137596   155658.0 155658   19704085  139258       2        0        1        1        0        0        0  19.1%
    64      256      20.624759            4   2000000      96971         88      97059      92717   138296.0 131401    7877370   98332   31282        0        1        1        0        0        0  17.2%
    64      1024      6.260598            4    588288      93967       1429      94905      88051    45939.5  41327    2324346   30361    9193        0        1        1        0        0        0  17.8%
    64      4096      1.881338            4    153856      81780        140      81920      81520    16109.8  16092     582714   11617    2678        0        1        1        0        0        0  18.2%

4) perf_fast_forward --run-tests=large-partition-slicing

    Caching enabled, each line shows the median run from many iterations

    TL;DR: We can observe reduction in IO which translates to reduction in execution time,
           especially for slicing in the middle of partition.

    perf_fast_forward --run-tests=large-partition-slicing --datasets=large-part-ds1 -c1 -m1G --keep-cache-across-test-cases

    Config: rows: 10000000, value size: 2000

    Before:

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
    0       1         0.000491          127         1       2037         24       2109        127        4.0      4        128       2       2        0        1        1        0        0        0       157      80 3058208  15.0%
    0       32        0.000561         1740        32      56995        410      60031      47208        5.0      5        160       3       2        0        1        1        0        0        0       386     111  113353  17.5%
    0       256       0.002052          488       256     124736       7111     144762      89053       16.6     17        672      14       2        0        1        1        0        0        0      2113     446   52669  18.6%
    0       4096      0.016437           61      4096     249199        692     252389     244995       69.4     69       8640      57       5        0        1        1        0        0        0     26638    1717   23321  22.4%
    5000000 1         0.002171          221         1        461          2        466        221       25.0     25        268       3       3        0        1        1        0        0        0       638     376 14311524  10.2%
    5000000 32        0.002392          404        32      13376         48      13528      13015       27.0     27        332       5       3        0        1        1        0        0        0       931     432  489691  11.9%
    5000000 256       0.003659          279       256      69967        764      73130      52563       39.5     41        780      19       3        0        1        1        0        0        0      2689     825   93756  15.8%
    5000000 4096      0.018592           55      4096     220313        433     234214     218803       94.2     94       9484      62       9        0        1        1        0        0        0     27349    2213   26562  21.0%

    After:

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
    0       1         0.000229          115         1       4371         85       4585        115        2.1      2         64       1       1        1        0        0        0        0        0        90      31 1314749  22.2%
    0       32        0.000277         2174        32     115674       1015     128109      14144        3.0      3         96       2       1        1        0        0        0        0        0       319      62   52508  26.1%
    0       256       0.001786          576       256     143298       5534     179142     113715       14.7     17        544      15       1        1        0        0        0        0        0      2110     453   45419  21.4%
    0       4096      0.015498           61      4096     264289       2006     268850     259342       67.4     67       8576      59       4        1        0        0        0        0        0     26657    1738   22897  23.7%
    5000000 1         0.000415          233         1       2411         15       2456        234        4.1      4        128       2       2        1        0        0        0        0        0       199      72 2644719  16.8%
    5000000 32        0.000635         1413        32      50398        349      51149      46439        6.0      6        192       4       2        1        0        0        0        0        0       458     128  125893  18.6%
    5000000 256       0.002028          486       256     126228       3024     146327      82559       17.8     18       1024      13       4        1        0        0        0        0        0      2123     385   51787  19.6%
    5000000 4096      0.016836           61      4096     243294        814     263434     241660       73.0     73       9344      62       8        1        0        0        0        0        0     26922    1920   24389  22.4%

Future work:

 - Check the impact on non-uniform workloads. Caching sstable indexes takes space away from the row cache
   which may reduce the hit ratio.

 - Reduce memory footprint of partition index cache. Currently, about 8x bloat over the on-disk size.

 - Disable cache population for "bypass cache" reads

 - Add a switch to disable sstable index caching, per-node, maybe per-table

 - Better sstable index format. Current format leads to inefficiency in caching since only some elements of the cached
   page can be hot. A B-tree index would be more efficient. Same applies to the partition index. Only some elements in
   the partition index page can be hot.

 - Add heuristic for reducing index file IO size when large partitions are anticipated. If we're bound by disk's
   bandwidth it's wasteful to read the front of promoted index using 32K IO, better use 4K which should cover the
   partition entry and then let binary search read the rest.

In V2:

 - Fixed perf_fast_forward regression in the number of IOs used to read partition index page
   The reader uses 32K reads, which were split by page cache into 4K reads
   Fix by propagating IO size hints to page cache and using single IO to populate it.
   New patch: "cached_file: Issue single I/O for the whole read range on miss"

 - Avoid large allocations to store partition index page entries (due to managed_vector storage).
   There is a unit test which detects this and fails.
   Fixed by implementing chunked_managed_vector, based on chunked_vector.

 - fixed bug in cached_file::evict_gently() where the wrong allocation strategy was used to free btree chunks

 - Simplify region_impl::free_buf() according to Avi's suggestions

 - Fit segment_kind in segment_descriptor::_free_space and lift requirement that _buf_pointers emptiness determines the kind

 - Workaround sigsegv which was most likely due to coroutine miscompilation. Worked around by manipulating local object scope.

 - Wire up system/drop_sstable_caches RESTful API

 - Fix use-after-move on permit for the old scanning ka/la index reader

 - Fixed more cases of double open_data() in tests leading to assert failure

 - Adjusted cached_file class doc to account for changes in behavior.

 - Rebased

Fixes #7079.
Refs #363.
"

* tag 'sstable-index-caching-v2' of github.com:tgrabiec/scylla: (39 commits)
  api: Drop sstable index caches on system/drop_sstable_caches
  cached_file: Issue single I/O for the whole read range on miss
  row_cache: cache_tracker: Do not register metrics when constructed for tests
  sstables, cached_file: Evict cache gently when sstable is destroyed
  sstables: Hide partition_index_cache implementation away from sstables.hh
  sstables: Drop shared_index_lists alias
  sstables: Destroy partition index cache gently
  sstables: Cache partition index pages in LSA and link to LRU
  utils: Introduce lsa::weak_ptr<>
  sstables: Rename index_list to partition_index_page and shared_index_lists to partition_index_cache
  sstables, cached_file: Avoid copying buffers from cache when parsing promoted index
  cached_file: Introduce get_page_units()
  sstables: read: Document that primitive_consumer::read_32() is alloc-free
  sstables: read: Count partition index page evictions
  sstables: Drop the _use_binary_search flag from index entries
  sstables: index_reader: Keep index objects under LSA
  lsa: chunked_managed_vector: Adapt more to managed_vector
  utils: lsa: chunked_managed_vector: Make LSA-aware
  test: chunked_managed_vector_test: Make exception_safe_class standard layout
  lsa: Copy chunked_vector to chunked_managed_vector
  ...
2021-07-07 18:17:10 +03:00
Pavel Solodovnikov
b959f5d394 test: lib: copy query_options in single_node_cql_env::execute_cql()
`query_processor::execute_direct()` takes a non-const ref
to query options, meaning it's not safe to pass the same
instance to subsequent invocations of `execute_direct()`
in the tests.

Copy default query options at each invocation of `execute_cql()`
so no possible side-effects can occur.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210705094824.243573-2-pa.solodovnikov@scylladb.com>
2021-07-07 11:46:50 +03:00
Nadav Har'El
775a64b003 test/alternator: test for change in CDC preimage
In pull request #8568, the CDC API changed slightly, with preimage data
gaining extra "delete$k" values for columns whose preimage was missing.
In this new test, we verify that this change did not break Alternator.
We didn't expect it to break Alternator, because it just outputs the known
base-table columns and ignores the columns which weren't a real base-table
column - like this "delete$k".

In the test we set up a stream with preimages, ensure that a real column
(note that an LSI key is a real column instead of a map element) has a
null preimage - and see that the preimage is returned as expected,
without fake columns like "delete$k".

The test passes, showing that PR #8568 was ok.
The test also passes, as expected, on DynamoDB.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210504120121.915829-1-nyh@scylladb.com>
2021-07-06 14:53:42 +02:00