Commit Graph

863 Commits

Author SHA1 Message Date
Avi Kivity
3c44445c07 Merge "Introduce off-strategy compaction for repair-based bootstrap and replace" from Raphael
"
Scylla suffers with aggressive compaction after repair-based operation has initiated. That translates into bad latency and slowness for the operation itself.

This aggressiveness comes from the fact that:
1) new sstables are immediately added to the compaction backlog, so reducing bandwidth available for the operation.
2) new sstables are in bad shape when integrated into the main sstable set, not conforming to the strategy invariant.

To solve this problem, new sstables will be incrementally reshaped, off the compaction strategy, until finally integrated into the main set.

The solution takes advantage there's only one sstable per vnode range, meaning sstables generated by repair-based operations are disjoint.

NOTE: off-strategy for repair-based decommission and removenode will follow this series and require little work as the infrastructure is introduced in this series.

Refs #5226.
"

* 'offstrategy_v7' of github.com:raphaelsc/scylla:
  tests: Add unit test for off-strategy sstable compaction
  table: Wire up off-strategy compaction on repair-based bootstrap and replace
  table: extend add_sstable_and_update_cache() for off-strategy
  sstables/compaction_manager: Add function to submit off-strategy work
  table: Introduce off-strategy compaction on maintenance sstable set
  table: change build_new_sstable_list() to accept other sstable sets
  table: change non_staging_sstables() to filter out off-strategy sstables
  table: Introduce maintenance sstable set
  table: Wire compound sstable set
  table: prepare make_reader_excluding_sstables() to work with compound sstable set
  table: prepare discard_sstables() to work with compound sstable set
  table: extract add_sstable() common code into a function
  sstable_set: Introduce compound sstable set
  reshape: STCS: preserve token contiguity when reshaping disjoint sstables
2021-03-22 10:43:13 +02:00
Benny Halevy
f562c9c2f3 test: sstable_datafile_test: tombstone_purge_test: use a longer ttl
As seen in next-3319 unit testing on jenkins
The cell ttl may expire during the test (presuming
that the test machine was overloaded), leading to:
```
INFO  2021-03-21 10:05:23,048 [shard 0] compaction - [Compact tests.tombstone_purge 2fcaf680-8a1c-11eb-b1b9-97020c5d261e] Compacting [/jenkins/workspace/scylla-master/next/scylla/testlog/release/scylla-af8644ec-7f07-4ffe-80bf-6703a942e435/la-17-big-Data.db:level=0:origin=, ]
INFO  2021-03-21 10:05:23,048 [shard 0] compaction - [Compact tests.tombstone_purge 2fcaf680-8a1c-11eb-b1b9-97020c5d261e] Compacted 1 sstables to []. 4kB to 0 bytes (~0% of original) in 0ms = 0 bytes/s. ~128 total partitions merged to 0.
./test/lib/mutation_assertions.hh(108): fatal error: in "tombstone_purge_test": Mutations differ, expected {table: 'tests.tombstone_purge', key: {'id': alpha, token: -7531858254489963}, mutation_partition: {
  rows: [
    {
      cont: true,
      dummy: false,
      position: {
        bound_weight: 0,
      },
      'value': { atomic_cell{1,ts=1616313953,expiry=1616313958,ttl=5} },
    },
  ]
}
}
 ...but got: {table: 'tests.tombstone_purge', key: {'id': alpha, token: -7531858254489963}, mutation_partition: {
  rows: [
    {
      cont: true,
      dummy: false,
      position: {
        bound_weight: 0,
      },
      'value': { atomic_cell{DEAD,ts=1616313953,deletion_time=1616313953} },
    },
  ]
}
}
```

This corresponds to:
```
2395            auto mut2 = make_expiring(alpha, ttl);
2396            auto mut3 = make_insert(beta);
...
2399            auto sst2 = make_sstable_containing(sst_gen, {mut2, mut3});
```

Extend (logical) ttl to 10 seconds to reduce flakiness
due to real-time timing.

Test: sstable_datafile_test(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210321142931.1226850-1-bhalevy@scylladb.com>
2021-03-21 16:42:00 +02:00
Avi Kivity
1e820687eb Merge "reader_concurrency_semaphore: limit non-admitted inactive reads" from Botond
"
Due to bad interaction of recent changes (913d970 and 4c8ab10) inctive
readers that are not admitted have managed to completely fly under the
radar, avoiding any sort of limitation. The reason is that pre-admission
the permits don't forward their resource cost to the semaphore, to
prevent them possibly blocking their own admission later. However this
meant that if such a reader is registered as inactive, it completely
avoids the normal resource based eviction mechanism and can accumulate
without bounds.
The real solution to this is to move the semaphore before the cache and
make all reads pass admission before they get started (#4758). Although
work has been started towards this, it is still a while until it lands.
In the meanwhile this patchset provides a workaround in the form of a
new inactive state, which -- like admitted -- causes the permit to
forward its cost to the semaphore, making sure these un-admitted
inactive reads are accounted for and evicted if there is too much of
them.

Fixes: #8258

Tests: unit(release), dtest(oppartitions_test.py:TestTopPartitions.test_read_by_gause_key_distribution_for_compound_primary_key_and_large_rows_number)
"

* 'reader-concurrency-semaphore-limit-inactive-reads/v4' of https://github.com/denesb/scylla:
  test: mutation_reader_test: add test for permit cleanup
  test: querier_cache_test: add memory based cache eviction test
  reader_permit: add inactive state
  querier: insert(): account immediately evicted querier as resource based eviction
  reader_concurrency_semaphore: fix clear_inactive_reads()
  reader_concurrency_semaphore: make inactive_read_handle a weak reference
  reader_concurrency_semaphore: make evict() noexcept
  reader_concurrency_semaphore: update out-of-date comments
2021-03-21 16:24:54 +02:00
Avi Kivity
a78f43b071 Merge 'tracing: fast slow query tracing' from Ivan Prisyazhnyy
The set of patches introduces a new tracing mode - `fast slow query tracing`. In this mode, Scylla tracks only tracing sessions and omits all tracing events if the tracing context does not have a `full_tracing` state set.

Fixes #2572

Motivation
---

We want to run production systems with that option always enabled so we could always catch slow queries without an overhead. The next step is we are gonna optimize further the costs of having tracing enabled to minimize session context handling overhead to allow it to be as transparent for the end-user as possible.

Fast tracing mode
---

To read the status do

    $ curl -v http://localhost:10000/storage_service/slow_query

To enable fast slow-query tracing

    $ curl -v --request POST http://localhost:10000/storage_service/slow_query\?fast=true\&enable=true

Potential optimizations
---

- remove tracing::begin(lazy_eval)
- replace tracing::begin(string) for enum to remove copying and memory allocations
- merge parameters allocations
- group parameters check for trace context
- delay formatting
- reuse prepared statement shared_ptr instead of both copying it and copying its query

Performance
---

100% cache hits
---

1 Core:

```
$ SCYLLA_HOME=/home/sitano.public/Projects/scylla build/release/scylla --smp 1 --cpuset 7 --log-to-syslog 0 --log-to-stdout 1 --default-log-level info --network-stack posix --workdir /home/sitano.public/Projects/scylla --developer-mode 1 --listen-address 0.0.0.0 --api-address 0.0.0.0 --rpc-address 0.0.0.0 --broadcast-rpc-address 172.18.0.1 --broadcast-address 127.0.0.1

./cassandra-stress write n=100000 no-warmup -pop seq=1..100000 -node 127.0.0.1 -log level=verbose -rate threads=1 -mode native cql3

curl --request POST http://localhost:10000/storage_service/slow_query\?fast\=false\&enable\=false
for i in $(seq 5); do
  taskset -c 2,3,4,5 ./cassandra-stress read duration=5m -pop seq=1..100000 -node 127.0.0.1 -log level=verbose -rate threads=4 throttle=30000/s -mode native cql3
done

curl --request POST http://localhost:10000/storage_service/slow_query\?fast\=true\&enable\=true
for i in $(seq 5); do
  taskset -c 2,3,4,5 ./cassandra-stress read duration=5m -pop seq=1..100000 -node 127.0.0.1 -log level=verbose -rate threads=4 throttle=30000/s -mode native cql3
done

curl --request POST http://localhost:10000/storage_service/slow_query\?fast\=false\&enable\=true
for i in $(seq 5); do
  taskset -c 2,3,4,5 ./cassandra-stress read duration=5m -pop seq=1..100000 -node 127.0.0.1 -log level=verbose -rate threads=4 throttle=30000/s -mode native cql3
done
```

  | qps |   |   |  
-- | -- | -- | -- | --
  | baseline | fast, slow | nofast, slow | %[1-fastslow/baseline]
  | 29,018 | 26,468 | 23,591 | 8.79%
  | 28,909 | 26,274 | 23,584 | 9.11%
  | 28,900 | 26,547 | 23,598 | 8.14%
  | 28,921 | 26,669 | 23,596 | 7.79%
  | 28,821 | 26,385 | 23,601 | 8.45%
stdev | 70.24030182 | 150.9678774 | 6.670832032 |  
avg | 28,914 | 26,469 | 23,594 |  
stderr | 0.24% | 0.57% | 0.03% |  
%[avg/baseline] |   | **8.46%** | 18.40% |  

8.46% performance degradation in `fast slow query mode` for pure in-memory workload with minimum traces.
18.40%  performance degradation in `original slow query mode` for pure in-memory workload with minimum traces.

0% cache hits
---

1GB memory, 1 Core:

    $ SCYLLA_HOME=/home/sitano.public/Projects/scylla build/release/scylla --memory 1G --smp 1 --cpuset 7 --log-to-syslog 0 --log-to-stdout 1 --default-log-level info --network-stack posix --workdir /home/sitano.public/Projects/scylla --developer-mode 1 --listen-address 0.0.0.0 --api-address 0.0.0.0 --rpc-address 0.0.0.0 --broadcast-rpc-address 172.18.0.1 --broadcast-address 127.0.0.1

2.4GB, 10000000 keys data:

    $ ./cassandra-stress write n=10000000 no-warmup -pop seq=1..10000000 -node 127.0.0.1 -log level=verbose -rate threads=4 -mode native cql3
    $ curl --request POST http://localhost:10000/storage_service/slow_query\?fast\=true\&enable\=true

CASSANDRA_STRESS prepared statements with BYPASS CACHE

    $ taskset -c 2,3,4,5 ./cassandra-stress read duration=5m -pop seq=1..10000000 -node 127.0.0.1 -log level=verbose -rate threads=4 throttle=30000/s -mode native cql3

20000 reads IOPS, 100MB/s from disk

  | qps |   |   |  
-- | -- | -- | -- | --
  | baseline reads | fast, slow reads | %[1-fastslow/baseline] |  
  | 9,575 | 9,054 | 5.44% |  
  | 9,614 | 9,065 | 5.71% |  
  | 9,610 | 9,066 | 5.66% |  
  | 9,611 | 9,062 | 5.71% |  
  | 9,614 | 9,073 | 5.63% |  
stdev | 16.75410397 | 6.892024376 |
avg | 9,605 | 9,064 |
stderr | 0.17% | 0.08% |
%[avg/baseline] |   | **5.63%** |

5.63% performance degradation in `fast slow query mode` for pure on-disk workload with minimum traces.

Closes #8314

* github.com:scylladb/scylla:
  tracing: fast mode unit test
  tracing: rest api for lightweight slow query tracing
  tracing: omit tracing session events and subsessions in fast mode
2021-03-21 12:15:17 +02:00
Avi Kivity
58b7f225ab keys: convert trichotomic comparators to return std::strong_ordering
A trichotomic comparator returning an int an easily be mistaken
for a less comparator as the return types are convertible.

Use the new std::strong_ordering instead.

A caller in cql3's update_parameters.hh is also converted, following
the path of least resistance.

Ref #1449.

Test: unit (dev)

Closes #8323
2021-03-21 09:30:43 +02:00
Raphael S. Carvalho
64d78eae6a tests: Add unit test for off-strategy sstable compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 16:56:00 -03:00
Raphael S. Carvalho
439e9b6fab table: change build_new_sstable_list() to accept other sstable sets
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
6e95860e09 table: change non_staging_sstables() to filter out off-strategy sstables
SSTables that are off-strategy should be excluded by this function as
it's used to select candidates for regular compaction.
So in addition to only returning candidates from the main set, let's
also rename it to precisely reflect its behavior.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:47:49 -03:00
Raphael S. Carvalho
1e7a444a8b table: Wire compound sstable set
From now own, _sstables  becomes the compound set, and _main_sstables refer
only to the main sstables of the table. In the near future, maintenance
set will be introduced and will also be managed by the compound set.

So add_sstable() and on_compaction_completion() are changed to
explicitly insert and remove sstables from the main set.

By storing compound set in _sstables, functions which used _sstables for
creating reader, computing statistics, etc, will not have to be changed
when we introduce the maintenance set, so code change is a lot minimized
by this approach.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:46:06 -03:00
Raphael S. Carvalho
e4b5f5ba33 sstable_set: Introduce compound sstable set
This new sstable set implementation is useful for combining operation of
multiple sstable sets, which can still be referenced individually via
its shared ptr reference.
It will be used when maintenance set is introduced in table, so a
compound set is required to allow both sets to have their operations
efficiently combined.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-18 11:42:49 -03:00
Botond Dénes
ad02f313dd test: mutation_reader_test: add test for permit cleanup
Check that a permit correctly restores the units on the semaphore in
each state it can be destroyed in.
2021-03-18 16:18:22 +02:00
Ivan Prisyazhnyy
f00391af8b tracing: fast mode unit test
Signed-off-by: Ivan Prisyazhnyy <ivan@scylladb.com>
2021-03-18 15:05:09 +02:00
Botond Dénes
c822f0d02a test: querier_cache_test: add memory based cache eviction test
Ensure that the memory consumption of querier cache entries is kept
under the limit.
2021-03-18 14:58:21 +02:00
Botond Dénes
594636ebbf querier: insert(): account immediately evicted querier as resource based eviction
`reader_concurrency_semaphore::register_inactive_read()` drops the
registered inactive read immediately if there is a resource shortage.
This is in effect a resource based eviction, so account it as such in
`querier::insert()`.
2021-03-18 14:57:57 +02:00
Botond Dénes
1a337d0ec1 reader_concurrency_semaphore: fix clear_inactive_reads()
Broken by the move to an intrusive container (9cbbf40), which caused
said method to only clear the container but not destroy the inactive
reads contained therein. This patch restores the previous behaviour and
also adds a call the destructor (to ensure inactive reads are cleaned up
under any circumstances), as well as a unit test.
2021-03-18 14:57:57 +02:00
Piotr Sarna
2509b7dbde Merge 'dht: convert ring_position and decorated_key to std::strong_ordering' from Avi Kivity
As #1449 notes, trichotomic comparators returning int are dangerous as they
can be mistaken for less comparators. This series converts dht::ring_position
and dht::decorated_key, as well as a few closely related downstream types, to
return std::strong_ordering.

Closes #8225

* github.com:scylladb/scylla:
  dht: ring_position, decorated_key: convert tri_comparators to std::strong_ordering
  pager: rephrase misleading comparison check
  test: total_order_checks: prepare for std::strong_ordering
  test: mutation_test: prepare merge_container for std::strong_ordering
  intrusive_array: prepare for std::strong_ordering
  utils: collection-concepts: prepare for std::strong_ordering
2021-03-18 11:51:54 +01:00
Avi Kivity
a5f17b9a2d test: total_order_checks: prepare for std::strong_ordering
Adjust the total_order_check template to work with comparators
returning either int (as a temporary compatibility measure) or
std::strong_ordering (for #1449 safety).
2021-03-18 12:40:05 +02:00
Avi Kivity
f0092ae475 test: mutation_test: prepare merge_container for std::strong_ordering
The function merge_container() accepts a trichotomic comparator returning
an int. As #1449 explains, this is dangerous as it could be mistaken for
a less comparator. Switch to std::strong_ordering, but leave a compatible
merge_container() in place as it is still needed (even after this series).
2021-03-18 12:40:05 +02:00
Benny Halevy
7862cad669 sstable_set: partitioned_sstable_set: clone: do clone all sstables
The existing implementation wrongfully shares _all sstables
rather than cloning it. This caused a use-after-free
in `repair_meta::do_estimate_partitions_on_local_shard`
when traversing a shared sstable_set, during which
`table::make_reader_excluding_sstables` erased an entry.
The erase should have happened on a cloned copy
of the sstable_list, not on a shared copy.

The regression was introduced in
c3b8757fa1.

Added a unit test that reproduces the share-on-copy issue
for partitioned_stable_set (sstables::sstable_set).

Fixes #8274

Test: unit(release, debug)
DTest: materialized_views_test.py:TestMaterializedViews.simple_repair_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210317145552.701559-1-bhalevy@scylladb.com>
2021-03-18 11:15:59 +02:00
Avi Kivity
972ea9900c Merge 'commitlog: Make pre-allocation drop O_DSYNC while pre-filling' from Calle Wilund
Refs #7794

Iff we need to pre-fill segment file ni O_DSYNC mode, we should
drop this for the pre-fill, to avoid issuing flushes until the file
is filled. Done by temporarily closing, re-opening in "normal" mode,
filling, then re-opening.

Closes #8250

* github.com:scylladb/scylla:
  commitlog: Make pre-allocation drop O_DSYNC while pre-filling
  commitlog: coroutinize allocate_segment_ex
2021-03-17 09:59:22 +02:00
Nadav Har'El
e344f74858 Merge 'logalloc: improve background reclaim shares management' from Avi Kivity
The log structured allocator's background reclaimer tries to
allocate CPU power proportional to memory demand, but a
bug made that not happen. Fix the bug, add some logging,
and future-proof the timer. Also, harden the test against
overcommitted test machines.

Fixes #8234.

Test: logalloc_test(dev), 20 concurrent runs on 2 cores (1 hyperthread each)

Closes #8281

* github.com:scylladb/scylla:
  test: logalloc_test: harden background reclain test against cpu overcommit
  logalloc: background reclaim: use default scheduling group for adjusting shares
  logalloc: background reclaim: log shares adjustment under trace level
  logalloc: background reclaim: fix shares not updated by periodic timer
2021-03-17 09:59:21 +02:00
Avi Kivity
65fea203d2 test: logalloc_test: harden background reclain test against cpu overcommit
Use thread CPU time instead of real time to avoid an overcommitted
machine from not being able to supply enough CPU for the test.
2021-03-15 13:54:49 +02:00
Alejo Sanchez
6139ad6337 raft: tests: move boost tests to tests/raft
Move raft boost tests to test/raft directory.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2021-03-15 06:16:58 -04:00
Calle Wilund
48ca01c3ab commitlog: Make pre-allocation drop O_DSYNC while pre-filling
Refs #7794

Iff we need to pre-fill segment file ni O_DSYNC mode, we should
drop this for the pre-fill, to avoid issuing flushes until the file
is filled. Done by temporarily closing, re-opening in "normal" mode,
filling, then re-opening.

v2:
* More comment
v3:
* Add missing flush
v4:
* comment
v5:
* Split coroutine and fix into separate patches
2021-03-15 09:35:45 +00:00
Tomasz Grabiec
f2ecb4617e Merge "raft: implement prevoting stage in leader election" from Gleb
This is how PhD explain the need for prevoting stage:

  One downside of Raft's leader election algorithm is that a server that
  has been partitioned from the cluster is likely to cause a disruption
  when it regains connectivity. When a server is partitioned, it will
  not receive heartbeats. It will soon increment its term to start
  an election, although it won't be able to collect enough votes to
  become leader. When the server regains connectivity sometime later, its
  larger term number will propagate to the rest of the cluster (either
  through the server's RequestVote requests or through its AppendEntries
  response). This will force the cluster leader to step down, and a new
  election will have to take place to select a new leader.

  Prevoting stage is addressing that. In the Prevote algorithm, a
  candidate only increments its term if it first learns from a majority of
  the cluster that they would be willing to grant the candidate their votes
  (if the candidate's log is sufficiently up-to-date, and the voters have
  not received heartbeats from a valid leader for at least a baseline
  election timeout).

  The Prevote algorithm solves the issue of a partitioned server disrupting
  the cluster when it rejoins. While a server is partitioned, it won't
  be able to increment its term, since it can't receive permission
  from a majority of the cluster. Then, when it rejoins the cluster, it
  still won't be able to increment its term, since the other servers
  will have been receiving regular heartbeats from the leader. Once the
  server receives a heartbeat from the leader itself, it will return to
  the follower state(in the same term).

In our implementation we have "stable leader" extension that prevents
spurious RequestVote to dispose an active leader, but AppendEntries with
higher term will still do that, so prevoting extension is also required.

* scylla-dev/raft-prevote-v5:
  raft: store leader and candidate state in state variant
  raft: add boost tests for prevoting
  raft: implement prevoting stage in leader election
  raft: reset the leader on entering candidate state
  raft: use modern unordered_set::contains instead of find in become_candidate
2021-03-12 11:15:51 +01:00
Gleb Natapov
e231186a7b raft: store leader and candidate state in state variant
We already have server state dependant state in fsm, so there is no need
to maintain "voters" and "tracker" optionals as well. The upside is that
optional and variant sates cannot drift apart now.
2021-03-12 11:12:57 +02:00
Gleb Natapov
e17e7d57bd raft: add boost tests for prevoting 2021-03-12 11:12:57 +02:00
Avi Kivity
486f6bf29c Merge "sstables: move format specific reader code to kl/, mx/" from Botond
"
Currently the sstable reader code is scattered across several source
files as following (paths are relative to sstables/):
* partition.cc - generic reader code;
* row.hh - format specific code related to building mutation fragments
  from cells;
* mp_row_consumer.hh - format specific code related to parsing the raw
  byte stream;

This is a strange organization scheme given that the generic sstable
reader is a template and as such it doesn't itself depend on the other
headers where the consumer and context implementations live. Yet these
are all included in partition.cc just so the reader factory function can
instantiate the sstable reader template with the format specific
objects.

This patchset reorganizes this code such that the generic sstable reader
is exposed in a header. Furthermore, format specific code is moved to
the kl/ and mx/ directories respectively. Each directory has a
reader.hh with a single factory function which creates the reader, all
the format specific code is hidden from sight. The added benefit is that
now reader code specific to a format is centralized in the format
specific folder, just like the writer code.

This patchset only moves code around, no logical changes are made.

Tests: unit(dev)
"

* 'sstable-reader-separation/v1' of https://github.com/denesb/scylla:
  sstables: get rid of mp_row_consumer.{hh,cc}
  sstables: get rid of row.hh
  sstables/mp_row_consumer.hh: remove unused struct new_mutation
  sstables: move mx specific context and consumer to mx/reader.cc
  sstables: move kl specific context and consumer to kl/reader.cc
  sstables: mv partition.cc sstable_mutation_reader.hh
2021-03-11 16:57:54 +02:00
Botond Dénes
3ba782bddd sstables: get rid of row.hh
Move stuff contained therein to `sstable_mutation_reader.{hh,cc}` which
will serve as the collection point of utility stuff needed by all reader
implementations.
2021-03-11 12:17:13 +02:00
Botond Dénes
4e3ae9d913 sstables: move kl specific context and consumer to kl/reader.cc
Move all the kl format specific context and consumer code to
kl/reader* and add a factory function `kl::make_reader()` which takes
over the job of instantiating the `sstable_mutation_reader` with the kl
specific context and consumer. Code which is used by test is moved to
kl/reader_impl.hh, while code that can be hidden us moved to
kl/reader.cc. Users who just want to create a reader only have to
include kl/reader.hh.
2021-03-11 12:17:13 +02:00
Avi Kivity
c8f692e526 Merge 'cql3: Rewrite get_clustering_bounds() using expressions' from Dejan Mircevski
Instead of using the `restrictions` class hierarchy, calculate the clustering slice using the `expr::expression` representation of the WHERE clause.  This will allow us to eventually drop the `restrictions` hierarchy altogether.

Tests: unit (dev, debug)

Closes #8227

* github.com:scylladb/scylla:
  cql3: Make get_clustering_bounds() use expressions
  cql3/expr: Add is_multi_column()
  cql3/expr: Add more operators to needs_filtering
  cql3: Replace CK-bound mode with comparison_order
  cql3/expr: Make to_range globally visible
  cql3: Gather slice-defining WHERE expressions
  cql3: Add statement_restrictions::_where
  test: Add unit tests for get_clustering_bounds
2021-03-11 11:46:52 +02:00
Dejan Mircevski
990de02d28 cql3: Make get_clustering_bounds() use expressions
Use expressions instead of _clustering_columns_restrictions.  This is
a step towards replacing the entire restrictions class hierarchy with
expressions.

Update some expected results in unit tests to reflect the new code.
These new results are equivalent to the old ones in how
storage_proxy::query() will process them (details:
bound_view::from_range() returns the same result for an empty-prefix
singular as for (-inf,+inf)).

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:25:43 -05:00
Dejan Mircevski
2525759027 test: Add unit tests for get_clustering_bounds
... as guardrails for the upcoming rewrite.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-03-10 21:17:26 -05:00
Benny Halevy
ff5b42a0fa bytes_ostream: max_chunk_size: account for chunk header
Currently, if the data_size is greater than
max_chunk_size - sizeof(chunk), we end up
allocating up to max_chunk_size + sizeof(chunk) bytes,
exceeding buf.max_chunk_size().

This may lead to allocation failures, as seen in
https://github.com/scylladb/scylla/issues/7950,
where we couldn't allocate 131088 (= 128K + 16) bytes.

This change adjusted the expose max_chunk_size()
to be max_alloc_size (128KB) - sizeof(chunk)
so that the allocated chunks would normally be allocated
in 128KB chunks in the write() path.

Added a unit test - test_large_placeholder that
stresses the chunk allocation path from the
write_place_holder(size) entry point to make
sure it handles large chunk allocations correctly.

Refs #7950
Refs #8081

Test: unit(release), bytes_ostream_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210303143413.902968-1-bhalevy@scylladb.com>
2021-03-10 19:54:12 +02:00
Avi Kivity
5342d79461 Merge "Preparatory work in sstable_set for the upcoming compound_sstable_set_impl" from Raphael
* 'preparatory_work_for_compound_set' of github.com:raphaelsc/scylla:
  sstable_set: move all() implementation into sstable_set_impl
  sstable_set: preparatory work to change sstable_set::all() api
  sstables: remove bag_sstable_set
2021-03-10 19:19:26 +02:00
Botond Dénes
cf28552357 mutation_test: test_mutation_diff_with_random_generator: compact input mutations
This test checks that `mutation_partition::difference()` works correctly.
One of the checks it does is: m1 + m2 == m1 + (m2 - m1).
If the two mutations are identical but have compactable data, e.g. a
shadowable tombstone shadowed by a row marker, the apply will collapse
these, causing the above equality check to fail (as m2 - m1 is null).
To prevent this, compact the two input mutations.

Fixes: #8221
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210310141118.212538-1-bdenes@scylladb.com>
2021-03-10 16:28:14 +01:00
Raphael S. Carvalho
05b07c7161 sstable_set: preparatory work to change sstable_set::all() api
users of sstable_set::all() rely on the set itself keeping a reference
to the returned list, so user can iterate through the list assuming
that it is alive all the way through.

this will change in the future though, because there will be a
compound set impl which will have to merge the all() of multiple
managed sets, and the result is a temporary value.

so even range-based loops on all() have to keep a ref to the returned
list, to avoid the list from being prematurely destroyed.

so the following code
	for (auto& sst : *sstable_set.all()) { ...}
becomes
	for (auto sstables = sstable_set.all(); auto& sst : *sstables) { ... }

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-03-10 12:02:12 -03:00
Avi Kivity
746798fd56 Merge "sstables: get rid of data_consume_context" from Botond
"
This class is basically a wrapper around a unique pointer and a few
short convenience methods, but is otherwise a distraction in trying to
untangle the maze that is the sstable reader class hierachy.
So this patchset folds it into its only real user: the sstable reader.
"

* 'data_consume_context_bye' of https://github.com/denesb/scylla:
  sstable: move data_consume_* factory methods to row.hh
  sstables: fold data_consume_context: into its users
  sstables: partition.cc: remove data_consume_* forward declarations
2021-03-10 16:45:32 +02:00
Botond Dénes
1aa2424dcf sstable: move data_consume_* factory methods to row.hh 2021-03-10 15:40:50 +02:00
Botond Dénes
a06465a8f3 sstables: fold data_consume_context: into its users
`data_consume_context` is a thin wrapper over the real context object
and it does little more than forward method calls to it. The few
methods doing more then mere forwarding can be folded into its single
real user: `sstable_reader`.
2021-03-10 15:38:58 +02:00
Pavel Emelyanov
096e452db9 test: Fix exit condition of row_cache_test::test_eviction_from_invalidated
The test populates the cache, then invalidates it, then tries to push
huge (10x times the segment size) chunks into seastar memory hoping that
the invalid entries will be evicted. The exit condition on the last
stage is -- total memory of the region (sum of both -- used and free)
becomes less than the size of one chunk.

However, the condition is wrong, because cache usually contains a dummy
entry that's not necessarily on lru and on some test iteration it may
happen that

  evictable size < chunk size < evictable size + dummy size

In this case test fails with bad_alloc being unable to evict the memory
from under the dummy.

fixes: #7959
tests: unit(row_cache_test), unit(the failing case with the triggering
       seed from the issue + 200 times more with random seeds)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210309134138.28099-1-xemul@scylladb.com>
2021-03-09 17:57:52 +01:00
Gleb Natapov
2a41ad0b57 raft: add testing for non-voting members
Add tests to check if quorum (for leader election and commit index
purposes) is calculated correctly in the presence of non-voting members.
Message-Id: <20210304101158.1237480-3-gleb@scylladb.com>
2021-03-09 13:51:09 +01:00
Gleb Natapov
dd6ba3d507 raft: add non-voting member support
This patch adds a support for non-voting members. Non voting member is a
member which vote is not counted for leader election purposes and commit
index calculation purposes and it cannot become a leader. But otherwise
it is a normal raft node. The state is needed to let new nodes to catch
up their log without disturbing a cluster.

All kind of transitions are allowed. A node may be added as a voting member
directly or it may be added as non-voting and then changed to be voting
one through additional configuration change. A node can be demoted from
voting to non-voting member through a configuration change as well.
Message-Id: <20210304101158.1237480-2-gleb@scylladb.com>
2021-03-09 13:47:48 +01:00
Tomasz Grabiec
3cb01f218f Merge "raft: add unit tests for log, tracker, votes and fix found bugs" from Kostja
Test log consistency after apply_snapshot() is called.
Ensure log::last_term() log::last_conf_index() and log::size()
work as expected.

Misc cleanups.

* scylla-dev.git/raft-confchange-test-v4:
  raft: fix spelling
  raft: add a unit test for voting
  raft: do not account for the same vote twice
  raft: remove fsm::set_configuration()
  raft: consistently use configuration from the log
  raft: add ostream serialization for enum vote_result
  raft: advance commit index right after leaving joint configuration
  raft: add tracker test
  raft: tidy up follower_progress API
  raft: update raft::log::apply_snapshot() assert
  raft: add a unit test for raft::log
  raft: rename log::non_snapshoted_length() to log::in_memory_size()
  raft: inline raft::log::truncate_tail()
  raft: ignore AppendEntries RPC with a very old term
  raft: remove log::start_idx()
  raft: return a correct last term on an empty log
  raft: do not use raft::log::start_idx() outside raft::log()
  raft: rename progress.hh to tracker.hh
  raft: extend single_node_is_quiet test
2021-03-03 16:29:40 +01:00
Tomasz Grabiec
0dc57db248 Revert "Merge "raft: add unit tests for log, tracker, votes and fix found bugs" from Kostja"
This reverts commit f94f70cda8, reversing
changes made to 5206a97915.

Not the latest version of the series was merged. Rvert prior to
merging the latest one.
2021-03-03 16:29:02 +01:00
Avi Kivity
5f4bf18387 Revert "Merge 'sstables: add versioning to the sstable_set ' from Wojciech Mitros"
This reverts commit 31909515b3, reversing
changes made to ef97adc72a. It shows many
serious regressions in dtest.

Fixes #8197.
2021-03-02 13:21:22 +02:00
Botond Dénes
257c295cff cql_query_test: add unit test for the more efficient range scan result format
The most user-visible aspect of this change is range scans which select
a small subset of the columns. These queries work as the user expects
them to work: unselected columns are not included in determining the
size of the result (or that of the page). This is the aspect this test
is checking for. While at it, also test single partition queries too.
2021-03-02 08:01:53 +02:00
Botond Dénes
fe280271a6 cql_query_test: test_query_limit: clean up scheduling groups
Destroy scheduling groups created for this test, so other tests can
create scheduling groups with the same name, without conflicts.
2021-03-02 07:53:53 +02:00
Avi Kivity
8747c684e0 Merge 'Move timeouts to client state' from Piotr Sarna
This series is extracted from #7913 as it may prove useful to other series as well, and #7913 might take a while until its merged, given that it also depends on other unmerged pull requests.

The idea of this series is to move timeouts to the client state, which will allow changing them independently for each session - e.g. by setting per-service-level timeouts and initializing the values from attached service levels (see #7867).

Closes #8140

* github.com:scylladb/scylla:
  treewide: remove timeout config from query options
  cql3: use timeout config from client state instead of query options
  cql3: use timeout config from client state instead of query options
  cql3: use timeout config from client state instead of query options
  service: add timeout config to client state
2021-03-01 20:34:35 +02:00
Tomasz Grabiec
cb0b8d1903 row_cache: Zap dummy entries when populating or reading a range
This will prevent accumulation of unnecessary dummy entries.

A single-partition populating scan with clustering key restrictions
will insert dummy entries positioned at the boundaries of the
clustering query range to mark the newly populated range as
continuous.

Those dummy entries may accumulate with time, increasing the cost of
the scan, which needs to walk over them.

In some workloads we could prevent this. If a populating query
overlaps with dummy entries, we could erase the old dummy entry since
it will not be needed, it will fall inside a broader continuous
range. This will be the case for time series worklodas which scan with
a decreasing (newest) lower bound.

Refs #8153.

_last_row is now updated atomically with _next_row. Before, _last_row
was moved first. If exception was thrown and the section was retried,
this could cause the wrong entry to be removed (new next instead of
old last) by the new algorithm. I don't think this was causing
problems before this patch.

The problem is not solved for all the cases. After this patch, we
remove dummies only when there is a single MVCC version. We could
patch apply_monotonically() to also do it, so that dummies which are
inside continuous ranges are eventually removed, but this is left for
later.

perf_row_cache_reads output after that patch shows that the second
scan touches no dummies:

$ build/release/test/perf/perf_row_cache_reads_g -c1 -m200M
Rows in cache: 0
Populating with dummy rows
Rows in cache: 265320
Scanning
read: 142.621613 [ms], preemption: {count: 639, 99%: 0.545791 [ms], max: 0.526929 [ms]}, cache: 0/0 [MB]
read: 0.023197 [ms], preemption: {count: 1, 99%: 0.035425 [ms], max: 0.032736 [ms]}, cache: 0/0 [MB]

Message-Id: <20210226172801.800264-1-tgrabiec@scylladb.com>
2021-03-01 20:34:35 +02:00