Commit Graph

606 Commits

Author SHA1 Message Date
Nadav Har'El
9ff9cd37c3 alternator test: tests for the number type
We had some tests for the number type in Alternator and how it can be
stored, retrieved, calculated and sorted, but only had rudementary tests
for the allowed magnitude and precision of numbers.

This patch creates a new test file, test_number.py, with tests aiming to
check exactly the supported magnitudes and precision of numbers.

These tests verify two things:

1. That Alternator's number type supports the full precision and magnitude
   that DynamoDB's number type supports. We don't want to see precision
   or magnitude lost when storing and retrieving numbers, or when doing
   calculations on them.

2. That Alternator's number type does not have *better* precision or
   magnitude than DynamoDB does. If it did, users may be tempted to rely
   on that implementation detail.

The three tests of the first type pass; But all four tests of the second
type xfail: Alternator currently stores numbers using big_decimal which
has unlimited precision and almost-unlimited magnitude, and is not yet
limited by the precision and magnitude allowed by DynamoDB.
This is a known issue - Refs #6794 - and these four new xfailing tests
will can be used to reproduce that issue.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200707204824.504877-1-nyh@scylladb.com>
2020-07-09 07:38:36 +02:00
Rafael Ávila de Espíndola
b10beead61 memtable_snapshot_source: Avoid a std::bad_alloc crash
_should_compact is a condition_variable and condition_variable::wait()
allocates memory.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200706223201.903072-1-espindola@scylladb.com>
2020-07-08 15:21:50 +02:00
Avi Kivity
7ea9ee27dd Merge 'aggregates: Use type-specific comparators in min/max' from Juliusz
"
For collections and UDTs the `MIN()` and `MAX()` functions are
generated on the fly. Until now they worked by comparing just the
byte representations of their arguments.

This patch employs specific per-type comparators to provide semantically
sensible, dynamically created aggregates.

Fixes #6768
"

* jul-stas-6768-use-type-comparators-for-minmax:
  tests: Test min/max on set
  aggregate_fcts: Use per-type comparators for dynamic types
2020-07-08 15:07:57 +03:00
Juliusz Stasiewicz
f08e0e10be tests: Test min/max on set
Expected behavior is the lexicographical comparison of sets
(element by element), so this test was failing when raw byte
representations were compared.
2020-07-08 13:39:15 +02:00
Avi Kivity
b0698dfb38 Merge 'Rewrite CQL3 restriction representation' from dekimir
"
This is the first stage of replacing the existing restrictions code with a new representation. It adds a new class `expression` to replace the existing class `restriction`. Lots of the old code is deleted, though not all -- that will come in subsequent stages.

Tests: unit (dev, debug restrictions_test), dtest (next-gating)
"

* dekimir-restrictions-rewrite:
  cql3/restrictions: Drop dead code
  cql3/restrictions: Use free functions instead of methods
  cql3/restrictions: Create expression objects
  cql3/restrictions: Add free functions over new classes
  cql3/restrictions: Add new representation
2020-07-08 10:22:17 +03:00
Dejan Mircevski
37ebe521e3 cql3/restrictions: Use free functions instead of methods
Instead of `restriction` class methods, use the new free functions.
Specific replacement actions are listed below.

Note that class `restrictions` (plural) remains intact -- both its
methods and its type hierarchy remain intact for now.

Ensure full test coverage of the replacement code with new file
test/boost/restrictions_test.cc and some extra testcases in
test/cql/*.

Drop some existing tests because they codify buggy behaviour
(reference #6369, #6382).  Drop others because they forbid relation
combinations that are now allowed (eg, mixing equality and
inequality, comparing to NULL, etc.).

Here are some specific categories of what was replaced:

- restriction::is_foo predicates are replaced by using the free
  function find_if; sometimes it is used transitively (see, eg,
  has_slice)

- restriction::is_multi_column is replaced by dynamic casts (recall
  that the `restrictions` class hierarchy still exists)

- utility methods is_satisfied_by, is_supported_by, to_string, and
  uses_function are replaced by eponymous free functions; note that
  restrictions::uses_function still exists

- restriction::apply_to is replaced by free function
  replace_column_def

- when checking infinite_bound_range_deletions, the has_bound is
  replaced by local free function bounded_ck

- restriction::bounds and restriction::value are replaced by the more
  general free function possible_lhs_values

- using free functions allows us to simplify the
  multi_column_restriction and token_restriction hierarchies; their
  methods merge_with and uses_function became identical in all
  subclasses, so they were moved to the base class

- single_column_primary_key_restrictions<clustering_key>::needs_filtering
  was changed to reuse num_prefix_columns_that_need_not_be_filtered,
  which uses free functions

Fixes #5799.
Fixes #6369.
Fixes #6371.
Fixes #6372.
Fixes #6382.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-07-07 23:08:09 +02:00
Nadav Har'El
0143aaa5a8 merge: Forbid internal schema changes for distributed tables
Merged patch set from Piotr Sarna:

This series addresses issue #6700 again (it was reopened),
by forbidding all non-local schema changes to be performed
from within the database via CQL interface. These changes
are dangerous since they are not directly propagated to other
nodes.

Tests: unit(dev)
Fixes #6700

Piotr Sarna (4):
  test: make schema changes in query_processor_test global
  cql3: refuse to change schema internally for distributed tables
  test: expand testing internal schema changes
  cql3: add explanatory comments to execute_internal

 cql3/query_processor.hh                      | 13 ++++++++++++-
 cql3/statements/alter_table_statement.cc     |  6 ------
 cql3/statements/schema_altering_statement.cc | 15 +++++++++++++++
 test/boost/cql_query_test.cc                 |  8 ++++++--
 test/boost/query_processor_test.cc           | 16 ++++++++--------
 5 files changed, 41 insertions(+), 17 deletions(-)
2020-07-07 18:27:16 +03:00
Piotr Sarna
8ecae38d6b test: expand testing internal schema changes
... in order to ensure that not only ALTER TABLE, but also other
schema altering statements are not allowed for distributed
tables/keyspaces.
2020-07-07 10:02:58 +02:00
Piotr Sarna
9bdf17a804 test: make schema changes in query_processor_test global
Now that schema changes are going to be forbidden for non-local tables,
query_processor_test is updated accordingly.
2020-07-07 09:09:40 +02:00
Botond Dénes
5ebe2c28d1 db/view: view_update_generator: re-balance wait/signal on the register semaphore
The view update generator has a semaphore to limit concurrency. This
semaphore is waited on in `register_staging_sstable()` and later the
unit is returned after the sstable is processed in the loop inside
`start()`.
This was broken by 4e64002, which changed the loop inside `start()` to
process sstables in per table batches, however didn't change the
`signal()` call to return the amount of units according to the number of
sstables processed. This can cause the semaphore units to dry up, as the
loop can process multiple sstables per table but return just a single
unit. This can also block callers of `register_staging_sstable()`
indefinitely as some waiters will never be released as under the right
circumstances the units on the semaphore can permanently go below 0.
In addition to this, 4e64002 introduced another bug: table entries from
the `_sstables_with_tables` are never removed, so they are processed
every turn. If the sstable list is empty, there won't be any update
generated but due to the unconditional `signal()` described above, this
can cause the units on the semaphore to grow to infinity, allowing
future staging sstables producers to register a huge amount of sstables,
causing memory problems due to the amount of sstable readers that have
to be opened (#6603, #6707).
Both outcomes are equally bad. This patch fixes both issues and modifies
the `test_view_update_generator` unit test to reproduce them and hence
to verify that this doesn't happen in the future.

Fixes: #6774
Refs: #6707
Refs: #6603

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200706135108.116134-1-bdenes@scylladb.com>
2020-07-07 08:53:00 +02:00
Dejan Mircevski
921dbd0978 cql/restrictions: Handle WHERE a>0 AND a<0
WHERE clauses with start point above the end point were handled
incorrectly.  When the slice bounds are transformed to interval
bounds, the resulting interval is interpreted as wrap-around (because
start > end), so it contains all values above 0 and all values below
0.  This is clearly incorrect, as the user's intent was to filter out
all possible values of a.

Fix it by explicitly short-circuiting to false when start > end.  Add
a test case.

Fixes #5799.

Tests: unit (dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-07-06 19:11:20 +03:00
Piotr Sarna
446b89f408 test: move json tests from manual/ to boost/
Manual tests are, as the name suggests, not run automatically,
which makes them more prone to regressions. JSON tests are
fast and correct, so there's no reason for them to be marked
as manual.

Message-Id: <dea75b0a0d1c238d12382a28840978884ac6ec2c.1594023481.git.sarna@scylladb.com>
2020-07-06 11:24:12 +03:00
Piotr Sarna
83ab41c76d test: add json test for parsing from map
Our JSON legacy helper functions for parsing documents to/from
string maps are indirectly tested by several unit tests, e.g.
caching_options_test.cc. They however lacked one corner case
detected only by dtest - parsing an empty map from a null JSON document.
This case is hereby added in order to prevent future regressions.

Message-Id: <df8243bd083b2ba198df665aeb944c8710834736.1594020411.git.sarna@scylladb.com>
2020-07-06 10:28:55 +03:00
Avi Kivity
cc891a5de8 Merge "Convert a few uses of sstring to std::string_view" from Rafael
"
This series converts an API to use std::string_view and then converts
a few sstring variables to be constexpr std::string_view. This has the
advantage that a constexpr variables cannot be part of any
initialization order problem.
"

* 'espindola/convert-to-constexpr' of https://github.com/espindola/scylla:
  auth: Convert sstring variables in common.hh to constexpr std::string_view
  auth: Convert sstring variables in default_authorizer to constexpr std::string_view
  cql_test_env: Make ks_name a constexpr std::string_view
  class_registry: Use std::string_view in (un)?qualified_name
2020-07-05 17:08:54 +03:00
Rafael Ávila de Espíndola
33af0c293f cql_test_env: Make ks_name a constexpr std::string_view
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-07-03 12:28:20 -07:00
Nadav Har'El
8e3ecc30a9 merge: Migrate from libjsoncpp to rjson
Merged patch series by Piotr Sarna:

The alternator project was in need of a more optimized
JSON library, which resulted in creating "rjson" helper functions.
Scylla generally used libjsoncpp for its JSON handling, but in  order
to reduce the dependency hell, the usage is now migrated
to rjson, which is faster and offers the same functionality.

The original plan was to be able to drop the dependency
on libjsoncpp-lib altogether and remove it from install-dependencies.sh,
but one last usage of it remains in our test suite,
namely cql_repl. The tool compares its output JSON textually,
so it depends on how a library presents JSON - what are the delimeters,
indentation, etc. It's possible to provide a layer of translation
to force rjson to print in an identical format, but the other issue
is that libjsoncpp keeps subobjects sorted by their name,
while rjson uses an unordered structure.
There are two possible solutions for the last remaining usage
of libjsoncpp:
 1. change our test suite to compare JSON documents with a JSON parser,
    so that we don't rely on internal library details
 2. provide a layer of translation which forces rjson to print
    its objects in a format idential to libjsoncpp.
(1.) would be preferred, since now we're also vulnerable for changes
inside libjsoncpp itself - if they change anything in their output
format, tests would start failing. The issue is not critical however,
so it's left for later.

Tests: unit(dev), manual(json_test),
       dtest(partitioner_tests.TestPartitioner.murmur3_partitioner_test)

Piotr Sarna (8):
  alternator,utils: move rjson.hh to utils/
  alternator: remove ambiguous string overloads in rjson
  rjson: add parse_to_map helper function
  rjson: add from_string_map function
  rjson: add non-throwing parsing
  rjson: move quote_json_string to rjson
  treewide: replace libjsoncpp usage with rjson
  configure: drop json.cc and json.hh helpers

 alternator/base64.hh                |   2 +-
 alternator/conditions.cc            |   2 +-
 alternator/executor.hh              |   2 +-
 alternator/expressions.hh           |   2 +-
 alternator/expressions_types.hh     |   2 +-
 alternator/rmw_operation.hh         |   2 +-
 alternator/serialization.cc         |   2 +-
 alternator/serialization.hh         |   2 +-
 alternator/server.cc                |   2 +-
 caching_options.hh                  |   9 +-
 cdc/log.cc                          |   4 +-
 column_computation.hh               |   5 +-
 configure.py                        |   3 +-
 cql3/functions/functions.cc         |   4 +-
 cql3/statements/update_statement.cc |  24 ++--
 cql3/type_json.cc                   | 212 ++++++++++++++++++----------
 cql3/type_json.hh                   |   7 +-
 db/legacy_schema_migrator.cc        |  12 +-
 db/schema_tables.cc                 |   1 -
 flat_mutation_reader.cc             |   1 +
 index/secondary_index.cc            |  80 +++++------
 json.cc                             |  80 -----------
 json.hh                             | 113 ---------------
 schema.cc                           |  25 ++--
 test/boost/cql_query_test.cc        |   9 +-
 test/manual/json_test.cc            |   4 +-
 test/tools/cql_repl.cc              |   1 +
 {alternator => utils}/rjson.cc      |  75 +++++++++-
 {alternator => utils}/rjson.hh      |  40 +++++-
 29 files changed, 344 insertions(+), 383 deletions(-)
 delete mode 100644 json.cc
 delete mode 100644 json.hh
 rename {alternator => utils}/rjson.cc (86%)
 rename {alternator => utils}/rjson.hh (81%)
2020-07-03 18:23:56 +02:00
Piotr Sarna
4cb79f04b0 treewide: replace libjsoncpp usage with rjson
In order to eventually switch to a single JSON library,
most of the libjsoncpp usage is dropped in favor of rjson.
Unfortunately, one usage still remains:
test/utils/test_repl utility heavily depends on the *exact textual*
format of its output JSON files, so replacing a library results
in all tests failing because of differences in formatting.
It is possible to force rjson to print its documents in the exact
matching format, but that's left for later, since the issue is not
critical. It would be nice though if our test suite compared
JSON documents with a real JSON parser, since there are more
differences - e.g. libjsoncpp keeps children of the object
sorted, while rapidjson uses an unordered data structure.
This change should cause no change in semantics, it strives
just to replace all usage of libjsoncpp with rjson.
2020-07-03 10:27:23 +02:00
Piotr Sarna
1b37517aab rjson: move quote_json_string to rjson
This utility function is used for type serialization,
but it also has a dedicated unit test, so it needs to be globally
reachable.
2020-07-03 10:27:23 +02:00
Rafael Ávila de Espíndola
6fe7706fce mutation_reader_test: Wait for a future
Nothing was waiting for this future. Found while testing another
patch.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200630183929.1704908-1-espindola@scylladb.com>
2020-07-03 08:24:41 +02:00
Rafael Ávila de Espíndola
b7f5e2e0dd big_decimal: Add more tests
It looks like an order version of my patch series was merged. The only
difference is that the new one had more tests. This patch adds the
missing ones.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200630141150.1286893-1-espindola@scylladb.com>
2020-07-03 08:24:41 +02:00
Botond Dénes
cb69406f6c test/boost/mutation_reader_test: multishard reader: add tests for fast-forwarding with read-ahead 2020-07-01 10:15:49 +03:00
Botond Dénes
d6e2033d8a test/boost/mutation_reader_test: extract multishard read-ahead test setup
Testing the multishard reader's various read-ahead related corner cases
requires a non-trivial setup. Currently there is just one such test,
but we plan to add more so in this patch we extract this setup code to a
free function to allow reuse across multiple tests.
2020-07-01 10:15:49 +03:00
Botond Dénes
851ae8c650 test/boost/mutation_reader_test: puppet_reader: fast-forward-support
A fast-forwarded puppet reader goes immediately to EOS. A counter is
added to the remote control to allow tests to check which readers were
actually fast forwarded.
2020-07-01 10:15:49 +03:00
Botond Dénes
741f0c276d mutation_reader_test: puppet_reader: make interface more predictable
Currently the puppet reader will do an automatic (half) buffer-fill in
the constructor. This makes it very hard to reason about when and how
the action that was passed to it will be executed. Refactor it to take a
list of actions and only execute those, no hidden buffer-fill anymore.
No better proof is needed for this than the fact that the test which is
supposed to test the multishard reader being destroyed with a pending
read-ahead was silently broken (not testing what it should).
This patch fixes this test too.

Also fixed in this patch is the `pending` and `destroyed` fields of the
remote control, tests can now rely on these to be correct and add
additional checkpoints to ensure the test is indeed doing what it was
intended to do.
2020-07-01 10:15:49 +03:00
Raphael S. Carvalho
cf352e7c14 sstables: optimize procedure that checks if a sstable needs cleanup
needs_cleanup() returns true if a sstable needs cleanup.

Turns out it's very slow because it iterates through all the local
ranges for all sstables in the set, making its complexity:
	O(num_sstables * local_ranges)

We can optimize it by taking into account that abstract_replication_strategy
documents that get_ranges() will return a list of ranges that is sorted
and non-overlapping. Compaction for cleanup already takes advantage of that
when checking if a given partition can be actually purged.

So needs_cleanup() can be optimized into O(num_sstables * log(local_ranges)).

With num_sstables=1000, RF=3, then local_ranges=256(num_tokens)*3, it means
the max # of checks performed will go from 768000 to ~9584.

Fixes #6730.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200629171355.45118-2-raphaelsc@scylladb.com>
2020-06-30 12:58:43 +03:00
Nadav Har'El
23ce6864a3 alternator test: ProjectionExpression test for BatchGetItem
The tests in test_projection_expression.py test that ProjectionExpression
works - including attribute paths - for the GetItem, Query and Scan
operations.

There is a fourth read operation - BatchGetItem, and it supports
ProjectionExpression too. We tested BatchGetItem + ProjectionExpression in
test_batch.py, but this only tests the basic feature, with top-level
attributes, and we were missing a test for nested document paths.

This patch adds such a test. It is still xfailing on Alternator (and passing
on DynamoDB), because attribute paths are still not supported (this is
issue #5024).

Refs #5024.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200629063244.287571-1-nyh@scylladb.com>
2020-06-29 08:51:05 +02:00
Nadav Har'El
b6fdd956bd alternator test: ProjectionExpression tests for document paths
This patch adds three more tests for the ProjectionExpression parameter
of GetItem. They are tests for nested document paths like a.b[2].c.

We don't support nested paths in Alternator yet (this is issue #5024),
so the new tests all xfail (and pass on DynamoDB).

We already had similar tests for UpdateExpression, which also needs to
support document paths, but the tests were missing for ProjectionExpression.
I am planning to start the implementation of document paths with
ProjectionExpression (which is the simplest use of document paths), so I
want the tests for this expression to be as complete as possible.

Refs #5024.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200628213208.275050-1-nyh@scylladb.com>
2020-06-29 08:50:55 +02:00
Avi Kivity
509442b128 Merge "Move snapshot code from storage_service into independent component" from Pavel E
"
The snapshotting code is already well isolated from the rest of
the storage_service, so it's relatively easy to move it into
independent component, thus de-bloating the storage_service.

As a side effect this allows painless removal of calls to global
get_storage_service() from schema::describe code.

Test: unit(debug), dtest.snapshot_test(dev), manual start-stop
"

* 'br-snapshot-controller-4' of https://github.com/xemul/scylla:
  snap: Get rid of storage_service reference in schema.cc
  main: Stop http server
  snapshot: Make check_snapshot_not_exist a method
  snapshots: Move ops gate from storage_service
  snapshot: Move lock from storage_service
  snapshot: Move all code into db::snapshot_ctl class
  storage_service: Move all snapshot code into snapshot-ctl.cc
  snapshots: Initial skeleton
  snapshots: Properly shutdown API endpoints
  api: Rewrap set_server_snapshot lambda
2020-06-28 13:17:32 +03:00
Pavel Emelyanov
f045cec586 snap: Get rid of storage_service reference in schema.cc
Now when the snapshot stopping is correctly handled, we may pull the database
reference all the way down to the schema::describe().

One tricky place is in table::napshot() -- the local db reference is pulled
through an smp::submit_to call, but thanks to the shard checks in the place
where it is needed the db is still "local"

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-26 20:28:25 +03:00
Rafael Ávila de Espíndola
85bb7ff743 big_decimal: Add a test for a corner case
This behavior is different from cassandra, but without arithmetic
operations it doesn't seem possible to notice the difference from
CQL. Using avg produces the same results, since we use an initial
value of 0 (scale = 0).

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-25 15:37:23 -07:00
Rafael Ávila de Espíndola
684f32c862 big_decimal: Correctly handle negative scales
A negative scale was being passed an a positive value to
boost::multiprecision::pow, which would never finish.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-25 15:34:10 -07:00
Avi Kivity
a9c7a1a86c Merge "repair: row_level: prevent deadlocks when repairing homogenous nodes" from Botond
"
Row level repair, when using a local reader, is prone to deadlocking on
the streaming reader concurrency semaphore. This has been observed to
happen with at least two participating nodes, running more concurrent
repairs than the maximum allowed amount of reads by the concurrency
semaphore. In this situation, it is possible that two repair instances,
competing for the last available permits on both nodes, get a permit on
one of the nodes and get queued on the other one respectively. As
neither will let go of the permit it already acquired, nor give up
waiting on the failed-to-acquired permit, a deadlock happens.

To prevent this, we make the local repair reader evictable. For this we
reuse the already existing evictable reader mechanism of the multishard
combining reader. This patchset refactors this evictable reader
mechanism into a standalone flat mutation reader, then exposes it to the
outside world.
The repair reader is paused after the repair buffer is filled, which is
currently 32MB, so the cost of a possible reader recreation is amortized
over 32MB read.

The repair reader is said to be local, when it can use the shard-local
partitioner. This is the case if the participating nodes are homogenous
(their shard configuration is identical), that is the repair instance
has to read just from one shard. A non-local reader uses the multishard
reader, which already makes its shard readers evictable and hence is not
prone to the deadlock described here.

Fixes: #6272

Tests: unit(dev, release, debug)
"

* 'repair-row-level-evictable-local-reader/v3' of https://github.com/denesb/scylla:
  repair: row_level: destroy reader on EOS or error
  repair: row_level: use evictable_reader for local reads
  mutation_reader: expose evictable_reader
  mutation_reader: evictable_reader: add auto_pause flag
  mutation_reader: make evictable_reader a flat_mutation_reader
  mutation_reader: s/inactive_shard_read/inactive_evictable_reader/
  mutation_reader: move inactive_shard_reader code up
  mutation_reader: fix indentation
  mutation_reader: shard_reader: extract remote_reader as evictable_reader
  mutation_reader: reader_lifecycle_policy: make semaphore() available early
2020-06-24 12:55:34 +03:00
Piotr Sarna
c2939c67b2 test: add a case for local altering of distributed tables
Local altering, which does not propagate the change to other nodes,
should not be allowed for a non-local table.

Refs #6700
Message-Id: <34a2b191c0e827f296e6d720dc31bf8bda0fd160.1592990796.git.sarna@scylladb.com>
2020-06-24 12:51:41 +03:00
Botond Dénes
542d9c3711 mutation_reader: expose evictable_reader
Expose functions for the outside world to create evictable readers. We
expose two functions, which create an evictable reader with
`auto_pause::yes` and `auto_pause::no` respectively. The function
creating the latter also returns a handle in addition to the reader,
which can be used to pause the reader.
2020-06-23 21:08:21 +03:00
Avi Kivity
8d67537178 Update seastar submodule
* seastar a6c8105443...7664f991b9 (13):
  > gate: add try_enter and try_with_gate
  > Merge "Manage reference counts in the file API" from Rafael
  > cmake: Refactor a bit of duplicated code
  > stream: Delete _sub
  > future: Add a rethrow_exception to future_state_base
  > future: Use a new seastar::nested_exception in finally
  > cmake: only apply C++ compile options to C++ language
  > testing: Enable fail-on-abandoned-failed-futures by default
  > future: Correct a few hypercorrect uses of std::forward
  > futures_test: Test using future::then with functions
  > Merge "io-queue: A set of cleanups collected so far" from Pavel E
  > tmp_file: Replace futurize_apply with futurize_invoke
  > future: Replace promise::set_coroutine with forward_state_and_schedule

Contains update to tests from Rafael:

tests: Update for fail-on-abandoned-failed-futures's new default

This depends on the corresponding change in seastar.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-23 19:39:54 +03:00
Botond Dénes
63309f925c mutation_reader: reader_lifecycle_policy: make semaphore() available early
Currently all reader lifecycle policy implementations assume that
`semaphore()` will only be called after at least one call to
`make_reader()`. This assumption will soon not hold, so make sure
`semaphore()` can be called at any time, including before any calls are
made to `make_reader()`.
2020-06-23 10:01:38 +03:00
Avi Kivity
de38091827 priority_manager: merge streaming_read and streaming_write classes into one class
Streaming is handled by just once group for CPU scheduling, so
separating it into read and write classes for I/O is artificial, and
inflates the resources we allow for streaming if both reads and writes
happen at the same time.

Merge both classes into one class ("streaming") and adjust callers. The
merged class has 200 shares, so it reduces streaming bandwidth if both
directions are active at the same time (which is rare; I think it only
happens in view building).
2020-06-22 15:09:04 +03:00
Rafael Ávila de Espíndola
a67f5b2de1 sstable_3_x_test: Call await_background_jobs on every test
Now every tests starts by deferring a call to
await_background_jobs. That can be verified with:

$ git grep -B 1 await_background test/boost/sstable_3_x_test.cc  | grep THREAD | wc -l
90
$ git grep -A 1 SEASTAR_THREAD_TEST_CASE test/boost/sstable_3_x_test.cc  | grep await_background | wc -l
90

Thanks to Raphael Carvalho for noticing it.

Refs #6624

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200619220048.1091630-1-espindola@scylladb.com>
2020-06-22 14:03:13 +03:00
Raphael S. Carvalho
a82afa68aa test/lib/cql_test_env: reenable auto compaction
after e40aa042a7, auto compaction is explicitly disabled on all
tables being populated and only enabled later on in the boot
process. we forgot to update cql_test_env to also reenable
auto compaction, so unit tests based on cql_test_env were not
compacting at all.
database_test, for example, was running out of file descriptors
because the number kept growing unboundly due to lack of compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200618225621.15937-1-raphaelsc@scylladb.com>
2020-06-22 14:03:13 +03:00
Avi Kivity
7351db7cab Merge "Reshape upload files and reshard+reshape at boot" from Glauber
"

This patchset adds a reshape operation to each compaction strategy;
that is a strategy-specific way of detecting if SSTables are in-strategy
or off-strategy, and in case they are offstrategy moving them to in-strategy.

Often times the number of SSTables in a particular slice of the sstable set
matters for that decision (number of SSTables in the same time window for TWCS,
number of SSTables per tier for STCS, number of L0 SSTables for LCS). We want
to be more lenient for operations that keep the node offline, like reshape at
boot, but more forgiving for operations like upload, which run in maintenance
mode. To accomodate for that the threshold for considering a slice of the SSTable
set offstrategy is passed as a parameter

Once this patchset is applied, the upload directory will reshape the SSTables
before moving them to the main directory (if needed). One side effect of it
is that it is no longer necessary to take locks for the refresh operation nor
disable writes in the table.

With the infrastructure that we have built in the upload directory, we can
apply the same set of steps to populate_column_family. Using the sstable_directory
to scan the files we can reshard and reshape (usually if we resharded a reshape
will be necessary) with the node still offline. This has the benefit of never
adding shared SSTables to the table.

Applying this patchset will unlock a host of cleanups:
- we can get rid of all testing for shared sstables, sstable_need_rewrite, etc.
- we can remove the resharding backlog tracker.

and many others. Most cleanups are deferred for a later patchset, though.
"

* 'reshard-reshape-v4' of github.com:glommer/scylla:
  distributed_loader: reshard before the node is made online
  distributed_loader: rework uploading of SSTables
  sstable_directory: add helper to reshape existing unshared sstables
  compaction_strategy: add method to reshape SSTables
  compaction: add a new compaction type, Reshape
  compaction: add a size and throught pretty printer.
  compaction: add default implementation for some pure functions
  tests: fix fragile database tests
  distributed_loader.cc: add a helper function to extract the highest SSTable version found
  distributed_loader.cc : extract highest_generation_seen code
  compaction_manager: rename run_resharding_job
  distributed_loader: assume populate_column_families is run in shard 0
  api: do not allow user to meddle with auto compaction too early
  upload: use custom error handler for upload directory
  sstable_directory: fix debug message
2020-06-18 17:04:53 +03:00
Glauber Costa
e40aa042a7 distributed_loader: reshard before the node is made online
This patch moves the resharding process to use the new
directory_with_sstables_handler infrastructure. There is no longer
a clear reshard step, and that just becomes a natural part of
populate_column_family.

In main.cc, a couple of changes are necessary to make that happen.
The first one obviously is to stop calling reshard. We also need to
make sure that:
 - The compaction manager is started much earlier, so we can register
   resharding jobs with it.
 - auto compactions are disabled in the populate method, so resharding
   doesn't have to fight for bandwidth with auto compactions.

Now that we are resharding through the sstable_directory, the old
resharding code can be deleted. There is also no need to deal with
the resharding backlog either, because the SSTables are not yet
added to the sstable set at this point.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-18 09:37:18 -04:00
Glauber Costa
96abf80c5e tests: fix fragile database tests
This test wants to make sure that an SSTable with generation number 4,
which is incomplete, gets deleted.

While that works today, the way the test verifies that is fragile
because new SSTables can and will be created, especially in the local
directory that sees a lot of activity on startup.

It works if generations don't go that far, but with SMP, even a single
SSTable in the right shard can end up having generation 4. In practice
this isn't an issue today because the code calls
cf.update_sstables_known_generation() as soon as it sees a file, before
deciding whether or not the file has to be deleted. However this
behavior is not guaranteed and is changing.

The best way to fix this would be to check if the file is the same,
including its inode. But given that this is just a unit test (which
is almost always if not always single node), I am just moving to use
the peers table instead. Again, we could have created a user table,
but it's just not worth the hassle.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-18 09:00:28 -04:00
Rafael Ávila de Espíndola
f6e407ecd2 everywhere: Prepare for seastar api v4 (when_all_succeed return value)
The seastar api v4 changes the return type of when_all_succeed. This
patch adds discard_result when that is best solution to handle the
change.

This doesn't do the actual update to v4 since there are still a few
issues left to fix in seastar. A patch doing just the update will
follow.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200617233150.918110-1-espindola@scylladb.com>
2020-06-18 15:13:56 +03:00
Amnon Heiman
bc854342e7 approx_exponential_histogram: Makes the implementation clearer
This patch aim to make the implementation and usage of the
approx_exponential_histogram clearer.

The approx_exponential_histogram Uses a combination of Min, Max,
Precision and number of buckets where the user needs to pick 3.

Most of the changes in the patch are about documenting the class and
method, but following the review there are two functionality changes:

1. The user would pick: Min, Max and Precision and the number of buckets
   will be calculated from these values.
2. The template restrictions are now state in a requires so voiolation
   will be stop at compile time.
2020-06-18 14:18:21 +03:00
Dejan Mircevski
aec1acd1d5 range_test: Add cases for singular intersection
Intersection was previously not tested for singular ranges.  This
ensures it will always work for singular ranges, too.

Tests: unit(dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-06-18 12:38:31 +03:00
Avi Kivity
9322c07c71 Merge "Use binary search in sstable promoted index" from Tomasz
"
The "promoted index" is how the sstable format calls the clustering key index within a given partition.
Large partitions with many rows have it. It's embedded in the partition index entry.

Currently, lookups in the promoted index are done by scanning the index linearly so the lookup
is O(N). For large partitions that's inefficient. It consumes both a lot of CPU and I/O.

We could do better and use binary search in the index. This patch series switches the mc-format
index reader to do that. Other formats use the old way.

The "mc" format promoted index has an extra structure at the end of the index called "offset map".
It's a vector of offsets of consecutive promoted index entries. This allows us to access random
entries in the index without reading the whole index.

The location of the offset entry for a given promoted index entry can be derived by knowing where
the offset vector ends in the index file, so the offset map also doesn't have to be read completely
into the memory.

The most tricky part is caching. We need to cache blocks read from the index file to amortize the
cost of binary search:

  - if the promoted index fits in the 32 KiB which was read from the index when looking for
    the partition entry, we don't want to issue any additional I/O to search the promoted index.

  - with large promoted indexes, the last few bisections will fall into the same I/O block and we
    want to reuse that block.

  - we don't want the cache to grow too big, we don't want to cache the whole promoted index
    as the read progresses over the index. Scanning reads may skip multiple times.

This series implements a rather simple approach which meets all the
above requirements and is not worse than the current state of affairs:

   - Each index cursor has its own cache of the index file area which corresponds to promoted index
     This is managed by the cached_file class.

   - Each index cursor has its own cache of parsed blocks. This allows the upper bound estimation to
     reuse information obtained during lower bound lookup. This estimation is used to limit
     read-aheads in the data file.

   - Each cursor drops entries that it walked past so that memory footprint stays O(log N)

   - Cached buffers are accounted to read's reader_permit.

Later, we could have a single cache shared by many readers. For that, we need to come up with eviction
policy.

Fixes #4007.

TESTING RESULTS

 * Point reads, large promoted index:

  Config: rows: 10000000, value size: 2000
  Partition size: 20 GB
  Index size: 7 MB

  Notes:

    - Slicing read into the middle of partition (offset=5000000, read=1) is a clear win for the binary search:

      time: 1.9ms vs 22.9ms
      CPU utilization: 8.9% vs 92.3%
      I/O: 21 reqs / 172 KiB vs 29 reqs / 3'520 KiB

      It's 12x faster, CPU utilization is 10x times smaller, disk utilization is 20x smaller.

    - Slicing at the front (offset=0) is a mixed bag.

      time is similar: 1.8ms
      CPU utilization is 6.7x smaller for bsearch: 8.5% vs 57.7%
      disk bandwidth utilization is smaller for bsearch but uses more IOs: 4 reqs / 320 KiB (scan) vs 17 reqs / 188 KiB (bsearch)

      bsearch uses less bandwidth because the series reduces buffer size used for index file I/O.

      scan is issuing:

         2 * 128 KB (index page)
         2 * 32 KB (data file)

      bsearch is issuing:

         1 * 64 KB (index page)
         15 * 4 KB (promoted index)
         1 * 64 KB (data file)

      The 1 * 64 KB is chosen dynamically by seastar. Sometimes it chooses 2 * 32 KB (with read-ahead).
      32 KB is the minimum I/O currently.

      Disk utilization could be further improved by changing the way seastar's dynamic I/O adjustments work
      so that it uses 1 * 4 KB when it suffices. This is left for the follow-up.

  Command:

        perf_fast_forward --datasets=large-part-ds1 \
         --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1

  Before:

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.001836          172         1        545          9        563        175        4.0      4        320       2       2        0        1        1        0        0        0  57.7%      0
    0       32        0.001858          502        32      17220        126      17776      11526        3.2      3        324       2       1        0        1        1        0        0        0  56.4%      0
    0       256       0.002833          339       256      90374        427      91757      85931        7.0      7        776       3       1        0        1        1        0        0        0  41.1%      0
    0       4096      0.017211           58      4096     237984       2011     241802     233870       66.1     66       8376      59       2        0        1        1        0        0        0  21.4%      0
    5000000 1         0.022952           42         1         44          1         45         41       29.2     29       3520      22       2        0        1        1        0        0        0  92.3%      0
    5000000 32        0.023052           43        32       1388         14       1414       1331       31.1     32       3588      26       2        0        1        1        0        0        0  91.7%      0
    5000000 256       0.024795           41       256      10325        129      10721       9993       43.1     39       4544      29       2        0        1        1        0        0        0  86.4%      0
    5000000 4096      0.038856           27      4096     105414        398     106918     103162       95.2     95      12160      78       5        0        1        1        0        0        0  61.4%      0

 After (v2):

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.001831          248         1        546         21        581        252       17.6     17        188       2       0        0        1        1        0        0        0   8.5%      0
    0       32        0.001910          535        32      16751        626      17770      13896       17.9     19        160       3       0        0        1        1        0        0        0   8.8%      0
    0       256       0.003545          266       256      72207       2333      89076      62852       26.9     24        764       7       0        0        1        1        0        0        0   9.7%      0
    0       4096      0.016800           56      4096     243812        524     245430     239736       83.6     83       8700      64       0        0        1        1        0        0        0  16.6%      0
    5000000 1         0.001968          351         1        508         19        538        380       21.3     21        172       2       0        0        1        1        0        0        0   8.9%      0
    5000000 32        0.002273          431        32      14077        436      15503      11551       22.7     22        268       3       0        0        1        1        0        0        0   8.9%      0
    5000000 256       0.003889          257       256      65824       2197      81833      57813       34.0     37        652      18       0        0        1        1        0        0        0  11.2%      0
    5000000 4096      0.017115           54      4096     239324        834     241310     231993       88.3     88       8844      65       0        0        1        1        0        0        0  16.8%      0

 After (v1):

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.001886          259         1        530          4        545        261       18.0     18        376       2       2        0        1        1        0        0        0   9.1%      0
    0       32        0.001954          513        32      16381         93      16844      15618       19.0     19        408       3       2        0        1        1        0        0        0   9.3%      0
    0       256       0.003266          318       256      78393       1820      81567      61663       30.8     26       1272       7       2        0        1        1        0        0        0  10.4%      0
    0       4096      0.017991           57      4096     227666        855     231915     225781       83.1     83       8888      55       5        0        1        1        0        0        0  15.5%      0
    5000000 1         0.002353          232         1        425          2        432        232       23.0     23        396       2       2        0        1        1        0        0        0   8.7%      0
    5000000 32        0.002573          384        32      12437         47      12571        429       25.0     25        460       4       2        0        1        1        0        0        0   8.5%      0
    5000000 256       0.003994          259       256      64101       2904      67924      51427       37.0     35       1484      11       2        0        1        1        0        0        0  10.6%      0
    5000000 4096      0.018567           56      4096     220609        448     227395     219029       89.8     89       9036      59       5        0        1        1        0        0        0  15.1%      0

 * Point reads, small promoted index (two blocks):

  Config: rows: 400, value size: 200
  Partition size: 84 KiB
  Index size: 65 B

  Notes:
     - No significant difference in time
     - the same disk utilization
     - similar CPU utilization

  Command:

      perf_fast_forward --datasets=large-part-ds1 \
         --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1

  Before:

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.000279          470         1       3587         31       3829        478        3.0      3         68       2       1        0        1        1        0        0        0  21.1%      0
    0       32        0.000276         3498        32     116038        811     122756     104033        3.0      3         68       2       1        0        1        1        0        0        0  24.0%      0
    0       256       0.000412         2554       256     621044       1778     732150     559221        2.0      2         72       2       0        0        1        1        0        0        0  32.6%      0
    0       4096      0.000510         1901       400     783883       4078     819058     665616        2.0      2         88       2       0        0        1        1        0        0        0  36.4%      0
    200     1         0.000339         2712         1       2951          8       3001       2569        2.0      2         72       2       0        0        1        1        0        0        0  17.8%      0
    200     32        0.000352         2586        32      91019        266      92427      83411        2.0      2         72       2       0        0        1        1        0        0        0  20.8%      0
    200     256       0.000458         2073       200     436503       1618     453945     385501        2.0      2         88       2       0        0        1        1        0        0        0  29.4%      0
    200     4096      0.000458         2097       200     436475       1676     458349     381558        2.0      2         88       2       0        0        1        1        0        0        0  29.0%      0

  After (v1):

    Testing slicing of large partition using clustering keys:
    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.000278          492         1       3598         30       3831        500        3.0      3         68       2       1        0        1        1        0        0        0  19.4%      0
    0       32        0.000275         3433        32     116153        753     122915      92559        3.0      3         68       2       1        0        1        1        0        0        0  22.5%      0
    0       256       0.000458         2576       256     559437       2978     728075     504375        2.1      2         88       2       0        0        1        1        0        0        0  29.0%      0
    0       4096      0.000506         1888       400     790064       3306     822360     623109        2.0      2         88       2       0        0        1        1        0        0        0  36.6%      0
    200     1         0.000382         2493         1       2619         10       2675       2268        2.0      2         88       2       0        0        1        1        0        0        0  16.3%      0
    200     32        0.000398         2393        32      80422        333      84759      22281        2.0      2         88       2       0        0        1        1        0        0        0  19.0%      0
    200     256       0.000459         2096       200     435943       1608     453989     380749        2.0      2         88       2       0        0        1        1        0        0        0  30.5%      0
    200     4096      0.000458         2097       200     436410       1651     455779     382485        2.0      2         88       2       0        0        1        1        0        0        0  29.2%      0

 * Scan with skips, large index:

  Config: rows: 10000000, value size: 2000
  Partition size: 20 GB
  Index size: 7 MB

  Notes:

    - Similar time, slightly worse for binary search: 36.1 s (scan) vs 36.4 (bsearch)

    - Slightly more I/O for bsearch: 153'932 reqs / 19'703'260 KiB (scan) vs 155'651 reqs / 19'704'088 KiB (bsearch)

      Binary search reads more by 828 KB and by 1719 IOs.
      It does more I/O to read the the promoted index offset map.

    - similar (low) memory footprint. The danger here is that by caching index blocks which we touch as we scan
      we would end up caching the whole index. But this is protected against by eviction as demonstrated by the
      last "mem" column.

  Command:

    perf_fast_forward --datasets=large-part-ds1 \
       --run-tests=large-partition-skips -c1 --test-case-duration=1

  Before:

      read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
      1       1        36.103451            4   5000000     138491         38     138601     138453   153932.0 153932   19703260  153561       1        0        1        1        0        0        0  31.5% 502690

  After (v2):

    read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    1       1        37.000145            4   5000000     135135          6     135146     135128   155651.0 155651   19704088  138968       0        0        1        1        0        0        0  34.2%      0

  After (v1):

    read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    1       1        36.965520            4   5000000     135261         30     135311     135231   155628.0 155628   19704216  139133       1        0        1        1        0        0        0  33.9% 248738

Also in:

  git@github.com:tgrabiec/scylla.git sstable-use-index-offset-map-v2

Tests:

  - unit (all modes)
  - manual using perf_fast_forward
"

* tag 'sstable-use-index-offset-map-v2' of github.com:tgrabiec/scylla:
  sstables: Add promoted index cache metrics
  position_in_partition: Introduce external_memory_usage()
  cached_file, sstables: Add tracing to index binary search and page cache
  sstables: Dynamically adjust I/O size for index reads
  sstables, tests: Allow disabling binary search in promoted index from perf tests
  sstables: mc: Use binary search over the promoted index
  utils: Introduce cached_file
  sstables: clustered_index: Relax scope of validity of entry_info
  sstables: index_entry: Introduce owning promoted_index_block_position
  compound_compat: Allow constructing composite from a view
  sstables: index_entry: Rename promoted_index_block_position to promoted_index_block_position_view
  sstables: mc: Extract parser for promoted index block
  sstables: mc: Extract parser for clustering out of the promoted index block parser
  sstables: consumer: Extract primitive_consumer
  sstables: Abstract the clustering index cursor behavior
  sstables: index_reader: Rearrange to reduce branching and optionals
2020-06-18 12:09:39 +03:00
Juliusz Stasiewicz
8628ede009 cdc: Fix segfault when stream ID key is too short
When a token is calculated for stream_id, we check that the key is
exactly 16 bytes long. If it's not - `minimum_token` is returned
and client receives empty result.

This used to be the expected behavior for empty keys; now it's
extended to keys of any incorrect length.

Fixes #6570
2020-06-17 18:19:37 +03:00
Nadav Har'El
095ddf0d41 alternator test: use ConsistentRead=True where missing
All tests that write some data and then read it back need to use
ConsistentRead=True, otherwise the test may sporadically fail on a multi-
node cluster.

In the previous patch we fixed the full_query()/full_scan() convenience
functions. In this patch, I audited the calls to the boto3 read methods -
get_item(), batch_get_item(), query(), scan(), and although most of them
did use ConsistentRead=True as needed, I found some missing and this patch
fixes them.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200616080334.825893-1-nyh@scylladb.com>
2020-06-17 14:57:45 +02:00
Nadav Har'El
c298088375 alternator test: use ConsistentRead=True for full_query/scan
Many of the Alternator tests use the convenience functions full_query()/
full_scan() to read from the table. Almost all these tests need to be able
to read their own writes, i.e., want ConsistentRead=True, but none of them
explicitly specified this parameter. Such tests may sporadically fail when
running on cluster with multiple nodes.

So this patch follows a TODO in the code, and makes ConsistentRead=True
the default for the full_*() functions. The caller can still override it
with ConsistentRead=False - and this is necessary in the GSI tests, because
ConsistentRead=True is not allowed in GSIs.

Note that while ConsistentRead=True is now the default for the full_*()
convenience functions, but it is still not the default for the lower level
boto3 functions scan(), query() and get_item() - so usages of those should
be evaluated as well and missing ConsistentRead=True, if any, should be
added.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200616073821.824784-1-nyh@scylladb.com>
2020-06-17 14:57:45 +02:00
Tomasz Grabiec
19501d9ef2 sstables, tests: Allow disabling binary search in promoted index from perf tests 2020-06-16 16:15:23 +02:00