Commit Graph

1482 Commits

Author SHA1 Message Date
Pavel Emelyanov
49c5d5b7e8 Merge 'lister: add directory_lister' from Benny Halevy
directory_lister provides a simpler interface compared to lister.

After creating the directory_lister,
its async get() method should be called repeatedly,
returning a std::optional<directory_entry> each call,
until it returns a disengaged entry or an error.

This is especially suitable for coroutines
as demonstrated in the unit tests that were added.

For example:
```c++
        auto dl = directory_lister(path);
        while (auto de = co_await dl.get()) {
            co_await process(*de);
        }
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #9835

* github.com:scylladb/scylla:
  sstable_directory: process_sstable_dir: use directory_lister
  sstable_directory: process_sstable_dir: fixup indentation
  sstable_directory: coroutinize process_sstable_dir
  lister: add directory_lister
2022-02-21 12:24:28 +03:00
Nadav Har'El
d3ac9a5790 Merge 'cql3: expr: Fix expr::visit so that it works with references' from Jan Ciołek
There is a bug in `expr::visit`. When trying to return a reference from a visitor it actually returns a reference to some temporary location.
So trying to do something like:
```c++
const expression e = new_bind_variable(123);

const bind_variable& ref = visit(overloaded_functor {
    [](const bind_variable& bv) -> const bind_variable& { return bv; },
    [](const auto&) -> const bind_variable& { throw std::runtime_error("Unreachable"); }
}, e);

std::cout << ref << std::endl;
 ```
 Would actually print a random stack location instead of the value inside of `e`.
 Additionally trying to return a non-const reference doesn't compile.

 Current implementation of `expr::visit` is:
 ```c++
 auto visit(invocable_on_expression auto&& visitor, const expression& e) {
    return std::visit(visitor, e._v->v);
}
 ```

For reference, `std::visit` looks like this:
 ```c++
template<typename _Res, typename _Visitor, typename... _Variants>
constexpr _Res
visit(_Visitor&& __visitor, _Variants&&... __variants)
{
  return std::__do_visit<_Res>(std::forward<_Visitor>(__visitor),
                               std::forward<_Variants>(__variants)...);
}
 ```

 The problem is that `auto` can evaluate to `int` or `float`, but not to `int&`.
 It has now been changed to `decltype(auto)`, which is able to express references.
 I also added a missing `std::forward` on the visitor argument.

The new version looks like this:
 ```c++
template <invocable_on_expression Visitor>
decltype(auto) visit(Visitor&& visitor, const expression& e) {
    return std::visit(std::forward<Visitor>(visitor), e._v->v);
}
```

I added some tests of `expr::visit` in `boost/expr_test`, but sadly they are not as throughout as they could be, Ideally I could return a refernce from `std::visit` and `expr::visit` and then check that they both point to the same address in memory.
I can't do this because it would require to access a private field of `expression`.
Some test pass before the fix, even though they shouldn't, but I'm not sure how to make them better without making field of expression public.

 I played around with some code, it can be found here: https://github.com/cvybhu/attached-files/blob/main/visit/visit_playground.cpp

Closes #10073

* github.com:scylladb/scylla:
  cql3: expr: Add a test to show that std::forward is needed in expr::visit
  cql3: expr: add std::forward in expr::visit
  cql3: expr: Add tests for expr::visit
  cql3: expr: Fix expr::visit so that it works with references
2022-02-20 12:09:57 +02:00
Jan Ciolek
353ab8f438 cql3: expr: Add a test to show that std::forward is needed in expr::visit
Adds a test with a vistior that can only be used as a rvalue.
Without std::forward in expr::visit this test doesn't compile.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-18 14:19:49 +01:00
Jan Ciolek
46367eec55 cql3: expr: Add tests for expr::visit
Add tests for new expr::visit to ensure that it is working correctly.

expr::visit had a hidden bug where trying to return a reference
actually returned a reference to freed location on the stack,
so now there are tests to ensure that everything works.

Sadly the test `expr_visit_const_ref` also passes
before the fix, but at lest expr_visit_ref doesn't compile before the fix.
It would be better to test this by taking references returned
by std::visit and expr::visit and checking that they point
to the same address in memory, but I can't do this
because I would have to access private field of expression.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2022-02-18 14:16:55 +01:00
Botond Dénes
96082631c8 tools/schema_loader: auto-create the keyspace for all statements
Currently the keyspace is only auto-created for create type statements.
However the keyspace is needed even without UDTs being involved: for
example if the table contains a collection type. So auto-create the
keyspace unconditionally before preparing the first statement.

Also add a test-case with a create table statement which requires the
keyspace to be present at prepare time.
2022-02-17 15:24:24 +02:00
Botond Dénes
948bc359c2 Merge "ME sstable format support" from Michael Livshin
"
This series implements support for the ME sstable format (introduced
in C* 3.11.11).

Tests: unit(dev)
"

* tag 'me-sstable-format-v5' of https://github.com/cmm/scylla:
  sstables: validate originating host id
  sstable: add is_uploaded() predicate
  config: make the ME sstable format default
  scylla-gdb.py: recognize ME sstables
  sstables: store originating host id in stats metadata
  system_keyspace: cache local host id before flushing
  database_test: ensure host id continuity
  sstables_manager: add get_local_host_id() method and support
  sstables_manager: formalize inheritability
  system_keyspace, main: load (or create) local host id earlier
  sstable_3_x_test: test ME sstable format too
  add "ME_SSTABLE" cluster feature
  add "sstable_format" config
  add support for the ME sstable format
  scylla-sstable: add ability to dump optionals and utils::UUID
  sstables: add ability to write and parse optionals
  globalize sstables::write(..., utils::UUID)
2022-02-16 18:28:16 +02:00
Michael Livshin
3bf1e137fc config: make the ME sstable format default
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
d8cc535297 database_test: ensure host id continuity
The "populate_from_quarantine_works" test case creates sstables with
one db config, then reads them with another.  Ensure that both configs
have the same host id so the sstables pass validation.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
3fef604075 sstables_manager: add get_local_host_id() method and support
Since ME sstable format includes originating host id in stats
metadata, local host id needs to be made available for writing and
validation.

Both Scylla server (where local host id comes from the `system.local`
table) and unit tests (where it is fabricated) must be accomodated.
Regardless of how the host id is obtained, it is stored in the db
config instance and accessed through `sstables_manager`.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
387c882dc7 sstable_3_x_test: test ME sstable format too
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
0b1447c702 add "sstable_format" config
Initialize it to "md" until ME format support is
complete (i.e. storing originating host id in sstable stats metadata
is implemented), so at present there is no observable change by
default.

Also declare "enable_sstables_md_format" unused -- the idea, going
forward, being that only "sstable_format" controls the written sstable
file format and that no more per-format enablement config options
shall be added.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Michael Livshin
c96708d262 add support for the ME sstable format
The ME format has been introduced in Cassandra 3.11.11:

11952fae77/src/java/org/apache/cassandra/io/sstable/format/big/BigFormat.java (L123)
d84c6e9810

It adds originating host id to sstable metadata in support of fixing
loss of commit log data when moving sstables between nodes:

https://issues.apache.org/jira/browse/CASSANDRA-16619

In Scylla:

* The supported way to ingest sstables is via upload/, where stored
  commit log replay position should be disregarded (but see
  https://github.com/scylladb/scylla/issues/10080).

* A later commit in this series implements originating host id
  validation for native ME sstables.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-16 18:21:24 +02:00
Benny Halevy
b7b0c19fdc test: uuid: cement the assumption that default and null uuid are equal
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220216081623.830627-2-bhalevy@scylladb.com>
2022-02-16 10:19:47 +02:00
Piotr Dulikowski
742f2abfd8 exception_container: do not throw in accept
This commit changes the behavior of `exception_container::accept`. Now,
instead of throwing an `utils::bad_exception_container_access` exception
when the container is empty, the provided visitor is invoked with that
exception instead. There are two reasons for this change:

- The exception_container is supposed to allow handling exceptions
  without using the costly C++'s exception runtime. Although an empty
  container is an edge case, I think it the new behavior is more aligned
  with the class' purpose. The old behavior can be simulated by
  providing a visitor which throws when called with bad access
  exception.

- The new behavior fixes a bug in `result_try`/`result_futurize_try`.
  Before the change, if the `try` block returned a failed result with an
  empty exception container, a bad access exception would either be
  thrown or returned as an exceptional future without being handled by
  the `catch` clauses. Although nobody is supposed to return such
  result<>s on purpose, a moved out result can be returned by accident
  and it's important for the exception handling logic to be correct in
  such a situation.

Tests: unit(dev)

Closes #10086
2022-02-16 10:06:10 +02:00
Nadav Har'El
7be3129458 cdc: don't need current keyspace to create the log table
CDC registers to the table-creation hook (before_create_column_family)
to add a second table - the CDC log table - to the same keyspace.
The handler function (on_before_update_column_family() in cdc/log.cc)
wants to retrieve the keyspace's definition, but that does NOT WORK if
we create the keyspace and table in one operation (which is exactly what
we intend to do in Alternator to solve issue #9868) - because at the
time of the hook, the keyspace does not yet exist in the schema.

It turns out that on_before_update_column_family() does not REALLY need
the keyspace. It needed it to pass it on to make_create_table_mutations()
but that function doesn't use the keyspace parameter passed to it! All
it needs is the keyspace's name - which is in the schema anyway and
doesn't need to be looked up.

So in this patch we fix make_create_table_mutations() to not require the
unused keyspace parameter - and fix the CDC code not to look for the
keyspace that is no longer needed.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220215162342.622509-1-nyh@scylladb.com>
2022-02-16 08:38:56 +02:00
Benny Halevy
69fcc053bb utils: uuid: add null_uuid
and respective bool predecate and operator
and unit test.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220215113438.473400-1-bhalevy@scylladb.com>
2022-02-15 18:02:54 +02:00
Avi Kivity
7cc43f8aa8 Merge 'utils: add result_try and result_futurize_try' from Piotr Dulikowski
Adds `utils::result_try` and `utils::result_futurize_try` - functions which allow to convert existing try..catch blocks into a version which handles C++ exceptions, failed results with exception containers and, depending on the function variant, exceptional futures using the same exception handling logic.

For example, you can convert the following try..catch block:

    try {
        return a_function_that_may_throw();
    } catch (const my_exception& ex) {
        return 123;
    } catch (...) {
        throw;
    }

...to this:

    return utils::result_try([&] {
        return a_function_that_may_throw_or_return_a_failed_result();
    },  utils::result_catch<my_exception>([&] (const Ex&) {
        return 123;
    }), utils::result_catch_dots([&] (auto&& handle) {
        return handle.into_result();
    });

Similarly, `utils::result_futurize_try` can be used to migrate `then_wrapped` or `f.handle_exception()` constructs.

As an example of the usability of the new constructs, two places in the current code which need to simultaneously handle exceptions and failed results are converted to use `result_try` and `result_futurize_try`.

Results of `perf_simple_query --smp 1 --operations-per-shard 1000000 --write`:

```
127041.61 tps ( 67.2 allocs/op,  14.2 tasks/op,   52422 insns/op)
126958.60 tps ( 67.2 allocs/op,  14.2 tasks/op,   52409 insns/op)
127088.37 tps ( 67.2 allocs/op,  14.2 tasks/op,   52411 insns/op)
127560.84 tps ( 67.2 allocs/op,  14.2 tasks/op,   52424 insns/op)
127826.61 tps ( 67.2 allocs/op,  14.2 tasks/op,   52406 insns/op)

126801.02 tps ( 67.2 allocs/op,  14.2 tasks/op,   52420 insns/op)
125371.51 tps ( 67.2 allocs/op,  14.2 tasks/op,   52425 insns/op)
126498.51 tps ( 67.2 allocs/op,  14.2 tasks/op,   52427 insns/op)
126359.41 tps ( 67.2 allocs/op,  14.2 tasks/op,   52423 insns/op)
126298.27 tps ( 67.2 allocs/op,  14.2 tasks/op,   52410 insns/op)
```

The number of tasks and allocations is unchanged. The number of instructions per operations seems similar, it may have increased slightly (by 10-20) but it's hard to tell for sure because of the noisiness of the results.

Tests: unit(dev)

Closes #10045

* github.com:scylladb/scylla:
  transport: use result_try in process_request_one
  storage_proxy: use result_futurize_try in mutate_end
  storage_proxy: temporarily throw exception from result in mutate_end
  utils: add result_try and result_futurize_try
2022-02-13 19:38:13 +02:00
Avi Kivity
6572b297a2 treewide: clean up stray license blurbs
After the mechanical change in fcb8d040e8
("treewide: use Software Package Data Exchange (SPDX) license identifiers"),
a few stray license blurbs or fragments thereof remain. In two cases
these were extra blurbs in code generators intended for the generated code,
in others they were just missed by the script.

Clean them up, adding an SPDX license identifier where needed.

Closes #10072
2022-02-13 14:16:16 +02:00
Piotr Dulikowski
dd3284ec38 utils/result: optimize result_parallel_for_each
It now resembles the original parallel_for_each more, but uses a
coroutine instead of a custom `task` to collect not-ready futures.
Although the usage of a coroutine saves on allocations, the drawback is
that there is currently no way to co_await on a future and handle its
exception without throwing or without unconditionally allocating a
then_wrapped or handle_exception continuation - so it introduces a
rethrow.

Furthermore, now failed results and exceptions are treated as equals.
Previously, in case one parallel invocation returned failed result and
another returned an exception, the exception would always be returned.
Now, the failed result/exception of the invocation with the lowest index
is always preferred, regardless of the failure type.

The reimplementation manages to save about 350-400 instructions, one
task and one allocation in the perf_simple_query benchmark in write
mode.

Results from `perf_simple_query --smp 1 --operations-per-shard 1000000
--write` (before vs. after):

```
126872.54 tps ( 67.2 allocs/op,  14.2 tasks/op,   52404 insns/op)
126532.13 tps ( 67.2 allocs/op,  14.2 tasks/op,   52408 insns/op)
126864.99 tps ( 67.2 allocs/op,  14.2 tasks/op,   52428 insns/op)
127073.10 tps ( 67.2 allocs/op,  14.2 tasks/op,   52404 insns/op)
126895.85 tps ( 67.2 allocs/op,  14.2 tasks/op,   52411 insns/op)

127894.02 tps ( 66.2 allocs/op,  13.2 tasks/op,   52036 insns/op)
127671.51 tps ( 66.2 allocs/op,  13.2 tasks/op,   52042 insns/op)
127541.42 tps ( 66.2 allocs/op,  13.2 tasks/op,   52044 insns/op)
127409.10 tps ( 66.2 allocs/op,  13.2 tasks/op,   52052 insns/op)
127831.30 tps ( 66.2 allocs/op,  13.2 tasks/op,   52043 insns/op)
```

Test: unit(dev), unit(result_utils_test, debug)
2022-02-10 18:19:08 +01:00
Piotr Dulikowski
6abeec6299 utils/result: split into combinators and loop file
Segregates result utilities into:

- result.hh - basic definitions related to results with exception
  containers,
- result_combinators.hh - combinators for working with results in
  conjunction with futures,
- result_loop.hh - loop-like combinators, currently has only
  result_parallel_for_each.

The motivation for the split is:

1. In headers, usually only result.hh will be needed, so no need to
   force most .cc files to compile definitions from other files,
2. Less files need to be recompiled when a combinator is added to
   result_combinators or result_loop.

As a bonus, `result_with_exception` was moved from `utils::internal` to
just `utils`.
2022-02-10 18:19:05 +01:00
Piotr Dulikowski
8d52ceca50 utils: add result_try and result_futurize_try
Adds result_try and result_futurize_try - functions which allow to
convert existing try..catch blocks into a version which handles C++
exceptions, failed results with exception containers and, depending on
the function variant, exceptional futures.
2022-02-10 17:35:32 +01:00
Benny Halevy
207174c692 lister: add directory_lister
directory_lister provides a simpler interface
compared to lister.

After creating the directory_lister,
its async get() method should be called repeatedly,
returning a std::optional<directory_entry> each call,
until it returns a disengaged entry or an error.

This is especially suitable for coroutines
as demonstrated in the unit tests that were added.

For example:

    auto dl = directory_lister(path);
    while (auto de = co_await dl.get()) {
        co_await process(*de);
    }

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-10 11:41:50 +02:00
Avi Kivity
5099b1e272 Merge 'Propagate coordinator timeouts for regular writes and batches without throwing' from Piotr Dulikowski
Currently, most of the failures that occur during CQL reads or writes are reported using C++ exceptions. Although the seastar framework avoids most of the cost of unwinding by keeping exceptions in futures as `std::exception_ptr`s, the exceptions need to be inspected at various points for the purposes of accounting metrics or converting them to a CQL error response. Analyzing the value and type of an exception held by `std::exception_ptr`'s cannot be done without rethrowing the exception, and that can be very costly even if the exception is immediately caught. Because of that, exceptions are not a good fit for reporting failures which happen frequently during overload, especially if the CPU is the bottleneck.

This PR introduces facilities for reporting exceptions as values using the boost::outcome library. As a first step, the need to use exceptions for reporting timeouts was eliminated for regular and batch writes, and no exceptions are thrown between creation of a `mutation_write_timeout_exception` and its serialization as a CQL response in the `cql_server`.

The types and helpers introduced here can be reused in order to migrate more exceptions and exception paths in a similar fashion.

Results of `perf_simple_query --smp 1 --operations-per-shard 1000000`:

    Master (00a9326ae7)
    128789.53 tps ( 82.2 allocs/op,  12.2 tasks/op,   49245 insns/op)

    This PR
    127072.93 tps ( 82.2 allocs/op,  12.2 tasks/op,   49356 insns/op)

The new version seems to be slower by about 100 insns/op, fortunately not by much (about 0.2%).

Tests: unit(dev), unit(result_utils_test, debug)

Closes #10014

* github.com:scylladb/scylla:
  cql_test_env: optimize handling result_message::exception
  transport/server: handle exceptions from coordinator_result without throwing
  transport/server: propagate coordinator_result to the error handling code
  transport/server: unwrap the exception result_message in process_xyz_internal
  query_processor: add exception-returning variants of execute_ methods
  modification_statement: propagate failed result through result_message::exception
  batch_statement: propagate failed result through result_message::exception
  cql_statement: add `execute_without_checking_exception_message`
  result_message: add result_message::exception
  storage_proxy: change mutate_with_triggers to return future<result<>>
  storage_proxy: add mutate_atomically_result
  storage_proxy: return result<> from mutate_result
  storage_proxy: return result<> from mutate_internal
  storage_proxy: properly propagate future from mutate_begin to mutate_end
  storage_proxy: handle exceptions as values in mutate_end
  storage_proxy: let mutate_end take a future<result<>>
  storage_proxy: resultify mutate_begin
  storage_proxy: use result in the _ready future of write handlers
  storage_proxy: introduce helpers for dealing with results
  exceptions: add coordinator_exception_container and coordinator_result
  utils: add result utils
  utils: add exception_container
2022-02-08 14:27:09 +02:00
Piotr Dulikowski
e4ff22b4ca result_message: add result_message::exception
In order to propagate exceptions as values through the CQL layer with
minimal modifications to the interfaces, a new result_message type is
introduced: result_message::exception. Similarly to
result_message::bounce_to_shard, this is an internal type which is
supposed to be handled before being returned to the client.
2022-02-08 11:08:42 +01:00
Piotr Dulikowski
11cb670881 utils: add result utils
Adds a number of utilities for working with boost::outcome::result
combined with exception_container. The utilities are meant to help with
migration of the existing code to use the boost::outcome::result:

- `exception_container_throw_policy` - a NoValuePolicy meant to be used
  as a template parameter for the boost::outcome::result. It protects
  the caller of `result::value()` and `result::error()` methods - if the
  caller wishes to get a value but the result has an error
  (exception_container in our case), the exception in the container will
  be thrown instead. In case it's the other way around,
  boost::outcome::bad_result_access is thrown.
- `result_parallel_for_each` - a version of `parallel_for_each` which is
  aware of results and returns a failed result in case any of the
  parallel invocations return a failed result.
- `result_into_future` - converts a result into a future. If the result
  holds a value, converts it into make_ready_future; if it holds an
  exception, the exception is returned as make_exception_future.
- `then_ok_result` takes a `future<T>` and converts it into
  a `future<result<T>>`.
- `result_wrap` adapts a callable of type `T -> future<result<T>>` and
  returns a callable of type `result<T> -> future<result<T>>`.
2022-02-08 11:08:42 +01:00
Nadav Har'El
cc57ac8c1c cql3: add a cql3::util::quote() function
The function cql3::util::maybe_quote() is used throughout Scylla to
convert identifier names (column names, table names, etc.) into strings
that can be embedded in CQL commands. maybe_quote() sometimes needs to
quote these identifier names, but when the identifier name is lowercase,
and not a CQL keyword, it is not quoted.

Not quoting identifier names when not needed is nice and pretty, but has
a forward-compatibility problem: If some CQL command with an unquoted
identifier is saved somewhere, and new version of Scylla adss this
identifier as a new reserved keyword - the CQL command will break.

So this patch introduces a new function, cql3::util::quote(), which
unconditionally quotes the given identifier.

The new function is not yet used in Scylla, but we add a unit test
(based on the test of maybe_quote()) to confirm it behaves correctly.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220118161217.231811-2-nyh@scylladb.com>
2022-02-07 11:33:57 +02:00
Nadav Har'El
5d2f694a90 cql3: fix cql3::util::maybe_quote() for keywords
cql3::util::maybe_quote() is a utility function formatting an identifier
name (table name, column name, etc.) that needs to be embedded in a CQL
statement - and might require quoting if it contains non-alphanumeric
characters, uppercase characters, or a CQL keyword.

maybe_quote() made an effort to only quote the identifier name if neccessary,
e.g., a lowercase name usually does not need quoting. But lowercase names
that are CQL keywords - e.g., to or where - cannot be used as identifiers
without quoting. This can cause problems for code that wants to generate
CQL statements, such as the materialized-view problem in issue #9450 - where
a user had a column called "to" and wanted to create a materialized view
for it.

So in this patch we fix maybe_quote() to recognize invalid identifiers by
using the CQL parser, and quote them. This will quote reserved keywords,
but not so-called unreserved keywords, which *are* allowed as identifiers
and don't need quoting. This addition slows down maybe_quote(), but
maybe_quote() is anyway only used in heavy operations which need to
generate CQL.

This patch also adds two tests that reproduce the bug and verify its
fix:

1. Add to the low-level maybe_quote() test (a C++ unit test) also tests
   that maybe_quote() quotes reserved keywords like "to", but doesn't
   quote unreserved keywords like "int".

2. Add a test reproducing issue #9450 - creating a materialized view
   whose key column is a keyword. This new test passes on Cassandra,
   failed on Scylla before this patch, and passes after this patch.

It is worth noting that maybe_quote() now has a "forward compatiblity"
problem: If we save CQL statements generated by maybe_quote(), and a
future version introduces a new reserved keyword, the parser of the
future version may not be able to parse the saved CQL statement that
was generated with the old mayb_quote() and didn't quote what is now
a keyword. This problem can be solved in two ways:

1. Try hard not to introduced new reserved keywords. Instead, introduce
   unreserved keywords. We've been doing this even before recognizing
   this maybe_quote() future-compatibility problem.

2. In the next patch we will introduce quote() - which unconditionally
   quotes identifier names, even if lowercase. These quoted names will
   be uglier for lowercase names - but will be safe from future
   introduction of new keywords. So we can consider switching some or
   all uses of maybe_quote() to quote().

Fixes #9450

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220118161217.231811-1-nyh@scylladb.com>
2022-02-07 11:33:56 +02:00
Piotr Dulikowski
80f6224959 utils: add exception_container
Adds `exception_container` - a helper type used to hold exceptions as a
value, without involving the std::exception_ptr.

The motivation behind this type is that it allows inspecting exception's
type and value without having to rethrow that exception and catch it,
unlike std::exception_ptr. In our current codebase, some exception
handling paths need to rethrow the exception multiple times in order to
account it into metrics or encode it as an error response to the CQL
client. Some types of exceptions can be thrown very frequently in case
of overload (e.g. timeouts) and inspecting those exceptions with
rethrows can make the overload even worse. For those kinds of exceptions
it is important to handle them as cheaply as possible, and
exception_container used with conjunction with boost::outcome::result
can help achieve that.
2022-02-04 20:18:00 +01:00
Avi Kivity
fe65122ccd Merge 'Distribute select count(*) queries' from Michał Sala
This pull request speeds up execution of `count(*)` queries. It does so by splitting given query into sub-queries and distributing them across some group of nodes for parallel execution.

New level of coordination was added. Node called super-coordinator splits aggregation query into sub-queries and distributes them across some group of coordinators. Super-coordinator is also responsible for merging results.

To develop a mechanism for speeding up `count(*)` queries, there was a need to detect which queries have a `count(*)` selector. Due to this pull request being a proof of concept, detection was realized rather poorly. It is only allows catching the simplest cases of `count(*)` queries (with only one selector and no column name specified).

After detecting that a query is a `count(*)` it should be split into sub-queries and sent to another coordinators. Splitting part wasn't that difficult, it has been achieved by limiting original query's partition ranges. Sending modified query to another node was much harder. The easiest scenario would be to send whole `cql3::statements::select_statement`. Unfortunately `cql3::statements::select_statement` can't be [de]serialized, so sending it was out of the question. Even more unfortunately, some non-[de]serializable members of `cql3::statements::select_statement` are required to start the execution process of this statement. Finally, I have decided to send a `query::read_command` paired with required [de]serializable members. Objects, that cannot be [de]serialized (such as query's selector) are mocked on the receiving end.

When a super-coordinator receives a `count(*)` query, it splits it into sub-queries. It does so, by splitting original query's partition ranges into list of vnodes, grouping them by their owner and creating sub-queries with partition ranges set to successive results of such grouping. After creation, each sub-query is sent to the owner of its partition ranges. Owner dispatches received sub-query to all of its shards. Shards slice partition ranges of the received sub-query, so that they will only query data that is owned by them. Each shard becomes a coordinator and executes so prepared sub-query.

3 node cluster set up on powerful desktops located in the office (3x32 cores)
Filled the cluster with ~2 * 10^8 rows using scylla-bench and run:
```
time cqlsh <ip> <port> --request-timeout=3600 -e "select count(*) from scylla_bench.test using timeout 1h;"
```

* master: 68s
* this branch: 2s

3 node cluster (each node had 2 shards, `murmur3_ignore_msb_bits` was set to 1, `num_tokens` was set to 3)

```
>  cqlsh -e 'tracing on; select count(*) from ks.t;
Now Tracing is enabled

 count
-------
  1000

(1 rows)

Tracing session: e5852020-7fc3-11ec-8600-4c4c210dd657

 activity                                                                                                                                    | timestamp                  | source    | source_elapsed | client
---------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+-----------
                                                                                                                          Execute CQL3 query | 2022-01-27 22:53:08.770000 | 127.0.0.1 |              0 | 127.0.0.1
                                                                                                               Parsing a statement [shard 1] | 2022-01-27 22:53:08.770451 | 127.0.0.1 |             -- | 127.0.0.1
                                                                                                            Processing a statement [shard 1] | 2022-01-27 22:53:08.770487 | 127.0.0.1 |             36 | 127.0.0.1
                                                                                        Dispatching forward_request to 3 endpoints [shard 1] | 2022-01-27 22:53:08.770509 | 127.0.0.1 |             58 | 127.0.0.1
                                                                                            Sending forward_request to 127.0.0.1:0 [shard 1] | 2022-01-27 22:53:08.770516 | 127.0.0.1 |             64 | 127.0.0.1
                                                                                                         Executing forward_request [shard 1] | 2022-01-27 22:53:08.770519 | 127.0.0.1 |             -- | 127.0.0.1
                                                                                                       read_data: querying locally [shard 1] | 2022-01-27 22:53:08.770528 | 127.0.0.1 |              9 | 127.0.0.1
                                             Start querying token range ({-4242912715832118944, end}, {-4075408479358018994, end}] [shard 1] | 2022-01-27 22:53:08.770531 | 127.0.0.1 |             12 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 1 [shard 1] | 2022-01-27 22:53:08.770537 | 127.0.0.1 |             18 | 127.0.0.1
                      Scanning cache for range ({-4242912715832118944, end}, {-4075408479358018994, end}] and slice {(-inf, +inf)} [shard 1] | 2022-01-27 22:53:08.770541 | 127.0.0.1 |             22 | 127.0.0.1
    Page stats: 12 partition(s), 0 static row(s) (0 live, 0 dead), 12 clustering row(s) (12 live, 0 dead) and 0 range tombstone(s) [shard 1] | 2022-01-27 22:53:08.770589 | 127.0.0.1 |             70 | 127.0.0.1
                                                                                            Sending forward_request to 127.0.0.2:0 [shard 1] | 2022-01-27 22:53:08.770600 | 127.0.0.1 |            149 | 127.0.0.1
                                                                                            Sending forward_request to 127.0.0.3:0 [shard 1] | 2022-01-27 22:53:08.770608 | 127.0.0.1 |            157 | 127.0.0.1
                                                                                                         Executing forward_request [shard 0] | 2022-01-27 22:53:08.770627 | 127.0.0.1 |             -- | 127.0.0.1
                                                                                                       read_data: querying locally [shard 0] | 2022-01-27 22:53:08.770639 | 127.0.0.1 |             11 | 127.0.0.1
                                               Start querying token range ({2507462623645193091, end}, {3897266736829642805, end}] [shard 0] | 2022-01-27 22:53:08.770643 | 127.0.0.1 |             15 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 0 [shard 0] | 2022-01-27 22:53:08.770646 | 127.0.0.1 |             19 | 127.0.0.1
                        Scanning cache for range ({2507462623645193091, end}, {3897266736829642805, end}] and slice {(-inf, +inf)} [shard 0] | 2022-01-27 22:53:08.770649 | 127.0.0.1 |             22 | 127.0.0.1
                                                                                                         Executing forward_request [shard 1] | 2022-01-27 22:53:08.770658 | 127.0.0.2 |             -- | 127.0.0.1
                                                                                                         Executing forward_request [shard 1] | 2022-01-27 22:53:08.770674 | 127.0.0.3 |              5 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 1] | 2022-01-27 22:53:08.770698 | 127.0.0.2 |             40 | 127.0.0.1
                                             Start querying token range [{4611686018427387904, start}, {5592106830937975806, end}] [shard 1] | 2022-01-27 22:53:08.770704 | 127.0.0.2 |             46 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 1 [shard 1] | 2022-01-27 22:53:08.770710 | 127.0.0.2 |             52 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 1] | 2022-01-27 22:53:08.770712 | 127.0.0.3 |             43 | 127.0.0.1
                      Scanning cache for range [{4611686018427387904, start}, {5592106830937975806, end}] and slice {(-inf, +inf)} [shard 1] | 2022-01-27 22:53:08.770714 | 127.0.0.2 |             56 | 127.0.0.1
                                           Start querying token range [{-4611686018427387904, start}, {-4242912715832118944, end}] [shard 1] | 2022-01-27 22:53:08.770718 | 127.0.0.3 |             49 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 1 [shard 1] | 2022-01-27 22:53:08.770739 | 127.0.0.3 |             70 | 127.0.0.1
                    Scanning cache for range [{-4611686018427387904, start}, {-4242912715832118944, end}] and slice {(-inf, +inf)} [shard 1] | 2022-01-27 22:53:08.770743 | 127.0.0.3 |             73 | 127.0.0.1
    Page stats: 17 partition(s), 0 static row(s) (0 live, 0 dead), 17 clustering row(s) (17 live, 0 dead) and 0 range tombstone(s) [shard 1] | 2022-01-27 22:53:08.770814 | 127.0.0.3 |            145 | 127.0.0.1
                                                                                                         Executing forward_request [shard 0] | 2022-01-27 22:53:08.770846 | 127.0.0.3 |             -- | 127.0.0.1
                                                                                                       read_data: querying locally [shard 0] | 2022-01-27 22:53:08.770862 | 127.0.0.3 |             16 | 127.0.0.1
    Page stats: 71 partition(s), 0 static row(s) (0 live, 0 dead), 71 clustering row(s) (71 live, 0 dead) and 0 range tombstone(s) [shard 0] | 2022-01-27 22:53:08.770865 | 127.0.0.1 |            238 | 127.0.0.1
                                             Start querying token range ({-6683686776653114062, end}, {-6473446911791631266, end}] [shard 0] | 2022-01-27 22:53:08.770867 | 127.0.0.3 |             21 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 0 [shard 0] | 2022-01-27 22:53:08.770874 | 127.0.0.3 |             28 | 127.0.0.1
                      Scanning cache for range ({-6683686776653114062, end}, {-6473446911791631266, end}] and slice {(-inf, +inf)} [shard 0] | 2022-01-27 22:53:08.770879 | 127.0.0.3 |             33 | 127.0.0.1
    Page stats: 48 partition(s), 0 static row(s) (0 live, 0 dead), 48 clustering row(s) (48 live, 0 dead) and 0 range tombstone(s) [shard 1] | 2022-01-27 22:53:08.770880 | 127.0.0.2 |            222 | 127.0.0.1
                                                                                                                  Querying is done [shard 1] | 2022-01-27 22:53:08.770888 | 127.0.0.1 |            369 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 1] | 2022-01-27 22:53:08.770909 | 127.0.0.1 |            390 | 127.0.0.1
                                             Start querying token range ({-4075408479358018994, end}, {-3391415989210253693, end}] [shard 1] | 2022-01-27 22:53:08.770911 | 127.0.0.1 |            392 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 1 [shard 1] | 2022-01-27 22:53:08.770914 | 127.0.0.1 |            395 | 127.0.0.1
                      Scanning cache for range ({-4075408479358018994, end}, {-3391415989210253693, end}] and slice {(-inf, +inf)} [shard 1] | 2022-01-27 22:53:08.770936 | 127.0.0.1 |            418 | 127.0.0.1
                                                                                                         Executing forward_request [shard 0] | 2022-01-27 22:53:08.770951 | 127.0.0.2 |             -- | 127.0.0.1
                                                                                                       read_data: querying locally [shard 0] | 2022-01-27 22:53:08.770966 | 127.0.0.2 |             15 | 127.0.0.1
    Page stats: 12 partition(s), 0 static row(s) (0 live, 0 dead), 12 clustering row(s) (12 live, 0 dead) and 0 range tombstone(s) [shard 0] | 2022-01-27 22:53:08.770969 | 127.0.0.3 |            123 | 127.0.0.1
                                                                    Start querying token range (-inf, {-6683686776653114062, end}] [shard 0] | 2022-01-27 22:53:08.770969 | 127.0.0.2 |             18 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 0 [shard 0] | 2022-01-27 22:53:08.770974 | 127.0.0.2 |             23 | 127.0.0.1
                                             Scanning cache for range (-inf, {-6683686776653114062, end}] and slice {(-inf, +inf)} [shard 0] | 2022-01-27 22:53:08.770977 | 127.0.0.2 |             26 | 127.0.0.1
                                                                                                                  Querying is done [shard 1] | 2022-01-27 22:53:08.770993 | 127.0.0.3 |            324 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 1] | 2022-01-27 22:53:08.770998 | 127.0.0.3 |            329 | 127.0.0.1
                                                              Start querying token range ({-3391415989210253693, end}, {0, start}) [shard 1] | 2022-01-27 22:53:08.771001 | 127.0.0.3 |            332 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 1 [shard 1] | 2022-01-27 22:53:08.771004 | 127.0.0.3 |            335 | 127.0.0.1
                                       Scanning cache for range ({-3391415989210253693, end}, {0, start}) and slice {(-inf, +inf)} [shard 1] | 2022-01-27 22:53:08.771007 | 127.0.0.3 |            338 | 127.0.0.1
    Page stats: 48 partition(s), 0 static row(s) (0 live, 0 dead), 48 clustering row(s) (48 live, 0 dead) and 0 range tombstone(s) [shard 1] | 2022-01-27 22:53:08.771044 | 127.0.0.1 |            525 | 127.0.0.1
                                                                                                                  Querying is done [shard 0] | 2022-01-27 22:53:08.771069 | 127.0.0.1 |            442 | 127.0.0.1
                                                                                                 On shard execution result is [71] [shard 0] | 2022-01-27 22:53:08.771145 | 127.0.0.1 |            518 | 127.0.0.1
                                                                                                                  Querying is done [shard 1] | 2022-01-27 22:53:08.771308 | 127.0.0.1 |            789 | 127.0.0.1
                                                                                                 On shard execution result is [60] [shard 1] | 2022-01-27 22:53:08.771351 | 127.0.0.1 |            832 | 127.0.0.1
 Page stats: 127 partition(s), 0 static row(s) (0 live, 0 dead), 127 clustering row(s) (127 live, 0 dead) and 0 range tombstone(s) [shard 0] | 2022-01-27 22:53:08.771379 | 127.0.0.2 |            427 | 127.0.0.1
 Page stats: 183 partition(s), 0 static row(s) (0 live, 0 dead), 183 clustering row(s) (183 live, 0 dead) and 0 range tombstone(s) [shard 1] | 2022-01-27 22:53:08.771385 | 127.0.0.3 |            716 | 127.0.0.1
                                                                                                                  Querying is done [shard 0] | 2022-01-27 22:53:08.771402 | 127.0.0.3 |            556 | 127.0.0.1
                                                                                                                  Querying is done [shard 1] | 2022-01-27 22:53:08.771403 | 127.0.0.2 |            745 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 1] | 2022-01-27 22:53:08.771408 | 127.0.0.2 |            750 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 0] | 2022-01-27 22:53:08.771409 | 127.0.0.3 |            563 | 127.0.0.1
                                                                     Start querying token range ({5592106830937975806, end}, +inf) [shard 1] | 2022-01-27 22:53:08.771411 | 127.0.0.2 |            754 | 127.0.0.1
                                           Start querying token range ({-6272011798787969456, end}, {-4611686018427387904, start}) [shard 0] | 2022-01-27 22:53:08.771412 | 127.0.0.3 |            566 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 0 [shard 0] | 2022-01-27 22:53:08.771415 | 127.0.0.3 |            569 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 1 [shard 1] | 2022-01-27 22:53:08.771415 | 127.0.0.2 |            757 | 127.0.0.1
                                              Scanning cache for range ({5592106830937975806, end}, +inf) and slice {(-inf, +inf)} [shard 1] | 2022-01-27 22:53:08.771419 | 127.0.0.2 |            761 | 127.0.0.1
                    Scanning cache for range ({-6272011798787969456, end}, {-4611686018427387904, start}) and slice {(-inf, +inf)} [shard 0] | 2022-01-27 22:53:08.771419 | 127.0.0.3 |            573 | 127.0.0.1
                                                                                    Received forward_result=[131] from 127.0.0.1:0 [shard 1] | 2022-01-27 22:53:08.771454 | 127.0.0.1 |           1003 | 127.0.0.1
    Page stats: 74 partition(s), 0 static row(s) (0 live, 0 dead), 74 clustering row(s) (74 live, 0 dead) and 0 range tombstone(s) [shard 0] | 2022-01-27 22:53:08.771764 | 127.0.0.3 |            918 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 0] | 2022-01-27 22:53:08.771768 | 127.0.0.3 |            922 | 127.0.0.1
                                                               Start querying token range [{0, start}, {2507462623645193091, end}] [shard 0] | 2022-01-27 22:53:08.771771 | 127.0.0.3 |            925 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 0 [shard 0] | 2022-01-27 22:53:08.771775 | 127.0.0.3 |            929 | 127.0.0.1
                                        Scanning cache for range [{0, start}, {2507462623645193091, end}] and slice {(-inf, +inf)} [shard 0] | 2022-01-27 22:53:08.771779 | 127.0.0.3 |            933 | 127.0.0.1
                                                                                                                  Querying is done [shard 1] | 2022-01-27 22:53:08.771935 | 127.0.0.3 |           1265 | 127.0.0.1
                                                                                                                  Querying is done [shard 0] | 2022-01-27 22:53:08.771950 | 127.0.0.2 |            998 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 0] | 2022-01-27 22:53:08.771956 | 127.0.0.2 |           1004 | 127.0.0.1
                                             Start querying token range ({-6473446911791631266, end}, {-6272011798787969456, end}] [shard 0] | 2022-01-27 22:53:08.771959 | 127.0.0.2 |           1008 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 0 [shard 0] | 2022-01-27 22:53:08.771963 | 127.0.0.2 |           1011 | 127.0.0.1
                      Scanning cache for range ({-6473446911791631266, end}, {-6272011798787969456, end}] and slice {(-inf, +inf)} [shard 0] | 2022-01-27 22:53:08.771966 | 127.0.0.2 |           1014 | 127.0.0.1
    Page stats: 13 partition(s), 0 static row(s) (0 live, 0 dead), 13 clustering row(s) (13 live, 0 dead) and 0 range tombstone(s) [shard 0] | 2022-01-27 22:53:08.772008 | 127.0.0.2 |           1057 | 127.0.0.1
                                                                                                       read_data: querying locally [shard 0] | 2022-01-27 22:53:08.772012 | 127.0.0.2 |           1061 | 127.0.0.1
                                             Start querying token range ({3897266736829642805, end}, {4611686018427387904, start}) [shard 0] | 2022-01-27 22:53:08.772014 | 127.0.0.2 |           1063 | 127.0.0.1
                                                                                                 Creating shard reader on shard: 0 [shard 0] | 2022-01-27 22:53:08.772016 | 127.0.0.2 |           1065 | 127.0.0.1
                      Scanning cache for range ({3897266736829642805, end}, {4611686018427387904, start}) and slice {(-inf, +inf)} [shard 0] | 2022-01-27 22:53:08.772019 | 127.0.0.2 |           1067 | 127.0.0.1
                                                                                                On shard execution result is [200] [shard 1] | 2022-01-27 22:53:08.772053 | 127.0.0.3 |           1384 | 127.0.0.1
    Page stats: 56 partition(s), 0 static row(s) (0 live, 0 dead), 56 clustering row(s) (56 live, 0 dead) and 0 range tombstone(s) [shard 0] | 2022-01-27 22:53:08.772138 | 127.0.0.2 |           1186 | 127.0.0.1
 Page stats: 190 partition(s), 0 static row(s) (0 live, 0 dead), 190 clustering row(s) (190 live, 0 dead) and 0 range tombstone(s) [shard 1] | 2022-01-27 22:53:08.772364 | 127.0.0.2 |           1706 | 127.0.0.1
 Page stats: 149 partition(s), 0 static row(s) (0 live, 0 dead), 149 clustering row(s) (149 live, 0 dead) and 0 range tombstone(s) [shard 0] | 2022-01-27 22:53:08.772407 | 127.0.0.3 |           1561 | 127.0.0.1
                                                                                                                  Querying is done [shard 0] | 2022-01-27 22:53:08.772417 | 127.0.0.3 |           1571 | 127.0.0.1
                                                                                                                  Querying is done [shard 1] | 2022-01-27 22:53:08.772418 | 127.0.0.2 |           1760 | 127.0.0.1
                                                                                                                  Querying is done [shard 0] | 2022-01-27 22:53:08.772426 | 127.0.0.2 |           1475 | 127.0.0.1
                                                                                                                  Querying is done [shard 0] | 2022-01-27 22:53:08.772428 | 127.0.0.2 |           1476 | 127.0.0.1
                                                                                                                  Querying is done [shard 0] | 2022-01-27 22:53:08.772449 | 127.0.0.3 |           1604 | 127.0.0.1
                                                                                                On shard execution result is [196] [shard 0] | 2022-01-27 22:53:08.772555 | 127.0.0.2 |           1603 | 127.0.0.1
                                                                                                On shard execution result is [238] [shard 1] | 2022-01-27 22:53:08.772674 | 127.0.0.2 |           2016 | 127.0.0.1
                                                                                                On shard execution result is [235] [shard 0] | 2022-01-27 22:53:08.772770 | 127.0.0.3 |           1924 | 127.0.0.1
                                                                                    Received forward_result=[435] from 127.0.0.3:0 [shard 1] | 2022-01-27 22:53:08.772933 | 127.0.0.1 |           2482 | 127.0.0.1
                                                                                    Received forward_result=[434] from 127.0.0.2:0 [shard 1] | 2022-01-27 22:53:08.773110 | 127.0.0.1 |           2658 | 127.0.0.1
                                                                                                           Merged result is [1000] [shard 1] | 2022-01-27 22:53:08.773111 | 127.0.0.1 |           2660 | 127.0.0.1
                                                                                              Done processing - preparing a result [shard 1] | 2022-01-27 22:53:08.773114 | 127.0.0.1 |           2663 | 127.0.0.1
                                                                                                                            Request complete | 2022-01-27 22:53:08.772666 | 127.0.0.1 |           2666 | 127.0.0.1
```

Fixes #1385

Closes #9209

* github.com:scylladb/scylla:
  docs: add parallel aggregations design doc
  db: config: add a flag to disable new parallelized aggregation algorithm
  test: add parallelized select count test
  forward_service: add metrics
  forward_service: parallelize execution across shards
  forward_service: add tracing
  cql3: statements: introduce parallelized_select_statement
  cql3: query_processor: add forward_service reference to query_processor
  gms: add PARALLELIZED_AGGREGATION feature
  service: introduce forward_service
  storage_proxy: extract query_ranges_to_vnodes_generator to a separate file
  messaging_service: add verb for count(*) request forwarding
  cql3: selection: detect if a selection represents count(*)
2022-02-04 12:34:19 +02:00
Piotr Wojtczak
0dd7739716 snapshots: Fix snapshot-ctl to include snapshots of dropped tables
Snapshot-ctl methods fetch information about snapshots from
column family objects. The problem with this is that we get rid
of these objects once the table gets dropped, while the snapshots
might still be present (the auto_snapshot option is specifically
made to create this kind of situation). This commit switches from
relying on column family interface to scanning every datadir
that the database knows of in search for "snapshots" folders.

Fixes #3463
Closes #7122

Closes #9884

Signed-off-by: Piotr Wojtczak <piotr.m.wojtczak@gmail.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-02-01 22:31:43 +02:00
Michał Sala
140bab279c test: add parallelized select count test
Added test that checks if a SELECT COUNT(*) query was transformed and
processed in a parallel way. Checking is done by looking at the cql
statistics and comparing subsequent counts of parallelized aggregation
SELECT query executions.
2022-02-01 21:14:41 +01:00
Michał Sala
0fe59082ec storage_proxy: extract query_ranges_to_vnodes_generator to a separate file
Such separation allows using query_ranges_to_vnodes_generator by other
services without needing a storage_proxy dependency.
2022-02-01 21:14:41 +01:00
Tomasz Grabiec
8297ae531d Merge "Automatically retry CQL DDL statements in presence of concurrent changes" from Kamil
Schema changes on top of Raft do not allow concurrent changes.
If two changes are attempted concurrently, one of them gets
`group0_concurrent_modification` exception.

Catch the exception in CQL DDL statement execution function and retry.

In addition, improve the description of CQL DDL statements
in group 0 history table.

Add a test which checks that group 0 history grows iff a schema change does
not throw `group0_concurrent_modification`. Also check that the retry
mechanism works as expected.

* kbr/ddl-retry-v1:
  test: unit test for group 0 concurrent change protection and CQL DDL retries
  cql3: statements: schema_altering_statement: automatically retry in presence of concurrent changes
2022-01-31 14:12:35 +01:00
Mikołaj Sielużycki
93d6eb6d51 compacting_reader: Support fast_forward_to position range.
Fast forwarding is delegated to the underlying reader and assumes the
it's supported. The only corner case requiring special handling that has
shown up in the tests is producing partition start mutation in the
forwarding case if there are no other fragments.

compacting state keeps track of uncompacted partition start, but doesn't
emit it by default. If end of stream is reached without producing a
mutation fragment, partition start is not emitted. This is invalid
behaviour in the forwarding case, so I've added a public method to
compacting state to force marking partition as non-empty. I don't like
this solution, as it feels like breaking an abstraction, but I didn't
come across a better idea.

Tests: unit(dev, debug, release)

Message-Id: <20220128131021.93743-1-mikolaj.sieluzycki@scylladb.com>
2022-01-31 13:37:36 +02:00
Tomasz Grabiec
b734615f51 util: cached_file: Fix corruption after memory reclamation was triggered from population
If memory reclamation is triggered inside _cache.emplace(), the _cache
btree can get corrupted. Reclaimers erase from it, and emplace()
assumes that the tree is not modified during its execution. It first
locates the target node and then does memory allocation.

Fix by running emplace() under allocating section, which disables
memory reclamation.

The bug manifests with assert failures, e.g:

./utils/bptree.hh:1699: void bplus::node<unsigned long, cached_file::cached_page, cached_file::page_idx_less_comparator, 12, bplus::key_search::linear, bplus::with_debug::no>::refill(Less) [Key = unsigned long, T = cached_file::cached_page, Less = cached_file::page_idx_less_comparator, NodeSize = 12, Search = bplus::key_search::linear, Debug = bplus::with_debug::no]: Assertion `p._kids[i].n == this' failed.

Fixes #9915

Message-Id: <20220130175639.15258-1-tgrabiec@scylladb.com>
2022-01-30 19:57:35 +02:00
Kamil Braun
4a52b802ac test: unit test for group 0 concurrent change protection and CQL DDL retries
Check that group 0 history grows iff a schema change does not throw
`group0_concurrent_modification`. Check that the CQL DDL statement retry
mechanism works as expected.
2022-01-27 11:26:15 +01:00
Kamil Braun
b863a63b08 test: unit test for clearing old entries in group0 history
We perform a bunch of schema changes with different values of
`migration_manager::_group0_history_gc_duration` and check if entries
are cleared according to this setting.
2022-01-25 13:13:35 +01:00
Botond Dénes
eb42213db4 compact_mutation: close active range tombstone on page end
The compactor recently acquired the ability to consume a v2 stream. The
v2 spec requires that all streams end with a null tombstone.
`range_tombstone_assembler`, the component the compactor uses for
converting the v2 input into its v1 output enforces this with a check on
`consume_end_of_partition()`. Normally the producer of the stream the
compactor is consuming takes care of closing the active tombstone before
the stream ends. The compactor however (or its consumer) can decide to
end the consume early, e.g. to cut the current page. When this happens
the compactor must take care of closing the tombstone itself.
Furthermore it has to keep this tombstone around to re-open it on the
next page.
This patch implements this mechanism which was left out of 134601a15e.
It also adds a unit test which reproduces the problems caused by the
missing mechanism.
The compactor now tracks the last clustering position emitted. When the
page ends, this position will be used as the position of the closing
range tombstone change. This ensures the range tombstone only covers the
actually emitted range.

Fixes: #9907

Tests: unit(dev), dtest(paging_test.py, paging_additional_test.py)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220114053215.481860-1-bdenes@scylladb.com>
2022-01-25 09:52:30 +02:00
Kamil Braun
a664ac7ba5 treewide: require group0_guard when performing schema changes
`announce` now takes a `group0_guard` by value. `group0_guard` can only
be obtained through `migration_manager::start_group0_operation` and
moved, it cannot be constructed outside `migration_manager`.

The guard will be a method of ensuring linearizability for group 0
operations.
2022-01-24 15:20:35 +01:00
Kamil Braun
283ac7fefe treewide: pass mutation timestamp from call sites into migration_manager::prepare_* functions
The functions which prepare schema change mutations (such as
`prepare_new_column_family_announcement`) would use internally
generated timestamps for these mutations. When schema changes are
managed by group 0 we want to ensure that timestamps of mutations
applied through Raft are monotonic. We will generate these timestamps at
call sites and pass them into the `prepare_` functions. This commit
prepares the APIs.
2022-01-24 15:12:50 +01:00
Benny Halevy
188cedd533 test: lister_test: test_lister_abort: generate at least one entry
Without this fix, generate_random_content could generate 0 entries
and the expected exception would never be injected.

With it, we generate at least 1 entry and the test passes
with the offending random-seed:

```
random-seed=1898914316
Generated 1 dir entries
Aborting lister after 1 dir entries
test/boost/lister_test.cc(96): info: check 'exception "expected_exception" raised as expected' has passed
```

Fixes #9953

Test: lister_test.test_lister_abort --random-seed=1898914316(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220123122921.14017-1-bhalevy@scylladb.com>
2022-01-23 17:52:44 +02:00
Benny Halevy
f439edca35 test: sstable_compaction_test: twcs_reshape_with_disjoint_set_test: take min_threshold into consideration
Take into account that get_reshaping_job selects only
buckets that have more than min_threashold sstables in them.

Therefore, with 256 disjoint sstables in different windows,
allow first or last windows to not be selected by get_reshaping_job
that will return at least disjoint_sstable_count - min_threshold + 1
sstables, and not more than disjoint_sstable_count.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220123090044.38449-2-bhalevy@scylladb.com>
2022-01-23 17:52:44 +02:00
Nadav Har'El
7cb6250c40 Merge 'snapshot_ctl: true_snapshots_size: fix space accounting' from Benny Halevy
This pull request fixes two preexisting issues related to snapshot_ctl::true_snapshots_size

https://github.com/scylladb/scylla/issues/9897
https://github.com/scylladb/scylla/issues/9898

And adds a couple unit tests to tests the snapshot_ctl functionality.

Test: unit(dev), database_test.{test_snapshot_ctl_details,test_snapshot_ctl_true_snapshots_size}(debug)

Closes #9899

* github.com:scylladb/scylla:
  table: get_snapshot_details: count allocated_size
  snapshot_ctl: cleanup true_snapshots_size
  snpashot_ctl: true_snapshots_size: do not map_reduce across all shards
2022-01-19 11:57:15 +02:00
Benny Halevy
5db3cbe1e4 snpashot_ctl: true_snapshots_size: do not map_reduce across all shards
snapshot_ctl uses map_reduce over all database shards,
each counting the size of the snapshots directory,
which is shared, not per-shard.

So the total live size returned by it is multiples by the number of shards.

Add a unit test to test that.

Fixes #9897

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-01-19 07:50:53 +02:00
Nadav Har'El
1ce73c2ab3 Merge 'utils::is_timeout_exception: Ensure we handle nested exception types' from Calle Wilund
Fixes #9922

storage proxy uses is_timeout_exception to traverse different code paths.
a6202ae079 broke this (because bit rot and
intermixing), by wrapping exception for information purposes.

This adds check of nested types in exception handling, as well as a test
for the routine itself.

Closes #9932

* github.com:scylladb/scylla:
  database/storage_proxy: Use "is_timeout_exception" instead of catch match
  utils::is_timeout_exception: Ensure we handle nested exception types
2022-01-18 23:49:41 +02:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Raphael S. Carvalho
299ffb1e1a compaction: make TWCS reshape on a time bucket with tons of files much more efficient
Currently, when TWCS reshape finds a bucket containing more than 32
files, it will blindly resize that bucket to 32.
That's very bad because it doesn't take into consideration that
compaction efficiency depends on relative sizes of files being
compacted together, meaning that a huge file can be compacted with
a tiny one, producing lots of write amplification.

To solve this problem, STCS reshape logic will now be reused in
each time bucket. So only similar-sized files are compacted together
and the time bucket will be considered reshaped once its size tiers
are properly compacted, according to the reshape mode.

Fixes #9938.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220117205000.121614-1-raphaelsc@scylladb.com>
2022-01-18 12:33:54 +02:00
Avi Kivity
7260d8abed Merge "index_reader: improve verify_end_state()" from Botond
"
Said method should take care of checking that parsing stopped in a valid
state. This patch-set expands the existing but very lacking
implementation by improving the existing error message and adding an
additional check for prematurely exiting the parser in the middle of
parsing an index entry, something we've seen recently in #9446.
To help in debugging such issues, some additional information is added
to the trace messages.
The series also fixes a bug in the error handling code of the partition
index cache.

Refs: #9446

Tests: unit(dev)
"

* 'index-reader-better-verify-end-state/v2.1' of https://github.com/denesb/scylla:
  sstables/index_reader: process_state(): add additional information to trace logging
  sstables/index_reader: verify_end_state(): add check for premature EOS
  sstables/index_reader: convert exception in verify_end_state() to malformed sstable exception
  sstables/index_reader: add const sstable& to index_consume_entry_context
  sstables/index_reader: remove unused members from index_consume_entry_context
2022-01-18 12:13:08 +02:00
Botond Dénes
afb14508c4 sstables/index_reader: verify_end_state(): add check for premature EOS
Add a check which ensures that parsing ended in a valid state and not in
the middle of a half-parsed entry.
2022-01-18 10:38:11 +02:00
Benny Halevy
25977db7b4 token_metadata: remove update_normal_token entry point
It's currently used only by unit tests
and it is dangerous to use on a populated token_metadata
as update_normal_tokens assumes that the set of tokens
owned by the given endpoint is compelte, i.e. previous
tokens owned by the endpoint are no longer owned by it,
but the single-token update_normal_token interface
seems commulative (and has no documentation whatsoever).

It is better to remove this interface and calculate a
complete map of endpoint->tokens from the tests.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220117101242.122512-1-bhalevy@scylladb.com>
2022-01-17 12:18:42 +02:00