Commit Graph

28555 Commits

Author SHA1 Message Date
Avi Kivity
15ffd84473 build: silence some gcc 11 warnings
These warnings are valuable, but limit the noise for now by disabling
them.
2021-10-10 18:16:50 +03:00
Avi Kivity
029560c232 sstables: processing_result_generator: make coroutine support palatable for C++20 compilers
clang implement the coroutine technical specification, in namespace
std::experimental. gcc implements C++20 coroutines, in namespace std.
Detect which one is in use and select the namespace accordingly.
2021-10-10 18:16:50 +03:00
Avi Kivity
c38f18163e managed_bytes: avoid compile-time loop in converting constructor
managed_bytes_basic_view is a template with a constructor that
converts from one instantiation of the template to another.
Unfortunately when gcc encounters the associated constraint, it
instantiates the template which forces it to evaluate the constraint
again, sending it into a loop.

Fix that by making the converting constructor a template itself,
delaying instantiation. The constraint is strengthened so the set
of types on which the constructor works is unchanged.
2021-10-10 18:16:50 +03:00
Avi Kivity
f6d59c33ff service: service_level_controller: drop unused variable sl_compare
Reported by gcc 11.
2021-10-10 18:16:50 +03:00
Avi Kivity
cd4af0c722 raft: disambiguate promise name in raft::active_read
gcc complains tha the name 'promise' changes meaning (from type
to variable) within active_read. Help it by disambiguating the
use as type.
2021-10-10 18:16:50 +03:00
Avi Kivity
3f9ec5302a locator: azure_snitch: use full type name in definition of globals
Some globals in azure_snitch use std::string in the declaration
and auto in the definition. gcc 11 complains. I don't know if it's
correct, but it's easy to use the type in both declaration and
definition.
2021-10-10 18:16:50 +03:00
Avi Kivity
d83b565938 cql3: statements: create_service_level_statement: don't ignore replace_defaults()
We call replace_defaults on an object named 'slo', but then ignore it.

Use the new object that replace_defaults() returned.

Reported by gcc 11.
2021-10-10 18:16:50 +03:00
Avi Kivity
25f8e9c078 cql3: statement_restrictions: adjust call to std::vector deduction guide
gcc 11 has a hard time parsing a deduction guide use with
braced initializer. The bug [1] was already fixed in gcc 12,
and I've requested a backport, but reduce friction meanwhile
by switching to a form that works in gcc 11.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89062
2021-10-10 18:16:50 +03:00
Avi Kivity
df73d12272 types: remove recursive constraint in deserialize_value
deserialize_value() has a constraint that depends on another
deserialize_value() implementation. Apprently gcc wants to
instantiate the deserialize_value() instance we're constraining
while evaluating the constraint, leading to a loop.

Since this deserialize_value() is just an internal helper, drop
the constraint rather than fighting it.
2021-10-10 18:16:50 +03:00
Avi Kivity
58a0e80021 cql3: restrictions: relax constraint on visitor_with_binary_operator_content
We require that v.current_binary_operator is a 'const binary_operator*',
but it's really a 'const binary_operator*&'. Relax the constraint so it
works with both gcc and clang.
2021-10-10 18:16:50 +03:00
Avi Kivity
fd8beeaea9 treewide: handle switch statements that return
A switch statement where every case returns triggers a gcc
warning if the surrounding function doesn't return/abort.

Fix by adding an abort(). The abort() will never trigger since we
have a warning on unhandled switch cases.
2021-10-10 18:16:50 +03:00
Avi Kivity
b08c299713 cql3: expr: correct type of captured map value_type
A map's value_type has const key, but in two places we omitted
the const. This causes construction of a new value, plus gcc
complaining that we're refering to a temporary.

Fix by using the correct type.
2021-10-06 14:57:43 +03:00
Avi Kivity
eac95e2370 cdc: adjust type of streams_count
streams_count has signed type, but it's compared against an unsigned
type, annoying gcc. Since a count should be positive, convert it to
an unsigned type.
2021-10-06 14:56:00 +03:00
Avi Kivity
5a5a47c4c7 alternator: disambiguate attrs_to_get in table_requests
There is a table_requests::attrs_to_get type, and also a type
named attrs_to_get used in the same struct, and gcc doesn't
like this. Disambiguate the type by fully qualifying it.
2021-10-06 14:55:48 +03:00
Avi Kivity
0ea79559a6 Merge 'IDL: support generating boilerplate code for RPC verbs' from Pavel Solodovnikov
Introduce new syntax in IDL compiler to allow generating
registration/sending code for RPC verbs:

```
        verb [[attr1, attr2...] my_verb (args...) -> return_type;
```

`my_verb` RPC verb declaration corresponds to the
`netw::messaging_verb::MY_VERB` enumeration value to identify the
new RPC verb.

For a given `idl_module.idl.hh` file, a registrator class named
`idl_module_rpc_verbs` will be created if there are any RPC verbs
registered within the IDL module file.

These are the methods being created for each RPC verb:

```
        static void register_my_verb(netw::messaging_service* ms, std::function<return_type(args...)>&&);
        static future<> unregister_my_verb(netw::messaging_service* ms);
        static future<> send_my_verb(netw::messaging_service* ms, netw::msg_addr id, args...);
```

Each method accepts a pointer to an instance of `messaging_service`
object, which contains the underlying seastar RPC protocol
implementation, that is used to register verbs and pass messages.

There is also a method to unregister all verbs at once:

```
        static future<> unregister(netw::messaging_service* ms);
```

The following attributes are supported when declaring an RPC verb
in the IDL:
* `[[with_client_info]]` - the handler will contain a const reference to
  an `rpc::client_info` as the first argument.
* `[[with_timeout]]` - an additional `time_point` parameter is supplied
  to the handler function and `send*` method uses `send_message_*_timeout`
  variant of internal function to actually send the message.
* `[[one_way]]` - the handler function is annotated by
  `future<rpc::no_wait_type>` return type to designate that a client
  doesn't need to wait for an answer.

The `-> return_type` clause is optional for two-way messages. If omitted,
the return type is set to be `future<>`.
For one-way verbs, the use of return clause is prohibited and the
signature of `send*` function always returns `future<>`.

No existing code is affected.

Ref: #1456

Closes #9359

* github.com:scylladb/scylla:
  idl: support generating boilerplate code for RPC verbs
  idl: allow specifying multiple attributes in the grammar
  message: messaging_service: extract RPC protocol details and helpers into a separate header
2021-10-05 18:05:24 +03:00
Tzach Livyatan
bd87c7d362 Update docker-hub text
Mention aarch64 support

Closes #9436
2021-10-05 17:35:02 +03:00
Raphael S. Carvalho
342bfbd65a compaction: Make major compaction on keyspace resilient if low on space
Let's major compact the smallest tables first, increasing chances of
success if low on disk space.

parallel_for_each() didn't have any effect on space requirement as
compaction_manager serializes major compaction in a shard.
As parallel_for_each() is no longer used, find_column_family() is now
used before each compact_all_sstables() to avoid a race with table drop.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211005135257.31931-1-raphaelsc@scylladb.com>
2021-10-05 17:04:34 +03:00
Nadav Har'El
77bd4afda7 test/alternator: avoid client-side validation
Ever since we started testing Alternator with tests written in Python
and using Amazon's "boto3" library, one limitation kept annoying us:

Boto3 verifies the validity of the request parameters before passing
them on to the server. It verifies that mandatory parameters are not
missing, that parameters have the right types, and sometimes even the
right ranges - all in the library before ever sending the request.
This meant that in many cases, we couldn't get good test coverage for
Alternator's server-side handling of *wrong* parameters.

As it turns out, it is trivial to tell boto3 to *not* do its client-side
request validation, with the `parameter_validation=False` config flag.
We just never noticed that such a flag existed :-)

So this patch adds this flag. It then fixes a few tests which expected
ParameterValidationError - this error is the client-side validation
failure, but should now be replaced by checking the server-side error.

The patch also adds a couple of invalid parameter checks that we
couldn't do before because of boto3's eagerness to check them on the
client side. We can add a lot more of these error tests in the future,
now that we got rid of client-side validation.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211005095514.537226-1-nyh@scylladb.com>
2021-10-05 13:26:51 +02:00
Nadav Har'El
6dee86eade test/alternator: another test for adding a GSI to an existing table
This patch adds yet another test for Alternator's unimplemented feature
of adding a GSI to an already existing table (issue #5022), but this
test is for a very specific corner case - tables which contain string
attributes with an empty value - the corner case described in
issue #9424:

DynamoDB used to forbid any string attributes from being set to an empty
string, but this changed in May 2020, and since then empty strings are
allowed - but NOT as keys. So although it is legal to set a string
attribute to an empty string, if this table has a GSI whose key is that
specific attribute, the update command is refused. We already had a
test for this - test_gsi_empty_value.

However, the case in this patch is the case where a GSI is added to a
table *after* the table already has data. In this case (as this test
demonstrates), we are supposed to drop the items which have the empty
string key from the GSI.

Even when #5022 (the ability to add GSIs to existing tables) will be done,
this test will continue to fail. The unique problem of this test is that
Scylla's materialized views *do* allow empty strings as clustering keys
(right now) and even partition keys (after #9375 will be solved), while
we don't want them to enter the GSI. We will probably need to add to the
view's filter, which right now contains (as required) "x IS NOT NULL"
also the filter "x != ''" (when x's type is a string or binary) so that
items with empty-string keys will be dropped.

Refs #5022
Refs #9375
Refs #9424

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211003170636.477582-1-nyh@scylladb.com>
2021-10-05 13:26:43 +02:00
Nadav Har'El
b136104298 alternator/test: test for invalid numeric values
DynamoDB has a rather baroque definition of numbers, and in particular
it does *not* allow numeric attributes to be set to infinity or NaN.

Although I did check invalid numbers in the past, manually, I was never
able to write a unit test for this in the past - because the boto3
library catches such errors on the client side, and prevents the test from
sending broken requests to the server. So in this patch, I finally came up
with a solution - a context manager client_no_transform() which
yields a client which does NOT do any transformation or validation
on the request's parameters, allowing us to use boto3 to create
improper requests - and test the server's handling of them.

The test in this patch passes - it did not discover a new bug, but
it is a useful regression test and the client_no_transform() trick
can be used in more error-case tests which until now we were unable
to write.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211004161809.520236-1-nyh@scylladb.com>
2021-10-05 13:13:45 +02:00
Avi Kivity
2d25705db0 cql3: deinline non-trivial methods in selection.hh
This allows us to forward-declare raw_selector, which in turn reduces
indirect inclusions of expression.hh from 147 to 58, reducing rebuilds
when anything in that area changes.

Includes that were lost due to the change are restored in individual
translation units.

Closes #9434
2021-10-05 12:58:55 +02:00
Avi Kivity
d3f8148807 utils: untie rjson.hh from base64.hh
base64.hh pulls in the huge rjson.hh, so if someone just wants
a base64 codec they have to pull in the entire rapidjson library.

Move the json related parts of base64.hh to rjson.hh and adjust
includes and namespaces.

In practice it doesn't make much difference, as all users of base64
appear to want json too. But it's cleaner not to mix the two.

Closes #9433
2021-10-05 12:57:54 +02:00
Avi Kivity
3a67c661d4 Merge "Improve parallelizm of mutation source tests" from Pavel E
"
There's a run_mutation_source_tests lib helper that runs a bunch of
tests sequentially. The problem is that it does 4 different flavors
of it each being a certain decoration over provided reader. This
amplification makes some test cases run enormous amount of time
without any chance for parallelizm.

The simplest way to help running those cases in parallel is to teach
the slowest cases to run different flavors of mutation source tests
in dedicated cases. This patch makes it so.

The resulting timings are
                                  dev   debug
                 sequential run:  2m1s  53m50s
--parallel-cases (+ this patch):  1m3s  31m15s

tests: unit(dev, debug)
"

* 'br-parallel-mutation-source-tests' of https://github.com/xemul/scylla:
  test: Split multishard combining reader case
  test: Split database test case
  test: Split run_mutation_source_tests
2021-10-05 12:22:52 +03:00
Tomasz Grabiec
17430795e8 Merge "test: raft: randomized_nemesis_test: handle missing snapshot in rpc::send_snapshot" from Kamil
It's possible that the server drops the snapshot in the same iteration
of `io_fiber` loop as it tries to send it (the sending of messages
happens after snapshot dropping). Handle this case by throwing an
exception.

As a preparation we also fix the code in `server_impl::send_snapshot` so
it works correctly when `rpc::send_snapshot` throws or returns a ready
future.

Refs #9407.

* kbr/snapshot-handle-errors:
  test: raft: randomized_nemesis_test: remove an obsolete comment
  test: raft: randomized_nemesis_test: handle missing snapshot in `rpc::send_snapshot`
  raft: server: handle `rpc::send_snapshot` returning instantly
2021-10-05 11:19:14 +02:00
Kamil Braun
c9a7778497 test: raft: randomized_nemesis_test: remove an obsolete comment 2021-10-05 11:04:11 +02:00
Kamil Braun
961f5a904c test: raft: randomized_nemesis_test: handle missing snapshot in rpc::send_snapshot
It's possible that the server drops the snapshot in the same iteration
of `io_fiber` loop as it tries to send it (the sending of messages
happens after snapshot dropping). Handle this case.

Refs #9407.
2021-10-05 11:04:11 +02:00
Kamil Braun
36f3e26374 raft: server: handle rpc::send_snapshot returning instantly
If `rpc::send_snapshot` returned immediately with a ready future, or if
it threw, the code in `server_impl::send_snapshot` would not update
`_snapshot_transfers` correctly.

The code assumed that the continuation attached to `rpc::send_snapshot`
(with `then_wrapped`) was executed after `_snapshot_transfer` below the
`rpc::send_snapshot` call was updated. That would not necessarily be true
(the continuation may even not have been executed at all if
`rpc::send_snapshot` threw).

Fix that by wrapping the `rpc::send_snapshot` call into a continuation
attached to `later()`.

Originally authored by Gleb <gleb@scylladb.com>, I added a comment.
2021-10-05 11:04:11 +02:00
Pavel Emelyanov
b742e6cbb6 test: Split multishard combining reader case
All the cases in this test also run mutation source tests and the
case with single-fragment buffer takes times more time to execute
than the others.

Splitting this single case so that it runs mutation source tests
flavours in different cases improves the test parallelizm.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-05 11:57:02 +03:00
Pavel Emelyanov
30075094ac test: Split database test case
The test_database_with_data_in_sstables_is_a_mutation_source case runs
the mutation source tests in one go. The problem is that on each step
a whole new ks:cf is created which takes the majority of the tests time.
In the end of the day this case is the slowest one in the suite being
up to two times longer (depending on mode) than the #2 on this list.

This patch splits the case into 4 so that each mutation source flavor
is run in separate case.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-05 11:53:18 +03:00
Pavel Emelyanov
1e09a2c925 test: Split run_mutation_source_tests
There are 4 flavours of mutation source tests that are all ran
sequentially -- plain, reversed and upgrade/downgrade ones that
check v1<->v2 conversions.

This patch splits them all into individual calls so that some
tests may want to have dedicated cases for each. "By default" they
are all run as they were.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-10-05 11:51:43 +03:00
Pavel Emelyanov
4b4ce015aa system-keyspace: Keep UUID value when saving
The set_local_host_id() accepts UUID references and starts
to save it in local keyspace and in all shards' local cache.
Before it was coroutinized the UUID was copied on captures
and survived, after it it remains references. The problem is
that callers pass local variables as arguments that go away
"really soon".

Fix it to accept UUID as value, it's short enough for safe
and painless copy.

fixes: #9425
tests: dtest.ReplaceAddress_rbo_enabled.replace_node_diff_ip(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20211004145421.32137-1-xemul@scylladb.com>
2021-10-04 18:21:44 +03:00
Tomasz Grabiec
cd9b4d95fc Merge "test: raft: randomized_nemesis_test: better liveness check at the end of generator test" from Kamil
The previous check would find a leader once and assume that it does not
change, and that the first attempt at sending a request to this leader
succeeds. In reality the leader may change at the end of the test (e.g.
it may be in the middle of stepping down when we find it) and in general
it may take some time for the cluster to stabilize. The new check tries
a few times to find a leader and perform a request - until a time limit
is reached.

* kbr/nemesis-liveness-check:
  test: raft: randomized_nemesis_test: better liveness check at the end of generator test
  test: raft: randomized_nemesis_test: take `time_point` instead of `duration` in `wait_for_leader`
2021-10-04 16:05:37 +02:00
Kamil Braun
17e771c5f5 test: raft: randomized_nemesis_test: better liveness check at the end of generator test
The previous check would find a leader once and assume that it does not
change, and that the first attempt at sending a request to this leader
succeeds. In reality the leader may change at the end of the test (e.g.
it may be in the middle of stepping down when we find it) and in general
it may take some time for the cluster to stabilize. The new check tries
a few times to find a leader and perform a request - until a time limit
is reached.

The commit also removes an incorrect assertion inside in `wait_for_leader`.
2021-10-04 15:57:54 +02:00
Kamil Braun
478a58e86d test: raft: randomized_nemesis_test: take time_point instead of duration in wait_for_leader
To be used in the next commit, where we call `wait_for_leader` in a loop
with the same deadline `time_point`.
2021-10-04 15:56:54 +02:00
Tomasz Grabiec
e89b9799b8 Merge 'sstable mx reader: implement reverse single-partition reads' from Kamil Braun
Until now reversed queries were implemented inside
`querier::consume_page` (more precisely, inside the free function
`consume_page` used by `querier::consume_page`) by wrapping the
passed-in reader into `make_reversing_reader` and then consuming
fragments from the resulting reversed reader.

The first couple of commits change that by pushing the reversing down below
the `make_combined_reader` call in `table::query`. This allows
working on improving reversing for memtables independently from
reversing for sstables.

We then extend the `index_reader` with functions that allow
reading the promoted index in reverse.

We introduce `partition_reversing_data_source`, which wraps an sstable data
file and returns data buffers with contents of a single chosen partition
as if the rows were stored in reverse order.

We use the reversing source and the extended index reader in
`mx_sstable_mutation_reader` to implement efficient (at least in theory)
reversed single-partition reads.

The patchset disables cache for reversed reads. Fast-forwarding
is not supported in the mx reader for reversed queries at this point.

Details in commit messages. Read the commits in topological order
for best review experience.

Refs: #9134
(not saying "Fixes" because it's only for single-partition queries
without forwarding)

Closes #9281

* github.com:scylladb/scylla:
  table: add option to automatically bypass cache for reversed queries
  test: reverse sstable reader with random schema and random mutations
  sstables: mx: implement reversed single-partition reads
  sstables: mx: introduce partition_reversing_data_source
  sstables: index_reader: add support for iterating over clustering ranges in reverse
  clustering_key_filter: clustering_key_filter_ranges owning constructor
  flat_mutation_reader: mention reversed schema in make_reversing_reader docstring
  clustering_key_filter: document clustering_key_filter_ranges::get_ranges
2021-10-04 15:37:34 +02:00
Kamil Braun
703aed3277 table: add option to automatically bypass cache for reversed queries
Currently the new reversing sstable algorithms do not support fast
forwarding and the cache does not yet handle reversed results. This
forced us to disable the cache for reversed queries if we want to
guarantee bounded memory. We introduce an option that does this
automatically (without specifying `bypass cache` in the query) and turn
it on by default.

If the user decides that they prefer to keep the cache at the
cost of fetching entire partitions into memory (which may be viable
if their partitions are small) during reversed queries, the option can
be turned off. It is live-updateable.
2021-10-04 15:24:12 +02:00
Kamil Braun
9bf6be5509 test: reverse sstable reader with random schema and random mutations
The test generates a random set of mutations and creates two readers:
- one by reversing the mutations, creating an sstable out of the result,
  and querying it in reverse,
- one by creating an sstable directly from the mutations and querying it
  in forward mode.

It checks that the readers give equal results.

The test already managed to find a bug where offsets returned by the
sstable index were interpreted incorrectly as absolute instead of
relative. It also helped find another bug unrelated to reversing (#9352).

Surprisingly few tests use the random schema and random mutation
utilities which seem to be quite powerful.
2021-10-04 15:24:12 +02:00
Kamil Braun
27238eaa0f sstables: mx: implement reversed single-partition reads
We use partition_reversing_data_source and the new `index_reader` methods
to implement single-partition reads in `mx_sstable_mutation_reader`.

The parsing logic does not need to change: the buffers returned by the
source already contain rows in reversed clustering order.

Some changes were required in `mp_row_consumer_m` which processes the
parsed rows and emits appropriate mutation fragments. The consumer uses
`mutation_fragment_filter` underneath to decide whether a fragment
should be ignored or not (e.g. the parsed fragment may come from outside
the requested clustering range), among other things. Previously
`mutation_fragment_filter` was provided a `partition_slice`. If the
slice was reversed, the filter would use
`clustering_key_filter_ranges::get_ranges` to obtain the clustering
ranges from the slice in unreversed order (they were reversed in the
slice) since we didn't perform any reversing in the reader. Now the
reader provides the ranges directly instead of the slice; furthermore,
the ranges are provided in native-reversed format (the order of ranges
is reversed and the ranges themselves are also reversed), and the schema
provided to the filter is also reversed. Thus to the filter everything
appears as if it was used during a non-reversed query but on a table
with reversed schema, which works correctly given the fact that the
reader is feeding parsed rows into the consumer in reversed order.

During reversed queries the reader uses alternative logic for skipping
to a later range (or, speaking in non-reversed terms, to an earlier range),
which happens in `advance_context`. It asks the index to advance its
upper bound in reverse so that the reversing_data_source notices the
change of the index end position and returns following buffers with rows
from the new range.

There is a slight difference in behavior of the reader from
`mp_row_consumer_m`'s point of view. For non-reversed reads, after
the consumer obtains the beginning of a row (`consume_row_start`)
- which contains the row's position but not the columns - and tells the
reader that the row won't be emitted because we need to skip to a later
range, the reader would tell the data source (the 'context') immediately
to skip to a later range by calling `skip_to`. This caused the source
not to return the rest of the row, and the rest of the row would not
be fed to the consumer (`consume_row_end`). However, for reversed reads,
the data source performs skipping 'on its own', after it notices that
the index end position has changed. This may happen 'too late', causing
the rest of the row to be returned anyway. We are prepared for this
situation inside `mp_row_consumer` by consulting the mutation fragment
filter again when the rest of the row arrives.

Fast forwarding is not supported at this point, which is fine given that
the cache is disabled for reversed queries for now (and the cache is the
only user of fast forwarding).

The `partition_slice` provided by callers is provided in 'half-reversed'
format for reversed queries, where the order of clustering ranges is
reversed, but the ranges themselves are not. This means we need to modify
the slice sometimes: for non-single-partition queries the mx reader must
use a non-reversed slice, and for single-partition queries the mx reader
must use a native-reversed slice (where the clustering ranges themselves
are reversed as well). The modified slice must be stored somewhere; we
store it inside the mx reader itself so we don't need to allocate more
intermediate readers at the call sites.  This causes the interface of
`mx::make_reader` to be a bit weird: for non-single-partition queries
where the provided slice is reversed the reader will actually return a
non-reversed stream of fragments, telling the user to reverse the stream
on their own. The interface has been documented in detail with
appropriate comments.
2021-10-04 15:24:12 +02:00
Wojciech Mitros
64e703bb54 sstables: mx: introduce partition_reversing_data_source
This patch adds an implementation of a data source that wraps an sstable
data file and returns data buffers with contents of one partition in the
sstable as if the rows of the partition were present in a reversed
order. In other words, to the user of the source the partition appears
to be reversed. We shall call this an 'intermediary' data source.

As part of the interface of the intermediary source the user is also
given read access to the source's current position over the data file,
and the constructor of the source takes a reference to `index_reader`.
This is necessary because the index operates directly on data file
offsets and we want the user to be able to use the index to skip
sequences of rows.

In order to ask the source to skip a sequence of rows - e.g. when jumping
between clustering ranges - the user must advance the index' upper bound
in reverse (to an earlier position). The source will then notice that
the end position of the index has changed and take appropriate action.

An alternative would be to translate the data positions of
`index_reader` to 'reversed positions' of the intermediary and then use
`skip_to` for skipping, as we do for forward reads. However this
solution would introduce more complexity to `index_reader` and the
intermediary source. One reason for the complexity in the input stream
is that we would have two kinds of skips: a single row skip,
and a skip to a clustering range. We know the offset of the next row,
so we could check that to differentiate them. We would also need to add
an information about the position of first clustering row and end of
the last one in the index_reader. Skipping by checking the index seems
to be overall simpler.

For simplicity, the intermediary stream always starts with
parsing the partition header and (if present) the static row,
and returning the corresponding bytes as a result of the first
read.

After partition header and static row we must find the last row entry of
the requested range. If the range ends before the partition end (i.e.
there are more row entries after the range) we can use the 'previous
unfiltered size' of the row following the range; otherwise we must scan
the last promoted index block and take its last row.

After finding the data range of the last row, we parse rows
consecutively in reversed order.  We must parse the rows partially
to learn their lengths and the positions of previous rows. We're
using similar constructs as in the sstable parser, but it only
contains a small part of the parsing coroutine and doesn't perform
any correctness checks.  The parser for rows still turned out rather
big mostly because we can't always deduce the size of the clustering
blocks without reading the block header.

The parser allows reading rows while skipping their bodies also in
non-reversed order, which we are making use of while reading the
last promoted index block.

The intermediary data source has one more utility: reversing range
tombstones.  When we read a tombstone bound/boundary, we modify
the data buffer so that the resulting bound/boundary has the reversed
kind (so we don't read ends before starts) and the boundaries have their
before/after timestamps swapped.
2021-10-04 15:24:12 +02:00
Wojciech Mitros
8385f3eb21 sstables: index_reader: add support for iterating over clustering ranges in reverse
In the sstable reader, we iterate over clustering ranges using the
index_reader, which normally only accepts advancing to increasing
positions. In this patch we add methods for advancing the index
reader in reverse.

To simplify our job we restrict our attention to a single implementation
of the promoted index block cursor: `bsearch_clustered_cursor`. The
`index_reader` methods for advancing in reverse will thus assume that
this implementation is used. The assumption is correct given that we're
working only with sstables of versions >= mc, which is indeed the
intended use case. We add some documentation in appropriate places to
make this obvious.

We extend `bsearch_clustered_cursor` with two methods:
`advance_past(pos)`, which advances the cursor to the first block after
`pos` (or to the end if there is no such block), and
`last_block_offset()`, which returns the data file offset of the first
row from the last promoted index block.

To efficiently find the position in the data file of the last row
of the partition (which we need when performing a reversed query)
the sstable reader may need to read the span of the entire last promoted
index block in the data file. To learn where the block starts it can use
`index_reader::last_block_offset()`, which is implemented in terms of
`bsearch_clustered_cursor::last_block_offset()`.

When performing a single partition read in forward order, the reader
asks the index to position its lower bound at the start of the partition
and its upper bound after the end of the slice. It starts by reading the
first range. After exhausting a range it jumps to the next one by asking
the index to advance the lower bound.

For reverse single partition reads we'll take a similar approach: the
initial bound positions are as in the forward case. However, we start
with the last range and after exhausting a range we want to jump to a
previous one; we will do it by advancing the upper bound in reverse
(i.e. moving it closer to the beginning of the partition).  For this
we introduce the `index_reader::advance_reverse` function.
2021-10-04 15:24:12 +02:00
Avi Kivity
3cb865103d Update seastar submodule
* seastar 0ba6c36cc3...994b4b5a0c (2):
  > Merge "Improve smp::invoke_on_others" from Pavel E
  > file: keep hint pointer alive when calling fcntl()
2021-10-04 15:36:45 +03:00
Avi Kivity
148a12f3da Merge "Keep storage_service less aware of cdc internals" from Pavel E
"
The storage_service is involved in the cdc_generation_service guts
more than needed.

 - the bool _for_testing bit is cdc-only
 - there's API-only cdc_generation_service getter
 - cdc_g._s. startup code partially sits in s._s. one

This patch cleans most of the above leaving only the startup
_cdc_gen_id on board.

tests: unit(dev)
refs: #2795

"

* 'br-storage-service-vs-cdc-2' of https://github.com/xemul/scylla:
  api: Use local sharded<cdc::generation_service> reference
  main: Push cdc::generation_service via API
  storage_service: Ditch for_testing boolean
  cdc: Replace db::config with generation_service::config
  cdc: Drop db::config from description_generator
  cdc: Remove all arguments from maybe_rewrite_streams_descriptions
  cdc: Move maybe_rewrite_streams_descriptions into after_join
  cdc: Squash two methods into one
  cdc: Turn make_new_cdc_generation a service method
  cdc: Remove ring-delay arg from make_new_cdc_generation
  cdc: Keep database reference on generation_service
2021-10-04 14:56:05 +03:00
Piotr Dulikowski
6093c2378b hints: assign _last_written_rp in ep manager's move constructor
The end_point_hints_manager's field _last_written_rp is initialized in
its regular constructor, but is not copied in the move constructor.
Because the move constructor is always involved when creating a new
endpoint manager, the _last_written_rp field is effectively always
initialized with the zero-argument constructor, and is set to the zero
value.

This can cause the following erroneous situation to occur:

- Node A accumulates hints towards B.
- Sync point is created at A.
  It will be used later to wait for currently accumulated hints.
- Node A is restarted.
  The endpoint manager A->B is created which has bogus value in the
  _last_written_rp (it is set to zero).
- Node A replays its hints but does not write any new ones.
- A hint flush occurs.
  If there are no hint segments on disk after flush, the endpoint
  manager sets its last sent position to the last written position,
  which is by design. However, the last written position has incorrect
  value, so the last sent position also becomes incorrect and too low.
- Try to wait for the sync point created earlier.
  The sync point waiting mechanism waits until last sent hint position
  reaches or goes past the position encoded in the sync point, but it
  will not happen because the last sent position is incorrect.

The above bug can be (sometimes) reproduced in
hintedhandoff_sync_point_api_test dtest.

Now, the _last_written_rp field is properly initialized in the move
constructor, which prevents the bug described above.

Fixes: #9320
Closes #9426
2021-10-04 13:21:34 +02:00
Kamil Braun
b2f33b3e0b test: raft: randomized_nemesis_test: abort environment before ticker
We must abort the environment before the ticker as the environment may
require time to keep advancing during abort in order for all operations
to finish, e.g. operations that can finish only due to timeout.
Currently such operations may cause the test to hang indefinitely
at the end.

The test requires a small modification to ensure that
`delivery_queue::push` is not called after the queue was aborted.
Message-Id: <20210930143539.157727-1-kbraun@scylladb.com>
2021-10-04 12:31:26 +02:00
Avi Kivity
1bac93e075 Merge "simplifications and layer violation fix for compaction manager" from Raphael
"This series removes layer violation in compaction, and also
simplifies compaction manager and how it interacts with compaction
procedure."

* 'compaction_manager_layer_violation_fix/v4' of github.com:raphaelsc/scylla:
  compaction: split compaction info and data for control
  compaction_manager: use task when stopping a given compaction type
  compaction: remove start_size and end_size from compaction_info
  compaction_manager: introduce helpers for task
  compaction_manager: introduce explicit ctor for task
  compaction: kill sstables field in compaction_info
  compaction: kill table pointer in compaction_info
  compaction: simplify procedure to stop ongoing compactions
  compaction: move management of compaction_info to compaction_manager
  compaction: move output run id from compaction_info into task
2021-10-04 13:09:31 +03:00
Avi Kivity
93b765f655 scripts/pull_github_pr.sh: don't guess git remote name
The script assumes the remote name is "origin", a fair
assumption, but not universally true. Read it from configuration
instead of guessing it.

Closes #9423
2021-10-04 12:32:39 +03:00
Nadav Har'El
414b672e22 test/alternator: verify that empty-string keys are NOT allowed
Since May 2020 empty strings are allowed in DynamoDB as attribute values
(see announcment in [1]). However, they are still not allowed as keys.

We had tests that they are not allowed in keys of LSI or GSI, but missed
tests that they are not allowed as keys (partition or sort key) of base
tables. This patch add these missing tests.

These tests pass - we already had code that checked for empty keys and
generated an appropriate error.

Note that for compatibility with DynamoDB, Alternator will forbid empty
strings as keys even though Scylla *does* support this possibility
(Scylla always supported empty strings as clustering key, and empty
partition keys will become possible with issue #9352).

[1] https://aws.amazon.com/about-aws/whats-new/2020/05/amazon-dynamodb-now-supports-empty-values-for-non-key-string-and-binary-attributes-in-dynamodb-tables/

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20211003122842.471001-1-nyh@scylladb.com>
2021-10-04 08:40:43 +02:00
Botond Dénes
61e7d3de90 Merge 'Cleanup compaction_stop_exception' from Benny Halevy
The gist of this series is splitting `compaction_abort_exception` from `compaction_stop_exception`
and their respective error messages to differentiate between compaction being stopped due to e.g. shutdown
or api event vs. compaction aborting due to scrub validation error.

While at it, cleanup the existing retry logic related to `compaction_stop_exception`.

Test: unit(dev)
Dtest: nodetool_additional_test.py:TestNodetool.{{scrub,validate}_sstable_with_invalid_fragment_test,{scrub,validate}_ks_sstable_with_invalid_fragment_test,{scrub,validate}_with_one_node_expect_data_loss_test} (dev, w/ https://github.com/scylladb/scylla-dtest/pull/2267)

Closes #9321

* github.com:scylladb/scylla:
  compaction: split compaction_aborted_exception from compaction_stopped_exception
  compaction_manager: maybe_stop_on_error: rely on retry=false default
  compaction_manager: maybe_stop_on_error: sync return value with error message.
  compaction: drop retry parameter from compaction_stop_exception
  compaction_manager: move errors stats accounting to maybe_stop_on_error
2021-10-04 07:27:11 +03:00
Takuya ASADA
9c830297ac scylla_util.py: add persistent disk support for GCE
Just like EBS disks for EC2, we want to use persistent disk on GCE.
We won't recommend to use it, but still need to support it.

Related scylladb/scylla-machine-image#215

Closes #9395
2021-10-03 17:58:18 +03:00
Takuya ASADA
d87b80ad14 scylla_util.py: add persistent disk support for Azure Just like EBS disks for EC2, we want to use persistent disk on Azure. We won't recommend to use it, but still need to support it.
Related https://github.com/scylladb/scylla-machine-image/issues/218

Closes #9417
2021-10-03 17:56:31 +03:00