Documentation was extracted from abstract_replication_strategy::get_ranges(),
which says:
// get_ranges() returns the list of ranges held by the given endpoint.
// The list is sorted, and its elements are non overlapping and non wrap-around.
That's important because users of get_keyspace_local_ranges() expect
that the returned list is both sorted and non overlapping, so let's
document it to prevent someone from removing any of these properties.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210805140628.537368-1-raphaelsc@scylladb.com>
Currently, view will be not updated because the streaming reason is set
to streaming::stream_reason::rebuild. On the receiver side, only
streaming with the reason streaming::stream_reason::repair will trigger
view update.
Change the stream reason to repair to trigger view update for load and
stream. This makes load_and_stream behaves the same as nodetool refresh.
Note: However, this is not very efficient though.
Consider RF = 3, sst1, sst2, sst3 from the older cluster. When sst1 is
loaded, it streams to 3 replica nodes, if we generate view updates, we
will have 3 view updates for this replica (each of the peer nodes finds
its peer and writes the view update to peer). After loading sst2 and
sst3, we will have 9 view updates in total for a single partition.
If we create the view after the load and stream process, we will only
have 3 view updates for a single partition.
If we create the view after the load and stream process, we will only
have 3 view updates for a single partition.
Fixes#9205Closes#9213
Operations and generators can be composed to create more complex
operations and generators. There are certain composition patterns useful
for many different test scenarios.
We implement a couple of such patterns. For example:
- Given multiple different operation types, we can create a new
operation type - `either_of` - which is a "union" of the original
operation types. Executing `either_of` operation means executing an
operation of one of the original types, but the specific type
can be chosen in runtime.
- Given a generator `g`, `op_limit(n, g)` is a new generator which
limits the number of operations produced by `g`.
- Given a generator `g` and a time duration of `d` ticks, `stagger(g, d)` is a
new generator which spreads the operations from `g` roughly every `d`
ticks. (The actual definition in code is more general and complex but
the idea is similar.)
Some of these patterns have correspodning notions in Jepsen, e.g. our
`stagger` has a corresponding `stagger` in Jepsen (although our
`stagger` is more general).
Finally, we implement a test that uses this new infrastructure.
Two `Executable` operations are implemented:
- `raft_call` is for calling to a Raft cluster with a given state
machine command,
- `network_majority_grudge` partitions the network in half,
putting the leader in the minority.
We run a workload of these operations against a cluster of 5 nodes with
6 threads for executing the operations: one "nemesis thread" for
`network_majority_grudge` and 5 "client threads" for `raft_call`.
Each client thread randomly chooses a contact point which it tries first
when executing a `raft_call`, but it can also "bounce" - call a
different server when the previous returned "not_a_leader" (we use the
generic "bouncing" wrapper to do this).
For now we only print the resulting history. In a follow-up patchset
we will analyze it for consistency anomalies.
* kbr/raft-test-generator-v4:
test: raft: randomized_nemesis_test: a basic generator test
test: raft: generator: a library of basic generators
test: raft: introduce generators
test: raft: introduce `future_set`
test: raft: randomized_nemesis_test: handle `raft::stopped_error` in timeout futures
The previous commits introduced basic the generator concept and a
library of most common composition patterns.
In this commit we implement a test that uses this new infrastructure.
Two `Executable` operations are implemented:
- `raft_call` is for calling to a Raft cluster with a given state
machine command,
- `network_majority_grudge` partitions the network in half,
putting the leader in the minority.
We run a workload of these operations against a cluster of 5 nodes with
6 threads for executing the operations: one "nemesis thread" for
`network_majority_grudge` and 5 "client threads" for `raft_call`.
Each client thread randomly chooses a contact point which it tries first
when executing a `raft_call`, but it can also "bounce" - call a
different server when the previous returned "not_a_leader" (we use the
generic "bouncing" wrapper to do this).
For now we only print the resulting history. In a follow-up patchset
we will analyze it for consistency anomalies.
Operations and generators can be composed to create more complex
operations and generators. There are certain composition patterns useful
for many different test scenarios.
This commit introduces a couple of such patterns. For example:
- Given multiple different operation types, we can create a new
operation type - `either_of` - which is a "union" of the original
operation types. Executing `either_of` operation means executing an
operation of one of the original types, but the specific type
can be chosen in runtime.
- Given a generator `g`, `op_limit(n, g)` is a new generator which
limits the number of operations produced by `g`.
- Given a generator `g` and a time duration of `d` ticks, `stagger(g, d)` is a
new generator which spreads the operations from `g` roughly every `d`
ticks. (The actual definition in code is more general and complex but
the idea is similar.)
And so on.
Some of these patterns have correspodning notions in Jepsen, e.g. our
`stagger` has a corresponding `stagger` in Jepsen (although our
`stagger` is more general).
We introduce the concepts of "operations" and "generators", basic
building blocks that will allow us to declaratively write randomized
tests for torturing simulated Raft clusters.
An "operation" is a data structure representing a computation which
may cause side effects such as calling a Raft cluster or partitioning
the network, represented in the code with the `Executable` concept.
It has an `execute` function performing the computation and returns
a result of type `result_type`. Different computations of the same type
share state of type `state_type`. The state can, for example, contain
database handles.
Each execution is performed on an abstract `thread' (represented by a `thread_id`)
and has a logical starting time point. The thread and start point together form
the execution's `context` which is passed as a reference to `execute`.
Two operations may be called in parallel only if they are on different threads.
A generator, represented through the `Generator` concept, produces a
sequence of operations. An operation can be fetched from a generator
using the `op` function, which also returns the next state of the
generator (generators are purely functional data structures).
The generator concept is inspired by the generators in the Jepsen
testing library for distributed systems.
We also implement `interpreter` which "interprets", or "runs", a given
generator, by fetching operations from the generator and executing them
with concurrency controlled by the abstract threads.
The algorithm used in the interpreter is also similar to the interpreter
algorithm in Jepsen, although there are differences. Most notably we don't
have a "worker" concept - everything runs on a single shard; but we use
"abstract threads" combined with futures for concurrency.
There is also no notion of "process". Finally, the interpreter doesn't
keep an explicit history, but instead uses a callback `Recorder` to notify
the user about operation invocations and completions. The user can
decide to save these events in a history, or perhaps they can analyze
them on the fly using constant memory.
A set of futures that can be polled.
Polling the set (`poll` function) returns the value of one of
the futures which became available or `std::nullopt` if the given
logical durationd passes (according to the given timer), whichever
event happens first. The current implementation assumes sequential
polling.
New futures can be added to the set with `add`.
All futures can be removed from the set with `release`.
The timeout futures in `call` and `reconfigure` may be discarded after
Raft servers were `abort()`ed which would result in
`raft::stopped_error` and the test complained about discarded
exceptional futures. Discard these errors explicitly.
supervisor scripts for Docker and supervisor scripts for offline
installer are almost same, drop Docker one and share same code to
deduplicate them.
Closes#9143Fixes#9194
In order to decouple the service level controller from the systems logic, we introduce an API for subscribing to configuration changes. The timing of the call was determined with resource creation and destruction in mind. An API subscriber can create
resources that will be available from the very start of the service level existence it can also destroy them since the service level
is guarantied not to exist anymore at the time of the call to the deletion notification callback.
Testing:
unit tests - all + a newly added one.
dtests - next-gating (dev mode)
Closes#9097
* github.com:scylladb/scylla:
service level controller: Subscriber API unit test
Service Level Controller: Add a listener API for service level config changes
changes
This change adds an api for registering a listener for service_level
configuration chanhes. It notifies about removal addition and change of
service level.
The hidden assumption is that some listeners are going to create and/or
manage service level specific resources and this it what guided the
time of the call to the subscriber.
Addition and change of a service level are called before the actual
change takes place, this guaranties that resource creation can take
place before the service level or new config starts to be used.
The deletion notification is called only after the deletion took place
and this guranties that the service level can't be active and the
resources created can be safely destroyed.
Refs #9053
Flips default for commitlog disk footprint hard limit enforcement to off due
to observed latency stalls with stress runs. Instead adds an optional flag
"commitlog_use_hard_size_limit" which can be turned on to in fact do enforce it.
Sort of tape and string fix until we can properly tweak the balance between
cl & sstable flush rate.
Closes#9195
We decided to enable repair based node operations by default for replace
node operations.
To do that, a new option --allowed-repair-based-node-ops is added. It
lists the node operations that are allowed to enable repair based node
operations.
The operations can be bootstrap, replace, removenode, decommission and rebuild.
By default, --allowed-repair-based-node-ops is set to contain "replace".
Note, the existing option --enable-repair-based-node-ops is still in
play. It is the global switch to enable or disable the feature.
Examples:
- To enable bootstrap and replace node ops:
```
scylla --enable-repair-based-node-ops true --allowed-repair-based-node-ops replace,bootstrap
```
- To disable any repair based node ops:
```
scylla --enable-repair-based-node-ops false
```
Closes#9197
It turns out that user-defined aggregates did not need any elaborate coding in order to make them exposed to the users. The whole infrastructure is already there, including system schema tables and support for running aggregate queries, so this series simply adds lots and lots of boilerplate glue code to make UDA usable.
It also comes with a simple test which shows that it's possible to define and use such an aggregate.
Performance not tested, since user-defined functions are still experimental, so nothing really changes in this matter.
Tests: unit(release)
Fixes#7201Closes#9165
* github.com:scylladb/scylla:
cql-pytest: add a test suite for user-defined aggregates
cql-pytest: add context managers for functions and aggregates
cql3: enable user-defined aggregates in CQL grammar
cql3: add statements for user-defined aggregates
cql3,functions: add checking if a function is used in UDA
gms: add UDA feature
migration_manager: add migrating user-defined aggregates
db,schema_tables: add handling user-defined aggregates
pagers: make a lambda mutable in fetch_page
cql3: wrap handling paging result with with_thread_if_needed
cql3: correctly mark function selectors as needing threads
cql3: add user-define aggregate representation
The test suite now consists of a single user aggregate:
a custom implementation for existing avg() built-in function,
as well as a couple of cases for catching incorrect operations,
like using wrong function signatures or dropping used functions.
Aggregates are propagated, created and dropped very similarly
to user-defined functions - a set of helper functions
for aggregates are added based on the UDF implementation.
Function call selectors correctly checked if their arguments
are required to run in threaded context, but forgot to check
the function itself - which is now done.
A user-defined aggregate is represented as an aggregate
which calls its state function on each input row
and then finalizes its execution by calling its final function
on the final state, after all rows were already processed.
The code for handling background view updates used to propagate
exceptions unconditionally, which leads to "exceptional future
ignored" warnings if the update was put to background.
From now on, the exception is only propagated if its future
is actually waited on.
Fixes#6187
Tested manually, the warning was not observed after the patch
Closes#9179
If the wasmtime library is available for download, it will be
set up by install-dependencies and prepared for linking.
Closes#9151
[avi: regenerate toolchain, which also updates clang to 12.0.1]
I initially tried to use a noncopyable_function to avoid the unnecessary
template usage.
However, since database::apply_in_memory is a hot function. It is better
to use with_gate directly. The run_async function does nothing but calls
with_gate anyway.
Closes#9160
On some environment /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
does not exist even it supported CPU scaling.
Instead, /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor is
avaliable on both environment, so we should switch to it.
Fixes#9191Closes#9193
What should the following pair of statements do?
CREATE INDEX xyz ON tbl(a)
CREATE INDEX IF NOT EXISTS xyz ON tbl(b)
There are two reasonable choices:
1. An index with the name xyz already exists, so the second command should
do nothing, because of the "IF NOT EXISTS".
2. The index on tbl(b) does *not* yet exist, so the command should try to
create it. And when it can't (because the name xyz is already taken),
it should produce an error message.
Currently, Cassandra went with choice 1, and Scylla went with choice 2.
After some discussions on the mailing list, we agreed that Scylla's
choice is the better one and Cassandra's choice could be considered a
bug: The "IF NOT EXIST" feature is meant to allow idempotent creation of
an index - and not to make it easy to make mistakes without not noticing.
The second command listed above is most likely a mistake by the user,
not anything intentional: The command intended to ensure than an index
on column b exists, but after the silent success of the command, no such
index exists.
So this patch doesn't change any Scylla code (it just adds a comment),
and rather it adds a test which "enshrines" the current behavior.
The test passes on Scylla and fails on Cassandra so we tag it
"cassandra_bug", meaning that we consider this difference to be
intentional and we consider Cassandra's behavior in this case to be wrong.
Fixes#9182.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210811113906.2105644-1-nyh@scylladb.com>
The extra status print is not needed in the log.
Fixes the following error:
ERROR 2021-08-10 10:54:21,088 [shard 0] storage_service -
service/storage_service.cc:3150 @do_receive: failed to log message:
fmt='send_meta_data: got error code={}, from node={}, status={}':
fmt::v7::format_error (argument not found)
Fixes#9183Closes#9189
The reader is used by load and stream to read sstables from the upload
directory which are not guaranteed to belong to the local shard.
Using the make_range_sstable_reader instead of
make_local_shard_sstable_reader.
Tests:
backup_restore_tests.py:TestBackupRestore.load_and_stream_using_snapshot_test
backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_2_test
backup_restore_tests.py:TestBackupRestore.load_and_stream_to_new_cluster_1_test
migration_test.py:TestLoadAndStream.load_and_stream_asymmetric_cluster_test
migration_test.py:TestLoadAndStream.load_and_stream_decrease_cluster_test
migration_test.py:TestLoadAndStream.load_and_stream_frozen_pk_test
migration_test.py:TestLoadAndStream.load_and_stream_increase_cluster_test
migration_test.py:TestLoadAndStream.load_and_stream_primary_replica_only_test
Fixes#9173Closes#9185
Those methods are never used so it's better not to keep a dead code
around.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Closes#9188
Adds a `hinted_handoff_pause_hint_replay` error injection point. When
enabled, hint replay logic behaves as if it is run, but it gets stuck in
a loop and no hints are actually sent until the point is disabled again.
This injection point will be useful in dtests - it will simulate
infinitely slow hint replay and will make it possible to test how some
operations behave while hint replay logic is running.
The first intended use case of this injection point is testing the HTTP
API for waiting for hints (#8728).
Refs: #6649Closes#8801
* github.com:scylladb/scylla:
hints: fix indentation after previous patch
hints: error injection for pausing hint replay
hints: coroutinize lambda inside send_one_file
Each hint sender runs an asynchronous loop with tries to flush and then
send hints. Between each attempt, it sleeps at most 10 seconds using
sleep_abortable. However, an overload of sleep_abortable is used which
does not take an abort_source - it should abort the sleep in case
Seastar handles a SIGINT or SIGTERM signal. However, in order for that
to work, the application must not prevent default handling of those
signals in Seastar - but Scylla explicitly does it by disabling the
`auto_handle_sigint_sigterm` option in reactor config. As a result,
those sleeps are never aborted, and - because we wait for the async
loops to stop - they can delay shutdown by at most 10 seconds.
To fix that, an abort_source is added to the hints sender, and the
abort_source is triggered when the corresponding sender is requested to
stop.
Fixes: #9176Closes#9177
Since key_compare does not conform to SimpleLessCompare, the benchmark
tests the non-optimized version of bptree (without SIMD key search).
We want to test the optimized version.
Closes#9180
The DynamoDB documentation
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html
describes several hard limits on the size of the size of expressions
(ProjectionExpression, ConditionExpression, UpdateExpression,
FilterExpression) and various elements they contain.
In this patch we begin testing those limits with a comprehensive test for
the *length* of each of these four expressions: we test that lengths up to
(and including) 4096 bytes are allowed but longer expressions are rejected.
We also add TODOs for additional documented limits that should be tested
in the future.
Currently, this test passes on DynamoDB but xfails on Alternator because
Alternator does *not* enforce any limits on the expression length. I don't
think this is a real problem, and we may consider keeping it this way,
but we should at least be aware that this difference exists and an
xfailing test will remind us.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210810081948.2012120-2-nyh@scylladb.com>
DynamoDB limits attribute names in items to lengths of up 65535 bytes,
but in some cases (such as key attributes) the limit is lower - 255.
This patch adds tests for many of these cases.
All the new tests pass on DynamoDB, but some still xfail on Alternator
because Alternator is too lenient - sometimes allowing longer attribute
names than DynamoDB allows. While this may sound great, it also has
downsides: The oversized attribute names perform badly, and as they
grow, Alternator's internal limits will be reached as well, and result
in an unsightly "internal server error" being reported instead of the
expected user-friendly error.
Refs #9169.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210810081948.2012120-1-nyh@scylladb.com>
"
* Make `sstable::make_reader()` return `flat_mutation_reader_v2`,
retain the old one as `sstable::make_reader_v1()`
* Start weaning tests off `sstable::make_reader_v1()` (done all the
easy ones, i.e. those not involving range tombstones)
"
* tag 'sstable-make-reader-v2-v1' of github.com:cmm/scylla:
tests: use flat_mutation_reader_v2 in the easier part of sstable_3_x_test
tests: upgrade the "buffer_overflow" test to flat_mutation_reader_v2
tests: get rid of sstable::make_reader_v1() in broken_sstable_test
tests: get rid of sstable::make_reader_v1() in the trivial cases
sstables: make sstable::make_reader() return flat_mutation_reader_v2
"
Validation compaction -- although I still maintain that it is a good
descriptive name -- was an unfortunate choice for the underlying
functionality because Origin has burned the name already as it uses it
for a compaction type used during repair. This opens the door for
confusion for users coming from Cassandra who will associate Validation
compaction with the purpose it is used for in Origin.
Additionally, since Origin's validation compaction was not user
initiated, it didn't have a corresponding `nodetool` command to start
it. Adding such a command would create an operational difference between
us and Origin.
To avoid all this we fold validation compaction into scrub compaction,
under a new "validation" mode. I decided against using the also
suggested `--dry-mode` flag as I feel that a new mode is a more natural
choice, we don't have to define how it interacts with all the other
modes, unlike with a `--dry-mode` flag.
Fixes: #7736
Tests: unit(dev), manual(REST API)
"
* 'scrub-validation-mode/v2' of https://github.com/denesb/scylla:
compaction/compaction_descriptor: add comment to Validation compaction type
compaction/compaction_descriptor: compaction_options: remove validate
api: storage_service: validate_keyspace -> scrub_keyspace (validate mode)
compaction/compaction_manager: hide perform_sstable_validation()
compaction: validation compaction -> scrub compaction (validate mode)
compaction/compaction_descriptor: compaction_options: add options() accessor
compaction/compaction_descriptor: compaction_options::scrub::mode: add validate