"
Currently database start and stop code is quite disperse and
exists in two slightly different forms -- one in main and the
other one in cql_test_env. This set unifies both and makes
them look almost the perfect way:
sharded<database> db;
db.start(<dependencies>);
auto stop = defer([&db] { db.stop().get(); });
db.invoke_on_all(&database::start).get();
with all (well, most) other mentionings of the "db" variable
being arguments for other services' dependencies.
tests: unit(dev, release), unit.cross_shard_barrier(debug)
dtest.simple_boot_shutdown(dev)
refs: #2737
refs: #2795
refs: #5489
"
* 'br-database-teardown-unification-2' of https://github.com/xemul/scylla: (26 commits)
main: Log when database starts
view_update_generator: Register staging sstables in constructor
database, messaging: Delete old connection drop notification
database, proxy: Relocate connection-drop activity
messaging, proxy: Notify connection drops with boost signal
database, tests: Rework recommended format setting
database, sstables_manager: Sow some noexcepts
database: Eliminate unused helpers
database: Merge the stop_database() into database::stop()
database: Flatten stop_database()
database: Equip with cross-shard-barrier
database: Move starting bits into start()
database: Add .start() method
main: Initialize directories before database
main, api: Detach set_server_config from database and move up
main: Shorten commitlog creation
database: Extract commitlog initialization from init_system_keyspace
repair: Shutdown without database help
main: Shift iosched verification upward
database: Remove unused mm arg from init_non_system_keyspaces()
...
First, it's to fix the discarded future during the register. The
future is not actually such, as it's always the no-op ready one as
at that stage the view_update_generator is neither aborted nor is
in throttling state.
Second, this change is to keep database start-up code in main
shorter and cleaner. Registering staging sstables belongs to the
view_update_generator start code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All the large data handler methods rely on global qctx thing to
write down its notes. This creates circular dependency:
query processor -> database -> large_data_handler -> qctx -> qp
In scylla this is not a technical problem, neither qctx nor the
query processor are stopped. It is a problem in cql_test_env
that stops everything, including resetting qctx to null. To avoid
tests stepping on nullptr qctx add the explicit check.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This series of commits fixes a small number of bugs with current implementation of HTTP API which allows to wait until hints are replayed, found by running the `hintedhandoff_sync_point_api_test` dtest in debug mode.
Refs: #9320Closes#9346
* github.com:scylladb/scylla:
commitlog: make it possible to provide base segment ID
hints: fill up missing shards with zeros in decoded sync points
hints: propagate abort signal correctly in wait_for_sync_point
hints: fix use-after-free when dismissing replay waiters
This warning can catch a virtual function that thinks it
overrides another, but doesn't, because the two functions
have different signatures. This isn't very likely since most
of our virtual functions override pure virtuals, but it's
still worth having.
Enable the warning and fix numerous violations.
Closes#9347
Adds a configuration option to the commitlog: base_segment_id. When
provided, the commitlog uses this ID as a base of its segment IDs
instead of calculating it based on the number of milliseconds between
the epoch and boot time.
This is needed in order for the feature which allows to wait for hints
to be replayed to work - it relies on the replay positions monotonically
increasing. Endpoint managers periodically re-creates its commitlog
instance - if it is re-created when there are no segments on disk,
currently it will choose the number of milliseconds between the epoch
and boot time, which might result in segments being generated with the
same IDs as some segments previously created and deleted during the same
runtime.
Between encoding and decoding of a sync point, the node might have been
restarted and resharded with increased shard count. During resharding,
existing hints segments might have been moved to new shards. Because of
that, we need to make sure that we wait for foreign segments to be
replayed on the new shards too.
This commit modifies the sync point decoding logic so that it places a
zero replay position for new shards. Additionally, a (incorrect) shard
count check is removed from `storage_proxy::wait_for_hint_sync_point`
because now the shard count in decoded sync point is guaranteed to be
not less than the node's current shard count.
When `manager::wait_for_sync_point` is called, the abort source from the
arguments (`as`) might have already been triggered. In such case, the
subscription which was supposed to trigger the `local_as` abort source
won't be run, and the code will wait indefinitely for hints to be
replayed instead of checking the replay status and returning
immediately.
This commit fixes the problem by manually triggering `local_as` if `as`
have been triggered.
When the promise waited on in the `wait_until_hints_are_replayed_up_to`
function is resolved, a continuation runs which prints a log line with
information about this event. The continuation captures a pointer to the
hints sender and uses it to get information about the endpoint whose
hints are waited for. However, at this point the sender might have been
deleted - for example, when the node is being stopped and everybody
waiting for hints is dismissed.
This commit fixes the use-after-free by getting all necessary
information while the sender is guaranteed to be alive and captures it
in the continuation's capture list.
This series adds very basic support for WebAssembly-based user-defined functions.
This series comes with a basic set of tests which were used to designate a minimal goal for this initial implementation.
Example usage:
```cql
CREATE FUNCTION ks.fibonacci (str text)
RETURNS NULL ON NULL INPUT
RETURNS boolean
LANGUAGE xwasm
AS ' (module
(func $fibonacci (param $n i32) (result i32)
(if
(i32.lt_s (local.get $n) (i32.const 2))
(return (local.get $n))
)
(i32.add
(call $fibonacci (i32.sub (local.get $n) (i32.const 1)))
(call $fibonacci (i32.sub (local.get $n) (i32.const 2)))
)
)
(export "fibonacci" (func $fibonacci))
) '
```
Note that the language is currently called "xwasm" as in "experimental wasm", because its interface is still subject to change in the future.
Closes#9108
* github.com:scylladb/scylla:
docs: add a WebAssembly entry
cql-pytest: add wasm-based tests for user-defined functions
main: add wasm engine instantiation
treewide: add initial WebAssembly support to UDF
wasm: add initial WebAssembly runtime implementation
db: add wasm_engine pointer to database
lang: add wasm_engine service
import wasmtime.hh
lua: move to lang/ directory
cql3: generalize user-defined functions for more languages
This commit adds a very basic support for user-defined functions
coded in wasm. The support is very limited (only a few types work)
and was not tested against reactor stalls and performance in general.
All the namespace scope functions in system_keyspace have no place
to store context, so they must store their context in global
variables. This prevents conversion of those global variables
to constructor-provided depdendencies.
Take the first step towards providing a place to store the
context by converting system_keyspace to a class. All the functions
are static, so no context is yet available, but we can de-static-ify
them incrementally in the future and store the context in class members.
Indentation is a mess, but can be easily fixed later.
In anticipation of making system_keyspace a class instead of a
namespace, rename any member that is currently forward-declared,
since one can't forward-declare a class member. Each member
is taken out of the system_keyspace namespace and gains a
system_keyspace prefix. Aliases are added to reduce code churn.
The result isn't lovely, but can be adjusted later.
Merge two fragments together, in anticipation of making 'legacy'
s struct instead of a namespace (when system_keyspace is a class,
we can't nest a namespace inside it).
Flushing schema tables is important for crash recovery (without a flush,
we might have sstables using a new schema before the commitlog entry
noting the schema change has been replayed), but not important for tests
that do not test crash recovery. Avoiding those flushes reduces system,
user, and real time on tests running on a consumer-level SSD.
before:
real 8m51.347s
user 7m5.743s
sys 5m11.185s
after:
real 7m4.249s
user 5m14.085s
sys 2m11.197s
Note real time is higher that user+sys time divided by the number
of hardware threads, indicating that there is still idle time due
to the disk flushing, so more work is needed.
Closes#9319
Previously, the layout for storing raft snapshot
descriptors contained a `config` field, which had `blob`
data type.
That means `raft::configuration` for the snapshot was serialized
as a whole in binary form. It's convenient to implement and
is the most compact form of representing the data, but:
1. Hard to debug due to the need to de-serialize the data.
2. Plants a time bomb wrt. changing data layout and also the
documentation in the future.
Remove the `config` field from `system.raft_snapshots` and
extract it to a separate `system.raft_config` table to store
the data in exploded form.
Also, modify the schema of `system.raft_snapshots` table in
the following way: add a `server_id` field as a part of
composite partition key ((group_id, server_id)) to
be able to start multiple raft servers belonging to one raft
group on the same scylla node.
Rename `id` field in `raft_snapshots` to `snapshot_id` so
it's self-documenting.
Rename `snapshot_id` from clustering key since a given server
can have only one snapshot installed at a time.
Note that the `raft::server_address` stucture contains an opaque
`info` member, which is `bytes`, but in the `raft_config` table
we use `ip_addr inet` field, instead. We always know that the
corresponding member field is going to contain an IP address (either v4
or v6) of a given raft server.
So, now the snapshots schema looks like this:
CREATE TABLE raft_snapshots (
group_id timeuuid,
server_id uuid,
snapshot_id uuid,
idx int,
term int,
-- no `config` field here, moved to `raft_config` table
PRIMARY KEY ((group_id, server_id))
)
CREATE TABLE raft_config (
group_id timeuuid,
my_server_id uuid,
server_id uuid,
disposition text, -- can be either 'CURRENT` or `PREVIOUS'
can_vote bool,
ip_addr inet,
PRIMARY KEY ((group_id, my_server_id), server_id, disposition)
);
This way it's much easier to extend the schema with new fields,
very easy to debug and inspect via CQL, and it's much more descriptive
in terms of self-documentation.
Tests: unit(dev)
* manmanson/raft_snapshots_new_schema_v2:
test: adjust `schema_change_test` to include new `system.raft_config` table
raft: new schema for storing raft snapshots
raft: pass server id to `raft_sys_table_storage` instance
The interval template member functions mostly accept
tri-comparators but a few functions accept less-comparators.
To reduce the chance of error, and to provide better error
messages, constrain comparator parameters to the expected
signature.
In one case (db/size_estimates_virtual_reader.cc) the caller
had to be adjusted. The comparator supported comparisons
of the interval value type against other types, but not
against itself. To simplify things, we add that signature too,
even though it will never be called.
Closes#9291
We define the native reverse format as a reversed mutation fragment
stream that is identical to one that would be emitted by a table with
the same schema but with reversed clustering order. The main difference
to the current format is how range tombstones are handled: instead of
looking at their start or end bound depending on the order, we always
use them as-usual and the reversing reader swaps their bounds to
facilitate this. This allows us to treat reversed streams completely
transparently: just pass along them a reversed schema and all the
reader, compacting and result building code is happily ignorant about
the fact that it is a reversed stream.
This series is the first step towards implementing efficient reverse
reads. It allows us to remove all the special casing we have in various
places for reverse reads and thus treating reverse streams transparently
in all the middle layers. The only layers that have to know about the
actual reversing are mutation sources proper. The plan is that when
reading in reverse we create a reversed schema in the top layer then
pass this down as the schema for the read. There are two layers that
will need to act on this reversed schema:
* The layer sitting on top of the first layer which still can't handle
reversed streams, this layer will create a reversed reader to handle
the transition.
* The mutation source proper: which will obtain the underlying schema
and will emit the data in reverse order.
Once all the mutation sources are able to handle reverse reads, we can
get rid of the reverse reader entirely.
Refs: #1413
Tests: unit(dev)
TODO:
* v2
* more testing
Also on: https://github.com/denesb/scylla.git reverse-reads/v3
Changelog
v3:
* Drop the entire schema transformation mechanism;
* Drop reversing from `schema_builder()`;
* Don't keep any information about whether the schema is reversed or not
in the schema itself, instead make reversing deterministic w.r.t.
schema version, such that:
`s.version() == s.make_reversed().make_reversed().version()`;
* Re-reverse range tombstones in `streaming_mutation_freezer`, so
`reconcilable_results` sent to the coordinator during read repair
still use the old reverse format;
v2:
* Add `data_type reversed(data_type)`;
* Add `bound_kind reverse_kind(bound_kind)`;
* Make new API safer to use:
- `schema::underlying_type()`: return this when unengaged;
- `schema::make_transformed()`: noop when applying the same
transformation again;
* Generalize reversed into transformation. Add support to transferring
to remote nodes and shards by way of making `schema_tables` aware of
the transformation;
* Use reverse schema everywhere in reverse reader;
Closes#9184
* github.com:scylladb/scylla:
range_tombstone_accumulator: drop _reversed flag
test/boost/mutation_test: add test for mutation::consume() monotonicity
test/boost/flat_mutation_reader_test: more reversed reader tests
flat_mutation_reader: make_reversing_reader(): implement fast_forward_to(partition_range)
flat_mutation_reader: make_reversing_reader(): take ownership of the reader
test/lib/mutation_source_test: add consistent log to all methods
mutation: introduce reverse()
mutation_rebuilder: make it standalone
mutation: make copy constructor compatible with mutation_opt
treewide: switch to native reversed format for reverse reads
mutation: consume(): add native reverse order
mutation: consume(): don't include dummy rows
query: add slice reversing functions
partition_slice_builder: add range mutating methods
partition_slice_builder: add constructor with slice
query: specific_ranges: add non-const ranges accessor
range_tombstone: add reverse()
clustering_bounds_comparator: add reverse_kind()
schema: introduce make_reversed()
schema: add a transforming copy constructor
utils: UUID_gen: introduce negate()
types: add reversed(data_type)
docs: design-notes: add reverse-reads.md
In order to be able to avoid a deadlock when CQL server cannot be started,
the view builder shutdown procedure is now split to two parts -
- drain and stop. Drain is performed before storage proxy shutdown,
but stop() will be called even before drain is scheduled.
The deadlock is as follows:
- view builder creates a reader permit in order to be able
to read from system tables
- CQL server fails to start, shutdown procedure begins
- view builder stop() is not called (because it was not scheduled
yet), so it holds onto its reader permit
- database shutdown procedure waits for all permits to be destroyed,
and it hangs indefinitely because view builder keeps holding
its permit.
"
There are 4 places out there that do the same steps parsing
"client_|server_encryption_options" and configuring the
seastar::tls::creds_builder with the values (messaging, redis,
alternator and transport).
Also to make redis and transport look slimmer main() cleans
the client_encryption_options by ... parsing it too.
This set introduces a (coroutinized) helper to configure the
creds_builder with map<string, string> and removes the options
beautification from main.
tests: unit(dev), dtest.internode_ssl_test(dev)
"
* 'br-generalize-tls-creds-builder-configuration' of https://github.com/xemul/scylla:
code: Generalize tls::credentials_builder configuration
transport, redis: Do not assume fixed encryption options
messaging: Move encryption options parsing to ms
main: Open-code internode encryption misconfig warning
main, config: Move options parsing helpers
Return the pre- 6773563d3 behavior of demanding ALLOW FILTERING when partition slice is requested but on potentially unlimited number of partitions. Put it on a flag defaulting to "off" for now.
Fixes#7608; see comments there for justification.
Tests: unit (debug, dev), dtest (cql_additional_test, paging_test)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Closes#9126
* github.com:scylladb/scylla:
cql3: Demand ALLOW FILTERING for unlimited, sliced partitions
cql3: Track warnings in prepared_statement
test: Use ALLOW FILTERING more strictly
cql3: Add statement_restrictions::to_string
When a query requests a partition slice but doesn't limit the number
of partitions, require that it also says ALLOW FILTERING. Although
do_filter() isn't invoked for such queries, the performance can still
be unexpectedly slow, and we want to signal that to the user by
demanding they explicitly say ALLOW FILTERING.
Because we now reject queries that worked fine before, existing
applications can break. Therefore, the behavior is controlled by a
flag currently defaulting to off. We will default to "on" in the next
Scylla version.
Fixes#7608; see comments there for justification.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Previously, the layout for storing raft snapshot
descriptors contained a `config` field, which had `blob`
data type.
That means `raft::configuration` for the snapshot was serialized
as a whole in binary form. It's convenient to implement and
is the most compact form of representing the data, but:
1. Hard to debug due to the need to de-serialize the data.
2. Plants a time bomb wrt. changing data layout and also the
documentation in the future.
Remove the `config` field from `system.raft_snapshots` and
extract it to a separate `system.raft_config` table to store
the data in exploded form.
Also, modify the schema of `system.raft_snapshots` table in
the following way: add a `server_id` field as a part of
composite partition key ((group_id, server_id)) to
be able to start multiple raft servers belonging to one raft
group on the same scylla node.
Rename `id` field in `raft_snapshots` to `snapshot_id` so
it's self-documenting.
Rename `snapshot_id` from clustering key since a given server
can have only one snapshot installed at a time.
Note that the `raft::server_address` stucture contains an opaque
`info` member, which is `bytes`, but in the `raft_config` table
we use `ip_addr inet` field, instead. We always know that the
corresponding member field is going to contain an IP address (either v4
or v6) of a given raft server.
So, now the snapshots schema looks like this:
CREATE TABLE raft_snapshots (
group_id timeuuid,
server_id uuid,
snapshot_id uuid,
idx int,
term int,
-- no `config` field here, moved to `raft_config` table
PRIMARY KEY ((group_id, server_id))
)
CREATE TABLE raft_config (
group_id timeuuid,
my_server_id uuid,
server_id uuid,
disposition text, -- can be either 'CURRENT` or `PREVIOUS'
can_vote bool,
ip_addr inet,
PRIMARY KEY ((group_id, my_server_id), server_id, disposition)
);
This way it's much easier to extend the schema with new fields,
very easy to debug and inspect via CQL, and it's much more descriptive
in terms of self-documentation.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Introduce `raft` experimental option.
Adjust the tests accordingly to accomodate the new option.
It's not enabled by default when providing
`--experimental=true` config option and should be
requested explicitly via `--experimental-options=raft`
config option.
Hide the code related to `raft_group_registry` behind
the switch. The service object is still constructed
but no initialization is performed (`init()` is not
called) if the flag is not set.
Later, other raft-related things, such as raft schema
changes, will also use this flag.
Also, don't introduce a corresponding gossiper feature
just yet, because again, it should be done after the
raft schema changes API contract is stabilized.
This will be done in a separate series, probably related to
implementing the feature itself.
Tests: unit(dev)
Ref #9239.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210823121956.167682-1-pa.solodovnikov@scylladb.com>
Prepare for updating seastar submodule to a change
that requires deferred actions to be noexcept
(and return void).
Test: unit(dev, debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Get rid of unused includes of seastar/util/{defer,closeable}.hh
and add a few that are missing from source files.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
All the places in code that configure the mentioned creds builder
from client_|server_encryption_options now do it the same way.
This patch generalizes it all in the utils:: helper.
The alternator code "ignores" require_client_auth and truststore
keys, but it's easy to make the generalized helper be compatible.
Also make the new helper coroutinized from the beginning.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The get_or_default and is_true are two aux bits that are used
to parse the config options. The former is duplicated in the
alternator code as well.
Put both in utils namespace for future.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Refs #9053
Flips default for commitlog disk footprint hard limit enforcement to off due
to observed latency stalls with stress runs. Instead adds an optional flag
"commitlog_use_hard_size_limit" which can be turned on to in fact do enforce it.
Sort of tape and string fix until we can properly tweak the balance between
cl & sstable flush rate.
Closes#9195
We decided to enable repair based node operations by default for replace
node operations.
To do that, a new option --allowed-repair-based-node-ops is added. It
lists the node operations that are allowed to enable repair based node
operations.
The operations can be bootstrap, replace, removenode, decommission and rebuild.
By default, --allowed-repair-based-node-ops is set to contain "replace".
Note, the existing option --enable-repair-based-node-ops is still in
play. It is the global switch to enable or disable the feature.
Examples:
- To enable bootstrap and replace node ops:
```
scylla --enable-repair-based-node-ops true --allowed-repair-based-node-ops replace,bootstrap
```
- To disable any repair based node ops:
```
scylla --enable-repair-based-node-ops false
```
Closes#9197
Aggregates are propagated, created and dropped very similarly
to user-defined functions - a set of helper functions
for aggregates are added based on the UDF implementation.
The code for handling background view updates used to propagate
exceptions unconditionally, which leads to "exceptional future
ignored" warnings if the update was put to background.
From now on, the exception is only propagated if its future
is actually waited on.
Fixes#6187
Tested manually, the warning was not observed after the patch
Closes#9179
Adds a `hinted_handoff_pause_hint_replay` error injection point. When
enabled, hint replay logic behaves as if it is run, but it gets stuck in
a loop and no hints are actually sent until the point is disabled again.
This injection point will be useful in dtests - it will simulate
infinitely slow hint replay and will make it possible to test how some
operations behave while hint replay logic is running.
The first intended use case of this injection point is testing the HTTP
API for waiting for hints (#8728).
Refs: #6649Closes#8801
* github.com:scylladb/scylla:
hints: fix indentation after previous patch
hints: error injection for pausing hint replay
hints: coroutinize lambda inside send_one_file
Each hint sender runs an asynchronous loop with tries to flush and then
send hints. Between each attempt, it sleeps at most 10 seconds using
sleep_abortable. However, an overload of sleep_abortable is used which
does not take an abort_source - it should abort the sleep in case
Seastar handles a SIGINT or SIGTERM signal. However, in order for that
to work, the application must not prevent default handling of those
signals in Seastar - but Scylla explicitly does it by disabling the
`auto_handle_sigint_sigterm` option in reactor config. As a result,
those sleeps are never aborted, and - because we wait for the async
loops to stop - they can delay shutdown by at most 10 seconds.
To fix that, an abort_source is added to the hints sender, and the
abort_source is triggered when the corresponding sender is requested to
stop.
Fixes: #9176Closes#9177
Adds a `hinted_handoff_pause_hint_replay` error injection point. When
enabled, hint replay logic behaves as if it is run, but it gets stuck in
a loop and no hints are actually sent until the point is disabled again.
This injection point will be useful in dtests - it will simulate
infinitely slow hint replay and will make it possible to test how some
operations behave while hint replay logic is running.
The first intended use case of this injection point is testing the HTTP
API for waiting for hints (#8728).
Refs: #6649
Adds a sync_point structure. A sync point is a (possibly incomplete)
mapping from hint queues to a replay position in it. Users will be able
to create sync points consisting of the last written positions of some
hint queues, so then they can wait until hint replay in all of the
queues reach that point.
The sync point supports serialization - first it is serialized with the
help of IDL to a binary form, and then converted to a hexadecimal
string. Deserialization is also possible.
Adds necessary infrastructure which allows, for a given endpoint
manager, to wait until hints are replayed up to a specified position. An
abort source must be specified which, if triggered, cancels waiting for
hint replay.
If the endpoint manager is stopped, current waiters are dismissed with
an exception.
Keeps track of a position which serves as an upper bound for positions
of already replayed hints - i.e. all hints with replay positions
strictly lower than it are considered replayed.
In order to accurately track this bound during hint replay, a std::map
is introduced which contains positions of hints which are currently
being sent.