Introduce `raft` experimental option.
Adjust the tests accordingly to accomodate the new option.
It's not enabled by default when providing
`--experimental=true` config option and should be
requested explicitly via `--experimental-options=raft`
config option.
Hide the code related to `raft_group_registry` behind
the switch. The service object is still constructed
but no initialization is performed (`init()` is not
called) if the flag is not set.
Later, other raft-related things, such as raft schema
changes, will also use this flag.
Also, don't introduce a corresponding gossiper feature
just yet, because again, it should be done after the
raft schema changes API contract is stabilized.
This will be done in a separate series, probably related to
implementing the feature itself.
Tests: unit(dev)
Ref #9239.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210823121956.167682-1-pa.solodovnikov@scylladb.com>
Prepare for updating seastar submodule to a change
that requires deferred actions to be noexcept
(and return void).
Test: unit(dev, debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Get rid of unused includes of seastar/util/{defer,closeable}.hh
and add a few that are missing from source files.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Refs #9053
Flips default for commitlog disk footprint hard limit enforcement to off due
to observed latency stalls with stress runs. Instead adds an optional flag
"commitlog_use_hard_size_limit" which can be turned on to in fact do enforce it.
Sort of tape and string fix until we can properly tweak the balance between
cl & sstable flush rate.
Closes#9195
We decided to enable repair based node operations by default for replace
node operations.
To do that, a new option --allowed-repair-based-node-ops is added. It
lists the node operations that are allowed to enable repair based node
operations.
The operations can be bootstrap, replace, removenode, decommission and rebuild.
By default, --allowed-repair-based-node-ops is set to contain "replace".
Note, the existing option --enable-repair-based-node-ops is still in
play. It is the global switch to enable or disable the feature.
Examples:
- To enable bootstrap and replace node ops:
```
scylla --enable-repair-based-node-ops true --allowed-repair-based-node-ops replace,bootstrap
```
- To disable any repair based node ops:
```
scylla --enable-repair-based-node-ops false
```
Closes#9197
Aggregates are propagated, created and dropped very similarly
to user-defined functions - a set of helper functions
for aggregates are added based on the UDF implementation.
The code for handling background view updates used to propagate
exceptions unconditionally, which leads to "exceptional future
ignored" warnings if the update was put to background.
From now on, the exception is only propagated if its future
is actually waited on.
Fixes#6187
Tested manually, the warning was not observed after the patch
Closes#9179
Adds a `hinted_handoff_pause_hint_replay` error injection point. When
enabled, hint replay logic behaves as if it is run, but it gets stuck in
a loop and no hints are actually sent until the point is disabled again.
This injection point will be useful in dtests - it will simulate
infinitely slow hint replay and will make it possible to test how some
operations behave while hint replay logic is running.
The first intended use case of this injection point is testing the HTTP
API for waiting for hints (#8728).
Refs: #6649Closes#8801
* github.com:scylladb/scylla:
hints: fix indentation after previous patch
hints: error injection for pausing hint replay
hints: coroutinize lambda inside send_one_file
Each hint sender runs an asynchronous loop with tries to flush and then
send hints. Between each attempt, it sleeps at most 10 seconds using
sleep_abortable. However, an overload of sleep_abortable is used which
does not take an abort_source - it should abort the sleep in case
Seastar handles a SIGINT or SIGTERM signal. However, in order for that
to work, the application must not prevent default handling of those
signals in Seastar - but Scylla explicitly does it by disabling the
`auto_handle_sigint_sigterm` option in reactor config. As a result,
those sleeps are never aborted, and - because we wait for the async
loops to stop - they can delay shutdown by at most 10 seconds.
To fix that, an abort_source is added to the hints sender, and the
abort_source is triggered when the corresponding sender is requested to
stop.
Fixes: #9176Closes#9177
Adds a `hinted_handoff_pause_hint_replay` error injection point. When
enabled, hint replay logic behaves as if it is run, but it gets stuck in
a loop and no hints are actually sent until the point is disabled again.
This injection point will be useful in dtests - it will simulate
infinitely slow hint replay and will make it possible to test how some
operations behave while hint replay logic is running.
The first intended use case of this injection point is testing the HTTP
API for waiting for hints (#8728).
Refs: #6649
Adds a sync_point structure. A sync point is a (possibly incomplete)
mapping from hint queues to a replay position in it. Users will be able
to create sync points consisting of the last written positions of some
hint queues, so then they can wait until hint replay in all of the
queues reach that point.
The sync point supports serialization - first it is serialized with the
help of IDL to a binary form, and then converted to a hexadecimal
string. Deserialization is also possible.
Adds necessary infrastructure which allows, for a given endpoint
manager, to wait until hints are replayed up to a specified position. An
abort source must be specified which, if triggered, cancels waiting for
hint replay.
If the endpoint manager is stopped, current waiters are dismissed with
an exception.
Keeps track of a position which serves as an upper bound for positions
of already replayed hints - i.e. all hints with replay positions
strictly lower than it are considered replayed.
In order to accurately track this bound during hint replay, a std::map
is introduced which contains positions of hints which are currently
being sent.
The position of the last written hint is now tracked by the endpoint
hints manager.
When manager is constructed and no hints are replayed yet, the last
written hint position is initialized to the beginning of a fake segment
with ID corresponding to the current number of milliseconds since the
epoch. This choice makes sure that, in case a new hint sync point is
created before any hints are written, the position recorded for that
hint queue will be larger than all replay positions in segments
currently stored on disk.
Instead of tracking the last position for which hint sending is
attempted, the last successfully replayed position is tracked.
The previous variable was used to calculate the position from which hint
replay should restart in case of an error, in the following way:
_last_not_complete_rp = ctx_ptr->first_failed_rp.value_or(
ctx_ptr->last_attempted_rp.value_or(_last_not_complete_rp));
Now, this formula uses the last_succeeded_rp in place of
last_attempted_rp. This change does not have an effect on the choice of
the starting position of the next retry:
- If the hint at `last_attempted_rp` has succeeded, in the new algorithm
the same position will be recorded in `last_succeeded_rp`, and the
formula will yield the same result.
- If the hint at `last_attempted_rp` has failed, it will be accounted
into `first_failed_rp`, so the formula will yield the same result.
The motivation for this change is that in the next commits of this PR we
will start tracking the position of the last replayed hint per hint
queue, and the meaning of the new variable makes it more useful - when
there are no failed hints in the hint sending attempt, last_succeeded_rp
gives us information that hints _up to this position_ were replayed; the
last_attempted_rp variable can only tell us that hints _before that
position_ were replayed successfully.
Instead of calling the `on_hint_send_failure` method inside the hint
sending task in places where an error occurs, we now let the exceptions
be returned and handle them inside a single `then_wrapped` attached to
the hint sending task.
Apart from the `then_wrapped`, there is one more place which calls
`on_hint_send_failure` - in the exception handler for the future which
spawns the asynchronous hint sending task. It needs to be kept separate
because it is a part of a separate task.
Endpoint hints manager keeps a commitlog instance which is used to write
hints into new segments. This instance is re-created every 10 seconds,
which causes the previous instance to leave its segments on disk.
On the other hand, hints sender keeps a list of segments to replay which
is updated only after it becomes empty. The list is repopulated with
segments returned by the commitlog::get_segments_to_replay() method
which does not specify the order of the segments returned.
As a preparation for the upcoming hint sync points feature, this commit
changes the order in which segments are replayed:
- First, segments written by other shards are replayed. Such segments
may appear in the queue because of segment rebalancing which is done
at startup.
The purpose of replaying "foreign" segments first is that they are
problematic for hint sync points. For each hint queue, a hint sync
point encodes a replay position of the last written hint on the local
shard. Accounting foreign segments precisely would make the
implementation more complicated. To make things simpler, waiting for
sync points will always make sure that all foreign segments are
replayed. This might sometimes cause more hints to be waited on than
necessary if a restart occurs in the meantime.
- Segments written by the local shard are replayed later, in order of
their IDs. This makes sure that local hints are replayed in the order
they were written to segments, and will make it possible to use replay
positions to track progress of hint replay.
This reverts commit e48739a6da.
This commit removes the functionality from endpoint hints manager which
allowed to flush hints immediately and forcefully update the list of
segments to replay.
The new implementation of waiting for hints will be based on replay
positions returned by the commitlog API and it won't be necessary to
forcefully update the segment list when creating a sync point.
This reverts commit 5a49fe74bb.
This commit removes a metric which tracks how many segments were
replayed during current runtime. It was necessary for current "wait for
hints" mechanism which is being replaced with a different one -
therefore we can remove the metric.
This reverts commit 427bbf6d86.
This commit removes the infrastructure which allows to wait until
current hints are replayed in a given hint queue.
It will be replaced with a different mechanism in later commits.
This reverts commit 244738b0d5.
This commit removes create_hint_queue_sync_point and
check_hint_queue_sync_point functions from storage_proxy, which were
used to wait until local hints are sent out to particular nodes.
Similar methods will be reintroduced later in this PR, with a completely
different implementation.
This reverts commit 82c419870a.
This commit removes the HINT_SYNC_POINT_CREATE and HINT_SYNC_POINT_CHECK
rpc verbs.
The upcoming HTTP API for waiting for hint replay will be restricted
to waiting for hints on the node handling the request, so there is no
need for new verbs.
This reverts commit 86d831b319.
This commit removes the wait_for_hints_before_repair option. Because a
previous commit in this series removes the logic from repair which
caused it to wait for hints to be replayed, this option is now useless.
We can safely remove this option because it is not present in any
release yet.
This reverts commit 9d68824327.
First, we are reverting existing infrastructure for waiting for hints in
order to replace it with a different one, therefore this commit needs to
be reverted as well.
Second, errors during hint replay can occur naturally and don't
necessarily indicate that no progress can be made - for example, the
target node is heavily loaded and some hints time out. The "waiting for
hints" operation becomes a user-issued command, so it's not as vital to
ensure liveness.
read_schema_partition_for_keyspace() copies some parameters to capture them
in a coroutine, but the same can be achieved more cleanly by changing the
reference parameters to value parameters, so do that.
Test: unit (dev)
Closes#9154
Following conversion to corotuines in fc91e90c59, remove extra
indents and braces left to make the change clearer.
One variable had to be renamed since without the braces it
duplicated another variable in the same block.
Test: unit (dev)
Closes#9125
schema_tables is quite hairy, but can be easily simplified with coroutines.
In addition to switching future-returning functions to coroutines, we also
switch Seastar threads to coroutines. This is less of a clear-cut win; the
motivation is to reduce the chances of someone calling a function that
expects to run in a thread from a non-thread context. This sometimes works
by accident, but when it doesn't, it's pretty bad. So a uniform calling convention
has some benefit.
I left the extra indents in, since the indent-fixing patch is hard to rebase in case
a rebase is needed. I will follow up with an indent fix post merge.
Test: unit (dev, debug, release)
Closes#9118
* github.com:scylladb/scylla:
db: schema_tables: drop now redundant #includes
db: schema_tables: coroutinize drop_column_mapping()
db: schema_tables: coroutinize column_mapping_exists()
db: schema_tables: coroutinize get_column_mapping()
db: schema_tables: coroutinize read_table_mutations()
db: schema_tables: coroutinize create_views_from_schema_partition()
db: schema_tables: coroutinize create_views_from_table_row()
db: schema_tables: unpeel lw_shared_ptr in create_Tables_from_tables_partition()
db: schema_tables: coroutinize create_tables_from_tables_partition()
db: schema_tables: coroutinize create_table_from_name()
db: schema_tables: coroutinize read_table_mutations()
db: schema_tables: coroutinize merge_keyspaces()
db: schema_tables: coroutinize do_merge_schema()
db: schema_tables: futurize and coroutinize merge_functions()
db: schema_tables: futurize and coroutinize user_types_to_drop::drop
db: schema_tables: futurize and coroutinize merge_types()
db: schema_tables: futurize and coroutinize merge_tables_and_views()
db: schema_tables: coroutinize store_column_mapping()
db: schema_tables: futurize and coroutinize read_tables_for_keyspaces()
db: schema_tables: coroutinize read_table_names_of_keyspace()
db: schema_tables: coroutinize recalculate_schema_version()
db: schema_tables: coroutinize merge_schema()
db: schema_tables: introduce and use with_merge_lock()
db: schema_tables: coroutinize update_schema_version_and_announce()
db: schema_tables: coroutinize read_keyspace_mutation()
db: schema_tables: coroutinize read_schema_partition_for_table()
db: schema_tables: coroutinize read_schema_partition_for_keyspace()
db: schema_tables: coroutinize query_partition_mutation()
db: schema_tables: coroutinize read_schema_for_keyspaces()
db: schema_tables: coroutinize convert_schema_to_mutations()
db: schema_tables: coroutinize calculate_schema_digest()
db: schema_tables: coroutinize save_system_schema()
The tables local is a lw_shared_ptr which is created and then refeferenced
before returning. It can be unpeeled to the pointed-to type, resulting in
one less allocation.
Right now, merge_functions() expects to be called in a thread.
Remove that requirement by converting it into a coroutine and returning
a future.
De-threading helps reduce errors where something expects to be called
in a thread, but isn't.