There's nothing specific to scylla in the lister
classes, they could (and maybe should) be part of
the seastar library.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.
References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.
scylla-gdb.py is adjusted to look for both the new and old names.
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.
As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
The end_point_hints_manager's field _last_written_rp is initialized in
its regular constructor, but is not copied in the move constructor.
Because the move constructor is always involved when creating a new
endpoint manager, the _last_written_rp field is effectively always
initialized with the zero-argument constructor, and is set to the zero
value.
This can cause the following erroneous situation to occur:
- Node A accumulates hints towards B.
- Sync point is created at A.
It will be used later to wait for currently accumulated hints.
- Node A is restarted.
The endpoint manager A->B is created which has bogus value in the
_last_written_rp (it is set to zero).
- Node A replays its hints but does not write any new ones.
- A hint flush occurs.
If there are no hint segments on disk after flush, the endpoint
manager sets its last sent position to the last written position,
which is by design. However, the last written position has incorrect
value, so the last sent position also becomes incorrect and too low.
- Try to wait for the sync point created earlier.
The sync point waiting mechanism waits until last sent hint position
reaches or goes past the position encoded in the sync point, but it
will not happen because the last sent position is incorrect.
The above bug can be (sometimes) reproduced in
hintedhandoff_sync_point_api_test dtest.
Now, the _last_written_rp field is properly initialized in the move
constructor, which prevents the bug described above.
Fixes: #9320Closes#9426
This needs to add forward declarations of the gossiper class and
re-include some other headers here and there.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Adds a configuration option to the commitlog: base_segment_id. When
provided, the commitlog uses this ID as a base of its segment IDs
instead of calculating it based on the number of milliseconds between
the epoch and boot time.
This is needed in order for the feature which allows to wait for hints
to be replayed to work - it relies on the replay positions monotonically
increasing. Endpoint managers periodically re-creates its commitlog
instance - if it is re-created when there are no segments on disk,
currently it will choose the number of milliseconds between the epoch
and boot time, which might result in segments being generated with the
same IDs as some segments previously created and deleted during the same
runtime.
When `manager::wait_for_sync_point` is called, the abort source from the
arguments (`as`) might have already been triggered. In such case, the
subscription which was supposed to trigger the `local_as` abort source
won't be run, and the code will wait indefinitely for hints to be
replayed instead of checking the replay status and returning
immediately.
This commit fixes the problem by manually triggering `local_as` if `as`
have been triggered.
When the promise waited on in the `wait_until_hints_are_replayed_up_to`
function is resolved, a continuation runs which prints a log line with
information about this event. The continuation captures a pointer to the
hints sender and uses it to get information about the endpoint whose
hints are waited for. However, at this point the sender might have been
deleted - for example, when the node is being stopped and everybody
waiting for hints is dismissed.
This commit fixes the use-after-free by getting all necessary
information while the sender is guaranteed to be alive and captures it
in the continuation's capture list.
Adds a `hinted_handoff_pause_hint_replay` error injection point. When
enabled, hint replay logic behaves as if it is run, but it gets stuck in
a loop and no hints are actually sent until the point is disabled again.
This injection point will be useful in dtests - it will simulate
infinitely slow hint replay and will make it possible to test how some
operations behave while hint replay logic is running.
The first intended use case of this injection point is testing the HTTP
API for waiting for hints (#8728).
Refs: #6649Closes#8801
* github.com:scylladb/scylla:
hints: fix indentation after previous patch
hints: error injection for pausing hint replay
hints: coroutinize lambda inside send_one_file
Each hint sender runs an asynchronous loop with tries to flush and then
send hints. Between each attempt, it sleeps at most 10 seconds using
sleep_abortable. However, an overload of sleep_abortable is used which
does not take an abort_source - it should abort the sleep in case
Seastar handles a SIGINT or SIGTERM signal. However, in order for that
to work, the application must not prevent default handling of those
signals in Seastar - but Scylla explicitly does it by disabling the
`auto_handle_sigint_sigterm` option in reactor config. As a result,
those sleeps are never aborted, and - because we wait for the async
loops to stop - they can delay shutdown by at most 10 seconds.
To fix that, an abort_source is added to the hints sender, and the
abort_source is triggered when the corresponding sender is requested to
stop.
Fixes: #9176Closes#9177
Adds a `hinted_handoff_pause_hint_replay` error injection point. When
enabled, hint replay logic behaves as if it is run, but it gets stuck in
a loop and no hints are actually sent until the point is disabled again.
This injection point will be useful in dtests - it will simulate
infinitely slow hint replay and will make it possible to test how some
operations behave while hint replay logic is running.
The first intended use case of this injection point is testing the HTTP
API for waiting for hints (#8728).
Refs: #6649
Adds necessary infrastructure which allows, for a given endpoint
manager, to wait until hints are replayed up to a specified position. An
abort source must be specified which, if triggered, cancels waiting for
hint replay.
If the endpoint manager is stopped, current waiters are dismissed with
an exception.
Keeps track of a position which serves as an upper bound for positions
of already replayed hints - i.e. all hints with replay positions
strictly lower than it are considered replayed.
In order to accurately track this bound during hint replay, a std::map
is introduced which contains positions of hints which are currently
being sent.
The position of the last written hint is now tracked by the endpoint
hints manager.
When manager is constructed and no hints are replayed yet, the last
written hint position is initialized to the beginning of a fake segment
with ID corresponding to the current number of milliseconds since the
epoch. This choice makes sure that, in case a new hint sync point is
created before any hints are written, the position recorded for that
hint queue will be larger than all replay positions in segments
currently stored on disk.
Instead of tracking the last position for which hint sending is
attempted, the last successfully replayed position is tracked.
The previous variable was used to calculate the position from which hint
replay should restart in case of an error, in the following way:
_last_not_complete_rp = ctx_ptr->first_failed_rp.value_or(
ctx_ptr->last_attempted_rp.value_or(_last_not_complete_rp));
Now, this formula uses the last_succeeded_rp in place of
last_attempted_rp. This change does not have an effect on the choice of
the starting position of the next retry:
- If the hint at `last_attempted_rp` has succeeded, in the new algorithm
the same position will be recorded in `last_succeeded_rp`, and the
formula will yield the same result.
- If the hint at `last_attempted_rp` has failed, it will be accounted
into `first_failed_rp`, so the formula will yield the same result.
The motivation for this change is that in the next commits of this PR we
will start tracking the position of the last replayed hint per hint
queue, and the meaning of the new variable makes it more useful - when
there are no failed hints in the hint sending attempt, last_succeeded_rp
gives us information that hints _up to this position_ were replayed; the
last_attempted_rp variable can only tell us that hints _before that
position_ were replayed successfully.
Instead of calling the `on_hint_send_failure` method inside the hint
sending task in places where an error occurs, we now let the exceptions
be returned and handle them inside a single `then_wrapped` attached to
the hint sending task.
Apart from the `then_wrapped`, there is one more place which calls
`on_hint_send_failure` - in the exception handler for the future which
spawns the asynchronous hint sending task. It needs to be kept separate
because it is a part of a separate task.
Endpoint hints manager keeps a commitlog instance which is used to write
hints into new segments. This instance is re-created every 10 seconds,
which causes the previous instance to leave its segments on disk.
On the other hand, hints sender keeps a list of segments to replay which
is updated only after it becomes empty. The list is repopulated with
segments returned by the commitlog::get_segments_to_replay() method
which does not specify the order of the segments returned.
As a preparation for the upcoming hint sync points feature, this commit
changes the order in which segments are replayed:
- First, segments written by other shards are replayed. Such segments
may appear in the queue because of segment rebalancing which is done
at startup.
The purpose of replaying "foreign" segments first is that they are
problematic for hint sync points. For each hint queue, a hint sync
point encodes a replay position of the last written hint on the local
shard. Accounting foreign segments precisely would make the
implementation more complicated. To make things simpler, waiting for
sync points will always make sure that all foreign segments are
replayed. This might sometimes cause more hints to be waited on than
necessary if a restart occurs in the meantime.
- Segments written by the local shard are replayed later, in order of
their IDs. This makes sure that local hints are replayed in the order
they were written to segments, and will make it possible to use replay
positions to track progress of hint replay.
This reverts commit e48739a6da.
This commit removes the functionality from endpoint hints manager which
allowed to flush hints immediately and forcefully update the list of
segments to replay.
The new implementation of waiting for hints will be based on replay
positions returned by the commitlog API and it won't be necessary to
forcefully update the segment list when creating a sync point.
This reverts commit 5a49fe74bb.
This commit removes a metric which tracks how many segments were
replayed during current runtime. It was necessary for current "wait for
hints" mechanism which is being replaced with a different one -
therefore we can remove the metric.
This reverts commit 427bbf6d86.
This commit removes the infrastructure which allows to wait until
current hints are replayed in a given hint queue.
It will be replaced with a different mechanism in later commits.
This reverts commit 244738b0d5.
This commit removes create_hint_queue_sync_point and
check_hint_queue_sync_point functions from storage_proxy, which were
used to wait until local hints are sent out to particular nodes.
Similar methods will be reintroduced later in this PR, with a completely
different implementation.
This reverts commit 9d68824327.
First, we are reverting existing infrastructure for waiting for hints in
order to replace it with a different one, therefore this commit needs to
be reverted as well.
Second, errors during hint replay can occur naturally and don't
necessarily indicate that no progress can be made - for example, the
target node is heavily loaded and some hints time out. The "waiting for
hints" operation becomes a user-issued command, so it's not as vital to
ensure liveness.
The storage service pointer is only used so (un)subscribe
to (from) lifecycle events. Now the subscription is gone,
so can the storage service pointer.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Managers sit on storage proxy which is already subscribed on
lifecycle events, so it can "notify" hints managers directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Eliminate not used includes and replace some more includes
with forward declarations where appropriate.
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
storage_proxy works with vectors of inet_addresses for replica sets
and for topology changes (pending endpoints, dead nodes). This patch
introduces new names for these (without changing the underlying
type - it's still std::vector<gms::inet_address>). This is so that
the following patch, that changes those types to utils::small_vector,
will be less noisy and highlight the real changes that take place.
When a hint queue becomes stuck due to not being able to send to its
destination (e.g. destination node is no longer UP, or we failed to send
some hints from a file), then it's better to immediately dismiss anybody
who waits for hint replay instead of letting them wait until timeout.
Adds two methods to `storage_proxy`:
- `create_hint_queue_sync_point` - creates a "hint sync point" which
is kept present in storage_proxy until all hint queues on the local
node reach their curent end. It will also disappear if given deadline
is reached first.
- `check_hint_queue_sync_point` - checks if given hint sync point still
exists.
The created sync point waits for hint queues in all hint managers, on
all shards.
Implements `wait_until_hints_are_replayed` method returning a future
which blocks until either all current hint segments are replayed
(returns success in this case), or when provided timeout is reached
(returns a timeout exception in this case).
Adds a field to `end_point_hints_manager::sender`:
`_total_replayed_segments_count` which keeps track of how many segments
were replayed so far. This metric will be used to calculate the
sequence number of the last current hint segments in the queue - so
that we can implement waiting for current segments to be replayed.
Endpoint hints manager keeps a list of segments to replay. New segments
are appended to it lazily - only when a hint flush occurs (hints
commitlog instance is re-created) and the list is empty. Because of
that, this list cannot be currently used to tell how many segments are
on disk.
This commit allows to trigger hints flush and forcefully update the list
of segments to replay. In later commits, a mechanism will be implemented
which will allow to wait until a given number of hint segments is
replayed. Triggering a hints flush with segment list update will allow
us to properly synchronize and determine up to which segment we need to
wait.
Now, instead of looking at the gossiper state to check if the
destination node is still in the ring, we are using token_metadata as a
source of truth. This results in much simpler code in can_send() as
token_metadata has an is_member method which does exactly what we want.
A recent change to the commitlog (4082f57) caused its configurable size limit to
be strictly enforced - after reaching the limit, new segments wouldn't be
allocated until some of the previous segments are freed. This flow can work for
the regular commitlog, however the hints commitlog does not delete the segments
itself - instead, hints manager recreates its commitlog every 10 seconds, picks
up segments left by the previous instance and deletes each segment manually only
after all hints are sent out from a segment.
Because of the non-standard flow, it is possible that the hints commitlog fills
up and stops accepting more hints. Hints manager uses a relatively low limit for
each commitlog instance (128MB divided by shard count), so it's not hard to fill
it up. What's worse, hints manager tries to acquire file_update_mutex in
exclusive mode before re-creating the commitlog, while hints waiting to be
written acquire this lock in shared mode - which causes hints flushing to
completely deadlock and no more hints be admitted to the commitlog. The queue of
hints waiting to be admitted grows very quickly and soon all writes which could
result in a hint being generated are rejected with OverloadedException.
To solve this problem, it is now possible to bring back the soft disk space
limit by setting a flag in commitlog's configuration.
Tests:
- unit(dev)
- wrote hints for 15 minutes in order to see if it gets stuck again
Fixes#8137Closes#8206
* github.com:scylladb/scylla:
hints_manager: don't use commitlog hard space limit
commitlog: add an option to allow going over size limit
This commit disables the hard space limit applied by commitlogs created
to store hints. The hard limit causes problems for hints because they
use small-sized commitlogs to store hints (128MB, currently). Instead of
letting the commitlog delete the segments itself, it recreates the
commitlog every 10 seconds and manually deletes old segments after all
hints are sent out from them.
If the 128MB limit is reached, the hints manager will get stuck. A
future which puts hint into commitlog holds a shared lock, and commitlog
recreation needs to get an exclusive lock, which results in a deadlock.
No more hints will be admitted, and eventually we will start rejecting
writes with OverloadedException due to too many hints waiting to be
admitted to the commitlog.
By disabling the hard limit for hints commitlog, the old behavior is
brought back - commitlog becomes more conservative with the space used
after going over its size limit, but does not block until some of its
segments are deleted.
So it can be modified while walked to dispatch
subscribed event notifications.
In #8143, there is a race between scylla shutdown and
notify_down(), causing use-after-free of cql_server.
Using an atomic vector itstead and futurizing
unregister_subscriber allows deleting from _lifecycle_subscribers
while walked using atomic_vector::for_each.
Fixes#8143
Test: unit(release)
DTest:
update_cluster_layout_tests:TestUpdateClusterLayout.add_node_with_large_partition4_test(release)
materialized_views_test.py:TestMaterializedViews.double_node_failure_during_mv_insert_4_nodes_test(release)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210224164647.561493-2-bhalevy@scylladb.com>
Implements a function which is responsible for changing hints manager
configuration while it is running.
It first starts new endpoint managers for endpoints which weren't
allowed by previous filter but are now, and then stops endpoint managers
which are rejected by the new filter.
The function is blocking and waits until all relevant ep managers are
started or stopped.
Uses db::hints::host_filter as the type of hinted_handoff_enabled
configuration option.
Previously, hinted_handoff_enabled used to be a string option, and it
was parsed later in a separate function during startup. The function
returned a std::optional<std::unordered_set<sstring>>, whose meaning in
the context of hints is rather enigmatic for an observer not familiar
with hints.
Now, hinted_handoff_enabled has type of db::hints::host_filter, and it
is plugged into the config parsing framework, so there is no need for
later post-processing.
Introduces a db::hints::directory_initializer object, which encapsulates
the logic of initializing directories for hints (creating/validating
directories, segment rebalancing). It will be useful for lazy
initialization of hints manager.
This use-after move was apprently exposed after switching to clang
in commit eb861e68e9.
The directory_entry is required for std::stoi(de.name.c_str())
and later in the catch{} clause.
This shows in the node logs as a "Ignore invalid directory" debug
log message with an empty name, and caused the hintedhandoff_rebalance_test
to fail when hints files aren't rebalanced.
Test: unit(dev)
DTest: hintedhandoff_additional_test.py:TestHintedHandoff.hintedhandoff_rebalance_test (dev, debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20201106172017.823577-1-bhalevy@scylladb.com>
Clang does not yet implement p1091r3, which allows lambdas
to capture structured bindings. To accomodate it, don't
use structured bindings for variables that are later
captured.
When there are hint files to be sent and the target endpoint is DOWN,
end_point_hints_manager works in the following loop:
- It reads the first hint file in the queue,
- For each hint in the file it decides that it won't be sent because the
target endpoint is DOWN,
- After realizing that there are some unsent hints, it decides to retry
this operation after sleeping 1 second.
This causes the first segment to be wholly read over and over again,
with 1 second pauses, until the target endpoint becomes UP or leaves the
cluster. This causes unnecessary I/O load in the streaming scheduling
group.
This patch adds a check which prevents end_point_hints_manager from
reading the first hint file at all when it is not allowed to send hints.
First observed in #6964
Tests:
- unit(dev)
- hinted handoff dtests
Closes#7407