This patch fixes a race between two methods in hints manager: drain_for
and store_hint.
The first method is called when a node leaves the cluster, and it
'drains' end point hints manager for that node (sends out all hints for
that node). If this method is called when the local node is being
decomissioned or removed, it instead drains hints managers for all
endpoints.
In the case of decomission/remove, drain_for first calls
parallel_for_each on all current ep managers and tells them to drain
their hints. Then, after all of them complete, _ep_managers.clear() is
called.
End point hints managers are created lazily and inserted into
_ep_managers map the first time a hint is stored for that node. If
this happens between parallel_for_each and _ep_managers.clear()
described above, the clear operation will destroy the new ep manager
without draining it first. This is a bug and will trigger an assert in
ep manager's destructor.
To solve this, a new flag for the hints manager is added which is set
when it drains all ep managers on removenode/decommission, and prevents
further hints from being written.
Fixes#7257Closes#7278
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
`count` function was often used in various ways.
`contains` does not only express the intend of the code better but also
does it in more unified way.
This commit replaces all the occurences of the `count` with the
`contains`.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <b4ef3b4bc24f49abe04a2aba0ddd946009c9fcb2.1597314640.git.piotr@scylladb.com>
After restart_segment was removed from send_state enum, send_state_set
now has only one possible element: segment_replay_failed.
This patch removes send_state_set and uses bool in its place instead.
Hints manager uses commitlog framework to store and replay hints.
The commitlog::read_log_file function is used for replaying hints. It
reads commitlog entries and passes them to a callback. In case of hints
manager, the callback calls manager::send_one_hint function.
In case something goes wrong during this process, sending of that file
is attempted again later. If the error was caused by hints that failed
to be sent (e.g. due to network error), then we also advance
_last_not_complete_rp field to the position of the first hint that
failed. In the next retry, we will start reading from the commitlog from
that position.
However, current logic does not account for the case when an error
occurs in the commitlog::read_log_file function itself. If,
coincidentally, all hints sent by send_one_hint succeed, then we won't
advance the _last_not_complete_rp field and we may unnecessarily repeat
sending some of the hints that succeeded.
This patch adds the send_one_file_ctx::last_sent_rp field, which keeps
track of the last commitlog position for which a hint was attempted to
be sent. In case read_log_file throws an error but all send_one_hint
calls succeed, then it will be used to update _last_not_complete_rp.
This will reduce the amount of hints that are resent in this case to
only one.
Tests:
- unit(dev)
- dtest(hintedhandoff_additional_test, dev)
When sending hints from one file, rps_set is used to keep track of
positions of hints that are currently sent. If sending of a hint fails,
its position is not removed from rps_set. If some hints fail to be sent
while handling a hints file, the lowest position from rps_set is used
to calculate the position from where to start when sending of the file
is retried.
Keeping track of commitlog positions this way isn't necessary to
calculate this position. This patch removes rps_set and replaces it
with first_failed_rp - which is just a single
std::optional<db::replay_position>. This value is updated when a hint
send failure is detected.
This simplifies calculation of starting position for the next retry, and
allowed to remove some error handling logic related to an edge case when
inserting to rps_set fails.
- unit(dev)
- dtest(hintedhandoff_additional_test, dev)
By default, semaphore exceptions bring along very little context:
either that a semaphore was broken or that it timed out.
In order to make debugging easier without introducing significant
runtime costs, a notion of named semaphore is added.
A named semaphore is simply a semaphore with statically defined
name, which is present in its errors, bringing valuable context.
A semaphore defined as:
auto sem = semaphore(0);
will present the following message when it breaks:
"Semaphore broken"
However, a named semaphore:
auto named_sem = named_semaphore(0, named_semaphore_exception_factory{"io_concurrency_sem"});
will present a message with at least some debugging context:
"Semaphore broken: io_concurrency_sem"
It's not much, but it would really help in pinpointing bugs
without having to inspect core dumps.
At the same time, it does not incur any costs for normal
semaphore operations (except for its creation), but instead
only uses more CPU in case an error is actually thrown,
which is considered rare and not to be on the hot path.
Refs #4999
Tests: unit(dev), manual: hardcoding a failure in view building code
end_point_hints_manager::file_update_mutex is taken by space_watchdog
but while space_watchdog is waiting for it the corresponding
end_point_hints_manager instance may get destroyed by manager::drain_for()
or by manager::stop().
This will end up in a use-after-free event.
Let's change the end_point_hints_manager's API in a way that would prevent
such an unsafe locking:
- Introduce the with_file_update_mutex().
- Make end_point_hints_manager::file_update_mutex() method private.
Fixes#4685Fixes#4836
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
If drain_for() is running together with itself: one instance for the local
node and one for some other node, erasing of elements from the _ep_managers
map may lead to a use-after-free event.
Let's serialize drain_for() calls with a semaphore.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
The resource manager is used to manage common resources between
various hints managers. In-flight hints used to be one of the shared
resources, but it proves to cause starvation, when one manager eats
the whole limit - which may be especially painful if the background
materialized views hints manager starves the regular hints manager,
which can in turn start failing user writes because of admission control.
This patch makes the limit per-manager again,
which effectively reverts the limit to its original behavior.
Fixes#4483
Message-Id: <8498768e8bccbfa238e6a021f51ec0fa0bf3f7f9.1559649491.git.sarna@scylladb.com>
Rename this state value to better reflect the reality:
state::ep_state_is_not_normal -> state::ep_state_left_the_ring
The manager gets to this state when the destination Node has left the ring.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
If we discover that a current segment is corrupted there is nothing we
can do about it.
This patch does the following:
1) Drops the corrupted segment and moves to the next one.
2) Logs such events as ERRORs.
3) Introduces a new metrics that accounts such event.
Fixes#4364
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
When a view replica becomes unavailable, updates to it are stored as
hints at the paired based replica. This on-disk queue of pending view
updates grows as long as there are view updated and the view replica
remains unavailable. Currently, we take that relative queue size into
account when calculating the delay for new base writes, in the context
of the backpressure algorithm for materialized views.
However, the way we're calculating that on-disk backlog is wrong,
since we calculate it per-device and then feed it to all the hints
managers for that device. This means that normal hints will show up as
backlog for the view hints manager, which in turn introduces delays.
This can make the view backpressure mechanism kick-in even if the
cluster uses no materialized views.
There's yet another way in which considering the view hints backlog is
wrong: a view replica that is unavailable for some period of time can
cause the backlog to grow to a point where all base writes are applied
the maximum delay of 1 second. This turns a single-node failure into
cluster unavailability.
The fix to both issues is to simply not take this on-disk backlog into
account for the backpressure algorithm.
Fixes#4351Fixes#4352
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190321170418.25953-1-duarte@scylladb.com>
We will try to send a particular segment later (in 1s) from the place
where we left off if it wasn't sent out in full before. However we may miss
some of column family mappings when we get back to sending this file and
start sending from some entry in the middle of it (where we left off)
if we didn't save column family mappings we cached while reading this segment
from its begining.
This happens because commitlog doesn't save a column family information
in every entry but rather once for each uniq column family (version) per
"cycle" (see commitlog::segment description for more info).
Therefore we have to assume that a particular column family mapping
appears once in the whole segment (worst case). And therefore, when we
decide to resume sending a segment we need to keep the column family
mappings we accumulated so far and drop them only after we are done with
this particular segment (sent it out in full).
Fixes#4122
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Account the amount of hints that were discarded in the send path.
This may happen for instance due to a schema change or because a hint
being to old.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
We would like to get rid of boost::filesystem and gradually replace it with
std::experimental::filesystem.
TODO: using namespace fs = std::experimental::filesystem,
use fs::path directly, rather than lister::path
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
Hinted handoff should not overpower regular flows like READs, WRITEs or
background activities like memtable flushes or compactions.
In order to achieve this put its sending in the STEAMING CPU scheduling
group and its commitlog object into the STREAMING I/O scheduling group.
Fixes#3817
"
* 'hinted_handoff_scheduling_groups-v2' of https://github.com/vladzcloudius/scylla:
db::hints::manager: use "streaming" I/O scheduling class for reads
commitlog::read_log_file(): set the a read I/O priority class explicitly
db::hints::manager: add hints sender to the "streaming" CPU scheduling group
When messaging_service is started we may immediately receive a mutation
from another node (e.g. in the MV update context). If hinted handoff is not
ready to store hints at that point we may fail some of MV updates.
We are going to resolve this by start()ing hints::managers before we
start messaging_service and blocking hints replaying until all relevant
objects are initialized.
Refs #3828
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Hinting is allowed after "started" before "stopping".
Hints that attempted to be stored outside this time frame are going to
be dropped.
Refs #3828
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Introduce a multi-bit state field. In this patch it replaces the _stopping
boolean. We are going to add more states in the following patches.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
The space_watchdog enables or disables hints for the managers
associated with a particular device. We encapsulate this decision
inside the hints::managers by introducing the update_backlog()
function.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Disable the copy and move ctors and assignment operators for both the
hints::manager and the hints::resource_manager.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Instead of unfreezing a mutation from the commitlog and then freezing
it again to send, just keep the read frozen mutation.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The moving operation changes a node's token to a new token. It is
supported only when a node has one token. The legacy moving operation is
useful in the early days before the vnode is introduced where a node has
only one token. I don't think it is useful anymore.
In the future, we might support adjusting the number of vnodes to reblance
the token range each node owns.
Removing it simplifies the cluster operation logic and code.
Fixes#3475
Message-Id: <144d3bea4140eda550770b866ec30e961933401d.1533111227.git.asias@scylladb.com>
Rebalance hints segments that need to be sent among all present shards.
Ensure that after rebalancing the difference between the number of segments
of any two shards is not greater than 1.
Try to minimize the amount of "file rename" operations in order to achieve the needed result.
Note: "Resharding" is a particular case of rebalancing.
Tests: dtest: hintedhandoff_additional_test.py:TestHintedHandoff.hintedhandoff_rebalance_test
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
In order to make space_watchdog device-aware, device_id field
is added to hints manager. It's an equivalent of stat.st_dev
and it identifies the disk that contains manager's root directory.
Signed-off-by: Piotr Sarna <sarna@scylladb.com>
Now that more than one instance of hints manager can be present
at the same time, registering metrics is moved out of the constructor
to prevent 'registering metrics twice' errors.
Constants related to managing resources are moved to newly created
resource_manager class. Later, this class will be used to manage
(potentially shared) resources of hints managers.
When node is decommissioned/removed it will drain all its hints and all
remote nodes that have hints to it will drain their hints to this node.
What "drain" means? - The node that "drains" hints to a specific
destination will ignore failures and will continue sending hints till the end
of the current segment, erase it and move to the next one till there are
no more segments left.
After all hints are drained the corresponding hints directory is removed.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>