Commit Graph

40 Commits

Author SHA1 Message Date
Piotr Dulikowski
39771967bb hinted handoff: fix race - decomission vs. endpoint mgr init
This patch fixes a race between two methods in hints manager: drain_for
and store_hint.

The first method is called when a node leaves the cluster, and it
'drains' end point hints manager for that node (sends out all hints for
that node). If this method is called when the local node is being
decomissioned or removed, it instead drains hints managers for all
endpoints.

In the case of decomission/remove, drain_for first calls
parallel_for_each on all current ep managers and tells them to drain
their hints. Then, after all of them complete, _ep_managers.clear() is
called.

End point hints managers are created lazily and inserted into
_ep_managers map the first time a hint is stored for that node. If
this happens between parallel_for_each and _ep_managers.clear()
described above, the clear operation will destroy the new ep manager
without draining it first. This is a bug and will trigger an assert in
ep manager's destructor.

To solve this, a new flag for the hints manager is added which is set
when it drains all ep managers on removenode/decommission, and prevents
further hints from being written.

Fixes #7257

Closes #7278
2020-09-24 14:51:24 +03:00
Piotr Jastrzebski
c001374636 codebase wide: replace count with contains
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
`count` function was often used in various ways.

`contains` does not only express the intend of the code better but also
does it in more unified way.

This commit replaces all the occurences of the `count` with the
`contains`.

Tests: unit(dev)

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <b4ef3b4bc24f49abe04a2aba0ddd946009c9fcb2.1597314640.git.piotr@scylladb.com>
2020-08-15 20:26:02 +03:00
Piotr Dulikowski
e5b2218ad4 hinted handoff: use bool instead of send_state_set
After restart_segment was removed from send_state enum, send_state_set
now has only one possible element: segment_replay_failed.

This patch removes send_state_set and uses bool in its place instead.
2020-06-12 16:10:20 +02:00
Piotr Dulikowski
6b34bb1a43 hinted handoff: update replay position on commitlog failure
Hints manager uses commitlog framework to store and replay hints.

The commitlog::read_log_file function is used for replaying hints. It
reads commitlog entries and passes them to a callback. In case of hints
manager, the callback calls manager::send_one_hint function.

In case something goes wrong during this process, sending of that file
is attempted again later. If the error was caused by hints that failed
to be sent (e.g. due to network error), then we also advance
_last_not_complete_rp field to the position of the first hint that
failed. In the next retry, we will start reading from the commitlog from
that position.

However, current logic does not account for the case when an error
occurs in the commitlog::read_log_file function itself. If,
coincidentally, all hints sent by send_one_hint succeed, then we won't
advance the _last_not_complete_rp field and we may unnecessarily repeat
sending some of the hints that succeeded.

This patch adds the send_one_file_ctx::last_sent_rp field, which keeps
track of the last commitlog position for which a hint was attempted to
be sent. In case read_log_file throws an error but all send_one_hint
calls succeed, then it will be used to update _last_not_complete_rp.
This will reduce the amount of hints that are resent in this case to
only one.

Tests:

- unit(dev)
- dtest(hintedhandoff_additional_test, dev)
2020-06-12 16:10:20 +02:00
Piotr Dulikowski
d369b538f0 hinted handoff: remove rps_set, use first_failed_rp instead
When sending hints from one file, rps_set is used to keep track of
positions of hints that are currently sent. If sending of a hint fails,
its position is not removed from rps_set. If some hints fail to be sent
while handling a hints file, the lowest position from rps_set is used
to calculate the position from where to start when sending of the file
is retried.

Keeping track of commitlog positions this way isn't necessary to
calculate this position. This patch removes rps_set and replaces it
with first_failed_rp - which is just a single
std::optional<db::replay_position>. This value is updated when a hint
send failure is detected.

This simplifies calculation of starting position for the next retry, and
allowed to remove some error handling logic related to an edge case when
inserting to rps_set fails.

- unit(dev)
- dtest(hintedhandoff_additional_test, dev)
2020-06-12 16:10:19 +02:00
Vlad Zolotarov
b83e84b467 db::hints:: optimize with_file_update_mutex()
Avoid extra shared_ptr copy.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200311214313.2988-1-vladz@scylladb.com>
2020-04-16 09:01:40 +03:00
Piotr Sarna
9c5a5a5ac2 treewide: add names to semaphores
By default, semaphore exceptions bring along very little context:
either that a semaphore was broken or that it timed out.
In order to make debugging easier without introducing significant
runtime costs, a notion of named semaphore is added.
A named semaphore is simply a semaphore with statically defined
name, which is present in its errors, bringing valuable context.
A semaphore defined as:

  auto sem = semaphore(0);

will present the following message when it breaks:
"Semaphore broken"
However, a named semaphore:

  auto named_sem = named_semaphore(0, named_semaphore_exception_factory{"io_concurrency_sem"});

will present a message with at least some debugging context:

  "Semaphore broken: io_concurrency_sem"

It's not much, but it would really help in pinpointing bugs
without having to inspect core dumps.

At the same time, it does not incur any costs for normal
semaphore operations (except for its creation), but instead
only uses more CPU in case an error is actually thrown,
which is considered rare and not to be on the hot path.

Refs #4999

Tests: unit(dev), manual: hardcoding a failure in view building code
2019-11-26 15:14:21 +02:00
Vlad Zolotarov
b34c36baa2 hinted handoff: make taking file_update_mutex safe
end_point_hints_manager::file_update_mutex is taken by space_watchdog
but while space_watchdog is waiting for it the corresponding
end_point_hints_manager instance may get destroyed by manager::drain_for()
or by manager::stop().

This will end up in a use-after-free event.

Let's change the end_point_hints_manager's API in a way that would prevent
such an unsafe locking:

   - Introduce the with_file_update_mutex().
   - Make end_point_hints_manager::file_update_mutex() method private.

Fixes #4685
Fixes #4836

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2019-08-20 11:26:19 -04:00
Vlad Zolotarov
7a12b46fc9 db::hints::manager: serialize calls to drain_for()
If drain_for() is running together with itself: one instance for the local
node and one for some other node, erasing of elements from the _ep_managers
map may lead to a use-after-free event.

Let's serialize drain_for() calls with a semaphore.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2019-08-20 10:58:36 -04:00
Vlad Zolotarov
09600f1779 db::hints: cosmetics: identation and missing method qualifier
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2019-08-20 10:58:36 -04:00
Piotr Sarna
ac7531d8d9 db,hints: decouple in-flight hints limits from resource manager
The resource manager is used to manage common resources between
various hints managers. In-flight hints used to be one of the shared
resources, but it proves to cause starvation, when one manager eats
the whole limit - which may be especially painful if the background
materialized views hints manager starves the regular hints manager,
which can in turn start failing user writes because of admission control.
This patch makes the limit per-manager again,
which effectively reverts the limit to its original behavior.

Fixes #4483
Message-Id: <8498768e8bccbfa238e6a021f51ec0fa0bf3f7f9.1559649491.git.sarna@scylladb.com>
2019-07-12 19:21:26 +03:00
Vlad Zolotarov
f07c341efc hints_manager: rename the state::ep_state_is_not_normal enum value
Rename this state value to better reflect the reality:
state::ep_state_is_not_normal -> state::ep_state_left_the_ring

The manager gets to this state when the destination Node has left the ring.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2019-05-08 15:46:47 -04:00
Vlad Zolotarov
db2ba0df61 hinted handoff: discard corrupted segments
If we discover that a current segment is corrupted there is nothing we
can do about it.

This patch does the following:
1) Drops the corrupted segment and moves to the next one.
2) Logs such events as ERRORs.
3) Introduces a new metrics that accounts such event.

Fixes #4364

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2019-04-09 15:54:20 -04:00
Duarte Nunes
93a1c27b31 service/storage_proxy: Don't consider view hints for MV backpressure
When a view replica becomes unavailable, updates to it are stored as
hints at the paired based replica. This on-disk queue of pending view
updates grows as long as there are view updated and the view replica
remains unavailable. Currently, we take that relative queue size into
account when calculating the delay for new base writes, in the context
of the backpressure algorithm for materialized views.

However, the way we're calculating that on-disk backlog is wrong,
since we calculate it per-device and then feed it to all the hints
managers for that device. This means that normal hints will show up as
backlog for the view hints manager, which in turn introduces delays.
This can make the view backpressure mechanism kick-in even if the
cluster uses no materialized views.

There's yet another way in which considering the view hints backlog is
wrong: a view replica that is unavailable for some period of time can
cause the backlog to grow to a point where all base writes are applied
the maximum delay of 1 second. This turns a single-node failure into
cluster unavailability.

The fix to both issues is to simply not take this on-disk backlog into
account for the backpressure algorithm.

Fixes #4351
Fixes #4352

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190321170418.25953-1-duarte@scylladb.com>
2019-03-24 20:29:56 +02:00
Vlad Zolotarov
34829b8f81 hinted handoff: cache column family mappings for segments that were not sent out in full
We will try to send a particular segment later (in 1s) from the place
where we left off if it wasn't sent out in full before. However we may miss
some of column family mappings when we get back to sending this file and
start sending from some entry in the middle of it (where we left off)
if we didn't save column family mappings we cached while reading this segment
from its begining.

This happens because commitlog doesn't save a column family information
in every entry but rather once for each uniq column family (version) per
"cycle" (see commitlog::segment description for more info).

Therefore we have to assume that a particular column family mapping
appears once in the whole segment (worst case). And therefore, when we
decide to resume sending a segment we need to keep the column family
mappings we accumulated so far and drop them only after we are done with
this particular segment (sent it out in full).

Fixes #4122

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2019-01-22 15:24:22 -05:00
Vlad Zolotarov
4516a8cfc4 hinted handoff: add a "discarded" metric
Account the amount of hints that were discarded in the send path.
This may happen for instance due to a schema change or because a hint
being to old.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2019-01-22 14:11:09 -05:00
Duarte Nunes
b7517183fa db/commitlog: Use fragmented buffers to read entries
Leverage fragmented_temporary_buffer when reading commit log
entries, avoiding large allocations.

Refs #4020

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-31 13:20:37 +00:00
Duarte Nunes
6afbec4685 db/hints: Initialize current backlog
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-19 22:38:30 +00:00
Benny Halevy
857ff4f59a database: directly use std::experimental::filesystem::path for lister::path
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2018-12-02 22:02:10 +02:00
Benny Halevy
585ac6e641 database: use std::experimental::filesystem::path for lister::path
We would like to get rid of boost::filesystem and gradually replace it with
std::experimental::filesystem.

TODO: using namespace fs = std::experimental::filesystem,
use fs::path directly, rather than lister::path

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2018-12-02 22:02:10 +02:00
Avi Kivity
1533487ba8 Merge "hinted handoff: give a sender a low priority" from Vlad
"
Hinted handoff should not overpower regular flows like READs, WRITEs or
background activities like memtable flushes or compactions.

In order to achieve this put its sending in the STEAMING CPU scheduling
group and its commitlog object into the STREAMING I/O scheduling group.

Fixes #3817
"

* 'hinted_handoff_scheduling_groups-v2' of https://github.com/vladzcloudius/scylla:
  db::hints::manager: use "streaming" I/O scheduling class for reads
  commitlog::read_log_file(): set the a read I/O priority class explicitly
  db::hints::manager: add hints sender to the "streaming" CPU scheduling group
2018-10-23 16:55:05 +00:00
Vlad Zolotarov
aca0882a3f hinted handoff: enable storing hints before starting messaging_service
When messaging_service is started we may immediately receive a mutation
from another node (e.g. in the MV update context). If hinted handoff is not
ready to store hints at that point we may fail some of MV updates.

We are going to resolve this by start()ing hints::managers before we
start messaging_service and blocking hints replaying until all relevant
objects are initialized.

Refs #3828

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-10-18 16:49:58 -04:00
Vlad Zolotarov
cff4186517 db::hints::manager: add a "started" state
Hinting is allowed after "started" before "stopping".
Hints that attempted to be stored outside this time frame are going to
be dropped.

Refs #3828

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-10-18 16:41:36 -04:00
Vlad Zolotarov
fb513a4b23 db::hints::manager: introduce a _state
Introduce a multi-bit state field. In this patch it replaces the _stopping
boolean. We are going to add more states in the following patches.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-10-18 16:41:33 -04:00
Duarte Nunes
624472d16a db/hints/manager: Expose current backlog
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-10-16 20:35:00 +01:00
Duarte Nunes
6dcb7a39d4 db/hints/manager: Move decision about blocking hints to the manager
The space_watchdog enables or disables hints for the managers
associated with a particular device. We encapsulate this decision
inside the hints::managers by introducing the update_backlog()
function.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-10-16 20:35:00 +01:00
Duarte Nunes
622ac734da db/hints: Disallow moving or copying the managers
Disable the copy and move ctors and assignment operators for both the
hints::manager and the hints::resource_manager.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-10-16 20:32:16 +01:00
Vlad Zolotarov
629972d586 db::hints::manager: add hints sender to the "streaming" CPU scheduling group
Make sure that HH sends do not overpower (CPU wise) regular WRITEs flow.

Fixes #3817

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-10-10 15:22:43 -04:00
Duarte Nunes
74d809f8be db/hints/manager: Use frozen_mutation instead of mutation
Instead of unfreezing a mutation from the commitlog and then freezing
it again to send, just keep the read frozen mutation.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-10-07 19:57:30 +01:00
Asias He
4a0b561376 storage_service: Get rid of moving operation
The moving operation changes a node's token to a new token. It is
supported only when a node has one token. The legacy moving operation is
useful in the early days before the vnode is introduced where a node has
only one token. I don't think it is useful anymore.

In the future, we might support adjusting the number of vnodes to reblance
the token range each node owns.

Removing it simplifies the cluster operation logic and code.

Fixes #3475

Message-Id: <144d3bea4140eda550770b866ec30e961933401d.1533111227.git.asias@scylladb.com>
2018-08-01 11:18:17 +03:00
Vlad Zolotarov
83ba6d84a1 db::hints::manager: implement rebalance() method
Rebalance hints segments that need to be sent among all present shards.

Ensure that after rebalancing the difference between the number of segments
of any two shards is not greater than 1.

Try to minimize the amount of "file rename" operations in order to achieve the needed result.

Note: "Resharding" is a particular case of rebalancing.

Tests: dtest: hintedhandoff_additional_test.py:TestHintedHandoff.hintedhandoff_rebalance_test

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-07-06 19:18:46 -04:00
Piotr Sarna
d22668de04 hints: add device_id to manager
In order to make space_watchdog device-aware, device_id field
is added to hints manager. It's an equivalent of stat.st_dev
and it identifies the disk that contains manager's root directory.

Signed-off-by: Piotr Sarna <sarna@scylladb.com>
2018-06-22 10:26:37 +02:00
Piotr Sarna
204bc17bd7 hints: decouple hints manager metrics from constructor
Now that more than one instance of hints manager can be present
at the same time, registering metrics is moved out of the constructor
to prevent 'registering metrics twice' errors.
2018-06-04 09:46:06 +02:00
Piotr Sarna
f345efc79a hints: move space_watchdog to resource manager
Space watchdog is decoupled from hints manager and moved to resource
manager, so it can be shared among different hints manager instances.
2018-06-04 09:46:01 +02:00
Piotr Sarna
ef40f7e628 hints: move send limiter to resource manager
Send limiting semaphore is moved from hints manager to resource manager.
In consequence, hints manager now keeps a reference to its resource
manager.
2018-06-04 09:35:58 +02:00
Piotr Sarna
2315937854 hints: move constants to resource_manager
Constants related to managing resources are moved to newly created
resource_manager class. Later, this class will be used to manage
(potentially shared) resources of hints managers.
2018-06-04 09:35:58 +02:00
Vlad Zolotarov
48c96d09d6 db::hints::manager: drain hints when the node is decommissioned/removed
When node is decommissioned/removed it will drain all its hints and all
remote nodes that have hints to it will drain their hints to this node.

What "drain" means? - The node that "drains" hints to a specific
destination will ignore failures and will continue sending hints till the end
of the current segment, erase it and move to the next one till there are
no more segments left.

After all hints are drained the corresponding hints directory is removed.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
6ede32156f db::hints::manager::end_point_hints_manager::sender: add set_stopping()/stopping() methods
It's nicer to have access methods instead of working directly with enum_set methods and values.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
8aedbf9d18 db::hints: manager.hh: cleanup: fix the comments
Fix the comments that went out of sync with the current implementation.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-05-08 22:29:21 +01:00
Vlad Zolotarov
51bbf18c08 db::hints::manager: initial commit
Curently implemented:
   - Hints generation: db::hints::manager::store_hint(...).
   - Sending: db::hints::manager::on_timer().

TODO:
   - Resharding.
   - Node decommission.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-14 15:08:07 -05:00