Changes the name of storage_proxy::mutate_hint_from_scratch function to
another name, whose meaning is more clear: send_hint_to_all_replicas.
Tests: unit(dev)
When replaying a hint with a destination node that is no longer in the
cluster, it will be sent with cl=ALL to all its new replicas. Before
this patch, the MUTATION verb was used, which causes such hints to be
handled on the same connection and with the same priority as regular
writes. This can cause problems when a large number of hints is
orphaned and they are scheduled to be sent at once. Such situation
may happen when replacing a dead node - all nodes that accumulated hints
for the dead node will now send them with cl=ALL to their new replicas.
This patch changes the verb used to send such hints to HINT_MUTATION.
This verb is handled on a separate connection and with streaming
scheduling group, which gives them similar priority to non-orphaned
hints.
Refs: #4712
Tests: unit(dev)
and replace all calls to dht::global_partitioner().get_token
dht::get_token is better because it takes schema and uses it
to obtain partitioner instead of using a global partitioner.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The disk-error-handler is purely auxiliary thing that helps
propagating IO errors to the rest of the code. It well
deserves not sitting in the root namespace.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200207112443.18475-1-xemul@scylladb.com>
When a node does not have gossip STATUS application_state, we currently
use an empty string to present such state in get_gossip_status.
It is better to use an explicit "UNKNOWN" to present it. It makes the
log easier to understand when the status is unknown.
Before:
'gossip - InetAddress n2 is now UP, status ='
After:
'gossip - InetAddress n2 is now UP, status = UNKNOWN'
This patch is safe because the STATUS_UNKNOWN is never sent over the
cluster. So the presentation is only internal to the node.
Fixes#5520
To be used by dtest as an indicator that endpoint's hints
were drained and hints directory is removed.
Refs #5354
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
By default, semaphore exceptions bring along very little context:
either that a semaphore was broken or that it timed out.
In order to make debugging easier without introducing significant
runtime costs, a notion of named semaphore is added.
A named semaphore is simply a semaphore with statically defined
name, which is present in its errors, bringing valuable context.
A semaphore defined as:
auto sem = semaphore(0);
will present the following message when it breaks:
"Semaphore broken"
However, a named semaphore:
auto named_sem = named_semaphore(0, named_semaphore_exception_factory{"io_concurrency_sem"});
will present a message with at least some debugging context:
"Semaphore broken: io_concurrency_sem"
It's not much, but it would really help in pinpointing bugs
without having to inspect core dumps.
At the same time, it does not incur any costs for normal
semaphore operations (except for its creation), but instead
only uses more CPU in case an error is actually thrown,
which is considered rare and not to be on the hot path.
Refs #4999
Tests: unit(dev), manual: hardcoding a failure in view building code
Since seastar::streams are based on future/promise, variadic streams
suffer the same fate as variadic futures - deprecation and eventual
removal.
This patch therefore replaces a variadic stream in commitlog::read_log_file()
with a non-variadic stream, via a helper struct.
Tests: unit (dev)
"
Fix races that may lead to use-after-free events and file system level exceptions
during shutdown and drain.
The root cause of use-after-free events in question is that space_watchdog blocks on
end_point_hints_manager::file_update_mutex() and we need to make sure this mutex is alive as long as
it's accessed even if the corresponding end_point_hints_manager instance
is destroyed in the context of manager::drain_for().
File system exceptions may occur when space_watchdog attempts to scan a
directory while it's being deleted from the drain_for() context.
In case of such an exception new hints generation is going to be blocked
- including for materialized views, till the next space_watchdog round (in 1s).
Issues that are fixed are #4685 and #4836.
Tested as follows:
1) Patched the code in order to trigger the race with (a lot) higher
probability and running slightly modified hinted handoff replace
dtest with a debug binary for 100 times. Side effect of this
testing was discovering of #4836.
2) Using the same patch as above tested that there are no crashes and
nodes survive stop/start sequences (they were not without this series)
in the context of all hinted handoff dtests. Ran the whole set of
tests with dev binary for 10 times.
"
* 'hinted_handoff_race_between_drain_for_and_space_watchdog_no_global_lock-v2' of https://github.com/vladzcloudius/scylla:
hinted handoff: fix a race on a directory removal between space_watchdog and drain_for()
hinted handoff: make taking file_update_mutex safe
db::hints::manager::drain_for(): fix alignment
db::hints::manager: serialize calls to drain_for()
db::hints: cosmetics: identation and missing method qualifier
This patch silences those future discard warnings where it is clear that
discarding the future was actually the intent of the original author,
*and* they did the necessary precautions (handling errors). The patch
also adds some trivial error handling (logging the error) in some
places, which were lacking this, but otherwise look ok. No functional
changes.
The endpoint directories scanned by space_watchdog may get deleted
by the manager::drain_for().
If a deleted directory is given to a lister::scan_dir() this will end up
in an exception and as a result a space_watchdog will skip this round
and hinted handoff is going to be disabled (for all agents including MVs)
for the whole space_watchdog round.
Let's make sure this doesn't happen by serializing the scanning and deletion
using end_point_hints_manager::file_update_mutex.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
end_point_hints_manager::file_update_mutex is taken by space_watchdog
but while space_watchdog is waiting for it the corresponding
end_point_hints_manager instance may get destroyed by manager::drain_for()
or by manager::stop().
This will end up in a use-after-free event.
Let's change the end_point_hints_manager's API in a way that would prevent
such an unsafe locking:
- Introduce the with_file_update_mutex().
- Make end_point_hints_manager::file_update_mutex() method private.
Fixes#4685Fixes#4836
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
If drain_for() is running together with itself: one instance for the local
node and one for some other node, erasing of elements from the _ep_managers
map may lead to a use-after-free event.
Let's serialize drain_for() calls with a semaphore.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Current cql transport code acquire a permit before processing a query and
release it when the query gets a reply, but some quires leave work behind.
If the work is allowed to accumulate without any limit a server may
eventually run out of memory. To prevent that the permit system should
account for the background work as well. The patch is a first step in
this direction. It passes a permit down to storage proxy where it will
be later hold by background work.
The resource manager is used to manage common resources between
various hints managers. In-flight hints used to be one of the shared
resources, but it proves to cause starvation, when one manager eats
the whole limit - which may be especially painful if the background
materialized views hints manager starves the regular hints manager,
which can in turn start failing user writes because of admission control.
This patch makes the limit per-manager again,
which effectively reverts the limit to its original behavior.
Fixes#4483
Message-Id: <8498768e8bccbfa238e6a021f51ec0fa0bf3f7f9.1559649491.git.sarna@scylladb.com>
Rename this state value to better reflect the reality:
state::ep_state_is_not_normal -> state::ep_state_left_the_ring
The manager gets to this state when the destination Node has left the ring.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
When node is in a DN state its gossiper state may be NORMAL, SHUTDOWN
or "" depending on the use case.
In addition to that if node has been removed from the ring its state is
also going to be removed from the gossiper_state map.
Let's consider the above when deciding if node is in the DN state.
Fixes#4461
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
gossiper::is_alive() has a lot of not needed checks (e.g. is_me(ep)) that
are irrelevant for HH use case and we may safely skip them.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
To prepare for a seastar change that adds an optional file_permissions
parameter to touch_directory and recursive_touch_directory.
This change messes up the call to io_check since the compiler can't
derive the Func&& argument. Therefore, use a lambda function instead
to wrap the call to {recursive_,}touch_directory.
Ref #4395
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190421085502.24729-1-bhalevy@scylladb.com>
If we discover that a current segment is corrupted there is nothing we
can do about it.
This patch does the following:
1) Drops the corrupted segment and moves to the next one.
2) Logs such events as ERRORs.
3) Introduces a new metrics that accounts such event.
Fixes#4364
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Hinted handoff doesn't utilize this feature (which was developed with a
commitlog in mind).
Since it's enabled by default we need to explicitly disable it.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Rather than {std::experimental,boost,seastar::compat}::filesystem
On Sat, 2019-03-23 at 01:44 +0200, Avi Kivity wrote:
> The intent for seastar::compat was to allow the application to choose
> the C++ dialect and have seastar follow, rather than have seastar choose
> the types and have the application follow (as in your patch).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a view replica becomes unavailable, updates to it are stored as
hints at the paired based replica. This on-disk queue of pending view
updates grows as long as there are view updated and the view replica
remains unavailable. Currently, we take that relative queue size into
account when calculating the delay for new base writes, in the context
of the backpressure algorithm for materialized views.
However, the way we're calculating that on-disk backlog is wrong,
since we calculate it per-device and then feed it to all the hints
managers for that device. This means that normal hints will show up as
backlog for the view hints manager, which in turn introduces delays.
This can make the view backpressure mechanism kick-in even if the
cluster uses no materialized views.
There's yet another way in which considering the view hints backlog is
wrong: a view replica that is unavailable for some period of time can
cause the backlog to grow to a point where all base writes are applied
the maximum delay of 1 second. This turns a single-node failure into
cluster unavailability.
The fix to both issues is to simply not take this on-disk backlog into
account for the backpressure algorithm.
Fixes#4351Fixes#4352
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190321170418.25953-1-duarte@scylladb.com>
We will try to send a particular segment later (in 1s) from the place
where we left off if it wasn't sent out in full before. However we may miss
some of column family mappings when we get back to sending this file and
start sending from some entry in the middle of it (where we left off)
if we didn't save column family mappings we cached while reading this segment
from its begining.
This happens because commitlog doesn't save a column family information
in every entry but rather once for each uniq column family (version) per
"cycle" (see commitlog::segment description for more info).
Therefore we have to assume that a particular column family mapping
appears once in the whole segment (worst case). And therefore, when we
decide to resume sending a segment we need to keep the column family
mappings we accumulated so far and drop them only after we are done with
this particular segment (sent it out in full).
Fixes#4122
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Account the amount of hints that were discarded in the send path.
This may happen for instance due to a schema change or because a hint
being to old.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
This header, which is easily replaced with a forward declaration,
introduces a dependency on database.hh everywhere. Remove it and scatter
includes of database.hh in source files that really need it.
When reading the header chunk of a commitlog file, check the stored id
value against the id derived from the file name, and ignore if
mismatched. This is a prerequisite for re-using renamed commitlog files,
as we can then fail-fast should one such be left on disk, instead of
trying to replay it.
We also check said id via the CRC check for each chunk parsed. If we
find a chunk with
mismatched id, we will get a CRC error for the chunk, and replay will
terminate (albeit not gracefully).
We would like to get rid of boost::filesystem and gradually replace it with
std::experimental::filesystem.
TODO: using namespace fs = std::experimental::filesystem,
use fs::path directly, rather than lister::path
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently if hints directory contains unexpected directories Scylla fails to
start with unhandled std::invalid_argument exception. Make the manager
ignore malformed files instead and try to proceed anyway.
Message-Id: <20181121134618.29936-2-gleb@scylladb.com>
We scan hints directory in two places: to search for files to replay and
to search for directories to remove after resharding. The code that
translates directory name to a shard is duplicated. It is simple now, so
not a bit issue but in case it grows better have it in one place.
Message-Id: <20181121134618.29936-1-gleb@scylladb.com>
"
Hinted handoff should not overpower regular flows like READs, WRITEs or
background activities like memtable flushes or compactions.
In order to achieve this put its sending in the STEAMING CPU scheduling
group and its commitlog object into the STREAMING I/O scheduling group.
Fixes#3817
"
* 'hinted_handoff_scheduling_groups-v2' of https://github.com/vladzcloudius/scylla:
db::hints::manager: use "streaming" I/O scheduling class for reads
commitlog::read_log_file(): set the a read I/O priority class explicitly
db::hints::manager: add hints sender to the "streaming" CPU scheduling group
When messaging_service is started we may immediately receive a mutation
from another node (e.g. in the MV update context). If hinted handoff is not
ready to store hints at that point we may fail some of MV updates.
We are going to resolve this by start()ing hints::managers before we
start messaging_service and blocking hints replaying until all relevant
objects are initialized.
Refs #3828
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Hinting is allowed after "started" before "stopping".
Hints that attempted to be stored outside this time frame are going to
be dropped.
Refs #3828
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Introduce a multi-bit state field. In this patch it replaces the _stopping
boolean. We are going to add more states in the following patches.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>