Currently it grabs one from database, but it's not nice to use database
as config/sched-groups provider.
This PR passes the scheduling group to use for sending hints via manager
which, in turn, gets one from proxy via its config (proxy config already
carries configuration for hints manager). The group is initialized in
main.cc code and is set to the maintenance one (nowadays it's the same
as streaming group).
This will help splitting the streaming scheduling group into more
elaborated groups under the maintenance supergroup: SCYLLADB-351
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28358
Context
-------
The procedure of hint draining boils down to the following steps:
1. Drain a hint sender. That should get rid of all hints stored
for the corresponding endpoint.
2. Remove the hint directory corresponding to that endpoint.
Obviously, it gets more complex than this high-level perspective.
Without blurring the view, the relevant information is that step 1
in the algorithm above may not be executed.
Breaking it down, it comprises of two calls to
`hint_sender::send_hints_maybe()`. The function is responsible for
sending out hints, but it's not unconditional and will not be performed
if any of the following bullets is not satisfied:
* `hint_sender::replay_allowed()` is not `true`. This can happen when
hint replay hasn't been turned on yet.
* `hint_sender::can_send()` is not `true`. This can happen if the
corresponding endpoint is not alive AND it hasn't left the cluster
AND it's still a normal token owner.
There is one more relevant point: sending hints can be stopped if
replaying hints fails and `hint_sender::send_hints_maybe()` returns
`false`. However, that's not not possible in the case of draining.
In that case, if Scylla comes across any failure, it'll simply delete
the corresponding hint segment. Because of that, we ignore it and
only focus on the two bullets.
---
Why is it a problem?
--------------------
If a hint directory is not purged of all hint segments in it,
any attempt to remove it will fail and we'll observe an error like this:
```
Exception when draining <host ID>: std::filesystem::__cxx11::filesystem_error
(error system:39, filesystem error: remove failed: Directory not empty [<path>])
```
The folder with the remaining hints will also stay on disk, which is, of
course, undesired.
---
When can it happen?
-------------------
As highlighted in the Context section of this commit message, the
key part of the code that can lead to a dangerous situation like that
is `hint_sender::send_hints_maybe()`. The function is called twice when
draining a hint endpoint manager: once to purge all of the existing
hints, and another time after flushing all hints stored in a commitlog
instances, but not listed by `hint_sender` yet. If any of those calls
misbehaves, we may end up with a problem. That's why it's crucial to
ensure that the function always goes through ALL of the hints.
Dangerous situations:
1. We try to drain hints before hint replay is allowed. That will
violate the first bullet above.
2. The node we're draining is dead, but it hasn't left the cluster,
and it still possesses some tokens.
---
How do we solve that?
---------------------
Hint replay is turned on in `main.cc`. Once enabled, it cannot be
disabled. So to address the first bullet above, it suffices to ensure
that no draining occurs beforehand. It's perfectly fine to prevent it.
Soon after hint replay is allowed, `main.cc` also asks the hint manager
to drain all of the endpoint managers whose endpoints are no longer
normal token owners (cf. `db::hints::manager::drain_left_nodes()`).
The other bullet is more tricky. It's important here to know that
draining only initiated in three situations:
1. As part of the call to `storage_service::notify_left()`.
2. As part of the call to `storage_service::notify_released()`.
3. As part of the call to `db::hints::manager::drain_left_nodes()`.
The last one is trivially non-problematic. The nodes that it'll try to
drain are no longer normal token owners, so `can_send()` must always
return `true`.
The second situation is similar. As we read in the commit message of
scylladb/scylladb@eb92f50413, which
introduced the notion of released nodes, the nodes are no longer
normal token owners:
> In this patch we postpone the hint draining for the "left" nodes to
> the time when we know that the target nodes no longer hold ownership
> of any tokens - so they're no longer referenced in topology. I'm
> calling such nodes "released".
I suggest reading the full commit message there because the problems
there are somewhat similar these changes try to solve.
Finally, the first situation: unfortunately, it's more tricky. The same
commit message says:
> When a node is being replaced, it enters a "left" state while still
> owning tokens. Before this patch, this is also the time when we start
> draining hints targeted to this node, so the hints may get sent before
> the token ownership gets migrated to another replica, and these hints
> may get lost.
This suggests that `storage_service::notify_left()` may be called when
the corresponding node still has some tokens! That's something that may
prevent properly draining hints.
Fortunately, no hope is lost. We only drain hints via `notify_left()`
when hinted handoff hasn't been upgraded to being host-ID-based yet.
If it has, draining always happens via `notify_released()`.
When I write this commit message, all of the supported versions of
Scylla 2025.1+ use host-ID-based hinted handoff. That means that
problems can only arise when upgrading from an older version of Scylla
(2024.1 downwards). Because of that, we don't cover it. It would most
likely require more extensive changes.
---
Non-issues
----------
There are notions that are closely related to sending hints. One of them
is the host filter that hinted handoff uses. It decides which endpoints
are eligible for receiving hints, and which are not. Fortunately, all
endpoints rejected by the host filter lose their hint endpoint managers
-- they're stopped as part of that procedure. What's more, draining
hints and changing the host filter cannot be happening at the same time,
so it cannot lead to any problems.
The solution
------------
To solve the described issue, we simply prevent draining hints before
hint replay is allowed. No reproducer test is attached because it's not
feasible to write one.
Fixesscylladb/scylladb#27693Closesscylladb/scylladb#27713
Draining hints may occur in one of the two scenarios:
* a node leaves the cluster and the local node drains all of the hints
saved for that node,
* the local node is being decommissioned.
Draining may take some time and the hint manager won't stop until it
finishes. It's not a problem when decommissioning a node, especially
because we want the cluster to retain the data stored in the hints.
However, it may become a problem when the local node started draining
hints saved for another node and now it's being shut down.
There are two reasons for that:
* Generally, in situations like that, we'd like to be able to shut down
nodes as fast as possible. The data stored in the hints won't
disappear from the cluster yet since we can restart the local node.
* Draining hints may introduce flakiness in tests. Replaying hints doesn't
have the highest priority and it's reflected in the scheduling groups we
use as well as the explicitly enforced throughput. If there are a large
number of hints to be replayed, it might affect our tests.
It's already happened, see: scylladb/scylladb#21949.
To solve those problems, we change the semantics of draining. It will behave
as before when the local node is being decommissioned. However, when the
local node is only being stopped, we will immediately cancel all ongoing
draining processes and stop the hint manager. To amend for that, when we
start a node and it initializes a hint endpoint manager corresponding to
a node that's already left the cluster, we will begin the draining process
of that endpoint manager right away.
That should ensure all data is retained, while possibly speeding up
the shutdown process.
There's a small trade-off to it, though. If we stop a node, we can then
remove it. It won't have a chance to replay hints it might've before
these changes, but that's an edge case. We expect this commit to bring
more benefit than harm.
We also provide tests verifying that the implementation works as intended.
Fixesscylladb/scylladb#21949Closesscylladb/scylladb#22811
One of its caller is in the RESTful API which gets ips from the user, so
we convert ips to ids inside the API handler using gossiper before
calling the function. We need to deprecate ip based API and move to host
id based.
these unused includes are identified by clang-include-cleaner.
after auditing the source files, all of the reports have been
confirmed.
please note, since we have `using seastar::shared_ptr` in
`seastarx.h`, this renders `#include <seastar/core/shared_ptr.hh>`
unnecessary if we don't need the full definition of `seastar::shared_ptr`.
so, in this change, all the unused includes are removed. but there are
some headers which are actually used, while still being identified by
this tool. these includes are marked with "IWYU pragma: keep".
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Currently, parsers work with temporary_buffer<char>. This is unsafe
when invoked by bsearch_clustered_cursor, which reuses some of the
parsers, and passes temporary_buffer<char> which is a view onto LSA
buffer which comes from the index file page cache. This view is stable
only around consume(). If parsing requires more than one page, it will
continue with a different input buffer. The old buffer will be
invalid, and it's unsafe for the parser to store and access
it. Unfortunetly, the temporary_buffer API allows sharing the buffer
via the share() method, which shares the underlying memory area. This
is not correct when the underlying is managed by LSA, because storage
may move. Parser uses this sharing when parsing blobs, e.g. clustering
key components. When parsing resumes in the next page, parser will try
to access the stored shared buffers pointing to the previous page,
which may result in use-after-free on the memory area.
In prearation for fixing the problem, parametrize parsers to work with
different kinds of buffers. This will allow us to instantiate them
with a buffer kind which supports sharing of LSA buffers properly in a
safe way.
It's not purely mechanical work. Some parts of the parsing state
machine still works with temporary_buffer<char>, and allocate buffers
internally, when reading into linearized destination buffer. They used
to store this destination in _read_bytes vector, same field which is
used to store the shared buffers. Now it's not possible, since shared
buffer type may be different than temporary_buffer<char>. So those
paths were changed to use a new field: _read_bytes_buf.
Sync points are created, via POST HTTP requests, for a subset of nodes
in the cluster. Those nodes are specified in a request's parameter
`target_hosts`. When the parameter is empty, Scylla should assume
the user wants to create a sync point for ALL nodes.
Before these changes, sync points were created only for LIVE nodes.
If a node was dead but still part of the cluster and the user
requested creating a sync point leaving the parameter `target_hosts`
empty, the dead node was skipped during the creation of the sync point.
That was inconsistent with the guarantees the sync point API provides.
In this commit, we fix that issue and add a test verifying that
the changes have made the implementation compliant with the design
of the sync point API -- the test only passes after this commit.
Fixesscylladb/scylladb#9413Closesscylladb/scylladb#19750
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.
Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.
To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.
[1] 66ef711d68Closesscylladb/scylladb#20006
There are two places in hints code that need gossiper: hist_sender
calling gossiper::is_alive() and endpoint_downtime_not_bigger_than()
helper in manager. Both can live with const gossiper, so the dependency
references and anchor pointers can be restricted to const too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In 6e79d64, the behavior of `manager::too_many_in_flight_hints_for()`
was accidentally modified. It remained unnoticed for some time
and then fixed. In this commit, we add a test verifying that
the concurrency of hints being written to disk is indeed limited
and the limitations are imposed properly.
Until now, the constant `HINT_FILE_WRITE_TIMEOUT` was
declared as a static member of `db::hints::manager`.
However, the constant is only ever used in one
translation unit, so it makes more sense to move it
there and not include boilerplate in a header.
Before these changes, it could happen that Scylla initialized
endpoint managers for hint directories representing
* host IDs before migrating hinted handoff to using host IDs,
* IP addresses after the migration.
One scenario looked like this:
1. Start Scylla and upgrade the cluster to using host IDs.
2. Create, by hand, a hint directory representing an IP address.
3. Trigger changing the host filter in hinted handoff; it could
be achieved by, for example, restricting the set of data
centers Scylla is allowed to save hints for.
When changing the host filter, we browse the hint directories
and create endpoint managers if we can send hints towards
the node corresponding to a given hint directory. We only
accepted hint directories representing IP addresses
and host IDs. However, we didn't check whether the local node
has already been upgraded to host-ID-based hinted handoff
or not. As a result, endpoint managers were created for
both IP addresses and host IDs, no matter whether we were
before or after the migration.
These changes make sure that any time we browse the hint
directories, we take that into account.
Fixesscylladb/scylladb#19172Closesscylladb/scylladb#19173
Before hinted handoff is migrated to using host IDs
to identify nodes in the cluster, we keep track
of mappings between hint endpoint managers
identified by host IDs and the hint directories
managed by them and represented by IP addresses.
As a consequence, it may happen that one hint
directory corresponds to multiple nodes
-- it's intended. See 64ba620 for more details.
Before these changes, we only started the draining
process of a hint directory if the node leaving
the cluster corresponded to that hint directory
AND was identified by the same host ID as
the hint endpoint manager managing that directory.
As a result, the draining did not always happen
when it was supposed to.
Draining should start no matter which of the nodes
corresponding to a hint directory is leaving
the cluster. This commit ensures that it happens.
These changes migrate hinted handoff to using
host ID as soon as the corresponding cluster
feature is enabled.
When a node starts, it defaults to creating
directories naming them after IP addresses.
When the whole cluster has upgraded
to a version of Scylla that can handle
directories representing host IDs,
we perform a migration of the IP folders,
i.e. we try to rename them to host IDs.
Invalid directories, i.e. those that
represent neither an IP address, nor a host
ID, are removed.
During the migration, hinted handoff is
disabled. It is necessary because we have
to modify the disk's contents, so new hints
cannot be saved until the migration finishes.
Before these changes, resource manager only handled
the case when directories it browsed represented
valid host IDs. However, since before migrating
hinted handoff to using host IDs we still name
directories after IP addresses, that would lead
to exceptins that shouldn't happen.
We make resource manager handle directories
of arbitrary names correctly.
We start keeping track of mappings IP - host ID.
The mappings are between endpoint managers
(identified by host IDs) and the hint directories
managed by them (represented by IP addresses).
This is a prelude to handling IP directories
by the hint shard manager.
The structure should only be used by the hint
manager before it's migrated to using host IDs.
The reason for that is that we rely on the
information obtained from the structure, but
it might not make sense later on.
When we start creating directories named after
host IDs and there are no longer directories
representing IP addresses, there is no relation
between host IDs and IPs -- just because
the structure is supposed to keep track between
endpoint managers and hint directories that
represent IP addresses. If they represent
host IDs, the connection between the two
is lost.
Still using the data structure could lead
to bugs, e.g. if we tried to associate
a given endpoint manager's host ID with its
corresponding IP address from
locator::token_metadata, it could happen that
two different host IDs would be bound to
the same IP address by the data structure:
node A has IP I1, node A changes its IP to I2,
node B changes its IP to I1. Though nodes
A and B have different host IDs (because they
are unique), the code would try to save hints
towards node B in node A's hint directory,
which should NOT happen.
Relying on the data structure is thus only
safe before migrating hinted handoff to using
host IDs. It may happen that we save a hint
in the hint directory of the wrong node indeed,
but since migration to using host IDs is
a process that only happens once, it's a price
we are ready to pay. It's only imperative to
prevent it from happening in normal
circumstances.
We add a function that will be used while
migrating hinted handoff to using host IDs.
It iterates over existing hint directories
and tries to rename them to the corresponding
host IDs. In case of a failure, we remove
it so that at the end of its execution
the only remaining directories are those
that represent host IDs.
The store_hint() method starts taking both an IP
and a host ID as its arguments. The rationale
for the change is depending on the stage of
the cluster (before an upgrade to the
host-ID-based hinted handdof and after it),
we might need to create a directory representing
either an IP address, or a host ID.
Because locator::topology can change in the
before obtaining the host ID we pass
and when the function is being executed,
we need to pass both parameters explicitly
to ensure the consistency between them.
We extract the initialization of endpoint managers
from the start method of the hint manager
to a separate function and make it handle directories
that represent either IP addresses, or host IDs;
other directories are ignored.
It's necessary because before Scylla is upgraded
to a version that uses host-ID-based hinted handoff,
we need to continue only managing IP directories.
When Scylla has been upgraded, we will need to handle
host ID directories.
It may also happen that after an upgrade (but not
before it), Scylla fails while renaming
the directories, so we end up with some of them
representing IP address, and some representing
host IDs. After these changes, the code handles
that scenario as well.
We change the type of node identifiers
used within the module and fix compilation.
Directories storing hints to specific nodes
are now represented by host IDs instead of
IPs.
In this commit, we postpone the start-up
of the hint manager until we obtain information
about other nodes in the cluster.
When we start the hint managers, one of the
things that happen is creating endpoint
managers -- structures managed by
db::hints::manager. Whether we create
an instance of endpoint manager depends on
the value returned by host_filter::can_hint_for,
which, in turn, may depend on the current state
of locator::topology.
If locator::topology is incomplete, some endpoint
managers may not be started even though they
should (because the target node IS part of the
cluster and we SHOULD send hints to it if there
are some).
The situation like that can happen because we
start the hint managers too early. This commit
aims to solve that problem. We only start
the hint managers when we've gathered information
about the other nodes in the cluster and created
the locator::topology using it.
Hinted Handoff is not negatively affected by these
changes since in between the previous point of
starting the hint managers and the current one,
all of the mutations performed by
service::storage_proxy target the local node, so
no hints would need to be generated anyway.
Fixesscylladb/scylladb#11870Closesscylladb/scylladb#16511
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.
Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
The current comments should be clearer to someone
not familiar with the module. This commit also makes
them abide by the limit of 120 characters per line.
space_watchdog is a friend of shard hint manager just to
be able to execute one of its functions. This commit changes
that by unfriending the class and exposing the function.
Currently, the function doesn't return anything.
However, if the futurue doesn't need to be awaited,
the caller can decide that. There is no reason
to make that decision in the function itself.
This commit makes with_file_update_mutex() a method of hint_endpoint_manager
and introduces db::hints::manager::with_file_update_mutex_for() for accessing
it from the outside. This way, hint_endpoint_manager is hidden and no one
needs to know about its existence.
This commit makes db::hints::manager store service::storage_proxy
as a reference instead of a seastar::shared_ptr. The manager is
owned by storage proxy, so it only lives as long as storage proxy
does. Hence, it makes little sense to store the latter as a shared
pointer; in fact, it's very confusing and may be error-prone.
The field never changes, so it's safe to keep it as a reference
(especially because copy and move constructors of db::hints::manager
are both deleted). What's more, we ensure that the hint manager
has access to storage proxy as soon as it's created.
The same changes were applied to db::hints::resource_manager.
The rationale is the same.
If the variables are accessible from the outside, it makes
sense to also expose their initial values to the user.
This commit moves them to the header and marks as inline.