This function retrieves the persisted timestamp of the last known CDC
generation (which this node is currently gossiping to other nodes).
It checks that the timestamp is present; if not, it throws an error.
The check is unnecessary. It's used only in a quite esoteric place
(start_gossiping, which implements an almost-never-used API call),
and it's fine if the timestamp is gone - in start_gossiping,
we can start gossiping the tokens without the CDC generation timestamp
(well, if the timestamp is not present in system tables, something
weird must have happened, but that doesn't mean we can't resume
gossiping - fixing CDC generation management in such a case is
a separate problem).
The comment mentioned tables that no longer exist: their names have
changed some time ago. Update the comment to be name-agnostic.
Furthemore, the second part of the comment related to a case of "joining
a node without bootstrapping". Fortunately this operation is no longer
possible (after #6848 which became part of Scylla 4.3) so we can shorten
the comment.
Previously, when node A replaced node B, it would obtain B's
generation timestamp from its application state (gossiped by other
nodes) and start gossiping it immediately on bootstrap.
But that's not necessary:
1. if this is the timestamp of the last (current) generation, we would
obtain it from other nodes anyway (every node gossips the last known
timestamp),
2. if this is the timestamp of an earlier generation, we would forget
it immediately and start gossiping the last timestamp (obtained from
other nodes).
This commit simplifies the bootstrap code (in node-replace case) a bit:
the replacing node no longer attempts to retrieve the CDC generation
timestamp from the node being replaced.
I wanted to move caching_option.hh's content to .cc (so that it doesn't
pull in rjson.hh everywhere), but for that, we must first make from_map()
a non-template function.
Luckily, it is only called with one parameter type, so just substitute
that type for the template parameter.
Closes#8406
has_empty is a textbook example of a concept: it checks whether a type
has an empty() method that returns bool. It is now implemented with
enable_if, simplify it to a concept.
I verified that the debug build doesn't contain any incorrect
emtpyable<T> (e.g. for strings).
Closes#8404
The copying and comparing utilities for FragmentedView are not prepared to
deal with empty fragments in non-empty views, and will fall into an infinite
loop in such case.
But data coming in result_row_view can contain such fragments, so we need to
fix that.
Fixes#8398.
Closes#8397
This just causes unneeded and slower recompliations. Instead replace
with forward declarations, or includes of smaller headers that were
incidentally brought in by the one removed. The .cc files that really
need it gain the include, but they are few.
Ref #1.
Closes#8403
These functions were not used anywhere but had to be maintained anyway.
When (if) the expiration algorithm actually gets implemented (see issue #7300),
the functions can be added back (perhaps they will need to look differently
at that time, and it's likely that the `expire` column won't be used in the
expiration algorithm in the end anyway).
In commit c82250e0cf (gossip: Allow deferring
advertise of local node to be up), the replacing node is changed to postpone
the responding of gossip echo message to avoid other nodes sending read
requests to the replacing node. It works as following:
1) replacing node does not respond echo message to avoid other nodes to
mark replacing node as alive
2) replacing node advertises hibernate state so other nodes knows
replacing node is replacing
3) replacing node responds echo message so other nodes can mark
replacing node as alive
This is problematic because after step 2, the existing nodes in the
cluster will start to send writes to the replacing node, but at this
time it is possible that existing nodes haven't marked the replacing
node as alive, thus failing the write request unnecessarily.
For instance, we saw the following errors in issue #8013 (Cassandra
stress fails to achieve consistency when only one of the nodes is down)
```
scylla:
[shard 1] consistency - Live nodes 2 do not satisfy ConsistencyLevel (2
required, 1 pending, live_endpoints={127.0.0.2, 127.0.0.1},
pending_endpoints={127.0.0.3}) [shard 0] gossip - Fail to send
EchoMessage to 127.0.0.3: std::runtime_error (Not ready to respond
gossip echo message)
c-s:
java.io.IOException: Operation x10 on key(s) [4c4f4d37324c35304c30]:
Error executing: (UnavailableException): Not enough replicas available
for query at consistency QUORUM (2 required but only 1 alive
```
To solve this problem, we can do the replacing operation in multiple stages.
One solution is to introduce a new gossip status state as proposed
here: gossip: Introduce STATUS_PREPARE_REPLACE #7416
1) replacing node does not respond echo message
2) replacing node advertises prepare_replace state (Remove replacing
node from natural endpoint, but do not put in pending list yet)
3) replacing node responds echo message
4) replacing node advertises hibernate state (Put replacing node in
pending list)
Since we now have the node ops verb introduced in
829b4c1438 (repair: Make removenode safe
by default), we can do the multiple stage without introducing a new
gossip status state.
This patch uses the NODE_OPS_CMD infrastructure to implement replace
operation.
Improvements:
1) It solves the race between marking replacing node alive and sending
writes to replacing node
2) The cluster reverts to a state before the replace operation
automatically in case of error. As a result, it solves when the
replacing node fails in the middle of the operation, the repacing
node will be in HIBERNATE status forever issue.
3) The gossip status of the node to be replaced is not changed until the
replace operation is successful. HIBERNATE gossip status is not used
anymore.
4) Users can now pass a list of dead nodes to ignore explicitly.
Fixes#8013Closes#8330
* github.com:scylladb/scylla:
repair: Switch to use NODE_OPS_CMD for replace operation
gossip: Add advertise_to_nodes
gossip: Add helper to wait for a node to be up
gossip: Add is_normal_ring_member helper
In `storage_service::removenode`, in "Step 5", services which implement
`endpoint_lifecycle_subscriber` are first notified about the node
leaving the cluster, and only after that the gossiper state is updated
(comments added by me):
// This function indirectly notifies subscribers
ss.excise(std::move(tmp), endpoint);
// This function updates the gossiper state
ss._gossiper.advertise_token_removed(endpoint, host_id).get();
This order is confusing for those subscribers which expect the fact
that the node is leaving to be reflected in the gossiper state - more
specifically, for hints manager.
The hints manager has a function `can_send()` which determines if it is
OK for it to try send hints. More specifically, it looks at the
gossiper state to see if the destination node is ALIVE or if it has
left the ring. The first case is obvious as the destination node will
be able to receive the hints as writes, while the other means that the
hints will be sent with CL=ALL to its new replicas.
When a node leaves the cluster, all hint queues either to or from that
node enter the "drain" mode - the queue will attempt to send out all
hints and will drop those hints which failed to be sent. This mode is
triggered by a notification from the storage_service (hints manager is
a lifecycle subscriber).
The core drain logic for a queue looks as follows:
manager_logger.trace("Draining for {}: start", end_point_key());
set_draining();
send_hints_maybe();
_ep_manager.flush_current_hints().handle_exception([] (auto e) {
manager_logger.error("Failed to flush pending hints: {}. Ignoring...", e);
}).get();
send_hints_maybe();
manager_logger.trace("Draining for {}: end", end_point_key());
And `send_hints_maybe` contains the following loop:
while (replay_allowed() && have_segments() && can_send()) {
if (!send_one_file(*_segments_to_replay.begin())) {
break;
}
_segments_to_replay.pop_front();
++replayed_segments_count;
}
Coming back to the `storage_service::removenode` - because of the order
of `excise` and `advertise_token_removed`, draining starts before the
node which is being removed is removed from gossiper state. In turn,
it might happen that the drain logic calls `send_hints_maybe` twice and
does not send any hints - the loop in that function will immediately
stop because `can_send()` is false because the gossiper state still
reports that the target node is not alive. The logic expects `can_send`
to be true here because the node has left the ring.
This patch changes the order of `excise` and `advertise_token_removed`
in `storage_service::removenode` - now, the first one is called after
the other. This ensures that the gossiper state is updated before
listeners are called, and the race descrbed in the commit message does
not happen anymore - `can_send` is true when the node is being drained.
The race described here was exposed by the following commit:
77a0f1a153Fixes: #5087
Tests:
- unit(dev)
- dtest(hintedhandoff_additional_test.py)
- dtest(topology_test.py)
Closes#8284
This PR changes the `can_send` function so that it looks at the `token_metadata` in order to tell if the destination node is in the ring. Previously, gossiper state was used for that purpose and required a relatively complicated condition to check. The new logic just uses `token_metadata::is_member` which reduces complexity of the `can_send` function.
Additionally, `storage_service` is slightly modified so that during a removenode operation the `token_metadata` is first updated and only then endpoint lifecycle subscribers are notified. This was done in order to prevent a race just like the one which happened in #5087 - hints manager is a lifecycle subscriber and starts a draining operation when a node is removed, and in order for draining to work correctly, `can_send` should keep returning true for that node.
Tests:
- unit(dev)
- dtest(hintedhandoff_additional_test.py)
- dtest(topology_test.py)
Closes#8387
* github.com:scylladb/scylla:
hints: clarify docstring comment for can_send
hints: use token_metadata to tell if node is in the ring
hints: slightly reogranize "if" statement in can_send
storage_service: release token_metadata lock before notify_left
storage_service: notify_left after token_metadata is replicated
As a part of the effort of removing big, contiguous buffers from the codebase,
cql3::raw_value should be made fragmented. Unfortunately a straightforward
rewrite to a fragmented buffer type is not possible, because we want
cql3::raw_value to be compatible with cql3::raw_value_view, and we want that
view to be based on fragmented_temporary_buffer::view, so that it can be
used to view data coming directly from seastar without copying.
This patch makes cql3::raw_value fragmented by making cql3::raw_value_view
a `variant` of managed_bytes_view and fragmented_temporary_buffer::view.
Code users which depended on `cql3::raw_value` being `bytes`,
and cql::raw_value_view being `fragmented_temporary_buffer::view` underneath
were adjusted to the new, dual representation, mainly through the
`cql3::raw_value_view::with_value` visitor and deserialization/validation
helpers added to `cql3::raw_value_view`.
The second part of this series gets rid of linearizations occuring when processing
compound types in the CQL layer. This is achieved by storing their elements in
`managed_bytes` instead of `bytes` in the partially deserialized form (`lists::value`
`tuples::value`, etc.) outputting `managed_bytes` instead of `bytes` in functions
which go from the partially deserialized form to the atomic cell format (for frozen
types), and avoiding calling deserialize/serialize on individual elements when
it's not necessary. (It's only necessary for CQLv2, because since CQLv3 the format
on the wire is the same as our internal one).
The above also forces some changes to `expression.cc`, and `restrictions`, mainly because
`IN` clauses store their arguments as `lists` and `tuples`, and the code which handled
this clause expected `bytes`.
After this series, the path from prepared CQL statements to `atomic_cell_or_collection`
is almost completely linearization-free. The last remaining place is `collection_mutation_description`,
where map keys are linearized to `bytes`.
Closes#8160
* github.com:scylladb/scylla:
cql3: update_parameters: remove unused version of make_cell for bytes_view
types: collection: remove an unused version of pack_fragmented
cql3: optimize the deserialization of collections
cql3: maps, sets: switch the element type from bytes to managed_bytes
cql3: expression: use managed_bytes instead of bytes where possible
cql3: expr: expression: make the argument of to_range a forwarding reference
cql3: don't linearize elements of lists, tuples, and user types
cql3: values: add const managed_bytes& constructor to raw_value_view
cql3: output managed_bytes instead of bytes in get_with_protocol_version
types: collection: add versions of pack for fragmented buffers
types: add write_collection_{value,size} for managed_bytes_mutable_view
cql3: tuples, user_types: avoid linearization in from_serialized() and get()
types: tuple: add build_value_fragmented
cql3: update_parameters: add make_cell version for managed_bytes_view
cql3: remove operation::make_*cell
cql3: values: make raw_value fragmented
cql3: values: remove raw_value_view::operator==
cql3: switch users of cql3::raw_value_view to internals-independent API
cql3: values: add an internals-independent API to raw_value_view
utils: managed_bytes: add a managed_bytes constructor from FragmentedView
utils: managed_bytes: add operator<< and to_hex for managed_bytes
utils: fragment_range: add to_hex
configure: remove unused link dependencies from UUID_test
"
This small patchset restructures the do_wait_admission() to facilitate
future changes to the wait/admission conditions. The changes we want to
facilitate are Benny's flat mutation reader close series and my stalled
readers series. As an added benefit the code is more readable and a
small theoretical corner-case bug is fixed. No logical changes (besides
the small bug-fix).
Tests: unit(dev)
"
* 'reader-concurrency-semaphore-refactor/v1' of https://github.com/denesb/scylla:
reader_concurrency_semaphore: remove now unused may_proceed()
reader_concurrency_semaphore: restructure do_wait_admission()
reader_concurrency_semaphore: extract enqueueing logic into enqueue_waiter()
reader_concurrency_semaphore: make admission conditions consistent
It turned out that all the users of btree can already be converted
to use safer std::strong_ordering. The only meaningful change here
is the btree code itself -- no more ints there.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210330153648.27049-1-xemul@scylladb.com>
Before this patch, deserializing a collection from a (prepared) CQL request
involved deserializing every element and serializing it again. Originally this
was a hacky method of validation, and it was also needed to reserialize nested
frozen collections from the CQLv2 format to the CQLv3 format.
But since then we started doing validation separately (before calls to
from_serialized) and CQLv2 became irrelevant, making reserialization of
elements (which, among other things, involves a memory alocation for every
element) pure waste.
This patch adds a faster path for collections in the v3 format, which does not
involve linearizing or reserializing the elements (since v3 is the same as
our internal format).
After this patch, the path from prepared CQL statements to
atomic_cell_or_collection is almost completely linearization-free. The last
remaining place is collection_mutation_description, where map keys are
linearized.
Make to_range able to handle rvalues. We will pass managed_bytes&& to it
in the next patch to avoid pointless copying.
The public declaration of to_range is changed to a concrete function to avoid
having to explicitly instantiate to_range for all possible reference types of
clustering_key_prefix.
This patch switches the type used to store collection elements inside the
intermediate form used in lists::value, tuples::value etc. from bytes
to managed_bytes. After this patch, tuple and list elements are only linearized
in from_serialized, which will be corrected soon.
This commit introduces some additional copies in expression.cc, which
will be dealt with in a future commit.
We will need them to port the representation of collection types
in cql3/ from bytes to managed_bytes.
The version which takes an iterator of `bytes` as an argument will
be removed after that transition is complete.
We will use them to avoid linearization when going from the intermediate
std::vector<bytes> form in cql3/ to the atomic_cell format, by outputting
managed_bytes instead of bytes in get_with_protocol_version.
We will use it to port the representation of collections in cql3/
from bytes to managed_bytes.
The duplicate version for bytes_view will be removed after that transition
is complete.
The operation::make_*cell functions are useless aliases to methods of
update_parameters, and are used interchangeably with them throughout the code.
Remove them.
Also, remove the now-unused update_parameters::make_cell version for
fragmented_temporary_buffer::view.
As a part of the effort of removing big, contiguous buffers from the codebase,
cql3::raw_value should be made fragmented. Unfortunately the change involves
some nontrivial work, because raw_value must be viewable with raw_value_view,
and raw_value_view must accomodate both raw_value (that's where we store
values in prepared queries) and fragmented_temporary_buffer::view
(because that's the type of values coming from the wire).
This patch makes raw_value fragmented, by changing the backing type from
bytes to managed_bytes. raw_value_view is modified accordingly by changing
the backing type from fragmented_temporary_buffer::view to a variant of
fragmented_temporary_buffer::view and managed_bytes_view.
We have prepared the users of raw_value{_view} for this change in preceding
commits.
It's only used in a single test, and there is no reason why it should ever
be used anywhere else. So let's remove it from the public header and move
it to that test.
We want to change the internals of cql3::raw_value{_view}.
However, users of cql3::raw_value and cql3::raw_value_view often
use them by extracting the internal representation, which will be different
after the planned change.
This commit prepares us for the change by making all accesses to the value
inside cql3::raw_value(_view) be done through helper methods which don't expose
the internal representation publicly.
After this commit we are free to change the internal representation of
raw_value_{view} without messing up their users.
Currently, raw_value_view is backed by a fragmented_temporary_buffer::view,
and many users of this type use it by extracting that internal representation.
However, we want to change raw_value_view so that it can be created both
from fragmented_temporary_buffer and from managed_bytes, so that we can switch
the internals of raw_value from bytes to managed_bytes. To do that we need
to prepare all users for that more general representation.
This commit adds an API which allow using raw_value_view without accessing its
internal representation. In the next commits of this series we will switch all
callers who currently depend on that representation to the new API,
and then we will remove the old accessors and change the internals.
Just for convenience. We will use it in an upcoming patch where we switch
the inner representation of cql3::raw_value from bytes to managed_bytes, and we
will want to construct managed_bytes from fragmented_temporary_buffer::view.
We will need them to replace bytes with managed_bytes in some places in an
upcoming patch.
The change to configure.py is necessary because opearator<< links to to_hex
in bytes.cc.
Now, the docstring comment next to can_send better represents the
condition that is checked inside that function. The statement about
returning true when destination left the NORMAL state is replaced with a
statement about returning true when the destination has left the ring.
Now, instead of looking at the gossiper state to check if the
destination node is still in the ring, we are using token_metadata as a
source of truth. This results in much simpler code in can_send() as
token_metadata has an is_member method which does exactly what we want.