Modeled after get_live_members_synchronized,
get_unreachable_members_synchronized calls
replicate_live_endpoints_on_change to synchronize
the state of unreachable_members on all shards.
Fixesscylladb/scylladb#15088
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently he gossiper marks endpoint_state objects as alive/dead.
I some cases the endpoint_state::is_alive function is checked but in many other cases
gossiper::is_alive(endpoint) is used to determine if the endpoint is alive.
This series removed the endpoint_state::is_alive state and moves all the logic to gossiper::is_alive
that bases its decision on the endpoint having an endpoint_state and being in the _live_endpoints set.
For that, the _live_endpoints is made sure to be replicated to all shards when changed
and the endpoint_state changes are serialized under lock_endpoint, and also making sure that the
endpoint_state in the _endpoint_states_map is never updated in place, but rather a temporary copy is changed
and then safely replicated using gossiper::replicate
Refs https://github.com/scylladb/scylladb/issues/14794Closes#14801
* github.com:scylladb/scylladb:
gossiper: mark_alive: remove local_state param
endpoint_state: get rid of _is_alive member and methods
gossiper: is_alive: use _live_endpoints
gossiper: evict_from_membership: erase endpoint from _live_endpoints
gossiper: replicate_live_endpoints_on_change: use _live_endpoints_version to detect change
gossiper: run: no need to replicate live_endpoints
gossiper: fold update_live_endpoints_version into replicate_live_endpoints_on_change
gossiper: add mutate_live_and_unreachable_endpoints
gossiper: reset_endpoint_state_map: clear also shadow endpoint sets
gossiper: reset_endpoint_state_map: clear live/unreachable endpoints on all shards
gossiper: functions that change _live_endpoints must be called on shard 0
gossiper: add lock_endpoint_update_semaphore
gossiper: make _live_endpoints an unordered_set
endpoint_state: use gossiper::is_alive externally
The schema version is updated by group0, so if group0 starts before
schema version observer is registered some updates may be missed. Since
the observer is used to update node's gossiper state the gossiper may
contain wrong schema version.
Fix by registering the observer before starting group0 and even before
starting gossiper to avoid a theoretical case that something may pull
schema after start of gossiping and before the observer is registered.
Fixes: #15078
Message-Id: <ZOYZWhEh6Zyb+FaN@scylladb.com>
We make every alternative type in the request_param variant
a named struct to make the code more readable. Additionally, this
change will make extending request parameters easier if we decide
to do so in the future.
Closes#15132
Some runtime errors thrown in storage_service::raft_removenode
start with the "raft topology " prefix. Since "raft topology" is
an implementation detail, we don't want to throw this information
through the user API. Only logs should contain it.
Closes#15136
When casting a float or double column to a string with `CAST(f AS TEXT)`,
Scylla is expected to print the number with enough digits so that reading
that string back to a float or double restores the original number
exactly. This expectation isn't documented anywhere, but makes sense,
and is what Cassandra does.
Before commit 71bbd7475c, this wasn't the
case in Scylla: `CAST(f AS TEXT)` always printed 6 digits of precision,
which was a bit under enough for a float (which can have 7 decimal digits
of precision), but very much not enough for a double (which can need 15
digits). The origin of this magic "6 digits" number was that Scylla uses
seastar::to_sstring() to print the float and double values, and before
the aforementioned commit those functions used sprintf with the "%g"
format - which always prints 6 decimal digits of precision! After that
commit, to_sstring() now uses a different approach (based on fmt) to
print the float and double values, that prints all significant digits.
This patch adds a regression test for this bug: We write float and double
values to the database, cast them to text, and then recover the float
or double number from that text - and check that we get back exactly the
same float or double object. The test *fails* before the aforementioned
commit, and passes after it. It also passes on Cassandra.
Refs #15127
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#15131
And verify the they returned host_id isn't null.
Call on_internal_error_noexcept in that case
since all token owners are expected to have their
host_id set. Aborting in testing would help fix
issues in this area.
Fixes scylladb/scylladb#14843
Refs scylladb/scylladb#14793
Closes#14844
* github.com:scylladb/scylladb:
api: storage_service: improve description of /storage_service/host_id
token_metadata: get_endpoint_to_host_id_map_for_reading: restrict to token owners
in this series, we also print the sstable name and pk when writing a tombstone whose local_deletion_time (ldt for short) is greater than INT32_MAX which cannot be represented by an uint32_t.
Fixes#15015Closes#15107
* github.com:scylladb/scylladb:
sstable/writer: log sstable name and pk when capping ldt
test: sstable_compaction_test: add a test for capped tombstone ldt
scylla-gdb.py has two methods for iterating over all tables:
* all_tables()
* for_each_table()
Despite this, many places in the code iterate over the column family map
directly. This patch leaves just a single method (for_each_table()) and
migrates all the codebase to use it, instead of iterating over the raw
map. While at it, the access to the map is made backward compatible with
pre 52afd9d42d code, said commit wrapped database::_column_families in
tables_metadata object. This broke scylla-gdb.py for older versions.
Closes#15121
We add support for `--ignore-dead-nodes` in `raft_removenode` and
`--ignore-dead-nodes-for-replace` in `raft_replace`. For now, we allow
passing only host ids of the ignored nodes. Supporting IPs is currently
impossible because `raft_address_map` doesn't provide a mapping from IP
to a host id.
The main steps of the implementation are as follows:
- add the `ignore_nodes` column to `system.topology`,
- set the `ignore_nodes` value of the topology mutation in `raft_removenode` and `raft_replace`,
- extend `service::request_param` with alternative types that allow storing a set of ids of the ignored nodes,
- load `ignore_nodes` from `system.topology` into `request_param` in `system_keyspace::load_topology_state`,
- add `ignore_nodes` to `exclude_nodes` in `topology_coordinator::exec_global_command`,
- pass `ignore_nodes` to `replace_with_repair` and `remove_with_repair` in `storage_service::raft_topology_cmd_handler`.
Additionally, we add `test_raft_ignore_nodes.py` with two tests that verify the added changes.
Fixes#15025Closes#15113
* github.com:scylladb/scylladb:
test: add test_raft_ignore_nodes
test: ManagerClient.remove_node: allow List[HostId] for ignore_dead
raft topology: pass ignore_nodes to {replace, remove}_with_repair
raft topology: exec_global_command: add ignore_nodes to exclude_nodes
raft topology: exec_global_command: change type of exclude_nodes
topology_state_machine: extend request_param with a set of raft ids
raft topology: set ignore_nodes in raft_removenode and raft_replace
utils: introduce split_comma_separated_list
raft topology: add the ignore_nodes column to system.topology
In this PR a simple test for fencing is added. It exercises the data
plane, meaning if it somehow happens that the node has a stale topology
version, then requests from this node will get an error 'stale
topology'. The test just decrements the node version manually through
CQL, so it's quite artificial. To test a more real-world scenario we
need to allow the topology change fiber to sometimes skip unavailable
nodes. Now the algorithm fails and retries indefinitely in this case.
The PR also adds some logs, and removes one seemingly redundant topology
version increment, see the commit messages for details.
Closes#14901
* github.com:scylladb/scylladb:
test_fencing: add test_fence_hints
test.py: output the skipped tests
test.py: add skip_mode decorator and fixture
test.py: add mode fixture
hints: add debug log for dropped hints
hints: send_one_hint: extend the scope of file_send_gate holder
pylib: add ScyllaMetrics
hints manager: add send_errors counter
token_metadata: add debug logs
fencing: add simple data plane test
random_tables.py: add counter column type
raft topology: don't increment version when transitioning to node_state::normal
The default reconnection policy in Python Driver is an exponential
backoff (with jitter) policy, which starts at 1 second reconnection
interval and ramps up to 600 seconds.
This is a problem in tests (refs #15104), especially in tests that restart
or replace nodes. In such a scenario, a node can be unavailable for an
extended period of time and the driver will try to reconnect to it
multiple times, eventually reaching very long reconnection interval
values, exceeding the timeout of a test.
Fix the issue by using a exponential reconnection policy with a maximum
interval of 4 seconds. A smaller value was not chosen, as each retry
clutters the logs with reconnection exception stack trace.
Fixes#15104Closes#15112
This sort series deals with two stall sources in row-level repair `to_repair_rows_list`:
1. Freeing the input `repair_rows_on_wire` in one shot on return (as seen in https://github.com/scylladb/scylladb/issues/14537)
2. Freeing the result `row_list` in one shot on error. this hasn't been seen in testing but I have no reason to believe it is not susceptible to stalls exactly like repair_rows_on_wire with the same number of rows and mutations.
Fixes https://github.com/scylladb/scylladb/issues/14537Closes#15102
* github.com:scylladb/scylladb:
repair: reindent to_repair_rows_list
repair: to_repair_rows_list: clear_gently on error
repair: to_repair_rows_list: consume frozen rows gently
We add two tests verifying that --ignore-dead-nodes in
raft_removenode and --ignore-dead-nodes-for-replace in
raft_replace are handled correctly.
We need a 7-cluster to have a Raft majority. Therefore, these
tests are quite slow, and we want to run them only in the dev mode.
ManagerClient.remove_node allows passing ignore_dead only as
List[IPAddress]. However, raft_removenode currently supports
only host ids. To write a test that passes ignore_dead to
ManagerClient.remove_node in the Raft topology mode, we allow
passing ignore_dead as List[HostId].
Note that we don't want to use List[IPAddress | HostId] because
mixing IP addresses and host ids fails anyway. See
ss::remove_node.set(...) in api::set_storage_service.
To properly stream ranges during the removenode or replace
operation in the Raft topology mode, we pass IPs of the ignored
nodes to replace_with_repair and remove_with_repair in
storage_service::raft_topology_cmd_handler.
We add ignore_nodes to exclude_nodes in exec_global_command
to ignore nodes marked as dead by --ignore-dead-nodes for
raft_removenode and --ignore-dead-nodes-for-replace for
raft_replace.
We extend exclude_nodes in exec_global_command with ignore_nodes
in the next commit. Since we already use std::unordered_set to
store ids of the ignored nodes and their number is unknown, we
change the type of exclude_nodes from utils::small_vector to
std::unordered_set.
We add two new alternative types to service::request_param:
removenode_param and replace_param. They allow storing the list
of ignored nodes loaded from the ignore_nodes column of
system.topology. We also remove the raft::server_id type because
it has been only used by the replace operation.
To handle --ignore-dead-nodes in raft_removenode and
--ignore-dead-nodes-for-replace in raft_replace, we set the
ignore_nodes value of the topology mutation in these functions. In
the following commits, we ensure that the topology coordinator
properly makes use of it.
The test makes a write through the first node with
the third node down, this causes a hint to be stored on the
first node for the second. We increment the version
and fence_version on the third node, restart it,
and expect to see a hint delivery failure
because of versions mismatch. Then we update the versions
of the first node and expect hint to be successfully
delivered.
pytest option -rs forces it to print
all the skipped tests along with
the reasons. Without this option we
can't tell why certain tests were skipped,
maybe some of them shouldn't already.
Syntactic sugar for marking tests to be
skipped in a particular mode.
There is skip_in_debug/skip_in_release in suite.yaml,
but they can be applied only on the entire file,
which is unnatural and inconvenient. Also, they
don't allow to specify a reason why the test is skipped.
Separate dictionary skipped_funcs is needed since
we can't use pytest fixtures in decorators.
The problem was that the holder in with_gate
call was released too early. This happened
before the possible call to on_hint_send_failure
in then_wrapped. As a result, the effects of
on_hint_send_failure (segment_replay_failed flag)
were not visible in send_one_file after
ctx_ptr->file_send_gate.close(), so we could decide
that the segment was sent in full and delete
it even if sending of some hints led to errors.
Fixes#15110
This patch adds facilities to work
with Scylla metrics from test.py tests.
The new metrics property was added to
ManagerClient, its query method
sends a request to Scylla metrics
endpoint and returns and object
to conveniently access the result.
ScyllaMetrics is copy-pasted from
test_shedding.py. It's difficult
to reuse code between 'new' and 'old'
styles of tests, we can't just import
pylib in 'old' tests because of some
problems with python search directories.
A past commit of mine that attempted
to solve this problem was rejected on review.
There was no indication of problems
in the hints manager metrics before.
We need this counter for fencing tests
in the later commit, but it seems to be
useful on its own.
We log the new version when the new token
metadata is set.
Also, the log for fence_version is moved
in shared_token_metadata from storage_service
for uniformity.
The test starts a three node cluster
and manually decrements the version on
the last node. It then tries to write
some data through the last node and
expects to get 'stale topology' exception.
Use the presence of of the endpoint in _live_endpoints
as the authoritative source for is_alive
rather than the endpoint_state::is_alive status.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than keeping an expensive copy of `_live_endpoint`
and `_unreachable_endpoints` in shadow members, while they aren't
currently used for their content anyhow, just to detect when
their corresponding members change.
With that, it is renamed to replicate_live_and_unreachable_endpoints.
This still doesn't provide strong exception safety guarantees,
but at least we don't "cheat" about shard state
and we don't mark shard 0 state as "replicated" by
updating the shadow members.
Also, we save some unneeded allocations.
Refs scylladb/scylladb#15089
Refs scylladb/scylladb#15088
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
As asias@scylladb.com noticed, after the previous
patch that calls replicate_live_endpoints_on_change
in mutate_live_and_unreachable_endpoints, _live_endpoints
are always updated on all shards when they change,
so there's no need anymore to replicate them here.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We want to propagate any change to _live_endpoints
to all shards. Currently we just update the `_live_endpoints_version`
and `replicate_live_endpoints_on_change` propagtes the
change some undetermined time in the future.
To rely on `_live_endpoints` for gossiper::is_alive,
that may be called on any shard, we want to propagate
the change to all shards as soon as it happens.
Use `mutate_live_and_unreachable_endpoints` to update
_live_endpoints and/or _unreachable_endpoints safely,
under `lock_endpoint_update_semaphore`. It is responsible
for incrementing _live_endpoints_version and
calling `replicate_live_endpoints_on_change` to
propagate the change to all shards.
Refs scylladb/scylladb#15089
Refs scylladb/scylladb#15088
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To be used for safely modifying _live_endpoints
and/or _unreachable_endpoints and replicating the
new version to all shards.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Three places handle comma-separated lists similarly:
- ss::remove_node.set(...) in api::set_storage_service,
- storage_service::parse_node_list,
- storage_service::is_repair_based_node_ops_enabled.
In the next commit, the fourth place that needs the same logic
appears -- storage_service::raft_replace. It needs to load
and parse the --ignore-dead-nodes-for-replace param from config.
Moreover, the code in is_repair_based_node_ops_enabled is
different and doesn't seem right. We swap '\"' and '\'' with ' '
but don't do anything with it afterward.
To avoid code duplication and fix is_repair_based_node_ops_enabled,
we introduce the new function utils::split_comma_separated_list.
This change has a small side effect on logging. For example,
ignore_nodes_strs in storage_service::parse_node_list might be
printed in a slightly different form.
In the following commits, we add support for --ignore-dead-nodes
in raft_removenode and --ignore-dead-nodes-for-replace in
raft_replace. To make these request parameters accessible for the
topology coordinator, we store them in the new ignore_nodes
column of system.topology.
If we don't clear them, there is a slight chance
that the next update will make `_live_endpoints` or `_unreachable_endpoints`
equal to their shadow counterparts and prevent an update
in `replicate_live_endpoints_on_change`.
Fixes#15003
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Not only on the calling shard (shard 0).
Essentially this change folds `update_live_endpoints_version`
into `reset_endpoint_state_map`.
Acquire the _endpoint_update_semaphore to serialize
this with `replicate_live_endpoints_on_change`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
`update_live_endpoints_version` and functions that call it
must be called on shard 0, since it updates the authoritative
`_live_endpoints` and `_live_endpoints_version`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Add a private helper to acquire the _endpoint_update_semaphore
before calling replicate_live_endpoints_on_change.
Must be called on shard 0.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is more efficient to maintain as an unrdered_set
and it will be used in a following patch
to determine is_alive(endpoint) in O(1) on average.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prevent destroying of potentially large `rows` and `row_list`
in one shot on error as it might caused a reactor stall.
Instead, use utils::clear_gently on the error return path.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Although to_repair_rows_list may yield if needed
between rows and mutation fragments, the input
`repair_rows_on_wire` is freed in one shot
and that may cause stalls as seen in qa:
```
| bytes_ostream::free_chain at ././bytes_ostream.hh:163
++ - addr=0x4103be0:
| bytes_ostream::~bytes_ostream at ././bytes_ostream.hh:199
| (inlined by) frozen_mutation_fragment::~frozen_mutation_fragment at ././mutation/frozen_mutation.hh:273
| (inlined by) std::destroy_at<frozen_mutation_fragment> at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_construct.h:88
| (inlined by) ?? at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/alloc_traits.h:537
| (inlined by) ?? at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/list.tcc:77
| (inlined by) std::__cxx11::_List_base<frozen_mutation_fragment, std::allocator<frozen_mutation_fragment> >::~_List_base at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_list.h:575
| (inlined by) partition_key_and_mutation_fragments::~partition_key_and_mutation_fragments at ././repair/repair.hh:203
| (inlined by) std::destroy_at<partition_key_and_mutation_fragments> at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_construct.h:88
| (inlined by) ?? at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/alloc_traits.h:537
| (inlined by) ?? at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/list.tcc:77
| (inlined by) std::__cxx11::_List_base<partition_key_and_mutation_fragments, std::allocator<partition_key_and_mutation_fragments> >::~_List_base at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/stl_list.h:575
| (inlined by) to_repair_rows_list at ./repair/row_level.cc:597
```
This change consumes the rows and frozen mutation fragments
incrementally, freeing each after being processed.
Fixesscylladb/scylladb#14537
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>