After calling filter_for_query() the extra_replica to speculate to may
be left default-initialized which is :0 ipv6 address. Later below this
address is used as-is to check if it belongs to the same DC or not which
is not nice, as :0 is not an address of any existing endpoint.
Recent move of dc/rack data onto topology made this place reveal itself
by emitting the internal error due to :0 not being present on the
topology's collection of endpoints. Prior to this move the dc filter
would count :0 as belonging to "default_dc" datacenter which may or may
not match with the dc of the local node.
The fix is to explicitly tell set extra_replica from unset one.
fixes: #11825
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#11833
The storage_service::stop() calls repair_service::abort_repair_node_ops() but at that time the sharded<repair_service> is already stopped and call .local() on it just crashes.
The suggested fix is to remove explicit storage_service -> repair_service kick. Instead, the repair_infos generated for the sake of node-ops are subscribed on the node_ops_meta_data's abort source and abort themselves automatically.
fixes: #10284Closes#11797
* github.com:scylladb/scylladb:
repair: Remove ops_uuid
repair: Remove abort_repair_node_ops() altogether
repair: Subscribe on node_ops_info::as abortion
repair: Keep abort source on node_ops_info
repair: Pass node_ops_info arg to do_sync_data_using_repair()
repair: Mark repair_info::abort() noexcept
node_ops: Remove _aborted bit
node_ops: Simplify construction of node_ops_metadata
main: Fix message about repair service starting
Gossiper makes sure local snitch name is the same as the one of other
nodes in the ring. It now gets global snitch to get the name, this patch
passes the name as an argument, because the caller (storage_service) has
snitch instance local reference
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Storage service uses snitch in several places:
- boot
- snitch-reconfigured subscription
- preferred IP reconnection
At this point it's worth adding storage_service->snitch explicit
dependency and patch the above to use local reference
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Until now, authentication in alternator served only two purposes:
- refusing clients without proper credentials
- printing user information with logs
After this series, this user information is passed to lower layers, which also means that users are capable of attaching service levels to roles, and this service level configuration will be effective with alternator requests.
tests: manually by adding more debug logs and inspecting that per-service-level timeout value was properly applied for an authenticated alternator user
Fixes#11379Closes#11380
* github.com:scylladb/scylladb:
alternator: propagate authenticated user in client state
client_state: add internal constructor with auth_service
alternator: pass auth_service and sl_controller to server
"
There's an ongoing effort to move the endpoint -> {dc/rack} mappings
from snitch onto topology object and this set finalizes it. After it the
snitch service stops depending on gossiper and system keyspace and is
ready for de-globalization. As a nice side-effect the system keyspace no
longer needs to maintain the dc/rack info cache and its starting code gets
relaxed.
refs: #2737
refs: #2795
"
* 'br-snitch-dont-mess-with-topology-data-2' of https://github.com/xemul/scylla: (23 commits)
system_keyspace: Dont maintain dc/rack cache
system_keyspace: Indentation fix after previous patch
system_keyspace: Coroutinuze build_dc_rack_info()
topology: Move all post-configuration to topology::config
snitch: Start early
gossiper: Do not export system keyspace
snitch: Remove gossiper reference
snitch: Mark get_datacenter/_rack methods const
snitch: Drop some dead dependency knots
snitch, code: Make get_datacenter() report local dc only
snitch, code: Make get_rack() report local rack only
storage_service: Populate pending endpoint in on_alive()
code: Populate pending locations
topology: Put local dc/rack on topology early
topology: Add pending locations collection
topology: Make get_location() errors more verbose
token_metadata: Add config, spread everywhere
token_metadata: Hide token_metadata_impl copy constructor
gosspier: Remove messaging service getter
snitch: Get local address to gossip via config
...
When node_ops_meta_data aborts it also kicks repair to find and abort
all relevant repair_infos. Now it can be simplified by subscribing
repair_meta on the abort source and aborting it without explicit kick
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patches will need to subscribe on node_ops_meta_data's abort source
inside repair code, so keep the pointer on node_ops_info too. At the
same time, the node_ops_info::abort becomes obsolete, because the same
check can be performed via the abort_source->abort_requested()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A short cleanup "while at it" -- the node_ops_meta_data doesn't need to
carry dedicated _aborted boolean -- the abort source that sets it is
available instantly
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Following up on 69aea59d97, which added fencing support
for simple reads and writes, this series does the same for the
complex ops:
- partition scan
- counter mutation
- paxos
With this done, the coordinator knows about all in-flight requests and
can delay topology changes until they are retired.
Closes#11296
* github.com:scylladb/scylladb:
storage_proxy: hold effective_replication_map for the duration of a paxos transaction
storage_proxy: move paxos_response_handler class to .cc file
storage_proxy: deinline paxos_response_handler constructor/destructor
storage_proxy: use consistent effective_replication_map for counter coordinator
storage_proxy: improve consistency in query_partition_key_range{,_concurrent}
storage_proxy: query_partition_key_range_concurrent: reduce smart pointer use
storage_proxy: query_partition_key_range_concurrent: improve token_metadata consistency
storage_proxy: query_singular: use fewer smart pointers
storage_proxy: query_singular: simplify lambda captures
locator: effective_replication_map: provide non-smart-pointer accessor to token_metadata
storage_proxy: use consistent token_metadata with rest of singular read
The `raft_address_map` code was "clever": it used two intrusive data structures and did a lot of manual lifetime management; raw pointer manipulation, manual deletion of objects... It wasn't clear who owns which object, who is responsible for deleting what. And there was a lot of code.
In this PR we replace one of the intrusive data structures with a good old `std::unordered_map` and make ownership clear by replacing the raw pointers with `std::unique_ptr`. Furthermore, some invariants which were not clear and enforced in runtime are now encoded in the type system.
The code also became shorter: we reduced its length from ~360 LOC to ~260 LOC.
Closes#11763
* github.com:scylladb/scylladb:
service/raft: raft_address_map: get rid of `is_linked` checks
service/raft: raft_address_map: get rid of `to_list_iterator`
service/raft: raft_address_map: simplify ownership of `expiring_entry_ptr`
service/raft: raft_address_map: move _last_accessed field from timestamped_entry to expiring_entry_ptr
service/raft: raft_address_map: don't use intrusive set for timestamped entries
service/raft: raft_address_map: store reference to `timestamped_entry` in `expiring_entry_ptr`
The owner of `expiring_entry_ptr` was almost uniquely its corresponding
`timestamp_entry`; it would delete the expiring entry when it itself got
destroyed. There was one call to explicit `unlink_and_dispose`, which
made the picture unclear.
Make the picture clear: `timestamped_entry` now contains a `unique_ptr`
to its `expiring_entry_ptr`. The `unlink_and_dispose` was replaced with
`_lru_entry = nullptr`.
We can also get rid of the back-reference from `expiring_entry_ptr` to
`timestamped_entry`.
The code becomes shorter and simpler.
If a removenode is run for a recently stopped node,
the gossiper may not yet know that the node is down,
and the removenode will fail with a Stream failed error
trying to stream data from that node.
In this patch we explicitly reject removenode operation
if the gossiper considers the leaving node up.
Closes#11704
- Start n1, n2, n3 (127.0.0.3)
- Stop n3
- Change ip address of n3 to 127.0.0.33 and restart n3
- Decommission n3
- Start new node n4
The node n4 will learn from the gossip entry for 127.0.0.3 that node
127.0.0.3 is in shutdown status which means 127.0.0.3 is still part of
the ring.
This patch prevents this by checking the status for the host id on all
the entries. If any of the entries shows the node with the host id is in
LEFT status, reject to put the node in NORMAL status.
Fixes#11355Closes#11361
Luckily, all topology calculations are done in get_paxos_participants(),
so all we have to do is it hold the effective_replication_map for the
duration of the transaction, and pass it to get_paxos_participants().
This ensures that the coordinator knows about all in-flight requests
and can fence them from topology changes.
Hold the effective_replication_map while talking to the counter leader,
to allow for fencing in the future. The code is somewhat awkward because
the API allows for multiple keyspaces to be in use.
The error code generation, already broken as it doesn't use the correct
table, continues to be broken in that it doesn't use the correct
effective_replication_map, for the same reason.
query_partition_key_range captures a token_metadata_ptr and uses
it consistently in sequential calls to query_partition_key_range_concurrent
(via tail recursion), but each invocation of
query_partition_key_range_concurrent captures its own
effective_replication_map_ptr. Since these are captured at different times,
they can be inconsistent after the first iteration.
Fix by capturing it once in the caller and propagating it everywhere.
Derive the token_metadata from the effective_replication_map rather than
getting it independently. Not a real bug since these were in the same
continuation, but safer this way.
Capture token_metadata by reference since we're protecting it with
the mighty effective_replication_map_ptr. This saves a few instructions
to manage smart pointers.
The lambdas in query_singular do not outlive the enclosing coroutine,
so they can capture everything by reference. This simplifies life
for a future update of the lambda, since there's one thing less to
worry about.
query_singular() uses get_token_metadata_ptr() and later, in
get_read_executor(), captures the effective_replication_map(). This
isn't a bug, since the two are captured in the same continuation and
are therefore consistent, but a way to ensure it stays so is to capture
the effective_replication_map earlier and derive the token_metadata from
it.
`timestamped_entry` had two fields:
```
optional<clock_time_point> _last_accessed
expiring_entry_ptr* _lru_entry
```
The `raft_address_map` data structure maintained an invariant:
`_last_accessed` is set if and only if `_lru_entry` is not null.
This invariant could be broken for a while when constructing an expiring
`timestamped_entry`: the constructor was given an `expiring = true`
flag, which set the `_last_accessed` field; this was redundant, because
immediately after a corresponding `expiring_entry_ptr` was constructed
which again reset the `_last_accessed` field and set `_lru_entry`.
The code becomes simpler and shorter when we move `_last_accessed` field
into `expiring_entry_ptr`. The invariant is now guaranteed by the type
system: `_last_accessed` is no longer `optional`.
Intrusive data structures are harder to reason about. In
`raft_address_map` there's a good reason to use an intrusive list for
storing `expiring_entry_ptr`s: we move the entries around in the list
(when their expiration times change) but we want for the objects to stay
in place because `timestamped_entry`s may point to them (although we
could simply update the pointers using the existing back-reference...)
However, there's not much reason to store `timestamped_entry` in an
intrusive set. It was basically used in one place: when dropping expired
entries, we iterate over the list of `expiring_entry_ptr`s and we want
to drop the corresponding `timestamped_entry` as well, which is easy
when we have a pointer to the entry and it's a member of an intrusive
container. But we can deal with it when using non-intrusive containers:
just `find` the element in the container to erase it.
The code becomes shorter with this change.
I also use a map instead of a set because we need to modify the
`timestamped_entry` which wouldn't be possible if it was used as an
`unordered_set` key. In fact using map here makes more sense: we were
using the intrusive set similarly to a map anyway because all lookups
were performed using the `_id` field of `timestamped_entry` (now the
field was moved outside the struct, it's used as the map's key).
- Start a cluster with n1, n2, n3
- Full cluster shutdown n1, n2, n3
- Start n1, n2 and keep n3 as shutdown
- Add n4
Node n4 will learn the ip and uuid of n3 but it does not know the gossip
status of n3 since gossip status is published only by the node itself.
After full cluster shutdown, gossip status of n3 will not be present
until n3 is restarted again. So n4 will not think n3 is part of the
ring.
In this case, it is better to reject the bootstrap.
With this patch, one would see the following when adding n4:
```
ERROR 2022-09-01 13:53:14,480 [shard 0] init - Startup failed:
std::runtime_error (Node 127.0.0.3 has gossip status=UNKNOWN. Try fixing it
before adding new node to the cluster.)
```
The user needs to perform either of the following before adding a new node:
1) Run nodetool removenode to remove n3
2) Restart n3 to get it back to the cluster
Fixes#6088Closes#11425
Refactor the existing upgrade tests, extracting some common functionality to
helper functions.
Add more tests. They are checking the upgrade procedure and recovery from
failure in scenarios like when a node fails causing the procedure to get stuck
or when we lose a majority in a fully upgraded cluster.
Add some new functionalities to `ScyllaRESTAPIClient` like injecting errors and
obtaining gossip generation numbers.
Extend the removenode function to allow ignoring dead nodes.
Improve checking for CQL availability when starting nodes to speed up testing.
Closes#11725
* github.com:scylladb/scylladb:
test/topology_raft_disabled: more Raft upgrade tests
test/topology_raft_disabled: refactor `test_raft_upgrade`
test/pylib: scylla_cluster: pass a list of ignored nodes to removenode
test/pylib: rest_client: propagate errors from put_json
test/pylib: fix some type hints
test/pylib: scylla_cluster: don't create and drop keyspaces to check if cql is up
In preparation for supporting IP address changes of Raft Group 0:
1) Always use start_server_for_group0() to start a server for group 0.
This will provide a single extension point when it's necessary to
prompt raft_address_map with gossip data.
2) Don't use raft::server_address in discovery, since going forward
discovery won't store raft::server_address. On the same token stop
using discovery::peer_set anywhere outside discovery (for persistence),
use a peer_list instead, which is easier to marshal.
Closes#11676
* github.com:scylladb/scylladb:
raft: (discovery) do not use raft::server_address to carry IP data
raft: (group0) API refactoring to avoid raft::server_address
raft: rename group0_upgrade.hh to group0_fwd.hh
raft: (group0) move the code around
raft: (discovery) persist a list of discovered peers, not a set
raft: (group0) always start group0 using start_server_for_group0()
- Start n1, n2, n3
- Apply network nemesis as below:
+ Block gossip traffic going from nodes 1 and 2 to node 3.
+ All the other rpc traffic flows normally, including gossip traffic
from node 3 to nodes 1 and 2 and responses to node_ops commands from
nodes 1 and 2 to node 3.
- Decommission n3
Currently, the decommission will be successful because all the network
traffic is ok. But n3 could not advertise status STATUS_LEFT to the rest
of the cluster due to the network nemesis applied. As a result, n1 and
n3 could not move the n3 from STATUS_LEAVING to STATUS_LEFT, so n3 will
stay in DL forever.
I know why the node stays DL forever. The problem is that with
node_ops_cmd based node operation, we still rely on the gossip status of
STATUS_LEFT from the node being decommissioned to notify other nodes
this node has finished decommission and can be moved from STATUS_LEAVING
to STATUS_LEFT.
This patch fixes by checking gossip liveness before running
decommission. Reject if required peer nodes are down.
With the fix, the decommission of n3 will fail like this:
$ nodetool decommission -p 7300
nodetool: Scylla API server HTTP POST to URL
'/storage_service/decommission' failed: std::runtime_error
(decommission[adb3950e-a937-4424-9bc9-6a75d880f23d]: Rejected
decommission operation, removing node=127.0.0.3, sync_nodes=[127.0.0.2,
127.0.0.3, 127.0.0.1], ignore_nodes=[], nodes_down={127.0.0.1})
Fixes#11302Closes#11362
Some good news finally. The saved dc/rack info about the ring is now
only loaded once on start. So the whole cache is not needed and the
loading code in storage_service can be greatly simplified
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The continuation of the previous patch -- all the code uses
topology::get_datacenter(endpoint) to get peers' dc string. The topology
still uses snitch for that, but it already contains the needed data.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All the code out there now calls snitch::get_rack() to get rack for the
local node. For other nodes the topology::get_rack(endpoint) is used.
Since now the topology is properly populated with endpoints, it can
finally be patched to stop using snitch and get rack from its internal
collections
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A special-purpose add-on to the previous patch.
When messaging service accepts a new connection it sometimes may want to
drop it early based on whether the client is from the same dc/rack or
not. However, at this stage the information might have not yet had
chances to be spread via storage service pending-tokens updating paths,
so here's one more place -- the on_alive() callback
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Previous patches added the concept of pending endpoints in the topology,
this patch populates endpoints in this state.
Also, the set_pending_ranges() is patched to make sure that the tokens
added for the enpoint(s) are added for something that's known by the
topology. Same check exists in update_normal_tokens()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
No functional changes, just keep some conditions from if()s as local
variables. This is the churn-reducing preparation for one of the the
next patches
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We plan to remove IP information from Raft addresses.
raft::server_address is used in Raft configuration and
also in discovery, which is a separate algorithm, as a handy data
structure, to avoid having new entities in RPC.
Since we plan to remove IP addresses from Raft configuration,
using raft::server_address in discovery and still storing
IPs in it would create ambiguity: in some uses raft::server_address
would store an IP, and in others - would not.
So switch to an own data structure for the purposes of discovery,
discovery_peer, which contains a pair ip, raft server id.
Note to reviewers: ideally we should switch to URIs
in discovery_peer right away. Otherwise we may have to
deal with incompatible changes in discovery when adding URI
support to Scylla.
Replace raft::server_address in a few raft_group0 API
calls with raft::server_id.
These API calls do not need raft::server_address, i.e. the
address part, anyway, and since going forward raft::server_address
will not contain the IP address, stop using it in these calls.
This is a beginning of a multi-patch series to reduce
raft::server_address usage to core raft only.
Move load/store functions for discovered peers up,
since going forward they'll be used to in start_server_for_group0(),
to extend the address map prior to start (and thus speed up
bootstrap).
We plan to reuse the discovery table to store the peers
after discovery is over, so load/store API must be generalized
to use outside discovery. This includes sending
the list of persisted peers over to a new member of the cluster.
When IP addresses are removed from raft::configuration, it's key
to initialize raft_address_map with IP addresses before we start group
0. Best place to put this initialization is start_server_for_group0(),
so make sure all paths which create group 0 use
start_server_for_group0().
The tests are checking the upgrade procedure and recovery from failure
in scenarios like when a node fails causing the procedure to get stuck
or when we lose a majority in a fully upgraded cluster.
Added some new functionalities to `ScyllaRESTAPIClient` like injecting
errors and obtaining gossip generation numbers.
Many services out there have one (sometimes called .drain()) that's
called early on stop and that's responsible for prearing the service for
stop -- aborting pending/in-flight fibers and alike.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There is a flaw in how the raft rpc endpoints are
currently managed. The io_fiber in raft::server
is supposed to first add new servers to rpc, then
send all the messages and then remove the servers
which have been excluded from the configuration.
The problem is that the send_messages function
isn't synchronous, it schedules send_append_entries
to run after all the current requests to the
target server, which can happen
after we have already removed the server from address_map.
In this patch the remove_server function is changed to mark
the server_id as expiring rather than synchronously dropping it.
This means all currently scheduled requests to
that server will still be able to resolve
the ip address for that server_id.
Fixes: #11228Closes#11748