The `raft_address_map` code was "clever": it used two intrusive data structures and did a lot of manual lifetime management; raw pointer manipulation, manual deletion of objects... It wasn't clear who owns which object, who is responsible for deleting what. And there was a lot of code.
In this PR we replace one of the intrusive data structures with a good old `std::unordered_map` and make ownership clear by replacing the raw pointers with `std::unique_ptr`. Furthermore, some invariants which were not clear and enforced in runtime are now encoded in the type system.
The code also became shorter: we reduced its length from ~360 LOC to ~260 LOC.
Closes#11763
* github.com:scylladb/scylladb:
service/raft: raft_address_map: get rid of `is_linked` checks
service/raft: raft_address_map: get rid of `to_list_iterator`
service/raft: raft_address_map: simplify ownership of `expiring_entry_ptr`
service/raft: raft_address_map: move _last_accessed field from timestamped_entry to expiring_entry_ptr
service/raft: raft_address_map: don't use intrusive set for timestamped entries
service/raft: raft_address_map: store reference to `timestamped_entry` in `expiring_entry_ptr`
The owner of `expiring_entry_ptr` was almost uniquely its corresponding
`timestamp_entry`; it would delete the expiring entry when it itself got
destroyed. There was one call to explicit `unlink_and_dispose`, which
made the picture unclear.
Make the picture clear: `timestamped_entry` now contains a `unique_ptr`
to its `expiring_entry_ptr`. The `unlink_and_dispose` was replaced with
`_lru_entry = nullptr`.
We can also get rid of the back-reference from `expiring_entry_ptr` to
`timestamped_entry`.
The code becomes shorter and simpler.
If a removenode is run for a recently stopped node,
the gossiper may not yet know that the node is down,
and the removenode will fail with a Stream failed error
trying to stream data from that node.
In this patch we explicitly reject removenode operation
if the gossiper considers the leaving node up.
Closes#11704
This add support stripped binary installation for relocatable package.
After this change, scylla and unified packages only contain stripped binary,
and introduce "scylla-debuginfo" package for debug symbol.
On scylla-debuginfo package, install.sh script will extract debug symbol
at /opt/scylladb/<dir>/.debug.
Note that we need to keep unstripped version of relocatable package for rpm/deb,
otherwise rpmbuild/debuild fails to create debug symbol package.
This version is renamed to scylla-unstripped-$version-$release.$arch.tar.gz.
See #8918
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closes#9005
- Start n1, n2, n3 (127.0.0.3)
- Stop n3
- Change ip address of n3 to 127.0.0.33 and restart n3
- Decommission n3
- Start new node n4
The node n4 will learn from the gossip entry for 127.0.0.3 that node
127.0.0.3 is in shutdown status which means 127.0.0.3 is still part of
the ring.
This patch prevents this by checking the status for the host id on all
the entries. If any of the entries shows the node with the host id is in
LEFT status, reject to put the node in NORMAL status.
Fixes#11355Closes#11361
Requests like `col IN NULL` used to cause
an error - Invalid null value for colum col.
We would like to allow NULLs everywhere.
When a NULL occurs on either side
of a binary operator, the whole operation
should just evaluate to NULL.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
Closes#11775
Today, we're completely blind about the progress of view updating
on Staging files. We don't know how long it will take, nor how much
progress we've made.
This patch adds visibility with a new metric that will inform
the number of bytes to be processed from Staging files.
Before any work is done, the metric tell us the total size to be
processed. As view updating progresses, the metric value is
expected to decrease, unless work is being produced faster than
we can consume them.
We're piggybacking on sstables::read_monitor, which allows the
progress metric to be updated whenever the SSTable reader makes
progress.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#11751
std::experimental::source_location is provided by <experimental/source_location>,
not <source_location>. libstdc++ 12 insists, so change the header.
Closes#11766
EC2 instance metadata service can be busy, ret's retry to connect with
interval, just like we do in scylla-machine-image.
Fixes#10250
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Closes#11688
`timestamped_entry` had two fields:
```
optional<clock_time_point> _last_accessed
expiring_entry_ptr* _lru_entry
```
The `raft_address_map` data structure maintained an invariant:
`_last_accessed` is set if and only if `_lru_entry` is not null.
This invariant could be broken for a while when constructing an expiring
`timestamped_entry`: the constructor was given an `expiring = true`
flag, which set the `_last_accessed` field; this was redundant, because
immediately after a corresponding `expiring_entry_ptr` was constructed
which again reset the `_last_accessed` field and set `_lru_entry`.
The code becomes simpler and shorter when we move `_last_accessed` field
into `expiring_entry_ptr`. The invariant is now guaranteed by the type
system: `_last_accessed` is no longer `optional`.
Intrusive data structures are harder to reason about. In
`raft_address_map` there's a good reason to use an intrusive list for
storing `expiring_entry_ptr`s: we move the entries around in the list
(when their expiration times change) but we want for the objects to stay
in place because `timestamped_entry`s may point to them (although we
could simply update the pointers using the existing back-reference...)
However, there's not much reason to store `timestamped_entry` in an
intrusive set. It was basically used in one place: when dropping expired
entries, we iterate over the list of `expiring_entry_ptr`s and we want
to drop the corresponding `timestamped_entry` as well, which is easy
when we have a pointer to the entry and it's a member of an intrusive
container. But we can deal with it when using non-intrusive containers:
just `find` the element in the container to erase it.
The code becomes shorter with this change.
I also use a map instead of a set because we need to modify the
`timestamped_entry` which wouldn't be possible if it was used as an
`unordered_set` key. In fact using map here makes more sense: we were
using the intrusive set similarly to a map anyway because all lookups
were performed using the `_id` field of `timestamped_entry` (now the
field was moved outside the struct, it's used as the map's key).
When code was moved to the new directory, a bug was reintroduced with `ssl` local hiding `ssl` module. Fix again.
Closes#11755
* github.com:scylladb/scylladb:
test.py: improve pylint score for conftest
test.py: fix variable name collision with ssl
fmt 9 deprecates automatic fallback to std::ostream formatting.
We should migrate, but in order to do so incrementally, first enable
the deprecated fallback so the code continues to compile.
Closes#11768
Abseil is not under our control, so if a header generates a
warning, we can do nothing about it. So far this wasn't a problem,
but under clang 15 it spews a harmless deprecation warning. Silence
the warning by treating the header as a system header (which it is,
for us).
Closes#11767
- Start a cluster with n1, n2, n3
- Full cluster shutdown n1, n2, n3
- Start n1, n2 and keep n3 as shutdown
- Add n4
Node n4 will learn the ip and uuid of n3 but it does not know the gossip
status of n3 since gossip status is published only by the node itself.
After full cluster shutdown, gossip status of n3 will not be present
until n3 is restarted again. So n4 will not think n3 is part of the
ring.
In this case, it is better to reject the bootstrap.
With this patch, one would see the following when adding n4:
```
ERROR 2022-09-01 13:53:14,480 [shard 0] init - Startup failed:
std::runtime_error (Node 127.0.0.3 has gossip status=UNKNOWN. Try fixing it
before adding new node to the cluster.)
```
The user needs to perform either of the following before adding a new node:
1) Run nodetool removenode to remove n3
2) Restart n3 to get it back to the cluster
Fixes#6088Closes#11425
Refactor the existing upgrade tests, extracting some common functionality to
helper functions.
Add more tests. They are checking the upgrade procedure and recovery from
failure in scenarios like when a node fails causing the procedure to get stuck
or when we lose a majority in a fully upgraded cluster.
Add some new functionalities to `ScyllaRESTAPIClient` like injecting errors and
obtaining gossip generation numbers.
Extend the removenode function to allow ignoring dead nodes.
Improve checking for CQL availability when starting nodes to speed up testing.
Closes#11725
* github.com:scylladb/scylladb:
test/topology_raft_disabled: more Raft upgrade tests
test/topology_raft_disabled: refactor `test_raft_upgrade`
test/pylib: scylla_cluster: pass a list of ignored nodes to removenode
test/pylib: rest_client: propagate errors from put_json
test/pylib: fix some type hints
test/pylib: scylla_cluster: don't create and drop keyspaces to check if cql is up
In preparation for supporting IP address changes of Raft Group 0:
1) Always use start_server_for_group0() to start a server for group 0.
This will provide a single extension point when it's necessary to
prompt raft_address_map with gossip data.
2) Don't use raft::server_address in discovery, since going forward
discovery won't store raft::server_address. On the same token stop
using discovery::peer_set anywhere outside discovery (for persistence),
use a peer_list instead, which is easier to marshal.
Closes#11676
* github.com:scylladb/scylladb:
raft: (discovery) do not use raft::server_address to carry IP data
raft: (group0) API refactoring to avoid raft::server_address
raft: rename group0_upgrade.hh to group0_fwd.hh
raft: (group0) move the code around
raft: (discovery) persist a list of discovered peers, not a set
raft: (group0) always start group0 using start_server_for_group0()
- Start n1, n2, n3
- Apply network nemesis as below:
+ Block gossip traffic going from nodes 1 and 2 to node 3.
+ All the other rpc traffic flows normally, including gossip traffic
from node 3 to nodes 1 and 2 and responses to node_ops commands from
nodes 1 and 2 to node 3.
- Decommission n3
Currently, the decommission will be successful because all the network
traffic is ok. But n3 could not advertise status STATUS_LEFT to the rest
of the cluster due to the network nemesis applied. As a result, n1 and
n3 could not move the n3 from STATUS_LEAVING to STATUS_LEFT, so n3 will
stay in DL forever.
I know why the node stays DL forever. The problem is that with
node_ops_cmd based node operation, we still rely on the gossip status of
STATUS_LEFT from the node being decommissioned to notify other nodes
this node has finished decommission and can be moved from STATUS_LEAVING
to STATUS_LEFT.
This patch fixes by checking gossip liveness before running
decommission. Reject if required peer nodes are down.
With the fix, the decommission of n3 will fail like this:
$ nodetool decommission -p 7300
nodetool: Scylla API server HTTP POST to URL
'/storage_service/decommission' failed: std::runtime_error
(decommission[adb3950e-a937-4424-9bc9-6a75d880f23d]: Rejected
decommission operation, removing node=127.0.0.3, sync_nodes=[127.0.0.2,
127.0.0.3, 127.0.0.1], ignore_nodes=[], nodes_down={127.0.0.1})
Fixes#11302Closes#11362
"
There's one via the database's compaction manager and large data handler
sub-services. Both need system keyspace to put their info into, but the
latter needs database naturally via query_processor->storage_proxy link.
The solution is to make c.m. | l.d.h. -> sys.ks. dependency be weak with
the help of shared_from_this(), described in details in patch #2 commit
message.
As a (not-that-)side effect this set removes a bunch of global qctx
calls.
refs: #11684 (this set seem to increase the chance of stepping on it)
"
* 'br-sysks-async-users' of https://github.com/xemul/scylla:
large_data_handler: Use local system_keyspace to update entries
system_keyspace: De-static compaction history update
compaction_manager: Relax history paths
database: Plug/unplug system_keyspace
system_keyspace: Add .shutdown() method
This patch adds a couple of simple tests for the USE statement: that
without USE one cannot create a table without explicitly specifying
a keyspace name, and with USE, it is possible.
Beyond testing these specific feature, this patch also serves as an
example of how to write more tests that need to control the effective USE
setting. Specifically, it adds a "new_cql" function that can be used to
create a new connection with a fresh USE setting. This is necessary
in such tests, because if multiple tests use the same cql fixture
and its single connection, they will share their USE setting and there
is no way to undo or reset it after being set.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11741
We plan to remove IP information from Raft addresses.
raft::server_address is used in Raft configuration and
also in discovery, which is a separate algorithm, as a handy data
structure, to avoid having new entities in RPC.
Since we plan to remove IP addresses from Raft configuration,
using raft::server_address in discovery and still storing
IPs in it would create ambiguity: in some uses raft::server_address
would store an IP, and in others - would not.
So switch to an own data structure for the purposes of discovery,
discovery_peer, which contains a pair ip, raft server id.
Note to reviewers: ideally we should switch to URIs
in discovery_peer right away. Otherwise we may have to
deal with incompatible changes in discovery when adding URI
support to Scylla.
The l._d._h.'s way to update system keyspace is not like in other code.
Instead of a dedicated helper on the system_keyspace's side it executes
the insertion query directly with the help of qctx.
Now when the l._d._h. has the weak system keyspace reference it can
execute queries on _it_ rather than on the qctx.
Just like in previous patch, it needs to keep the sys._k.s. weak
reference alive until the query's future resolves.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Compaction manager now has the weak reference on the system keyspace
object and can use it to update its stats. It only needs to take care
and keep the shared pointer until the respective future resolves.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a virtual method on table_state to update the entry in system
keyspace. It's an overkill to facilitate tests that don't want this.
With new system_keyspace weak referencing it can be made simpled by
moving the updating call to the compaction_manager itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a circular dependency between system_keyspace and database. The
former needs the latter because it needs to execula local requests via
query_processor. The latter needs the former via compaction manager and
large data handler, database depends on both and these too need to
insert their entries into system keyspace.
To cut this loop the compaction manager and large data handler both get
a weak reference on the system keysace. Once system keyspace starts is
activcates this reference via the database call. When system keyspace is
shutdown-ed on stop, it deactivates the reference.
Technically the weak reference is implemented by marking the system_k.s.
object as async_sharded_service, and the "reference" in question is the
shared_from_this() pointer. When compaction manager or large data
handler need to update a system keyspace's table, they both hold an
extra reference on the system keyspace until the entry is committed,
thus making sure that sys._k.s. doesn't stop from under their feet. At
the same time, unplugging the reference on shutdown makes sure that no
new entries update will appear and the system_k.s. will eventually be
released.
It's not a C++ classical reference, because system_keyspace starts after
and stops before database.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Replace raft::server_address in a few raft_group0 API
calls with raft::server_id.
These API calls do not need raft::server_address, i.e. the
address part, anyway, and since going forward raft::server_address
will not contain the IP address, stop using it in these calls.
This is a beginning of a multi-patch series to reduce
raft::server_address usage to core raft only.
Move load/store functions for discovered peers up,
since going forward they'll be used to in start_server_for_group0(),
to extend the address map prior to start (and thus speed up
bootstrap).
We plan to reuse the discovery table to store the peers
after discovery is over, so load/store API must be generalized
to use outside discovery. This includes sending
the list of persisted peers over to a new member of the cluster.
When IP addresses are removed from raft::configuration, it's key
to initialize raft_address_map with IP addresses before we start group
0. Best place to put this initialization is start_server_for_group0(),
so make sure all paths which create group 0 use
start_server_for_group0().
The tests are checking the upgrade procedure and recovery from failure
in scenarios like when a node fails causing the procedure to get stuck
or when we lose a majority in a fully upgraded cluster.
Added some new functionalities to `ScyllaRESTAPIClient` like injecting
errors and obtaining gossip generation numbers.
Many services out there have one (sometimes called .drain()) that's
called early on stop and that's responsible for prearing the service for
stop -- aborting pending/in-flight fibers and alike.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The `removenode` operation normally requires the removing node to
contact every node in the cluster except the one that is being removed.
But if more than 1 node is down it's possible to specify a list of nodes
to ignore for the operation; the `/storage_service/remove_node` endpoint
accepts an `ignore_nodes` param which is a comma-separated list of IPs.
Extend `ScyllaRESTAPIClient`, `ScyllaClusterManager` and `ManagerClient`
so it's possible to pass the list of ignored nodes.
We also modify the `/cluster/remove-node` Manager endpoint to use
`put_json` instead of `get_text` and pass all parameters except the
initiator IP (the IP of the node who coordinates the `removenode`
operation) through JSON. This simplifies the URL greatly (it was already
messy with 3 parameters) and more closely resembles Scylla's endpoint.
Change variable name to avoid collision with module ssl.
This bug was reintroduced when moving code.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
There is a flaw in how the raft rpc endpoints are
currently managed. The io_fiber in raft::server
is supposed to first add new servers to rpc, then
send all the messages and then remove the servers
which have been excluded from the configuration.
The problem is that the send_messages function
isn't synchronous, it schedules send_append_entries
to run after all the current requests to the
target server, which can happen
after we have already removed the server from address_map.
In this patch the remove_server function is changed to mark
the server_id as expiring rather than synchronously dropping it.
This means all currently scheduled requests to
that server will still be able to resolve
the ip address for that server_id.
Fixes: #11228Closes#11748
The `add_entry` and `modify_config` methods sometimes do an rpc to
execute the request on the current leader. If the tcp connection
was broken, a `seastar::rpc::closed_error` would be thrown to the client.
This exception was not documented in the method comments and the
client could have missed handling it. For example, this exception
was not handled when calling `modify_config` in `raft_group0`,
which sometimes broke the `removenode` command.
An `intermittent_connection_error` exception was added earlier to
solve a similar problem with the `read_barrier` method. In this patch it
is renamed to `transport_error`, as it seems to better describe the
situation, and an explicit specification for this exception
was added - the rpc implementation can throw it if it is not known
whether the call reached the destination and whether any mutations were made.
In case of `read_barrier` it does not matter and we just retry, in case
of `add_entry` and `modify_config` we cannot retry because of possible mutations,
so we convert this exception to `commit_status_unknown`, which
the client has to handle.
Explicit comments have also been added to `raft::server` methods
describing all possible exceptions.
Closes#11691
* github.com:scylladb/scylladb:
raft_group0: retry modify_config on commit_status_unknown
raft: convert raft::transport_error to raft::commit_status_unknown
modify_config can throw commit_status_unknown in case
of a leader change or when the leader is unavailable,
but the information about it has not yet reached the
current node. In this patch modify_config is run
again after some time in this case.
The add_entry and modify_config methods sometimes do an rpc to
execute the request on the current leader. If the tcp connection
was broken, a seastar::rpc::closed_error would be thrown to the client.
This exception was not documented in the method comments and the
client could have missed handling it. For example, this exception
was not handled when calling modify_config in raft_group0,
which sometimes broke the removenode command.
An intermittent_connection_error exception was added earlier to
solve a similar problem with the read_barrier method. In this patch it
is renamed to transport_error, as it seems to better describe the
situation, and an explicit specification for this exception
was added - the rpc implementation can throw it if it is not known
whether the call reached the target node and whether any
actions were performed on it.
In case of read_barrier it does not matter and we just retry. In case
of add_entry and modify_config we cannot retry
because the rpc calls are not idempotent, so we convert this
exception to commit_status_unknown, which the client has to handle.
Explicit comments have also been added to raft::server methods
describing all possible exceptions.