Whenever a Raft configuration change is performed, `raft::server` calls
`raft_rpc::add_server`/`raft_rpc::remove_server`. Our `raft_rpc`
implementation has a function, `_on_server_update`, passed in the
constructor, which it called in `add_server`/`remove_server`;
that function would update the set of endpoints detected by the
direct failure detector. `_on_server_update` was passed an IP address
and that address was added to / removed from the failure detector set
(there's another translation layer between the IP addresses and internal
failure detector 'endpoint ID's; but we can ignore it for the purposes
of this commit).
Therefore: the failure detector was pinging a certain set of IP
addresses. These IP addresses were updated during Raft configuration
changes.
To implement the `is_alive(raft::server_id)` function (required by
`raft::failure_detector` interface), we would translate the ID using
the Raft address map, which is currently also updated during
configuration changes, to an IP address, and check if that IP address is
alive according to the direct failure detector (which maintained an
`_alive_set` of type `unordered_set<gms::inet_address>`).
This all works well but it assumes that servers can be identified using
IP addresses - it doesn't play well with the fact that servers may
change their IP addresses. The only immutable identifier we have for a
server is `raft::server_id`. In the future, Raft configurations will not
associate IP addresses with Raft servers; instead we will assume that IP
addresses can change at any time, and there will be a different
mechanism that eventually updates the Raft address map with the latest
IP address for each `raft::server_id`.
To prepare us for that future, in this commit we no longer operate in
terms of IP addresses in the failure detector, but in terms of
`raft::server_id`s. Most of the commit is boilerplate, changing
`gms::inet_address` to `raft::server_id` and function/variable names.
The interesting changes are:
- in `is_alive`, we no longer need to translate the `raft::server_id` to
an IP address, because now the stored `_alive_set` already contains
`raft::server_id`s instead of `gms::inet_address`es.
- the `ping` function now takes a `raft::server_id` instead of
`gms::inet_address`. To send the ping message, we need to translate
this to IP address; we do it by the `raft_address_map` pointer
introduced in an earlier commit.
Thus, there is still a point where we have to translate between
`raft::server_id` and `gms::inet_address`; but observe we now do it at
the last possible moment - just before sending the message. If we
have no translation, we consider the `ping` to have failed - it's
equivalent to a network failure where no route to a given address was
found.
In later commit `direct_fd_pinger` will operate in terms of
`raft::server_id`s. Decouple it from `gossiper` since we don't want to
entangle `gossiper` with Raft-specific stuff.
`gms::gossiper::direct_fd_pinger` serves multiple purposes: one of them
is to maintain a mapping between `gms::inet_address`es and
`direct_failure_detector::pinger::endpoint_id`s, another is to cache the
last known gossiper's generation number to use it for sending gossip
echo messages. The latter is the only gossiper-specific thing in this
class.
We want to move `direct_fd_pinger` utside `gossiper`. To do that, split the
gossiper-specific thing -- the generation number management -- to a
smaller class, `echo_pinger`.
`echo_pinger` is a top-level class (not a nested one like
`direct_fd_pinger` was) so we can forward-declare it and pass references
to it without including gms/gossiper.hh header.
Replicating `raft_address_map` entries is needed for the following use
cases:
- the direct failure detector - currently it assumes a static mapping of
`raft::server_id`s to `gms::inet_address`es, which is obtained on Raft
group 0 configuration changes. To handle dynamic mappings we need to
modify the failure detector so it pings `raft::server_id`s and obtains
the `gms::inet_address` before sending the message from
`raft_address_map`. The failure detector is sharded, so we need the
mappings to be available on all shards.
- in the future we'll have multiple Raft groups running on different
shards. To send messages they'll need `raft_address_map`.
Initially I tried to replicate all entries - expiring and non-expiring.
The implementation turned out to be very complex - we need to handle
dropping expired entries and refreshing expiring entries' timestamps
across shards, and doing this correctly while accounting for possible
races is quite problematic.
Eventually I arrived at the conclusion that replicating only
non-expiring entries, and furthermore allowing non-expiring entries to
be added only on shard 0, is good enough for our use cases:
- The direct failure detector is pinging group 0 members only; group
0 members correspond exactly to the non-expiring entries.
- Group 0 configuration changes are handled on shard 0, so non-expiring
entries are added/removed on shard 0.
- When we have multiple Raft groups, we can reuse a single Raft server
ID for all Raft servers running on a single node belonging to
different groups; they are 'namespaced' by the group IDs. Furthermore,
every node has a server that belongs to group 0. Thus for every Raft
server in every group, it has a corresponding server in group 0 with
the same ID, which has a non-expiring entry in `raft_address_map`,
which is replicated to all shards; so every group will be able to
deliver its messages.
With these assumptions the implementation is short and simple.
We can always complicate it in the future if we find that the
assumptions are too strong.
Closes#11791
* github.com:scylladb/scylladb:
test/raft: raft_address_map_test: add replication test
service/raft: raft_address_map: replicate non-expiring entries to other shards
service/raft: raft_address_map: assert when entry is missing in drop_expired_entries
service/raft: turn raft_address_map into a service
We capture `key` by reference, but it is in a another continuation.
Capture it by value, and avoid the default capture specification.
Found by clang 15 + asan + aarch64.
Closes#11884
To fix CVE-2022-24675, we need to a binary compiled in <= golang 1.18.1.
Only released version which compiled <= golang 1.18.1 is node_exporter
1.4.0, so we need to update to it.
See scylladb/scylla-enterprise#2317
Closes#11400
[avi: regenerated frozen toolchain]
Closes#11879
Starting from https://github.com/scylladb/scylla-pkg/pull/3035 we
removed all old tar.gz prefix from uploading to S3 or been used by
downstream jobs.
Hence, there is no point building those tar.gz files anymore
Closes#11865
Fix https://github.com/scylladb/scylla-doc-issues/issues/864
This PR:
- updates the introduction to add information about AArch64 and rewrite the content.
- replaces "Scylla" with "ScyllaDB".
Closes#11778
* github.com:scylladb/scylladb:
Update docs/getting-started/system-requirements.rst
doc: fix the link to the OS Support page
doc: replace Scylla with ScyllaDB
doc: update the info about supported architecture and rewrite the introduction
This patch adds a reproducing test for issue #11588, which is still open
so the test is expected to fail on Scylla ("xfail), and passes on Cassandra.
The test shows that Scylla allows an out-of-range value to be written to
timestamp column, but then it can't be read back.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closes#11864
The PR prepares repair for task manager integration:
- Creates repair_module
- Keeps repair_module in repair_service
- Moves tracker methods to repair_module
- Changes UUID to task_id in repair module
Closes#11851
* github.com:scylladb/scylladb:
repair: check shutdown with abort source in repair module
repair: use generic module gate for repair module operations
repair: move tracker to repair module
repair: move next_repair_command to repair_module
repair: generate repair id in repair module
repair: keep shard number in repair_uniq_id
repair: change UUID to task_id
repair: add task_manager::module to repair_service
repair: create repair module and task
Repair module uses a gate to prevent starting new tasks on shutdown.
Generic module's gate serves the same purpose, thus we can
use it also in repair specific context.
Since both tracker and repair_module serve similar purpose,
it is confusing where we should seek for methods connected to them.
Thus, to make it more transparent, tracker class is deleted and all
its attributes and methods are moved to repair_module.
Number of the repair operation was counted both with
next_repair_command from tracer and sequence number
from task_manager::module.
To get rid of redundancy next_repair_command was deleted and all
methods using its value were moved to repair_module.
Execution shard is one of the traits specific to repair tasks.
Child task should freely access shard id of its parent. Thus,
the shard number is kept in a repair_uniq_id struct.
Create repair_task_impl and repair_module inheriting from respectively
task manager task_impl and module to integrate repair operations with
task manager.
We currently avoid compiling C code in "pip3 install scylla-driver", but
we actually providing portable binary distributions of the package,
so we should use it by "pip3 install --only-binary=:all: scylla-driver".
The binary distribution contains dependency libraries, so we won't have
problem loading it on relocatable python3.
Closes#11852
Replicating `raft_address_map` entries is needed for the following use
cases:
- the direct failure detector - currently it assumes a static mapping of
`raft::server_id`s to `gms::inet_address`es, which is obtained on Raft
group 0 configuration changes. To handle dynamic mappings we need to
modify the failure detector so it pings `raft::server_id`s and obtains
the `gms::inet_address` before sending the message from
`raft_address_map`. The failure detector is sharded, so we need the
mappings to be available on all shards.
- in the future we'll have multiple Raft groups running on different
shards. To send messages they'll need `raft_address_map`.
Initially I tried to replicate all entries - expiring and non-expiring.
The implementation turned out to be very complex - we need to handle
dropping expired entries and refreshing expiring entries' timestamps
across shards, and doing this correctly while accounting for possible
races is quite problematic.
Eventually I arrived at the conclusion that replicating only
non-expiring entries, and furthermore allowing non-expiring entries to
be added only on shard 0, is good enough for our use cases:
- The direct failure detector is pinging group 0 members only; group
0 members correspond exactly to the non-expiring entries.
- Group 0 configuration changes are handled on shard 0, so non-expiring
entries are added/removed on shard 0.
- When we have multiple Raft groups, we can reuse a single Raft server
ID for all Raft servers running on a single node belonging to
different groups; they are 'namespaced' by the group IDs. Furthermore,
every node has a server that belongs to group 0. Thus for every Raft
server in every group, it has a corresponding server in group 0 with
the same ID, which has a non-expiring entry in `raft_address_map`,
which is replicated to all shards; so every group will be able to
deliver its messages.
With these assumptions the implementation is short and simple.
We can always complicate it in the future if we find that the
assumptions are too strong.
The PR adds changes to task manager that allow more convenient integration with modules.
Introduced changes:
- adds internal flag in task::impl that allows user to filter too specific tasks
- renames `parent_data` to more appropriate name `task_info`
- creates `tasks/types.hh` which allows using some types connected with task manager without the necessity to include whole task manager
- adds more flexible version of `make_task` method
Closes#11821
* github.com:scylladb/scylladb:
tasks: add alternative make_task method
tasks: rename parent_data to task_info and move it
tasks: move task_id to tasks/types.hh
tasks: add internal flag for task_manager::task::impl
Prevent copying shared_ptr across shards
in do_sync_data_using_repair by allocating
a shared_ptr<abort_source> per shard in
node_ops_meta_data and respectively in node_ops_info.
Fixes#11826
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#11827
* github.com:scylladb/scylladb:
repair: use sharded abort_source to abort repair_info
repair: node_ops_info: add start and stop methods
storage_service: node_ops_abort_thread: abort all node ops on shutdown
storage_service: node_ops_abort_thread: co_return only after printing log message
storage_service: node_ops_meta_data: add start and stop methods
repair: node_ops_info: prevent accidental copy
Currently, to_string() recursively calls itself for engaged optionals.
Eliminate it. Also, use the std_optional wrapper instead of accessing
std::optional internals directly.
Scylla fiber uses a crude method of scanning inbound and outbound
references to/from other task objects of recognized type. This method
cannot detect user instantiated promise<> objects. Add a note about this
to the printout, so users are beware of this.
We collect already seen tasks in a set to be able to detect perceived
task loops and stop when one is seen. Initialize this set with the
starting task, so if it forms a loop, we won't repeat it in the trace
before cutting the loop.
Currently there is two loops and a separate line printing the starting
task, all duplicating the formatting logic. Define a method for it and
use it in all 3 places instead.
Shard boundaries can be crossed in one direction currently: when looking
for waiters on a task, but not in the other direction (looking for
waited-on tasks). This patch fixes that.
Currently seastar threads end any attempt to follow waited-on-futures.
Seastar threads need special handling because it allocates the wake up
task on its stack. This patch adds this special handling.
scylla_ptr.analyze() switches to the thread the analyzed object lives
on, but forgets to switch back. This was very annoying as any commands
using it (which is a bunch of them) were prone to suddenly and
unexpectedly switching threads.
This patch makes sure that the original thread context is switched back
to after analyzing the pointer.
Rename to scylla tables. Less typing and more up-to-date.
By default it now only lists tables from local shard. Added flag -a
which brings back old behaviour (lists on all shards).
Added -u (only list user tables) and -k (list tables of provided
keyspace only) filtering options.