The direct failure detector operates on abstract `endpoint_id`s for
pinging. The `pigner` interface is responsible for translating these IDs
to 'real' addresses.
Earlier we used two types of addresses: IP addresses in 'production'
code (`gms::gossiper::direct_fd_pinger`) and `raft::server_id`s in test
code (in `randomized_nemesis_test`). For each of these use cases we
would maintain mappings between `endpoint_id`s and the address type.
In recent commits we switched the 'production' code to also operate on
Raft server IDs, which are UUIDs underneath.
In this commit we switch `endpoint_id`s from `unsigned` type to
`utils::UUID`. Because each use case operates in Raft server IDs, we can
perform a simple translation: `raft_id.uuid()` to get an `endpoint_id`
from a Raft ID, `raft::server_id{ep_id}` to obtain a Raft ID from
an `endpoint_id`. We no longer have to maintain complex sharded data
structures to store the mappings.
In later commit `direct_fd_pinger` will operate in terms of
`raft::server_id`s. Decouple it from `gossiper` since we don't want to
entangle `gossiper` with Raft-specific stuff.
Replicating `raft_address_map` entries is needed for the following use
cases:
- the direct failure detector - currently it assumes a static mapping of
`raft::server_id`s to `gms::inet_address`es, which is obtained on Raft
group 0 configuration changes. To handle dynamic mappings we need to
modify the failure detector so it pings `raft::server_id`s and obtains
the `gms::inet_address` before sending the message from
`raft_address_map`. The failure detector is sharded, so we need the
mappings to be available on all shards.
- in the future we'll have multiple Raft groups running on different
shards. To send messages they'll need `raft_address_map`.
Initially I tried to replicate all entries - expiring and non-expiring.
The implementation turned out to be very complex - we need to handle
dropping expired entries and refreshing expiring entries' timestamps
across shards, and doing this correctly while accounting for possible
races is quite problematic.
Eventually I arrived at the conclusion that replicating only
non-expiring entries, and furthermore allowing non-expiring entries to
be added only on shard 0, is good enough for our use cases:
- The direct failure detector is pinging group 0 members only; group
0 members correspond exactly to the non-expiring entries.
- Group 0 configuration changes are handled on shard 0, so non-expiring
entries are added/removed on shard 0.
- When we have multiple Raft groups, we can reuse a single Raft server
ID for all Raft servers running on a single node belonging to
different groups; they are 'namespaced' by the group IDs. Furthermore,
every node has a server that belongs to group 0. Thus for every Raft
server in every group, it has a corresponding server in group 0 with
the same ID, which has a non-expiring entry in `raft_address_map`,
which is replicated to all shards; so every group will be able to
deliver its messages.
With these assumptions the implementation is short and simple.
We can always complicate it in the future if we find that the
assumptions are too strong.
Closes#11791
* github.com:scylladb/scylladb:
test/raft: raft_address_map_test: add replication test
service/raft: raft_address_map: replicate non-expiring entries to other shards
service/raft: raft_address_map: assert when entry is missing in drop_expired_entries
service/raft: turn raft_address_map into a service
The storage_service::stop() calls repair_service::abort_repair_node_ops() but at that time the sharded<repair_service> is already stopped and call .local() on it just crashes.
The suggested fix is to remove explicit storage_service -> repair_service kick. Instead, the repair_infos generated for the sake of node-ops are subscribed on the node_ops_meta_data's abort source and abort themselves automatically.
fixes: #10284Closes#11797
* github.com:scylladb/scylladb:
repair: Remove ops_uuid
repair: Remove abort_repair_node_ops() altogether
repair: Subscribe on node_ops_info::as abortion
repair: Keep abort source on node_ops_info
repair: Pass node_ops_info arg to do_sync_data_using_repair()
repair: Mark repair_info::abort() noexcept
node_ops: Remove _aborted bit
node_ops: Simplify construction of node_ops_metadata
main: Fix message about repair service starting
All uses of snitch not have their own local referece. The global
instance can now be replaced with the one living in main (and tests)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The snitch/name endpoint needs snitch instance to get the name from.
Also the storage_service/reset_snitch endpoint will also need snitch
instance to call reset on.
This patch carries local snitch reference all thw way through API setup
and patches the get_name() call. The reset_snitch() will come in the
next patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some time soon snitch API handlers will operate on local snitch
reference capture, so those need to be unset before the target local
variable variable goes away
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Storage service uses snitch in several places:
- boot
- snitch-reconfigured subscription
- preferred IP reconnection
At this point it's worth adding storage_service->snitch explicit
dependency and patch the above to use local reference
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two places to patch: .start() and .setup() and both only need
snitch to get local dc/rack from, nothing more. Thus both can live with
the explicit argument for now
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Until now, authentication in alternator served only two purposes:
- refusing clients without proper credentials
- printing user information with logs
After this series, this user information is passed to lower layers, which also means that users are capable of attaching service levels to roles, and this service level configuration will be effective with alternator requests.
tests: manually by adding more debug logs and inspecting that per-service-level timeout value was properly applied for an authenticated alternator user
Fixes#11379Closes#11380
* github.com:scylladb/scylladb:
alternator: propagate authenticated user in client state
client_state: add internal constructor with auth_service
alternator: pass auth_service and sl_controller to server
"
There's an ongoing effort to move the endpoint -> {dc/rack} mappings
from snitch onto topology object and this set finalizes it. After it the
snitch service stops depending on gossiper and system keyspace and is
ready for de-globalization. As a nice side-effect the system keyspace no
longer needs to maintain the dc/rack info cache and its starting code gets
relaxed.
refs: #2737
refs: #2795
"
* 'br-snitch-dont-mess-with-topology-data-2' of https://github.com/xemul/scylla: (23 commits)
system_keyspace: Dont maintain dc/rack cache
system_keyspace: Indentation fix after previous patch
system_keyspace: Coroutinuze build_dc_rack_info()
topology: Move all post-configuration to topology::config
snitch: Start early
gossiper: Do not export system keyspace
snitch: Remove gossiper reference
snitch: Mark get_datacenter/_rack methods const
snitch: Drop some dead dependency knots
snitch, code: Make get_datacenter() report local dc only
snitch, code: Make get_rack() report local rack only
storage_service: Populate pending endpoint in on_alive()
code: Populate pending locations
topology: Put local dc/rack on topology early
topology: Add pending locations collection
topology: Make get_location() errors more verbose
token_metadata: Add config, spread everywhere
token_metadata: Hide token_metadata_impl copy constructor
gosspier: Remove messaging service getter
snitch: Get local address to gossip via config
...
Because of snitch ex-dependencies some bits on topology were initialized
with nasty post-start calls. Now it all can be removed and the initial
topology information can be provided by topology::config
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Snitch code doesn't need anything to start working, but it is needed by
the low-level token-metadata, so move the snitch to start early (and to
stop late)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It doesn't need gossiper any longer. This change will allow starting
snitch early by the next patch, and eventually improving the
token-metadata start-up sequence
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The continuation of the previous patch -- all the code uses
topology::get_datacenter(endpoint) to get peers' dc string. The topology
still uses snitch for that, but it already contains the needed data.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All the code out there now calls snitch::get_rack() to get rack for the
local node. For other nodes the topology::get_rack(endpoint) is used.
Since now the topology is properly populated with endpoints, it can
finally be patched to stop using snitch and get rack from its internal
collections
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Startup code needs to know the dc/rack of the local node early, way
before nodes starts any communication with the ring. This information is
available when snitch activates, but it starts _after_ token-metadata,
so the only way to put local dc/rack in topology is via a startup-time
special API call. This new init_local_endpoint() is temporary and will
be removed later in this set
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patches will need to provide some early-start data for topology.
The standard way of doing it is via service config, so this patch adds
one. The new config is empty in this patch, to be filled later
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The property-file snitch gossips listen_address as internal-IP state. To
get this value it gets it from snitch->gossiper->messaging_service
chain. This change provides the needed value via config thus cutting yet
another snitch->gossiper dependency and allowing gossiper not to export
messaging service in the future
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Yet another user of global qctx object. Making the method(s) non-static requires pushing the system_keyspace all the way down to size_estimate_virtual_reader and a small update of the cql_test_env
Closes#11738
* github.com:scylladb/scylladb:
system_keyspace: Make get_{local|saved}_tokens non static
size_estimates_virtual_reader: Pass sys_ks argument to get_local_ranges()
cql_test_env: Keep sharded<system_keyspace> reference
size_estimate_virtual_reader: Keep system_keyspace reference
system_keyspace: Pass sys_ks argument to install_virtual_readers()
system_keyspace: Make make() non-static
distributed_loader: Pass sys_ks argument to init_system_keyspace()
system_keyspace: Remove dangling forward declaration
It's final destination is virtual tabls registration code called from
init_system_keyspace() eventually
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Long-term index caching in the global cache, as introduced in 4.6, is a major
pessimization for workloads where accesses to the index are (spacially) sparse.
We want to have a way to disable it for the affected workloads.
There is already infrastructure in place for disabling it for BYPASS CACHE
queries. One way of solving the issue is hijacking that infrastructure.
This patch adds a global flag (and a corresponding CLI option) which controls
index caching. Setting the flag to `false` causes all index reads to behave
like they would in BYPASS CACHE queries.
Consequences of this choice:
- The per-SSTable partition_index_cache is unused. Every index_reader has
its own, and they die together. Independent reads can no longer reuse the
work of other reads which hit the same index pages. This is not crucial,
since partition accesses have no (natural) spatial locality. Note that
the original reason for partition_index_cache -- the ability to share
reads for the lower and upper bound of the query -- is unaffected.
- The per-SSTable cached_file is unused. Every index_reader has its own
(uncached) input stream from the index file, and every
bsearch_clustered_cursor has its own cached_file, which dies together with
the cursor. Note that the cursor still can perform its binary search with
caching. However, it won't be able to reuse the file pages read by
index_reader. In particular, if the promoted index is small, and fits inside
the same file page as its index_entry, that page will be re-read.
It can also happen that index_reader will read the same index file page
multiple times. When the summary is so dense that multiple index pages fit in
one index file page, advancing the upper bound, which reads the next index
page, will read the same index file page. Since summary:disk ratio is 1:2000,
this is expected to happen for partitions with size greater than 2000
partition keys.
Fixes#11202
These two are just getting in the way when touching inter-components
dependencies around messaging service. Without it m.-s. start/stop
just looks like any other service out there
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#11535
Task manager for observing and managing long-running, asynchronous tasks in Scylla
with the interface for the user. It will allow listing of tasks, getting detailed
task status and progression, waiting for their completion, and aborting them.
The task manager will be configured with a “task ttl” that determines how long
the task status is kept in memory after the task completes.
At first it will support repair and compaction tasks, and possibly more in the future.
Currently:
Sharded `task_manager` is started in `main.cc` where it is further passed
to `http_context` for the purpose of user interface.
Task manager's tasks are implemented in two two layers: the abstract
and the implementation one. The latter is a pure virtual class which needs
to be overriden by each module. Abstract layer provides the methods that
are shared by all modules and the access to module-specific methods.
Each module can access task manager, create and manage its tasks through
`task_manager::module` object. This way data specific to a module can be
separated from the other modules.
User can access task manager rest api interface to track asynchronous tasks.
The available options consist of:
- getting a list of modules
- getting a list of basic stats of all tasks in the requested module
- getting the detailed status of the requested task
- aborting the requested task
- waiting for the requested task to finish
To enable testing of the provided api, test specific task implementation and module
are provided. Their lifetime can be simulated with the standalone test api.
These components are compiled and the tests are run in all but release build modes.
Fixes: #9809Closes#11216
* github.com:scylladb/scylladb:
test: task manager api test
task_manager: test api layer implementation
task_manager: add test specific classes
task_manager: test api layer
task_manager: api layer implementation
task_manager: api layer
task_manager: keep task_manager reference in http_context
start sharded task manager
task_manager: create task manager object
Broadcast tables are tables for which all statements are strongly
consistent (linearizable), replicated to every node in the cluster and
available as long as a majority of the cluster is available. If a user
wants to store a “small” volume of metadata that is not modified “too
often” but provides high resiliency against failures and strong
consistency of operations, they can use broadcast tables.
The main goal of the broadcast tables project is to solve problems which
need to be solved when we eventually implement general-purpose strongly
consistent tables: designing the data structure for the Raft command,
ensuring that the commands are idempotent, handling snapshots correctly,
and so on.
In this MVP (Minimum Viable Product), statements are limited to simple
SELECT and UPDATE operations on the built-in table. In the future, other
statements and data types will be available but with this PR we can
already work on features like idempotent commands or snapshotting.
Snapshotting is not handled yet which means that restarting a node or
performing too many operations (which would cause a snapshot to be
created) will give incorrect results.
In a follow-up, we plan to add end-to-end Jepsen tests
(https://jepsen.io/). With this PR we can already simulate operations on
lists and test linearizability in linear complexity. This can also test
Scylla's implementation of persistent storage, failure detector, RPC,
etc.
Design doc: https://docs.google.com/document/d/1m1IW320hXtsGulzSTSHXkfcBKaG5UlsxOpm6LN7vWOc/edit?usp=sharingCloses#11164
* github.com:scylladb/scylladb:
raft: broadcast_tables: add broadcast_kv_store test
raft: broadcast_tables: add returning query result
raft: broadcast_tables: add execution of intermediate language
raft: broadcast_tables: add compilation of cql to intermediate language
raft: broadcast_tables: add definition of intermediate language
db: system_keyspace: add broadcast_kv_store table
db: config: add BROADCAST_TABLES feature flag
The implementation of a test api that helps testing task manager
api. It provides methods to simulate the operations that can happen
on modules and theirs task. Through the api user can: register
and unregister the test module and the tasks belonging to the module,
and finish the tasks with success or custom error.
The implementation of a task manager api layer. It provides
methods to list the modules registered in task_manager, list
tasks belonging to the given module, abort, wait for or retrieve
a status of the given task.
We decided to extend `cql_statement` hierarchy with `strongly_consistent_modification_statement`
and `strongly_consistent_select_statement`. Statements operating on
system.broadcast_kv_store will be compiled to these new subclasses if
BROADCAST_TABLES flag is enabled.
If the query is executed on a shard other than 0 it's bounced to that shard.
There's one corner case in nodes sorting by snitch. The simple snitch
code overloads the call and doesn't sort anything. The same behavior
should be preserved by (future) topology implementation, but it doesn't
know the snitch name. To address that the patch adds a boolean switch on
topology that's turned off by main code when it sees the snitch is
"simple" one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add experimental flag 'broadcast-tables' for enabling BROADCAST_TABLES feature.
This feature requires raft group0, thus enabling it without RAFT will cause an error.
Messaging will need to call topology methods to compare DC/RACK of peers
with local node. Topology now resides on token metadata, so messaging
needs to get the dependency reference.
However, messaging only needs the topology when it's up and running, so
instead of producing a life-time reference, add a pointer, that's set up
on .start_listen(), before any client pops up, and is cleared on
.shutdown() after all connections are dropped.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
this setting was removed back in
dcdd207349, so despite that we are still
passing `storage_service_config` to the ctor of `storage_service`,
`storage_service::storage_service()` just drops it on the floor.
in this change, `storage_service_config` class is removed, and all
places referencing it are updated accordingly.
Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Closes#11415
A listener is created inside `raft_group0` for acting when the
SUPPORTS_RAFT feature is enabled. The listener is established after the
node enters NORMAL status (in `raft_group0::finish_setup_after_join()`,
called at the end of `storage_service::join_cluster()`).
The listener starts the `upgrade_to_group0` procedure.
The procedure, on a high level, works as follows:
- join group 0
- wait until every peer joined group 0 (peers are taken from
`system.peers` table)
- enter `synchronize` upgrade state, in which group 0 operations are
disabled (see earlier commit which implemented this logic)
- wait until all members of group 0 entered `synchronize` state or some
member entered the final state
- synchronize schema by comparing versions and pulling if necessary
- enter the final state (`use_new_procedures`), in which group 0 is used
for schema operations (only those for now).
The devil lies in the details, and the implementation is ugly compared
to this nice description; for example there are many retry loops for
handling intermittent network failures. Read the code.
`leave_group0` and `remove_group0` were adjusted to handle the upgrade
procedure being run correctly; if necessary, they will wait for the
procedure to finish.
If the upgrade procedure gets stuck (and it may, since it requires all
nodes to be available to contact them to correctly establish a single
group 0 raft cluster); or if a running cluster permanently loses a
majority of nodes, causing group 0 unavailability; the cluster admin
is not left without help.
We introduce a recovery mode, which allows the admin to
completely get rid of traces of existing group 0 and restart the
upgrade procedure - which will establish a new group 0. This works even
in clusters that never upgraded but were bootstrapped using group 0 from
scratch.
To do that, the admin does the following on every node:
- writes 'recovery' under 'group0_upgrade_state' key
in `system.scylla_local` table,
- truncates the `system.discovery` table,
- truncates the `system.group0_history` table,
- deletes group 0 ID and group 0 server ID from `system.scylla_local`
(the keys are `raft_group0_id` and `raft_server_id`
then the admin performs a rolling restart of their cluster. The nodes
restart in a "group 0 recovery mode", which simply means that the nodes
won't try to perform any group 0 operations. Then the admin calls
`removenode` to remove the nodes that are down. Finally, the admin
removes the `group0_upgrade_state` key from `system.scylla_local`,
rolling-restarts the cluster, and the cluster should establish group 0
anew.
Note that this recovery procedure will have to be extended when new
stuff is added to group 0 - like topology change state. Indeed, observe
that a minority of nodes aren't able to receive committed entries from a
leader, so they may end up in inconsistent group 0 states. It wouldn't
be safe to simply create group 0 on those nodes without first ensuring
that they have the same state from which group 0 will start.
Right now the state only consist of schema tables, and the upgrade
procedure ensures to synchronize them, so even if the nodes started in
inconsistent schema states, group 0 will correctly be established.
(TODO: create a tracking issue? something needs to remind us of this
whenever we extend group 0 with new stuff...)
Now, whether an 'group 0 operation' (today it means schema change) is
performed using the old or new methods, doesn't depend on the local RAFT
fature being enabled, but on the state of the upgrade procedure.
In this commit the state of the upgrade is always
`use_pre_raft_procedures` because the upgrade procedure is not
implemented yet. But stay tuned.
The upgrade procedure will need certain guarantees: at some point it
switches from `use_pre_raft_procedures` to `synchronize` state. During
`synchronize` schema changes must be disabled, so the procedure can
ensure that schema is in sync across the entire cluster before
establishing group 0. Thus, when the switch happens, no schema change
can be in progress.
To handle all this weirdness we introduce `_upgrade_lock` and
`get_group0_upgrade_state` which takes this lock whenever it returns
`use_pre_raft_procedures`. Creating a `group0_guard` - which happens at
the start of every group 0 operation - will take this lock, and the lock
holder shall be stored inside the guard (note: the holder only holds the
lock if `use_pre_raft_procedures` was returned, no need to hold it for
other cases). Because `group0_guard` is held for the entire duration of
a group 0 operation, and because the upgrade procedure will also have to
take this lock whenever it wants to change the upgrade state (it's an
rwlock), this ensures that no group 0 operation that uses the old ways
is happening when we change the state.
We also implement `wait_until_group0_upgraded` using a condition
variable. It will be used by certain methods during upgrade (later
commits; stay tuned).
Some additional comments were written.
Refs #11237
Don't store segments found on init scan in all shard instances,
instead retrieve (based on low time-pos for current gen) when
required. This changes very little, but we at last don't store
pointless string lists in shards 1 to X, and also we can potentially
ask for the list twice. More to the point, goes better hand-in-hand
with the semantics of "delete_segments", where any file sent in is
considered candidate for recycling, and included in footprint.
Introduce a `remote` class that handles all remote communication in `storage_proxy`: sending and receiving RPCs, checking the state of other nodes by accessing the gossiper, and fetching schema.
The `remote` object lives inside `storage_proxy` and right now it's initialized and destroyed together with `storage_proxy`.
The long game here is to split the initialization of `storage_proxy` into two steps:
- the first step, which constructs `storage_proxy`, initializes it "locally" and does not require references to `messaging_service` and `gossiper`.
- the second step will take those references and add the `remote` part to `storage_proxy`.
This will allow us to remove some cycles from the service (de)initialization order and in general clean it up a bit. We'll be able to start `storage_proxy` right after the `database` (without messaging/gossiper). Similar refactors are planned for `query_processor`.
Closes#11088
* github.com:scylladb/scylladb:
service: storage_proxy: pass `migration_manager*` to `init_messaging_service`
service: storage_proxy: `remote`: make `_gossiper` a const reference
gms: gossiper: mark some member functions const
db: consistency_level: `filter_for_query`: take `const gossiper&`
replica: table: `get_hit_rate`: take `const gossiper&`
gms: gossiper: move `endpoint_filter` to `storage_proxy` module
service: storage_proxy: pass `shared_ptr<gossiper>` to `start_hints_manager`
service: storage_proxy: establish private section in `remote`
service: storage_proxy: remove `migration_manager` pointer
service: storage_proxy: remove calls to `storage_proxy::remote()` from `remote`
service: storage_proxy: remove `_gossiper` field
alternator: ttl: pass `gossiper&` to `expiration_service`
service: storage_proxy: move `truncate_blocking` implementation to `remote`
service: storage_proxy: introduce `is_alive` helper
service: storage_proxy: remove `_messaging` reference
service: storage_proxy: move `connection_dropped` to `remote`
service: storage_proxy: make `encode_replica_exception_for_rpc` a static function
service: storage_proxy: move `handle_write` to `remote`
service: storage_proxy: move `handle_paxos_prune` to `remote`
service: storage_proxy: move `handle_paxos_accept` to `remote`
service: storage_proxy: move `handle_paxos_prepare` to `remote`
service: storage_proxy: move `handle_truncate` to `remote`
service: storage_proxy: move `handle_read_digest` to `remote`
service: storage_proxy: move `handle_read_mutation_data` to `remote`
service: storage_proxy: move `handle_read_data` to `remote`
service: storage_proxy: move `handle_mutation_failed` to `remote`
service: storage_proxy: move `handle_mutation_done` to `remote`
service: storage_proxy: move `handle_paxos_learn` to `remote`
service: storage_proxy: move `receive_mutation_handler` to `remote`
service: storage_proxy: move `handle_counter_mutation` to `remote`
service: storage_proxy: remove `get_local_shared_storage_proxy`
service: storage_proxy: (de)register RPC handlers in `remote`
service: storage_proxy: introduce `remote`
`migration_manager` lifetime is longer than the lifetime of "storage
proxy's messaging service part" - that is, `init_messaging_service` is
called after `migration_manager` is started, and `uninit_messaging_service`
is called before `migration_manager` is stopped. Thus we don't need to
hold an owning pointer to `migration_manager` here.
Later, when `init_messaging_service` will actually construct `remote`,
this will be a reference, not a pointer.
Also observe that `_mm` in `remote` is only used in handlers, and
handlers are unregistered before `_mm` is nullified, which ensures that
handlers are not running when `_mm` is nullified. (This argument shows
why the code made sense regardless of our switch from shared_ptr to raw
ptr).
Even on the environment which causes error during initalize Scylla,
"scylla --version" should be able to run without error.
To do so, we need to parse and execute these options before
initializing Scylla/Seastar classes.
Fixes#11117Closes#11179