Commit Graph

864 Commits

Author SHA1 Message Date
Kamil Braun
e086521c1a direct_failure_detector: get rid of complex endpoint_id translations
The direct failure detector operates on abstract `endpoint_id`s for
pinging. The `pigner` interface is responsible for translating these IDs
to 'real' addresses.

Earlier we used two types of addresses: IP addresses in 'production'
code (`gms::gossiper::direct_fd_pinger`) and `raft::server_id`s in test
code (in `randomized_nemesis_test`). For each of these use cases we
would maintain mappings between `endpoint_id`s and the address type.

In recent commits we switched the 'production' code to also operate on
Raft server IDs, which are UUIDs underneath.

In this commit we switch `endpoint_id`s from `unsigned` type to
`utils::UUID`. Because each use case operates in Raft server IDs, we can
perform a simple translation: `raft_id.uuid()` to get an `endpoint_id`
from a Raft ID, `raft::server_id{ep_id}` to obtain a Raft ID from
an `endpoint_id`. We no longer have to maintain complex sharded data
structures to store the mappings.
2022-11-04 09:38:08 +01:00
Kamil Braun
ac70a05c7e service/raft: store raft_address_map reference in direct_fd_pinger
The pinger will use the map to translate `raft::server_id`s to
`gms::inet_address`es when pinging.
2022-11-04 09:38:08 +01:00
Kamil Braun
2c20f2ab9d gms: gossiper: move direct_fd_pinger out to a separate service
In later commit `direct_fd_pinger` will operate in terms of
`raft::server_id`s. Decouple it from `gossiper` since we don't want to
entangle `gossiper` with Raft-specific stuff.
2022-11-04 09:38:08 +01:00
Pavel Emelyanov
efbfcdb97e Merge 'Replicate raft_address_map non-expiring entries to other shards' from Kamil Braun
Replicating `raft_address_map` entries is needed for the following use
cases:
- the direct failure detector - currently it assumes a static mapping of
  `raft::server_id`s to `gms::inet_address`es, which is obtained on Raft
  group 0 configuration changes. To handle dynamic mappings we need to
  modify the failure detector so it pings `raft::server_id`s and obtains
  the `gms::inet_address` before sending the message from
  `raft_address_map`. The failure detector is sharded, so we need the
  mappings to be available on all shards.
- in the future we'll have multiple Raft groups running on different
  shards. To send messages they'll need `raft_address_map`.

Initially I tried to replicate all entries - expiring and non-expiring.
The implementation turned out to be very complex - we need to handle
dropping expired entries and refreshing expiring entries' timestamps
across shards, and doing this correctly while accounting for possible
races is quite problematic.

Eventually I arrived at the conclusion that replicating only
non-expiring entries, and furthermore allowing non-expiring entries to
be added only on shard 0, is good enough for our use cases:
- The direct failure detector is pinging group 0 members only; group
  0 members correspond exactly to the non-expiring entries.
- Group 0 configuration changes are handled on shard 0, so non-expiring
  entries are added/removed on shard 0.
- When we have multiple Raft groups, we can reuse a single Raft server
  ID for all Raft servers running on a single node belonging to
  different groups; they are 'namespaced' by the group IDs. Furthermore,
  every node has a server that belongs to group 0. Thus for every Raft
  server in every group, it has a corresponding server in group 0 with
  the same ID, which has a non-expiring entry in `raft_address_map`,
  which is replicated to all shards; so every group will be able to
  deliver its messages.

With these assumptions the implementation is short and simple.
We can always complicate it in the future if we find that the
assumptions are too strong.

Closes #11791

* github.com:scylladb/scylladb:
  test/raft: raft_address_map_test: add replication test
  service/raft: raft_address_map: replicate non-expiring entries to other shards
  service/raft: raft_address_map: assert when entry is missing in drop_expired_entries
  service/raft: turn raft_address_map into a service
2022-11-03 18:34:42 +03:00
Aleksandra Martyniuk
dc80af33bc repair: add task_manager::module to repair_service
repair_service keeps a shared pointer to repair_module.
2022-10-31 10:04:50 +01:00
Kamil Braun
159bb32309 service/raft: turn raft_address_map into a service 2022-10-31 09:17:10 +01:00
Botond Dénes
396d9e6a46 Merge 'Subscribe repair_info::abort on node_ops_meta_data::abort_source' from Pavel Emelyanov
The storage_service::stop() calls repair_service::abort_repair_node_ops() but at that time the sharded<repair_service> is already stopped and call .local() on it just crashes.

The suggested fix is to remove explicit storage_service -> repair_service kick. Instead, the repair_infos generated for the sake of node-ops are subscribed on the node_ops_meta_data's abort source and abort themselves automatically.

fixes: #10284

Closes #11797

* github.com:scylladb/scylladb:
  repair: Remove ops_uuid
  repair: Remove abort_repair_node_ops() altogether
  repair: Subscribe on node_ops_info::as abortion
  repair: Keep abort source on node_ops_info
  repair: Pass node_ops_info arg to do_sync_data_using_repair()
  repair: Mark repair_info::abort() noexcept
  node_ops: Remove _aborted bit
  node_ops: Simplify construction of node_ops_metadata
  main: Fix message about repair service starting
2022-10-21 10:08:43 +03:00
Pavel Emelyanov
01b1f56bd7 code: Deglobalize snitch
All uses of snitch not have their own local referece. The global
instance can now be replaced with the one living in main (and tests)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:33:41 +03:00
Pavel Emelyanov
0d49b0e24a api: Use local snitch reference
The snitch/name endpoint needs snitch instance to get the name from.
Also the storage_service/reset_snitch endpoint will also need snitch
instance to call reset on.

This patch carries local snitch reference all thw way through API setup
and patches the get_name() call. The reset_snitch() will come in the
next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:31:45 +03:00
Pavel Emelyanov
c175ea33e2 api: Unset snitch endpoints on stop
Some time soon snitch API handlers will operate on local snitch
reference capture, so those need to be unset before the target local
variable variable goes away

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:31:12 +03:00
Pavel Emelyanov
ea8bfc4844 storage_service: Keep local snitch reference
Storage service uses snitch in several places:
- boot
- snitch-reconfigured subscription
- preferred IP reconnection

At this point it's worth adding storage_service->snitch explicit
dependency and patch the above to use local reference

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:30:00 +03:00
Pavel Emelyanov
52d6e56a10 system_keyspace: Don't use global snitch instance
There are two places to patch: .start() and .setup() and both only need
snitch to get local dc/rack from, nothing more. Thus both can live with
the explicit argument for now

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-20 12:29:26 +03:00
Nadav Har'El
264f453b9d Merge 'Associate alternator user with its service level configuration' from Piotr Sarna
Until now, authentication in alternator served only two purposes:
 - refusing clients without proper credentials
 - printing user information with logs

After this series, this user information is passed to lower layers, which also means that users are capable of attaching service levels to roles, and this service level configuration will be effective with alternator requests.

tests: manually by adding more debug logs and inspecting that per-service-level timeout value was properly applied for an authenticated alternator user

Fixes #11379

Closes #11380

* github.com:scylladb/scylladb:
  alternator: propagate authenticated user in client state
  client_state: add internal constructor with auth_service
  alternator: pass auth_service and sl_controller to server
2022-10-19 23:27:48 +03:00
Botond Dénes
2d581e9e8f Merge "Maintain dc/rack by topology" from Pavel Emelyanov
"
There's an ongoing effort to move the endpoint -> {dc/rack} mappings
from snitch onto topology object and this set finalizes it. After it the
snitch service stops depending on gossiper and system keyspace and is
ready for de-globalization. As a nice side-effect the system keyspace no
longer needs to maintain the dc/rack info cache and its starting code gets
relaxed.

refs: #2737
refs: #2795
"

* 'br-snitch-dont-mess-with-topology-data-2' of https://github.com/xemul/scylla: (23 commits)
  system_keyspace: Dont maintain dc/rack cache
  system_keyspace: Indentation fix after previous patch
  system_keyspace: Coroutinuze build_dc_rack_info()
  topology: Move all post-configuration to topology::config
  snitch: Start early
  gossiper: Do not export system keyspace
  snitch: Remove gossiper reference
  snitch: Mark get_datacenter/_rack methods const
  snitch: Drop some dead dependency knots
  snitch, code: Make get_datacenter() report local dc only
  snitch, code: Make get_rack() report local rack only
  storage_service: Populate pending endpoint in on_alive()
  code: Populate pending locations
  topology: Put local dc/rack on topology early
  topology: Add pending locations collection
  topology: Make get_location() errors more verbose
  token_metadata: Add config, spread everywhere
  token_metadata: Hide token_metadata_impl copy constructor
  gosspier: Remove messaging service getter
  snitch: Get local address to gossip via config
  ...
2022-10-19 06:50:21 +03:00
Pavel Emelyanov
2fa58632b3 main: Fix message about repair service starting
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-18 17:23:17 +03:00
Pavel Emelyanov
5c8a61ace2 tracing: Dismantle trace-backend registry
It's not used any longer

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-13 17:57:24 +03:00
Pavel Emelyanov
b6061bb97d topology: Move all post-configuration to topology::config
Because of snitch ex-dependencies some bits on topology were initialized
with nasty post-start calls. Now it all can be removed and the initial
topology information can be provided by topology::config

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:18:31 +03:00
Pavel Emelyanov
56d4863eb6 snitch: Start early
Snitch code doesn't need anything to start working, but it is needed by
the low-level token-metadata, so move the snitch to start early (and to
stop late)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:18:31 +03:00
Pavel Emelyanov
2bb354b2e7 snitch: Remove gossiper reference
It doesn't need gossiper any longer. This change will allow starting
snitch early by the next patch, and eventually improving the
token-metadata start-up sequence

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
4206b1f98f snitch, code: Make get_datacenter() report local dc only
The continuation of the previous patch -- all the code uses
topology::get_datacenter(endpoint) to get peers' dc string. The topology
still uses snitch for that, but it already contains the needed data.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
6c6711404f snitch, code: Make get_rack() report local rack only
All the code out there now calls snitch::get_rack() to get rack for the
local node. For other nodes the topology::get_rack(endpoint) is used.
Since now the topology is properly populated with endpoints, it can
finally be patched to stop using snitch and get rack from its internal
collections

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
b61bd6cf56 topology: Put local dc/rack on topology early
Startup code needs to know the dc/rack of the local node early, way
before nodes starts any communication with the ring. This information is
available when snitch activates, but it starts _after_ token-metadata,
so the only way to put local dc/rack in topology is via a startup-time
special API call. This new init_local_endpoint() is temporary and will
be removed later in this set

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
d60ebc5ace token_metadata: Add config, spread everywhere
Next patches will need to provide some early-start data for topology.
The standard way of doing it is via service config, so this patch adds
one. The new config is empty in this patch, to be filled later

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Pavel Emelyanov
66bc84d217 snitch: Get local address to gossip via config
The property-file snitch gossips listen_address as internal-IP state. To
get this value it gets it from snitch->gossiper->messaging_service
chain. This change provides the needed value via config thus cutting yet
another snitch->gossiper dependency and allowing gossiper not to export
messaging service in the future

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-11 05:17:08 +03:00
Botond Dénes
b247f29881 Merge 'De-static system_keyspace::get_{saved|local}_tokens()' from Pavel Emelyanov
Yet another user of global qctx object. Making the method(s) non-static requires pushing the system_keyspace all the way down to size_estimate_virtual_reader and a small update of the cql_test_env

Closes #11738

* github.com:scylladb/scylladb:
  system_keyspace: Make get_{local|saved}_tokens non static
  size_estimates_virtual_reader: Pass sys_ks argument to get_local_ranges()
  cql_test_env: Keep sharded<system_keyspace> reference
  size_estimate_virtual_reader: Keep system_keyspace reference
  system_keyspace: Pass sys_ks argument to install_virtual_readers()
  system_keyspace: Make make() non-static
  distributed_loader: Pass sys_ks argument to init_system_keyspace()
  system_keyspace: Remove dangling forward declaration
2022-10-07 11:28:32 +03:00
Pavel Emelyanov
9f79525f8e distributed_loader: Pass sys_ks argument to init_system_keyspace()
It's final destination is virtual tabls registration code called from
init_system_keyspace() eventually

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-06 17:55:03 +03:00
Pavel Emelyanov
8570fe3c30 raft_group0: Add system keyspace reference
The sharded<system_keyspace> is already started by the time raft_group0
is created

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-10-05 17:35:13 +03:00
Michał Chojnowski
cdb3e71045 sstables: add a flag for disabling long-term index caching
Long-term index caching in the global cache, as introduced in 4.6, is a major
pessimization for workloads where accesses to the index are (spacially) sparse.
We want to have a way to disable it for the affected workloads.

There is already infrastructure in place for disabling it for BYPASS CACHE
queries. One way of solving the issue is hijacking that infrastructure.

This patch adds a global flag (and a corresponding CLI option) which controls
index caching. Setting the flag to `false` causes all index reads to behave
like they would in BYPASS CACHE queries.

Consequences of this choice:

- The per-SSTable partition_index_cache is unused. Every index_reader has
  its own, and they die together. Independent reads can no longer reuse the
  work of other reads which hit the same index pages. This is not crucial,
  since partition accesses have no (natural) spatial locality. Note that
  the original reason for partition_index_cache -- the ability to share
  reads for the lower and upper bound of the query -- is unaffected.
- The per-SSTable cached_file is unused. Every index_reader has its own
  (uncached) input stream from the index file, and every
  bsearch_clustered_cursor has its own cached_file, which dies together with
  the cursor. Note that the cursor still can perform its binary search with
  caching. However, it won't be able to reuse the file pages read by
  index_reader. In particular, if the promoted index is small, and fits inside
  the same file page as its index_entry, that page will be re-read.
  It can also happen that index_reader will read the same index file page
  multiple times. When the summary is so dense that multiple index pages fit in
  one index file page, advancing the upper bound, which reads the next index
  page, will read the same index file page. Since summary:disk ratio is 1:2000,
  this is expected to happen for partitions with size greater than 2000
  partition keys.

Fixes #11202
2022-09-15 17:16:26 +03:00
Pavel Emelyanov
82162be1f1 messaging_service: Remove init/uninit helpers
These two are just getting in the way when touching inter-components
dependencies around messaging service. Without it m.-s. start/stop
just looks like any other service out there

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #11535
2022-09-15 11:54:46 +03:00
Botond Dénes
5374f0edbf Merge 'Task manager' from Aleksandra Martyniuk
Task manager for observing and managing long-running, asynchronous tasks in Scylla
with the interface for the user. It will allow listing of tasks, getting detailed
task status and progression, waiting for their completion, and aborting them.
The task manager will be configured with a “task ttl” that determines how long
the task status is kept in memory after the task completes.

At first it will support repair and compaction tasks, and possibly more in the future.

Currently:
Sharded `task_manager` is started in `main.cc` where it is further passed
to `http_context` for the purpose of user interface.

Task manager's tasks are implemented in two two layers: the abstract
and the implementation one. The latter is a pure virtual class which needs
to be overriden by each module. Abstract layer provides the methods that
are shared by all modules and the access to module-specific methods.

Each module can access task manager, create and manage its tasks through
`task_manager::module` object. This way data specific to a module can be
separated from the other modules.

User can access task manager rest api interface to track asynchronous tasks.
The available options consist of:
- getting a list of modules
- getting a list of basic stats of all tasks in the requested module
- getting the detailed status of the requested task
- aborting the requested task
- waiting for the requested task to finish

To enable testing of the provided api, test specific task implementation and module
are provided. Their lifetime can be simulated with the standalone test api.
These components are compiled and the tests are run in all but release build modes.

Fixes: #9809

Closes #11216

* github.com:scylladb/scylladb:
  test: task manager api test
  task_manager: test api layer implementation
  task_manager: add test specific classes
  task_manager: test api layer
  task_manager: api layer implementation
  task_manager: api layer
  task_manager: keep task_manager reference in http_context
  start sharded task manager
  task_manager: create task manager object
2022-09-12 09:26:46 +03:00
Kamil Braun
dba595d347 Merge 'Minimal implementation of Broadcast Tables' from Mikołaj Grzebieluch
Broadcast tables are tables for which all statements are strongly
consistent (linearizable), replicated to every node in the cluster and
available as long as a majority of the cluster is available. If a user
wants to store a “small” volume of metadata that is not modified “too
often” but provides high resiliency against failures and strong
consistency of operations, they can use broadcast tables.

The main goal of the broadcast tables project is to solve problems which
need to be solved when we eventually implement general-purpose strongly
consistent tables: designing the data structure for the Raft command,
ensuring that the commands are idempotent, handling snapshots correctly,
and so on.

In this MVP (Minimum Viable Product), statements are limited to simple
SELECT and UPDATE operations on the built-in table. In the future, other
statements and data types will be available but with this PR we can
already work on features like idempotent commands or snapshotting.
Snapshotting is not handled yet which means that restarting a node or
performing too many operations (which would cause a snapshot to be
created) will give incorrect results.

In a follow-up, we plan to add end-to-end Jepsen tests
(https://jepsen.io/). With this PR we can already simulate operations on
lists and test linearizability in linear complexity. This can also test
Scylla's implementation of persistent storage, failure detector, RPC,
etc.

Design doc: https://docs.google.com/document/d/1m1IW320hXtsGulzSTSHXkfcBKaG5UlsxOpm6LN7vWOc/edit?usp=sharing

Closes #11164

* github.com:scylladb/scylladb:
  raft: broadcast_tables: add broadcast_kv_store test
  raft: broadcast_tables: add returning query result
  raft: broadcast_tables: add execution of intermediate language
  raft: broadcast_tables: add compilation of cql to intermediate language
  raft: broadcast_tables: add definition of intermediate language
  db: system_keyspace: add broadcast_kv_store table
  db: config: add BROADCAST_TABLES feature flag
2022-09-09 18:05:37 +02:00
Aleksandra Martyniuk
ec86410094 task_manager: test api layer implementation
The implementation of a test api that helps testing task manager
api. It provides methods to simulate the operations that can happen
on modules and theirs task. Through the api user can: register
and unregister the test module and the tasks belonging to the module,
and finish the tasks with success or custom error.
2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk
c9637705a6 task_manager: api layer implementation
The implementation of a task manager api layer. It provides
methods to list the modules registered in task_manager, list
tasks belonging to the given module, abort, wait for or retrieve
a status of the given task.
2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk
b87a0a74ab task_manager: keep task_manager reference in http_context
Keep a reference to sharded<task_manager> as a member
of http_context so it can be reached from rest api.
2022-09-09 14:29:28 +02:00
Aleksandra Martyniuk
9e68c8d445 start sharded task manager
Sharded task manager object is started in main.cc.
2022-09-09 14:29:28 +02:00
Mikołaj Grzebieluch
82df8a9905 raft: broadcast_tables: add compilation of cql to intermediate language
We decided to extend `cql_statement` hierarchy with `strongly_consistent_modification_statement`
and `strongly_consistent_select_statement`. Statements operating on
system.broadcast_kv_store will be compiled to these new subclasses if
BROADCAST_TABLES flag is enabled.

If the query is executed on a shard other than 0 it's bounced to that shard.
2022-09-08 15:25:36 +02:00
Pavel Emelyanov
41973c5bf7 topology: Add "enable proximity sorting" bit
There's one corner case in nodes sorting by snitch. The simple snitch
code overloads the call and doesn't sort anything. The same behavior
should be preserved by (future) topology implementation, but it doesn't
know the snitch name. To address that the patch adds a boolean switch on
topology that's turned off by main code when it sees the snitch is
"simple" one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-05 15:15:07 +03:00
Mikołaj Grzebieluch
5b1421cc33 db: config: add BROADCAST_TABLES feature flag
Add experimental flag 'broadcast-tables' for enabling BROADCAST_TABLES feature.
This feature requires raft group0, thus enabling it without RAFT will cause an error.
2022-09-05 11:11:08 +02:00
Piotr Sarna
9511c21686 alternator: pass auth_service and sl_controller to server
It's going to be needed to recreate a client state for an authenticated
user.
2022-09-05 10:03:00 +02:00
Pavel Emelyanov
e147681d85 messaging, topology: Keep shared_token_metadata* on messaging
Messaging will need to call topology methods to compare DC/RACK of peers
with local node. Topology now resides on token metadata, so messaging
needs to get the dependency reference.

However, messaging only needs the topology when it's up and running, so
instead of producing a life-time reference, add a pointer, that's set up
on .start_listen(), before any client pops up, and is cleared on
.shutdown() after all connections are dropped.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-09-01 11:32:34 +03:00
Kefu Chai
a5e696fab8 storage_service, test: drop unused storage_service_config
this setting was removed back in
dcdd207349, so despite that we are still
passing `storage_service_config` to the ctor of `storage_service`,
`storage_service::storage_service()` just drops it on the floor.

in this change, `storage_service_config` class is removed, and all
places referencing it are updated accordingly.

Signed-off-by: Kefu Chai <tchaikov@gmail.com>

Closes #11415
2022-08-31 19:49:13 +03:00
Kamil Braun
e350e37605 service/raft: raft_group0: implement upgrade procedure
A listener is created inside `raft_group0` for acting when the
SUPPORTS_RAFT feature is enabled. The listener is established after the
node enters NORMAL status (in `raft_group0::finish_setup_after_join()`,
called at the end of `storage_service::join_cluster()`).

The listener starts the `upgrade_to_group0` procedure.

The procedure, on a high level, works as follows:
- join group 0
- wait until every peer joined group 0 (peers are taken from
  `system.peers` table)
- enter `synchronize` upgrade state, in which group 0 operations are
  disabled (see earlier commit which implemented this logic)
- wait until all members of group 0 entered `synchronize` state or some
  member entered the final state
- synchronize schema by comparing versions and pulling if necessary
- enter the final state (`use_new_procedures`), in which group 0 is used
  for schema operations (only those for now).

The devil lies in the details, and the implementation is ugly compared
to this nice description; for example there are many retry loops for
handling intermittent network failures. Read the code.

`leave_group0` and `remove_group0` were adjusted to handle the upgrade
procedure being run correctly; if necessary, they will wait for the
procedure to finish.

If the upgrade procedure gets stuck (and it may, since it requires all
nodes to be available to contact them to correctly establish a single
group 0 raft cluster); or if a running cluster permanently loses a
majority of nodes, causing group 0 unavailability; the cluster admin
is not left without help.

We introduce a recovery mode, which allows the admin to
completely get rid of traces of existing group 0 and restart the
upgrade procedure - which will establish a new group 0. This works even
in clusters that never upgraded but were bootstrapped using group 0 from
scratch.

To do that, the admin does the following on every node:
- writes 'recovery' under 'group0_upgrade_state' key
  in `system.scylla_local` table,
- truncates the `system.discovery` table,
- truncates the `system.group0_history` table,
- deletes group 0 ID and group 0 server ID from `system.scylla_local`
  (the keys are `raft_group0_id` and `raft_server_id`
then the admin performs a rolling restart of their cluster. The nodes
restart in a "group 0 recovery mode", which simply means that the nodes
won't try to perform any group 0 operations. Then the admin calls
`removenode` to remove the nodes that are down. Finally, the admin
removes the `group0_upgrade_state` key from `system.scylla_local`,
rolling-restarts the cluster, and the cluster should establish group 0
anew.

Note that this recovery procedure will have to be extended when new
stuff is added to group 0 - like topology change state. Indeed, observe
that a minority of nodes aren't able to receive committed entries from a
leader, so they may end up in inconsistent group 0 states. It wouldn't
be safe to simply create group 0 on those nodes without first ensuring
that they have the same state from which group 0 will start.
Right now the state only consist of schema tables, and the upgrade
procedure ensures to synchronize them, so even if the nodes started in
inconsistent schema states, group 0 will correctly be established.
(TODO: create a tracking issue? something needs to remind us of this
 whenever we extend group 0 with new stuff...)
2022-08-23 13:51:01 +02:00
Kamil Braun
43687be1f1 service/raft: raft_group0_client: prepare for upgrade procedure
Now, whether an 'group 0 operation' (today it means schema change) is
performed using the old or new methods, doesn't depend on the local RAFT
fature being enabled, but on the state of the upgrade procedure.

In this commit the state of the upgrade is always
`use_pre_raft_procedures` because the upgrade procedure is not
implemented yet. But stay tuned.

The upgrade procedure will need certain guarantees: at some point it
switches from `use_pre_raft_procedures` to `synchronize` state. During
`synchronize` schema changes must be disabled, so the procedure can
ensure that schema is in sync across the entire cluster before
establishing group 0. Thus, when the switch happens, no schema change
can be in progress.

To handle all this weirdness we introduce `_upgrade_lock` and
`get_group0_upgrade_state` which takes this lock whenever it returns
`use_pre_raft_procedures`. Creating a `group0_guard` - which happens at
the start of every group 0 operation - will take this lock, and the lock
holder shall be stored inside the guard (note: the holder only holds the
lock if `use_pre_raft_procedures` was returned, no need to hold it for
other cases). Because `group0_guard` is held for the entire duration of
a group 0 operation, and because the upgrade procedure will also have to
take this lock whenever it wants to change the upgrade state (it's an
rwlock), this ensures that no group 0 operation that uses the old ways
is happening when we change the state.

We also implement `wait_until_group0_upgraded` using a condition
variable. It will be used by certain methods during upgrade (later
commits; stay tuned).

Some additional comments were written.
2022-08-19 19:15:19 +02:00
Calle Wilund
a729c2438e commitlog: Make get_segments_to_replay on-demand
Refs #11237

Don't store segments found on init scan in all shard instances,
instead retrieve (based on low time-pos for current gen) when
required. This changes very little, but we at last don't store
pointless string lists in shards 1 to X, and also we can potentially
ask for the list twice. More to the point, goes better hand-in-hand
with the semantics of "delete_segments", where any file sent in is
considered candidate for recycling, and included in footprint.
2022-08-11 06:41:23 +00:00
Takuya ASADA
3ffc978166 main: move preinit_description to main()
We don't need to wait for handling version options after scylla_main()
called, we can handle it in main() instead.

Closes #11221
2022-08-08 18:31:43 +03:00
Pavel Emelyanov
527b345079 Merge 'storage_proxy: introduce a remote "subservice"' from Kamil Braun
Introduce a `remote` class that handles all remote communication in `storage_proxy`: sending and receiving RPCs, checking the state of other nodes by accessing the gossiper, and fetching schema.

The `remote` object lives inside `storage_proxy` and right now it's initialized and destroyed together with `storage_proxy`.

The long game here is to split the initialization of `storage_proxy` into two steps:
- the first step, which constructs `storage_proxy`, initializes it "locally" and does not require references to `messaging_service` and `gossiper`.
- the second step will take those references and add the `remote` part to `storage_proxy`.

This will allow us to remove some cycles from the service (de)initialization order and in general clean it up a bit. We'll be able to start `storage_proxy` right after the `database` (without messaging/gossiper). Similar refactors are planned for `query_processor`.

Closes #11088

* github.com:scylladb/scylladb:
  service: storage_proxy: pass `migration_manager*` to `init_messaging_service`
  service: storage_proxy: `remote`: make `_gossiper` a const reference
  gms: gossiper: mark some member functions const
  db: consistency_level: `filter_for_query`: take `const gossiper&`
  replica: table: `get_hit_rate`: take `const gossiper&`
  gms: gossiper: move `endpoint_filter` to `storage_proxy` module
  service: storage_proxy: pass `shared_ptr<gossiper>` to `start_hints_manager`
  service: storage_proxy: establish private section in `remote`
  service: storage_proxy: remove `migration_manager` pointer
  service: storage_proxy: remove calls to `storage_proxy::remote()` from `remote`
  service: storage_proxy: remove `_gossiper` field
  alternator: ttl: pass `gossiper&` to `expiration_service`
  service: storage_proxy: move `truncate_blocking` implementation to `remote`
  service: storage_proxy: introduce `is_alive` helper
  service: storage_proxy: remove `_messaging` reference
  service: storage_proxy: move `connection_dropped` to `remote`
  service: storage_proxy: make `encode_replica_exception_for_rpc` a static function
  service: storage_proxy: move `handle_write` to `remote`
  service: storage_proxy: move `handle_paxos_prune` to `remote`
  service: storage_proxy: move `handle_paxos_accept` to `remote`
  service: storage_proxy: move `handle_paxos_prepare` to `remote`
  service: storage_proxy: move `handle_truncate` to `remote`
  service: storage_proxy: move `handle_read_digest` to `remote`
  service: storage_proxy: move `handle_read_mutation_data` to `remote`
  service: storage_proxy: move `handle_read_data` to `remote`
  service: storage_proxy: move `handle_mutation_failed` to `remote`
  service: storage_proxy: move `handle_mutation_done` to `remote`
  service: storage_proxy: move `handle_paxos_learn` to `remote`
  service: storage_proxy: move `receive_mutation_handler` to `remote`
  service: storage_proxy: move `handle_counter_mutation` to `remote`
  service: storage_proxy: remove `get_local_shared_storage_proxy`
  service: storage_proxy: (de)register RPC handlers in `remote`
  service: storage_proxy: introduce `remote`
2022-08-04 17:50:20 +03:00
Kamil Braun
0a4e701b50 service: storage_proxy: pass migration_manager* to init_messaging_service
`migration_manager` lifetime is longer than the lifetime of "storage
proxy's messaging service part" - that is, `init_messaging_service` is
called after `migration_manager` is started, and `uninit_messaging_service`
is called before `migration_manager` is stopped. Thus we don't need to
hold an owning pointer to `migration_manager` here.

Later, when `init_messaging_service` will actually construct `remote`,
this will be a reference, not a pointer.

Also observe that `_mm` in `remote` is only used in handlers, and
handlers are unregistered before `_mm` is nullified, which ensures that
handlers are not running when `_mm` is nullified. (This argument shows
why the code made sense regardless of our switch from shared_ptr to raw
ptr).
2022-08-04 12:19:43 +02:00
Kamil Braun
078900042f service: storage_proxy: pass shared_ptr<gossiper> to start_hints_manager
No need to call `_remote->gossiper().shared_from_this()` from within
storage_proxy now.
2022-08-04 12:16:09 +02:00
Kamil Braun
ab946e392f alternator: ttl: pass gossiper& to expiration_service
This allows us to remove the `gossiper()` getter from `storage_proxy`.
2022-08-04 12:12:43 +02:00
Takuya ASADA
d7dfd0a696 main: run --version before app_template initialize
Even on the environment which causes error during initalize Scylla,
"scylla --version" should be able to run without error.
To do so, we need to parse and execute these options before
initializing Scylla/Seastar classes.

Fixes #11117

Closes #11179
2022-08-03 11:25:28 +03:00