Commit Graph

3849 Commits

Author SHA1 Message Date
Gleb Natapov
f80fff3484 gossip: remove unused STATUS_LEAVING gossiper status
The status is no longer used. The function that referenced it was
removed by 5a96751534 and it was unused
back then for awhile already.

Message-Id: <ZS92mcGE9Ke5DfXB@scylladb.com>
2023-10-18 11:13:14 +02:00
Tomasz Grabiec
0aef0f900b Merge 'truncation records refactorings' from Petr Gusev
This PR contains several refactoring, related to truncation records handling in `system_keyspace`, `commitlog_replayer` and `table` clases:
* drop map_reduce from `commitlog_replayer`, it's sufficient to load truncation records from the null shard;
* add a check that `table::_truncated_at` is properly initialized before it's accessed;
* move its initialization after `init_non_system_keyspaces`

Closes scylladb/scylladb#15583

* github.com:scylladb/scylladb:
  system_keyspace: drop truncation_record
  system_keyspace: remove get_truncated_at method
  table: get_truncation_time: check _truncated_at is initialized
  database: add_column_family: initialize truncation_time for new tables
  database: add_column_family: rename readonly parameter to is_new
  system_keyspace: move load_truncation_times into distributed_loader::populate_keyspace
  commitlog_replayer: refactor commitlog_replayer::impl::init
  system_keyspace: drop redundant typedef
  system_keyspace: drop redundant save_truncation_record overload
  table: rename cache_truncation_record -> set_truncation_time
  system_keyspace: get_truncated_position -> get_truncated_positions
2023-10-17 10:55:30 +02:00
Avi Kivity
35849fc901 Revert "Merge 'Don't calculate hashes for schema versions in Raft mode' from Kamil Braun"
This reverts commit 3d4398d1b2, reversing
changes made to 45dfce6632. The commit
causes some schema changes to be lost due to incorrect timestamps
in some mutations. More information is available in [1].

Reopens: scylladb/scylladb#7620
Reopens: scylladb/scylladb#13957

Fixes scylladb/scylladb#15530.

[1] https://github.com/scylladb/scylladb/pull/15687
2023-10-11 00:32:05 +03:00
Avi Kivity
765e193122 Merge 'db/hints: Modernize manager' from Dawid Mędrek
This PR is another step in refactoring the Hinted Handoff module. It aims at modernizing the code by moving to coroutines, using `std::ranges` instead of Boost's ones where possible, and uses other features coming with the new C++ standards.

It also tries to make the code clearer and get rid of confusing elements, e.g. using shared pointers where they shouldn't be used or marking methods as virtual even though nothing derives from the class. It also prevents `manager.hh` from giving direct access to internal structures (`hint_endpoint_manager` in this case).

Refs #15358

Closes scylladb/scylladb#15631

* github.com:scylladb/scylladb:
  db/hints/manager: Reword comments about state
  db/hints/manager: Unfriend space_watchdog
  db/hints: Remove a redundant alias
  db/hints: Remove an unused namespace
  db/hints: Coroutinize change_host_filter()
  db/hints: Coroutinize drain_for()
  db/hints: Clean up can_hint_for()
  db/hints: Clean up store_hint()
  db/hints: Clean up too_many_in_flight_hints_for()
  db/hints: Refactor get_ep_manager()
  db/hints: Coroutinize wait_for_sync_point()
  db/hints: Use std::span in calculate_current_sync_point
  db/hints: Clean up manager::forbid_hints_for_eps_with_pending_hints()
  db/hints: Clean up manager::forbid_hints()
  db/hints: Clean up manager::allow_hints()
  db/hints: Coroutinize compute_hints_dir_device_id()
  db/hints: Clean up manager::stop()
  db/hints: Clean up manager::start()
  db/hints/manager: Clean up the constructor
  db/hints: Remove boilerplate drain_lock()
  db/hints: Let drain_for() return a future
  db/hints: Remove ep_managers_end
  db/hints: Remove find_ep_manager
  db/hints: Use manager as API for hint_endpoint_manager
  db/hints: Don't mark have_ep_manager()'s definition as inline
  db/hints: Remove make_directory_initializer()
  db/hints/manager: Order constructors
  db/hints: Move ~manager() and mark it as noexcept
  db/hints: Use reference for storage proxy
  db/hints/manager: Explicitly delete copy constructor
  db/hints: Capitalize constants
  db/hints/manager: Hide declarations
  db/hints/manager: Move the defintions of static members to the header
  db/hints: Move make_dummy() to the header
  db/hints: Don't explicitly define ~directory_initializer()
  db/hints: Change the order of logging in ensure_created_and_verified()
  db/hints: Coroutinize ensure_rebalanced()
  db/hints: Coroutinize ensure_created_and_verified()
  db/hints: Improve formatting of directory_initializer::impl
  db/hints: Do not rely on the values of enums
  db/hints: Move the implementation of directory_initializer
  db/hints: Prefer nested namespaces
  db/hints: Remove an unused alias from manager.hh
  db/hints: Reorder includes in manager.hh and .cc
2023-10-06 17:20:33 +03:00
Dawid Medrek
f1f35ba819 db/hints: Let drain_for() return a future
Currently, the function doesn't return anything.
However, if the futurue doesn't need to be awaited,
the caller can decide that. There is no reason
to make that decision in the function itself.
2023-10-06 12:18:25 +02:00
Dawid Medrek
18a2831186 db/hints: Use reference for storage proxy
This commit makes db::hints::manager store service::storage_proxy
as a reference instead of a seastar::shared_ptr. The manager is
owned by storage proxy, so it only lives as long as storage proxy
does. Hence, it makes little sense to store the latter as a shared
pointer; in fact, it's very confusing and may be error-prone.
The field never changes, so it's safe to keep it as a reference
(especially because copy and move constructors of db::hints::manager
are both deleted). What's more, we ensure that the hint manager
has access to storage proxy as soon as it's created.

The same changes were applied to db::hints::resource_manager.
The rationale is the same.
2023-10-06 11:54:15 +02:00
Botond Dénes
8c03eeb85d Merge 'Sanitize hints API handlers and remove proxy from http context' from Pavel Emelyanov
This is the continuation of 3e74432dbf.

Registering API handlers for services need to
 - happen next to the corresponding service's start
 - use only the provided service, not any other ones (if needed, the handler's service can use its internal dependencies to do its job)
 - get the service to handle requests via argument, not from http context (http context, in turn, is going _not_ to depend on anything)

Hints API handlers want to use proxy, but also reference gossiper and capture proxy via http context. This PR fixes both and removes http_contex -> proxy dependency as no longer needed

Closes scylladb/scylladb#15644

* github.com:scylladb/scylladb:
  api: Remove proxy reference from http context
  api,hints: Use proxy instead of ctx
  api,hints: Pass sharded<proxy>& instead of gossiper&
  api,hints: Fix indentation after previous patch
  api,hints: Move gossiper access to proxy
2023-10-06 11:04:27 +03:00
Avi Kivity
854188a486 Merge 'database, storage_proxy: Reconcile pages with dead rows and partitions incrementally' from Botond Dénes
Currently, mutation query on replica side will not respond with a result which doesn't have at least one live row. This causes problems if there is a lot of dead rows or partitions before we reach a live row, which stem from the fact that resulting reconcilable_result will be large:

1. Large allocations.  Serialization of reconcilable_result causes large allocations for storing result rows in std::deque
2. Reactor stalls. Serialization of reconcilable_result on the replica side and on the coordinator side causes reactor stalls. This impacts not only the query at hand. For 1M dead rows, freezing takes 130ms, unfreezing takes 500ms. Coordinator  does multiple freezes and unfreezes. The reactor stall on the coordinator side is >5s
3. Too large repair mutations. If reconciliation works on large pages, repair may fail due to too large mutation size. 1M dead rows is already too much: Refs https://github.com/scylladb/scylladb/issues/9111.

This patch fixes all of the above by making mutation reads respect the memory accounter's limit for the page size, even for dead rows.

This patch also addresses the problem of client-side timeouts during paging. Reconciling queries processing long strings of tombstones will now properly page tombstones,like regular queries do.

My testing shows that this solution even increases efficiency. I tested with a cluster of 2 nodes, and a table of RF=2. The data layout was as follows (1 partition):
* Node1: 1 live row, 1M dead rows
* Node2: 1M dead rows, 1 live row

This was designed to trigger reconciliation right from the very start of the query.

Before:
```
Running query (node2, CL=ONE, cold cache)
Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]
```

After:
```
Running query (node2, CL=ONE, cold cache)
Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]
```

Non-reconciling queries have almost identical duration (1 few ms changes can be observed between runs). Note how in the after case, the reconciling read also produces 100 pages, vs. just 2 pages in the before case, leading to a much lower duration (less than 1/4 of the before).

Refs https://github.com/scylladb/scylladb/issues/7929
Refs https://github.com/scylladb/scylladb/issues/3672
Refs https://github.com/scylladb/scylladb/issues/7933
Fixes https://github.com/scylladb/scylladb/issues/9111

Closes scylladb/scylladb#15414

* github.com:scylladb/scylladb:
  test/topology_custom: add test_read_repair.py
  replica/mutation_dump: detect end-of-page in range-scans
  tools/scylla-sstable: write: abort parser thread if writing fails
  test/pylib: add REST methods to get node exe and workdir paths
  test/pylib/rest_client: add load_new_sstables, keyspace_{flush,compaction}
  service/storage_proxy: add trace points for the actual read executor type
  service/storage_proxy: add trace points for read-repair
  storage_proxy: Add more trace-level logging to read-repair
  database: Fix accounting of small partitions in mutation query
  database, storage_proxy: Reconcile pages with no live rows incrementally
2023-10-05 22:39:34 +03:00
Pavel Emelyanov
967faa97e4 proxy: Coroutinize start_hints_manager()
All the other calls managing hints are coroutinized

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#15641
2023-10-05 16:16:27 +02:00
Pavel Emelyanov
53891dd9cc api,hints: Move gossiper access to proxy
API handlers should try to avoid using any service other than the "main"
one. For hints API this service is going to be proxy, so no gossiper
access in the handler itself.

(indentation is left broken)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-10-05 16:14:26 +03:00
Piotr Dulikowski
4340e46c66 storage_service: increase timeout during join procedure to 3 minutes
When joining the cluster in raft topology mode, the new node asks some
existing node in the cluster to put its information to the
`system.topology` table. Later, the topology coordinator is supposed to
contact the joining node back, telling it that it was added to group 0
and accepted, or rejected. Due to the fact that the topology coordinator
might not manage to successfully contact the joining node, in order not
to get stuck it might decide to give up and move the node to left state
and forget about it (this not always happens as of now, but will in the
future). Because of that, the joining node must use a timeout when
waiting for a response because it's not guaranteed that it will ever
receive it.

There is an additional complication: the topology coordinator might be
busy and not notice the request to join for a long time. For example, it
might be migrating tablets or joining other nodes which are in the queue
before it. Therefore, it's difficult to choose a timeout which is long
enough for every case and still not too long.

Such a failure was observed to happen in ARM tests in debug mode. In
order to unblock the CI the timeout is increased from 30 seconds to 3
minutes. As a proper solution, the procedure will most likely have to be
adjusted in a more significant way.

Fixes: #15600

Closes scylladb/scylladb#15618
2023-10-05 10:29:03 +02:00
Petr Gusev
da1e6751e9 table: rename cache_truncation_record -> set_truncation_time
This is a refactoring commit without observable
changes in behaviour.

There is a truncation_record struct, but in this method we
only care about time, so rename it (and other related methods)
appropriately to avoid confusion.
2023-10-03 17:11:35 +04:00
Michael Huang
1640f83fdc raft: Store snapshot update and truncate log atomically
In case the snapshot update fails, we don't truncate commit log.

Fixes scylladb/scylladb#9603

Closes scylladb/scylladb#15540
2023-09-29 17:57:49 +02:00
Botond Dénes
5d8384eff0 Merge 'Fix test_fencing.py::test_fence_hints flakiness' from Kamil Braun
Add a REST API to reload Raft topology state without having to restart a node and use it in `test_fence_hints`. Restarting the node has undesired side effects which cause test flakiness; more details provided in commit messages.

Refactor the test a bit while at it.

Fixes: #15285

Closes scylladb/scylladb#15523

* github.com:scylladb/scylladb:
  test: test_fencing.py: enable hints_manager=trace logs in `test_fence_hints`
  test: test_fencing.py: reload topology through REST API in `test_fence_hints`
  test: refactor test_fencing.py
  api: storage_service: add REST API to reload topology state
2023-09-28 16:30:23 +03:00
Kamil Braun
992f1327d3 api: storage_service: add REST API to reload topology state
Some tests may want to modify system.topology table directly. Add a REST
API to reload the state into memory. An alternative would be restarting
the server, but that's slower and may have other side effects undesired
in the test.

The API can also be called outside tests, it should not have any
observable effects unless the user modifies `system.topology` table
directly (which they should never do, outside perhaps some disaster
recovery scenarios).
2023-09-28 11:59:16 +02:00
Kamil Braun
060f2de14e Merge 'Cluster features on raft: new procedure for joining group 0' from Piotr Dulikowski
This PR implements a new procedure for joining nodes to group 0, based on the description in the "Cluster features on Raft (v2)" document. This is a continuation of the previous PRs related to cluster features on raft (https://github.com/scylladb/scylladb/pull/14722, https://github.com/scylladb/scylladb/pull/14232), and the last piece necessary to replace cluster feature checks in gossip.

Current implementation relies on gossip shadow round to fetch the set of enabled features, determine whether the node supports all of the enabled features, and joins only if it is safe. As we are moving management of cluster features to group 0, we encounter a problem: the contents of group 0 itself may depend on features, hence it is not safe to join it unless we perform the feature check which depends on information in group 0. Hence, we have a dependency cycle.

In order to solve this problem, the algorithm for joining group 0 is modified, and verification of features and other parameters is offloaded to an existing node in group 0. Instead of directly asking the discovery leader to unconditionally add the node to the configuration with `GROUP0_MODIFY_CONFIG`, two different RPCs are added: `JOIN_NODE_REQUEST` and `JOIN_NODE_RESPONSE`. The main idea is as follows:

- The new node sends `JOIN_NODE_REQUEST` to the discovery leader. It sends a bunch of information describing the node, including supported cluster features. The discovery leader verifies some of the parameters and adds the node in the `none` state to `system.topology`.
- The topology coordinator picks up the request for the node to be joined (i.e. the node in `none` state), verifies its properties - including cluster features - and then:
	- If the node is accepted, the coordinator transitions it to `boostrap`/`replace` state and transitions the topology to `join_group0` state. The node is added to group 0 and then `JOIN_NODE_RESPONSE` is sent to it with information that the node was accepted.
	- Otherwise, the node is moved to `left` state, told by the coordinator via `JOIN_NODE_RESPONSE` that it was rejected and it shuts down.

The procedure is not retryable - if a node fails to do it from start to end and crashes in between, it will not be allowed to retry it with the same host_id - `JOIN_NODE_REQUEST` will fail. The data directory must be cleared before attempting to add it again (so that a new host_id is generated).

More details about the procedure and the RPC are described in `topology-over-raft.md`.

Fixes: #15152

Closes scylladb/scylladb#15196

* github.com:scylladb/scylladb:
  tests: mark test_blocked_bootstrap as skipped
  storage_service: do not check features in shadow round
  storage_service: remove raft_{boostrap,replace}
  topology_coordinator: relax the check in enable_features
  raft_group0: insert replaced node info before server setup
  storage_service: use join node rpc to join the cluster
  topology_coordinator: handle joining nodes
  topology_state_machine: add join_group0 state
  storage_service: add join node RPC handlers
  raft: expose current_leader in raft::server
  storage_service: extract wait_for_live_nodes_timeout constant
  raft_group0: abstract out node joining handshake
  storage_service: pass raft_topology_change_enabled on rpc init
  rpc: add new join handshake verbs
  docs: document the new join procedure
  topology_state_machine: add supported_features to replica_state
  storage_service: check destination host ID in raft verbs
  group_state_machine: take reference to raft address map
  raft_group0: expose joined_group0
2023-09-28 11:45:09 +02:00
Piotr Dulikowski
11ab7c3853 storage_service: do not check features in shadow round
The new joining procedure safely checks compatibility of
supported/enabled features, therefore there is no longer any need to do
it in the gossip shadow round.
2023-09-27 15:53:15 +02:00
Piotr Dulikowski
bf5059e83c storage_service: remove raft_{boostrap,replace}
The functionality of `raft_bootstrap` and `raft_replace` is handled by
the new handshake, so those functions can be removed.
2023-09-27 15:53:15 +02:00
Piotr Dulikowski
9a829ddf97 topology_coordinator: relax the check in enable_features
Currently, `enable_features` requires that there are no topology in
progress and there are no nodes waiting to be joined. Now, after the new
handshake is implemented, we can drop the second condition because nodes
in `none` state are not a part of group 0 yet.

Additionally, the comments inside `enable_features` are clarified so
that they explain why it's safe to only include normal features when
doing the barrier and calculating features to enable.
2023-09-27 15:53:15 +02:00
Piotr Dulikowski
3ee3699a9c raft_group0: insert replaced node info before server setup
Currently, information about replaced node is put into the raft address
map after joining group 0 via `join_group0`. However, the new handshake
which happens when joining group 0 needs to read the group 0 state (so
that it can wait until it sees all normal nodes as UP). Loading the
topology state to memory involves resolving IP addresses of the normal
nodes, so the information about replaced node needs to be inserted
before the handshake happens.

This commit moves insertion of the replace node's data before the call
to `join_group0`.
2023-09-27 15:53:15 +02:00
Piotr Dulikowski
41a22f6e3b storage_service: use join node rpc to join the cluster
Now, the storage_service uses new RPCs to join the cluster. A new
handshaker is implemented and passed to group0 in order to make it
happen.
2023-09-27 15:53:15 +02:00
Piotr Dulikowski
862b6e61a4 topology_coordinator: handle joining nodes
The topology coordinator is updated to perform verification of joining
nodes and to send `JOIN_NODE_RESPONSE` RPC back to the joining node.
2023-09-27 15:53:15 +02:00
Piotr Dulikowski
5ba2bfa015 topology_state_machine: add join_group0 state
Currently, when the topology coordinator notices a request to join or
replace a node, the node is transitioned to an appropriate state and the
topology is moved to commit_new_generation/write_both_read_old, in a
single group 0 operation. In later commits, the topology coordinator
will accept/reject nodes based on the request, so we would like to have
a separate step - topology coordinator accepts, transitions to bootstrap
state, tells the node that it is accepted, and only then continues with
the topology transition.

This commits adds a new `join_group0` transition state that precedes
`commit_cdc_generation`.
2023-09-27 15:53:15 +02:00
Piotr Dulikowski
bb40c2a8b8 storage_service: add join node RPC handlers 2023-09-27 15:53:13 +02:00
Botond Dénes
2cc37eb89b Merge 'Sanitize storage_service API maintenance' from Pavel Emelyanov
Storage service API set/unset has two flaws.

First, unset doesn't happen, so after storage service is stopped its handlers become "local is not initialized"-assertion and use-after-free landmines.

Second, setting of storage service API carry gossiper and system keyspace references, thus duplicating the knowledge about storage service dependencies.

This PR fixes both by adding the storage service API unsetting and by making the handlers use _only_ storage service instance, not any externally provided references.

Closes scylladb/scylladb#15547

* github.com:scylladb/scylladb:
  main, api: Set/Unset storage_service API in proper place
  api/storage_service: Remove gossiper arg from API
  api/storage_service: Remove system keyspace arg from API
  api/storage_service: Get gossiper from storage service
  api/storage_service: Get token_metadata from storage service
2023-09-27 10:00:54 +03:00
Nadav Har'El
9dea20539d Merge 'Sanitize forward-service shutdown' from Pavel Emelyanov
There's a dedicated forward_service::shutdown() method that's defer-scheduled in main for very early invocation. That's not nice, the fwd service start-shutdown-stop sequence can be made "canonical" by moving the shutting down code into abort source subscription. Similar thing was done for view updates generator in 3b95f4f107

refs: #2737
refs: #4384

Closes scylladb/scylladb#15545

* github.com:scylladb/scylladb:
  forward_service: Remove .shutdown() method
  forward_service: Set _shutdown in abort-source subscription
  forward_service: Add abort_source to constructor
2023-09-26 18:36:52 +03:00
Piotr Dulikowski
74b01730b4 storage_service: extract wait_for_live_nodes_timeout constant
Like in the non-raft topology path, during the new handshake, the
joining node will wait until all normal nodes are alive. The timeout
used during the wait is extracted to a constant so that it will be
reused in the handshake code, to be introduced in later commits.
2023-09-26 15:56:52 +02:00
Piotr Dulikowski
4f82f9fe50 raft_group0: abstract out node joining handshake
Currently, the raft_group0 uses GROUP0_MODIFY_CONFIG RPC to ask an
existing group 0 member to add this node to the group, in case the
joining node was not a discovery leader. The new handshake verbs
(JOIN_NODE_REQUEST + JOIN_NODE_RESPONSE) will replace the old RPC. As a
preparation, this commit abstracts away the handshake process.
2023-09-26 15:56:52 +02:00
Piotr Dulikowski
c24daf7e88 storage_service: pass raft_topology_change_enabled on rpc init
We will want to conditionally register some verbs based on whether we
are using raft topology or not. This commit serves as a preparation,
passing the `raft_topology_change_enabled` to the function which
initializes the verbs (although there is _raft_topology_change_enabled
field already, it's only initialized on shard 0 later).
2023-09-26 15:56:52 +02:00
Piotr Dulikowski
7cbe5e3af8 rpc: add new join handshake verbs
The `join_node_request` and `join_node_response` RPCs are added:

- `join_node_request` is sent from the joining node to any node in the
  cluster. It contains some initial parameters that will be verified by
  the receiving node, or the topology coordinator - notably, it contains
  a list of cluster features supported by the joining node.
- `join_node_response` is sent from the topology coordinator to the
  joining node to tell it about the the outcome of the verification.
2023-09-26 15:56:52 +02:00
Piotr Dulikowski
caf1d4938e topology_state_machine: add supported_features to replica_state
The `service::topology_features` struct was introduced in #14955. Its
purpose was to make it possible to load cluster features from
`system.topology` before schema commitlog replay. It contains a map from
host ID to supported feature set for every normal node.

In order not to duplicate logic for loading features,
the `service::topology`'s `replica_state`s do not hold a set of
supported features and users are supposed to refer to the features
in `topology_features`, which is a field in the `topology` struct.
However, accessing features is quite awkward now.

This commit adds `supported_features` field back to the `replica_state`
struct and the `load_topology_state` function initializes them properly.
The logic duplication needed to initialize them is quite small and the
drawbacks that come with it are outweighed by the fact that we now can
refer to node's supported features in a more natural way.

The `topology_features` struct is no longer a field of `topology`, but
it still exists for the purpose of the feature check that happens before
commitlog replay.
2023-09-26 15:56:52 +02:00
Piotr Dulikowski
51b0e4d44f storage_service: check destination host ID in raft verbs
In unlucky but possible circumstances where a node is being replaced
very quickly, RPC requests using raft-related verbs from storage_service
might be sent to it - even before the node starts its group 0 server.
In the latter case, this triggers on_internal_error.

This commit adds protection to the existing verbs in storage_service:
they check whether the group 0 is running and whether the received
host_id matches the actual recipient's host_id.

None of the verbs that are modified are in any existing release, so the
added parameter does not have to be wrapped in rpc::optional.
2023-09-26 15:56:51 +02:00
Piotr Dulikowski
0317705f5a group_state_machine: take reference to raft address map
It will be needed to translate host ids to addresses.
2023-09-26 15:46:25 +02:00
Piotr Dulikowski
193e8eba26 raft_group0: expose joined_group0
It will be needed in the next commit to check whether the group 0 server
has been started.
2023-09-26 15:46:25 +02:00
Pavel Emelyanov
27eaff9d44 api/storage_service: Get gossiper from storage service
Some handlers in set_storage_service() have implicit dependency on
gossiper. It's not API that should track it, but storage service itself,
so get the gossiper from service, not from the external argument (it
will be removed soon)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-26 12:20:27 +03:00
Tomasz Grabiec
0f22e8d196 storage_service: Fixed missed notificaiton on tablet metadata update
There can be 2 waiters now (coordinator and CDC generation publisher),
so signal() is not enough.

Change made in c416c9ff33 missed to
update this site.

Closes scylladb/scylladb#15527
2023-09-26 10:37:57 +02:00
Pavel Emelyanov
0e0f9a57c6 forward_service: Remove .shutdown() method
It's now empty and has no value

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-26 10:39:22 +03:00
Pavel Emelyanov
a251b9893f forward_service: Set _shutdown in abort-source subscription
Currently the bit is set in .shutdown() method which is called early on
stop. After the patch the bit it set in the abort-source subscription
callback which is also called early on stop.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-26 10:38:34 +03:00
Pavel Emelyanov
b18c54f56c forward_service: Add abort_source to constructor
It will be used by the next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-09-26 10:38:26 +03:00
Tomasz Grabiec
19ff4b730f storage_service: Avoid SIGSEGV when tablet cleanup is invoked on non-0 shard
We access group0, which is only set on shard 0.

Closes scylladb/scylladb#15469
2023-09-25 20:59:27 +03:00
Botond Dénes
d62a83683e service/storage_proxy: add trace points for the actual read executor type
There is currently a trace point for when the read executor is created,
but this only contains the initial replica set and doesn't mention which
read executor is created in the end. This patch adds trace points for
each different return path, so it is clear from the trace whether
speculative read can happen or not.
2023-09-22 02:53:15 -04:00
Botond Dénes
d3aabf7896 service/storage_proxy: add trace points for read-repair
Currently the fact that read-repair was triggered can only be inferred
from seeing mutation reads in the trace. This patch adds an explicit
trace point for when read repair is triggered and also when it is
finished or retried.
2023-09-22 02:53:14 -04:00
Tomasz Grabiec
1bcac74976 storage_proxy: Add more trace-level logging to read-repair
Extremely helpful in debugging.
2023-09-22 02:53:14 -04:00
Tomasz Grabiec
17c1cad4b4 database, storage_proxy: Reconcile pages with no live rows incrementally
Currently, mutation query on replica side will not respond with a result
which doesn't have at least one live row. This causes problems if there
is a lot of dead rows or partitions before we reach a live row, which
stems from the fact that resulting reconcilable_result will be large:

* Large allocations. Serialization of reconcilable_result causes large
  allocations for storing result rows in std::deque
* Reactor stalls. Serialization of reconcilable_result on the replica
  side and on the coordinator side causes reactor stalls. This impacts
  not only the query at hand. For 1M dead rows, freezing takes 130ms,
  unfreezing takes 500ms. Coordinator does multiple freezes and
  unfreezes. The reactor stall on the coordinator side is >5s.
* Large repair mutations. If reconciliation works on large pages, repair
  may fail due to too large mutation size. 1M dead rows is already too
  much: Refs #9111.

This patch fixes all of the above by making mutation reads respect the
memory accounter's limit for the page size, even for dead rows.

This patch also addresses the problem of client-side timeouts during
paging. Reconciling queries processing long strings of tombstones will
now properly page tombstones,like regular queries do.

My testing shows that this solution even increases efficiency. I tested
with a cluster of 2 nodes, and a table of RF=2. The data layout was as
follows (1 partition):

    Node1: 1 live row, 1M dead rows
    Node2: 1M dead rows, 1 live row

This was designed to trigger reconciliation right from the very start of
the query.

Before:

Running query (node2, CL=ONE, cold cache)
Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]

After:

Running query (node2, CL=ONE, cold cache)
Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (node2, CL=ONE, hot cache)
Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)]
Running query (all-nodes, CL=ALL, reconcile, cold-cache)
Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)]

Non-reconciling queries have almost identical duration (1 few ms changes
can be observed between runs). Note how in the after case, the
reconciling read also produces 100 pages, vs. just 2 pages in the before
case, leading to a much lower duration (less than 1/4 of the before).

Refs #7929
Refs #3672
Refs #7933
Fixes #9111
2023-09-22 02:53:14 -04:00
Gleb Natapov
c94a9cf731 storage_service: raft topology: fence off write from old topology coordinator before starting a new one
Make sure that all writes started by the old coordinator are completed or
will eventually fail before starting a new coordinator.

Message-ID: <ZQv+OCrHl+KyAnvv@scylladb.com>
2023-09-21 17:26:45 +02:00
Pavel Emelyanov
6e972f8505 repair: Shutdown repair on nodetool drain too
Currently repair shutdown only happens on stop, but it looks like
nodetool drain can call shutdown too to abort no longer relevant repair
tasks if any. This also makes the main()'s deferred shutdown/stop paths
cleaner a little bit

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#15438
2023-09-21 16:58:23 +03:00
Tomasz Grabiec
3d4398d1b2 Merge 'Don't calculate hashes for schema versions in Raft mode' from Kamil Braun
When performing a schema change through group 0, extend the schema mutations with a version that's persisted and then used by the nodes in the cluster in place of the old schema digest, which becomes horribly slow as we perform more and more schema changes (#7620).

If the change is a table create or alter, also extend the mutations with a version for this table to be used for `schema::version()`s instead of having each node calculate a hash which is susceptible to bugs (#13957).

When performing a schema change in Raft RECOVERY mode we also extend schema mutations which forces nodes to revert to the old way of calculating schema versions when necessary.

We can only introduce these extensions if all of the cluster understands them, so protect this code by a new cluster/schema feature, `GROUP0_SCHEMA_VERSIONING`.

Fixes: #7620
Fixes: #13957

Closes scylladb/scylladb#15331

* github.com:scylladb/scylladb:
  test: add test for group 0 schema versioning
  test/pylib: log_browsing: fix type hint
  feature_service: enable `GROUP0_SCHEMA_VERSIONING` in Raft mode
  schema_tables: don't delete `version` cell from `scylla_tables` mutations from group 0
  migration_manager: add `committed_by_group0` flag to `system.scylla_tables` mutations
  schema_tables: use schema version from group 0 if present
  migration_manager: store `group0_schema_version` in `scylla_local` during schema changes
  migration_manager: migration_request handler: assume `canonical_mutation` support
  system_keyspace: make `get/set_scylla_local_param` public
  feature_service: add `GROUP0_SCHEMA_VERSIONING` feature
  schema_tables: refactor `scylla_tables(schema_features)`
  migration_manager: add `std::move` to avoid a copy
  schema_tables: remove default value for `reload` in `merge_schema`
  schema_tables: pass `reload` flag when calling `merge_schema` cross-shard
  system_keyspace: fix outdated comment
2023-09-20 10:43:40 +02:00
Botond Dénes
844a0e426f Merge 'Mark counters with skip when empty' from Amnon Heiman
This series mark multiple high cardinality counters with skip_when_empty flag.
After this patch the following counters will not be reported if they were never used:
```
scylla_transport_cql_errors_total
scylla_storage_proxy_coordinator_reads_local_node
scylla_storage_proxy_coordinator_completed_reads_local_node
scylla_transport_cql_errors_total
```
Also marked, the CAS related CQL operations.
Fixes #12751

Closes scylladb/scylladb#13558

* github.com:scylladb/scylladb:
  service/storage_proxy.cc: mark counters with skip_when_empty
  cql3/query_processor.cc: mark cas related metrics with skip_when_empty
  transport/server.cc: mark metric counter with skip_when_empty
2023-09-19 15:02:39 +03:00
Benny Halevy
e784930dd7 storage_service: fix comment about when group0 is set
Since 8598cebb11
it is set earlier, before join_cluster.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-ID: <20230919063951.1424924-1-bhalevy@scylladb.com>
2023-09-19 13:20:58 +03:00
Kefu Chai
9de00c1c5a build: cmake: add node_ops
node_ops source files was extracted into /node_ops directory in
d0d0ad7aa4, so let's update the building system accordingly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#15442
2023-09-18 16:27:02 +03:00