scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-12 19:02:12 +00:00

Author	SHA1	Message	Date
Gleb Natapov	f80fff3484	gossip: remove unused STATUS_LEAVING gossiper status The status is no longer used. The function that referenced it was removed by `5a96751534` and it was unused back then for awhile already. Message-Id: <ZS92mcGE9Ke5DfXB@scylladb.com>	2023-10-18 11:13:14 +02:00
Tomasz Grabiec	0aef0f900b	Merge 'truncation records refactorings' from Petr Gusev This PR contains several refactoring, related to truncation records handling in `system_keyspace`, `commitlog_replayer` and `table` clases: * drop map_reduce from `commitlog_replayer`, it's sufficient to load truncation records from the null shard; * add a check that `table::_truncated_at` is properly initialized before it's accessed; * move its initialization after `init_non_system_keyspaces` Closes scylladb/scylladb#15583 * github.com:scylladb/scylladb: system_keyspace: drop truncation_record system_keyspace: remove get_truncated_at method table: get_truncation_time: check _truncated_at is initialized database: add_column_family: initialize truncation_time for new tables database: add_column_family: rename readonly parameter to is_new system_keyspace: move load_truncation_times into distributed_loader::populate_keyspace commitlog_replayer: refactor commitlog_replayer::impl::init system_keyspace: drop redundant typedef system_keyspace: drop redundant save_truncation_record overload table: rename cache_truncation_record -> set_truncation_time system_keyspace: get_truncated_position -> get_truncated_positions	2023-10-17 10:55:30 +02:00
Avi Kivity	35849fc901	Revert "Merge 'Don't calculate hashes for schema versions in Raft mode' from Kamil Braun" This reverts commit `3d4398d1b2`, reversing changes made to `45dfce6632`. The commit causes some schema changes to be lost due to incorrect timestamps in some mutations. More information is available in [1]. Reopens: scylladb/scylladb#7620 Reopens: scylladb/scylladb#13957 Fixes scylladb/scylladb#15530. [1] https://github.com/scylladb/scylladb/pull/15687	2023-10-11 00:32:05 +03:00
Avi Kivity	765e193122	Merge 'db/hints: Modernize manager' from Dawid Mędrek This PR is another step in refactoring the Hinted Handoff module. It aims at modernizing the code by moving to coroutines, using `std::ranges` instead of Boost's ones where possible, and uses other features coming with the new C++ standards. It also tries to make the code clearer and get rid of confusing elements, e.g. using shared pointers where they shouldn't be used or marking methods as virtual even though nothing derives from the class. It also prevents `manager.hh` from giving direct access to internal structures (`hint_endpoint_manager` in this case). Refs #15358 Closes scylladb/scylladb#15631 * github.com:scylladb/scylladb: db/hints/manager: Reword comments about state db/hints/manager: Unfriend space_watchdog db/hints: Remove a redundant alias db/hints: Remove an unused namespace db/hints: Coroutinize change_host_filter() db/hints: Coroutinize drain_for() db/hints: Clean up can_hint_for() db/hints: Clean up store_hint() db/hints: Clean up too_many_in_flight_hints_for() db/hints: Refactor get_ep_manager() db/hints: Coroutinize wait_for_sync_point() db/hints: Use std::span in calculate_current_sync_point db/hints: Clean up manager::forbid_hints_for_eps_with_pending_hints() db/hints: Clean up manager::forbid_hints() db/hints: Clean up manager::allow_hints() db/hints: Coroutinize compute_hints_dir_device_id() db/hints: Clean up manager::stop() db/hints: Clean up manager::start() db/hints/manager: Clean up the constructor db/hints: Remove boilerplate drain_lock() db/hints: Let drain_for() return a future db/hints: Remove ep_managers_end db/hints: Remove find_ep_manager db/hints: Use manager as API for hint_endpoint_manager db/hints: Don't mark have_ep_manager()'s definition as inline db/hints: Remove make_directory_initializer() db/hints/manager: Order constructors db/hints: Move ~manager() and mark it as noexcept db/hints: Use reference for storage proxy db/hints/manager: Explicitly delete copy constructor db/hints: Capitalize constants db/hints/manager: Hide declarations db/hints/manager: Move the defintions of static members to the header db/hints: Move make_dummy() to the header db/hints: Don't explicitly define ~directory_initializer() db/hints: Change the order of logging in ensure_created_and_verified() db/hints: Coroutinize ensure_rebalanced() db/hints: Coroutinize ensure_created_and_verified() db/hints: Improve formatting of directory_initializer::impl db/hints: Do not rely on the values of enums db/hints: Move the implementation of directory_initializer db/hints: Prefer nested namespaces db/hints: Remove an unused alias from manager.hh db/hints: Reorder includes in manager.hh and .cc	2023-10-06 17:20:33 +03:00
Dawid Medrek	f1f35ba819	db/hints: Let drain_for() return a future Currently, the function doesn't return anything. However, if the futurue doesn't need to be awaited, the caller can decide that. There is no reason to make that decision in the function itself.	2023-10-06 12:18:25 +02:00
Dawid Medrek	18a2831186	db/hints: Use reference for storage proxy This commit makes db::hints::manager store service::storage_proxy as a reference instead of a seastar::shared_ptr. The manager is owned by storage proxy, so it only lives as long as storage proxy does. Hence, it makes little sense to store the latter as a shared pointer; in fact, it's very confusing and may be error-prone. The field never changes, so it's safe to keep it as a reference (especially because copy and move constructors of db::hints::manager are both deleted). What's more, we ensure that the hint manager has access to storage proxy as soon as it's created. The same changes were applied to db::hints::resource_manager. The rationale is the same.	2023-10-06 11:54:15 +02:00
Botond Dénes	8c03eeb85d	Merge 'Sanitize hints API handlers and remove proxy from http context' from Pavel Emelyanov This is the continuation of `3e74432dbf`. Registering API handlers for services need to - happen next to the corresponding service's start - use only the provided service, not any other ones (if needed, the handler's service can use its internal dependencies to do its job) - get the service to handle requests via argument, not from http context (http context, in turn, is going _not_ to depend on anything) Hints API handlers want to use proxy, but also reference gossiper and capture proxy via http context. This PR fixes both and removes http_contex -> proxy dependency as no longer needed Closes scylladb/scylladb#15644 * github.com:scylladb/scylladb: api: Remove proxy reference from http context api,hints: Use proxy instead of ctx api,hints: Pass sharded<proxy>& instead of gossiper& api,hints: Fix indentation after previous patch api,hints: Move gossiper access to proxy	2023-10-06 11:04:27 +03:00
Avi Kivity	854188a486	Merge 'database, storage_proxy: Reconcile pages with dead rows and partitions incrementally' from Botond Dénes Currently, mutation query on replica side will not respond with a result which doesn't have at least one live row. This causes problems if there is a lot of dead rows or partitions before we reach a live row, which stem from the fact that resulting reconcilable_result will be large: 1. Large allocations. Serialization of reconcilable_result causes large allocations for storing result rows in std::deque 2. Reactor stalls. Serialization of reconcilable_result on the replica side and on the coordinator side causes reactor stalls. This impacts not only the query at hand. For 1M dead rows, freezing takes 130ms, unfreezing takes 500ms. Coordinator does multiple freezes and unfreezes. The reactor stall on the coordinator side is >5s 3. Too large repair mutations. If reconciliation works on large pages, repair may fail due to too large mutation size. 1M dead rows is already too much: Refs https://github.com/scylladb/scylladb/issues/9111. This patch fixes all of the above by making mutation reads respect the memory accounter's limit for the page size, even for dead rows. This patch also addresses the problem of client-side timeouts during paging. Reconciling queries processing long strings of tombstones will now properly page tombstones,like regular queries do. My testing shows that this solution even increases efficiency. I tested with a cluster of 2 nodes, and a table of RF=2. The data layout was as follows (1 partition): * Node1: 1 live row, 1M dead rows * Node2: 1M dead rows, 1 live row This was designed to trigger reconciliation right from the very start of the query. Before: ``` Running query (node2, CL=ONE, cold cache) Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (node2, CL=ONE, hot cache) Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (all-nodes, CL=ALL, reconcile, cold-cache) Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)] ``` After: ``` Running query (node2, CL=ONE, cold cache) Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (node2, CL=ONE, hot cache) Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (all-nodes, CL=ALL, reconcile, cold-cache) Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)] ``` Non-reconciling queries have almost identical duration (1 few ms changes can be observed between runs). Note how in the after case, the reconciling read also produces 100 pages, vs. just 2 pages in the before case, leading to a much lower duration (less than 1/4 of the before). Refs https://github.com/scylladb/scylladb/issues/7929 Refs https://github.com/scylladb/scylladb/issues/3672 Refs https://github.com/scylladb/scylladb/issues/7933 Fixes https://github.com/scylladb/scylladb/issues/9111 Closes scylladb/scylladb#15414 * github.com:scylladb/scylladb: test/topology_custom: add test_read_repair.py replica/mutation_dump: detect end-of-page in range-scans tools/scylla-sstable: write: abort parser thread if writing fails test/pylib: add REST methods to get node exe and workdir paths test/pylib/rest_client: add load_new_sstables, keyspace_{flush,compaction} service/storage_proxy: add trace points for the actual read executor type service/storage_proxy: add trace points for read-repair storage_proxy: Add more trace-level logging to read-repair database: Fix accounting of small partitions in mutation query database, storage_proxy: Reconcile pages with no live rows incrementally	2023-10-05 22:39:34 +03:00
Pavel Emelyanov	967faa97e4	proxy: Coroutinize start_hints_manager() All the other calls managing hints are coroutinized Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#15641	2023-10-05 16:16:27 +02:00
Pavel Emelyanov	53891dd9cc	api,hints: Move gossiper access to proxy API handlers should try to avoid using any service other than the "main" one. For hints API this service is going to be proxy, so no gossiper access in the handler itself. (indentation is left broken) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-10-05 16:14:26 +03:00
Piotr Dulikowski	4340e46c66	storage_service: increase timeout during join procedure to 3 minutes When joining the cluster in raft topology mode, the new node asks some existing node in the cluster to put its information to the `system.topology` table. Later, the topology coordinator is supposed to contact the joining node back, telling it that it was added to group 0 and accepted, or rejected. Due to the fact that the topology coordinator might not manage to successfully contact the joining node, in order not to get stuck it might decide to give up and move the node to left state and forget about it (this not always happens as of now, but will in the future). Because of that, the joining node must use a timeout when waiting for a response because it's not guaranteed that it will ever receive it. There is an additional complication: the topology coordinator might be busy and not notice the request to join for a long time. For example, it might be migrating tablets or joining other nodes which are in the queue before it. Therefore, it's difficult to choose a timeout which is long enough for every case and still not too long. Such a failure was observed to happen in ARM tests in debug mode. In order to unblock the CI the timeout is increased from 30 seconds to 3 minutes. As a proper solution, the procedure will most likely have to be adjusted in a more significant way. Fixes: #15600 Closes scylladb/scylladb#15618	2023-10-05 10:29:03 +02:00
Petr Gusev	da1e6751e9	table: rename cache_truncation_record -> set_truncation_time This is a refactoring commit without observable changes in behaviour. There is a truncation_record struct, but in this method we only care about time, so rename it (and other related methods) appropriately to avoid confusion.	2023-10-03 17:11:35 +04:00
Michael Huang	1640f83fdc	raft: Store snapshot update and truncate log atomically In case the snapshot update fails, we don't truncate commit log. Fixes scylladb/scylladb#9603 Closes scylladb/scylladb#15540	2023-09-29 17:57:49 +02:00
Botond Dénes	5d8384eff0	Merge 'Fix `test_fencing.py::test_fence_hints` flakiness' from Kamil Braun Add a REST API to reload Raft topology state without having to restart a node and use it in `test_fence_hints`. Restarting the node has undesired side effects which cause test flakiness; more details provided in commit messages. Refactor the test a bit while at it. Fixes: #15285 Closes scylladb/scylladb#15523 * github.com:scylladb/scylladb: test: test_fencing.py: enable hints_manager=trace logs in `test_fence_hints` test: test_fencing.py: reload topology through REST API in `test_fence_hints` test: refactor test_fencing.py api: storage_service: add REST API to reload topology state	2023-09-28 16:30:23 +03:00
Kamil Braun	992f1327d3	api: storage_service: add REST API to reload topology state Some tests may want to modify system.topology table directly. Add a REST API to reload the state into memory. An alternative would be restarting the server, but that's slower and may have other side effects undesired in the test. The API can also be called outside tests, it should not have any observable effects unless the user modifies `system.topology` table directly (which they should never do, outside perhaps some disaster recovery scenarios).	2023-09-28 11:59:16 +02:00
Kamil Braun	060f2de14e	Merge 'Cluster features on raft: new procedure for joining group 0' from Piotr Dulikowski This PR implements a new procedure for joining nodes to group 0, based on the description in the "Cluster features on Raft (v2)" document. This is a continuation of the previous PRs related to cluster features on raft (https://github.com/scylladb/scylladb/pull/14722, https://github.com/scylladb/scylladb/pull/14232), and the last piece necessary to replace cluster feature checks in gossip. Current implementation relies on gossip shadow round to fetch the set of enabled features, determine whether the node supports all of the enabled features, and joins only if it is safe. As we are moving management of cluster features to group 0, we encounter a problem: the contents of group 0 itself may depend on features, hence it is not safe to join it unless we perform the feature check which depends on information in group 0. Hence, we have a dependency cycle. In order to solve this problem, the algorithm for joining group 0 is modified, and verification of features and other parameters is offloaded to an existing node in group 0. Instead of directly asking the discovery leader to unconditionally add the node to the configuration with `GROUP0_MODIFY_CONFIG`, two different RPCs are added: `JOIN_NODE_REQUEST` and `JOIN_NODE_RESPONSE`. The main idea is as follows: - The new node sends `JOIN_NODE_REQUEST` to the discovery leader. It sends a bunch of information describing the node, including supported cluster features. The discovery leader verifies some of the parameters and adds the node in the `none` state to `system.topology`. - The topology coordinator picks up the request for the node to be joined (i.e. the node in `none` state), verifies its properties - including cluster features - and then: - If the node is accepted, the coordinator transitions it to `boostrap`/`replace` state and transitions the topology to `join_group0` state. The node is added to group 0 and then `JOIN_NODE_RESPONSE` is sent to it with information that the node was accepted. - Otherwise, the node is moved to `left` state, told by the coordinator via `JOIN_NODE_RESPONSE` that it was rejected and it shuts down. The procedure is not retryable - if a node fails to do it from start to end and crashes in between, it will not be allowed to retry it with the same host_id - `JOIN_NODE_REQUEST` will fail. The data directory must be cleared before attempting to add it again (so that a new host_id is generated). More details about the procedure and the RPC are described in `topology-over-raft.md`. Fixes: #15152 Closes scylladb/scylladb#15196 * github.com:scylladb/scylladb: tests: mark test_blocked_bootstrap as skipped storage_service: do not check features in shadow round storage_service: remove raft_{boostrap,replace} topology_coordinator: relax the check in enable_features raft_group0: insert replaced node info before server setup storage_service: use join node rpc to join the cluster topology_coordinator: handle joining nodes topology_state_machine: add join_group0 state storage_service: add join node RPC handlers raft: expose current_leader in raft::server storage_service: extract wait_for_live_nodes_timeout constant raft_group0: abstract out node joining handshake storage_service: pass raft_topology_change_enabled on rpc init rpc: add new join handshake verbs docs: document the new join procedure topology_state_machine: add supported_features to replica_state storage_service: check destination host ID in raft verbs group_state_machine: take reference to raft address map raft_group0: expose joined_group0	2023-09-28 11:45:09 +02:00
Piotr Dulikowski	11ab7c3853	storage_service: do not check features in shadow round The new joining procedure safely checks compatibility of supported/enabled features, therefore there is no longer any need to do it in the gossip shadow round.	2023-09-27 15:53:15 +02:00
Piotr Dulikowski	bf5059e83c	storage_service: remove raft_{boostrap,replace} The functionality of `raft_bootstrap` and `raft_replace` is handled by the new handshake, so those functions can be removed.	2023-09-27 15:53:15 +02:00
Piotr Dulikowski	9a829ddf97	topology_coordinator: relax the check in enable_features Currently, `enable_features` requires that there are no topology in progress and there are no nodes waiting to be joined. Now, after the new handshake is implemented, we can drop the second condition because nodes in `none` state are not a part of group 0 yet. Additionally, the comments inside `enable_features` are clarified so that they explain why it's safe to only include normal features when doing the barrier and calculating features to enable.	2023-09-27 15:53:15 +02:00
Piotr Dulikowski	3ee3699a9c	raft_group0: insert replaced node info before server setup Currently, information about replaced node is put into the raft address map after joining group 0 via `join_group0`. However, the new handshake which happens when joining group 0 needs to read the group 0 state (so that it can wait until it sees all normal nodes as UP). Loading the topology state to memory involves resolving IP addresses of the normal nodes, so the information about replaced node needs to be inserted before the handshake happens. This commit moves insertion of the replace node's data before the call to `join_group0`.	2023-09-27 15:53:15 +02:00
Piotr Dulikowski	41a22f6e3b	storage_service: use join node rpc to join the cluster Now, the storage_service uses new RPCs to join the cluster. A new handshaker is implemented and passed to group0 in order to make it happen.	2023-09-27 15:53:15 +02:00
Piotr Dulikowski	862b6e61a4	topology_coordinator: handle joining nodes The topology coordinator is updated to perform verification of joining nodes and to send `JOIN_NODE_RESPONSE` RPC back to the joining node.	2023-09-27 15:53:15 +02:00
Piotr Dulikowski	5ba2bfa015	topology_state_machine: add join_group0 state Currently, when the topology coordinator notices a request to join or replace a node, the node is transitioned to an appropriate state and the topology is moved to commit_new_generation/write_both_read_old, in a single group 0 operation. In later commits, the topology coordinator will accept/reject nodes based on the request, so we would like to have a separate step - topology coordinator accepts, transitions to bootstrap state, tells the node that it is accepted, and only then continues with the topology transition. This commits adds a new `join_group0` transition state that precedes `commit_cdc_generation`.	2023-09-27 15:53:15 +02:00
Piotr Dulikowski	bb40c2a8b8	storage_service: add join node RPC handlers	2023-09-27 15:53:13 +02:00
Botond Dénes	2cc37eb89b	Merge 'Sanitize storage_service API maintenance' from Pavel Emelyanov Storage service API set/unset has two flaws. First, unset doesn't happen, so after storage service is stopped its handlers become "local is not initialized"-assertion and use-after-free landmines. Second, setting of storage service API carry gossiper and system keyspace references, thus duplicating the knowledge about storage service dependencies. This PR fixes both by adding the storage service API unsetting and by making the handlers use _only_ storage service instance, not any externally provided references. Closes scylladb/scylladb#15547 * github.com:scylladb/scylladb: main, api: Set/Unset storage_service API in proper place api/storage_service: Remove gossiper arg from API api/storage_service: Remove system keyspace arg from API api/storage_service: Get gossiper from storage service api/storage_service: Get token_metadata from storage service	2023-09-27 10:00:54 +03:00
Nadav Har'El	9dea20539d	Merge 'Sanitize forward-service shutdown' from Pavel Emelyanov There's a dedicated forward_service::shutdown() method that's defer-scheduled in main for very early invocation. That's not nice, the fwd service start-shutdown-stop sequence can be made "canonical" by moving the shutting down code into abort source subscription. Similar thing was done for view updates generator in `3b95f4f107` refs: #2737 refs: #4384 Closes scylladb/scylladb#15545 * github.com:scylladb/scylladb: forward_service: Remove .shutdown() method forward_service: Set _shutdown in abort-source subscription forward_service: Add abort_source to constructor	2023-09-26 18:36:52 +03:00
Piotr Dulikowski	74b01730b4	storage_service: extract wait_for_live_nodes_timeout constant Like in the non-raft topology path, during the new handshake, the joining node will wait until all normal nodes are alive. The timeout used during the wait is extracted to a constant so that it will be reused in the handshake code, to be introduced in later commits.	2023-09-26 15:56:52 +02:00
Piotr Dulikowski	4f82f9fe50	raft_group0: abstract out node joining handshake Currently, the raft_group0 uses GROUP0_MODIFY_CONFIG RPC to ask an existing group 0 member to add this node to the group, in case the joining node was not a discovery leader. The new handshake verbs (JOIN_NODE_REQUEST + JOIN_NODE_RESPONSE) will replace the old RPC. As a preparation, this commit abstracts away the handshake process.	2023-09-26 15:56:52 +02:00
Piotr Dulikowski	c24daf7e88	storage_service: pass raft_topology_change_enabled on rpc init We will want to conditionally register some verbs based on whether we are using raft topology or not. This commit serves as a preparation, passing the `raft_topology_change_enabled` to the function which initializes the verbs (although there is _raft_topology_change_enabled field already, it's only initialized on shard 0 later).	2023-09-26 15:56:52 +02:00
Piotr Dulikowski	7cbe5e3af8	rpc: add new join handshake verbs The `join_node_request` and `join_node_response` RPCs are added: - `join_node_request` is sent from the joining node to any node in the cluster. It contains some initial parameters that will be verified by the receiving node, or the topology coordinator - notably, it contains a list of cluster features supported by the joining node. - `join_node_response` is sent from the topology coordinator to the joining node to tell it about the the outcome of the verification.	2023-09-26 15:56:52 +02:00
Piotr Dulikowski	caf1d4938e	topology_state_machine: add supported_features to replica_state The `service::topology_features` struct was introduced in #14955. Its purpose was to make it possible to load cluster features from `system.topology` before schema commitlog replay. It contains a map from host ID to supported feature set for every normal node. In order not to duplicate logic for loading features, the `service::topology`'s `replica_state`s do not hold a set of supported features and users are supposed to refer to the features in `topology_features`, which is a field in the `topology` struct. However, accessing features is quite awkward now. This commit adds `supported_features` field back to the `replica_state` struct and the `load_topology_state` function initializes them properly. The logic duplication needed to initialize them is quite small and the drawbacks that come with it are outweighed by the fact that we now can refer to node's supported features in a more natural way. The `topology_features` struct is no longer a field of `topology`, but it still exists for the purpose of the feature check that happens before commitlog replay.	2023-09-26 15:56:52 +02:00
Piotr Dulikowski	51b0e4d44f	storage_service: check destination host ID in raft verbs In unlucky but possible circumstances where a node is being replaced very quickly, RPC requests using raft-related verbs from storage_service might be sent to it - even before the node starts its group 0 server. In the latter case, this triggers on_internal_error. This commit adds protection to the existing verbs in storage_service: they check whether the group 0 is running and whether the received host_id matches the actual recipient's host_id. None of the verbs that are modified are in any existing release, so the added parameter does not have to be wrapped in rpc::optional.	2023-09-26 15:56:51 +02:00
Piotr Dulikowski	0317705f5a	group_state_machine: take reference to raft address map It will be needed to translate host ids to addresses.	2023-09-26 15:46:25 +02:00
Piotr Dulikowski	193e8eba26	raft_group0: expose joined_group0 It will be needed in the next commit to check whether the group 0 server has been started.	2023-09-26 15:46:25 +02:00
Pavel Emelyanov	27eaff9d44	api/storage_service: Get gossiper from storage service Some handlers in set_storage_service() have implicit dependency on gossiper. It's not API that should track it, but storage service itself, so get the gossiper from service, not from the external argument (it will be removed soon) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-26 12:20:27 +03:00
Tomasz Grabiec	0f22e8d196	storage_service: Fixed missed notificaiton on tablet metadata update There can be 2 waiters now (coordinator and CDC generation publisher), so signal() is not enough. Change made in `c416c9ff33` missed to update this site. Closes scylladb/scylladb#15527	2023-09-26 10:37:57 +02:00
Pavel Emelyanov	0e0f9a57c6	forward_service: Remove .shutdown() method It's now empty and has no value Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-26 10:39:22 +03:00
Pavel Emelyanov	a251b9893f	forward_service: Set _shutdown in abort-source subscription Currently the bit is set in .shutdown() method which is called early on stop. After the patch the bit it set in the abort-source subscription callback which is also called early on stop. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-26 10:38:34 +03:00
Pavel Emelyanov	b18c54f56c	forward_service: Add abort_source to constructor It will be used by the next patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-09-26 10:38:26 +03:00
Tomasz Grabiec	19ff4b730f	storage_service: Avoid SIGSEGV when tablet cleanup is invoked on non-0 shard We access group0, which is only set on shard 0. Closes scylladb/scylladb#15469	2023-09-25 20:59:27 +03:00
Botond Dénes	d62a83683e	service/storage_proxy: add trace points for the actual read executor type There is currently a trace point for when the read executor is created, but this only contains the initial replica set and doesn't mention which read executor is created in the end. This patch adds trace points for each different return path, so it is clear from the trace whether speculative read can happen or not.	2023-09-22 02:53:15 -04:00
Botond Dénes	d3aabf7896	service/storage_proxy: add trace points for read-repair Currently the fact that read-repair was triggered can only be inferred from seeing mutation reads in the trace. This patch adds an explicit trace point for when read repair is triggered and also when it is finished or retried.	2023-09-22 02:53:14 -04:00
Tomasz Grabiec	1bcac74976	storage_proxy: Add more trace-level logging to read-repair Extremely helpful in debugging.	2023-09-22 02:53:14 -04:00
Tomasz Grabiec	17c1cad4b4	database, storage_proxy: Reconcile pages with no live rows incrementally Currently, mutation query on replica side will not respond with a result which doesn't have at least one live row. This causes problems if there is a lot of dead rows or partitions before we reach a live row, which stems from the fact that resulting reconcilable_result will be large: * Large allocations. Serialization of reconcilable_result causes large allocations for storing result rows in std::deque * Reactor stalls. Serialization of reconcilable_result on the replica side and on the coordinator side causes reactor stalls. This impacts not only the query at hand. For 1M dead rows, freezing takes 130ms, unfreezing takes 500ms. Coordinator does multiple freezes and unfreezes. The reactor stall on the coordinator side is >5s. * Large repair mutations. If reconciliation works on large pages, repair may fail due to too large mutation size. 1M dead rows is already too much: Refs #9111. This patch fixes all of the above by making mutation reads respect the memory accounter's limit for the page size, even for dead rows. This patch also addresses the problem of client-side timeouts during paging. Reconciling queries processing long strings of tombstones will now properly page tombstones,like regular queries do. My testing shows that this solution even increases efficiency. I tested with a cluster of 2 nodes, and a table of RF=2. The data layout was as follows (1 partition): Node1: 1 live row, 1M dead rows Node2: 1M dead rows, 1 live row This was designed to trigger reconciliation right from the very start of the query. Before: Running query (node2, CL=ONE, cold cache) Query done, duration: 140.0633503ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (node2, CL=ONE, hot cache) Query done, duration: 66.7195275ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (all-nodes, CL=ALL, reconcile, cold-cache) Query done, duration: 873.5400742ms, pages: 2, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)] After: Running query (node2, CL=ONE, cold cache) Query done, duration: 136.9035122ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (node2, CL=ONE, hot cache) Query done, duration: 69.5286021ms, pages: 101, result: [Row(pk=0, ck=3000000, v=0)] Running query (all-nodes, CL=ALL, reconcile, cold-cache) Query done, duration: 162.6239498ms, pages: 100, result: [Row(pk=0, ck=0, v=0), Row(pk=0, ck=3000000, v=0)] Non-reconciling queries have almost identical duration (1 few ms changes can be observed between runs). Note how in the after case, the reconciling read also produces 100 pages, vs. just 2 pages in the before case, leading to a much lower duration (less than 1/4 of the before). Refs #7929 Refs #3672 Refs #7933 Fixes #9111	2023-09-22 02:53:14 -04:00
Gleb Natapov	c94a9cf731	storage_service: raft topology: fence off write from old topology coordinator before starting a new one Make sure that all writes started by the old coordinator are completed or will eventually fail before starting a new coordinator. Message-ID: <ZQv+OCrHl+KyAnvv@scylladb.com>	2023-09-21 17:26:45 +02:00
Pavel Emelyanov	6e972f8505	repair: Shutdown repair on nodetool drain too Currently repair shutdown only happens on stop, but it looks like nodetool drain can call shutdown too to abort no longer relevant repair tasks if any. This also makes the main()'s deferred shutdown/stop paths cleaner a little bit Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#15438	2023-09-21 16:58:23 +03:00
Tomasz Grabiec	3d4398d1b2	Merge 'Don't calculate hashes for schema versions in Raft mode' from Kamil Braun When performing a schema change through group 0, extend the schema mutations with a version that's persisted and then used by the nodes in the cluster in place of the old schema digest, which becomes horribly slow as we perform more and more schema changes (#7620). If the change is a table create or alter, also extend the mutations with a version for this table to be used for `schema::version()`s instead of having each node calculate a hash which is susceptible to bugs (#13957). When performing a schema change in Raft RECOVERY mode we also extend schema mutations which forces nodes to revert to the old way of calculating schema versions when necessary. We can only introduce these extensions if all of the cluster understands them, so protect this code by a new cluster/schema feature, `GROUP0_SCHEMA_VERSIONING`. Fixes: #7620 Fixes: #13957 Closes scylladb/scylladb#15331 * github.com:scylladb/scylladb: test: add test for group 0 schema versioning test/pylib: log_browsing: fix type hint feature_service: enable `GROUP0_SCHEMA_VERSIONING` in Raft mode schema_tables: don't delete `version` cell from `scylla_tables` mutations from group 0 migration_manager: add `committed_by_group0` flag to `system.scylla_tables` mutations schema_tables: use schema version from group 0 if present migration_manager: store `group0_schema_version` in `scylla_local` during schema changes migration_manager: migration_request handler: assume `canonical_mutation` support system_keyspace: make `get/set_scylla_local_param` public feature_service: add `GROUP0_SCHEMA_VERSIONING` feature schema_tables: refactor `scylla_tables(schema_features)` migration_manager: add `std::move` to avoid a copy schema_tables: remove default value for `reload` in `merge_schema` schema_tables: pass `reload` flag when calling `merge_schema` cross-shard system_keyspace: fix outdated comment	2023-09-20 10:43:40 +02:00
Botond Dénes	844a0e426f	Merge 'Mark counters with skip when empty' from Amnon Heiman This series mark multiple high cardinality counters with skip_when_empty flag. After this patch the following counters will not be reported if they were never used: ``` scylla_transport_cql_errors_total scylla_storage_proxy_coordinator_reads_local_node scylla_storage_proxy_coordinator_completed_reads_local_node scylla_transport_cql_errors_total ``` Also marked, the CAS related CQL operations. Fixes #12751 Closes scylladb/scylladb#13558 * github.com:scylladb/scylladb: service/storage_proxy.cc: mark counters with skip_when_empty cql3/query_processor.cc: mark cas related metrics with skip_when_empty transport/server.cc: mark metric counter with skip_when_empty	2023-09-19 15:02:39 +03:00
Benny Halevy	e784930dd7	storage_service: fix comment about when group0 is set Since `8598cebb11` it is set earlier, before join_cluster. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-ID: <20230919063951.1424924-1-bhalevy@scylladb.com>	2023-09-19 13:20:58 +03:00
Kefu Chai	9de00c1c5a	build: cmake: add node_ops node_ops source files was extracted into /node_ops directory in `d0d0ad7aa4`, so let's update the building system accordingly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#15442	2023-09-18 16:27:02 +03:00

1 2 3 4 5 ...

3849 Commits