scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-09 08:23:29 +00:00

Author	SHA1	Message	Date
Piotr Jastrzebski	831a60a6cd	priority_manager: Fix warnings about deprecated register_one_priority_class usage This patch fixes following warnings: service/priority_manager.cc:30:36: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations] : _commitlog_priority(engine().register_one_priority_class("commitlog", 1000)) service/priority_manager.cc:31:35: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations] , _mt_flush_priority(engine().register_one_priority_class("memtable_flush", 1000)) service/priority_manager.cc:32:36: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations] , _streaming_priority(engine().register_one_priority_class("streaming", 200)) service/priority_manager.cc:33:36: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations] , _sstable_query_read(engine().register_one_priority_class("query", 1000)) service/priority_manager.cc:34:37: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations] , _compaction_priority(engine().register_one_priority_class("compaction", 1000)) Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2021-06-14 08:49:46 +02:00
Kamil Braun	9e85921006	storage_proxy: remove a feedback loop from the speculative retry latency metric To handle a read request from a client, the coordinator node must send data and digest requests to replicas, reconcile the obtained results (by merging the obtained mutations and comparing digests), and possibly send more requests to replicas if the digests turned out to be different in order to perform read repair and preserve consistency of observed reads. In contrast to writes, where coordinators send their mutation write requests to all replicas in the replica set, for reads the coordinators send their requests only to as many replicas as is required to achieve the desired CL. For example consider RF=3 and a CL=QUORUM read. Then the coordinator sends its request to a subset of 2 nodes out of the 3 possible replicas. The choice of the 2-node subset is random; the distribution used for the random roll is affected by certain things such as the "cache hitrate" metric. The details are not that relevant for this discussion. If not all of the the initially chosen replicas answer within a certain time period, the coordinator may send an additional request to one more replica, hoping that this replica helps achieving the desired CL so the entire client request succeeds. This mechanism is called "speculative retry" and is enabled by default. This time period - call it `T` - is chosen based on keyspace configuration. The default value is "99.0PERCENTILE", which means that `T` is roughly equal to the 99th percentile of the latency distribution of previous requests (or at least the most recent requests; the algorithm uses an exponential decay strategy to make old request less relevant for the metric). The latencies used are the durations of whole coordinator read requests: each such duration measurement starts before the first replica request is sent and ends after the last replica request is answered, among the replica requests whose results were used for the reconciled result returned to the client (there may be more requests sent later "in the background" - they don't affect the client result and are not taken into account for the latency measurement). This strategy, however, gives an undesired effect which appears when a significant part of all requests require a speculative retry to succeed. To explain this effect it's best to consider a scenario which takes this to the extreme - where all requests require a speculative retry. Consider RF=3 and CL=QUORUM so each read request initially uses 2 replicas. Let {A, B, C} be the set of replicas. We run a uniformly distributed read workload. Initially the cluster operates normally. Roughly 1/3 of all requests go to replicas {A, B}, 1/3 go to {A, C}, and 1/3 go to {B, C}. The 99th percentile of read request latencies is 50ms. Suppose that the average round-trip latency between a coordinator and any replica is 10ms. Suddenly replica C is hard-killed: non-graceful shutdown, e.g. power outage. This means that other nodes are initially not aware that C is down, they must wait for the failure detector to convict C as unavailable which happens after a configurable amount of time. The current default is 20s, meaning that by default coordinators will still attempt to send requests to C for 20s after it is hard-killed. During this period the following happens: - About 2/3 of all requests - the ones which were routed to {A, C} and {B, C} - do not finish within 50ms because C does not answer. For these requests to finish, the coordinator performs a speculative retry to the third replica which finishes after ~10ms (the average round-trip latency). Thus the entire request, from the coordinator's POV, takes ~60ms. - Eventually (very quickly in fact - assuming there are many concurrent requests) the P99 latency rises to 60ms. - Furthermore, the requests which initially use {A, C} and {B, C} start taking more than 2/3 of all requests because they are stuck in the foreground longer than the {A, B} requests (since their latencies are higher). - These requests do not finish within 60ms. Thus coordinators perform speculative retries. Thus they finish after ~70ms. - Eventually the P99 latency rises to 70ms. - These bad requests take an even longer portion of all requests. - These requests do not finish within 70ms. They finish after ~80ms. - Eventually the P99 latency rises to 80ms. - And so on. In metrics, we observe the following: - Latencies rise roughly linearly. They rise until they hit a certain limit; this limit comes from the fact that `T` is upper-bounded by the read request timeout parameter divided by 2. Thus if the read request timeout is `5s` and P99 latencies are `3s`, `T` will be `2.5s`, not `3s`. Thus eventually all requests will take about `2.5s + 10ms` to finish (`2.5s` until speculative retry happens, `10ms` for the last round-trip), unless the node is marked as DOWN before we reach that limit. - Throughput decreases roughly proportionally to the y = 1/x function, as expected from Little's law. Everything goes back to normal when nodes mark C as DOWN, which happens after ~20s by default as explained above. Then coordinators start routing all requests to {A, B} only. This does not happen for graceful shutdowns, where C announces to the cluster that it's shutting down before shutting down, causing other nodes to mark it as DOWN almost immediately. The root cause of the issue is a feedback loop in the metric used to calculate `T`: we perform a speculative retry after `T` -> P99 request latencies rise above `T + 10ms` -> `T` rises above `T + 10ms` -> etc. We fix the problem by changing the measurements used for calculating `T`. Instead of measuring the entire coordinator read latency, we measure each replica request separately and take the maximum over these measurements. We only take into account the measurements for requests that actually contributed to the request's result. The previous statistic would also measure failed requests latencies. Now we measure only latencies of successful replica requests. Indeed this makes sense for the speculative retry use case; the idea behind speculative retry is that we assume that requests usually succeed within a certain time period, and we should perform the retry if they take longer than that. To measure this time period, taking failed requests into account doesn't make much sense. In the scenario above, for a request that initially goes to {A, C}, the following would happen after applying the fix: - We send the requests to A and C. - After ~10ms A responds. We record the ~10ms measurement. - After ~50ms we perform speculative retry, sending a request to B. - After ~10ms B responds. We record the ~10ms measurement. The maximum over recorded measurements is ~10ms, not ~60ms. The feedback loop is removed. Experiments show that the solution is effective: in scenarios like above, after C is killed, latencies only rise slightly by a constant amount and then maintain their level, as expected. Throughput also drops by a constant amount and maintains its level instead of continuously dropping with an asymptote at 0. Fixes #3746. Fixes #7342. Closes #8783	2021-06-13 16:19:11 +03:00
Tomasz Grabiec	ce7a404f17	Merge "Cleanups/refactoring for Raft Group 0" from Kostja * scylla-dev/raft-group-0-part-1-rebase: raft: (service) pass Raft service into storage_service raft: (service) add comments for boot steps raft: add ordering for raft::server_address based on id raft: (internal) simplify construction of tagged_id raft: (internal) tagged_id minor improvements	2021-06-09 10:48:05 +02:00
Avi Kivity	d2157dfea7	Merge 'locator: token_metadata: simplify `tokens_iterator`' from Michał Chojnowski `ring_range()`/`tokens_iterator` are more complicated than they need to be. The `include_min` parameter is not used anywhere, and `tokens_iterator` is pimplified without a good reason. Simplify that. Closes #8805 * github.com:scylladb/scylla: locator: token_metadata: depimplify tokens_iterator locator: token_metadata: remove _ring_pos from tokens_iterator_impl locator: token_metadata: remove tokens_end() locator: token_metadata: remove `include_min` from tokens_iterator_impl locator: token_metadata: remove the `include_min` parameter from `ring_range()`	2021-06-08 15:42:41 +03:00
Konstantin Osipov	267a8e99ad	raft: (service) pass Raft service into storage_service Raft group 0 initialization and configuration changes should be integrated with Scylla cluster assembly, happening when starting the storage service and joining the cluster. Prepare for this. Since Raft service depends on query processor, and query processor depends on storage service, to break a dependency loop split Raft initialization into two steps: starting an under-constructed instance of "sharded" Raft service, accepting an under-constructed instance of "sharded" query_processor, and then passed into storage service start function, and then the local state of Raft groups from system tables once query processor starts. Consistently abbreviate raft_services instance raft_svcs, as is the convention at Scylla. Update the tests.	2021-06-08 14:52:32 +03:00
Konstantin Osipov	959bd21cdb	raft: (service) add comments for boot steps	2021-06-08 14:52:32 +03:00
Konstantin Osipov	d42d5aee8c	raft: (internal) simplify construction of tagged_id Make it easy to construct tagged_id from UUID.	2021-06-08 14:52:32 +03:00
Konstantin Osipov	c9a23e9b8a	raft: (internal) tagged_id minor improvements Introduce a syntax helper tagged_id::create_random_id(), used to create a new Raft server or group id. Provide a default ordering for tagged ids, for use in Raft leader discovery, which selects the smallest id for leader.	2021-06-08 14:52:32 +03:00
Kamil Braun	3778a816c1	storage_proxy: abstract_read_executor: make certain methods private The methods `make_mutation_data_request`, `make_data_request` and `make_digest_request` were marked as protected, but weren't used by deriving classes. The "API" for deriving classes is encapsulated through plural versions of these functions, such as `make_mutation_data_requests` (note the "s" at the end), which send a request to a set of replicas (rather than a single replica) but also do other important things - like gathering statistics - hence we don't want the deriving classes to use them directly. Marking these singular methods as private communicates the intent more clearly.	2021-06-08 12:32:47 +03:00
Pavel Solodovnikov	76bea23174	treewide: reduce header interdependencies Use forward declarations wherever possible. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Closes #8813	2021-06-07 15:58:35 +03:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Pavel Solodovnikov	2187a59089	treewide: move `service::cas_request` out from `storage_proxy.hh` And remove all remaining inclusions of `storage_proxy.hh` in the headers. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-06-06 19:18:49 +03:00
Pavel Solodovnikov	e0749d6264	treewide: some random header cleanups Eliminate not used includes and replace some more includes with forward declarations where appropriate. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-06-06 19:18:49 +03:00
Gleb Natapov	bb822c92ab	raft: change raft::rpc api to return void for most sending functions Most RAFT packets are sent very rarely during special phases of the protocol (like election or leader stepdown). The protocol itself does not care if a packet is sent or dropped, so returning futures from their send function does not serve any purpose. Change the raft's rpc interface to return void for all packet types but append_request. We still want to get a future from sending append_request for backpressure purposes since replication protocol is more efficient if there is no packet loss, so it is better to pause a sender than dropping packets inside the rpc. Rpc is still allowed to drop append_requests if overloaded.	2021-06-06 19:18:49 +03:00
Michał Chojnowski	2a3bd2babe	locator: token_metadata: remove the `include_min` parameter from `ring_range()` `include_min` is always set to the default value. Remove it.	2021-06-05 17:40:35 +02:00
Piotr Sarna	389a0a52c9	treewide: revamp workload type for service levels This patch is not backward compatible with its original, but it's considered fine, since the original workload types were not yet part of any release. The changes include: - instead of using 'unspecified' for declaring that there's no workload type for a particular service level, NULL is used for that purpose; NULL is the standard way of representing lack of data - introducing a delete marker, which accompanies NULL and makes it possible to distinguish between wanting to forcibly reset a workload type to unspecified and not wanting to change the previous value - updating the tests accordingly These changes come in as a single patch, because they're intertwined with each other and the tests for workload types are already in place; an attempt to split them proved to be more complicated than it's worth. Tests: unit(release) Closes #8763	2021-05-31 18:18:33 +03:00
Avi Kivity	d23bebf5c2	Merge "Unexport storage service dependencies" from Pavel E " Right now storage service is used as "provider" of another services -- database, feature service and tokens. This set unexports the first pair. This dropps a bunch of calls for global storage service instances from the places that don't really need it. tests: unit(dev), start-stop " * 'br-pupate-storage-service' of https://github.com/xemul/scylla: storage-service: Don't export features api: Get features from proxy storage-service: Don't export database storage-service: Turn some global helpers into methods storage-service: Open-code simple config getters view: Get database from stprage_proxy main: Use local database instance api: Use database from http_ctx	2021-05-29 20:52:47 +03:00
Pavel Emelyanov	598bbfab15	storage-service: Don't export features Now storage service uses the feature service instance internally and doesn't need to provide getter for it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-28 18:16:12 +03:00
Pavel Emelyanov	651568318d	api: Get features from proxy The reset_local_schema call needs proxy and feature service to do its job. Right now the features are retrived from global storage service, but they are present on the proxy as well. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-28 18:15:15 +03:00
Pavel Emelyanov	b990b764ca	storage-service: Don't export database Now storage service uses the database instance internally and doesn't need to provide getter for it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-28 18:13:27 +03:00
Pavel Emelyanov	0651038f29	storage-service: Turn some global helpers into methods There are two static helpers used by storage service that grab global storage service. To simplify these two turn both into storage service methods and use 'this' inside. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-28 18:12:25 +03:00
Pavel Emelyanov	5ae8accfed	storage-service: Open-code simple config getters There are two db::config getters in storage_service.cc that are used only once. Both call for global storage service, but since they are called from storage service it's simpler to break this loop and make storage service get needed config options directly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-28 18:11:24 +03:00
Asias He	e86d39faf0	storage_service: Update peer table only if the peer is part of the ring Consider the following procedure: - n1, n2, n3 - n3 is network partitioned from the cluster - n4 replaces n3 - n3 has the network partition fixed - n1 learns n3 as NORMAL status and calls storage_service::handle_state_normal which in turn calls update_peer_info, all columns except tokens column in system.peers are written - n1 restarts before figure out n4 is the new owner and deletes the entry for n3 in system.peers - n3 is removed from gossip by all the nodes in the cluster automatically because they detect the collision and removes n3 - n1 restarts, leaving the entry in system.peers for n3 forever To fix, we can update peer tables only if the node is part of the ring. Fixes #8729 Closes #8742	2021-05-28 15:03:26 +02:00
Avi Kivity	5f8484897b	Merge 'cdc: use a new internal table for exchanging generations' from Kamil Braun Reopening #8286 since the token metadata fix that allows `Everywhere` strategy tables to work with RBO (#8536) has been merged. --- Currently when a node wants to create and broadcast a new CDC generation it performs the following steps: 1. choose the generation's stream IDs and mapping (how this is done is irrelevant for the current discussion) 2. choose the generation's timestamp by taking the current time (according to its local clock) and adding 2 * ring_delay 3. insert the generation's data (mapping and stream IDs) into system_distributed.cdc_generation_descriptions, using the generation's timestamp as the partition key (we call this table the "old internal table" below) 4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP" application state. The timestamp spreads epidemically through the gossip protocol. When nodes see the timestamp, they retrieve the generation data from the old internal table. Unfortunately, due to the schema of the old internal table, where the entire generation data is stored in a single cell, step 3 may fail for sufficiently large generations (there is a size threshold for which step 3 will always fail - retrying the operation won't help). Also the old internal table lies in the system_distributed keyspace that uses SimpleStrategy with replication factor 3, which is also problematic; for example, when nodes restart, they must reach at least 2 out of these 3 specific replicas in order to retrieve the current generation (we write and read the generation data with QUORUM, unless we're a single-node cluster, where we use ONE). Until this happens, a restarting node can't coordinate writes to CDC-enabled tables. It would be better if the node could access the last known generation locally. The commit introduces a new table for broadcasting generation data with the following properties: - it uses a better schema that stores the data in multiple rows, each of manageable size - it resides in a new keyspace that uses EverywhereStrategy so the data will be written to every node in the cluster that has a token in the token ring - the data will be written using CL=ALL and read using CL=ONE; thanks to this, restarting node won't have to communicate with other nodes to retrieve the data of the last known generation. Note that writing with CL=ALL does not reduce availability: creating a new generation requires all nodes to be available anyway, because they must learn about the generation before their clocks go past the generation's timestamp; if they don't, partitions won't be mapped to stream IDs consistently across the cluster - the partition key is no longer the generation's timestamp. Because it was that way in the old internal table, it forced the algorithm to choose the timestamp before the generation data was inserted into the table. What if the inserting took a long time? It increased the chance that nodes would learn about the generation too late (after their clocks moved past its timestamp). With the new schema we will first insert the generation data using a randomly generated UUID as the partition key, then choose the timestamp, then gossip both the timestamp and the UUID. Observe that after a node learns about a generation broadcasted using this new method through gossip it will retrieve its data very quickly since it's one of the replicas and it can use CL=ONE as it was written using CL=ALL. The generation's timestamp and the UUID mentioned in the last point form a "generation identifier" for this new generation. For passing these new identifiers around, we introduce the cdc::generation_id_v2 type. Fixes #7961. --- For optimal review experience it is best to first read the updated design notes (you can read them rendered here: https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md), specifically the ["Generation switching"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#generation-switching) section followed by the ["Internal generation descriptions table V1 and upgrade procedure"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#internal-generation-descriptions-table-v1-and-upgrade-procedure) section, then read the commits in topological order. dtest gating run (dev): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/1160/ unit tests (dev) passed locally Closes #8643 * github.com:scylladb/scylla: docs: update cdc.md with info about the new internal table sys_dist_ks: don't create old CDC generations table on service initialization sys_dist_ks: rename all_tables() to ensured_tables() cdc: when creating new generations, use format v2 if possible main: pass feature_service to cdc::generation_service gms: introduce CDC_GENERATIONS_V2 feature cdc: introduce retrieve_generation_data test: cdc: include new generations table in permissions test sys_dist_ks: increase timeout for create_cdc_desc sys_dist_ks: new table for exchanging CDC generations tree-wide: introduce cdc::generation_id_v2	2021-05-27 17:13:44 +03:00
Avi Kivity	e8e4456ec7	Merge 'Introduce per-service-level workload types and their first use-case - shedding in interactive workloads' from Piotr Sarna This draft extends and obsoletes #8123 by introducing a way of determining the workload type from service level parameters, and then using this context to qualify requests for shedding. The rough idea is that when the admission queue in the CQL server is hit, it might make more sense to start shedding surplus requests instead of accumulating them on the semaphore. The assumption that interactive workloads are more interested in the success rate of as many requests as possible, and hanging on a semaphore reduces the chances for a request to succeed. Thus, it may make sense to shed some requests to reduce the load on this coordinator and let the existing requests to finish. It's a draft, because I only performed local guided tests. #8123 was followed by some experiments on a multinode cluster which I want to rerun first. Closes #8680 * github.com:scylladb/scylla: test: add a case for conflicting workload types cql-pytest: add basic tests for service level workload types docs: describe workload types for service levels sys_dist_ks: fix redundant parsing in get_service_level sys_dist_ks: make get_service_level exception-safe transport: start shedding requests during potential overload client_state: hook workload type from service levels cql3: add listing service level workload type cql3: add persisting service level workload type qos: add workload_type service level parameter	2021-05-27 17:01:56 +03:00
Piotr Sarna	409c67b1b4	client_state: hook workload type from service levels The client state is now aware of its workload type derived from its attached service level.	2021-05-27 13:02:22 +02:00
Piotr Sarna	578543603d	qos: add workload_type service level parameter The workload type is currently one of three values: - unspecified - interactive - batch By defining the workload type, the service level makes it easier for other components to decide what to do in overload scenarios. E.g. if the workload is interactive, requests can be shed earlier, while if it's batched (or unspecified), shedding does not take place. Conversely, batch workloads could accept long full scan operations.	2021-05-27 13:02:22 +02:00
Dejan Mircevski	b54872fd95	auth: Remove `const` from role_manager methods Some subclasses want to maintain state, which constness needlessly precludes. Tests: unit (dev) Signed-off-by: Dejan Mircevski <dejan@scylladb.com> Closes #8721	2021-05-27 11:27:38 +03:00
Asias He	72cc596842	repair: Wire off-strategy compaction for regular repair We have enabled off-strategy compaction for bootstrap, replace, decommission and removenode operations when repair based node operation is enabled. Unlike node operations like replace or decommission, it is harder to know when the repair of a table is finished because users can send multiple repair requests one after another, each request repairing a few token ranges. This patch wires off-strategy compaction for regular repair by adding a timeout based automatic off-strategy compaction trigger mechanism. If there is no repair activity for sometime, off-strategy compaction will be triggered for that table automatically. Fixes #8677 Closes #8678	2021-05-26 11:41:27 +03:00
Avi Kivity	3896e35897	Merge 'storage_service: Respect --enable-repair-based-node-ops flag during removenode' from Asias He In commit `829b4c1` (repair: Make removenode safe by default), removenode was changed to use repair based node operations unconditionally. Since repair based node operations is not enabled by default, we should respect the flag to use stream to sync data if the flag is false. Fixes #8700 Closes #8701 * github.com:scylladb/scylla: storage_service: Add removenode_add_ranges helper storage_service: Respect --enable-repair-based-node-ops flag during removenode	2021-05-26 10:32:56 +03:00
Kamil Braun	337a4ef8ad	cdc: when creating new generations, use format v2 if possible A node with this commit, when creating a new CDC generation (during bootstrap, upgrade, or when running checkAndRepairCdcStreams command) will check for the CDC_GENERATIONS_V2 feature and: - If the feature is enabled create the generation in the v2 format and insert it into the new internal table. This is safe because a node joins the feature only if it understands the new format. - Otherwise create it in the v1 format, limiting its size as before, and insert it into the old table. The second case should only happen if we perform bootstrap or run checkAndRepairCdcStreams in the middle of an upgrade procedure. On fully upgraded clusters the feature shall be enabled, causing all new generations to use the new format.	2021-05-25 16:07:23 +02:00
Kamil Braun	3155cde9c8	sys_dist_ks: new table for exchanging CDC generations Currently when a node wants to create and broadcast a new CDC generation it performs the following steps: 1. choose the generation's stream IDs and mapping (how this is done is irrelevant for the current discussion) 2. choose the generation's timestamp by taking the current time (according to its local clock) and adding 2 * ring_delay 3. insert the generation's data (mapping and stream IDs) into system_distributed.cdc_generation_descriptions, using the generation's timestamp as the partition key (we call this table the "old internal table" below) 4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP" application state. The timestamp spreads epidemically through the gossip protocol. When nodes see the timestamp, they retrieve the generation data from the old internal table. Unfortunately, due to the schema of the old internal table, where the entire generation data is stored in a single cell, step 3 may fail for sufficiently large generations (there is a size threshold for which step 3 will always fail - retrying the operation won't help). Also the old internal table lies in the system_distributed keyspace that uses SimpleStrategy with replication factor 3, which is also problematic; for example, when nodes restart, they must reach at least 2 out of these 3 specific replicas in order to retrieve the current generation (we write and read the generation data with QUORUM, unless we're a single-node cluster, where we use ONE). Until this happens, a restarting node can't coordinate writes to CDC-enabled tables. It would be better if the node could access the last known generation locally. The commit introduces a new table for broadcasting generation data with the following properties: - it uses a better schema that stores the data in multiple rows, each of manageable size - it resides in the `system_distributed_everywhere` keyspace so the data will be written to every node in the cluster that has a token in the token ring - the data will be written using CL=ALL and read using CL=ONE; thanks to this, restarting node won't have to communicate with other nodes to retrieve the data of the last known generation. Note that writing with CL=ALL does not reduce availability: creating a new generation requires all nodes to be available anyway, because they must learn about the generation before their clocks go past the generation's timestamp; if they don't, partitions won't be mapped to stream IDs consistently across the cluster - the partition key is no longer the generation's timestamp. Because it was that way in the old internal table, it forced the algorithm to choose the timestamp before the generation data was inserted into the table. What if the inserting took a long time? It increased the chance that nodes would learn about the generation too late (after their clocks moved past its timestamp). With the new schema we will first insert the generation data using a randomly generated UUID as the partition key, then choose the timestamp, then gossip both the timestamp and the UUID. The timestamp and the UUID form the "generation identifier" of this new generation; this should explain why we introduced the generation_id_v2 type in previous commits. Observe that after a node learns about a generation broadcasted using this new method through gossip it will retrieve its data very quickly since it's one of the replicas and it can use CL=ONE as it was written using CL=ALL. Note that the node is still using the old method - the actual switch will be done in a later commit.	2021-05-25 16:07:23 +02:00
Asias He	70147dcb5a	storage_service: Add removenode_add_ranges helper Share the code between restore_replica_count and removenode_with_stream to reduce duplication. Refs #8700	2021-05-25 10:44:31 +08:00
Asias He	a285bd28e2	storage_service: Respect --enable-repair-based-node-ops flag during removenode In commit `829b4c1` (repair: Make removenode safe by default), removenode was changed to use repair based node operations unconditionally. Since repair based node operations is not enabled by default, we should respect the flag to use stream to sync data if the flag is false. Fixes #8700	2021-05-25 10:42:58 +08:00
Avi Kivity	50f3bbc359	Merge "treewide: various header cleanups" from Pavel S " The patch set is an assorted collection of header cleanups, e.g: * Reduce number of boost includes in header files * Switch to forward declarations in some places A quick measurement was performed to see if these changes provide any improvement in build times (ccache cleaned and existing build products wiped out). The results are posted below (`/usr/bin/time -v ninja dev-build`) for 24 cores/48 threads CPU setup (AMD Threadripper 2970WX). Before: Command being timed: "ninja dev-build" User time (seconds): 28262.47 System time (seconds): 824.85 Percent of CPU this job got: 3979% Elapsed (wall clock) time (h:mm:ss or m:ss): 12:10.97 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 2129888 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 1402838 Minor (reclaiming a frame) page faults: 124265412 Voluntary context switches: 1879279 Involuntary context switches: 1159999 Swaps: 0 File system inputs: 0 File system outputs: 11806272 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 After: Command being timed: "ninja dev-build" User time (seconds): 26270.81 System time (seconds): 767.01 Percent of CPU this job got: 3905% Elapsed (wall clock) time (h:mm:ss or m:ss): 11:32.36 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 2117608 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 1400189 Minor (reclaiming a frame) page faults: 117570335 Voluntary context switches: 1870631 Involuntary context switches: 1154535 Swaps: 0 File system inputs: 0 File system outputs: 11777280 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 The observed improvement is about 5% of total wall clock time for `dev-build` target. Also, all commits make sure that headers stay self-sufficient, which would help to further improve the situation in the future. " * 'feature/header_cleanups_v1' of https://github.com/ManManson/scylla: transport: remove extraneous `qos/service_level_controller` includes from headers treewide: remove evidently unneded storage_proxy includes from some places service_level_controller: remove extraneous `service/storage_service.hh` include sstables/writer: remove extraneous `service/storage_service.hh` include treewide: remove extraneous database.hh includes from headers treewide: reduce boost headers usage in scylla header files cql3: remove extraneous includes from some headers cql3: various forward declaration cleanups utils: add missing <limits> header in `extremum_tracking.hh`	2021-05-24 14:24:20 +03:00
Eliran Sinvani	f2091bb227	workload prioritization: Reduce the logging sensitivity to "glitches" in availability Before this patch every failure to pull the configuration have been reported as a warning. However this is confusing for users for two reasons: 1. It pollutes the logs if the configuration is polled which is Scylla's mode of operation. Such a line is logged every failed iteration. 2. It confuses users because even though this level is warning, it logs out an exception and the log message contains the word failed. We see it a lot during QA runs and customer questions from the field. Point 2 is only solvable by reducing the verbosity of the logged information, which will make debugging harder. Point 1 is addressed here in the following manner, first the one shot configuration pull function is not handling the exception itself, this is OK because it is harmless to fail once or twice in a row in configuration pulling like in every other query, the caller is the one that will be responsible to handle the exception and log the information. Second, the polling loop capture the exceptions being thrown from the configuration pulling function and only report an error with the latest exception if the polling has failed in consecutive iterations over the last 90 seconds. This value was chosen because this is about the empirical worst case time that it takes to a node to notice one of the other nodes in the cluster is down (hence not querying it). It is not important for the user or to us to be notified on temporary glitches in availability (through this error at least) and since we are eventually consistent is ok that some nodes will catch up with the configuration later than others. We also set a threshold in which if the configuration still couldn't be retrieved then the logging level is bumped to ERROR. Closes #8574	2021-05-24 10:51:47 +02:00
Piotr Sarna	17f4a55664	qos: remove unused with_user_service_level helper This helper function is an artifact of forward-porting service levels, and it wouldn't even compile when used because of mismatched function declarations. It's not used anywhere in the open-source code, so it's removed to avoid future merge conflicts. Message-Id: <c9f421d0c4c1a807626775d324fd35b4c72505fe.1621845335.git.sarna@scylladb.com>	2021-05-24 11:42:51 +03:00
Avi Kivity	b8137986e6	raft: raft_services: drop unused _gossiper field	2021-05-21 21:00:04 +03:00
Asias He	2ec1f719de	repair: Always use run_replace_ops Currently, the new NODE_OPS_CMD for replace operation is used only when repair based node operation is enabled. However, We can use the NODE_OPS_CMD to run replace operation and use streaming instead of repair to sync data as well. After this patch, we will use streaming inside run_replace_ops if repair based node ops is not enabled. So that we can take the benefits that NODE_OPS_CMD brings in commit `323f72e48a` (repair: Switch to use NODE_OPS_CMD for replace operation). Fixes #8013	2021-05-20 20:14:15 +03:00
Avi Kivity	30034371e7	Merge "Remove most of global pointers from repair" from Pavel " There are many global stuff in repair -- a bunch of pointers to sharded services, tracker, map of metas (maybe more). This set removes the first group, all those services had become main-local recently. Along the way a call to global storage proxy is dropped. To get there the repair_service is turned into a "classical" sharded<> service, gets all the needed dependencies by references from main and spreads them internally where needed. Tracker and other stuff is left global, but tracker is now the candidate for merging with the now sharded repair_service, since it emulates the sharded concept internally. Overall the change is - make repair_service sharded and put all dependencies on it at start - have sharded<repair_service> in API and storage service - carry the service reference down to repair_info and repair_meta constructions to give them the depedencies - use needed services in _info and _meta methods tests: unit(dev), dtest.repair(dev) " * 'br-repair-service' of https://github.com/xemul/scylla: (29 commits) repair: Drop most of globals from repair repair: Use local references in messaging handler checks repair: Use local references in create_writer() repair: Construct repair_meta with local references repair: Keep more stuff on repair_info repair: Kill bunch of global usages from insert_repair_meta repair: Pass repair service down to meta insertion repair: Keep local migration manager on repair_info repair: Move unused db captures repair: Remove unused ms captures repair: Construct repair_info with service repair: Loop over repair sharded container repair: Make sync_data_using_repair a method repair: Use repair from storage service repair: Keep repair on storage service repair: Make do_repair_start a method repair: Pass repair_service through the API until do_repair_start repair: Fix indentation after previous patch repair: Split sync_data_using_repair repair: Turn repair_range a repair_info method ...	2021-05-20 10:57:48 +03:00
Pavel Solodovnikov	0663aa6ca1	service_level_controller: remove extraneous `service/storage_service.hh` include Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-20 02:18:41 +03:00
Pavel Solodovnikov	fff7ef1fc2	treewide: reduce boost headers usage in scylla header files `dev-headers` target is also ensured to build successfully. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-20 01:33:18 +03:00
Asias He	0858619cba	storage_service: Abort restore_replica_count when node is removed from the cluster Consider the following procedure: - n1, n2, n3 - n3 is down - n1 runs nodetool removenode uuid_of_n3 to removenode from n3 the cluster - n1 is down in the middle of removenode operation Node n1 will set n3 to removing gossip status during removenode operation. Whenever existing nodes learn a node is in removing gossip status, they will call restore_replica_count to stream data from other nodes for the ranges n3 loses if n3 was removed from the cluster. If the streaming fails, the streaming will sleep and retry. The current max number of retry attempts is 5. The sleep interval starts at 60 seconds and increases 1.5 times per sleep. This can leave the cluster in a bad state. For example, nodes can go out of disk space if the streaming continues. We need a way to abort such streaming attempts. To abort the removenode operation and forcely remove the node, users can run `nodetool removenode force` on any existing nodes to move the node from removing gossip status to removed gossip status. However, the restore_replica_count will not be aborted. In this patch, a status checker is added in restore_replica_count, so that once a node is in removed gossip status, restore_replica_count will be aborted. This patch is for older releases without the new NODE_OPS_CMD infrastructure where such abort will happen automatically in case of error. Fixes #8651 Closes #8655	2021-05-18 14:55:18 +02:00
Kamil Braun	03ad111beb	tree-wide: comments on deprecated functions to access global variables Closes #8665	2021-05-18 11:31:10 +03:00
Pavel Emelyanov	5c020880f9	repair: Use repair from storage service This is the continuation of the previous patch -- the do_..._with_repair functions become repair_service methods and will get local repair service reference as "this". Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Pavel Emelyanov	23e8e60ec0	repair: Keep repair on storage service Storage service calls a bunch of do_something_with_repair() methods. All of them need the local repair_service and the only way to get it is by keeping it on storage service. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-14 18:44:02 +03:00
Asias He	e4872a78b5	storage_service: Delay update pending ranges for replacing node In commit `c82250e0cf` (gossip: Allow deferring advertise of local node to be up), the replacing node is changed to postpone the responding of gossip echo message to avoid other nodes sending read requests to the replacing node. It works as following: 1) replacing node does not respond echo message to avoid other nodes to mark replacing node as alive 2) replacing node advertises hibernate state so other nodes knows replacing node is replacing 3) replacing node responds echo message so other nodes can mark replacing node as alive This is problematic because after step 2, the existing nodes in the cluster will start to send writes to the replacing node, but at this time it is possible that existing nodes haven't marked the replacing node as alive, thus failing the write request unnecessarily. For instance, we saw the following errors in issue #8013 (Cassandra stress fails to achieve consistency when only one of the nodes is down) ``` scylla: [shard 1] consistency - Live nodes 2 do not satisfy ConsistencyLevel (2 required, 1 pending, live_endpoints={127.0.0.2, 127.0.0.1}, pending_endpoints={127.0.0.3}) [shard 0] gossip - Fail to send EchoMessage to 127.0.0.3: std::runtime_error (Not ready to respond gossip echo message) c-s: java.io.IOException: Operation x10 on key(s) [4c4f4d37324c35304c30]: Error executing: (UnavailableException): Not enough replicas available for query at consistency QUORUM (2 required but only 1 alive ``` To solve this problem for older releases without the patch "repair: Switch to use NODE_OPS_CMD for replace operation", a minimum fix is implemented in this patch. Once existing nodes learn the replacing node is in HIBERNATE state, they add the replacing as replacing, but only add the replacing to the pending list only after the replacing node is marked as alive. With this patch, when the existing nodes start to write to the replacing node, the replacing node is already alive. Tests: replace_address_test.py:TestReplaceAddress.replace_node_same_ip_test + manual test Fixes: #8013 Closes #8614	2021-05-14 17:24:28 +02:00
Avi Kivity	61c7f874cc	Merge 'Add per-service-level timeouts' from Piotr Sarna Ref: #7617 This series adds timeout parameters to service levels. Per-service-level timeouts can be set up in the form of service level parameters, which can in turn be attached to roles. Setting up and modifying role-specific timeouts can be achieved like this: ```cql CREATE SERVICE LEVEL sl2 WITH read_timeout = 500ms AND write_timeout = 200ms AND cas_timeout = 2s; ATTACH SERVICE LEVEL sl2 TO cassandra; ALTER SERVICE LEVEL sl2 WITH write_timeout = null; ``` Per-service-level timeouts take precedence over default timeout values from scylla.yaml, but can still be overridden for a specific query by per-query timeouts (e.g. `SELECT * from t USING TIMEOUT 50ms`). Closes #7913 * github.com:scylladb/scylla: docs: add a paragraph describing service level timeouts test: add per-service-level timeout tests test: add refreshing client state transport: add updating per-service-level params client_state: allow updating per service level params qos: allow returning combined service level options qos: add a way of merging service level options cql3: add preserving default values for per-sl timeouts qos: make getting service level public qos: make finding service level public treewide: remove service level controller from query state treewide: propagate service level to client state sstables: disambiguate boost::find cql3: add a timeout column to LIST SERVICE LEVEL statement db: add extracting service level info via CQL types: add a missing translation for cql_duration cql3: allow unsetting service level timeouts cql3: add validating service level timeout values db: add setting service level params via system_distributed cql3: add fetching service level attrs in ALTER and CREATE cql3: add timeout to service level params qos: add timeout to service level info db,sys_dist_ks: add timeout to the service level table migration_manager: allow table updates with timestamp cql3: allow a null keyword for CQL properties	2021-05-11 18:39:10 +03:00
Avi Kivity	b1f9df279a	Merge "Untie cdc, storage service and migration notifier knot" from Pavel E " Storage service needs migration notifier reference to pass it to cdc service via get_local_storage_service(). This set removes - get_local_storage_service from cdc - migration notifier from storage service - db_context::builder from cdc (released nuclear binding energy) tests: unit(dev) " * 'br-cdc-no-storage-service' of https://github.com/xemul/scylla: storage_service: Remove migration notifier dependency cdc: Remove db_context::builder cdc: Provide migration notifier right at once cdc: Remove db_context::builder::with_migration_notifier	2021-05-11 18:39:10 +03:00
Asias He	4f0a1cbca3	repair: Wire off-strategy compaction for decommission When decommission is done, all nodes that receive data from the decommission node will run node_ops_cmd::decommission_done handler. Trigger off-strategy compaction inside the handler to wire off-strategy for decommission. Refs #5226 Closes #8607	2021-05-11 18:39:10 +03:00

1 2 3 4 5 ...

2223 Commits