scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-28 20:27:03 +00:00

Author	SHA1	Message	Date
Calle Wilund	cbddcf46aa	commitlog: Flush all segments if we only have one. Handle test cases with borked config so we don't deadlock in cases where we only have one segment in a commitlog	2021-06-16 15:35:56 +00:00
Calle Wilund	a0f559a44c	commitlog: Always force flush if segment allocation is waiting Refs #8270 If segement allocation is blocked, we should bypass all thresholds and issue a flush of as much as possible.	2021-06-16 15:35:56 +00:00
Calle Wilund	bcf4d07f0b	commitlog: Include segment wasted (slack) size in footprint check Refs #8270 Since segment allocation looks at actual disk footprint, not active, the threshold check in timer task should include slack space so we don't mistake sparse usage for space left.	2021-06-16 15:35:56 +00:00
Calle Wilund	1187f5c181	commitlog: Adjust (lower) usage threshold Refs #8270 Try to ensure we issue a flush as soon as we are allocating in the last allowable segment, instead of "half through". This will make flushing a little more eager, but should reduce latencies created by waiting for segment delete/recycle on heavy usage.	2021-06-16 15:35:56 +00:00
Piotr Sarna	8a049c9116	view: fix use-after-move when handling view update failures The code was susceptible to use-after-move if both local and remote updates were going to be sent. The whole routine for sending view updates is now rewritten to avoid use-after-move. Refs #8830 Tests: unit(release), dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)	2021-06-14 09:36:10 +02:00
Piotr Sarna	7cdbb7951a	db,view: explicitly move the mutation to its helper function The `apply_to_remote_endpoints` helper function used to take its `mut` parameter by reference, but then moved the value from it, which is confusing and prone to errors. Since the value is moved-from, let's pass it to the helper function as rvalue ref explicitly.	2021-06-14 09:34:40 +02:00
Piotr Sarna	88d4a66e90	db,view: pass base token by value to mutate_MV The base token is passed cross-continuations, so the current way of passing it by const reference probably only works because the token copying is cheap enough to optimize the reference out. Fix by explicitly taking the token by value.	2021-06-14 09:30:38 +02:00
Nadav Har'El	8a4ac6914a	config: add configuration option restrict_replication_simplestrategy This patch adds a configuration option to choose whether the SimpleStrategy replication strategy is restricted. It is a tri_mode_restriction, allowing to restrict this strategy (true), to allow it (false), or to just warn when it is used (warn). After this patch, the option exists but doesn't yet do anything. It will be used in the following two patches to restrict the CREATE KEYSPACE and ALTER KEYSPACE operations, respectively. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-06-13 14:45:16 +03:00
Nadav Har'El	a3d6f502ad	config: add "tri_mode_restriction" type of configurable value This patch adds a new type of configurable value for our command-line and YAML parsers - a "tri_mode_restriction" - which can be set to three values: "true", "false", or "warn". We will use this value type for many (but not all) of the restriction options that we plan to start adding in the following patches. Restriction options will allow users to ask Scylla to restrict (true), to not restrict (false) or to warn about (warn) certain dangerous or undesirable operations. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2021-06-13 14:44:20 +03:00
Pavel Solodovnikov	76bea23174	treewide: reduce header interdependencies Use forward declarations wherever possible. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Closes #8813	2021-06-07 15:58:35 +03:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Nadav Har'El	48ff641f67	Merge 'commitlog: make_checked_file for segments, report and ignore other errors on shutdown' from Benny Halevy Shutdown must never fail, otherwise it may cause hangs as seen in https://github.com/scylladb/scylla/issues/8577. This change wraps the file created in `allocate_segment_ex` in `make_checked_file` so that scylla will abort when failing to write to the commitlog files. In case other errors are seen during shutdown, just log them and continue with shutting down to prevent scylla from hanging. Fixes #8577 Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #8578 * github.com:scylladb/scylla: commitlog: segment_manager::shutdown: abort on errors commitlog: allocate_segment_ex: make_checked_file	2021-06-06 19:18:49 +03:00
Pavel Solodovnikov	e0749d6264	treewide: some random header cleanups Eliminate not used includes and replace some more includes with forward declarations where appropriate. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-06-06 19:18:49 +03:00
Piotr Sarna	389a0a52c9	treewide: revamp workload type for service levels This patch is not backward compatible with its original, but it's considered fine, since the original workload types were not yet part of any release. The changes include: - instead of using 'unspecified' for declaring that there's no workload type for a particular service level, NULL is used for that purpose; NULL is the standard way of representing lack of data - introducing a delete marker, which accompanies NULL and makes it possible to distinguish between wanting to forcibly reset a workload type to unspecified and not wanting to change the previous value - updating the tests accordingly These changes come in as a single patch, because they're intertwined with each other and the tests for workload types are already in place; an attempt to split them proved to be more complicated than it's worth. Tests: unit(release) Closes #8763	2021-05-31 18:18:33 +03:00
Piotr Jastrzebski	76d7c761d1	schema: Stop using deprecated constructor This is another boring patch. One of schema constructors has been deprecated for many years now but was used in several places anyway. Usage of this constructor could lead to data corruption when using MX sstables because this constructor does not set schema version. MX reading/writing code depends on schema version. This patch replaces all the places the deprecated constructor is used with schema_builder equivalent. The schema_builder sets the schema version correctly. Fixes #8507 Test: unit(dev) Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <4beabc8c942ebf2c1f9b09cfab7668777ce5b384.1622357125.git.piotr@scylladb.com>	2021-05-30 11:58:27 +03:00
Pavel Emelyanov	1ce0682821	view: Get database from stprage_proxy The db::view code already uses proxy rather actively, so instead of depending on the storage service to be at hands it's better to make db::view require the proxy. For now -- via global instance. There's one dependency on storage service left after this patch -- to get the tokens. This piece is to be fixed later. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-05-28 18:09:32 +03:00
Avi Kivity	0acf5bfca6	build: enable -Wreturn-std-move Clang warns when "return std::move(x)" is needed to elide a copy, but the call to std::move() is missing. We disabled the warning during the migration to clang. This patch re-enables the warning and fixes the places it points out, usually by adding std::move() and in one place by converting the returned variable from a reference to a local, so normal copy elision can take place. Closes #8739	2021-05-27 21:16:26 +03:00
Avi Kivity	d3e5b37059	Revert "Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed' from Calle Wilund" This reverts commit `e9c940dbbc`, reversing changes made to `6144656b25`. Since it was merged commitlog_test consistently times out in debug mode.	2021-05-27 21:16:26 +03:00
Avi Kivity	5f8484897b	Merge 'cdc: use a new internal table for exchanging generations' from Kamil Braun Reopening #8286 since the token metadata fix that allows `Everywhere` strategy tables to work with RBO (#8536) has been merged. --- Currently when a node wants to create and broadcast a new CDC generation it performs the following steps: 1. choose the generation's stream IDs and mapping (how this is done is irrelevant for the current discussion) 2. choose the generation's timestamp by taking the current time (according to its local clock) and adding 2 * ring_delay 3. insert the generation's data (mapping and stream IDs) into system_distributed.cdc_generation_descriptions, using the generation's timestamp as the partition key (we call this table the "old internal table" below) 4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP" application state. The timestamp spreads epidemically through the gossip protocol. When nodes see the timestamp, they retrieve the generation data from the old internal table. Unfortunately, due to the schema of the old internal table, where the entire generation data is stored in a single cell, step 3 may fail for sufficiently large generations (there is a size threshold for which step 3 will always fail - retrying the operation won't help). Also the old internal table lies in the system_distributed keyspace that uses SimpleStrategy with replication factor 3, which is also problematic; for example, when nodes restart, they must reach at least 2 out of these 3 specific replicas in order to retrieve the current generation (we write and read the generation data with QUORUM, unless we're a single-node cluster, where we use ONE). Until this happens, a restarting node can't coordinate writes to CDC-enabled tables. It would be better if the node could access the last known generation locally. The commit introduces a new table for broadcasting generation data with the following properties: - it uses a better schema that stores the data in multiple rows, each of manageable size - it resides in a new keyspace that uses EverywhereStrategy so the data will be written to every node in the cluster that has a token in the token ring - the data will be written using CL=ALL and read using CL=ONE; thanks to this, restarting node won't have to communicate with other nodes to retrieve the data of the last known generation. Note that writing with CL=ALL does not reduce availability: creating a new generation requires all nodes to be available anyway, because they must learn about the generation before their clocks go past the generation's timestamp; if they don't, partitions won't be mapped to stream IDs consistently across the cluster - the partition key is no longer the generation's timestamp. Because it was that way in the old internal table, it forced the algorithm to choose the timestamp before the generation data was inserted into the table. What if the inserting took a long time? It increased the chance that nodes would learn about the generation too late (after their clocks moved past its timestamp). With the new schema we will first insert the generation data using a randomly generated UUID as the partition key, then choose the timestamp, then gossip both the timestamp and the UUID. Observe that after a node learns about a generation broadcasted using this new method through gossip it will retrieve its data very quickly since it's one of the replicas and it can use CL=ONE as it was written using CL=ALL. The generation's timestamp and the UUID mentioned in the last point form a "generation identifier" for this new generation. For passing these new identifiers around, we introduce the cdc::generation_id_v2 type. Fixes #7961. --- For optimal review experience it is best to first read the updated design notes (you can read them rendered here: https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md), specifically the ["Generation switching"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#generation-switching) section followed by the ["Internal generation descriptions table V1 and upgrade procedure"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#internal-generation-descriptions-table-v1-and-upgrade-procedure) section, then read the commits in topological order. dtest gating run (dev): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/1160/ unit tests (dev) passed locally Closes #8643 * github.com:scylladb/scylla: docs: update cdc.md with info about the new internal table sys_dist_ks: don't create old CDC generations table on service initialization sys_dist_ks: rename all_tables() to ensured_tables() cdc: when creating new generations, use format v2 if possible main: pass feature_service to cdc::generation_service gms: introduce CDC_GENERATIONS_V2 feature cdc: introduce retrieve_generation_data test: cdc: include new generations table in permissions test sys_dist_ks: increase timeout for create_cdc_desc sys_dist_ks: new table for exchanging CDC generations tree-wide: introduce cdc::generation_id_v2	2021-05-27 17:13:44 +03:00
Piotr Sarna	d45574ed28	sys_dist_ks: fix redundant parsing in get_service_level The routine used for getting service level information already operates on the service level name, but the same information is also parsed once more from a row from an internal table. This parsing is redundant, so it's hereby removed.	2021-05-27 14:31:26 +02:00
Piotr Sarna	7faba19605	sys_dist_ks: make get_service_level exception-safe In order to avoid killing the node if a parsing error occurs, the routine which fetches service level information is made exception-safe.	2021-05-27 14:31:25 +02:00
Piotr Sarna	4816678eb6	cql3: add persisting service level workload type The workload type information can now be set via CQL and it's persisted in the distributed system table.	2021-05-27 13:02:22 +02:00
Konstantin Osipov	7ca4ffc309	system_keyspace: coroutinize db::system_keyspace::setup() Message-Id: <20210525183919.1395607-19-kostja@scylladb.com>	2021-05-26 11:06:21 +03:00
Avi Kivity	e9c940dbbc	Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed' from Calle Wilund Fixes #8270 If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk _footprint_ that is over limit causing new segment allocation to stall. We need to take a few things into account: 1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here. 2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task. 3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit. Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything). Closes #8695 * github.com:scylladb/scylla: commitlog_test: Add test case for usage/disk size threshold mismatch commitlog: Flush all segments if we only have one. commitlog: Always force flush if segment allocation is waiting commitlog: Include segment wasted (slack) size in footprint check commitlog: Adjust (lower) usage threshold	2021-05-25 18:34:29 +03:00
Kamil Braun	c948573398	sys_dist_ks: don't create old CDC generations table on service initialization The old table won't be created in clusters that are bootstrapped after this commit. It will stay in clusters that were upgraded from a version before this commit. Note that a fully upgraded cluster doesn't automatically create a new generation in the new format. Even if the last generation was created before the upgrade, the cluster will keep using it. A new generation will be created in the new format when either: 1. a new node bootstraps (in the new version), 2. or the user runs checkAndRepairCdcStreams, which has a new check: if the current generation uses the old format, the command will decide that repair is needed, even if the generation is completely fine otherwise (also in the new version). During upgrade, while the CDC_GENERATIONS_V2 feature is still not enabled, the user may still bootstrap a node in the old version of Scylla or run checkAndRepairCdcStreams on a not-yet-upgraded node. In that case a new generation will be created in the old format, using the old table definitions.	2021-05-25 16:07:23 +02:00
Kamil Braun	2835697ac1	sys_dist_ks: rename all_tables() to ensured_tables() The static function `all_tables` in system_distributed_keyspace.cc was used by the `system_distributed_keyspace` service initialization function (`start()`) to ensure that a certain set of tables - which the service provides accessors to - exist in the cluster. For each table in the vector returned by `all_tables()` the function would try to create the table, ignoring the "table already exists" error if it is thrown. The commit renames `all_tables` to `ensured_tables` to better convey the intention of this function and documents its purpose in a comment. We do this because in the future the service may provide accessors to tables which it does not actually create. The example - coming in a later commit - is a table which was created in a previous version of Scylla, and for which we still have to provide accessors for backward compatibility / correct handling of the upgrade procedure, but which we do not want to create in clusters that were freshly created using the new version of Scylla, since in that case these tables would be just unnecessary garbage. We mention this use case in the comment.	2021-05-25 16:07:23 +02:00
Kamil Braun	1c25b9df56	sys_dist_ks: increase timeout for create_cdc_desc If we want to allow larger generations, we may want to give this operation a bit more time.	2021-05-25 16:07:23 +02:00
Kamil Braun	3155cde9c8	sys_dist_ks: new table for exchanging CDC generations Currently when a node wants to create and broadcast a new CDC generation it performs the following steps: 1. choose the generation's stream IDs and mapping (how this is done is irrelevant for the current discussion) 2. choose the generation's timestamp by taking the current time (according to its local clock) and adding 2 * ring_delay 3. insert the generation's data (mapping and stream IDs) into system_distributed.cdc_generation_descriptions, using the generation's timestamp as the partition key (we call this table the "old internal table" below) 4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP" application state. The timestamp spreads epidemically through the gossip protocol. When nodes see the timestamp, they retrieve the generation data from the old internal table. Unfortunately, due to the schema of the old internal table, where the entire generation data is stored in a single cell, step 3 may fail for sufficiently large generations (there is a size threshold for which step 3 will always fail - retrying the operation won't help). Also the old internal table lies in the system_distributed keyspace that uses SimpleStrategy with replication factor 3, which is also problematic; for example, when nodes restart, they must reach at least 2 out of these 3 specific replicas in order to retrieve the current generation (we write and read the generation data with QUORUM, unless we're a single-node cluster, where we use ONE). Until this happens, a restarting node can't coordinate writes to CDC-enabled tables. It would be better if the node could access the last known generation locally. The commit introduces a new table for broadcasting generation data with the following properties: - it uses a better schema that stores the data in multiple rows, each of manageable size - it resides in the `system_distributed_everywhere` keyspace so the data will be written to every node in the cluster that has a token in the token ring - the data will be written using CL=ALL and read using CL=ONE; thanks to this, restarting node won't have to communicate with other nodes to retrieve the data of the last known generation. Note that writing with CL=ALL does not reduce availability: creating a new generation requires all nodes to be available anyway, because they must learn about the generation before their clocks go past the generation's timestamp; if they don't, partitions won't be mapped to stream IDs consistently across the cluster - the partition key is no longer the generation's timestamp. Because it was that way in the old internal table, it forced the algorithm to choose the timestamp before the generation data was inserted into the table. What if the inserting took a long time? It increased the chance that nodes would learn about the generation too late (after their clocks moved past its timestamp). With the new schema we will first insert the generation data using a randomly generated UUID as the partition key, then choose the timestamp, then gossip both the timestamp and the UUID. The timestamp and the UUID form the "generation identifier" of this new generation; this should explain why we introduced the generation_id_v2 type in previous commits. Observe that after a node learns about a generation broadcasted using this new method through gossip it will retrieve its data very quickly since it's one of the replicas and it can use CL=ONE as it was written using CL=ALL. Note that the node is still using the old method - the actual switch will be done in a later commit.	2021-05-25 16:07:23 +02:00
Calle Wilund	bf0a91b566	commitlog: Flush all segments if we only have one. Handle test cases with borked config so we don't deadlock in cases where we only have one segment in a commitlog	2021-05-25 12:43:12 +00:00
Calle Wilund	8ce836209b	commitlog: Always force flush if segment allocation is waiting Refs #8270 If segement allocation is blocked, we should bypass all thresholds and issue a flush of as much as possible.	2021-05-25 12:43:12 +00:00
Calle Wilund	e34ed30178	commitlog: Include segment wasted (slack) size in footprint check Refs #8270 Since segment allocation looks at actual disk footprint, not active, the threshold check in timer task should include slack space so we don't mistake sparse usage for space left.	2021-05-25 12:43:12 +00:00
Calle Wilund	ec40207e7f	commitlog: Adjust (lower) usage threshold Refs #8270 Try to ensure we issue a flush as soon as we are allocating in the last allowable segment, instead of "half through". This will make flushing a little more eager, but should reduce latencies created by waiting for segment delete/recycle on heavy usage.	2021-05-25 12:43:12 +00:00
Kamil Braun	4658adbe18	tree-wide: introduce cdc::generation_id_v2 This is a new type of CDC generation identifiers. Compared to old IDs, additionally to the timestamp it contains an UUID. These new identifiers will allow a safer and more efficient algorithm of introducing new generations into a cluster (introduced in a later commit). For now, nodes keep using the old identifier format when creating new generations and whenever they learn about a new CDC generation from gossip they assume that it also is stored in the v1 format. But they do know how to (de)serialize the second format and how to persist new identifiers in local tables.	2021-05-24 17:50:21 +02:00
Avi Kivity	50f3bbc359	Merge "treewide: various header cleanups" from Pavel S " The patch set is an assorted collection of header cleanups, e.g: * Reduce number of boost includes in header files * Switch to forward declarations in some places A quick measurement was performed to see if these changes provide any improvement in build times (ccache cleaned and existing build products wiped out). The results are posted below (`/usr/bin/time -v ninja dev-build`) for 24 cores/48 threads CPU setup (AMD Threadripper 2970WX). Before: Command being timed: "ninja dev-build" User time (seconds): 28262.47 System time (seconds): 824.85 Percent of CPU this job got: 3979% Elapsed (wall clock) time (h:mm:ss or m:ss): 12:10.97 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 2129888 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 1402838 Minor (reclaiming a frame) page faults: 124265412 Voluntary context switches: 1879279 Involuntary context switches: 1159999 Swaps: 0 File system inputs: 0 File system outputs: 11806272 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 After: Command being timed: "ninja dev-build" User time (seconds): 26270.81 System time (seconds): 767.01 Percent of CPU this job got: 3905% Elapsed (wall clock) time (h:mm:ss or m:ss): 11:32.36 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 2117608 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 1400189 Minor (reclaiming a frame) page faults: 117570335 Voluntary context switches: 1870631 Involuntary context switches: 1154535 Swaps: 0 File system inputs: 0 File system outputs: 11777280 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 The observed improvement is about 5% of total wall clock time for `dev-build` target. Also, all commits make sure that headers stay self-sufficient, which would help to further improve the situation in the future. " * 'feature/header_cleanups_v1' of https://github.com/ManManson/scylla: transport: remove extraneous `qos/service_level_controller` includes from headers treewide: remove evidently unneded storage_proxy includes from some places service_level_controller: remove extraneous `service/storage_service.hh` include sstables/writer: remove extraneous `service/storage_service.hh` include treewide: remove extraneous database.hh includes from headers treewide: reduce boost headers usage in scylla header files cql3: remove extraneous includes from some headers cql3: various forward declaration cleanups utils: add missing <limits> header in `extremum_tracking.hh`	2021-05-24 14:24:20 +03:00
Asias He	425e3b1182	gossip: Introduce direct failure detector Currently, gossip uses the updates of the gossip heartbeat from gossip messages to decide if a node is up or down. This means if a node is actually down but the gossip messages are delayed in the network, the marking of node down can be delayed. For example, a node sends 20 gossip messages in 20 seconds before it is dead. Each message is delayed 15 seconds by the network for some reason. A node receives those delayed messages one after another. Those delayed messages will prevent this node from being marked as down. Because heartbeat update is received just before the threshold to mark a node down is triggered which is around 20 seconds by default. As a result, this node will not be marked as down in 20 * 15 seconds = 300 seconds, much longer than the ~20 seconds node down detection time in normal cases. In this patch, a new failure detector is implemented. - Direct detection The existing failure detector can get gossip heartbeat updates indirectly. For example: Node A can talk to Node B Node B can talk to Node C Node A can not talk to Node C, due to network issues Node A will not mark Node B to be down because Node A can get heart beat of Node C from node B indirectly. This indirect detection is not very useful because when Node A decides if it should send requests to Node C, the requests from Node A to C will fail while Node A thinks it can communicate with Node C. This patch changes the failure detection to be direct. It uses the existing gossip echo message to detect directly. Gossip echo messages will be sent to peer nodes periodically. A peer node will be marked as down if a timeout threshold has been meet. Since the failure detection is peer to peer, it avoids the delayed message issue mentioned above. - Parallel detection The old failure detector uses shard zero only. This new failure detector utilizes all the shards to perform the failure detection, each shard handling a subset of live nodes. For example, if the cluster has 32 nodes and each node has 16 shards, each shard will handle only 2 nodes. With a 16 nodes cluster, each node has 16 shards, each shard will handle only one peer node. A gossip message will be sent to peer nodes every 2 seconds. The extra echo messages traffic produced compared to the old failure detector is negligible. - Deterministic detection Users can configure the failure_detector_timeout_in_ms to set the threshold to mark a node down. It is the maximum time between two successful echo message before gossip marks a node down. It is easier to understand than the old phi_convict_threshold. - Compatible This patch only uses the existing gossip echo message. Nodes with or without this patch can work together. Fixes #8488 Closes #8036	2021-05-24 10:47:06 +03:00
Avi Kivity	924f93028a	db: data_listeners: remove unused field _db Remove the unused field and the constructor that populated it.	2021-05-21 20:56:42 +03:00
Pavel Solodovnikov	fff7ef1fc2	treewide: reduce boost headers usage in scylla header files `dev-headers` target is also ensured to build successfully. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-20 01:33:18 +03:00
Avi Kivity	c71d007797	consistency_level: deinline assure_sufficient_live_nodes() assure_sufficient_live_nodes() is a huge template calling other huge templates, and requires "network_topology_strategy.hh". It is inlined in consistency_level.hh. This increases compile time and recompiles. Move the template out-of-line and use "extern template" to instantiate it. This is not ideal as new callers would require updates to the instantiated signatures, but I think our goal should be to de-template it completely instead. Meanwhile, this reduces some pain. Ref #1. Closes #8637	2021-05-19 15:03:51 +03:00
Juliusz Stasiewicz	874f4de60c	db/system_keyspace: Add system.status virtual table This change uses the previously introduced memtable_filling_virtual_table to expose nodetool status as a virtual table.	2021-05-12 17:05:35 +02:00
Tomasz Grabiec	57ed93bf44	db/virtual_table: Add a way to specify a range of partitions for virtual table queries. This change introduces a query_restrictions object into the virtual table infrastructure, for now only holding a restriction on partition ranges. That partition range is then implemented into memtable_filling_virtual_table.	2021-05-12 17:05:35 +02:00
Piotr Wojtczak	38720847f2	db/virtual_table: Introduce memtable_filling_virtual_table This change adds a more specific implementation of the virtual table called memtable_filling_virtual_table. It produces results by filling a memtable on each read.	2021-05-12 17:05:34 +02:00
Juliusz Stasiewicz	61a0314952	db: Add virtual tables interface This change introduces the basic interface we expect each virtual table to implement. More specific implementations will then expand upon it if needed.	2021-05-12 17:05:34 +02:00
Juliusz Stasiewicz	8333d66d4e	db: Introduce chained_delegating_reader This change adds a new type of mutation reader which purpose is to allow inserting operations before an invocation of the proper reader. It takes a future to wait on and only after it resolves will it forward the execution to the underlying flat_mutation_reader implementation.	2021-05-12 17:05:34 +02:00
Avi Kivity	61c7f874cc	Merge 'Add per-service-level timeouts' from Piotr Sarna Ref: #7617 This series adds timeout parameters to service levels. Per-service-level timeouts can be set up in the form of service level parameters, which can in turn be attached to roles. Setting up and modifying role-specific timeouts can be achieved like this: ```cql CREATE SERVICE LEVEL sl2 WITH read_timeout = 500ms AND write_timeout = 200ms AND cas_timeout = 2s; ATTACH SERVICE LEVEL sl2 TO cassandra; ALTER SERVICE LEVEL sl2 WITH write_timeout = null; ``` Per-service-level timeouts take precedence over default timeout values from scylla.yaml, but can still be overridden for a specific query by per-query timeouts (e.g. `SELECT * from t USING TIMEOUT 50ms`). Closes #7913 * github.com:scylladb/scylla: docs: add a paragraph describing service level timeouts test: add per-service-level timeout tests test: add refreshing client state transport: add updating per-service-level params client_state: allow updating per service level params qos: allow returning combined service level options qos: add a way of merging service level options cql3: add preserving default values for per-sl timeouts qos: make getting service level public qos: make finding service level public treewide: remove service level controller from query state treewide: propagate service level to client state sstables: disambiguate boost::find cql3: add a timeout column to LIST SERVICE LEVEL statement db: add extracting service level info via CQL types: add a missing translation for cql_duration cql3: allow unsetting service level timeouts cql3: add validating service level timeout values db: add setting service level params via system_distributed cql3: add fetching service level attrs in ALTER and CREATE cql3: add timeout to service level params qos: add timeout to service level info db,sys_dist_ks: add timeout to the service level table migration_manager: allow table updates with timestamp cql3: allow a null keyword for CQL properties	2021-05-11 18:39:10 +03:00
Nadav Har'El	fb0c4e469a	Merge 'token_metadata: Fix get_all_endpoints to return nodes in the ring' from Asias He The get_all_endpoints() should return the nodes that are part of the ring. A node inside _endpoint_to_host_id_map does not guarantee that the node is part of the ring. To fix, return from _token_to_endpoint_map. Fixes #8534 Closes #8536 * github.com:scylladb/scylla: token_metadata: Get rid of get_all_endpoints_count range_streamer: Handle everywhere_topology range_streamer: Adjust use_strict_sources_for_ranges token_metadata: Fix get_all_endpoints to return nodes in the ring	2021-05-11 18:39:10 +03:00
Piotr Sarna	e8d271fea9	db: add extracting service level info via CQL	2021-05-10 11:45:09 +02:00
Piotr Sarna	6e83054497	cql3: add validating service level timeout values The checks cover proper granulatity (1ms) and not using negative values.	2021-05-10 11:00:51 +02:00
Piotr Sarna	7bb34fdede	db: add setting service level params via system_distributed Service level params (various timeout values) are now properly stored in system_distributed.service_levels table.	2021-05-10 10:43:23 +02:00
Piotr Sarna	ef8da7930f	db,sys_dist_ks: add timeout to the service level table In order to be able to store timeouts in the service level table, an appropriate column is added.	2021-05-10 10:10:38 +02:00
Tomasz Grabiec	abe3d7d7d3	Merge 'storage_proxy: use small_vector for vectors of inet_address' from Avi Kivity storage_proxy uses std::vector<inet_address> for small lists of nodes - for replication (often 2-3 replicas per operation) and for pending operations (usually 0-1). These vectors require an allocation, sometimes more than one if reserve() is not used correctly. This series switches storage_proxy to use utils::small_vector instead, removing the allocations in the common case. Test results (perf_simple_query --smp 1 --task-quota-ms 10): ``` before: median 184810.98 tps ( 91.1 allocs/op, 20.1 tasks/op, 54564 insns/op) after: median 192125.99 tps ( 87.1 allocs/op, 20.1 tasks/op, 53673 insns/op) ``` 4 allocations and ~900 instructions are removed (the tps figure is also improved, but it is less reliable due to cpu frequency changes). The type change is unfortunately not contained in storage_proxy - the abstraction leaks to providers of replica sets and topology change vectors. This is sad but IMO the benefits make it worthwhile. I expect more such changes can be applied in storage_proxy, specifically std::unordered_set<gms::inet_address> and vectors of response handles. Closes #8592 * github.com:scylladb/scylla: storage_proxy, treewide: use utils::small_vector inet_address_vector:s storage_proxy, treewide: introduce names for vectors of inet_address utils: small_vector: add print operator for std::ostream hints: messages.hh: add missing #include	2021-05-06 18:00:54 +02:00

1 2 3 4 5 ...

2080 Commits