scylladb

Author	SHA1	Message	Date
Eliran Sinvani	a16b4e407d	internal queries: add caching to some queries Some of the internal queries didn't have caching enabled even though there are chances of the query executing in large bursts or relatively often, example of the former is `default_authorized::authorize` and for the later is `system_distributed_keyspace::get_service_levels`. Fixes #10335 Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2022-05-01 13:30:02 +03:00
Eliran Sinvani	e0c7178e75	query_processor: remove default internal query caching behavior When executing internal queries, it is important that the developer will decide if to cache the query internally or not since internal queries are cached indefinitely. Also important is that the programmer will be aware if caching is going to happen or not. The code contained two "groups" of `query_processor::execute_internal`, one group has caching by default and the other doesn't. Here we add overloads to eliminate default values for caching behaviour, forcing an explicit parameter for the caching values. All the call sites were changed to reflect the original caching default that was there. Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2022-05-01 08:33:55 +03:00
Eliran Sinvani	38b7ebf526	query_processor: make execute_internal caching parameter more verbose `execute_internal` has a parameter to indicate if caching a prepared statement is needed for a specific call. However this parameter was a boolean so it was easy to miss it's meaning in the various call sites. This replaces the parameter type to a more verbose one so it is clear from the call site what decision was made.	2022-05-01 08:33:55 +03:00
Botond Dénes	53b00ecefe	db/system_distributed_keyspace: add all tables methods Add methods to get the schema of all distributed and distribyted everywhere tables respectively.	2022-04-01 10:10:31 +03:00
Pavel Emelyanov	fa4d4beaf1	system_distributed_keyspace: Indentation fix after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-25 13:25:55 +03:00
Pavel Emelyanov	b8d3048104	code,system_keyspace: Relax system_keyspace::load_local_host_id() usage The method is nowadays called from several places: - API - sys.dist.ks. (to udpate view building info) - storage service prepare_to_join() - set up in main They all, but the last, can use db::config cached value, because it's loaded earlier than any of them (but the last -- that's the loading part itself). Once patched, the load_local_host_id() can avoid checking the cache for that value -- it will not be there for sure. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-03-25 13:23:30 +03:00
Kamil Braun	044e05b0d9	service: migration_manager: `announce`: take a description parameter The description parameter is used for the group 0 history mutation. The default is empty, in which case the mutation will leave the description column as `null`. I filled the parameter in some easy places as an example and left the rest for a follow-up. This is how it looks now in a fresh cluster with a single statement performed by the user: cqlsh> select * from system.group0_history ; key \| state_id \| description ---------+--------------------------------------+------------------------------------------------------ history \| 9ec29cac-7547-11ec-cfd6-77bb9e31c952 \| CQL DDL statement history \| 9beb2526-7547-11ec-7b3e-3b198c757ef2 \| null history \| 9be937b6-7547-11ec-3b19-97e88bd1ca6f \| null history \| 9be784ca-7547-11ec-f297-f40f0073038e \| null history \| 9be52e14-7547-11ec-f7c5-af15a1a2de8c \| null history \| 9be335dc-7547-11ec-0b6d-f9798d005fb0 \| null history \| 9be160c2-7547-11ec-e0ea-29f4272345de \| null history \| 9bdf300e-7547-11ec-3d3f-e577a2e31ffd \| null history \| 9bdd2ea8-7547-11ec-c25d-8e297b77380e \| null history \| 9bdb925a-7547-11ec-d754-aa2cc394a22c \| null history \| 9bd8d830-7547-11ec-1550-5fd155e6cd86 \| null history \| 9bd36666-7547-11ec-230c-8702bc785cb9 \| Add new columns to system_distributed.service_levels history \| 9bd0a156-7547-11ec-a834-85eac94fd3b8 \| Create system_distributed(_everywhere) tables history \| 9bcfef18-7547-11ec-76d9-c23dfa1b3e6a \| Create system_distributed_everywhere keyspace history \| 9bcec89a-7547-11ec-e1b4-34e0010b4183 \| Create system_distributed keyspace	2022-01-24 15:20:37 +01:00
Kamil Braun	a664ac7ba5	treewide: require `group0_guard` when performing schema changes `announce` now takes a `group0_guard` by value. `group0_guard` can only be obtained through `migration_manager::start_group0_operation` and moved, it cannot be constructed outside `migration_manager`. The guard will be a method of ensuring linearizability for group 0 operations.	2022-01-24 15:20:35 +01:00
Kamil Braun	86762a1dd9	service: migration_manager: rename `schema_read_barrier` to `start_group0_operation` 1. Generalize the name so it mentions group 0, which schema will be a strict subset of. 2. Remove the fact that it performs a "read barrier" from the name. The function will be used in general to ensure linearizability of group0 operations - both reads and writes. "Read barrier" is Raft-specific terminology, so it can be thought of as an implementation detail.	2022-01-24 15:12:50 +01:00
Kamil Braun	283ac7fefe	treewide: pass mutation timestamp from call sites into `migration_manager::prepare_*` functions The functions which prepare schema change mutations (such as `prepare_new_column_family_announcement`) would use internally generated timestamps for these mutations. When schema changes are managed by group 0 we want to ensure that timestamps of mutations applied through Raft are monotonic. We will generate these timestamps at call sites and pass them into the `prepare_` functions. This commit prepares the APIs.	2022-01-24 15:12:50 +01:00
Kamil Braun	0af5f74871	db: system_distributed_keyspace: use current time when creating mutations in `start()` When creating or updating internal distributed tables in `system_distributed_keyspace::start()`, hardcoded timestamps were used. There two reasons for this: - to protect against issue #2129, where nodes would start without synchronizing schema with the existing cluster, creating the tables again, which would override any manual user changes to these tables. The solution was to use small timestamps (like api::min_timestamp) - the user-created schema mutations would always 'win' (because when they were created, they used current time). - to eliminate unnecessary schema sync. If two nodes created these tables concurrently with different timestamps, the schemas would formally be different and would need to merge. This could happen during upgrades when we upgraded from a version which doesn't have these tables or doesn't have some columns. The #2129 workaround is no longer necessary: when nodes start they always have to sync schema with existing nodes; we also don't allow bootstrapping nodes in parallel. The second problem would happen during parallel bootstrap, which we don't allow, or during parallel upgrade. The procedure we recommend is rolling upgrade - where nodes are upgraded one by one. In this case only one node is going to create/update the tables; following upgraded nodes will sync schema first and notice they don't need to do anything. So if procedures are followed correctly, the workaround is not needed. If someone doesn't follow the procedures and upgrades nodes in parallel, these additional schema synchronizations are not a big cost, so the workaround doesn't give us much in this case as well. When schema changes are performed by Raft group 0, certain constraints are placed on the timestamps used for mutations. For this we'll need to be able to use timestamps which are generated based on current time.	2022-01-24 15:12:49 +01:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Gleb Natapov	9ce62bcc33	system_distributed_keyspace: move schema creation code to use raft	2022-01-12 16:40:06 +02:00
Gleb Natapov	459539e812	migration_manager: do not allow creating keyspace with arbitrary timestamp This was needed to fix issue #2129 which was only manifest itself with auto_bootstrap set to false. The option is ignored now and we always wait for schema to synch during boot.	2022-01-12 16:33:15 +02:00
Avi Kivity	bbad8f4677	replica: move ::database, ::keyspace, and ::table to replica namespace Move replica-oriented classes to the replica namespace. The main classes moved are ::database, ::keyspace, and ::table, but a few ancillary classes are also moved. There are certainly classes that should be moved but aren't (like distributed_loader) but we have to start somewhere. References are adjusted treewide. In many cases, it is obvious that a call site should not access the replica (but the data_dictionary instead), but that is left for separate work. scylla-gdb.py is adjusted to look for both the new and old names.	2022-01-07 12:04:38 +02:00
Avi Kivity	ae3a360725	database: Move database, keyspace, table classes to replica/ directory The database, keyspace, and table classes represent the replica-only part of the objects after which they are named. Reading from a table doesn't give you the full data, just the replica's view, and it is not consistent since reconciliation is applied on the coordinator. As a first step in acknowledging this, move the related files to a replica/ subdirectory.	2022-01-06 17:07:30 +02:00
Avi Kivity	d768e9fac5	cql3, related: switch to data_dictionary Stop using database (and including database.hh) for schema related purposes and use data_dictionary instead. data_dictionary::database::real_database() is called from several places, for these reasons: - calling yet-to-be-converted code - callers with a legitimate need to access data (e.g. system_keyspace) but with the ::database accessor removed from query_processor. We'll need to find another way to supply system_keyspace with data access. - to gain access to the wasm engine for testing whether used defined functions compile. We'll have to find another way to do this as well. The change is a straightforward replacement. One case in modification_statement had to change a capture, but everything else was just a search-and-replace. Some files that lost "database.hh" gained "mutation.hh", which they previously had access to through "database.hh".	2021-12-15 13:54:23 +02:00
Gleb Natapov	38e1f85959	migration_manager: drop view_ptr array from announce_column_family_update() No users pass it any longer.	2021-12-11 12:31:07 +02:00
Pavel Emelyanov	beb345c00a	code: Rename get_local_host_id() into load_...() There will appear the future-less method which better deserves the get_ prefix, so give the existing method the load_ one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-30 10:33:57 +03:00
Avi Kivity	369afe3124	treewide: use coroutine::maybe_yield() instead of co_await make_ready_future() The dedicated API shows the intent, and may be a tiny bit faster. Closes #9382	2021-09-23 12:28:56 +02:00
Kamil Braun	a3f3563828	storage_service: check for existing normal token owners before bootstrapping The bootstrap procedure starts by "waiting for range setup", which means waiting for a time interval specified by the `ring_delay` parameter (30s by default) so the node can receive the tokens of other nodes before introducing its own tokens. However it may sometimes happen that the node doesn't receive the tokens. There are no explicit checks for this. But the code may crash in weird ways if the tokens-received assuption is false, and we are lucky if it does crash (instead of, for example, allowing the node to incorrectly bootstrap, causing data loss in the process). Introduce an explicit check-and-throw-if-false: a bootstrapping node now checks that there's at least one NORMAL token in the token ring, which means that it had to have contacted at least one existing node in the cluster, which means that it received the gossip application states of all nodes from that node; in particular the tokens of all nodes. Also add an assert in CDC code which relies on that assumption (and would cause weird division-by-zero errors if the assumption was false; better to crash on assert than this). Ref #8889. Closes #8896	2021-06-24 13:19:08 +03:00
Pavel Solodovnikov	76bea23174	treewide: reduce header interdependencies Use forward declarations wherever possible. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Closes #8813	2021-06-07 15:58:35 +03:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Piotr Sarna	389a0a52c9	treewide: revamp workload type for service levels This patch is not backward compatible with its original, but it's considered fine, since the original workload types were not yet part of any release. The changes include: - instead of using 'unspecified' for declaring that there's no workload type for a particular service level, NULL is used for that purpose; NULL is the standard way of representing lack of data - introducing a delete marker, which accompanies NULL and makes it possible to distinguish between wanting to forcibly reset a workload type to unspecified and not wanting to change the previous value - updating the tests accordingly These changes come in as a single patch, because they're intertwined with each other and the tests for workload types are already in place; an attempt to split them proved to be more complicated than it's worth. Tests: unit(release) Closes #8763	2021-05-31 18:18:33 +03:00
Avi Kivity	5f8484897b	Merge 'cdc: use a new internal table for exchanging generations' from Kamil Braun Reopening #8286 since the token metadata fix that allows `Everywhere` strategy tables to work with RBO (#8536) has been merged. --- Currently when a node wants to create and broadcast a new CDC generation it performs the following steps: 1. choose the generation's stream IDs and mapping (how this is done is irrelevant for the current discussion) 2. choose the generation's timestamp by taking the current time (according to its local clock) and adding 2 * ring_delay 3. insert the generation's data (mapping and stream IDs) into system_distributed.cdc_generation_descriptions, using the generation's timestamp as the partition key (we call this table the "old internal table" below) 4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP" application state. The timestamp spreads epidemically through the gossip protocol. When nodes see the timestamp, they retrieve the generation data from the old internal table. Unfortunately, due to the schema of the old internal table, where the entire generation data is stored in a single cell, step 3 may fail for sufficiently large generations (there is a size threshold for which step 3 will always fail - retrying the operation won't help). Also the old internal table lies in the system_distributed keyspace that uses SimpleStrategy with replication factor 3, which is also problematic; for example, when nodes restart, they must reach at least 2 out of these 3 specific replicas in order to retrieve the current generation (we write and read the generation data with QUORUM, unless we're a single-node cluster, where we use ONE). Until this happens, a restarting node can't coordinate writes to CDC-enabled tables. It would be better if the node could access the last known generation locally. The commit introduces a new table for broadcasting generation data with the following properties: - it uses a better schema that stores the data in multiple rows, each of manageable size - it resides in a new keyspace that uses EverywhereStrategy so the data will be written to every node in the cluster that has a token in the token ring - the data will be written using CL=ALL and read using CL=ONE; thanks to this, restarting node won't have to communicate with other nodes to retrieve the data of the last known generation. Note that writing with CL=ALL does not reduce availability: creating a new generation requires all nodes to be available anyway, because they must learn about the generation before their clocks go past the generation's timestamp; if they don't, partitions won't be mapped to stream IDs consistently across the cluster - the partition key is no longer the generation's timestamp. Because it was that way in the old internal table, it forced the algorithm to choose the timestamp before the generation data was inserted into the table. What if the inserting took a long time? It increased the chance that nodes would learn about the generation too late (after their clocks moved past its timestamp). With the new schema we will first insert the generation data using a randomly generated UUID as the partition key, then choose the timestamp, then gossip both the timestamp and the UUID. Observe that after a node learns about a generation broadcasted using this new method through gossip it will retrieve its data very quickly since it's one of the replicas and it can use CL=ONE as it was written using CL=ALL. The generation's timestamp and the UUID mentioned in the last point form a "generation identifier" for this new generation. For passing these new identifiers around, we introduce the cdc::generation_id_v2 type. Fixes #7961. --- For optimal review experience it is best to first read the updated design notes (you can read them rendered here: https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md), specifically the ["Generation switching"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#generation-switching) section followed by the ["Internal generation descriptions table V1 and upgrade procedure"](https://github.com/kbr-/scylla/blob/cdc-gen-table/docs/design-notes/cdc.md#internal-generation-descriptions-table-v1-and-upgrade-procedure) section, then read the commits in topological order. dtest gating run (dev): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/1160/ unit tests (dev) passed locally Closes #8643 * github.com:scylladb/scylla: docs: update cdc.md with info about the new internal table sys_dist_ks: don't create old CDC generations table on service initialization sys_dist_ks: rename all_tables() to ensured_tables() cdc: when creating new generations, use format v2 if possible main: pass feature_service to cdc::generation_service gms: introduce CDC_GENERATIONS_V2 feature cdc: introduce retrieve_generation_data test: cdc: include new generations table in permissions test sys_dist_ks: increase timeout for create_cdc_desc sys_dist_ks: new table for exchanging CDC generations tree-wide: introduce cdc::generation_id_v2	2021-05-27 17:13:44 +03:00
Piotr Sarna	d45574ed28	sys_dist_ks: fix redundant parsing in get_service_level The routine used for getting service level information already operates on the service level name, but the same information is also parsed once more from a row from an internal table. This parsing is redundant, so it's hereby removed.	2021-05-27 14:31:26 +02:00
Piotr Sarna	7faba19605	sys_dist_ks: make get_service_level exception-safe In order to avoid killing the node if a parsing error occurs, the routine which fetches service level information is made exception-safe.	2021-05-27 14:31:25 +02:00
Piotr Sarna	4816678eb6	cql3: add persisting service level workload type The workload type information can now be set via CQL and it's persisted in the distributed system table.	2021-05-27 13:02:22 +02:00
Kamil Braun	c948573398	sys_dist_ks: don't create old CDC generations table on service initialization The old table won't be created in clusters that are bootstrapped after this commit. It will stay in clusters that were upgraded from a version before this commit. Note that a fully upgraded cluster doesn't automatically create a new generation in the new format. Even if the last generation was created before the upgrade, the cluster will keep using it. A new generation will be created in the new format when either: 1. a new node bootstraps (in the new version), 2. or the user runs checkAndRepairCdcStreams, which has a new check: if the current generation uses the old format, the command will decide that repair is needed, even if the generation is completely fine otherwise (also in the new version). During upgrade, while the CDC_GENERATIONS_V2 feature is still not enabled, the user may still bootstrap a node in the old version of Scylla or run checkAndRepairCdcStreams on a not-yet-upgraded node. In that case a new generation will be created in the old format, using the old table definitions.	2021-05-25 16:07:23 +02:00
Kamil Braun	2835697ac1	sys_dist_ks: rename all_tables() to ensured_tables() The static function `all_tables` in system_distributed_keyspace.cc was used by the `system_distributed_keyspace` service initialization function (`start()`) to ensure that a certain set of tables - which the service provides accessors to - exist in the cluster. For each table in the vector returned by `all_tables()` the function would try to create the table, ignoring the "table already exists" error if it is thrown. The commit renames `all_tables` to `ensured_tables` to better convey the intention of this function and documents its purpose in a comment. We do this because in the future the service may provide accessors to tables which it does not actually create. The example - coming in a later commit - is a table which was created in a previous version of Scylla, and for which we still have to provide accessors for backward compatibility / correct handling of the upgrade procedure, but which we do not want to create in clusters that were freshly created using the new version of Scylla, since in that case these tables would be just unnecessary garbage. We mention this use case in the comment.	2021-05-25 16:07:23 +02:00
Kamil Braun	1c25b9df56	sys_dist_ks: increase timeout for create_cdc_desc If we want to allow larger generations, we may want to give this operation a bit more time.	2021-05-25 16:07:23 +02:00
Kamil Braun	3155cde9c8	sys_dist_ks: new table for exchanging CDC generations Currently when a node wants to create and broadcast a new CDC generation it performs the following steps: 1. choose the generation's stream IDs and mapping (how this is done is irrelevant for the current discussion) 2. choose the generation's timestamp by taking the current time (according to its local clock) and adding 2 * ring_delay 3. insert the generation's data (mapping and stream IDs) into system_distributed.cdc_generation_descriptions, using the generation's timestamp as the partition key (we call this table the "old internal table" below) 4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP" application state. The timestamp spreads epidemically through the gossip protocol. When nodes see the timestamp, they retrieve the generation data from the old internal table. Unfortunately, due to the schema of the old internal table, where the entire generation data is stored in a single cell, step 3 may fail for sufficiently large generations (there is a size threshold for which step 3 will always fail - retrying the operation won't help). Also the old internal table lies in the system_distributed keyspace that uses SimpleStrategy with replication factor 3, which is also problematic; for example, when nodes restart, they must reach at least 2 out of these 3 specific replicas in order to retrieve the current generation (we write and read the generation data with QUORUM, unless we're a single-node cluster, where we use ONE). Until this happens, a restarting node can't coordinate writes to CDC-enabled tables. It would be better if the node could access the last known generation locally. The commit introduces a new table for broadcasting generation data with the following properties: - it uses a better schema that stores the data in multiple rows, each of manageable size - it resides in the `system_distributed_everywhere` keyspace so the data will be written to every node in the cluster that has a token in the token ring - the data will be written using CL=ALL and read using CL=ONE; thanks to this, restarting node won't have to communicate with other nodes to retrieve the data of the last known generation. Note that writing with CL=ALL does not reduce availability: creating a new generation requires all nodes to be available anyway, because they must learn about the generation before their clocks go past the generation's timestamp; if they don't, partitions won't be mapped to stream IDs consistently across the cluster - the partition key is no longer the generation's timestamp. Because it was that way in the old internal table, it forced the algorithm to choose the timestamp before the generation data was inserted into the table. What if the inserting took a long time? It increased the chance that nodes would learn about the generation too late (after their clocks moved past its timestamp). With the new schema we will first insert the generation data using a randomly generated UUID as the partition key, then choose the timestamp, then gossip both the timestamp and the UUID. The timestamp and the UUID form the "generation identifier" of this new generation; this should explain why we introduced the generation_id_v2 type in previous commits. Observe that after a node learns about a generation broadcasted using this new method through gossip it will retrieve its data very quickly since it's one of the replicas and it can use CL=ONE as it was written using CL=ALL. Note that the node is still using the old method - the actual switch will be done in a later commit.	2021-05-25 16:07:23 +02:00
Kamil Braun	4658adbe18	tree-wide: introduce cdc::generation_id_v2 This is a new type of CDC generation identifiers. Compared to old IDs, additionally to the timestamp it contains an UUID. These new identifiers will allow a safer and more efficient algorithm of introducing new generations into a cluster (introduced in a later commit). For now, nodes keep using the old identifier format when creating new generations and whenever they learn about a new CDC generation from gossip they assume that it also is stored in the v1 format. But they do know how to (de)serialize the second format and how to persist new identifiers in local tables.	2021-05-24 17:50:21 +02:00
Piotr Sarna	e8d271fea9	db: add extracting service level info via CQL	2021-05-10 11:45:09 +02:00
Piotr Sarna	6e83054497	cql3: add validating service level timeout values The checks cover proper granulatity (1ms) and not using negative values.	2021-05-10 11:00:51 +02:00
Piotr Sarna	7bb34fdede	db: add setting service level params via system_distributed Service level params (various timeout values) are now properly stored in system_distributed.service_levels table.	2021-05-10 10:43:23 +02:00
Piotr Sarna	ef8da7930f	db,sys_dist_ks: add timeout to the service level table In order to be able to store timeouts in the service level table, an appropriate column is added.	2021-05-10 10:10:38 +02:00
Piotr Sarna	ad661561c8	db: stop using infinite timeout for service level updates Due to a porting bug, the routines for updating service levels used the default infinite timeout for internal CQL queries, which causes Scylla to hang on shutdown. The behavior is now fixed and the routines use the same timeout as the other similar functions - 10s at the time of writing this message.	2021-04-22 09:03:21 +02:00
Avi Kivity	daeddda7cc	treewide: remove inclusions of storage_proxy.hh from headers storage_proxy.hh is huge and includes many headers itself, so remove its inclusions from headers and re-add smaller headers where needed (and storage_proxy.hh itself in source files that need it). Ref #1.	2021-04-20 21:23:00 +03:00
Kamil Braun	617813ba66	sys_dist_ks: new keyspace for system tables with Everywhere strategy `system_distributed_everywhere` is a new keyspace that uses Everywhere replication strategy. This is useful, for example, when we want to store internal data that should be accessible by every node; the data can be written using CL=ALL (e.g. during node operations such as node bootstrap, which require all nodes to be alive - at least currently) and then read by each node locally using CL=ONE (e.g. during node restarts). Closes #8457	2021-04-19 11:22:57 +03:00
Eliran Sinvani	dd74556ad9	service/qos: adding service level table to the distributed keyspace This patch adds the service level table and functions to manipulate it to the distributed keyspace. Message-Id: <b6cb7f311ac1ee6802d8f3d78eac9cf40fe21f68.1609161341.git.sarna@scylladb.com>	2021-04-12 15:58:09 +02:00
Kamil Braun	99fd2244a3	tree-wide: introduce cdc::generation_id type This is a follow-up to the previous commit. Each CDC generation has a timestamp which denotes a logical point in time when this generation starts operating. That same timestamp is used to identify the CDC generation. We use this identification scheme to exchange CDC generations around the cluster. However, the fact that a generation's timestamp is used as an ID for this generation is an implementation detail of the currently used method of managing CDC generations. Places in the code that deal with the timestamp, e.g. functions which take it as an argument (such as handle_cdc_generation) are often interested in the ID aspect, not the "when does the generation start operating" aspect. They don't care that the ID is a `db_clock::time_point`. They may sometimes want to retrieve the time point given the ID (such as do_handle_cdc_generation when it calls `cdc::metadata::insert`), but they don't care about the fact that the time point actually IS the ID. In the future we may actually change the specific type of the ID if we modify the generation management algorithms. This commit is an intermediate step that will ease the transition in the future. It introduces a new type, `cdc::generation_id`. Inside it contains the timestamp, so: 1. if a piece of code doesn't care about the timestamp, it just passes the ID around 2. if it does care, it can simply access it using the `get_ts` function. The fact that `get_ts` simply accesses the ID's only field is an implementation detail. Using the occasion, we change the `do_handle_cdc_generation_intercept...` function to be a standard function, not a coroutine. It turns out that - depending on the shape of the passed-in argument - the function would sometimes miscompile (the compiled code would not copy the argument to the coroutine frame).	2021-04-07 13:47:13 +02:00
Kamil Braun	3cebe99613	sys_dist_ks: update comment at quorum_if_many The comment mentioned tables that no longer exist: their names have changed some time ago. Update the comment to be name-agnostic. Furthemore, the second part of the comment related to a case of "joining a node without bootstrapping". Fortunately this operation is no longer possible (after #6848 which became part of Scylla 4.3) so we can shorten the comment.	2021-04-06 13:15:31 +02:00
Kamil Braun	641040d465	sys_dist_ks: remove dead code (expire_cdc_* functions) These functions were not used anywhere but had to be maintained anyway. When (if) the expiration algorithm actually gets implemented (see issue #7300), the functions can be added back (perhaps they will need to look differently at that time, and it's likely that the `expire` column won't be used in the expiration algorithm in the end anyway).	2021-04-04 13:12:12 +03:00
Kamil Braun	4f3f245188	sys_dist_ks: coroutinize system_distributed_keyspace::start	2021-04-04 13:10:44 +03:00
Calle Wilund	5da0129775	system_distributed_keyspace: Add better routine to get latest cdc gen. timestamp Since we have a table of cdc version timestamps, conviniently sorted reversed, we can just query this and get the latest known gen ts.	2021-03-03 15:44:54 +00:00
Calle Wilund	5a69250d7e	system_distributed_keyspace: Fix cdc_get_versioned_streams timestamp range With the new scheme for cdc generation management, one of the last changes was to make the time ordering of the stream timestamps reversed. However, cdc_get_versioned_streams forgot to take this into account when sifting out timestamp ranges for stream retrieval (based on low mark). Fixed by doing reverse iteration.	2021-03-03 15:41:42 +00:00
Piotr Sarna	c5214eb096	treewide: remove timeout config from query options Timeout config is now stored in each connection, so there's no point in tracking it inside each query as well. This patch removes timeout_config from query_options and follows by removing now unnecessary parameters of many functions and constructors.	2021-02-25 17:20:27 +01:00
Kamil Braun	9bdd000e97	cdc: rewrite streams to the new description table Nodes automatically ensure that the latest CDC generation's list of streams is present in the streams description table. When a new generation appears, we only need to update the table for this generation; old generations are already inserted. However, we've changed the description table (from `cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The existing mechanism only ensures that the latest generation appears in the new description table. This commit adds an additional procedure that rewrites the older generations as well, if we find that it is necessary to do so (i.e. when some CDC log tables may contain data in these generations).	2021-02-18 11:44:59 +01:00
Kamil Braun	67d4e5576d	sys_dist_ks: split CDC streams table partitions into clustered rows Until now, the lists of streams in the `cdc_streams_descriptions` table for a given generation were stored in a single collection. This solution has multiple problems when dealing with large clusters (which produce large lists of streams): 1. large allocations 2. reactor stalls 3. mutations too large to even fit in commitlog segments This commit changes the schema of the table as described in issue #7993. The streams are grouped according to token ranges, each token range being represented by a separate clustering row. Rows are inserted in reasonably large batches for efficiency. The table is renamed to enable easy upgrade. On upgrade, the latest CDC generation's list of streams will be (re-)inserted into the new table. Yet another table is added: one that contains only the generation timestamps clustered in a single partition. This makes it easy for CDC clients to learn about new generations. It also enables an elegant two-phase insertion procedure of the generation description: first we insert the streams; only after ensuring that a quorum of replicas contains them, we insert the timestamp. Thus, if any client observes a timestamp in the timestamps table (even using a ONE query), it means that a quorum of replicas must contain the list of streams.	2021-02-18 11:44:59 +01:00

1 2

77 Commits