mirror of
https://github.com/scylladb/scylladb.git
synced 2026-04-24 02:20:37 +00:00
Until now, the lists of streams in the `cdc_streams_descriptions` table for a given generation were stored in a single collection. This solution has multiple problems when dealing with large clusters (which produce large lists of streams): 1. large allocations 2. reactor stalls 3. mutations too large to even fit in commitlog segments This commit changes the schema of the table as described in issue #7993. The streams are grouped according to token ranges, each token range being represented by a separate clustering row. Rows are inserted in reasonably large batches for efficiency. The table is renamed to enable easy upgrade. On upgrade, the latest CDC generation's list of streams will be (re-)inserted into the new table. Yet another table is added: one that contains only the generation timestamps clustered in a single partition. This makes it easy for CDC clients to learn about new generations. It also enables an elegant two-phase insertion procedure of the generation description: first we insert the streams; only after ensuring that a quorum of replicas contains them, we insert the timestamp. Thus, if any client observes a timestamp in the timestamps table (even using a ONE query), it means that a quorum of replicas must contain the list of streams. --- Nodes automatically ensure that the latest CDC generation's list of streams is present in the streams description table. When a new generation appears, we only need to update the table for this generation; old generations are already inserted. However, we've changed the description table (from `cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The existing mechanism only ensures that the latest generation appears in the new description table. We add an additional procedure that rewrites the older generations as well, if we find that it is necessary to do so (i.e. when some CDC log tables may contain data in these generations). Closes #8116 * github.com:scylladb/scylla: tests: add a simple CDC cql pytest cdc: add config option to disable streams rewriting cdc: rewrite streams to the new description table cql3: query_processor: improve internal paged query API cdc: introduce no_generation_data_exception exception type docs: cdc: mention system.cdc_local table cdc: coroutinize do_update_streams_description sys_dist_ks: split CDC streams table partitions into clustered rows cdc: use chunked_vector for streams in streams_version cdc: remove `streams_version::expired` field system_distributed_keyspace: use mutation API to insert CDC streams storage_service: don't use `sys_dist_ks` before it is started