Files
scylladb/db
Kamil Braun 3155cde9c8 sys_dist_ks: new table for exchanging CDC generations
Currently when a node wants to create and broadcast a new CDC generation
it performs the following steps:
1. choose the generation's stream IDs and mapping (how this is done is
   irrelevant for the current discussion)
2. choose the generation's timestamp by taking the current time
   (according to its local clock) and adding 2 * ring_delay
3. insert the generation's data (mapping and stream IDs) into
   system_distributed.cdc_generation_descriptions, using the
   generation's timestamp as the partition key (we call this table
   the "old internal table" below)
4. insert the generation's timestamp into the "CDC_STREAMS_TIMESTAMP"
   application state.

The timestamp spreads epidemically through the gossip protocol. When
nodes see the timestamp, they retrieve the generation data from the
old internal table.

Unfortunately, due to the schema of the old internal table, where
the entire generation data is stored in a single cell, step 3 may fail for
sufficiently large generations (there is a size threshold for which step
3 will always fail - retrying the operation won't help). Also the old
internal table lies in the system_distributed keyspace that uses
SimpleStrategy with replication factor 3, which is also problematic; for
example, when nodes restart, they must reach at least 2 out of these 3
specific replicas in order to retrieve the current generation (we write
and read the generation data with QUORUM, unless we're a single-node
cluster, where we use ONE). Until this happens, a restarting
node can't coordinate writes to CDC-enabled tables. It would be better
if the node could access the last known generation locally.

The commit introduces a new table for broadcasting generation data with
the following properties:
-  it uses a better schema that stores the data in multiple rows, each
   of manageable size
-  it resides in the `system_distributed_everywhere` keyspace so the
   data will be written to every node in the cluster that has a token in
   the token ring
-  the data will be written using CL=ALL and read using CL=ONE; thanks
   to this, restarting node won't have to communicate with other nodes
   to retrieve the data of the last known generation. Note that writing
   with CL=ALL does not reduce availability: creating a new generation
   *requires* all nodes to be available anyway, because they must learn
   about the generation before their clocks go past the generation's
   timestamp; if they don't, partitions won't be mapped to stream IDs
   consistently across the cluster
-  the partition key is no longer the generation's timestamp. Because it
   was that way in the old internal table, it forced the algorithm to
   choose the timestamp *before* the generation data was inserted into
   the table. What if the inserting took a long time? It increased the
   chance that nodes would learn about the generation too late (after
   their clocks moved past its timestamp). With the new schema we will
   first insert the generation data using a randomly generated UUID as
   the partition key, *then* choose the timestamp, then gossip both the
   timestamp and the UUID. The timestamp and the UUID form the
   "generation identifier" of this new generation; this should explain
   why we introduced the generation_id_v2 type in previous commits.
   Observe that after a node learns about a generation broadcasted using
   this new method through gossip it will retrieve its data very quickly
   since it's one of the replicas and it can use CL=ONE as it was
   written using CL=ALL.

Note that the node is still using the old method - the actual switch
will be done in a later commit.
2021-05-25 16:07:23 +02:00
..
2019-10-25 12:04:44 +02:00