Files

Nadav Har'El 3de09042bb CDC topology change support

Merged pull request https://github.com/scylladb/scylla/pull/5485
by Kamil Braun:

This series introduces the notion of CDC generations: sets of CDC streams
used by the cluster to choose partition keys for CDC log writes.
Each CDC generation begins operating at a specific time point, called the
generation's timestamp (cdc_streams_timestamp in the code).
It continues being used by all nodes in the cluster to generate log writes
until superseded by a new generation.

Generations are chosen so that CDC log writes are colocated with their
corresponding base table writes, i.e. their partition keys (which are CDC
stream identifiers picked from the generation operating at time of making
the write) fall into the same vnode and shard as the corresponding base
table write partition keys. Currently this is probabilistic and not 100%
of log writes will be colocated - this will change in future commits,
after per-table partitioners are implemented.

CDC generations are a global property of the cluster -- they don't depend
on any particular table's configuration. Therefore the old "CDC stream
description tables", which were specific to each CDC-enabled table,
were removed and replaced by a new, global description table inside the
system_distributed keyspace.

A new generation is introduced and supersedes the previous one whenever
we insert new tokens into the token ring, which breaks the colocation
property of the previous generation. The new generation is chosen to
account for the new tokens and restore colocation. This happens when a
new node joins the cluster.

The joining node is responsible for creating and informing other nodes
about the new CDC generation. It does that by serializing it and inserting
into an internal distributed table ("CDC topology description table").
If it fails the insert, it fails the joining process. It then announces
the generation to other nodes through gossip using the generation's
timestamp, which is the partition key of the inserted distributed table
entry.

Nodes that learn about the new generation through gossip attempt to
retrieve it from the distributed table. This might fail - for example,
if the node is partitioned away from all replicas that hold this
generation's table entry. In that case the node might stop accepting
writes, since it knows that it should send log entries to a new generation
of streams, but it doesn't know what the generation is. The node will keep
trying to retrieve the data in the background until it succeeds or sees
that it is no longer necessary (e.g., because yet another generation
superseded this one). So we give up some availability to achieve safety.
However, this solution is not completely safe (might break consistency
properties): if a node learns about a new generation too late (if gossip
doesn't reach this node in time), the node might send writes to the wrong
(old) generation. In the future we will introduce a transaction-based
approach where we will always make sure that all nodes receive the new
generation before any of them starts using it (and if it's impossible
e.g. due to a network partition, we will fail the bootstrap attempt).
In practice, if the admin makes sure that the cluster works correctly
before bootstrapping a new node, and a network partition doesn't start
in the few seconds window where a new generation is announced, everything
will work as it should.

After the learning node retrieves the generation, it inserts it into an
in-memory data structure called "CDC metadata". This structure is then
used when performing writes to the CDC log -- given the timestamp of the
written mutation, the data structure will return the CDC generation
operating at this time point. CDC metadata might reject the query for
two reasons: if the timestamp belongs to an earlier generation, which
most probably doesn't have the colocation property anymore, or if it is
picked too far away into the future, where we don't know if the current
generation won't be superseded by a different one (so we don't yet know
the set of streams that this log write should be sent to). If the client
uses server-generated timestamps, the query will never be rejected.
Clients can also use client-generated timestamps, but they must make sure
that their clocks are not too desynchronized with the database --
otherwise some or all of their writes to CDC-enabled tables will be
rejected.

In the case of rolling upgrade, where we restart nodes that were
previously running without CDC, we act a bit differently - there is no
naturally selected joining node which must propose a new generation.
We have to select such a node using other means. For this we use a bully
approach: every node compares its host id with host ids of other nodes
and if it finds that it has the greatest host id, it becomes responsible
for creating the first generation.

This change also fixes the way of choosing values of the "time" column
of CDC log writes: the timeuuid is chosen in a way which preserves
ordering of corresponding base table mutations (the timestamp of this
timeuuid is equal to the base table mutation timestamp).

Warning: if you were running a previous CDC version (without topology
change support), make sure to disable CDC on all tables before performing
the upgrade. This will drop the log data -- backup it if needed.

TODO in future patchset: expire CDC generations. Currently, each inserted
CDC generation will stay in the distributed tables forever (until
manually removed by the administrator). When a generation is superseded,
it should become "expired", and 24 hours after expiration, it should be
removed. The distributed tables (cdc_topology_description and
cdc_description) both have an "expired" column which can be used for
this purpose.

Unit tests: dev, debug, release
dtests (dev): https://jenkins.scylladb.com/job/scylla-master/job/byo/job/byo_build_tests_dtest/907/

2020-02-04 10:20:29 +02:00

alternator

docs: add entries for alternator tags and arn

2020-01-29 10:20:05 +01:00

redis

Redis API: Rename options related to Redis API, describe them clearly, and remove unnecessary one.

2019-12-03 17:13:35 +08:00

api_v2.md

Explanation about the API V2

2018-03-28 12:42:04 +03:00

backport.md

Document backport queue and procedure (#5282 )

2019-11-17 01:45:24 -08:00

building-packages.md

building-packages doc: Update no specific el7 on path

2020-01-16 12:49:08 +02:00

cdc.md

docs: add documentation about CDC generations

2020-02-03 10:57:31 +01:00

coding-style.md

Merge "Misc documentation cleanup" from Botond

2019-10-15 12:53:49 +02:00

compaction_controller.md

document the compaction controller

2018-01-03 19:58:57 -05:00

cql-extensions.md

doc: fix BYPASS CACHE documentation

2018-11-26 13:04:52 +00:00

debugging.md

docs/debugging.md: fix anchor links

2019-12-29 16:26:26 +02:00

docker-hub.md

dist/docker: Add support for Alternator

2019-09-11 18:01:05 +03:00

hinted_handoff_design.md

docs: hinted_handoff_design.md: high level design of a Hinted Handoff feature

2017-12-14 15:05:47 -05:00

IDL.md

Merge "Misc documentation cleanup" from Botond

2019-10-15 12:53:49 +02:00

in_memory_representation.md

imr: move documentation to docs/

2019-11-28 16:47:52 +02:00

isolation.md

docs/isolation.md: copy-edit

2019-10-10 15:17:28 +03:00

logging.md

docs/logging.md: improvements

2019-01-06 13:20:53 +02:00

lua-type-mapping.md

Lua: Document the conversions between Lua and CQL

2019-11-07 08:41:08 -08:00

maintainer.md

docs: maintainer.md: use command line to merge multi-commit pull requests

2019-12-06 10:59:46 +01:00

metrics.md

docs/metrics.md: document additional "lables"

2019-09-09 15:15:57 +03:00

migrating-from-users-to-roles.md

auth: Migrate legacy data on boot

2018-02-14 14:15:59 -05:00

paged-queries.md

docs: add paged-queries.md design doc

2018-09-03 10:31:44 +03:00

protocol-extensions.md

doc: documented protocol extension for exposing sharding

2018-07-01 15:26:30 +03:00

README.md

docs: add README.md

2019-10-10 14:14:09 +03:00

review-checklist.md

Merge "Misc documentation cleanup" from Botond

2019-10-15 12:53:49 +02:00

row_cache.md

doc: Fix row_cache.md

2018-03-10 16:27:04 +02:00

row_level_repair.md

docs: Add RPC stream doc for row level repair

2019-07-03 08:09:57 +08:00

secondary_index.md

docs: document index target serialization

2019-03-20 09:51:46 +01:00

sstable-scylla-format.md

docs: add CorrectEmptyCounters to sstable-scylla-format

2019-05-23 10:10:24 +02:00

sstables-directory-structure.md

docs: add sstables-directory-structure.md

2019-02-22 11:08:22 +02:00

system_keyspace.md

Improve documentation on the system.large_* tables

2019-08-21 10:36:25 +03:00

system_schema_keyspace.md

docs: init system_schema_keyspace.md with column computations

2019-07-19 11:58:42 +02:00

testing.md

test.py: add a help file

2020-01-30 11:05:02 +02:00

tracing.md

docs: tracing.md: add a "how to get traces" chapter

2017-04-25 21:52:29 -04:00

README.md

Scylla developer documentation

This folder (and its subfolders) contain developer-oriented documentation concerning the Scylla codebase. We also have a wiki, which contains additional developer-oriented documentation. There is currently no clear definition of what goes where, so when looking for something be sure to check both.

Seastar documentation can be found here.

User documentation can be found on docs.scylladb.com

For information on how to build Scylla and how to contribute visit HACKING.md and CONTRIBUTING.md.