use utils::chunked_vector instead of std::vector to store cdc stream
sets for tablets.
a cdc stream set usually represents all streams for a specific table and
timestamp, and has a stream id per each tablet of the table. each stream
id is represented by 16 bytes. thus the vector could require quite large
contiguous allocations for a table that has many tablets. change it to
chunked_vector to avoid large contiguous allocations.
Fixesscylladb/scylladb#26791Closesscylladb/scylladb#26792
introduce helper functions that can be used for garbage collecting old
cdc streams for tablets-based keyspaces.
- get_new_base_for_gc: finds a new base timestamp given a TTL, such that
all older timestamps and streams can be removed.
- get_cdc_stream_gc_mutations: given new base timestamp and streams,
builds mutations that update the internal cdc tables and remove the
older streams.
- garbage_collect_cdc_streams_for_table: combines the two functions
above to find a new base and build mutations to update it for a
specific table
- garbage_collect_cdc_streams: builds gc mutations for all cdc tables
on tablet split/merge finalization, generate a new CDC timestamp and
stream set for the table with a new stream for each tablet in the new
tablet map, in order to maintain synchronization of the CDC streams with
the tablets.
We pick a new timestamp for the streams with a small delay into the
future so that all nodes can learn about the new streams in time, in the
same way it's done for vnodes.
the new timestamp and streams are published by adding a mutation to the
cdc_streams_history table that contains the timestamp and the sets of
closed and opened streams from the current timestamp.
Define two new virtual tables in system keyspace: cdc_timestamps and
cdc_streams. They expose the internal cdc metadata for tablets-enabled
keyspace to be consumed by users consuming the CDC log.
cdc_timestamps lists all timestamps for a table where a stream change
occured.
cdc_streams list additionally the current streams sets for each
table and timestamp, as well as difference - closed and opened streams -
from the previous stream set.
Read the CDC stream metadata from the internal system tables, and store
it in the cdc metadata data structures.
The metadata is stored in the tables as diffs which is more storage
efficient, but when in-memory we store it as full stream sets for each
timestamp. This is more useful because we need to be able to find a
stream given timestamp and token.
We make the `consistent-topology-changes` experimental feature
unused and assumed to be true in 6.0. We remove code branches that
executed if `consistent-topology-changes` was disabled.
When a node enters recovery after being in raft topology mode, topology
operations switch back to legacy mode. We want CDC to keep working when
that happens, so we need for the legacy code to be able to access
generations created back in raft mode - so that the node can still
properly serve writes to CDC log tables.
In order to make this possible, modify the legacy logic to also look for
a cdc generation in raft tables, if it is not found in legacy tables.
In raft topology mode CDC information is propagated through group 0.
Prevent the generation service from reacting to gossiper notifications
after we made the switch to raft mode.
Rather than calling on_change for each particular
application_state, pass an endpoint_state::map_type
with all changed states, to be processed as a batch.
In particular, thise allows storage_service::on_change
to update_peer_info once for all changed states.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
None of the subscribers is doing anything before_change.
This is done before changing `on_change` in the following patch.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Now that the endpoint_state isn't change in place
we do not need to copy it to each subscriber.
We can rather just pass the lw_shared_ptr holding
a snapshot of it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The legacy_handle_cdc_generation() checks if the node had bootstrapped
with the help of system_keyspace method. The former is called in two
cases -- on boot via cdc_generation_service::after_join() and via
gossiper on_...() notifications. The notifications, in turn, are set up
in the very same after_join().
The after_join(), in turn, is called from storage_service explicitly
after the bootstrap state is updated to be "complete", so the check for
the state in legacy_handle_...() seems unnecessary. However, there's
still the case when it may be stepped on -- decommission. When performed
it calls storage_service::leave_ring() which udpates the bootstrap state
to be "needed", thus preventing the cdc gen. service from doing anything
inside gossiper's on_...() notifications.
It's more correct to stop cdc gen. service handling gossiper
notifications by unsubscribing it, but by adding fragile implicit
dependencies on the bootstrap state.
Checks for sys.dist.ks in the legacy_handle_...() are kept in a form
of on-internal-error. The system distributed keyspace is activated by
storage service even before the bootstrap state is updated and is
never deactivated, but it's anyway good to have this assertion.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Pass permit_id to subscribers when we acquire one
via lock_endpoint. The subscribers then pass it back to
gossiper for paths that acquire lock_endpoint for
the same endpoint, to detect nested locks when the endpoint
is locked with the same permit_id.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When a node notices that a new CDC generation was introduced in
`storage_service::topology_state_load`, it updates its internal data
structures that are used when coordinating writes to CDC log tables.
The service uses system keyspace to, e.g., manage the generation id,
thus it depends on the system_keyspace instance and deserves the
explicit reference.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.
References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.
scylla-gdb.py is adjusted to look for both the new and old names.
If we're upgrading from an older version with the previous CDC streams
format, we'll upgrade it in the background. Background update is needed
since we need the cluster to be available when performing the upgrade,
but at this point we're just starting a node, and may not succeed in
forming a cluster before we shut down.
However, running in the background is dangerous since the objects we
use may stop existing. The code is careful to use reference counting,
but this does not guarantee that other dependencies are still alive,
especially since not all dependencies are expressed via constructor
parameters.
Fix by waiting for the rewrite work in generation_service::stop(). As
long as generation_service is up, the required dependencies should be
working too.
Note that there is another change here besides limiting the background
work: checks that were previously done in the foreground (limited to
local tables) are now also done in the background. I don't think
this has any impact.
Note: I expect this to have no real impact. Any CDC users will have
long since ugpraded. This is just preparing for other patches that
bring in other dependencies, which cannot be passed via reference
counted pointers, so they expose the existing problem.
This is to push the service towards general idea that each
component should have its own config and db::config to stay
in main.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
All of them are references taken from 'this', since the function is
the generation_service method it can use 'this' directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The generation service already has all it needs to do it. This
keeps storage_service smaller and less aware about cdc internals.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The recently introduced make_new_generation() method just calls
another one by passing more this->... stuff as arguments. Relax
the flow by teaching the latter to use 'this' directly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It has everything needed onboard. Only two arguments are required -- the
booststrap tokens and whether or not to inject a delay.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is a follow-up to the previous commit.
Each CDC generation has a timestamp which denotes a logical point in time
when this generation starts operating. That same timestamp is
used to identify the CDC generation. We use this identification scheme
to exchange CDC generations around the cluster.
However, the fact that a generation's timestamp is used as an ID for
this generation is an implementation detail of the currently used method
of managing CDC generations.
Places in the code that deal with the timestamp, e.g. functions which
take it as an argument (such as handle_cdc_generation) are often
interested in the ID aspect, not the "when does the generation start
operating" aspect. They don't care that the ID is a `db_clock::time_point`.
They may sometimes want to retrieve the time point given the ID (such as
do_handle_cdc_generation when it calls `cdc::metadata::insert`),
but they don't care about the fact that the time point actually IS the ID.
In the future we may actually change the specific type of the ID if we
modify the generation management algorithms.
This commit is an intermediate step that will ease the transition in the
future. It introduces a new type, `cdc::generation_id`. Inside it contains
the timestamp, so:
1. if a piece of code doesn't care about the timestamp, it just passes
the ID around
2. if it does care, it can simply access it using the `get_ts` function.
The fact that `get_ts` simply accesses the ID's only field is an
implementation detail.
Using the occasion, we change the `do_handle_cdc_generation_intercept...`
function to be a standard function, not a coroutine. It turns out that -
depending on the shape of the passed-in argument - the function would
sometimes miscompile (the compiled code would not copy the argument to the
coroutine frame).
Each CDC generation always has a timestamp, but the fact that the
timestamp identifies the generation is an implementation detail.
We abstract away from this detail by using a more generic naming scheme:
a generation "identifier" (whatever that is - a timestamp or something
else).
It's possible that a CDC generation will be identified by more than a
timestamp in the (near) future.
The actual string gossiped by nodes in their application state is left
as "CDC_STREAMS_TIMESTAMP" for backward compatibility.
Some stale comments have been updated.
Currently all management of CDC generations happens in storage_service,
which is a big ball of mud that does many unrelated things.
Previous commits have introduced a new service for managing CDC
generations. This code moves most of the relevant code to this new
service.
However, some part still remains in storage_service: the bootstrap
procedure, which happens inside storage_service, must also do some
initialization regarding CDC generations, for example: on restart it
must retrieve the latest known generation timestamp from disk; on
bootstrap it must create a new generation and announce it to other
nodes. The order of these operations w.r.t the rest of the startup
procedure is important, hence the startup procedure is the only right
place for them.
Still, what remains in storage_service is a small part of the entire
CDC generation management logic; most of it has been moved to the
new service. This includes listening for generation changes and
updating the data structures for performing CDC log writes (cdc::metadata).
Furthermore these functions now return futures (and are internally
coroutines), where previously they required a seastar::async context.
This commit introduces a new service crafted to handle CDC generation
management: listening and reacting to generation changes in the cluster.
The implementation is a stub for now, the service reacts to generation
changes by simply logging the event.
The commit plugs the service in, initializing it in main and test code,
passing a reference to storage_service and having storage_service start
the service (using the `after_join` method): the service only starts
doing its job after the node joins the token ring (either on bootstrap
or restart).