Commit Graph

467 Commits

Author SHA1 Message Date
Piotr Dulikowski
44c605e59c Merge 'Fix the types of change events in Alternator Streams' from Piotr Wieczorek
This patch increases the compatibility with DynamoDB Streams by integrating the DynamoDB's event type rules (described in https://github.com/scylladb/scylladb/issues/6918) into Alternator. The main changes are:
- introduce a new flag `alternator_streams_strict_compatibility`, meant as a guard of performance-intensive operations that increase the compatibility with DynamoDB Streams. If enabled, Alternator always performs a RBW before a data-modifying operation, and propagates its result to CDC. Then, the old item is compared to the new one, to determine the mutation type (INSERT vs MODIFY). This option is a no-op for tables with disabled Alternator Streams,
- reduce splitting of simple Alternator mutations,
- correctly distinguish event types described in #6918, except for item deletes. Deleting a missing item with DeleteItem, BatchWriteItem, or a missing field with UpdateItem still emit REMOVEs.

To summarize, the emitted events of the data manipulation operations should be as follows:
- DeleteItem/BatchWriteItem.DeleteItem of existing item: REMOVE (OK)
- DeleteItem of nonexistent item: nothing (OK)
- BatchWriteItem.DeleteItem of nonexistent item: nothing (OK)
- PutItem/UpdateItem/BatchWriteItem.PutItem of existing and not equal item: MODIFY (OK)
- PutItem/UpdateItem/BatchWriteItem.PutItem of existing and equal item: nothing (OK)
- PutItem/UpdateItem/BatchWriteItem.PutItem of nonexistent item: INSERT (OK)

No backport is necessary.

Refs https://github.com/scylladb/scylladb/pull/26149
Refs https://github.com/scylladb/scylladb/pull/26396
Refs https://github.com/scylladb/scylladb/issues/26382
Fixes https://github.com/scylladb/scylladb/issues/6918

Closes scylladb/scylladb#26121

* github.com:scylladb/scylladb:
  test/alternator: Enable the tests failing because of #6918
  alternator, cdc: Don't emit events for no-op removes
  alternator, cdc: Don't emit an event for equal items
  alternator/streams, cdc: Differentiate item replace and item update in CDC
  alternator: Change the return type of rmw_operation_return
  config: Add alternator_streams_strict_compatibility flag
  cdc: Don't split a row marker away from row cells
2025-11-30 07:20:22 +01:00
Michael Litvak
868ac42a8b tombstone_gc: don't use 'repair' mode for colocated tables
For tables of special types that can be located: MV, CDC, and paxos
table, we should not use tombstone_gc=repair mode because colocated
tablets are never repaired, hence they will not have repair_time set and
will never be GC'd using 'repair' mode.
2025-11-25 09:15:46 +01:00
Radosław Cybulski
d589e68642 Add precompiled headers to CMakeLists.txt
Add precompiled header support to CMakeLists.txt and configure.py -
it improves compilation time by approximately 10%.

New header `stdafx.hh` is added, don't include it manually -
the compiler will include it for you. The header contains includes from
external libraries used by Scylla - seastar, standard library,
linux headers and zlib.

The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER`
or configure.py --disable-precompiled-header to disable.

The feature should be disabled, when trying to check headers - otherwise
you might get false negatives on missing includes from seastar / abseil and so on.

Note: following configuration needs to be added to ccache.conf:

    sloppiness = pch_defines,time_macros,include_file_mtime,include_file_ctime

Closes scylladb/scylladb#26617
2025-11-21 12:27:41 +02:00
Dawid Mędrek
0602afc085 cdc: Preserve properties when reattaching log table
When we enable CDC on a table, Scylla creates a log table for it.
It has default properties, but the user may change them later on.
Furthermore, it's possible to detach that log table by simply
disabling CDC on the base table:

```cql
/* Create a table with CDC enabled. The log table is created. */
CREATE TABLE ks.t (pk int PRIMARY KEY) WITH cdc = {'enabled': true};

/* Detach the log table. */
ALTER TABLE ks.t WITH cdc = {'enabled': false};

/* Modify a property of the log table. */
ALTER TABLE ks.t_scylla_cdc_log WITH bloom_filter_fp_chance = 0.13;
```

The log table can also be reattached by enabling CDC on the base table
again:

```cql
/* Reattach the log table */
ALTER TABLE ks.t WITH cdc = {'enabled': true};
```

However, because the process of reattachment goes through the same
code that created it in the first place, the properties of the log
table are rolled back to their default values. This may be confusing
to the user and, if unnoticed, also have other consequences, e.g.
affecting performance.

To prevent that, we ensure that the properties are preserved.

A reproducer test,
`test_log_table_preserves_properties_after_reattachment`, has been
provided to verify that the changes are correct. It fails before this
commit.

Another test, `test_log_table_preserves_id_after_reattachment`, has
also been added because the current implementation sets properties
and the ID separately.

Fixes scylladb/scylladb#25523
2025-11-17 11:56:30 +01:00
Dawid Mędrek
10975bf65c cdc: Extract creating columns in CDC log table to dedicated function
We extract the portion of the code responsible for creating columns
in a CDC log table to a separate, dedicated function. This should
improve the overall readability of the function (and also making it
very short now).
2025-11-17 11:54:48 +01:00
Dawid Mędrek
8bf09ac6f7 cdc: Extract default properties of CDC log tables to dedicated function
We extract the portion of the code responsible for setting the default
properties of a CDC log table to a separate function. This should
improve the overall readability of the function. Also, it should be
helpful when modifying the code later on in this commit series.
2025-11-17 11:50:35 +01:00
Michael Litvak
039323d889 cdc: check if recreating a column too soon
When we drop a column from a CDC log table, we set the column drop
timestamp a few seconds into the future. This can cause unexpected
problems if a user tries to recreate a CDC column too soon, before
the drop timestamp has passed.

To prevent this issue, when creating a CDC column we check its
creation timestamp against the existing drop timestamp, if any, and
fail with an informative error if the recreation attempt is too soon.
2025-11-13 17:00:07 +01:00
Michael Litvak
48298e38ab cdc: set column drop timestamp in the future
When dropping a column from a CDC log table, set the column drop
timestamp several seconds into the future.

If a value is written to a column concurrently with dropping that
column, the value's timestamp may be after the column drop timestamp. If
this value is also flushed to an SSTable, the SSTable would be
corrupted, because it considers the column missing after the drop
timestamp and doesn't allow values for it.

While this issue affects general tables, it especially impacts CDC tables
because this scenario can occur when writing to a table with CDC preimage
enabled while dropping a column from the base table. This happens even if
the base mutation doesn't write to the dropped column, because CDC log
mutations can generate values for a column even if the base mutation doesn't.
For general tables, this issue can be avoided by simply not writing to a
column while dropping it.

We fix this for the more problematic case of CDC log tables by setting
the column drop timestamp several seconds into the future, ensuring that
writes concurrent with column drops are much less likely to have
timestamps greater than the column drop timestamp.

Fixes scylladb/scylladb#26340
2025-11-13 16:59:43 +01:00
Michael Litvak
eefae4cc4e migration_manager: pass timestamp to pre_create
pass the write timestamp as parameter to the
on_pre_create_column_families notification.
2025-11-13 16:59:43 +01:00
Piotr Dulikowski
2e5eb92f21 Merge 'cdc: use CDC schema that is compatible with the base schema' from Michael Litvak
When generating CDC log mutations for some base mutation, use a CDC schema that is compatible with the base schema.

The compatible CDC schema has for every base column a corresponding CDC column with the same name. If using a non-compatible schema, we may encounter a situation, especially during ALTER, that we have a mutation with a base column set with some value, but the CDC schema doesn't have a column by that name. This would cause the user request to fail with an error.

We add to the schema object a schema_ptr that for CDC-enabled tables points to the schema object of the CDC table that is compatible with the schema. It is set by the schema merge algorithm when creating the schema for a table that is created or altered. We use the fact that a base table and its CDC table are created and altered in the same group0 operation, and this way we can find and set the cdc schema for a base table.

When transporting the base schema as a frozen schema between shards, we transport with it the frozen cdc schema as well.

The patch starts with a series of refactoring commits that make extending the frozen schema easier and cleans up some duplication in the code about the frozen schema. We combine the two types `frozen_schema_with_base_info` and `view_schema_and_base_info` to a single type `extended_frozen_schema` that holds a frozen schema with additional data that is not part of the schema mutations but needs to be transported with it to unfreeze it - base_info, and the frozen cdc schema which is added in a later commit.

Fixes https://github.com/scylladb/scylladb/issues/26405

backport not needed - enhancement

Closes scylladb/scylladb#24960

* github.com:scylladb/scylladb:
  test: cdc: test cdc compatible schema
  cdc: use compatiable cdc schema
  db: schema_applier: create schema with pointer to CDC schema
  db: schema_applier: extract cdc tables
  schema: add pointer to CDC schema
  schema_registry: remove base_info from global_schema_ptr
  schema_registry: use extended_frozen_schema in schema load
  schema_registry: replace frozen_schema+base_info with extended_frozen_schema
  frozen_schema: extract info from schema_ptr in the constructor
  frozen_schema: rename frozen_schema_with_base_info to extended_frozen_schema
2025-11-13 10:11:54 +01:00
Michael Litvak
e7dbccd59e cdc: use chunked_vector instead of vector for stream ids
use utils::chunked_vector instead of std::vector to store cdc stream
sets for tablets.

a cdc stream set usually represents all streams for a specific table and
timestamp, and has a stream id per each tablet of the table. each stream
id is represented by 16 bytes. thus the vector could require quite large
contiguous allocations for a table that has many tablets. change it to
chunked_vector to avoid large contiguous allocations.

Fixes scylladb/scylladb#26791

Closes scylladb/scylladb#26792
2025-10-31 13:02:34 +01:00
Piotr Wieczorek
66ac66178b alternator, cdc: Don't emit events for no-op removes
Deletes that don't change the state of the database visible to the user
(e.g. an attempt to delete a missing item) shouldn't produce a cdc log.
This commit addresses this DynamoDB compatibility issue if the delete is
a partition delete, a row delete, or a cell delete. It works under the
assumption that the change was produced by Alternator. This means that
it doesn't support range deletes, static row deletes, deletes of
collection cells other than a map, etc. See also its parent commit,
which introduces the methods that this commit extends.

This commit handles the following cases:
- `DeleteItem of nonexistent item: nothing`,
- `BatchWriteItem.DeleteItem of nonexistent item: nothing`.

Refs https://github.com/scylladb/scylladb/pull/26121
2025-10-30 08:38:30 +01:00
Piotr Wieczorek
a32e8091a9 alternator, cdc: Don't emit an event for equal items
This commit adds a function that compares split mutations with the
`row_state`, that was selected as a preimage or propagated through
cdc options by a caller. If the items are equal, the corresponding log
row isn't generated. The result being that creating an item with
BatchWriteItem, PutItem, or UpdateItem doesn't emit an INSERT/MODIFY
event if exactly identical item already exists.

Comparing the items may be costly, so this logic is controlled by
`alternator_streams_compabitiblity` flag.

This commit handles the following cases:
- `PutItem/UpdateItem/BatchWriteItem.PutItem of an existing and equal
  item: nothing`
2025-10-30 08:38:30 +01:00
Piotr Wieczorek
8c2f60f111 alternator/streams, cdc: Differentiate item replace and item update in CDC
This commit improves compatibility with DynamoDB streams by changing the
emitted events when creating/updating an item. Replace/update operations
of an existing item emit a MODIFY, whereas replacing/updating a missing
item results in an INSERT. If the state of the item doesn't change after
applying the operation, no event is emitted.

This commit handles the following cases:
- `PutItem/UpdateItem/BatchWriteItem.PutItem of an existing and not equal item: MODIFY`
- `PutItem/UpdateItem/BatchWriteItem.PutItem of a nonexistent item: INSERT`

Refs https://github.com/scylladb/scylladb/issues/6918
2025-10-30 07:40:31 +01:00
Piotr Wieczorek
e3fde8087a cdc: Don't split a row marker away from row cells
CDC log table records a mutation as a sequence of log rows that record
an atomic change (i.e. a row marker, tombstones, etc.), whereas a
mutation in Alternator Streams always appears as a single log row. The
type of operation is determined based on the type of the last log row in
CDC.

As a result, updates that create a row always appeared to Alternator
Streams as an update (row marker + data), rather than an insert. This
commit makes them a single log row. Its operation type is insert if it
contains a row marker, and an update otherwise, which gives results
consistent with DynamoDB Streams.
2025-10-30 07:40:31 +01:00
Piotr Wieczorek
2812e67f47 cdc: Emit a preimage for non-clustered tables
Until this patch, CDC haven't fetched a preimage for mutations
containing only a partition tombstone. Therefore, single-row deletions
in a table witout a clustering key didn't include a preimage, which was
inconsistent with single-row clustered deletions. This commit addresses
this inconsistency.

Second reason is compatibility with DynamoDB Streams, which doesn't
support entire-partition deletes. Alternator uses partition tombstones
for single-row deletions, though, and in these cases the 'OldImage' was
missing from REMOVE records.

Fixes https://github.com/scylladb/scylladb/issues/26382

Closes scylladb/scylladb#26578
2025-10-29 17:54:58 +02:00
Michael Litvak
8743422241 cdc: improve cdc metadata loading
when loading CDC streams metadata for tablets from the tables, read only
new entries from the history table instead of reading all entries. This
improves the CDC metadata reloading, making it more efficient and
predictable.

the CDC metadata is loaded as part of group0 reload whenever the
internal CDC tables are modified. on tablet split / merge, we create a
new CDC timestamp and streams by writing them to the cdc_streams_history
table by group0 operation, and when it's applied we reload the in-memory
CDC streams map by reading from the tables and constructing the updated map.

Previously, on every update, we would read the entire
cdc_streams_history entries for the changed table, constructing all its
streams and creating a new map from scratch.

We improve this now by reading only new entries from cdc_streams_history
and append them to the existing map. we can do this because we only
append new entries to cdc_streams_history with higher timestamp than all
previous entries.

This makes this reloading more efficient and predictable, because
previously we would read a number of entries that depends on the number
of tablets splits and merges, which increases over time and is
unbounded, whereas now we read only a single stream set on each update.

Fixes scylladb/scylladb#26732
2025-10-28 08:54:09 +01:00
Michael Litvak
440caeabcb cdc: helpers for garbage collecting old streams for tablets
introduce helper functions that can be used for garbage collecting old
cdc streams for tablets-based keyspaces.

- get_new_base_for_gc: finds a new base timestamp given a TTL, such that
  all older timestamps and streams can be removed.
- get_cdc_stream_gc_mutations: given new base timestamp and streams,
  builds mutations that update the internal cdc tables and remove the
  older streams.
- garbage_collect_cdc_streams_for_table: combines the two functions
  above to find a new base and build mutations to update it for a
  specific table
- garbage_collect_cdc_streams: builds gc mutations for all cdc tables
2025-10-26 11:01:20 +01:00
Radosław Cybulski
621e88ce52 Fix spelling errors
Closes scylladb/scylladb#26652
2025-10-22 16:46:31 +02:00
Michael Litvak
448e14a3b7 cdc: use compatiable cdc schema
in the CDC log transformer, when augmenting a base mutation, use the CDC
log schema that is compatible with the base schema, if set.

Now that the base schema has a pointer to its CDC schema, we can use it
instead of getting the current schema from the db, which may not be
compatible with the base schema.

The compatible CDC schema may not be set if the cluster is not using
raft mode for schema. In this case, we maintain the previous behavior.
2025-10-21 14:14:33 +02:00
Piotr Wieczorek
a3ec6c7d1d alternator/streams: Support userIdentity field for TTL deletions
UserIdentity is a map of two fields in GetRecords responses, which
always has the same value. It may be missing, or contain a constant
object with value `{"type": "Service", "principalId":
"dynamodb.amazonaws.com"}`. Currently, the latter is set only for
`REMOVE`s triggered by TTL.

This commit introduces two new CDC operation types: `service_row_delete`
and `service_partition_delete`, emitted in place of `row_delete` and
`partition_delete`. Alternator Streams treats them as regular `REMOVE`s,
but in addition adds the `userIdentity` field to the record.

This change may break existing Scylla libraries for reading raw CDC
tables, but we doubt that anybody has this use case.

Refs https://github.com/scylladb/scylladb/pull/26149
Refs https://github.com/scylladb/scylladb/pull/26121
Fixes https://github.com/scylladb/scylladb/issues/11523

Closes scylladb/scylladb#26460
2025-10-20 17:15:59 +02:00
Tomasz Grabiec
c4a87453a2 Merge 'Add experimental feature flag for strongly consistent tables and extend kesypace creation syntax to allow specifying consistency mode.' from Gleb Natapov
The series adds an experimental flag for strongly consistent tables  and extends "CREATE KEYSPACE" ddl with `consistency` option that allows specifying the consistency mode for the keyspace.

Closes scylladb/scylladb#26116

* github.com:scylladb/scylladb:
  schema: Allow configuring consistency setting for a keyspace
  db: experimental consistent-tablets option
2025-10-16 21:48:06 +02:00
Gleb Natapov
c255740989 schema: Allow configuring consistency setting for a keyspace
We want to add strongly consistent tables as an option. We will have
two kind of strongly consistent tables: globally consistent and locally
consistent. The former means that requests from all DCs will be globally
linearisable while the later - only requests to the same DCs will be
linearisable.  To allow configuring all the possibilities the patch
adds new parameter to a keyspace definition "consistency" that can be
configured to be `eventual`, `global` or `local`. Non eventual setting
is supported for tablets enabled keyspaces only. Since we want to start
with implementing local consistency configuring global consistency will
result in an error for now.
2025-10-16 13:34:49 +03:00
Piotr Wieczorek
d4581cc442 cdc: Support prefetched preimages
This commit adds support to pass a preimage selected by an upper layer
to CDC. The responsibility for the correctness of the preimage (i.e. the
selected columns, whether it's up to date, etc.) lies with the caller.
It may be improved in the future by validating the preimage, e.g. by
"slicing" the received preimage to the necessary columns.

The motivation behind this change was to reduce the number of
read-before-writes and avoid reading the row twice for Alternator
Streams in an increased compatibility mode with DynamoDB. This is to be
added in a following commit. Until now, this commit should be a no-op.
2025-10-14 07:29:07 +02:00
Piotr Wieczorek
2c1e699864 cdc, storage: Add a struct to pass per-mutation options to CDC
This will allow us to communicate with CDC from higher layers. We plan
to use it to reduce the number of read-before-writes with preimages by
passing the row selected in upper layers.
2025-10-09 12:28:10 +02:00
Piotr Wieczorek
66935bedac cdc: Move operations enum to the top of the namespace 2025-10-09 12:28:10 +02:00
Benny Halevy
da6e2fdb1b locator: Pass topology to replication strategy constructor 2025-10-01 16:06:28 +02:00
Botond Dénes
86ed627fc4 compaction: move code to namespace compaction
The namespace usage in this directory is very inconsistent, with files
and classes scattered in:
* global namespace
* namespace compaction
* namespace sstables

With cases, where all three used in the same file. This code used to
live in sstables/ and some of it still retains namespace sstables as a
heritage of that time. The mismatch between the dir (future module) and
the namespace used is confusing, so finish the migration and move all
code in compaction/ to namespace compaction too.

This patch, although large, is mechanic and only the following kind of
changes are made:
* replace namespace sstable {} with namespace compaction {}
* add namespace compaction {}
* drop/add sstables::
* drop/add compaction::
* move around forward-declarations so they are in the correct namespace
  context

This refactoring revealed some awkward leftover coupling between
sstables and compaction, in sstables/sstable_set.cc, where the
make_sstable_set() methods of compaction strategies are implemented.
2025-09-25 15:03:56 +03:00
Ernest Zaslavsky
5ba5aec1f8 treewide: Move mutation related files to a mutation directory
As requested in #22104, moved the files and fixed other includes and build system.

Moved files:
 - combine.hh
 - collection_mutation.hh
 - collection_mutation.cc
 - converting_mutation_partition_applier.hh
 - converting_mutation_partition_applier.cc
 - counters.hh
 - counters.cc
 - timestamp.hh

Fixes: #22104

This is a cleanup, no need to backport

Closes scylladb/scylladb#25085
2025-09-24 13:23:38 +03:00
Piotr Dulikowski
5f55787e50 Merge 'CDC with tablets' from Michael Litvak
initial implementation to support CDC in tablets-enabled keyspaces.

The design is described in https://docs.google.com/document/d/1qO5f2q5QoN5z1-rYOQFu6tqVLD3Ha6pphXKEqbtSNiU/edit?usp=sharing
It is followed closely for the most part except "Deciding when to change streams" - instead, streams are changed synchronously with tablet split / merge.
Instead of the stream switching algorithm with the double writes, we use a scheme similar to the previous method for vnodes - we add the new streams with timestamp that is sufficiently far into the future.

In this PR we:
* add new group0-based internal system tables for tablet stream metadata and loading it into in-memory CDC metadata
* add virtual tables for CDC consumers
* the write coordinator chooses a stream by looking up the appropriate stream in the CDC metadata
* enable creating tables with CDC enabled in tablets-enabled keyspaces. tablets are allocated for the CDC table, and a stream is created per each tablet.
* on tablet resize (split / merge), the topology coordinator creates a new stream set with a new stream for each new tablet.
* the cdc tablets are co-located with the base tablets

Fixes https://github.com/scylladb/scylladb/issues/22576

backport not needed - new feature

update dtests: https://github.com/scylladb/scylla-dtest/pull/5897
update java cdc library: https://github.com/scylladb/scylla-cdc-java/pull/102
update rust cdc library: https://github.com/scylladb/scylla-cdc-rust/pull/136

Closes scylladb/scylladb#23795

* github.com:scylladb/scylladb:
  docs/dev: update CDC dev docs for tablets
  doc: update CDC docs for tablets
  test: cluster_events: enable add_cdc and drop_cdc
  test/cql: enable cql cdc tests to run with tablets
  test: test_cdc_with_alter: adjust for cdc with tablets
  test/cqlpy: adjust cdc tests for tablets
  test/cluster/test_cdc_with_tablets: introduce cdc with tablets tests
  cdc: enable cdc with tablets
  topology coordinator: change streams on tablet split/merge
  cdc: virtual tables for cdc with tablets
  cdc: generate_stream_diff helper function
  cdc: choose stream in tablets enabled keyspaces
  cdc: rename get_stream to get_vnode_stream
  cdc: load tablet streams metadata from tables
  cdc: helper functions for reading metadata from tables
  cdc: colocate cdc table with base
  cdc: remove streams when dropping CDC table
  cdc: create streams when allocating tablets
  migration_listener: add on_before_allocate_tablet_map notification
  cdc: notify when creating or dropping cdc table
  cdc: move cdc table creation to pre_create
  cdc: add internal tables for cdc with tablets
  cdc: add cdc_with_tablets feature flag
  cdc: add is_log_schema helper
2025-09-18 13:39:37 +02:00
Ernest Zaslavsky
54aa552af7 treewide: Move type related files to a type directory As requested in #22110, moved the files and fixed other includes and build system.
Moved files:
- duration.hh
- duration.cc
- concrete_types.hh

Fixes: #22110

This is a cleanup, no need to backport

Closes scylladb/scylladb#25088
2025-09-17 17:32:19 +03:00
Michael Litvak
1fc3273b27 cdc: enable cdc with tablets
Allow to create CDC tables in a tablets-enabled keyspace when all nodes
in the cluster support the cdc_with_tablets feature.

Fixes scylladb/scylladb#22576
2025-09-17 14:47:12 +02:00
Michael Litvak
98de3f0e86 topology coordinator: change streams on tablet split/merge
on tablet split/merge finalization, generate a new CDC timestamp and
stream set for the table with a new stream for each tablet in the new
tablet map, in order to maintain synchronization of the CDC streams with
the tablets.

We pick a new timestamp for the streams with a small delay into the
future so that all nodes can learn about the new streams in time, in the
same way it's done for vnodes.

the new timestamp and streams are published by adding a mutation to the
cdc_streams_history table that contains the timestamp and the sets of
closed and opened streams from the current timestamp.
2025-09-17 14:47:12 +02:00
Michael Litvak
b8442c6087 cdc: virtual tables for cdc with tablets
Define two new virtual tables in system keyspace: cdc_timestamps and
cdc_streams. They expose the internal cdc metadata for tablets-enabled
keyspace to be consumed by users consuming the CDC log.

cdc_timestamps lists all timestamps for a table where a stream change
occured.

cdc_streams list additionally the current streams sets for each
table and timestamp, as well as difference - closed and opened streams -
from the previous stream set.
2025-09-17 14:47:12 +02:00
Michael Litvak
67410cac4d cdc: generate_stream_diff helper function
This helper functions receives two sets of streams and constructs their
difference - closed and opened streams.
2025-09-17 14:47:12 +02:00
Michael Litvak
5f6bb0af9d cdc: choose stream in tablets enabled keyspaces
When choosing a CDC stream to generate CDC log writes to, if the
keyspace uses tablets, we need to choose a stream according to the
relevant metadata which is specific to tablets-enabled keyspaces.

We define the method get_tablet_stream that given a table, write
timestamp, and token, returns the stream that the log entry should
be written to.

The method works by looking up the stream metadata of the table, then
finding the relevant stream set by timestamp, and finally finding
the stream that covers the token range that contains the token.
2025-09-17 14:47:12 +02:00
Michael Litvak
28cdd81ef0 cdc: rename get_stream to get_vnode_stream
the get_stream method is relevant only for vnode-based keyspaces. next
we will introduce a new method to get a stream in a tablets-based
keyspace. prepare for this by renaming get_stream to get_vnode_stream.
2025-09-17 14:47:12 +02:00
Michael Litvak
9ec4b6ccb1 cdc: load tablet streams metadata from tables
Read the CDC stream metadata from the internal system tables, and store
it in the cdc metadata data structures.

The metadata is stored in the tables as diffs which is more storage
efficient, but when in-memory we store it as full stream sets for each
timestamp. This is more useful because we need to be able to find a
stream given timestamp and token.
2025-09-17 14:47:12 +02:00
Michael Litvak
aa61f074b5 cdc: helper functions for reading metadata from tables
Define functions in system_keyspace that read from the internal cdc
tables and construct the data into internal cdc data structures.
2025-09-17 14:47:12 +02:00
Michael Litvak
9ef0862155 cdc: remove streams when dropping CDC table
When dropping a CDC log table in a tablets-enabled keyspace, remove all
metadata about the table's CDC streams from the internal CDC tables,
since the streams can't be read anymore.

Similarly, when dropping a tablets-enabled keyspace, remove metadata of
all streams belonging to tables in the keyspace.
2025-09-17 14:47:12 +02:00
Michael Litvak
ed25e420f8 cdc: create streams when allocating tablets
When allocating tablets for a CDC table, create the initial CDC stream
set. We create one stream per each tablet, each stream covering the
corresponding token range.
2025-09-17 14:47:12 +02:00
Michael Litvak
fdfe9ebb4c cdc: notify when creating or dropping cdc table
When creating a CDC table by updating an existing base table and
enabling CDC, notify about the table creation so subscribers can act on
it.  This is needed in particular for notifying the tablet allocator
when creating a CDC table so that it will allocate tablets for the CDC
table.

Also, when dropping a CDC table, notifying about the dropped table. This
is needed for the tablet allocator to remove the tablet map of the CDC
table.
2025-09-17 14:47:11 +02:00
Michael Litvak
fed1048059 cdc: move cdc table creation to pre_create
When creating a new table with CDC enabled, we create also a CDC log
table by adding the CDC table's mutations in the same operation.

Previously, it works by the CDC log service subscribing to
on_before_create_column_family and adding the CDC table's mutations
there when being notified about a new created table.

The problem is that when we create the tables we also create their
tablet maps in the tablet allocator, and we want to created the two
tables as co-located tables: we allocate a tablet map for the base
table, and the CDC table is co-located with the base table.

This doesn't work well with the previous approach because the
notification that creates the CDC table is the same notification that
the tablet allocator creates the base tablet map, so the two operations
are independent, but really we want the tablet allocator to work on both
tables together, so that we have the base table's schema and tablet map
when we create the CDC table's co-located tablet map.

In order to achieve this, we want to create and add the CDC table's
schema, and only after that notify using before_create_column_families
with a vector that contains both the base table and CDC table. The
tablet allocator will then have all the information it needs to create
the co-located tablet map.

We move the creation of the CDC log table - instead of adding the
table's mutations in on_before_create_column_family, we create the table
schema and add it to the new tables vector in
on_pre_create_column_families, which is called by the migration manager
in do_prepare_new_column_families_announcement. The migration manager
will then create and add all mutations for creating the tables, and
notify about the tables being created together.
2025-09-17 14:47:11 +02:00
Michael Litvak
daf200facb cdc: add is_log_schema helper
In few places we need to check whether a schema represents a CDC log
table, and we do so by checking whether the table's partitioner is the
CDC partitioner.

Extract this logic to a new utility function to reduce code duplication
and allow reuse.
2025-09-17 14:47:11 +02:00
Piotr Dulikowski
f95808cbe7 Merge 'cdc/generation: Clone topology_description asynchronously' from Dawid Mędrek
An instance of `cdc::topology_description` can be quite big. The vector
it consists of stores as many `token_range_description`s as there are
vnodes, and the size of each `token_range_description` is O(#shards).

Because of that, copying an instance of the type can lead to reactor
stalls. To prevent that, we introduce an asynchronous function copying
the contents on the object.

Reactor stalls were detected in the call to `map_reduce` in
`generation_service::legacy_do_handle_cdc_generation`, so let's start
using the new function there.

A similar scenario occurs in `generation_service::handle_cdc_generation`,
so we modify it too.

Unfortunately, it doesn't seem viable to provide a reproducer of said
problem.

Fixes scylladb/scylladb#24522

Backport: none. Reactor stalls are not critical.

Closes scylladb/scylladb#25730

* github.com:scylladb/scylladb:
  cdc/generation: Delete copy constructors of topology_description
  cdc/generation: Clone topology_description asynchronously
2025-09-03 13:41:58 +02:00
Radosław Cybulski
c242234552 Revert "build: add precompiled headers to CMakeLists.txt"
This reverts commit 01bb7b629a.

Closes scylladb/scylladb#25735
2025-09-03 09:46:00 +03:00
Piotr Dulikowski
762d9ef68f Merge 'cdc: Set tombstone_gc when creating log table' from Dawid Mędrek
Normally, when we create a table, MV, etc., we apply `cf_prop_defs` to the
schema builder via the function `cf_prop_defs::apply_to_builder`. Unfortunately,
that didn't happen when creating CDC log tables, and so we might have missed
some of the properties that would normally be set to some value, even if the
default one.

One particular example of that phenomenon was `tombstone_gc`. For better or
worse, it's not a "standalone property" of a table, but rather part of
`extensions`. [Somewhat related issue: scylladb/scylladb#9722]

That may have and did cause trouble. Consider this scenario:

1. A CDC log table is created.
2. The table does NOT have any value of `tombstone_gc` set.
3. The user edits the table via `ALTER TABLE`. That statement treats the log
   table just like any other one (at least as far as the relevant portion of the
   logic is concerned). Among other things, it uses
   `cf_prop_defs::apply_to_builder`, and as a result, the `tombstone_gc`
   property is set to some value:
   * the default one if the user doesn't specify it in the statement,
   * a custom one if they do.

Why is that a problem?

First of all, it's confusing. When we perform a schema backup and a table uses
CDC, we include an ALTER statement for its corresponding CDC log table (for more
context, see issue scylladb/scylladb#18467 or commit
scylladb/scylladb@f12edbdd95).

There are two consequences for the user here:
1. If the log table had NOT been altered ever since it was created, the
   statement will miss the `tombstone_gc` property as if it couldn't be set for
   it at all. That's confusing!
2. If the log table HAD in fact been altered after its creation, the statement
   will include the `tombstone_gc` property. That's even more confusing (why was
   it not present the first time, but it is now?).

The `tombstone_gc` property should always be set to avoid confusion and
problematic edge cases in tests and to simply be consistent with how other
schema entities work.

The solution we employ is that we always set the property to the default
value. That includes the case when we reattach the log table to the base;
consider the following scenario:

1. Create a table with CDC enabled.
2. Detach the log table by performing `ALTER TABLE ... WITH cdc = {'enabled': false}`.
3. Change the `tombstone_gc` property of the log table.
4. Reattach the log table to the base in the same way as in step 2.

The expected result would be that the new value of `tombstone_gc` would be
preserved after reattaching the log table. However, that's not what will
happen. We decide to stay consistent with how other properties of a log
table behave, and we reset them after every reattachment. We might change that
in the future: see issue scylladb/scylladb#25523.

Two reproducer tests of scylladb/scylladb#25187 are included in the changes.

Backport: The problem is not critical, so it may not be necessary to backport the changes.
That's to be discussed.

Closes scylladb/scylladb#25521

* github.com:scylladb/scylladb:
  cdc: Set tombstone_gc when creating log table
  tombstone_gc: Add overload of get_default_tombstone_gc_mode
  tombstone_gc: Rename get_default_tombstonesonte_gc_mode
2025-09-02 10:20:11 +02:00
Piotr Dulikowski
7ccb50514d Merge 'Introduce view building coordinator' from Michał Jadwiszczak
This patch introduces `view_building_coordinator`, a single entity within whole cluster responsible for building tablet-based views.

The view building coordinator takes slightly different approach than the existing node-local view builder. The whole process is split into smaller view building tasks, one per each tablet replica of the base table.
The coordinator builds one base table at a time and it can choose another when all views of currently processing base table are built.
The tasks are started by setting `STARTED` state and they are executed by node-local view building worker. The tasks are scheduled in a way, that each shard processes only one tablet at a time (multiple tasks can be started for a shard on a node because a table can have multiple views but then all tasks have the same base table and tablet (last_token)). Once the coordinator starts the tasks, it sends `work_on_view_building_tasks` RPC to start the tasks and receive their results.
This RPC is resilient to RPC failure or raft leader change, meaning if one RPC call started a batch of tasks but then failed (for instance the raft leader was changed and caller aborted waiting for the response), next RPC call will attach itself to the already started batch.

The coordinator plugs into handling tablet operations (migration/resize/RF change) and adjusts its tasks accordingly. At the start of each tablet operation, the coordinator aborts necessary view building tasks to prevent https://github.com/scylladb/scylladb/issues/21564. Then, new adjusted tasks are created at the end of the operation.
If the operation fails at any moment, aborted tasks are rollback.

The view building coordinator can also handle staging sstables using process_staging view building tasks. We do this because we don't want to start generating view updates from a staging sstable prematurely, before the writes are directed to the new replica (https://github.com/scylladb/scylladb/issues/19149).

For detailed description check: `docs/dev/view-building-coordinator.md`

Fixes https://github.com/scylladb/scylladb/issues/22288
Fixes https://github.com/scylladb/scylladb/issues/19149
Fixes https://github.com/scylladb/scylladb/issues/21564
Fixes https://github.com/scylladb/scylladb/issues/17603
Fixes https://github.com/scylladb/scylladb/issues/22586
Fixes https://github.com/scylladb/scylladb/issues/18826
Fixes https://github.com/scylladb/scylladb/issues/23930

---

This PR is reimplementation of https://github.com/scylladb/scylladb/pull/21942

Closes scylladb/scylladb#23760

* github.com:scylladb/scylladb:
  test/cluster: add view build status tests
  test/cluster: add view building coordinator tests
  utils/error_injection: allow to abort `injection_handler::wait_for_message()`
  test: adjust existing tests
  utils/error_injection: add injection with `sleep_abortable()`
  db/view/view_builder: ignore `no_such_keyspace` exception
  docs/dev: add view building coordinator documentation
  db/view/view_building_worker: work on `process_staging` tasks
  db/view/view_building_worker: register staging sstable to view building coordinator when needed
  db/view/view_building_worker: discover staging sstables
  db/view/view_building_worker: add method to register staging sstable
  db/view/view_update_generator: add method to process staging sstables instantly
  db/view/view_update_generator: extract generating updates from staging sstables to a method
  db/view/view_update_generator: ignore tablet-based sstables
  db/view/view_building_coordinator: update view build status on node join/left
  db/view/view_building_coordinator: handle tablet operations
  db/view: add view building task mutation builder
  service/topology_coordinator: run view building coordinator
  db/view: introduce `view_building_coordinator`
  db/view/view_building_worker: update built views locally
  db/view: introduce `view_building_worker`
  db/view: extract common view building functionalities
  db/view: prepare to create abstract `view_consumer`
  message/messaging_service: add `work_on_view_building_tasks` RPC
  service/topology_coordinator: make `term_changed_error` public
  db/schema_tables: create/cleanup tasks when an index is created/dropped
  service/migration_manager: cleanup view building state on drop keyspace
  service/migration_manager: cleanup view building state on drop view
  service/migration_manager: create view building tasks on create view
  test/boost: enable proxy remote in some tests
  service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()`
  service/migration_manager: coroutinize `prepare_new_view_announcement()`
  service/storage_proxy: expose references to `system_keyspace` and `view_building_state_machine`
  service: reload `view_building_state_machine` on group0 apply()
  service/vb_coordinator: add currently processing base
  db/system_keyspace: move `get_scylla_local_mutation()` up
  db/system_keyspace: add `view_building_tasks` table
  db/view: add view_building_state and views_state
  db/system_keyspace: add method to get view build status map
  db/view: extract `system.view_build_status_v2` cql statements to system_keyspace
  db/system_keyspace: move `internal_system_query_state()` function earlier
  db/view: ignore tablet-based views in `view_builder`
  gms/feature_service: add VIEW_BUILDING_COORDINATOR feature
2025-08-29 17:28:44 +02:00
Dawid Mędrek
90a2e0d1cc cdc/generation: Delete copy constructors of topology_description
The object might be quite big and lead to reactor stalls. Using its
copy constructor is asking for trouble, so let's explicitly delete it.
2025-08-29 13:54:16 +02:00
Dawid Mędrek
508e00319b cdc/generation: Clone topology_description asynchronously
An instance of `cdc::topology_description` can be quite big. The vector
it consists of stores as many `token_range_description`s as there are
vnodes, and the size of each `token_range_description` is O(#shards).

Because of that, copying an instance of the type can lead to reactor
stalls. To prevent that, we introduce an asynchronous function copying
the contents on the object.

Reactor stalls were detected in the call to `map_reduce` in
`generation_service::legacy_do_handle_cdc_generation`, so let's start
using the new function there.

A similar scenario occurs in `generation_service::handle_cdc_generation`,
so we modify it too.

Unfortunately, it doesn't seem viable to provide a reproducer of said
problem.

Fixes scylladb/scylladb#24522
2025-08-29 13:54:00 +02:00