Reformat indentation, brace placement, lambda formatting, and
line wrapping for consistency.
The seastar logger already checks is_enabled() before formatting
arguments, so explicit guards around trace calls with simple
variable arguments are unnecessary.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Modify the methods which calculate the default gc mode as well as that
which validates whether repair-mode can be used at all, so both accepts
use of repair-mode on RF=1 tables.
This de-facto changes the default tombstone-gc to repair-mode for all
tables. Documentation is updated accordingly.
Some tests need adjusting:
* cqlpy/test_select_from_mutation_fragments.py: disable GC for some test
cases because this patch makes tombstones they write subject to GC
when using defaults.
* test/cluster/test_mv.py::test_mv_tombstone_gc_not_inherited used
repair-mode as a non-default for the base table and expected the MV to
revert to default. Another mode has to be used as the non-default
(immediate).
* test/cqlpy/test_tools.py::test_scylla_sstable_dump_schema: don't
compare tombstone_gc schema extension when comparing dumped schema vs.
original. The tool's schema loader doesn't have access to the keyspace
definition so it will come up with different defaults for
tombstone-gc.
* test/boost/row_cache_test.cc::test_populating_cache_with_expired_and_nonexpired_tombstones
sets tombstone expiry assuming the tombstone-gc timeout-mode default.
Change the CREATE TABLE statement to set the expected mode.
The patch marks force-gossip-topology-changes as deprecated and removes
tests that use it. There is one test (test_different_group0_ids) which
is marked as xfail instead since it looks like gossiper mode was used
there as a way to easily achieve a certain state, so more investigation
is needed if the tests can be fixed to use raft mode instead.
Closesscylladb/scylladb#28383
This patch increases the compatibility with DynamoDB Streams by integrating the DynamoDB's event type rules (described in https://github.com/scylladb/scylladb/issues/6918) into Alternator. The main changes are:
- introduce a new flag `alternator_streams_strict_compatibility`, meant as a guard of performance-intensive operations that increase the compatibility with DynamoDB Streams. If enabled, Alternator always performs a RBW before a data-modifying operation, and propagates its result to CDC. Then, the old item is compared to the new one, to determine the mutation type (INSERT vs MODIFY). This option is a no-op for tables with disabled Alternator Streams,
- reduce splitting of simple Alternator mutations,
- correctly distinguish event types described in #6918, except for item deletes. Deleting a missing item with DeleteItem, BatchWriteItem, or a missing field with UpdateItem still emit REMOVEs.
To summarize, the emitted events of the data manipulation operations should be as follows:
- DeleteItem/BatchWriteItem.DeleteItem of existing item: REMOVE (OK)
- DeleteItem of nonexistent item: nothing (OK)
- BatchWriteItem.DeleteItem of nonexistent item: nothing (OK)
- PutItem/UpdateItem/BatchWriteItem.PutItem of existing and not equal item: MODIFY (OK)
- PutItem/UpdateItem/BatchWriteItem.PutItem of existing and equal item: nothing (OK)
- PutItem/UpdateItem/BatchWriteItem.PutItem of nonexistent item: INSERT (OK)
No backport is necessary.
Refs https://github.com/scylladb/scylladb/pull/26149
Refs https://github.com/scylladb/scylladb/pull/26396
Refs https://github.com/scylladb/scylladb/issues/26382
Fixes https://github.com/scylladb/scylladb/issues/6918
Closes scylladb/scylladb#26121
* github.com:scylladb/scylladb:
test/alternator: Enable the tests failing because of #6918
alternator, cdc: Don't emit events for no-op removes
alternator, cdc: Don't emit an event for equal items
alternator/streams, cdc: Differentiate item replace and item update in CDC
alternator: Change the return type of rmw_operation_return
config: Add alternator_streams_strict_compatibility flag
cdc: Don't split a row marker away from row cells
For tables of special types that can be located: MV, CDC, and paxos
table, we should not use tombstone_gc=repair mode because colocated
tablets are never repaired, hence they will not have repair_time set and
will never be GC'd using 'repair' mode.
Add precompiled header support to CMakeLists.txt and configure.py -
it improves compilation time by approximately 10%.
New header `stdafx.hh` is added, don't include it manually -
the compiler will include it for you. The header contains includes from
external libraries used by Scylla - seastar, standard library,
linux headers and zlib.
The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER`
or configure.py --disable-precompiled-header to disable.
The feature should be disabled, when trying to check headers - otherwise
you might get false negatives on missing includes from seastar / abseil and so on.
Note: following configuration needs to be added to ccache.conf:
sloppiness = pch_defines,time_macros,include_file_mtime,include_file_ctime
Closesscylladb/scylladb#26617
When we enable CDC on a table, Scylla creates a log table for it.
It has default properties, but the user may change them later on.
Furthermore, it's possible to detach that log table by simply
disabling CDC on the base table:
```cql
/* Create a table with CDC enabled. The log table is created. */
CREATE TABLE ks.t (pk int PRIMARY KEY) WITH cdc = {'enabled': true};
/* Detach the log table. */
ALTER TABLE ks.t WITH cdc = {'enabled': false};
/* Modify a property of the log table. */
ALTER TABLE ks.t_scylla_cdc_log WITH bloom_filter_fp_chance = 0.13;
```
The log table can also be reattached by enabling CDC on the base table
again:
```cql
/* Reattach the log table */
ALTER TABLE ks.t WITH cdc = {'enabled': true};
```
However, because the process of reattachment goes through the same
code that created it in the first place, the properties of the log
table are rolled back to their default values. This may be confusing
to the user and, if unnoticed, also have other consequences, e.g.
affecting performance.
To prevent that, we ensure that the properties are preserved.
A reproducer test,
`test_log_table_preserves_properties_after_reattachment`, has been
provided to verify that the changes are correct. It fails before this
commit.
Another test, `test_log_table_preserves_id_after_reattachment`, has
also been added because the current implementation sets properties
and the ID separately.
Fixesscylladb/scylladb#25523
We extract the portion of the code responsible for creating columns
in a CDC log table to a separate, dedicated function. This should
improve the overall readability of the function (and also making it
very short now).
We extract the portion of the code responsible for setting the default
properties of a CDC log table to a separate function. This should
improve the overall readability of the function. Also, it should be
helpful when modifying the code later on in this commit series.
When we drop a column from a CDC log table, we set the column drop
timestamp a few seconds into the future. This can cause unexpected
problems if a user tries to recreate a CDC column too soon, before
the drop timestamp has passed.
To prevent this issue, when creating a CDC column we check its
creation timestamp against the existing drop timestamp, if any, and
fail with an informative error if the recreation attempt is too soon.
When dropping a column from a CDC log table, set the column drop
timestamp several seconds into the future.
If a value is written to a column concurrently with dropping that
column, the value's timestamp may be after the column drop timestamp. If
this value is also flushed to an SSTable, the SSTable would be
corrupted, because it considers the column missing after the drop
timestamp and doesn't allow values for it.
While this issue affects general tables, it especially impacts CDC tables
because this scenario can occur when writing to a table with CDC preimage
enabled while dropping a column from the base table. This happens even if
the base mutation doesn't write to the dropped column, because CDC log
mutations can generate values for a column even if the base mutation doesn't.
For general tables, this issue can be avoided by simply not writing to a
column while dropping it.
We fix this for the more problematic case of CDC log tables by setting
the column drop timestamp several seconds into the future, ensuring that
writes concurrent with column drops are much less likely to have
timestamps greater than the column drop timestamp.
Fixesscylladb/scylladb#26340
When generating CDC log mutations for some base mutation, use a CDC schema that is compatible with the base schema.
The compatible CDC schema has for every base column a corresponding CDC column with the same name. If using a non-compatible schema, we may encounter a situation, especially during ALTER, that we have a mutation with a base column set with some value, but the CDC schema doesn't have a column by that name. This would cause the user request to fail with an error.
We add to the schema object a schema_ptr that for CDC-enabled tables points to the schema object of the CDC table that is compatible with the schema. It is set by the schema merge algorithm when creating the schema for a table that is created or altered. We use the fact that a base table and its CDC table are created and altered in the same group0 operation, and this way we can find and set the cdc schema for a base table.
When transporting the base schema as a frozen schema between shards, we transport with it the frozen cdc schema as well.
The patch starts with a series of refactoring commits that make extending the frozen schema easier and cleans up some duplication in the code about the frozen schema. We combine the two types `frozen_schema_with_base_info` and `view_schema_and_base_info` to a single type `extended_frozen_schema` that holds a frozen schema with additional data that is not part of the schema mutations but needs to be transported with it to unfreeze it - base_info, and the frozen cdc schema which is added in a later commit.
Fixes https://github.com/scylladb/scylladb/issues/26405
backport not needed - enhancement
Closesscylladb/scylladb#24960
* github.com:scylladb/scylladb:
test: cdc: test cdc compatible schema
cdc: use compatiable cdc schema
db: schema_applier: create schema with pointer to CDC schema
db: schema_applier: extract cdc tables
schema: add pointer to CDC schema
schema_registry: remove base_info from global_schema_ptr
schema_registry: use extended_frozen_schema in schema load
schema_registry: replace frozen_schema+base_info with extended_frozen_schema
frozen_schema: extract info from schema_ptr in the constructor
frozen_schema: rename frozen_schema_with_base_info to extended_frozen_schema
use utils::chunked_vector instead of std::vector to store cdc stream
sets for tablets.
a cdc stream set usually represents all streams for a specific table and
timestamp, and has a stream id per each tablet of the table. each stream
id is represented by 16 bytes. thus the vector could require quite large
contiguous allocations for a table that has many tablets. change it to
chunked_vector to avoid large contiguous allocations.
Fixesscylladb/scylladb#26791Closesscylladb/scylladb#26792
Deletes that don't change the state of the database visible to the user
(e.g. an attempt to delete a missing item) shouldn't produce a cdc log.
This commit addresses this DynamoDB compatibility issue if the delete is
a partition delete, a row delete, or a cell delete. It works under the
assumption that the change was produced by Alternator. This means that
it doesn't support range deletes, static row deletes, deletes of
collection cells other than a map, etc. See also its parent commit,
which introduces the methods that this commit extends.
This commit handles the following cases:
- `DeleteItem of nonexistent item: nothing`,
- `BatchWriteItem.DeleteItem of nonexistent item: nothing`.
Refs https://github.com/scylladb/scylladb/pull/26121
This commit adds a function that compares split mutations with the
`row_state`, that was selected as a preimage or propagated through
cdc options by a caller. If the items are equal, the corresponding log
row isn't generated. The result being that creating an item with
BatchWriteItem, PutItem, or UpdateItem doesn't emit an INSERT/MODIFY
event if exactly identical item already exists.
Comparing the items may be costly, so this logic is controlled by
`alternator_streams_compabitiblity` flag.
This commit handles the following cases:
- `PutItem/UpdateItem/BatchWriteItem.PutItem of an existing and equal
item: nothing`
This commit improves compatibility with DynamoDB streams by changing the
emitted events when creating/updating an item. Replace/update operations
of an existing item emit a MODIFY, whereas replacing/updating a missing
item results in an INSERT. If the state of the item doesn't change after
applying the operation, no event is emitted.
This commit handles the following cases:
- `PutItem/UpdateItem/BatchWriteItem.PutItem of an existing and not equal item: MODIFY`
- `PutItem/UpdateItem/BatchWriteItem.PutItem of a nonexistent item: INSERT`
Refs https://github.com/scylladb/scylladb/issues/6918
CDC log table records a mutation as a sequence of log rows that record
an atomic change (i.e. a row marker, tombstones, etc.), whereas a
mutation in Alternator Streams always appears as a single log row. The
type of operation is determined based on the type of the last log row in
CDC.
As a result, updates that create a row always appeared to Alternator
Streams as an update (row marker + data), rather than an insert. This
commit makes them a single log row. Its operation type is insert if it
contains a row marker, and an update otherwise, which gives results
consistent with DynamoDB Streams.
Until this patch, CDC haven't fetched a preimage for mutations
containing only a partition tombstone. Therefore, single-row deletions
in a table witout a clustering key didn't include a preimage, which was
inconsistent with single-row clustered deletions. This commit addresses
this inconsistency.
Second reason is compatibility with DynamoDB Streams, which doesn't
support entire-partition deletes. Alternator uses partition tombstones
for single-row deletions, though, and in these cases the 'OldImage' was
missing from REMOVE records.
Fixes https://github.com/scylladb/scylladb/issues/26382Closesscylladb/scylladb#26578
when loading CDC streams metadata for tablets from the tables, read only
new entries from the history table instead of reading all entries. This
improves the CDC metadata reloading, making it more efficient and
predictable.
the CDC metadata is loaded as part of group0 reload whenever the
internal CDC tables are modified. on tablet split / merge, we create a
new CDC timestamp and streams by writing them to the cdc_streams_history
table by group0 operation, and when it's applied we reload the in-memory
CDC streams map by reading from the tables and constructing the updated map.
Previously, on every update, we would read the entire
cdc_streams_history entries for the changed table, constructing all its
streams and creating a new map from scratch.
We improve this now by reading only new entries from cdc_streams_history
and append them to the existing map. we can do this because we only
append new entries to cdc_streams_history with higher timestamp than all
previous entries.
This makes this reloading more efficient and predictable, because
previously we would read a number of entries that depends on the number
of tablets splits and merges, which increases over time and is
unbounded, whereas now we read only a single stream set on each update.
Fixesscylladb/scylladb#26732
introduce helper functions that can be used for garbage collecting old
cdc streams for tablets-based keyspaces.
- get_new_base_for_gc: finds a new base timestamp given a TTL, such that
all older timestamps and streams can be removed.
- get_cdc_stream_gc_mutations: given new base timestamp and streams,
builds mutations that update the internal cdc tables and remove the
older streams.
- garbage_collect_cdc_streams_for_table: combines the two functions
above to find a new base and build mutations to update it for a
specific table
- garbage_collect_cdc_streams: builds gc mutations for all cdc tables
in the CDC log transformer, when augmenting a base mutation, use the CDC
log schema that is compatible with the base schema, if set.
Now that the base schema has a pointer to its CDC schema, we can use it
instead of getting the current schema from the db, which may not be
compatible with the base schema.
The compatible CDC schema may not be set if the cluster is not using
raft mode for schema. In this case, we maintain the previous behavior.
UserIdentity is a map of two fields in GetRecords responses, which
always has the same value. It may be missing, or contain a constant
object with value `{"type": "Service", "principalId":
"dynamodb.amazonaws.com"}`. Currently, the latter is set only for
`REMOVE`s triggered by TTL.
This commit introduces two new CDC operation types: `service_row_delete`
and `service_partition_delete`, emitted in place of `row_delete` and
`partition_delete`. Alternator Streams treats them as regular `REMOVE`s,
but in addition adds the `userIdentity` field to the record.
This change may break existing Scylla libraries for reading raw CDC
tables, but we doubt that anybody has this use case.
Refs https://github.com/scylladb/scylladb/pull/26149
Refs https://github.com/scylladb/scylladb/pull/26121
Fixes https://github.com/scylladb/scylladb/issues/11523Closesscylladb/scylladb#26460
The series adds an experimental flag for strongly consistent tables and extends "CREATE KEYSPACE" ddl with `consistency` option that allows specifying the consistency mode for the keyspace.
Closesscylladb/scylladb#26116
* github.com:scylladb/scylladb:
schema: Allow configuring consistency setting for a keyspace
db: experimental consistent-tablets option
We want to add strongly consistent tables as an option. We will have
two kind of strongly consistent tables: globally consistent and locally
consistent. The former means that requests from all DCs will be globally
linearisable while the later - only requests to the same DCs will be
linearisable. To allow configuring all the possibilities the patch
adds new parameter to a keyspace definition "consistency" that can be
configured to be `eventual`, `global` or `local`. Non eventual setting
is supported for tablets enabled keyspaces only. Since we want to start
with implementing local consistency configuring global consistency will
result in an error for now.
This commit adds support to pass a preimage selected by an upper layer
to CDC. The responsibility for the correctness of the preimage (i.e. the
selected columns, whether it's up to date, etc.) lies with the caller.
It may be improved in the future by validating the preimage, e.g. by
"slicing" the received preimage to the necessary columns.
The motivation behind this change was to reduce the number of
read-before-writes and avoid reading the row twice for Alternator
Streams in an increased compatibility mode with DynamoDB. This is to be
added in a following commit. Until now, this commit should be a no-op.
This will allow us to communicate with CDC from higher layers. We plan
to use it to reduce the number of read-before-writes with preimages by
passing the row selected in upper layers.
The namespace usage in this directory is very inconsistent, with files
and classes scattered in:
* global namespace
* namespace compaction
* namespace sstables
With cases, where all three used in the same file. This code used to
live in sstables/ and some of it still retains namespace sstables as a
heritage of that time. The mismatch between the dir (future module) and
the namespace used is confusing, so finish the migration and move all
code in compaction/ to namespace compaction too.
This patch, although large, is mechanic and only the following kind of
changes are made:
* replace namespace sstable {} with namespace compaction {}
* add namespace compaction {}
* drop/add sstables::
* drop/add compaction::
* move around forward-declarations so they are in the correct namespace
context
This refactoring revealed some awkward leftover coupling between
sstables and compaction, in sstables/sstable_set.cc, where the
make_sstable_set() methods of compaction strategies are implemented.
As requested in #22104, moved the files and fixed other includes and build system.
Moved files:
- combine.hh
- collection_mutation.hh
- collection_mutation.cc
- converting_mutation_partition_applier.hh
- converting_mutation_partition_applier.cc
- counters.hh
- counters.cc
- timestamp.hh
Fixes: #22104
This is a cleanup, no need to backport
Closesscylladb/scylladb#25085
initial implementation to support CDC in tablets-enabled keyspaces.
The design is described in https://docs.google.com/document/d/1qO5f2q5QoN5z1-rYOQFu6tqVLD3Ha6pphXKEqbtSNiU/edit?usp=sharing
It is followed closely for the most part except "Deciding when to change streams" - instead, streams are changed synchronously with tablet split / merge.
Instead of the stream switching algorithm with the double writes, we use a scheme similar to the previous method for vnodes - we add the new streams with timestamp that is sufficiently far into the future.
In this PR we:
* add new group0-based internal system tables for tablet stream metadata and loading it into in-memory CDC metadata
* add virtual tables for CDC consumers
* the write coordinator chooses a stream by looking up the appropriate stream in the CDC metadata
* enable creating tables with CDC enabled in tablets-enabled keyspaces. tablets are allocated for the CDC table, and a stream is created per each tablet.
* on tablet resize (split / merge), the topology coordinator creates a new stream set with a new stream for each new tablet.
* the cdc tablets are co-located with the base tablets
Fixes https://github.com/scylladb/scylladb/issues/22576
backport not needed - new feature
update dtests: https://github.com/scylladb/scylla-dtest/pull/5897
update java cdc library: https://github.com/scylladb/scylla-cdc-java/pull/102
update rust cdc library: https://github.com/scylladb/scylla-cdc-rust/pull/136Closesscylladb/scylladb#23795
* github.com:scylladb/scylladb:
docs/dev: update CDC dev docs for tablets
doc: update CDC docs for tablets
test: cluster_events: enable add_cdc and drop_cdc
test/cql: enable cql cdc tests to run with tablets
test: test_cdc_with_alter: adjust for cdc with tablets
test/cqlpy: adjust cdc tests for tablets
test/cluster/test_cdc_with_tablets: introduce cdc with tablets tests
cdc: enable cdc with tablets
topology coordinator: change streams on tablet split/merge
cdc: virtual tables for cdc with tablets
cdc: generate_stream_diff helper function
cdc: choose stream in tablets enabled keyspaces
cdc: rename get_stream to get_vnode_stream
cdc: load tablet streams metadata from tables
cdc: helper functions for reading metadata from tables
cdc: colocate cdc table with base
cdc: remove streams when dropping CDC table
cdc: create streams when allocating tablets
migration_listener: add on_before_allocate_tablet_map notification
cdc: notify when creating or dropping cdc table
cdc: move cdc table creation to pre_create
cdc: add internal tables for cdc with tablets
cdc: add cdc_with_tablets feature flag
cdc: add is_log_schema helper
Allow to create CDC tables in a tablets-enabled keyspace when all nodes
in the cluster support the cdc_with_tablets feature.
Fixesscylladb/scylladb#22576
on tablet split/merge finalization, generate a new CDC timestamp and
stream set for the table with a new stream for each tablet in the new
tablet map, in order to maintain synchronization of the CDC streams with
the tablets.
We pick a new timestamp for the streams with a small delay into the
future so that all nodes can learn about the new streams in time, in the
same way it's done for vnodes.
the new timestamp and streams are published by adding a mutation to the
cdc_streams_history table that contains the timestamp and the sets of
closed and opened streams from the current timestamp.
Define two new virtual tables in system keyspace: cdc_timestamps and
cdc_streams. They expose the internal cdc metadata for tablets-enabled
keyspace to be consumed by users consuming the CDC log.
cdc_timestamps lists all timestamps for a table where a stream change
occured.
cdc_streams list additionally the current streams sets for each
table and timestamp, as well as difference - closed and opened streams -
from the previous stream set.
When choosing a CDC stream to generate CDC log writes to, if the
keyspace uses tablets, we need to choose a stream according to the
relevant metadata which is specific to tablets-enabled keyspaces.
We define the method get_tablet_stream that given a table, write
timestamp, and token, returns the stream that the log entry should
be written to.
The method works by looking up the stream metadata of the table, then
finding the relevant stream set by timestamp, and finally finding
the stream that covers the token range that contains the token.
the get_stream method is relevant only for vnode-based keyspaces. next
we will introduce a new method to get a stream in a tablets-based
keyspace. prepare for this by renaming get_stream to get_vnode_stream.
Read the CDC stream metadata from the internal system tables, and store
it in the cdc metadata data structures.
The metadata is stored in the tables as diffs which is more storage
efficient, but when in-memory we store it as full stream sets for each
timestamp. This is more useful because we need to be able to find a
stream given timestamp and token.
When dropping a CDC log table in a tablets-enabled keyspace, remove all
metadata about the table's CDC streams from the internal CDC tables,
since the streams can't be read anymore.
Similarly, when dropping a tablets-enabled keyspace, remove metadata of
all streams belonging to tables in the keyspace.
When allocating tablets for a CDC table, create the initial CDC stream
set. We create one stream per each tablet, each stream covering the
corresponding token range.
When creating a CDC table by updating an existing base table and
enabling CDC, notify about the table creation so subscribers can act on
it. This is needed in particular for notifying the tablet allocator
when creating a CDC table so that it will allocate tablets for the CDC
table.
Also, when dropping a CDC table, notifying about the dropped table. This
is needed for the tablet allocator to remove the tablet map of the CDC
table.
When creating a new table with CDC enabled, we create also a CDC log
table by adding the CDC table's mutations in the same operation.
Previously, it works by the CDC log service subscribing to
on_before_create_column_family and adding the CDC table's mutations
there when being notified about a new created table.
The problem is that when we create the tables we also create their
tablet maps in the tablet allocator, and we want to created the two
tables as co-located tables: we allocate a tablet map for the base
table, and the CDC table is co-located with the base table.
This doesn't work well with the previous approach because the
notification that creates the CDC table is the same notification that
the tablet allocator creates the base tablet map, so the two operations
are independent, but really we want the tablet allocator to work on both
tables together, so that we have the base table's schema and tablet map
when we create the CDC table's co-located tablet map.
In order to achieve this, we want to create and add the CDC table's
schema, and only after that notify using before_create_column_families
with a vector that contains both the base table and CDC table. The
tablet allocator will then have all the information it needs to create
the co-located tablet map.
We move the creation of the CDC log table - instead of adding the
table's mutations in on_before_create_column_family, we create the table
schema and add it to the new tables vector in
on_pre_create_column_families, which is called by the migration manager
in do_prepare_new_column_families_announcement. The migration manager
will then create and add all mutations for creating the tables, and
notify about the tables being created together.