23 Commits

Author SHA1 Message Date
Piotr Wieczorek
a32e8091a9 alternator, cdc: Don't emit an event for equal items
This commit adds a function that compares split mutations with the
`row_state`, that was selected as a preimage or propagated through
cdc options by a caller. If the items are equal, the corresponding log
row isn't generated. The result being that creating an item with
BatchWriteItem, PutItem, or UpdateItem doesn't emit an INSERT/MODIFY
event if exactly identical item already exists.

Comparing the items may be costly, so this logic is controlled by
`alternator_streams_compabitiblity` flag.

This commit handles the following cases:
- `PutItem/UpdateItem/BatchWriteItem.PutItem of an existing and equal
  item: nothing`
2025-10-30 08:38:30 +01:00
Piotr Wieczorek
e3fde8087a cdc: Don't split a row marker away from row cells
CDC log table records a mutation as a sequence of log rows that record
an atomic change (i.e. a row marker, tombstones, etc.), whereas a
mutation in Alternator Streams always appears as a single log row. The
type of operation is determined based on the type of the last log row in
CDC.

As a result, updates that create a row always appeared to Alternator
Streams as an update (row marker + data), rather than an insert. This
commit makes them a single log row. Its operation type is insert if it
contains a row marker, and an update otherwise, which gives results
consistent with DynamoDB Streams.
2025-10-30 07:40:31 +01:00
Ernest Zaslavsky
5ba5aec1f8 treewide: Move mutation related files to a mutation directory
As requested in #22104, moved the files and fixed other includes and build system.

Moved files:
 - combine.hh
 - collection_mutation.hh
 - collection_mutation.cc
 - converting_mutation_partition_applier.hh
 - converting_mutation_partition_applier.cc
 - counters.hh
 - counters.cc
 - timestamp.hh

Fixes: #22104

This is a cleanup, no need to backport

Closes scylladb/scylladb#25085
2025-09-24 13:23:38 +03:00
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Kefu Chai
94e36d4af4 auth: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

this change addresses the leftover of 850ee7e170a.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#19467
2024-06-25 12:11:28 +03:00
Kefu Chai
6c06751640 cdc: not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16725
2024-01-11 09:13:37 +02:00
Avi Kivity
69a385fd9d Introduce schema/ module
Schema related files are moved there. This excludes schema files that
also interact with mutations, because the mutation module depends on
the schema. Those files will have to go into a separate module.

Closes #12858
2023-02-15 11:01:50 +02:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Avi Kivity
ae3a360725 database: Move database, keyspace, table classes to replica/ directory
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.

As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
2022-01-06 17:07:30 +02:00
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Avi Kivity
14a4173f50 treewide: make headers self-sufficient
In preparation for some large header changes, fix up any headers
that aren't self-sufficient by adding needed includes or forward
declarations.
2021-04-20 21:23:00 +03:00
Calle Wilund
46ea8c9b8b cdc: Add an "end-of-record" column to
Fixes #7435

Adds an "eor" (end-of-record) column to cdc log. This is non-null only on
last-in-timestamp group rows, i.e. end of a singular source "event".

A client can use this as a shortcut to knowing whether or not he has a
full cdc "record" for a given source mutation (single row change).

Closes #7436
2020-10-26 09:39:27 +02:00
Piotr Dulikowski
20b236d27d cdc: don't update partition state when not needed
In some cases, tracking the state of processed rows inside `transformer`
is not needd at all. We don't need to do it if either:

- Preimage and postimage are disabled for the table,
- Only preimage is enabled and we are processing the last timestamp.

This commit disables updating the state in the cases listed above.
2020-07-08 15:36:41 +02:00
Piotr Dulikowski
24b50ffbc8 cdc: add interface for producing pre/postimages
Introduces new methods to the change_processor interface that will cause
it to produce pre/postimage rows for requested clustering key, or for
static row.

Introduces logic in split.cc responsible for calling pre/postimage
methods of the change_processor interface. This does not have any effect
on generated CDC log mutations yet, because the transformer class has
empty implementations in place of those methods.
2020-07-08 15:36:41 +02:00
Piotr Dulikowski
82ddeb1992 cdc: track batch_no inside transformer
Move tracking of batch_no inside the transformer.
2020-07-08 15:36:41 +02:00
Piotr Dulikowski
7b47f84965 cdc: move cdc$time generation to transformer
Generate the timeuuid on the transformer side, which allows to simplify
the change_processor interface.
2020-07-08 15:36:41 +02:00
Piotr Dulikowski
51d97be0b3 cdc: introduce change_processor interface
This allows for a more refined use of the transformer by the
for_each_change function (now named "process_changes_with_splitting).

The change_processor interface exposes two methods so far:
begin_timestamp, and process_change (previously named "transform").
By separating those two and exposing them, process_changes_with\
_splitting can cause the transformer to generate less CDC log mutations
- only one for each timestamp in the batch.
2020-07-08 15:36:40 +02:00
Piotr Dulikowski
f907cab156 cdc: remove redundant schema arguments from cdc functions
A `mutation` object already has a reference to its schema. It does not
make sense to call functions changed in this commit with a different
schema.
2020-07-08 15:36:40 +02:00
Botond Dénes
e0284bb9ee treewide: add missing headers and/or forward declarations 2020-03-23 09:29:45 +02:00
Kamil Braun
3200d415da cdc: use a single timeuuid value for a batch of changes
If a batch update is performed with a sequence of changes with a single
timestamp, they will now show up in CDC with a single timeuuid in the
`time` column, distinguished by different `batch_seq_no` values.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-05 12:32:57 +01:00
Kamil Braun
292eba9da0 cdc: replace split with for_each_change
`for_each_change` is like `split` but it doesn't return a vector of
mutations representing each change; instead, it takes as a parameter
a function which gets called on each mutation.

This reduced the memory usage and allows to preserve common context
when handling each change (will be useful in next commits).

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-05 12:05:08 +01:00
Kamil Braun
529d30ef66 cdc: add split function
This function takes a mutation and returns a set of mutations, each
representing a separate change with a single timestamp and ttl.
2020-03-03 13:17:51 +01:00
Kamil Braun
b5c944370e cdc: add should_split function
The function checks if there are multiple timestamps and/or ttls inside
a mutation, which means separate changes should be created for this
mutation in CDC.
2020-03-03 13:17:50 +01:00