Files
scylladb/docs/dev/timestamp-conflict-resolution.md
Tomasz Grabiec 1a6f4389ae Merge 'atomic_cell: compare value last' from Benny Halevy
Currently, when two cells have the same write timestamp
and both are alive or expiring, we compare their value first,
before checking if either of them is expiring
and if both are expiring, comparing their expiration time
and ttl value to determine which of them will expire
later or was written later.

This was based on an early version of Cassandra.
However, the Cassandra implementation rightfully changed in
e225c88a65 ([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)),
where the cell expiration is considered before the cell value.

To summarize, the motivation for this change is three fold:
1. Cassandra compatibility
2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration.
3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times.  If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time.

\Fixes scylladb/scylladb#14182

Also, this series:
- updates dml documentation
- updates internal documentation
- updates and adds unit tests and cql pytest reproducing #14182

\Closes scylladb/scylladb#14183

* github.com:scylladb/scylladb:
  docs: dml: add update ordering section
  cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp
  mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same
  atomic_cell: compare_atomic_cell_for_merge: update and add documentation
  compare_atomic_cell_for_merge: compare value last for live cells
  mutation_test: test_cell_ordering: improve debuggability

(cherry picked from commit 87b4606cd6)

Closes #14649
2023-07-12 10:09:56 +03:00

2.3 KiB

Timestamp conflict resolution

The fundamental rule for ordering cells that insert, update, or delete data in a given row and column is that the cell with the highest timestamp wins.

However, it is possible that multiple such cells will carry the same TIMESTAMP. In this case, conflicts must be resolved in a consistent way by all nodes. Otherwise, if nodes would have picked an arbitrary cell in case of a conflict and they would reach different results, reading from different replicas would detect the inconsistency and trigger read-repair that will generate yet another cell that would still conflict with the existing cells, with no guarantee for convergence.

The first tie-breaking rule when two cells have the same write timestamp is that dead cells win over live cells; and if both cells are deleted, the one with the later deletion time prevails.

If both cells are alive, their expiration time is examined. Cells that are written with a non-zero TTL (either implicit, as determined by the table's default TTL, or explicit, USING TTL) are due to expire TTL seconds after the time they were written (as determined by the coordinator, and rounded to 1 second resolution). That time is the cell's expiration time. When cells expire, they become tombstones, shadowing any data written with a write timestamp less than or equal to the timestamp of the expiring cell. Therefore, cells that have an expiration time win over cells with no expiration time.

If both cells have an expiration time, the one with the latest expiration time wins; and if they have the same expiration time (in whole second resolution), their write time is derived from the expiration time less the original time-to-live value and the one that was written at a later time prevails.

Finally, if both cells are live and have no expiration, or have the same expiration time and time-to-live, the cell with the lexicographically bigger value prevails.

Note that when multiple columns are INSERTed or UPDATEed using the same timestamp, SELECTing those columns might return a result that mixes cells from either upsert. This may happen when both upserts have no expiration time, or both their expiration time and TTL are the same, respectively (in whole second resolution). In such a case, cell selection would be based on the cell values in each column, independently of each other.