Currently, when two cells have the same write timestamp and both are alive or expiring, we compare their value first, before checking if either of them is expiring and if both are expiring, comparing their expiration time and ttl value to determine which of them will expire later or was written later. This was based on an early version of Cassandra. However, the Cassandra implementation rightfully changed ine225c88a65([CASSANDRA-14592](https://issues.apache.org/jira/browse/CASSANDRA-14592)), where the cell expiration is considered before the cell value. To summarize, the motivation for this change is three fold: 1. Cassandra compatibility 2. Prevent an edge case where a null value is returned by select query when an expired cell has a larger value than a cell with later expiration. 3. A generalization of the above: value-based reconciliation may cause select query to return a mixture of upserts, if multiple upserts use the same timeastamp but have different expiration times. If the cell value is considered before expiration, the select result may contain cells from different inserts, while reconciling based the expiration times will choose cells consistently from either upserts, as all cells in the respective upsert will carry the same expiration time. \Fixes scylladb/scylladb#14182 Also, this series: - updates dml documentation - updates internal documentation - updates and adds unit tests and cql pytest reproducing #14182 \Closes scylladb/scylladb#14183 * github.com:scylladb/scylladb: docs: dml: add update ordering section cql-pytest: test_using_timestamp: add tests for rewrites using same timestamp mutation_partition: compare_row_marker_for_merge: consider ttl in case expiry is the same atomic_cell: compare_atomic_cell_for_merge: update and add documentation compare_atomic_cell_for_merge: compare value last for live cells mutation_test: test_cell_ordering: improve debuggability (cherry picked from commit87b4606cd6) Closes #14649
2.3 KiB
Timestamp conflict resolution
The fundamental rule for ordering cells that insert, update, or delete data in a given row and column is that the cell with the highest timestamp wins.
However, it is possible that multiple such cells will carry the same TIMESTAMP.
In this case, conflicts must be resolved in a consistent way by all nodes.
Otherwise, if nodes would have picked an arbitrary cell in case of a conflict and they would
reach different results, reading from different replicas would detect the inconsistency and trigger
read-repair that will generate yet another cell that would still conflict with the existing cells,
with no guarantee for convergence.
The first tie-breaking rule when two cells have the same write timestamp is that dead cells win over live cells; and if both cells are deleted, the one with the later deletion time prevails.
If both cells are alive, their expiration time is examined.
Cells that are written with a non-zero TTL (either implicit, as determined by
the table's default TTL, or explicit, USING TTL) are due to expire
TTL seconds after the time they were written (as determined by the coordinator,
and rounded to 1 second resolution). That time is the cell's expiration time.
When cells expire, they become tombstones, shadowing any data written with a write timestamp
less than or equal to the timestamp of the expiring cell.
Therefore, cells that have an expiration time win over cells with no expiration time.
If both cells have an expiration time, the one with the latest expiration time wins; and if they have the same expiration time (in whole second resolution), their write time is derived from the expiration time less the original time-to-live value and the one that was written at a later time prevails.
Finally, if both cells are live and have no expiration, or have the same expiration time and time-to-live, the cell with the lexicographically bigger value prevails.
Note that when multiple columns are INSERTed or UPDATEed using the same timestamp, SELECTing those columns might return a result that mixes cells from either upsert. This may happen when both upserts have no expiration time, or both their expiration time and TTL are the same, respectively (in whole second resolution). In such a case, cell selection would be based on the cell values in each column, independently of each other.