From 3bad46a6e2870ebd4905bd1d7aa59c8a368f9db7 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Botond=20D=C3=A9nes?= Date: Wed, 26 Mar 2025 11:31:49 -0400 Subject: [PATCH] docs/dev: add tombstone.md An exhaustive document on the tombstone related internal logic as well as the user-facing aspects. Closes scylladb/scylladb#23454 --- docs/dev/tombstone.md | 613 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 613 insertions(+) create mode 100644 docs/dev/tombstone.md diff --git a/docs/dev/tombstone.md b/docs/dev/tombstone.md new file mode 100644 index 0000000000..d254dca778 --- /dev/null +++ b/docs/dev/tombstone.md @@ -0,0 +1,613 @@ +# Tombstones + +This document explains what tombstones are, how they used and how they are are created in ScyllaDB. + +## Introduction + +In an LSM Tree data structure, deleting data directly is not possible. +Instead, deleting a key is translated to a write of a special deletion marker for said key. +When the key and its deletion marker are compacted together, the key and any data it may have is dropped as a result. + +In ScyllaDB these deletion markers are called "tombstones". + +A tombstone has the following fields: + +```c++ +struct tombstone { + api::timestamp_type timestamp; + gc_clock::time_point deletion_time; +} +``` + +The `timestamp` field is used to determine whether the tombstone covers some live data. +If `tombstone.timestamp >= live_data.timestamp`, the data is covered by the tombstone and it is considered dead. +In ScyllaDB, every cell has its own timestamp, so this coverage check happens on a cell level. +It is possible that in a given row, some cells are covered, while others are not and they remain live. + +The `deletion_time` is used to determine whether the tombstone is eligible for garbage collection. +Once a tombstone was compacted together with *all* data it could possibly cover and therefore all such data was dropped, it is no longer useful and can be garbage collected (dropped). +Tombstone garbage collection involves complex rules, for more details see the [Tombstone Garbage Collection](./tombstone.md#tombstone-garbage-collection) chapter. + +## Tombstone Hierarchy + +Tombstones and data form a hierarchical structure, where tombstones can cover data on the same level and below, but not above. +The tombstone hierarchy follows the data hierarchy, which looks like this: +```mermaid +graph TD; + partition-->row; + row-->cell; + row-->collection; + collection-->cell; +``` + +Tombstone hierarchy: +```mermaid +graph TD; + P[partition tombstone]-->RT[range tombstone]; + RT[range tombstone]-->R[row tombstone]; + R[row tombstone]-->C[cell tombstone]; + R-->CT[collection tombstone]; + CT-->C; +``` + +Partition tombstones applies to everything in the partition, including the static row. + +The range tombstone applies to a clustering range, so it can apply to multiple rows. + +Collections have a collection tombstone which applies to all elements in the collection. +Collections members are similar to regular cells, they can be live or dead (cell tombstone). +In the ScyllaDB type system, in addition to the `list`, `set` and `map` collection types, `tuple` and `UDT` are also stored as collections so the same rules apply to them as well. + +When determining whether some entity is live or not, one has to consider all tombstones that are above this entity in the tombstone hierarchy. +For example, to determine whether a regular cell is live or not, one has to calculate the following tombstone: `partition tombstone + range tombstone + row tombstone`. +Addition of tombstones is simply choosing the one higher timestamp: +```c++ +tombstone operator+(tombstone a, tombstone b) { + return a.timestamp > b.timestamp ? a : b; +} +``` + +Missing tombstones (there is no deletion) have the special `timestamp` value of `api::missing_timestamp`, which compares less than any other timestamp value. + +## How Are Tombstones Created? + +Creating the higher-level tombstones (row or above) is fairly intuitive and involves a `DELETE FROM` statement as one would expect. +But when it comes to cells and collections, tombstones can be created as a result of many different statements. + +Given the following table as an example: +```CQL +CREATE TABLE ks.tbl (pk text, ck1 int, ck2 int, v1 int, v2 map, PRIMARY KEY (pk, ck1, ck2)); +``` + +Below we will examine how each kind of tombstone can be created with concrete example. +Each example will have a CQL statement which creates the tombstone followed by a [SELECT * FROM MUTATION_FRAGMENTS()](https://opensource.docs.scylladb.com/stable/operating-scylla/admin-tools/select-from-mutation-fragments.html) query showing the created tombstone. + +### Partition Tombstone + +```CQL +DELETE FROM ks.tbl WHERE pk = 'partition tombstone'; + +SELECT * from MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'partition tombstone'; + + pk | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata | mutation_fragment_kind | value +---------------------+-----------------+------------------+-----+-----+-----------------+-------------------------------------------------------------------------------------+------------------------+------- + partition tombstone | memtable:0 | 0 | | | | {"tombstone":{"timestamp":1743054972857790,"deletion_time":"2025-03-27 05:56:12z"}} | partition start | null + partition tombstone | memtable:0 | 3 | | | | null | partition end | null + +(2 rows) +``` + +### Range Tombstone + +Delete a range: +```CQL +DELETE FROM ks.tbl WHERE pk = 'range tombstone 1' AND ck1 = 0 AND ck2 > 100 AND ck2 < 200; + +SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'range tombstone 1'; + + pk | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata | mutation_fragment_kind | value +-------------------+-----------------+------------------+-----+-----+-----------------+-------------------------------------------------------------------------------------+------------------------+------- + range tombstone 1 | memtable:0 | 0 | | | | {"tombstone":{}} | partition start | null + range tombstone 1 | memtable:0 | 2 | 0 | 100 | 1 | {"tombstone":{"timestamp":1743055013006807,"deletion_time":"2025-03-27 05:56:53z"}} | range tombstone change | null + range tombstone 1 | memtable:0 | 2 | 0 | 200 | -1 | {"tombstone":{}} | range tombstone change | null + range tombstone 1 | memtable:0 | 3 | | | | null | partition end | null + +(4 rows) +``` + +Delete a prefix -- only a prefix of all the clustering columns is restricted: +```CQL +DELETE FROM ks.tbl WHERE pk = 'range tombstone 2' AND ck1 = 1; + +SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'range tombstone 2'; + + pk | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata | mutation_fragment_kind | value +-------------------+-----------------+------------------+-----+-----+-----------------+-------------------------------------------------------------------------------------+------------------------+------- + range tombstone 2 | memtable:0 | 0 | | | | {"tombstone":{}} | partition start | null + range tombstone 2 | memtable:0 | 2 | 1 | | -1 | {"tombstone":{"timestamp":1743055505954714,"deletion_time":"2025-03-27 06:05:05z"}} | range tombstone change | null + range tombstone 2 | memtable:0 | 2 | 1 | | 1 | {"tombstone":{}} | range tombstone change | null + range tombstone 2 | memtable:0 | 3 | | | | null | partition end | null + +(4 rows) +``` + +Internally, range tombstones are represented by so-called "range tombstone change" objects. +There are two such objects: one at the start of the range tombstone (the lower bound key) and one at the end of the range tombstone (the higher bound key). +A range tombstone takes effect between its start and end bounds. +A partition can have any number of range tombstones, but there is always at least two, the starting one and a last one. +The last range tombstone in a partition always has an empty tombstone, this mark the remaining range of the partition as having no range tombstone, or in other words it "resets" the current tombstone. +The range tombstone end object resets the tombstone to the null-tombstone (no deletion). +Despite every range tombstone having a start and end range tombstone change objects, it is possible for a partition to have an odd number of such objects. +This happens when range tombstones overlap: +```CQL +DELETE FROM ks.tbl WHERE pk = 'range tombstone 3' AND ck1 = 0 AND ck2 > 100 AND ck2 < 200; + +DELETE FROM ks.tbl WHERE pk = 'range tombstone 3' AND ck1 = 0 AND ck2 > 150 AND ck2 < 300; + +SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'range tombstone 3'; + + pk | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata | mutation_fragment_kind | value +-------------------+-----------------+------------------+-----+-----+-----------------+-------------------------------------------------------------------------------------+------------------------+------- + range tombstone 3 | memtable:0 | 0 | | | | {"tombstone":{}} | partition start | null + range tombstone 3 | memtable:0 | 2 | 0 | 100 | 1 | {"tombstone":{"timestamp":1743164183543439,"deletion_time":"2025-03-28 12:16:23z"}} | range tombstone change | null + range tombstone 3 | memtable:0 | 2 | 0 | 150 | 1 | {"tombstone":{"timestamp":1743164186551458,"deletion_time":"2025-03-28 12:16:26z"}} | range tombstone change | null + range tombstone 3 | memtable:0 | 2 | 0 | 300 | -1 | {"tombstone":{}} | range tombstone change | null + range tombstone 3 | memtable:0 | 3 | | | | null | partition end | null + +(5 rows) +``` + +In the range [(0, 150), (0, 200)] the two range tombstones overlap, so one of them takes precedence (overwrites the other) -- the later one. + +Range tombstone change objects always have a position which is *before* or *after* a certain key, never a position where they are *at* a key. +This becomes evident when looking at the `position_weight` column in the mutation dump. +Range tombstone changes always have a position weight of either `1` or `-1`. +Regular rows alwas have a position weight of `0`. +Position weight is relevant when comparing two positions which have the same clustering key. +In this case the position weight is the tie-breaker: comparing the position weight of the two respective positions will determine the comparison result. + +### Row Tombstone + +```CQL +DELETE FROM ks.tbl WHERE pk = 'row tombstone' AND ck1 = 5 AND ck2 = 5; + +SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'row tombstone'; + + pk | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata | mutation_fragment_kind | value +---------------+-----------------+------------------+-----+-----+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+------- + row tombstone | memtable:0 | 0 | | | | {"tombstone":{}} | partition start | null + row tombstone | memtable:0 | 2 | 5 | 5 | 0 | {"tombstone":{"timestamp":1743055543508176,"deletion_time":"2025-03-27 06:05:43z"},"shadowable_tombstone":{"timestamp":1743055543508176,"deletion_time":"2025-03-27 06:05:43z"},"columns":{}} | clustering row | {} + row tombstone | memtable:0 | 3 | | | | null | partition end | null + +(3 rows) +``` + +The row tombstone also has a `shadowable_tombstone` object included in it. More on this in the [Shadowable Tombstone](./tombstone.md#shadowable-tombstone) chapter. + +### Regular Cell Tombstone + +Delete regular cell with `DELETE` statement: +```CQL +DELETE v1 FROM ks.tbl WHERE pk = 'cell tombstone 1' AND ck1 = 0 AND ck2 = 0; + +SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'regular cell tombstone 1'; + + pk | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata | mutation_fragment_kind | value +--------------------------+-----------------+------------------+-----+-----+-----------------+---------------------------------------------------------------------------------------------------------------------------+------------------------+------------- + regular cell tombstone 1 | memtable:0 | 0 | | | | {"tombstone":{}} | partition start | null + regular cell tombstone 1 | memtable:0 | 2 | 0 | 0 | 0 | {"columns":{"v1":{"is_live":false,"type":"regular","timestamp":1743056112215870,"deletion_time":"2025-03-27 06:15:12z"}}} | clustering row | {"v1":null} + regular cell tombstone 1 | memtable:0 | 3 | | | | null | partition end | null + +(3 rows) +``` + +When it comes to cell tombstones (be that regular or collection cells), there is no longer a separate tombstone object. +Instead, a cell is either live or dead (see `is_live` in the `metadata` column). If `is_live=false`, the cell is a dead cell -- also called a cell tombstone. + +Deleting a cell, or setting it to `null` has the same effect. + +Delete regular cell with `UPDATE` statement: +```CQL +UPDATE ks.tbl SET v1 = null WHERE pk = 'regular cell tombstone 2' AND ck1 = 0 AND ck2 = 0; + +SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'regular cell tombstone 2'; + + pk | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata | mutation_fragment_kind | value +--------------------------+-----------------+------------------+-----+-----+-----------------+---------------------------------------------------------------------------------------------------------------------------+------------------------+------------- + regular cell tombstone 2 | memtable:0 | 0 | | | | {"tombstone":{}} | partition start | null + regular cell tombstone 2 | memtable:0 | 2 | 0 | 0 | 0 | {"columns":{"v1":{"is_live":false,"type":"regular","timestamp":1743056276318904,"deletion_time":"2025-03-27 06:17:56z"}}} | clustering row | {"v1":null} + regular cell tombstone 2 | memtable:0 | 3 | | | | null | partition end | null + +(3 rows) +``` + +Delete regular cell with `INSERT` statement: +```CQL +INSERT INTO ks.tbl (pk, ck1, ck2, v1) VALUES ('regular cell tombstone 3', 0, 0, null); + +SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'regular cell tombstone 3'; + + pk | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata | mutation_fragment_kind | value +--------------------------+-----------------+------------------+-----+-----+-----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+------------- + regular cell tombstone 3 | memtable:0 | 0 | | | | {"tombstone":{}} | partition start | null + regular cell tombstone 3 | memtable:0 | 2 | 0 | 0 | 0 | {"marker":{"timestamp":1743056345463644},"columns":{"v1":{"is_live":false,"type":"regular","timestamp":1743056345463644,"deletion_time":"2025-03-27 06:19:05z"}}} | clustering row | {"v1":null} + regular cell tombstone 3 | memtable:0 | 3 | | | | null | partition end | null + +(3 rows) +``` + +### Collection Tombstone + +Delete collection with `DELETE` statement: +```CQL +DELETE v2 FROM ks.tbl WHERE pk = 'collection tombstone 1' AND ck1 = 0 AND ck2 = 0; + +SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'collection tombstone 1'; + + pk | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata | mutation_fragment_kind | value +------------------------+-----------------+------------------+-----+-----+-----------------+-------------------------------------------------------------------------------------------------------------------+------------------------+----------- + collection tombstone 1 | memtable:0 | 0 | | | | {"tombstone":{}} | partition start | null + collection tombstone 1 | memtable:0 | 2 | 0 | 0 | 0 | {"columns":{"v2":{"tombstone":{"timestamp":1743056564590040,"deletion_time":"2025-03-27 06:22:44z"},"cells":[]}}} | clustering row | {"v2":[]} + collection tombstone 1 | memtable:0 | 3 | | | | null | partition end | null + +(3 rows) +``` + +Setting the collection to `null` has the same effect, just like for regular cells. + +Delete collection with `UPDATE` statement: +```CQL +UPDATE ks.tbl SET v2 = null WHERE pk = 'collection tombstone 2' AND ck1 = 0 AND ck2 = 0; + +SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'collection tombstone 2'; + + pk | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata | mutation_fragment_kind | value +------------------------+-----------------+------------------+-----+-----+-----------------+-------------------------------------------------------------------------------------------------------------------+------------------------+----------- + collection tombstone 2 | memtable:0 | 0 | | | | {"tombstone":{}} | partition start | null + collection tombstone 2 | memtable:0 | 2 | 0 | 0 | 0 | {"columns":{"v2":{"tombstone":{"timestamp":1743056668558340,"deletion_time":"2025-03-27 06:24:28z"},"cells":[]}}} | clustering row | {"v2":[]} + collection tombstone 2 | memtable:0 | 3 | | | | null | partition end | null + +(3 rows) +``` + +Delete collection with `INSERT` statement: +```CQL +INSERT INTO ks.tbl (pk, ck1, ck2, v2) VALUES ('collection tombstone 3', 0, 0, null); + +SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'collection tombstone 3'; + + pk | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata | mutation_fragment_kind | value +------------------------+-----------------+------------------+-----+-----+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+----------- + collection tombstone 3 | memtable:0 | 0 | | | | {"tombstone":{}} | partition start | null + collection tombstone 3 | memtable:0 | 2 | 0 | 0 | 0 | {"marker":{"timestamp":1743056946866432},"columns":{"v2":{"tombstone":{"timestamp":1743056946866431,"deletion_time":"2025-03-27 06:29:06z"},"cells":[]}}} | clustering row | {"v2":[]} + collection tombstone 3 | memtable:0 | 3 | | | | null | partition end | null + +(3 rows) +``` + +Collection tombstones are also generated when a collection is fully overwritten. + +Collection tombstone generated by full overwrite, using the `UPDATE` statement: +```CQL +UPDATE ks.tbl SET v2 = {1: 12, 2: 44} WHERE pk = 'collection tombstone 4' AND ck1 = 0 AND ck2 = 0; + +SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'collection tombstone 4'; + + pk | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata | mutation_fragment_kind | value +------------------------+-----------------+------------------+-----+-----+-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+------------------------------------------------------------ + collection tombstone 4 | memtable:0 | 0 | | | | {"tombstone":{}} | partition start | null + collection tombstone 4 | memtable:0 | 2 | 0 | 0 | 0 | {"columns":{"v2":{"tombstone":{"timestamp":1743057841587097,"deletion_time":"2025-03-27 06:44:01z"},"cells":[{"key":"1","value":{"is_live":true,"type":"regular","timestamp":1743057841587098}},{"key":"2","value":{"is_live":true,"type":"regular","timestamp":1743057841587098}}]}}} | clustering row | {"v2":[{"key":"1","value":"12"},{"key":"2","value":"44"}]} + collection tombstone 4 | memtable:0 | 3 | | | | null | partition end | null + +(3 rows) +``` + +Collection tombstone generated by full overwrite, using the `INSERT` statement: +```CQL +INSERT INTO ks.tbl (pk, ck1, ck2, v2) VALUES ('collection tombstone 5', 0, 0, {1: 12, 2: 44}); + +SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'collection tombstone 5'; + + pk | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata | mutation_fragment_kind | value +------------------------+-----------------+------------------+-----+-----+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+------------------------------------------------------------ + collection tombstone 5 | memtable:0 | 0 | | | | {"tombstone":{}} | partition start | null + collection tombstone 5 | memtable:0 | 2 | 0 | 0 | 0 | {"marker":{"timestamp":1743057913516603},"columns":{"v2":{"tombstone":{"timestamp":1743057913516602,"deletion_time":"2025-03-27 06:45:13z"},"cells":[{"key":"1","value":{"is_live":true,"type":"regular","timestamp":1743057913516603}},{"key":"2","value":{"is_live":true,"type":"regular","timestamp":1743057913516603}}]}}} | clustering row | {"v2":[{"key":"1","value":"12"},{"key":"2","value":"44"}]} + collection tombstone 5 | memtable:0 | 3 | | | | null | partition end | null + +(3 rows) +``` + +### Collection Cell Tombstone + +Collection cell tombstone behave like regular cell tombstones for the most part. + +Delete collection cell with `DELETE` statement: +```CQL +DELETE v2[1] FROM ks.tbl WHERE pk = 'collection cell tombstone 1' AND ck1 = 0 AND ck2 = 0; + +SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'collection cell tombstone 1'; + + pk | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata | mutation_fragment_kind | value +-----------------------------+-----------------+------------------+-----+-----+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+----------------------------------- + collection cell tombstone 1 | memtable:0 | 0 | | | | {"tombstone":{}} | partition start | null + collection cell tombstone 1 | memtable:0 | 2 | 0 | 0 | 0 | {"columns":{"v2":{"cells":[{"key":"1","value":{"is_live":false,"type":"regular","timestamp":1743057941371233,"deletion_time":"2025-03-27 06:45:41z"}}]}}} | clustering row | {"v2":[{"key":"1","value":null}]} + collection cell tombstone 1 | memtable:0 | 3 | | | | null | partition end | null + +(3 rows) +``` + +Delete collection cell with `UPDATE` statement: +```CQL +UPDATE ks.tbl SET v2[1] = null WHERE pk = 'regular cell tombstone 2' AND ck1 = 0 AND ck2 = 0; + +SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'collection cell tombstone 2'; + + pk | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata | mutation_fragment_kind | value +-----------------------------+-----------------+------------------+-----+-----+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+----------------------------------- + collection cell tombstone 2 | memtable:0 | 0 | | | | {"tombstone":{}} | partition start | null + collection cell tombstone 2 | memtable:0 | 2 | 0 | 0 | 0 | {"columns":{"v2":{"cells":[{"key":"1","value":{"is_live":false,"type":"regular","timestamp":1743058010855333,"deletion_time":"2025-03-27 06:46:50z"}}]}}} | clustering row | {"v2":[{"key":"1","value":null}]} + collection cell tombstone 2 | memtable:0 | 3 | | | | null | partition end | null + +(3 rows) +``` + +Individual collection cells cannot be deleted (or overwritten) by `INSERT` statement. + +## TTL and Tombstones + +Cells which have [TTL](https://opensource.docs.scylladb.com/stable/cql/time-to-live.html) become cell tombstones after they expire. + +```CQL +INSERT INTO ks.tbl (pk, ck1, ck2, v1) VALUES ('expired cell', 0, 0, 1) USING TTL 1; + +SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'expired cell'; + + pk | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata | mutation_fragment_kind | value +--------------+-----------------+------------------+-----+-----+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+------------ + expired cell | memtable:0 | 0 | | | | {"tombstone":{}} | partition start | null + expired cell | memtable:0 | 2 | 0 | 0 | 0 | {"marker":{"timestamp":1743058565262883,"ttl":"1s","expiry":"2025-03-27 06:56:06z"},"columns":{"v1":{"is_live":true,"type":"regular","timestamp":1743058565262883,"ttl":"1s","expiry":"2025-03-27 06:56:06z"}}} | clustering row | {"v1":"1"} + expired cell | memtable:0 | 3 | | | | null | partition end | null + +(3 rows) +``` + +Even though the cell has already expired, it still appears as live, because expired cells are converted to cell tombstones by compaction. +To determine whether an expired cell is really live or not, one has to look at the `expiry` and compare it with the current time. +If `expiry <= now()`, the cell is expired and it will be treated as a dead cell by compaction. + +Reads also compact the data, so if we read this partition, the cell will not show up in the results: +```CQL +SELECT * FROM ks.tbl WHERE pk = 'expired cell'; + + pk | ck1 | ck2 | v1 | v2 +----+-----+-----+----+---- + +(0 rows) +``` + +After flushing and compacting the table, the expired cell is now converted to a cell tombstone: +```CQL +SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'expired cell'; + + pk | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata | mutation_fragment_kind | value +--------------+------------------------------------------------------------------------------------------------------------------+------------------+-----+-----+-----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+------------- + expired cell | sstable:/var/lib/scylla/data/ks/tbl-b0f65c600ad611f0811d8550c417dfcd/me-3gox_0jgn_1gq5c20k31rxg89le0-big-Data.db | 0 | | | | {"tombstone":{}} | partition start | null + expired cell | sstable:/var/lib/scylla/data/ks/tbl-b0f65c600ad611f0811d8550c417dfcd/me-3gox_0jgn_1gq5c20k31rxg89le0-big-Data.db | 2 | 0 | 0 | 0 | {"marker":{"timestamp":1743058565262883},"columns":{"v1":{"is_live":false,"type":"regular","timestamp":1743058565262883,"deletion_time":"2025-03-27 06:56:05z"}}} | clustering row | {"v1":null} + expired cell | sstable:/var/lib/scylla/data/ks/tbl-b0f65c600ad611f0811d8550c417dfcd/me-3gox_0jgn_1gq5c20k31rxg89le0-big-Data.db | 3 | | | | null | partition end | null + +(3 rows) +``` + +## Row Marker + +The row marker is an object that is part of a the row. +It stores no data, but it has a timestamp, it can have a TTL and it also interacts with tombstones, just like data. +Row markers are considered when determining whether a row is empty: a row is considered non-empty when it has no cells, but it has a live row marker. +Such rows will show up in `SELECT` statement results as empty rows, with only key columns having values. +The row marker has a special interaction with the [shadowable tombstone](./tombstone.md#shadowable-tombstone). +This special interaction will be examined in detail in the next chapter. + +Row markers are created by `INSERT` statements: +```CQL +INSERT INTO ks.tbl (pk, ck1, ck2, v1) VALUES ('row marker 1', 0, 0, 1); + +SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'row marker 1'; + + pk | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata | mutation_fragment_kind | value +--------------+-----------------+------------------+-----+-----+-----------------+---------------------------------------------------------------------------------------------------------------------------+------------------------+------------ + row marker 1 | memtable:0 | 0 | | | | {"tombstone":{}} | partition start | null + row marker 1 | memtable:0 | 2 | 0 | 0 | 0 | {"marker":{"timestamp":1743060450523155},"columns":{"v1":{"is_live":true,"type":"regular","timestamp":1743060450523155}}} | clustering row | {"v1":"1"} + row marker 1 | memtable:0 | 3 | | | | null | partition end | null + +(3 rows) +``` + +Specifying only the key columns in the `INSERT` statement will only create a row marker, but no content for the row: +```CQL +INSERT INTO ks.tbl (pk, ck1, ck2) VALUES ('row marker 2', 0, 0); +SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'row marker 2'; + + pk | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata | mutation_fragment_kind | value +--------------+-----------------+------------------+-----+-----+-----------------+--------------------------------------------------------+------------------------+------- + row marker 2 | memtable:0 | 0 | | | | {"tombstone":{}} | partition start | null + row marker 2 | memtable:0 | 2 | 0 | 0 | 0 | {"marker":{"timestamp":1743060548534072},"columns":{}} | clustering row | {} + row marker 2 | memtable:0 | 3 | | | | null | partition end | null + +(3 rows) +``` + +Such a row will appear as an empty row with only the keys having value: +```CQL +SELECT * FROM ks.tbl WHERE pk = 'row marker 2'; + + pk | ck1 | ck2 | v1 | v2 +--------------+-----+-----+------+------ + row marker 2 | 0 | 0 | null | null + +(1 rows) +``` + +The `UPDATE` statement doesn't create row markers: +```CQL +UPDATE ks.tbl SET v1 = 1 WHERE pk = 'no row marker' AND ck1 = 0 AND ck2 = 0; +SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'no row marker'; + + pk | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata | mutation_fragment_kind | value +---------------+-----------------+------------------+-----+-----+-----------------+-----------------------------------------------------------------------------------+------------------------+------------ + no row marker | memtable:0 | 0 | | | | {"tombstone":{}} | partition start | null + no row marker | memtable:0 | 2 | 0 | 0 | 0 | {"columns":{"v1":{"is_live":true,"type":"regular","timestamp":1743060161838151}}} | clustering row | {"v1":"1"} + no row marker | memtable:0 | 3 | | | | null | partition end | null + +(3 rows) +``` + +Deleting a row with a row marker will remove the row marker completely: +```CQL +DELETE FROM ks.tbl WHERE pk = 'row marker 2' AND ck1 = 0 AND ck2 = 0; + +SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'row marker 2'; + + pk | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata | mutation_fragment_kind | value +--------------+-----------------+------------------+-----+-----+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+------- + row marker 2 | memtable:0 | 0 | | | | {"tombstone":{}} | partition start | null + row marker 2 | memtable:0 | 2 | 0 | 0 | 0 | {"tombstone":{"timestamp":1743060872181113,"deletion_time":"2025-03-27 07:34:32z"},"shadowable_tombstone":{"timestamp":1743060872181113,"deletion_time":"2025-03-27 07:34:32z"},"columns":{}} | clustering row | {} + row marker 2 | memtable:0 | 3 | | | | null | partition end | null + +(3 rows) + +SELECT * FROM ks.tbl WHERE pk = 'row marker 2'; + + pk | ck1 | ck2 | v1 | v2 +----+-----+-----+----+---- + +(0 rows) +``` + +## Shadowable Tombstone + +The shadowable tombstone is a special tombstone that is used by [materialized views](https://opensource.docs.scylladb.com/stable/features/materialized-views.html). +Just like the regular row tombstone, it applies to the content of the row, but it has a special interaction with the [row marker](./tombstone.md#row-marker). +The shadowable tombstone can be covered by row markers. +In other words: when a row has both a row marker and a shadowable tombstone and `row_marker.timestamp > shadowable_tombstone.timestamp`, the row marker covers the shadowable tombstone and the latter is dropped, just like data covered by tombstones are dropped. + +Notice that in the [row tombstone](./tombstone.md#row-tombstone) and other examples, whenever the row has an active tombstone, it also has a `shadowable_tombstone` object. +This is to keep things simple internally: the final row tombstone is always calculated as `row_tombstone + shadowable_tombstone`. Normally, the timestamp of the two matches -- the shadowable tombstone is not set separately. + +Shadowable tombstones cannot be created directly via CQL, they are only created by materialized view updates. + +Given the following materialized view, created on the tombstone example table: +```CQL +CREATE MATERIALIZED VIEW ks.mv AS SELECT * FROM ks.tbl WHERE v1 IS NOT NULL AND ck1 IS NOT NULL AND ck2 IS NOT NULL PRIMARY KEY (pk, v1, ck1, ck2); +``` + +Inserting a row to the base table will create a corresponding entry in the materialized view: +```CQL +INSERT INTO ks.tbl (pk, ck1, ck2, v1) VALUES ('shadowable tombstone', 0, 0, 1); + +SELECT * FROM MUTATION_FRAGMENTS(ks.mv) WHERE pk = 'shadowable tombstone'; + + pk | mutation_source | partition_region | v1 | ck1 | ck2 | position_weight | metadata | mutation_fragment_kind | value +----------------------+-----------------+------------------+----+-----+-----+-----------------+--------------------------------------------------------+------------------------+------- + shadowable tombstone | memtable:0 | 0 | | | | | {"tombstone":{}} | partition start | null + shadowable tombstone | memtable:0 | 2 | 1 | 0 | 0 | 0 | {"marker":{"timestamp":1743061930471880},"columns":{}} | clustering row | {} + shadowable tombstone | memtable:0 | 3 | | | | | null | partition end | null + +(3 rows) +``` + +When changing the value of `v1` in the base table row, the materialized view has to first delete the old instance of the view row and then insert a new one. +The old instance with the now non-existent `v1=1` key is deleted with a shadowable tombstone: +```CQL +INSERT INTO ks.tbl (pk, ck1, ck2, v1) VALUES ('shadowable tombstone', 0, 0, 2); + +SELECT * FROM MUTATION_FRAGMENTS(ks.mv) WHERE pk = 'shadowable tombstone'; + + pk | mutation_source | partition_region | v1 | ck1 | ck2 | position_weight | metadata | mutation_fragment_kind | value +----------------------+-----------------+------------------+----+-----+-----+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+------- + shadowable tombstone | memtable:0 | 0 | | | | | {"tombstone":{}} | partition start | null + shadowable tombstone | memtable:0 | 2 | 1 | 0 | 0 | 0 | {"tombstone":{},"shadowable_tombstone":{"timestamp":1743061930471880,"deletion_time":"2025-03-27 07:53:00z"},"marker":{"timestamp":1743061930471880},"columns":{}} | clustering row | {} + shadowable tombstone | memtable:0 | 2 | 2 | 0 | 0 | 0 | {"marker":{"timestamp":1743061980019472},"columns":{}} | clustering row | {} + shadowable tombstone | memtable:0 | 3 | | | | | null | partition end | null + +(4 rows) +``` + +Note how the regular row tombstone is `null` and only the shadowable tombstone is active. + +Re-inserting `v1=1` into the base-row will now delete the `v1=2` instance of the row with a shadowable tombstone and it will insert a new `v1=1` row into the view, with a fresh row marker, which eliminates the previously inserted shadowable tombstone: +```CQL +INSERT INTO ks.tbl (pk, ck1, ck2, v1) VALUES ('shadowable tombstone', 0, 0, 1); + +SELECT * FROM MUTATION_FRAGMENTS(ks.mv) WHERE pk = 'shadowable tombstone'; + + pk | mutation_source | partition_region | v1 | ck1 | ck2 | position_weight | metadata | mutation_fragment_kind | value +----------------------+-----------------+------------------+----+-----+-----+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+------- + shadowable tombstone | memtable:0 | 0 | | | | | {"tombstone":{}} | partition start | null + shadowable tombstone | memtable:0 | 2 | 1 | 0 | 0 | 0 | {"marker":{"timestamp":1743062162754870},"columns":{}} | clustering row | {} + shadowable tombstone | memtable:0 | 2 | 2 | 0 | 0 | 0 | {"tombstone":{},"shadowable_tombstone":{"timestamp":1743061980019472,"deletion_time":"2025-03-27 07:56:02z"},"marker":{"timestamp":1743061980019472},"columns":{}} | clustering row | {} + shadowable tombstone | memtable:0 | 3 | | | | | null | partition end | null + +(4 rows) +``` + +This special interaction of the shadowable tombstone and the row marker is important because view rows use the base-row's timestamps. +So when inserting a new row into the view, it can use old timestamps. +Having a regular row tombstone around, from a previous key change, could accidentally delete content from this newly inserted row. +So by having the row marker override the shadowable tombstone, we have a way to "undo" past deletions and "resurrect" old instances of a view row. + +## Tombstone Garbage Collection + +Tombstones are only valuable as long as there exists data that they cover. +After compaction has merged all instances of the covered data with the tombstone, the tombstone stops being useful. +It is now garbage and ScyllaDB wants to get rid of it. This is called tombstone garbage collection (tombstone GC in short) or tombstone purging. +Garbage collecting a tombstone can cause data resurrection if not done properly: +* If data still exists on the local replica, which the tombstone covers, this data will be resurrected if the tombstone is garbage collected before the tombstone could take effect on it. +* Some other replica in the cluster may have missed the delete, if the tombstone is garbage collected before a repair could propagate it to such replicas, these replicas will resurrect the data. + +Data resurrection means that some data that was deleted by the user, reappears again later, as if it wasn't deleted. +Data resurrection is considered a serious failure of the database, as serious as data loss. + +Based on the two different ways improper tombstone garbage collection can cause data resurrection, there are two conditions that have to be fulfilled for a tombstone to become eligible for garbage collection: +1) The database have to make sure the tombstone was compacted together with **all** data it covers and that all such data was consequently dropped. In other words, there is no data entity such that `entity.timestamp <= tombstone.timestamp`. +2) The tombstone was replicated to **all** replicas for the partition it is part of. In other words, in a table with RF=N, all N replicas have seen this tombstone and applied it to their data. + +Once these two conditions are met, the tombstone is eligible for garbage collection and can be purged (dropped). + +### Overlap Check + +To ensure the condition (1), compaction does so-called "overlap checks" for every tombstone. The overlap check is done per partition: the result of the overlap check for a given partition is applied to all tombstones in this partition. +An overlap check consists of checking every other data source, getting the smallest timestamp of live data they have for they checked partition. +Once this `min_timestamp` is obtained from all other data sources, the smallest of these is used to compare it against the timestamp of tombstones in this partition. +A tombstone is considered to have passed the overlap check, if `min_timestamp > tombstone.timestamp`. +In other words, we know that **all** data in **all** other data sources only have data which is not covered by this tombstone -- because they have higher timestamps. + +In ScyllaDB the following data sources exists: +* SSTable(s) +* Row Cache +* Memtable(s) +* Commitlog + +When compacting in any of the data sources, overlap checks have to be done against all other data sources. +Furthermore, when compacting an SSTable, overlap checks have to be done against other SSTables that don't participate in the current compaction. +There is one exception here: the row cache represents the content of the SSTables, so when compacting the cache, no overlap checks are required against SSTables. + +### Expiry + +To ensure condition (2), ScyllaDB uses the `deletion_time` field of tombstones, which is a wall-clock time of when the tombstone was written. +This deletion time is used to check whether enough time has passed for this tombstone to be considered safe to garbage collect, or in other words, whether it has "expired". +For tombstones which are created from TTL'd data, the `deletion_time` is the time the TTL'd data was written, not the time when the data expired. +This is because the TTL'd data can start to be replicated in the cluster as soon as it is written in live TTL'd form already. + +There are different methods to check whether a tombstone is "old enough" to be considered expired. + +The most advanced method involves keeping a record of repairs. +When a tombstone is considered for garbage collection, the time of the last repair is obtained from this record. +This happens on a partition granularity. +This last repair time is then used to determine whether the tombstone is eligible for garbage collection: if the tombstone was written *before* the last repair, it is considered eligible for garbage collection. + +A much simpler method, the one used historically by both Apache Cassandra and ScyllaDB, is a simple timeout based one. +Each table has `gc_grace_seconds` table attribute, which is a fixed time window, during which the user is expected to perform at least one full repair. +For example, with a `gc_grace_seconds=864000` (10 days), it is assumed that the user repaired their cluster thoroughly at least once, so that any tombstone that is 10 days old is guaranteed to have been propagated to all replicas already. +In this system, the elibility for garbage collection of a tombstone is simply checking that it is at least `gc_grace_seconds` old. If so, it is considered expired. + +See [Tombstones GC options](https://opensource.docs.scylladb.com/stable/cql#tombstones-gc-options) for more details on different modes of expiry.