From 3bad46a6e2870ebd4905bd1d7aa59c8a368f9db7 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Botond=20D=C3=A9nes?= <bdenes@scylladb.com>
Date: Wed, 26 Mar 2025 11:31:49 -0400
Subject: [PATCH] docs/dev: add tombstone.md

An exhaustive document on the tombstone related internal logic as well
as the user-facing aspects.

Closes scylladb/scylladb#23454
---
 docs/dev/tombstone.md | 613 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 613 insertions(+)
 create mode 100644 docs/dev/tombstone.md

diff --git a/docs/dev/tombstone.md b/docs/dev/tombstone.md
new file mode 100644
index 0000000000..d254dca778
--- /dev/null
+++ b/docs/dev/tombstone.md
@@ -0,0 +1,613 @@
+# Tombstones
+
+This document explains what tombstones are, how they used and how they are are created in ScyllaDB.
+
+## Introduction
+
+In an LSM Tree data structure, deleting data directly is not possible.
+Instead, deleting a key is translated to a write of a special deletion marker for said key.
+When the key and its deletion marker are compacted together, the key and any data it may have is dropped as a result.
+
+In ScyllaDB these deletion markers are called "tombstones".
+
+A tombstone has the following fields:
+
+```c++
+struct tombstone {
+    api::timestamp_type timestamp;
+    gc_clock::time_point deletion_time;
+}
+```
+
+The `timestamp` field is used to determine whether the tombstone covers some live data.
+If `tombstone.timestamp >= live_data.timestamp`, the data is covered by the tombstone and it is considered dead.
+In ScyllaDB, every cell has its own timestamp, so this coverage check happens on a cell level.
+It is possible that in a given row, some cells are covered, while others are not and they remain live.
+
+The `deletion_time` is used to determine whether the tombstone is eligible for garbage collection.
+Once a tombstone was compacted together with *all* data it could possibly cover and therefore all such data was dropped, it is no longer useful and can be garbage collected (dropped).
+Tombstone garbage collection involves complex rules, for more details see the [Tombstone Garbage Collection](./tombstone.md#tombstone-garbage-collection) chapter.
+
+## Tombstone Hierarchy
+
+Tombstones and data form a hierarchical structure, where tombstones can cover data on the same level and below, but not above.
+The tombstone hierarchy follows the data hierarchy, which looks like this:
+```mermaid
+graph TD;
+    partition-->row;
+    row-->cell;
+    row-->collection;
+    collection-->cell;
+```
+
+Tombstone hierarchy:
+```mermaid
+graph TD;
+    P[partition tombstone]-->RT[range tombstone];
+    RT[range tombstone]-->R[row tombstone];
+    R[row tombstone]-->C[cell tombstone];
+    R-->CT[collection tombstone];
+    CT-->C;
+```
+
+Partition tombstones applies to everything in the partition, including the static row.
+
+The range tombstone applies to a clustering range, so it can apply to multiple rows.
+
+Collections have a collection tombstone which applies to all elements in the collection.
+Collections members are similar to regular cells, they can be live or dead (cell tombstone).
+In the ScyllaDB type system, in addition to the `list`, `set` and `map` collection types, `tuple` and `UDT` are also stored as collections so the same rules apply to them as well.
+
+When determining whether some entity is live or not, one has to consider all tombstones that are above this entity in the tombstone hierarchy.
+For example, to determine whether a regular cell is live or not, one has to calculate the following tombstone: `partition tombstone + range tombstone + row tombstone`.
+Addition of tombstones is simply choosing the one higher timestamp:
+```c++
+tombstone operator+(tombstone a, tombstone b) {
+    return a.timestamp > b.timestamp ? a : b;
+}
+```
+
+Missing tombstones (there is no deletion) have the special `timestamp` value of `api::missing_timestamp`, which compares less than any other timestamp value.
+
+## How Are Tombstones Created?
+
+Creating the higher-level tombstones (row or above) is fairly intuitive and involves a `DELETE FROM` statement as one would expect.
+But when it comes to cells and collections, tombstones can be created as a result of many different statements.
+
+Given the following table as an example:
+```CQL
+CREATE TABLE ks.tbl (pk text, ck1 int, ck2 int, v1 int, v2 map<int, int>, PRIMARY KEY (pk, ck1, ck2));
+```
+
+Below we will examine how each kind of tombstone can be created with concrete example.
+Each example will have a CQL statement which creates the tombstone followed by a [SELECT * FROM MUTATION_FRAGMENTS()](https://opensource.docs.scylladb.com/stable/operating-scylla/admin-tools/select-from-mutation-fragments.html) query showing the created tombstone.
+
+### Partition Tombstone
+
+```CQL
+DELETE FROM ks.tbl WHERE pk = 'partition tombstone';
+
+SELECT * from MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'partition tombstone';
+
+ pk                  | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata                                                                            | mutation_fragment_kind | value
+---------------------+-----------------+------------------+-----+-----+-----------------+-------------------------------------------------------------------------------------+------------------------+-------
+ partition tombstone |      memtable:0 |                0 |     |     |                 | {"tombstone":{"timestamp":1743054972857790,"deletion_time":"2025-03-27 05:56:12z"}} |        partition start |  null
+ partition tombstone |      memtable:0 |                3 |     |     |                 |                                                                                null |          partition end |  null
+
+(2 rows)
+```
+
+### Range Tombstone
+
+Delete a range:
+```CQL
+DELETE FROM ks.tbl WHERE pk = 'range tombstone 1' AND ck1 = 0 AND ck2 > 100 AND ck2 < 200;
+
+SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'range tombstone 1';
+
+ pk                | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata                                                                            | mutation_fragment_kind | value
+-------------------+-----------------+------------------+-----+-----+-----------------+-------------------------------------------------------------------------------------+------------------------+-------
+ range tombstone 1 |      memtable:0 |                0 |     |     |                 |                                                                    {"tombstone":{}} |        partition start |  null
+ range tombstone 1 |      memtable:0 |                2 |   0 | 100 |               1 | {"tombstone":{"timestamp":1743055013006807,"deletion_time":"2025-03-27 05:56:53z"}} | range tombstone change |  null
+ range tombstone 1 |      memtable:0 |                2 |   0 | 200 |              -1 |                                                                    {"tombstone":{}} | range tombstone change |  null
+ range tombstone 1 |      memtable:0 |                3 |     |     |                 |                                                                                null |          partition end |  null
+
+(4 rows)
+```
+
+Delete a prefix -- only a prefix of all the clustering columns is restricted:
+```CQL
+DELETE FROM ks.tbl WHERE pk = 'range tombstone 2' AND ck1 = 1;
+
+SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'range tombstone 2';
+
+ pk                | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata                                                                            | mutation_fragment_kind | value
+-------------------+-----------------+------------------+-----+-----+-----------------+-------------------------------------------------------------------------------------+------------------------+-------
+ range tombstone 2 |      memtable:0 |                0 |     |     |                 |                                                                    {"tombstone":{}} |        partition start |  null
+ range tombstone 2 |      memtable:0 |                2 |   1 |     |              -1 | {"tombstone":{"timestamp":1743055505954714,"deletion_time":"2025-03-27 06:05:05z"}} | range tombstone change |  null
+ range tombstone 2 |      memtable:0 |                2 |   1 |     |               1 |                                                                    {"tombstone":{}} | range tombstone change |  null
+ range tombstone 2 |      memtable:0 |                3 |     |     |                 |                                                                                null |          partition end |  null
+
+(4 rows)
+```
+
+Internally, range tombstones are represented by so-called "range tombstone change" objects.
+There are two such objects: one at the start of the range tombstone (the lower bound key) and one at the end of the range tombstone (the higher bound key).
+A range tombstone takes effect between its start and end bounds.
+A partition can have any number of range tombstones, but there is always at least two, the starting one and a last one.
+The last range tombstone in a partition always has an empty tombstone, this mark the remaining range of the partition as having no range tombstone, or in other words it "resets" the current tombstone.
+The range tombstone end object resets the tombstone to the null-tombstone (no deletion).
+Despite  every range tombstone having a start and end range tombstone change objects, it is possible for a partition to have an odd number of such objects.
+This happens when range tombstones overlap:
+```CQL
+DELETE FROM ks.tbl WHERE pk = 'range tombstone 3' AND ck1 = 0 AND ck2 > 100 AND ck2 < 200;
+
+DELETE FROM ks.tbl WHERE pk = 'range tombstone 3' AND ck1 = 0 AND ck2 > 150 AND ck2 < 300;
+
+SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'range tombstone 3';
+
+ pk                | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata                                                                            | mutation_fragment_kind | value
+-------------------+-----------------+------------------+-----+-----+-----------------+-------------------------------------------------------------------------------------+------------------------+-------
+ range tombstone 3 |      memtable:0 |                0 |     |     |                 |                                                                    {"tombstone":{}} |        partition start |  null
+ range tombstone 3 |      memtable:0 |                2 |   0 | 100 |               1 | {"tombstone":{"timestamp":1743164183543439,"deletion_time":"2025-03-28 12:16:23z"}} | range tombstone change |  null
+ range tombstone 3 |      memtable:0 |                2 |   0 | 150 |               1 | {"tombstone":{"timestamp":1743164186551458,"deletion_time":"2025-03-28 12:16:26z"}} | range tombstone change |  null
+ range tombstone 3 |      memtable:0 |                2 |   0 | 300 |              -1 |                                                                    {"tombstone":{}} | range tombstone change |  null
+ range tombstone 3 |      memtable:0 |                3 |     |     |                 |                                                                                null |          partition end |  null
+
+(5 rows)
+```
+
+In the range [(0, 150), (0, 200)] the two range tombstones overlap, so one of them takes precedence (overwrites the other) -- the later one.
+
+Range tombstone change objects always have a position which is *before* or *after* a certain key, never a position where they are *at* a key.
+This becomes evident when looking at the `position_weight` column in the mutation dump.
+Range tombstone changes always have a position weight of either `1` or `-1`.
+Regular rows alwas have a position weight of `0`.
+Position weight is relevant when comparing two positions which have the same clustering key.
+In this case the position weight is the tie-breaker: comparing the position weight of the two respective positions will determine the comparison result.
+
+### Row Tombstone
+
+```CQL
+DELETE FROM ks.tbl WHERE pk = 'row tombstone' AND ck1 = 5 AND ck2 = 5;
+
+SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'row tombstone';
+
+ pk            | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata                                                                                                                                                                                      | mutation_fragment_kind | value
+---------------+-----------------+------------------+-----+-----+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+-------
+ row tombstone |      memtable:0 |                0 |     |     |                 |                                                                                                                                                                              {"tombstone":{}} |        partition start |  null
+ row tombstone |      memtable:0 |                2 |   5 |   5 |               0 | {"tombstone":{"timestamp":1743055543508176,"deletion_time":"2025-03-27 06:05:43z"},"shadowable_tombstone":{"timestamp":1743055543508176,"deletion_time":"2025-03-27 06:05:43z"},"columns":{}} |         clustering row |    {}
+ row tombstone |      memtable:0 |                3 |     |     |                 |                                                                                                                                                                                          null |          partition end |  null
+
+(3 rows)
+```
+
+The row tombstone also has a `shadowable_tombstone` object included in it. More on this in the [Shadowable Tombstone](./tombstone.md#shadowable-tombstone) chapter.
+
+### Regular Cell Tombstone
+
+Delete regular cell with `DELETE` statement:
+```CQL
+DELETE v1 FROM ks.tbl WHERE pk = 'cell tombstone 1' AND ck1 = 0 AND ck2 = 0;
+
+SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'regular cell tombstone 1';
+
+ pk                       | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata                                                                                                                  | mutation_fragment_kind | value
+--------------------------+-----------------+------------------+-----+-----+-----------------+---------------------------------------------------------------------------------------------------------------------------+------------------------+-------------
+ regular cell tombstone 1 |      memtable:0 |                0 |     |     |                 |                                                                                                          {"tombstone":{}} |        partition start |        null
+ regular cell tombstone 1 |      memtable:0 |                2 |   0 |   0 |               0 | {"columns":{"v1":{"is_live":false,"type":"regular","timestamp":1743056112215870,"deletion_time":"2025-03-27 06:15:12z"}}} |         clustering row | {"v1":null}
+ regular cell tombstone 1 |      memtable:0 |                3 |     |     |                 |                                                                                                                      null |          partition end |        null
+
+(3 rows)
+```
+
+When it comes to cell tombstones (be that regular or collection cells), there is no longer a separate tombstone object.
+Instead, a cell is either live or dead (see `is_live` in the `metadata` column). If `is_live=false`, the cell is a dead cell -- also called a cell tombstone.
+
+Deleting a cell, or setting it to `null` has the same effect.
+
+Delete regular cell with `UPDATE` statement:
+```CQL
+UPDATE ks.tbl SET v1 = null WHERE pk = 'regular cell tombstone 2' AND ck1 = 0 AND ck2 = 0;
+
+SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'regular cell tombstone 2';
+
+ pk                       | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata                                                                                                                  | mutation_fragment_kind | value
+--------------------------+-----------------+------------------+-----+-----+-----------------+---------------------------------------------------------------------------------------------------------------------------+------------------------+-------------
+ regular cell tombstone 2 |      memtable:0 |                0 |     |     |                 |                                                                                                          {"tombstone":{}} |        partition start |        null
+ regular cell tombstone 2 |      memtable:0 |                2 |   0 |   0 |               0 | {"columns":{"v1":{"is_live":false,"type":"regular","timestamp":1743056276318904,"deletion_time":"2025-03-27 06:17:56z"}}} |         clustering row | {"v1":null}
+ regular cell tombstone 2 |      memtable:0 |                3 |     |     |                 |                                                                                                                      null |          partition end |        null
+
+(3 rows)
+```
+
+Delete regular cell with `INSERT` statement:
+```CQL
+INSERT INTO ks.tbl (pk, ck1, ck2, v1) VALUES ('regular cell tombstone 3', 0, 0, null);
+
+SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'regular cell tombstone 3';
+
+ pk                       | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata                                                                                                                                                          | mutation_fragment_kind | value
+--------------------------+-----------------+------------------+-----+-----+-----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+-------------
+ regular cell tombstone 3 |      memtable:0 |                0 |     |     |                 |                                                                                                                                                  {"tombstone":{}} |        partition start |        null
+ regular cell tombstone 3 |      memtable:0 |                2 |   0 |   0 |               0 | {"marker":{"timestamp":1743056345463644},"columns":{"v1":{"is_live":false,"type":"regular","timestamp":1743056345463644,"deletion_time":"2025-03-27 06:19:05z"}}} |         clustering row | {"v1":null}
+ regular cell tombstone 3 |      memtable:0 |                3 |     |     |                 |                                                                                                                                                              null |          partition end |        null
+
+(3 rows)
+```
+
+### Collection Tombstone
+
+Delete collection with `DELETE` statement:
+```CQL
+DELETE v2 FROM ks.tbl WHERE pk = 'collection tombstone 1' AND ck1 = 0 AND ck2 = 0;
+
+SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'collection tombstone 1';
+
+ pk                     | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata                                                                                                          | mutation_fragment_kind | value
+------------------------+-----------------+------------------+-----+-----+-----------------+-------------------------------------------------------------------------------------------------------------------+------------------------+-----------
+ collection tombstone 1 |      memtable:0 |                0 |     |     |                 |                                                                                                  {"tombstone":{}} |        partition start |      null
+ collection tombstone 1 |      memtable:0 |                2 |   0 |   0 |               0 | {"columns":{"v2":{"tombstone":{"timestamp":1743056564590040,"deletion_time":"2025-03-27 06:22:44z"},"cells":[]}}} |         clustering row | {"v2":[]}
+ collection tombstone 1 |      memtable:0 |                3 |     |     |                 |                                                                                                              null |          partition end |      null
+
+(3 rows)
+```
+
+Setting the collection to `null` has the same effect, just like for regular cells.
+
+Delete collection with `UPDATE` statement:
+```CQL
+UPDATE ks.tbl SET v2 = null WHERE pk = 'collection tombstone 2' AND ck1 = 0 AND ck2 = 0;
+
+SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'collection tombstone 2';
+
+ pk                     | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata                                                                                                          | mutation_fragment_kind | value
+------------------------+-----------------+------------------+-----+-----+-----------------+-------------------------------------------------------------------------------------------------------------------+------------------------+-----------
+ collection tombstone 2 |      memtable:0 |                0 |     |     |                 |                                                                                                  {"tombstone":{}} |        partition start |      null
+ collection tombstone 2 |      memtable:0 |                2 |   0 |   0 |               0 | {"columns":{"v2":{"tombstone":{"timestamp":1743056668558340,"deletion_time":"2025-03-27 06:24:28z"},"cells":[]}}} |         clustering row | {"v2":[]}
+ collection tombstone 2 |      memtable:0 |                3 |     |     |                 |                                                                                                              null |          partition end |      null
+
+(3 rows)
+```
+
+Delete collection with `INSERT` statement:
+```CQL
+INSERT INTO ks.tbl (pk, ck1, ck2, v2) VALUES ('collection tombstone 3', 0, 0, null);
+
+SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'collection tombstone 3';
+
+ pk                     | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata                                                                                                                                                  | mutation_fragment_kind | value
+------------------------+-----------------+------------------+-----+-----+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+-----------
+ collection tombstone 3 |      memtable:0 |                0 |     |     |                 |                                                                                                                                          {"tombstone":{}} |        partition start |      null
+ collection tombstone 3 |      memtable:0 |                2 |   0 |   0 |               0 | {"marker":{"timestamp":1743056946866432},"columns":{"v2":{"tombstone":{"timestamp":1743056946866431,"deletion_time":"2025-03-27 06:29:06z"},"cells":[]}}} |         clustering row | {"v2":[]}
+ collection tombstone 3 |      memtable:0 |                3 |     |     |                 |                                                                                                                                                      null |          partition end |      null
+
+(3 rows)
+```
+
+Collection tombstones are also generated when a collection is fully overwritten.
+
+Collection tombstone generated by full overwrite, using the `UPDATE` statement:
+```CQL
+UPDATE ks.tbl SET v2 = {1: 12, 2: 44} WHERE pk = 'collection tombstone 4' AND ck1 = 0 AND ck2 = 0;
+
+SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'collection tombstone 4';
+
+ pk                     | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata                                                                                                                                                                                                                                                                               | mutation_fragment_kind | value
+------------------------+-----------------+------------------+-----+-----+-----------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+------------------------------------------------------------
+ collection tombstone 4 |      memtable:0 |                0 |     |     |                 |                                                                                                                                                                                                                                                                       {"tombstone":{}} |        partition start |                                                       null
+ collection tombstone 4 |      memtable:0 |                2 |   0 |   0 |               0 | {"columns":{"v2":{"tombstone":{"timestamp":1743057841587097,"deletion_time":"2025-03-27 06:44:01z"},"cells":[{"key":"1","value":{"is_live":true,"type":"regular","timestamp":1743057841587098}},{"key":"2","value":{"is_live":true,"type":"regular","timestamp":1743057841587098}}]}}} |         clustering row | {"v2":[{"key":"1","value":"12"},{"key":"2","value":"44"}]}
+ collection tombstone 4 |      memtable:0 |                3 |     |     |                 |                                                                                                                                                                                                                                                                                   null |          partition end |                                                       null
+
+(3 rows)
+```
+
+Collection tombstone generated by full overwrite, using the `INSERT` statement:
+```CQL
+INSERT INTO ks.tbl (pk, ck1, ck2, v2) VALUES ('collection tombstone 5', 0, 0, {1: 12, 2: 44});
+
+SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'collection tombstone 5';
+
+ pk                     | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata                                                                                                                                                                                                                                                                                                                       | mutation_fragment_kind | value
+------------------------+-----------------+------------------+-----+-----+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+------------------------------------------------------------
+ collection tombstone 5 |      memtable:0 |                0 |     |     |                 |                                                                                                                                                                                                                                                                                                               {"tombstone":{}} |        partition start |                                                       null
+ collection tombstone 5 |      memtable:0 |                2 |   0 |   0 |               0 | {"marker":{"timestamp":1743057913516603},"columns":{"v2":{"tombstone":{"timestamp":1743057913516602,"deletion_time":"2025-03-27 06:45:13z"},"cells":[{"key":"1","value":{"is_live":true,"type":"regular","timestamp":1743057913516603}},{"key":"2","value":{"is_live":true,"type":"regular","timestamp":1743057913516603}}]}}} |         clustering row | {"v2":[{"key":"1","value":"12"},{"key":"2","value":"44"}]}
+ collection tombstone 5 |      memtable:0 |                3 |     |     |                 |                                                                                                                                                                                                                                                                                                                           null |          partition end |                                                       null
+
+(3 rows)
+```
+
+### Collection Cell Tombstone
+
+Collection cell tombstone behave like regular cell tombstones for the most part.
+
+Delete collection cell with `DELETE` statement:
+```CQL
+DELETE v2[1] FROM ks.tbl WHERE pk = 'collection cell tombstone 1' AND ck1 = 0 AND ck2 = 0;
+
+SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'collection cell tombstone 1';
+
+ pk                          | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata                                                                                                                                                  | mutation_fragment_kind | value
+-----------------------------+-----------------+------------------+-----+-----+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+-----------------------------------
+ collection cell tombstone 1 |      memtable:0 |                0 |     |     |                 |                                                                                                                                          {"tombstone":{}} |        partition start |                              null
+ collection cell tombstone 1 |      memtable:0 |                2 |   0 |   0 |               0 | {"columns":{"v2":{"cells":[{"key":"1","value":{"is_live":false,"type":"regular","timestamp":1743057941371233,"deletion_time":"2025-03-27 06:45:41z"}}]}}} |         clustering row | {"v2":[{"key":"1","value":null}]}
+ collection cell tombstone 1 |      memtable:0 |                3 |     |     |                 |                                                                                                                                                      null |          partition end |                              null
+
+(3 rows)
+```
+
+Delete collection cell with `UPDATE` statement:
+```CQL
+UPDATE ks.tbl SET v2[1] = null WHERE pk = 'regular cell tombstone 2' AND ck1 = 0 AND ck2 = 0;
+
+SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'collection cell tombstone 2';
+
+ pk                          | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata                                                                                                                                                  | mutation_fragment_kind | value
+-----------------------------+-----------------+------------------+-----+-----+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+-----------------------------------
+ collection cell tombstone 2 |      memtable:0 |                0 |     |     |                 |                                                                                                                                          {"tombstone":{}} |        partition start |                              null
+ collection cell tombstone 2 |      memtable:0 |                2 |   0 |   0 |               0 | {"columns":{"v2":{"cells":[{"key":"1","value":{"is_live":false,"type":"regular","timestamp":1743058010855333,"deletion_time":"2025-03-27 06:46:50z"}}]}}} |         clustering row | {"v2":[{"key":"1","value":null}]}
+ collection cell tombstone 2 |      memtable:0 |                3 |     |     |                 |                                                                                                                                                      null |          partition end |                              null
+
+(3 rows)
+```
+
+Individual collection cells cannot be deleted (or overwritten) by `INSERT` statement.
+
+## TTL and Tombstones
+
+Cells which have [TTL](https://opensource.docs.scylladb.com/stable/cql/time-to-live.html) become cell tombstones after they expire.
+
+```CQL
+INSERT INTO ks.tbl (pk, ck1, ck2, v1) VALUES ('expired cell', 0, 0, 1) USING TTL 1;
+
+SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'expired cell';
+
+ pk           | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata                                                                                                                                                                                                        | mutation_fragment_kind | value
+--------------+-----------------+------------------+-----+-----+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+------------
+ expired cell |      memtable:0 |                0 |     |     |                 |                                                                                                                                                                                                {"tombstone":{}} |        partition start |       null
+ expired cell |      memtable:0 |                2 |   0 |   0 |               0 | {"marker":{"timestamp":1743058565262883,"ttl":"1s","expiry":"2025-03-27 06:56:06z"},"columns":{"v1":{"is_live":true,"type":"regular","timestamp":1743058565262883,"ttl":"1s","expiry":"2025-03-27 06:56:06z"}}} |         clustering row | {"v1":"1"}
+ expired cell |      memtable:0 |                3 |     |     |                 |                                                                                                                                                                                                            null |          partition end |       null
+
+(3 rows)
+```
+
+Even though the cell has already expired, it still appears as live, because expired cells are converted to cell tombstones by compaction.
+To determine whether an expired cell is really live or not, one has to look at the `expiry` and compare it with the current time.
+If `expiry <= now()`, the cell is expired and it will be treated as a dead cell by compaction.
+
+Reads also compact the data, so if we read this partition, the cell will not show up in the results:
+```CQL
+SELECT * FROM ks.tbl WHERE pk = 'expired cell';
+
+ pk | ck1 | ck2 | v1 | v2
+----+-----+-----+----+----
+
+(0 rows)
+```
+
+After flushing and compacting the table, the expired cell is now converted to a cell tombstone:
+```CQL
+SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'expired cell';
+
+ pk           | mutation_source                                                                                                  | partition_region | ck1 | ck2 | position_weight | metadata                                                                                                                                                          | mutation_fragment_kind | value
+--------------+------------------------------------------------------------------------------------------------------------------+------------------+-----+-----+-----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+-------------
+ expired cell | sstable:/var/lib/scylla/data/ks/tbl-b0f65c600ad611f0811d8550c417dfcd/me-3gox_0jgn_1gq5c20k31rxg89le0-big-Data.db |                0 |     |     |                 |                                                                                                                                                  {"tombstone":{}} |        partition start |        null
+ expired cell | sstable:/var/lib/scylla/data/ks/tbl-b0f65c600ad611f0811d8550c417dfcd/me-3gox_0jgn_1gq5c20k31rxg89le0-big-Data.db |                2 |   0 |   0 |               0 | {"marker":{"timestamp":1743058565262883},"columns":{"v1":{"is_live":false,"type":"regular","timestamp":1743058565262883,"deletion_time":"2025-03-27 06:56:05z"}}} |         clustering row | {"v1":null}
+ expired cell | sstable:/var/lib/scylla/data/ks/tbl-b0f65c600ad611f0811d8550c417dfcd/me-3gox_0jgn_1gq5c20k31rxg89le0-big-Data.db |                3 |     |     |                 |                                                                                                                                                              null |          partition end |        null
+
+(3 rows)
+```
+
+## Row Marker
+
+The row marker is an object that is part of a the row.
+It stores no data, but it has a timestamp, it can have a TTL and it also interacts with tombstones, just like data.
+Row markers are considered when determining whether a row is empty: a row is considered non-empty when it has no cells, but it has a live row marker.
+Such rows will show up in `SELECT` statement results as empty rows, with only key columns having values.
+The row marker has a special interaction with the [shadowable tombstone](./tombstone.md#shadowable-tombstone).
+This special interaction will be examined in detail in the next chapter.
+
+Row markers are created by `INSERT` statements:
+```CQL
+INSERT INTO ks.tbl (pk, ck1, ck2, v1) VALUES ('row marker 1', 0, 0, 1);
+
+SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'row marker 1';
+
+ pk           | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata                                                                                                                  | mutation_fragment_kind | value
+--------------+-----------------+------------------+-----+-----+-----------------+---------------------------------------------------------------------------------------------------------------------------+------------------------+------------
+ row marker 1 |      memtable:0 |                0 |     |     |                 |                                                                                                          {"tombstone":{}} |        partition start |       null
+ row marker 1 |      memtable:0 |                2 |   0 |   0 |               0 | {"marker":{"timestamp":1743060450523155},"columns":{"v1":{"is_live":true,"type":"regular","timestamp":1743060450523155}}} |         clustering row | {"v1":"1"}
+ row marker 1 |      memtable:0 |                3 |     |     |                 |                                                                                                                      null |          partition end |       null
+
+(3 rows)
+```
+
+Specifying only the key columns in the `INSERT` statement will only create a row marker, but no content for the row:
+```CQL
+INSERT INTO ks.tbl (pk, ck1, ck2) VALUES ('row marker 2', 0, 0);
+SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'row marker 2';
+
+ pk           | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata                                               | mutation_fragment_kind | value
+--------------+-----------------+------------------+-----+-----+-----------------+--------------------------------------------------------+------------------------+-------
+ row marker 2 |      memtable:0 |                0 |     |     |                 |                                       {"tombstone":{}} |        partition start |  null
+ row marker 2 |      memtable:0 |                2 |   0 |   0 |               0 | {"marker":{"timestamp":1743060548534072},"columns":{}} |         clustering row |    {}
+ row marker 2 |      memtable:0 |                3 |     |     |                 |                                                   null |          partition end |  null
+
+(3 rows)
+```
+
+Such a row will appear as an empty row with only the keys having value:
+```CQL
+SELECT * FROM ks.tbl WHERE pk = 'row marker 2';
+
+ pk           | ck1 | ck2 | v1   | v2
+--------------+-----+-----+------+------
+ row marker 2 |   0 |   0 | null | null
+
+(1 rows)
+```
+
+The `UPDATE` statement doesn't create row markers:
+```CQL
+UPDATE ks.tbl SET v1 = 1 WHERE pk = 'no row marker' AND ck1 = 0 AND ck2 = 0;
+SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'no row marker';
+
+ pk            | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata                                                                          | mutation_fragment_kind | value
+---------------+-----------------+------------------+-----+-----+-----------------+-----------------------------------------------------------------------------------+------------------------+------------
+ no row marker |      memtable:0 |                0 |     |     |                 |                                                                  {"tombstone":{}} |        partition start |       null
+ no row marker |      memtable:0 |                2 |   0 |   0 |               0 | {"columns":{"v1":{"is_live":true,"type":"regular","timestamp":1743060161838151}}} |         clustering row | {"v1":"1"}
+ no row marker |      memtable:0 |                3 |     |     |                 |                                                                              null |          partition end |       null
+
+(3 rows)
+```
+
+Deleting a row with a row marker will remove the row marker completely:
+```CQL
+DELETE FROM ks.tbl WHERE pk = 'row marker 2' AND ck1 = 0 AND ck2 = 0;
+
+SELECT * FROM MUTATION_FRAGMENTS(ks.tbl) WHERE pk = 'row marker 2';
+
+ pk           | mutation_source | partition_region | ck1 | ck2 | position_weight | metadata                                                                                                                                                                                      | mutation_fragment_kind | value
+--------------+-----------------+------------------+-----+-----+-----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+-------
+ row marker 2 |      memtable:0 |                0 |     |     |                 |                                                                                                                                                                              {"tombstone":{}} |        partition start |  null
+ row marker 2 |      memtable:0 |                2 |   0 |   0 |               0 | {"tombstone":{"timestamp":1743060872181113,"deletion_time":"2025-03-27 07:34:32z"},"shadowable_tombstone":{"timestamp":1743060872181113,"deletion_time":"2025-03-27 07:34:32z"},"columns":{}} |         clustering row |    {}
+ row marker 2 |      memtable:0 |                3 |     |     |                 |                                                                                                                                                                                          null |          partition end |  null
+
+(3 rows)
+
+SELECT * FROM ks.tbl WHERE pk = 'row marker 2';
+
+ pk | ck1 | ck2 | v1 | v2
+----+-----+-----+----+----
+
+(0 rows)
+```
+
+## Shadowable Tombstone
+
+The shadowable tombstone is a special tombstone that is used by [materialized views](https://opensource.docs.scylladb.com/stable/features/materialized-views.html).
+Just like the regular row tombstone, it applies to the content of the row, but it has a special interaction with the [row marker](./tombstone.md#row-marker).
+The shadowable tombstone can be covered by row markers.
+In other words: when a row has both a row marker and a shadowable tombstone and `row_marker.timestamp > shadowable_tombstone.timestamp`, the row marker covers the shadowable tombstone and the latter is dropped, just like data covered by tombstones are dropped.
+
+Notice that in the [row tombstone](./tombstone.md#row-tombstone) and other examples, whenever the row has an active tombstone, it also has a `shadowable_tombstone` object.
+This is to keep things simple internally: the final row tombstone is always calculated as `row_tombstone + shadowable_tombstone`. Normally, the timestamp of the two matches -- the shadowable tombstone is not set separately.
+
+Shadowable tombstones cannot be created directly via CQL, they are only created by materialized view updates.
+
+Given the following materialized view, created on the tombstone example table:
+```CQL
+CREATE MATERIALIZED VIEW ks.mv AS SELECT * FROM ks.tbl WHERE v1 IS NOT NULL AND ck1 IS NOT NULL AND ck2 IS NOT NULL PRIMARY KEY (pk, v1, ck1, ck2);
+```
+
+Inserting a row to the base table will create a corresponding entry in the materialized view:
+```CQL
+INSERT INTO ks.tbl (pk, ck1, ck2, v1) VALUES ('shadowable tombstone', 0, 0, 1);
+
+SELECT * FROM MUTATION_FRAGMENTS(ks.mv) WHERE pk = 'shadowable tombstone';
+
+ pk                   | mutation_source | partition_region | v1 | ck1 | ck2 | position_weight | metadata                                               | mutation_fragment_kind | value
+----------------------+-----------------+------------------+----+-----+-----+-----------------+--------------------------------------------------------+------------------------+-------
+ shadowable tombstone |      memtable:0 |                0 |    |     |     |                 |                                       {"tombstone":{}} |        partition start |  null
+ shadowable tombstone |      memtable:0 |                2 |  1 |   0 |   0 |               0 | {"marker":{"timestamp":1743061930471880},"columns":{}} |         clustering row |    {}
+ shadowable tombstone |      memtable:0 |                3 |    |     |     |                 |                                                   null |          partition end |  null
+
+(3 rows)
+```
+
+When changing the value of `v1` in the base table row, the materialized view has to first delete the old instance of the view row and then insert a new one.
+The old instance with the now non-existent `v1=1` key is deleted with a shadowable tombstone:
+```CQL
+INSERT INTO ks.tbl (pk, ck1, ck2, v1) VALUES ('shadowable tombstone', 0, 0, 2);
+
+SELECT * FROM MUTATION_FRAGMENTS(ks.mv) WHERE pk = 'shadowable tombstone';
+
+ pk                   | mutation_source | partition_region | v1 | ck1 | ck2 | position_weight | metadata                                                                                                                                                           | mutation_fragment_kind | value
+----------------------+-----------------+------------------+----+-----+-----+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+-------
+ shadowable tombstone |      memtable:0 |                0 |    |     |     |                 |                                                                                                                                                   {"tombstone":{}} |        partition start |  null
+ shadowable tombstone |      memtable:0 |                2 |  1 |   0 |   0 |               0 | {"tombstone":{},"shadowable_tombstone":{"timestamp":1743061930471880,"deletion_time":"2025-03-27 07:53:00z"},"marker":{"timestamp":1743061930471880},"columns":{}} |         clustering row |    {}
+ shadowable tombstone |      memtable:0 |                2 |  2 |   0 |   0 |               0 |                                                                                                             {"marker":{"timestamp":1743061980019472},"columns":{}} |         clustering row |    {}
+ shadowable tombstone |      memtable:0 |                3 |    |     |     |                 |                                                                                                                                                               null |          partition end |  null
+
+(4 rows)
+```
+
+Note how the regular row tombstone is `null` and only the shadowable tombstone is active.
+
+Re-inserting `v1=1` into the base-row will now delete the `v1=2` instance of the row with a shadowable tombstone and it will insert a new `v1=1` row into the view, with a fresh row marker, which eliminates the previously inserted shadowable tombstone:
+```CQL
+INSERT INTO ks.tbl (pk, ck1, ck2, v1) VALUES ('shadowable tombstone', 0, 0, 1);
+
+SELECT * FROM MUTATION_FRAGMENTS(ks.mv) WHERE pk = 'shadowable tombstone';
+
+ pk                   | mutation_source | partition_region | v1 | ck1 | ck2 | position_weight | metadata                                                                                                                                                           | mutation_fragment_kind | value
+----------------------+-----------------+------------------+----+-----+-----+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------+-------
+ shadowable tombstone |      memtable:0 |                0 |    |     |     |                 |                                                                                                                                                   {"tombstone":{}} |        partition start |  null
+ shadowable tombstone |      memtable:0 |                2 |  1 |   0 |   0 |               0 |                                                                                                             {"marker":{"timestamp":1743062162754870},"columns":{}} |         clustering row |    {}
+ shadowable tombstone |      memtable:0 |                2 |  2 |   0 |   0 |               0 | {"tombstone":{},"shadowable_tombstone":{"timestamp":1743061980019472,"deletion_time":"2025-03-27 07:56:02z"},"marker":{"timestamp":1743061980019472},"columns":{}} |         clustering row |    {}
+ shadowable tombstone |      memtable:0 |                3 |    |     |     |                 |                                                                                                                                                               null |          partition end |  null
+
+(4 rows)
+```
+
+This special interaction of the shadowable tombstone and the row marker is important because view rows use the base-row's timestamps.
+So when inserting a new row into the view, it can use old timestamps.
+Having a regular row tombstone around, from a previous key change, could accidentally delete content from this newly inserted row.
+So by having the row marker override the shadowable tombstone, we have a way to "undo" past deletions and "resurrect" old instances of a view row.
+
+## Tombstone Garbage Collection
+
+Tombstones are only valuable as long as there exists data that they cover.
+After compaction has merged all instances of the covered data with the tombstone, the tombstone stops being useful.
+It is now garbage and ScyllaDB wants to get rid of it. This is called tombstone garbage collection (tombstone GC in short) or tombstone purging.
+Garbage collecting a tombstone can cause data resurrection if not done properly:
+* If data still exists on the local replica, which the tombstone covers, this data will be resurrected if the tombstone is garbage collected before the tombstone could take effect on it.
+* Some other replica in the cluster may have missed the delete, if the tombstone is garbage collected before a repair could propagate it to such replicas, these replicas will resurrect the data.
+
+Data resurrection means that some data that was deleted by the user, reappears again later, as if it wasn't deleted.
+Data resurrection is considered a serious failure of the database, as serious as data loss.
+
+Based on the two different ways improper tombstone garbage collection can cause data resurrection, there are two conditions that have to be fulfilled for a tombstone to become eligible for garbage collection:
+1) The database have to make sure the tombstone was compacted together with **all** data it covers and that all such data was consequently dropped. In other words, there is no data entity such that `entity.timestamp <= tombstone.timestamp`.
+2) The tombstone was replicated to **all** replicas for the partition it is part of. In other words, in a table with RF=N, all N replicas have seen this tombstone and applied it to their data.
+
+Once these two conditions are met, the tombstone is eligible for garbage collection and can be purged (dropped).
+
+### Overlap Check
+
+To ensure the condition (1), compaction does so-called "overlap checks" for every tombstone. The overlap check is done per partition: the result of the overlap check for a given partition is applied to all tombstones in this partition.
+An overlap check consists of checking every other data source, getting the smallest timestamp of live data they have for they checked partition.
+Once this `min_timestamp` is obtained from all other data sources, the smallest of these is used to compare it against the timestamp of tombstones in this partition.
+A tombstone is considered to have passed the overlap check, if `min_timestamp > tombstone.timestamp`.
+In other words, we know that **all** data in **all** other data sources only have data which is not covered by this tombstone -- because they have higher timestamps.
+
+In ScyllaDB the following data sources exists:
+* SSTable(s)
+* Row Cache
+* Memtable(s)
+* Commitlog
+
+When compacting in any of the data sources, overlap checks have to be done against all other data sources.
+Furthermore, when compacting an SSTable, overlap checks have to be done against other SSTables that don't participate in the current compaction.
+There is one exception here: the row cache represents the content of the SSTables, so when compacting the cache, no overlap checks are required against SSTables.
+
+### Expiry
+
+To ensure condition (2), ScyllaDB uses the `deletion_time` field of tombstones, which is a wall-clock time of when the tombstone was written.
+This deletion time is used to check whether enough time has passed for this tombstone to be considered safe to garbage collect, or in other words, whether it has "expired".
+For tombstones which are created from TTL'd data, the `deletion_time` is the time the TTL'd data was written, not the time when the data expired.
+This is because the TTL'd data can start to be replicated in the cluster as soon as it is written in live TTL'd form already.
+
+There are different methods to check whether a tombstone is "old enough" to be considered expired.
+
+The most advanced method involves keeping a record of repairs.
+When a tombstone is considered for garbage collection, the time of the last repair is obtained from this record.
+This happens on a partition granularity.
+This last repair time is then used to determine whether the tombstone is eligible for garbage collection: if the tombstone was written *before* the last repair, it is considered eligible for garbage collection.
+
+A much simpler method, the one used historically by both Apache Cassandra and ScyllaDB, is a simple timeout based one.
+Each table has `gc_grace_seconds` table attribute, which is a fixed time window, during which the user is expected to perform at least one full repair.
+For example, with a `gc_grace_seconds=864000` (10 days), it is assumed that the user repaired their cluster thoroughly at least once, so that any tombstone that is 10 days old is guaranteed to have been propagated to all replicas already.
+In this system, the elibility for garbage collection of a tombstone is simply checking that it is at least `gc_grace_seconds` old. If so, it is considered expired.
+
+See [Tombstones GC options](https://opensource.docs.scylladb.com/stable/cql#tombstones-gc-options) for more details on different modes of expiry.