replica: Fix truncate and drop table after tablet migration happens

When running those operations after a tablet replica is migrated away from
a shard, an assert can fail resulting in a crash.

Status quo (around the assert in truncate procedure):

1) Highest RP seen by table is saved in low_mark, and the current time in
low_mark_at.
2) Then compaction is disabled in order to not mix data written before truncate,
and data written later.
3) Then memtable is flushed in order for the data written before truncate to be
available in sstables and then removed.
4) Now, current time is saved in truncated_at, which is supposedly the time of
truncate to decide which sstables to remove.

Note: truncated_at is likely above low_mark_at due to steps 2 and 3.

The interesting part of the assert is:
    (truncated_at <= low_mark_at ? rp <= low_mark : low_mark <= rp)

Note: RP in the assert above is the highest RP among all sstables generated
before truncated_at. RP is retrieved by table::discard_sstables().

If truncated_at > low_mark_at, maybe newer data was written during steps 2 and
3, and memtable's RP becomes greater than low_mark, resulting in a SSTable with
RP > low_mark.
So assert's 2nd condition is there to defend against the scenario above.

truncated_at and low_mark_at uses millisecond granularity, so even if
truncated_at == low_mark_at, data could have been written in steps 2 and 3
(during same MS window), failing the assert. This is fragile.

Reproducer:

To reproduce the problem, truncated_at must be > low_mark_at, which can easily
happen with both drop table and truncate due to steps 2 and 3.

If a shard has 2 or more tablets, the table's highest RP refer to just one
tablet in that shard.
If the tablet with the highest RP is migrated away, then the sstables in that
shard will have lower RP than the recorded highest RP (it's a table wide state,
which makes sense since CL is shared among tablets).

So when either drop table or truncate runs, low_mark will be potentially bigger
than highest RP retrieved from sstables.

Proposed solution:

The current assert is hacked to not fail if writes sneak in, during steps 2 and
3, but it's still fragile and seems not to serve its real purpose, since it's
allowing for RP > low_mark.

We should be able to say that low_mark >= RP, as a way of asserting we're not
leaving data targeted by truncate behind (or that we're not removing the wrong
data).

But the problem is that we're saving low_mark in step 1, before preparation
steps (2 and 3). When truncated_at is recorded in step 4, it's a way of saying
all data written so far is targeted for removal. But as of today, low_mark
refers to all data written up to step 1. So low_mark is now only one set
before issuing flush, and also accounts for all potentially flushed data.

Fixes #18059.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#23560
This commit is contained in:
Raphael S. Carvalho
2025-04-02 22:24:38 -03:00
committed by Botond Dénes
parent 8d2a41db82
commit 0f59deffaa
4 changed files with 84 additions and 19 deletions

View File

@@ -9,7 +9,7 @@ from test.pylib.internal_types import HostID, ServerInfo, ServerNum
from test.pylib.manager_client import ManagerClient
from test.pylib.rest_client import inject_error_one_shot, HTTPError, read_barrier
from test.pylib.util import wait_for_cql_and_get_hosts, unique_name
from test.pylib.tablets import get_tablet_replica, get_all_tablet_replicas
from test.pylib.tablets import get_tablet_replica, get_all_tablet_replicas, TabletReplicas
from test.cluster.conftest import skip_mode
from test.cluster.util import reconnect_driver, create_new_test_keyspace, new_test_keyspace
@@ -1835,3 +1835,48 @@ async def test_tablet_cleanup_vs_snapshot_race(manager: ManagerClient):
await s0_log.wait_for('Cleanup failed for tablet', from_mark=s0_mark)
await manager.api.take_snapshot(servers[0].ip_addr, ks, "test_snapshot")
# Reproduces assert failure when truncating table, either triggered by DROP TABLE or TRUNCATE.
# See: https://github.com/scylladb/scylladb/issues/18059
# It's achieved by migrating a tablet away that contains the highest replay position of a shard,
# so when drop/truncate happens, the highest replay position will be greater than all the data
# found in the table (includes data in memtable).
@pytest.mark.asyncio
@pytest.mark.parametrize("operation", ['DROP TABLE', 'TRUNCATE'])
@skip_mode('release', 'error injections are not supported in release mode')
async def test_drop_table_and_truncate_after_migration(manager: ManagerClient, operation):
cmdline = [ '--smp=2' ]
cfg = { 'auto_snapshot': True }
servers = [await manager.server_add(cmdline=cmdline, config=cfg)]
await manager.api.disable_tablet_balancing(servers[0].ip_addr)
cql = manager.get_cql()
ks = await create_new_test_keyspace(cql, f"WITH replication = {{'class': 'NetworkTopologyStrategy', 'replication_factor': 1}} AND TABLETS = {{'initial': 4}}")
await cql.run_async(f"CREATE TABLE {ks}.test (pk int PRIMARY KEY, c int);")
await manager.api.disable_autocompaction(servers[0].ip_addr, ks)
keys = range(100)
await asyncio.gather(*[cql.run_async(f"INSERT INTO {ks}.test (pk, c) VALUES ({k}, {k});") for k in keys])
tablet_replicas = await get_all_tablet_replicas(manager, servers[0], ks, 'test')
tablet_replicas_in_s0 = list[TabletReplicas]()
for replica in tablet_replicas:
if replica.replicas[0][1] == 0:
tablet_replicas_in_s0.append(replica)
assert len(tablet_replicas_in_s0) == 2
target_tablet = tablet_replicas_in_s0[0]
s0_host_id = await manager.get_host_id(servers[0].server_id)
logger.info("Migrating 1st tablet to shard 1")
await manager.api.move_tablet(servers[0].ip_addr, ks, "test", *(s0_host_id, 0), *(s0_host_id, 1), target_tablet.last_token)
await manager.api.enable_injection(servers[0].ip_addr, "truncate_disable_compaction_delay", one_shot=True)
logger.info(f"Running {operation} {ks}.test")
await cql.run_async(f"{operation} {ks}.test")