mirror of
https://github.com/scylladb/scylladb.git
synced 2026-05-02 06:05:53 +00:00
replica: Fix truncate and drop table after tablet migration happens
When running those operations after a tablet replica is migrated away from
a shard, an assert can fail resulting in a crash.
Status quo (around the assert in truncate procedure):
1) Highest RP seen by table is saved in low_mark, and the current time in
low_mark_at.
2) Then compaction is disabled in order to not mix data written before truncate,
and data written later.
3) Then memtable is flushed in order for the data written before truncate to be
available in sstables and then removed.
4) Now, current time is saved in truncated_at, which is supposedly the time of
truncate to decide which sstables to remove.
Note: truncated_at is likely above low_mark_at due to steps 2 and 3.
The interesting part of the assert is:
(truncated_at <= low_mark_at ? rp <= low_mark : low_mark <= rp)
Note: RP in the assert above is the highest RP among all sstables generated
before truncated_at. RP is retrieved by table::discard_sstables().
If truncated_at > low_mark_at, maybe newer data was written during steps 2 and
3, and memtable's RP becomes greater than low_mark, resulting in a SSTable with
RP > low_mark.
So assert's 2nd condition is there to defend against the scenario above.
truncated_at and low_mark_at uses millisecond granularity, so even if
truncated_at == low_mark_at, data could have been written in steps 2 and 3
(during same MS window), failing the assert. This is fragile.
Reproducer:
To reproduce the problem, truncated_at must be > low_mark_at, which can easily
happen with both drop table and truncate due to steps 2 and 3.
If a shard has 2 or more tablets, the table's highest RP refer to just one
tablet in that shard.
If the tablet with the highest RP is migrated away, then the sstables in that
shard will have lower RP than the recorded highest RP (it's a table wide state,
which makes sense since CL is shared among tablets).
So when either drop table or truncate runs, low_mark will be potentially bigger
than highest RP retrieved from sstables.
Proposed solution:
The current assert is hacked to not fail if writes sneak in, during steps 2 and
3, but it's still fragile and seems not to serve its real purpose, since it's
allowing for RP > low_mark.
We should be able to say that low_mark >= RP, as a way of asserting we're not
leaving data targeted by truncate behind (or that we're not removing the wrong
data).
But the problem is that we're saving low_mark in step 1, before preparation
steps (2 and 3). When truncated_at is recorded in step 4, it's a way of saying
all data written so far is targeted for removal. But as of today, low_mark
refers to all data written up to step 1. So low_mark is now only one set
before issuing flush, and also accounts for all potentially flushed data.
Fixes #18059.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes scylladb/scylladb#23560
This commit is contained in:
committed by
Botond Dénes
parent
8d2a41db82
commit
0f59deffaa
@@ -9,7 +9,7 @@ from test.pylib.internal_types import HostID, ServerInfo, ServerNum
|
||||
from test.pylib.manager_client import ManagerClient
|
||||
from test.pylib.rest_client import inject_error_one_shot, HTTPError, read_barrier
|
||||
from test.pylib.util import wait_for_cql_and_get_hosts, unique_name
|
||||
from test.pylib.tablets import get_tablet_replica, get_all_tablet_replicas
|
||||
from test.pylib.tablets import get_tablet_replica, get_all_tablet_replicas, TabletReplicas
|
||||
from test.cluster.conftest import skip_mode
|
||||
from test.cluster.util import reconnect_driver, create_new_test_keyspace, new_test_keyspace
|
||||
|
||||
@@ -1835,3 +1835,48 @@ async def test_tablet_cleanup_vs_snapshot_race(manager: ManagerClient):
|
||||
await s0_log.wait_for('Cleanup failed for tablet', from_mark=s0_mark)
|
||||
|
||||
await manager.api.take_snapshot(servers[0].ip_addr, ks, "test_snapshot")
|
||||
|
||||
# Reproduces assert failure when truncating table, either triggered by DROP TABLE or TRUNCATE.
|
||||
# See: https://github.com/scylladb/scylladb/issues/18059
|
||||
# It's achieved by migrating a tablet away that contains the highest replay position of a shard,
|
||||
# so when drop/truncate happens, the highest replay position will be greater than all the data
|
||||
# found in the table (includes data in memtable).
|
||||
@pytest.mark.asyncio
|
||||
@pytest.mark.parametrize("operation", ['DROP TABLE', 'TRUNCATE'])
|
||||
@skip_mode('release', 'error injections are not supported in release mode')
|
||||
async def test_drop_table_and_truncate_after_migration(manager: ManagerClient, operation):
|
||||
cmdline = [ '--smp=2' ]
|
||||
cfg = { 'auto_snapshot': True }
|
||||
servers = [await manager.server_add(cmdline=cmdline, config=cfg)]
|
||||
|
||||
await manager.api.disable_tablet_balancing(servers[0].ip_addr)
|
||||
|
||||
cql = manager.get_cql()
|
||||
ks = await create_new_test_keyspace(cql, f"WITH replication = {{'class': 'NetworkTopologyStrategy', 'replication_factor': 1}} AND TABLETS = {{'initial': 4}}")
|
||||
await cql.run_async(f"CREATE TABLE {ks}.test (pk int PRIMARY KEY, c int);")
|
||||
|
||||
await manager.api.disable_autocompaction(servers[0].ip_addr, ks)
|
||||
|
||||
keys = range(100)
|
||||
await asyncio.gather(*[cql.run_async(f"INSERT INTO {ks}.test (pk, c) VALUES ({k}, {k});") for k in keys])
|
||||
|
||||
tablet_replicas = await get_all_tablet_replicas(manager, servers[0], ks, 'test')
|
||||
tablet_replicas_in_s0 = list[TabletReplicas]()
|
||||
|
||||
for replica in tablet_replicas:
|
||||
if replica.replicas[0][1] == 0:
|
||||
tablet_replicas_in_s0.append(replica)
|
||||
|
||||
assert len(tablet_replicas_in_s0) == 2
|
||||
|
||||
target_tablet = tablet_replicas_in_s0[0]
|
||||
|
||||
s0_host_id = await manager.get_host_id(servers[0].server_id)
|
||||
|
||||
logger.info("Migrating 1st tablet to shard 1")
|
||||
await manager.api.move_tablet(servers[0].ip_addr, ks, "test", *(s0_host_id, 0), *(s0_host_id, 1), target_tablet.last_token)
|
||||
|
||||
await manager.api.enable_injection(servers[0].ip_addr, "truncate_disable_compaction_delay", one_shot=True)
|
||||
|
||||
logger.info(f"Running {operation} {ks}.test")
|
||||
await cql.run_async(f"{operation} {ks}.test")
|
||||
|
||||
Reference in New Issue
Block a user