When running those operations after a tablet replica is migrated away from
a shard, an assert can fail resulting in a crash.
Status quo (around the assert in truncate procedure):
1) Highest RP seen by table is saved in low_mark, and the current time in
low_mark_at.
2) Then compaction is disabled in order to not mix data written before truncate,
and data written later.
3) Then memtable is flushed in order for the data written before truncate to be
available in sstables and then removed.
4) Now, current time is saved in truncated_at, which is supposedly the time of
truncate to decide which sstables to remove.
Note: truncated_at is likely above low_mark_at due to steps 2 and 3.
The interesting part of the assert is:
(truncated_at <= low_mark_at ? rp <= low_mark : low_mark <= rp)
Note: RP in the assert above is the highest RP among all sstables generated
before truncated_at. RP is retrieved by table::discard_sstables().
If truncated_at > low_mark_at, maybe newer data was written during steps 2 and
3, and memtable's RP becomes greater than low_mark, resulting in a SSTable with
RP > low_mark.
So assert's 2nd condition is there to defend against the scenario above.
truncated_at and low_mark_at uses millisecond granularity, so even if
truncated_at == low_mark_at, data could have been written in steps 2 and 3
(during same MS window), failing the assert. This is fragile.
Reproducer:
To reproduce the problem, truncated_at must be > low_mark_at, which can easily
happen with both drop table and truncate due to steps 2 and 3.
If a shard has 2 or more tablets, the table's highest RP refer to just one
tablet in that shard.
If the tablet with the highest RP is migrated away, then the sstables in that
shard will have lower RP than the recorded highest RP (it's a table wide state,
which makes sense since CL is shared among tablets).
So when either drop table or truncate runs, low_mark will be potentially bigger
than highest RP retrieved from sstables.
Proposed solution:
The current assert is hacked to not fail if writes sneak in, during steps 2 and
3, but it's still fragile and seems not to serve its real purpose, since it's
allowing for RP > low_mark.
We should be able to say that low_mark >= RP, as a way of asserting we're not
leaving data targeted by truncate behind (or that we're not removing the wrong
data).
But the problem is that we're saving low_mark in step 1, before preparation
steps (2 and 3). When truncated_at is recorded in step 4, it's a way of saying
all data written so far is targeted for removal. But as of today, low_mark
refers to all data written up to step 1. So low_mark is now only one set
before issuing flush, and also accounts for all potentially flushed data.
Fixes #18059.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes scylladb/scylladb#23560
(cherry picked from commit 0f59deffaa)
(cherry picked from commit 7554d4bbe09967f9b7a55575b5dfdde4f6616862)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes scylladb/scylladb#23649
Scylla in-source tests.
For details on how to run the tests, see docs/dev/testing.md
Shared C++ utils, libraries are in lib/, for Python - pylib/
alternator - Python tests which connect to a single server and use the DynamoDB API unit, boost, raft - unit tests in C++ cqlpy - Python tests which connect to a single server and use CQL topology* - tests that set up clusters and add/remove nodes cql - approval tests that use CQL and pre-recorded output rest_api - tests for Scylla REST API Port 9000 scylla-gdb - tests for scylla-gdb.py helper script nodetool - tests for C++ implementation of nodetool
If you can use an existing folder, consider adding your test to it. New folders should be used for new large categories/subsystems, or when the test environment is significantly different from some existing suite, e.g. you plan to start scylladb with different configuration, and you intend to add many tests and would like them to reuse an existing Scylla cluster (clusters can be reused for tests within the same folder).
To add a new folder, create a new directory, and then
copy & edit its suite.ini.