This fixes the problem of equal_continuity() being prone to false
positives due to redundant information (extra dummy rows) present in
one of the partitions. get_continuity() is minified, so is not prone
to this.
Will make it easy to represent and manipulate continuity in tests.
Could also replace clustering_row_ranges in the future, which is
currently a naked vector<> with no semantic methods.
"Fixes cache reader to not skip over data in some cases involving overlapping
range tombstones in different partition versions and discontinuous cache.
Introduced in 2.0
Fixes #3053."
* tag 'tgrabiec/fix-range-tombstone-slicing-v2' of github.com:scylladb/seastar-dev:
tests: row_cache: Add reproducer for issue #3053
tests: mvcc: Add test for partition_snapshot::range_tombstones()
mvcc: Optimize partition_snapshot::range_tombstones() for single version case
mvcc: Fix partition_snapshot::range_tombstones()
tests: random_mutation_generator: Do not emit dummy entries at clustering row positions
The issue is that partition_snapshot::range_tombstones() is
deoverlapping tombstones coming from different versions, and it may
happen that due to range tombstone splitting that function will return
a tombstone which starts after the requested range. This breaks
assumptions made by the cache reader. It keeps track of the maximum
fragment position, and if cache reader will then need to read from
sstables due to a miss, it would do so starting from the position
marked by that out of range tombstone, possibly skipping over some
rows.
partition_snapshot::range_tombstones() is deoverlapping tombstones
coming from different versions and it may happen that due to range
tombstone splitting the method will return a tombstone which starts
after the requested range. This would cause it to return a tombstone
which doesn't overlap with the requested range.
This breaks assumptions made by cache reader. It keeps track of the
maximum fragment position, and if cache reader will then need to read
from sstables due to a miss, it would do so starting from the position
marked by that out of range tombstone, possibly skipping over some
rows.
Exposed by a change in row_cache_test.cc::test_mvcc() which fills the
buffer of sm5 reader after it is created.
Fixes#3053.
It is assumed that dummy entries are only at !is_clustering_row() positions.
Causes cache_streamed_mutation to assert when trying to trim a range tombstone.
"Didn't affect any release. Regression introduced in 301358e.
Fixes#3041"
* 'resharding_fix_v4' of github.com:raphaelsc/scylla:
tests: add sstable resharding test to test.py
tests: fix sstable resharding test
sstables: Fix resharding by not filtering out mutation that belongs to other shard
db: introduce make_range_sstable_reader
rename make_range_sstable_reader to make_local_shard_sstable_reader
db: extract sstable reader creation from incremental_reader_selector
db: reuse make_range_sstable_reader in make_sstable_reader
"In time-series, it's common for tables in a given time window to be eventually
fully expired. The deletion of such tables is done by compaction, but there's
*no* need to *actually* compact such fully expired sstables *iff* their full
deletion will not cause older data to be ressurected. In other words, a fully
expired table can be actually skipped (but deleted in the end) by compaction
*iff* it doesn't contain newer data than its overlapping counterparts. So there
may be false negatives, but never false positives.
All that said, the goal behind this patchset is to save read bandwidth of disk
in such scenarios. Given that fully expired sstables will not be read by
compaction process anymore, read amplification will be greatly reduced too.
Fixes #2620."
* 'time_series_performance_improvement_v2_2' of github.com:raphaelsc/scylla:
tests: check sstable auto correct bad max deletion time
tests: add test for compaction with fully expired table
sstables/compaction: do not actually compact fully expired sstables
sstables: make sstable auto correct max_local_deletion_time
sstables: switch to const ref wherever possible
sstables: use gc_clock::time_point for gc_before
gc_clock: introduce operator<<(ostream&, gc_clock::time_point)
sstables: introduce sstable::get_max_local_deletion_time
sstables: remove unnecessary copy in time series strategies
sstables: change return value type of get_fully_expired_sstables
dtcs: make code to extract non expired tables faster
sstables: add has_correct_max_deletion_time to sstable
"Soon we will have resources beyond just keyspaces and table names. There
will be resources for roles, for user-defined functions (UDFs), and
possible resources for REST end-points. This change generalizes the
implementation of a `data_resource` to many different kinds of
resources, though there is still only one kind (`data`).
The most important patch is 2/5 ("auth/resource: Generalize to different
kinds"), which re-writes `auth::data_resource`. The patch message should
sufficiently explain the design decisions involved.
The other patches rename files and identifiers based on the expanded
role of this class, except for 5/5 ("auth/resource.hh: Rename
`resource_ids`"): this patch gives a more appropriate name to a type
alias.
Fixes #3027."
* 'jhk/generalize_resource/v3' of https://github.com/hakuch/scylla:
auth/resource.hh: Rename `resource_ids`
auth: Rename `data_resource` files
cql3/authorization_statement: Fix typo
auth/resource: Generalize to different kinds
auth: Rename `data_resource` to `resource`
wrong sstable was used when checking for content, and storage service
for test was missing.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
After 301358e, sstable resharding stopped work because shared sstables would
use a filtering reader, which excludes mutation that belong to other shards.
That completely breaks which relies on compaction of mutations that belong
to different shards. The fix is about using recently introduced non local
shard reader.
Fixes#3041.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
introduce reader variant that will allow its caller to read a range
in a given table without any filter applied.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Tomek says:
"I think that the least surprising behavior for a function named like this
is to read the sstables unfiltered (it just reads them), and the filtering
should be indicated specially in the name or by accepting a parameter."
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
There's no need to actually compact a sstable which is fully expired
and which deletion of all its data will not ressurect older data.
For that, a sstable will only be considered fully expired if it
doesn't contain data newer than its overlapping counterparts.
That way, there could be a false negative, but never a false positive.
Currently, a fully expired sstable would unnecessarily waste read
bandwidth of disk. This will help a lot time series workloads in
which data for a given time window is all deleted at once using TTL.
Fixes#2620.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
sstables created prior to cc6c383 can contain bad max deletion time stat,
which would make get_fully_expired_sstables return sstables that aren't
actually fully expired. Let's make sstable invalidate the stat if it
is potentially incorrect.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
unordered_set will allow us to quickly extract fully expired tables
from a set of compacting sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
since it's O(n) and not O(n log n).
change also needed for change in interface of function to retrieve
fully expired tables, or sort lambda would need to be parametrized.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Commit cc6c38324 fixes the stat. It was only updated for range
tombstone prior to fix, so a sstable that had a regular cell with
no expiration time could be considered fully expired which can
lead to bad decisions in compaction for time series workloads.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>