Files
scylladb/sstables/sstable_mutation_reader.cc
Tomasz Grabiec a29501ed67 sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions
Single-row reads from large partition issue 64 KiB reads to the data file,
which is equal to the default span of the promoted index block in the data file.
If users would want to reduce selectivity of the index to speed up single-row reads,
this won't be effective. The reason is that the reader uses promoted index
to look up the start position in the data file of the read, but end position
will in practice extend to the next partition, and amount of I/O will be
determined by the underlying file input stream implementation and its
read-ahead heuristics. By default, that results in at least 2 IOs 32KB each.

There is already infrastructure to lookup end position based on upper
bound of the read, but it's not effective becasue it's a
non-populating lookup and the upper bound cursor has its own private
cached_promoted_index, which is cold when positions are computed. It's
non-populating on purpose, to avoid extra index file IO to read upper
bound. In case upper bound is far-enough from the lower bound, this
will only increase the cost of the read.

The solution employed here is to warm up the lower bound cursor's
cache before positions are computed, and use that cursor for
non-populating lookup of the upper bound.

We use the lower bound cursor and the slice's lower bound so that we
read the same blocks as later lower-bound slicing would, so that we
don't incur extra IO for cases where looking up upper bound is not
worth it, that is when upper bound is far from the lower bound. If
upper bound is near lower bound, then warming up using lower bound
will populate cached_promoted_index with blocks which will allow us to
locate the upper bound block accurately.  This is especially important
for single-row reads, where the bounds are around the same key.  In
this case we want to read the data file range which belongs to a
single promoted index block.  It doesn't matter that the upper bound
is not exactly the same. They both will likely lie in the same block,
and if not, binary search will bring adjacent blocks into cache.  Even
if upper bound is not near, the binary search will populate the cache
with blocks which can be used to narrow down the data file range
somewhat.

Fixes #10030.

The change was tested with perf-fast-forward.

I populated the data set with `column_index_size_in_kb` set to 1

  scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1

Test run:

  build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0

This test reads two rows from the middle of a large partition (1M
rows), of subsequent keys. The first read will miss in the index file
page cache, the second read will hit.

Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total.
After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB.
I verified using logging that the data file range matches a single promoted index block.

Also, the first read which misses in cache is still faster after the change.

Before:

running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride  rows      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
500000  1         0.009802            1         1        102          0        102        102       21.0     21        196       2       1        0        1        1        0        0        0       568     269 4716050  53.4%
500001  1         0.000321            1         1       3113          0       3113       3113        2.0      2         64       1       0        1        0        0        0        0        0       116      26  555110  45.0%

After:

running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride  rows      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
500000  1         0.009609            1         1        104          0        104        104       20.0     20        137       2       1        0        1        1        0        0        0       561     268 4633407  43.1%
500001  1         0.000217            1         1       4602          0       4602       4602        1.0      1          2       1       0        1        0        0        0        0        0       110      26  313882  64.1%

(cherry picked from commit dfb339376aff1ed961b26c4759b1604f7df35e54)
2024-10-01 18:40:34 +02:00

188 lines
6.7 KiB
C++

/*
* Copyright (C) 2021-present ScyllaDB
*/
/*
* SPDX-License-Identifier: AGPL-3.0-or-later
*/
#include "sstable_mutation_reader.hh"
#include "column_translation.hh"
#include "concrete_types.hh"
#include "utils/fragment_range.hh"
#include "utils/to_string.hh"
#include <boost/range/algorithm/stable_partition.hpp>
#include <boost/intrusive/list.hpp>
namespace sstables {
class reader_tracker {
using list_type = boost::intrusive::list<mp_row_consumer_reader_base,
boost::intrusive::member_hook<mp_row_consumer_reader_base,
mp_row_consumer_reader_base::tracker_link_type,
&mp_row_consumer_reader_base::_tracker_link>,
boost::intrusive::constant_time_size<false>>;
public:
list_type _readers;
void add(mp_row_consumer_reader_base& reader) {
_readers.push_back(reader);
}
};
thread_local reader_tracker _reader_tracker;
mp_row_consumer_reader_base::mp_row_consumer_reader_base(shared_sstable sst)
: _sst(std::move(sst))
{
_reader_tracker.add(*this);
}
atomic_cell make_counter_cell(api::timestamp_type timestamp, fragmented_temporary_buffer::view cell_value) {
static constexpr size_t shard_size = 32;
if (cell_value.empty()) {
// This will never happen in a correct MC sstable but
// we had a bug #4363 that caused empty counters
// to be incorrectly stored inside sstables.
counter_cell_builder ccb;
return ccb.build(timestamp);
}
auto cell_value_size = cell_value.size_bytes();
auto header_size = read_simple<int16_t>(cell_value);
for (auto i = 0; i < header_size; i++) {
auto idx = read_simple<int16_t>(cell_value);
if (idx >= 0) {
throw marshal_exception("encountered a local shard in a counter cell");
}
}
auto header_length = (size_t(header_size) + 1) * sizeof(int16_t);
auto shard_count = (cell_value_size - header_length) / shard_size;
if (shard_count != size_t(header_size)) {
throw marshal_exception("encountered remote shards in a counter cell");
}
counter_cell_builder ccb(shard_count);
for (auto i = 0u; i < shard_count; i++) {
auto id_hi = read_simple<int64_t>(cell_value);
auto id_lo = read_simple<int64_t>(cell_value);
auto clock = read_simple<int64_t>(cell_value);
auto value = read_simple<int64_t>(cell_value);
ccb.add_maybe_unsorted_shard(counter_shard(counter_id(utils::UUID(id_hi, id_lo)), value, clock));
}
ccb.sort_and_remove_duplicates();
return ccb.build(timestamp);
}
// See #6130.
static data_type freeze_types_in_collections(data_type t) {
return ::visit(*t, make_visitor(
[] (const map_type_impl& typ) -> data_type {
return map_type_impl::get_instance(
freeze_types_in_collections(typ.get_keys_type()->freeze()),
freeze_types_in_collections(typ.get_values_type()->freeze()),
typ.is_multi_cell());
},
[] (const set_type_impl& typ) -> data_type {
return set_type_impl::get_instance(
freeze_types_in_collections(typ.get_elements_type()->freeze()),
typ.is_multi_cell());
},
[] (const list_type_impl& typ) -> data_type {
return list_type_impl::get_instance(
freeze_types_in_collections(typ.get_elements_type()->freeze()),
typ.is_multi_cell());
},
[&] (const abstract_type& typ) -> data_type {
return std::move(t);
}
));
}
/* If this function returns false, the caller cannot assume that the SSTable comes from Scylla.
* It might, if for some reason a table was created using Scylla that didn't contain any feature bit,
* but that should never happen. */
static bool is_certainly_scylla_sstable(const sstable_enabled_features& features) {
return features.enabled_features;
}
std::vector<column_translation::column_info> column_translation::state::build(
const schema& s,
const utils::chunked_vector<serialization_header::column_desc>& src,
const sstable_enabled_features& features,
bool is_static) {
std::vector<column_info> cols;
if (s.is_dense()) {
const column_definition& col = is_static ? *s.static_begin() : *s.regular_begin();
cols.push_back(column_info{
&col.name(),
col.type,
col.id,
col.type->value_length_if_fixed(),
col.is_multi_cell(),
col.is_counter(),
false
});
} else {
cols.reserve(src.size());
for (auto&& desc : src) {
const bytes& type_name = desc.type_name.value;
data_type type = db::marshal::type_parser::parse(to_sstring_view(type_name));
if (!features.is_enabled(CorrectUDTsInCollections) && is_certainly_scylla_sstable(features)) {
// See #6130.
type = freeze_types_in_collections(std::move(type));
}
const column_definition* def = s.get_column_definition(desc.name.value);
std::optional<column_id> id;
bool schema_mismatch = false;
if (def) {
id = def->id;
schema_mismatch = def->is_multi_cell() != type->is_multi_cell() ||
def->is_counter() != type->is_counter() ||
!def->type->is_value_compatible_with(*type);
}
cols.push_back(column_info{
&desc.name.value,
type,
id,
type->value_length_if_fixed(),
type->is_multi_cell(),
type->is_counter(),
schema_mismatch
});
}
boost::range::stable_partition(cols, [](const column_info& column) { return !column.is_collection; });
}
return cols;
}
position_in_partition_view get_slice_upper_bound(const schema& s, const query::partition_slice& slice, dht::ring_position_view key) {
const auto& ranges = slice.row_ranges(s, *key.key());
if (ranges.empty()) {
return position_in_partition_view::for_static_row();
}
if (slice.is_reversed()) {
return position_in_partition_view::for_range_end(ranges.front());
}
return position_in_partition_view::for_range_end(ranges.back());
}
position_in_partition_view get_slice_lower_bound(const schema& s, const query::partition_slice& slice, dht::ring_position_view key) {
const auto& ranges = slice.row_ranges(s, *key.key());
if (ranges.empty()) {
return position_in_partition_view::for_static_row();
}
if (slice.is_reversed()) {
return position_in_partition_view::for_range_start(ranges.back());
}
return position_in_partition_view::for_range_start(ranges.front());
}
} // namespace sstables