Commit Graph

178 Commits

Author SHA1 Message Date
Botond Dénes
d65c1523c2 sstables: migrate all malformed_sstable_exception throw sites to throw_malformed_sstable_exception()
Replace all direct 'throw malformed_sstable_exception(...)' call sites
with the new throw_malformed_sstable_exception() helper, which respects
the --abort-on-malformed-sstable-error flag.
2026-05-11 11:58:14 +03:00
Avi Kivity
0ae22a09d4 LICENSE: Update to version 1.1
Updated terms of non-commercial use (must be a never-customer).
2026-04-12 19:46:33 +03:00
Tomasz Grabiec
4410e9c61a sstables: mx: index_reader: Optimize parsing for no promoted index case
It's a common case with small partition workloads.
2026-03-18 16:25:21 +01:00
Tomasz Grabiec
f55bb154ec sstables: mx: index_reader: Amoritze partition key storage
This change reduces the cost of partition index page construction and
LSA migration. This is achieved by several things working together:

 - index entries don't store keys as separate small objects (managed_bytes)
   They are written into one managed_bytes fragmented storage, entries
   hold offset into it.

   Before, we paid 16 bytes for managed_bytes plus LSA descriptor for
   the storage (1 byte) plus back-reference in the storage (8 bytes),
   so 25 bytes. Now we only pay 4 bytes for the size offset. If keys are 16
   bytes, that's a reduction from 31 bytes to 20 bytes per key.

 - index entries and key storage are now trivially moveable, so LSA
   migration can use memcpy() which amortizes the cost per key.
   memcpy().

   LSA eviction is now trivial and constant time for the whole page
   regardless of the number of entries. Page eviction dropped from
   14 us to 1 us.

This improves throughput in a CPU-bound miss-heavy read workload where
the partition index doesn't fit in memory.

  scylla perf-simple-query -c1 -m200M --partitions=1000000

Before:

    15328.25 tps (150.0 allocs/op,  14.1 logallocs/op,  45.4 tasks/op,  286769 insns/op,  218134 cycles/op,        0 errors)
    15279.01 tps (149.9 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  287696 insns/op,  218637 cycles/op,        0 errors)
    15347.78 tps (149.7 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  285851 insns/op,  217795 cycles/op,        0 errors)
    15403.68 tps (149.6 allocs/op,  14.1 logallocs/op,  45.2 tasks/op,  285111 insns/op,  216984 cycles/op,        0 errors)
    15189.47 tps (150.0 allocs/op,  14.1 logallocs/op,  45.5 tasks/op,  289509 insns/op,  219602 cycles/op,        0 errors)
    15295.04 tps (149.8 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  288021 insns/op,  218545 cycles/op,        0 errors)
    15162.01 tps (149.8 allocs/op,  14.1 logallocs/op,  45.4 tasks/op,  291265 insns/op,  220451 cycles/op,        0 errors)

After:

    21620.18 tps (148.4 allocs/op,  13.4 logallocs/op,  43.7 tasks/op,  176817 insns/op,  153183 cycles/op,        0 errors)
    20644.03 tps (149.8 allocs/op,  13.5 logallocs/op,  44.3 tasks/op,  187941 insns/op,  160409 cycles/op,        0 errors)
    20588.06 tps (150.1 allocs/op,  13.5 logallocs/op,  44.5 tasks/op,  188090 insns/op,  160818 cycles/op,        0 errors)
    20789.29 tps (149.5 allocs/op,  13.5 logallocs/op,  44.2 tasks/op,  186495 insns/op,  159382 cycles/op,        0 errors)
    20977.89 tps (149.5 allocs/op,  13.4 logallocs/op,  44.2 tasks/op,  183969 insns/op,  158140 cycles/op,        0 errors)
    21125.34 tps (149.1 allocs/op,  13.4 logallocs/op,  44.1 tasks/op,  183204 insns/op,  156925 cycles/op,        0 errors)
    21244.42 tps (148.6 allocs/op,  13.4 logallocs/op,  43.8 tasks/op,  181276 insns/op,  155973 cycles/op,        0 errors)

Mostly because the index now fits in memory.

When it doesn't, the benefits are still visible due to lower LSA overhead.
2026-03-18 16:25:21 +01:00
Tomasz Grabiec
50dc7c6dd8 sstables: mx: index_reader: Keep promoted_index info next to index_entry
Densely populated pages have no promoted index (small partitions), so
we can save space in such workloads by keeping promoted index in a
separate vector.

For workloads which do have a promoted index, pages have only one
partition. There aren't many such pages and they are long-lived, so
the extra allocation of the vector is amortized.

promoted_index class is removed, and replaced with equivalent
parsed_promoted_index_entry for simplicity. Because it's removed,
make_cursor() is moved into the index_reader class.

Reducing the size of index_entry is important for performence if pages
are densly populated. It helps to reduce LSA allocator pressure and
compaction/eviction speed.

This change, combined with the earlier change "Shave-off 16 bytes from
index_entry by using raw_token", gives significant improvement in
throughput in perf_simple_query run where the index doesn't fit in
memory:

  scylla perf-simple-query -c1 -m200M --partitions=1000000

Before:

9714.78 tps (170.9 allocs/op,  16.9 logallocs/op,  55.3 tasks/op,  494788 insns/op,  343920 cycles/op,        0 errors)
9603.13 tps (171.6 allocs/op,  17.0 logallocs/op,  55.6 tasks/op,  502358 insns/op,  348344 cycles/op,        0 errors)
9621.43 tps (171.9 allocs/op,  17.0 logallocs/op,  55.8 tasks/op,  500612 insns/op,  347508 cycles/op,        0 errors)
9597.75 tps (171.6 allocs/op,  17.0 logallocs/op,  55.6 tasks/op,  501428 insns/op,  348604 cycles/op,        0 errors)
9615.54 tps (171.6 allocs/op,  16.9 logallocs/op,  55.6 tasks/op,  501313 insns/op,  347935 cycles/op,        0 errors)
9577.03 tps (171.8 allocs/op,  17.0 logallocs/op,  55.7 tasks/op,  503283 insns/op,  349251 cycles/op,        0 errors)

After:

15328.25 tps (150.0 allocs/op,  14.1 logallocs/op,  45.4 tasks/op,  286769 insns/op,  218134 cycles/op,        0 errors)
15279.01 tps (149.9 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  287696 insns/op,  218637 cycles/op,        0 errors)
15347.78 tps (149.7 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  285851 insns/op,  217795 cycles/op,        0 errors)
15403.68 tps (149.6 allocs/op,  14.1 logallocs/op,  45.2 tasks/op,  285111 insns/op,  216984 cycles/op,        0 errors)
15189.47 tps (150.0 allocs/op,  14.1 logallocs/op,  45.5 tasks/op,  289509 insns/op,  219602 cycles/op,        0 errors)
15295.04 tps (149.8 allocs/op,  14.1 logallocs/op,  45.3 tasks/op,  288021 insns/op,  218545 cycles/op,        0 errors)
15162.01 tps (149.8 allocs/op,  14.1 logallocs/op,  45.4 tasks/op,  291265 insns/op,  220451 cycles/op,        0 errors)
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
e9c98274b5 sstables: mx: index_reader: Reduce allocation_section overhead during index page parsing by batching allocation
If the page has many entries, we continuously enter and leave the
allocating section for every key. This can be avoided by batching LSA
operations for the whole page, after collecting all the entries.

Later optimizations will also build on this, where we will allocate
fragmented storage for keys in LSA using a single managed_bytes
constructor.

This alone brings only a minor improvement, but it does reduce LSA
allocations, probably due to less frequent memory reclamation:

  scylla perf-simple-query -c1 -m200M --duration 6000 --partitions=1000000

Before:

  9560.42 tps (172.2 allocs/op,  19.6 logallocs/op,  57.7 tasks/op,  567741 insns/op,  345158 cycles/op,        0 errors)
  9445.95 tps (173.1 allocs/op,  19.7 logallocs/op,  58.1 tasks/op,  579075 insns/op,  352173 cycles/op,        0 errors)
  9576.75 tps (172.2 allocs/op,  19.6 logallocs/op,  57.6 tasks/op,  572004 insns/op,  347373 cycles/op,        0 errors)
  9597.16 tps (172.2 allocs/op,  19.6 logallocs/op,  57.6 tasks/op,  569615 insns/op,  346618 cycles/op,        0 errors)
  9454.07 tps (173.5 allocs/op,  19.8 logallocs/op,  58.3 tasks/op,  579213 insns/op,  351569 cycles/op,        0 errors)

After:

  9562.21 tps (172.0 allocs/op,  17.0 logallocs/op,  55.8 tasks/op,  499225 insns/op,  347832 cycles/op,        0 errors)
  9480.20 tps (172.3 allocs/op,  17.0 logallocs/op,  55.9 tasks/op,  507271 insns/op,  350640 cycles/op,        0 errors)
  9512.42 tps (172.1 allocs/op,  17.0 logallocs/op,  55.9 tasks/op,  504247 insns/op,  350392 cycles/op,        0 errors)
  9498.45 tps (172.4 allocs/op,  17.1 logallocs/op,  55.9 tasks/op,  505765 insns/op,  350320 cycles/op,        0 errors)
  9076.30 tps (173.5 allocs/op,  17.1 logallocs/op,  56.5 tasks/op,  512791 insns/op,  354792 cycles/op,        0 errors)
  9542.62 tps (171.9 allocs/op,  17.0 logallocs/op,  55.8 tasks/op,  502532 insns/op,  348922 cycles/op,        0 errors)
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
0e0f9f41b3 sstables: mx: index_reader: Keep index_entry directly in the vector
Partition index entries are relatively small, and if the workload has
small partitions, index pages have a lot of elements. Currently, index
entries are indirected via managed_ref, which causes increased cost of
LSA eviction and compaction. This patch amortizes this cost by storing
them dierctly in the managed_chunked_vector.

This gives about 23% improvement in throughput in perf-simple-query
for a workload where the index doesn't fit in memory:

  scylla perf-simple-query -c1 -m200M --duration 6000 --partitions=1000000

Before:

  7774.96 tps (166.0 allocs/op, 521.7 logallocs/op,  54.0 tasks/op,  802428 insns/op,  430457 cycles/op,        0 errors)
  7511.08 tps (166.1 allocs/op, 527.2 logallocs/op,  54.0 tasks/op,  804185 insns/op,  430752 cycles/op,        0 errors)
  7740.44 tps (166.3 allocs/op, 526.2 logallocs/op,  54.2 tasks/op,  805347 insns/op,  432117 cycles/op,        0 errors)
  7818.72 tps (165.2 allocs/op, 517.6 logallocs/op,  53.7 tasks/op,  794965 insns/op,  427751 cycles/op,        0 errors)
  7865.49 tps (165.1 allocs/op, 513.3 logallocs/op,  53.6 tasks/op,  788898 insns/op,  425171 cycles/op,        0 errors)

After:

  9560.42 tps (172.2 allocs/op,  19.6 logallocs/op,  57.7 tasks/op,  567741 insns/op,  345158 cycles/op,        0 errors)
  9445.95 tps (173.1 allocs/op,  19.7 logallocs/op,  58.1 tasks/op,  579075 insns/op,  352173 cycles/op,        0 errors)
  9576.75 tps (172.2 allocs/op,  19.6 logallocs/op,  57.6 tasks/op,  572004 insns/op,  347373 cycles/op,        0 errors)
  9597.16 tps (172.2 allocs/op,  19.6 logallocs/op,  57.6 tasks/op,  569615 insns/op,  346618 cycles/op,        0 errors)
  9454.07 tps (173.5 allocs/op,  19.8 logallocs/op,  58.3 tasks/op,  579213 insns/op,  351569 cycles/op,        0 errors)

Disabling the partition index doesn't improve the throuhgput beyond
that.
2026-03-18 16:25:20 +01:00
Tomasz Grabiec
082342ecad Attach names to allocating sections for better debuggability
Large reserves in allocating_section can cause stalls. We already log
reserve increase, but we don't know which table it belongs to:

  lsa - LSA allocation failure, increasing reserve in section 0x600009f94590 to 128 segments;

Allocating sections used for updating row cache on memtable flush are
notoriously problematic. Each table has its own row_cache, so its own
allocating_section(s). If we attached table name to those sections, we
could identify which table is causing problems. In some issues we
suspected system.raft, but we can't be sure.

This patch allows naming allocating_sections for the purpose of
identifying them in such log messages. I use abstract_formatter for
this purpose to avoid the cost of formatting strings on the hot path
(e.g. index_reader). And also to avoid duplicating strings which are
already stored elsewhere.

Fixes #25799

Closes scylladb/scylladb#27470
2025-12-07 14:14:25 +02:00
Michał Chojnowski
420e215873 sstables/trie/bti_index_reader: allow the caller to passing a precalculated murmur hash
Partitions.db internally uses a piece of the partition key murmur
hash (the same hash which is used to compute the token and the
relevant bits in the bloom filter). Before this patch,
the Partitions.db reader computes the hash internally from the
`sstables::partition_key`.

That's a waste, because this hash is usually also computed
for bloom filter purposes just before that.

So in this patch we let the caller pass that hash instead.

The old index interface, without the hash, is kept for convenience.

In this patch we only add a new interface, we don't switch the callers
to it yet. That will happen in the next commit.
2025-09-29 13:01:21 +02:00
Michał Chojnowski
25e98f1ff7 sstables: extract abstract_index_reader from index_reader.hh to its own header
In later parts of the series, we add a second implementation
of `abstract_index_reader`. To do that, we want a header with
`abstract_index_reader`, but we don't need to pull in everything else
from `index_reader.hh`. So let's extract `abstract_index_reader`
to its own header.
2025-09-07 00:30:15 +02:00
Michał Chojnowski
b1da5f2d0f sstables/index_reader: weaken some exactness guarantees in abstract_index_reader
After making the sstable reader more permissive,
we can weaken the abstract_index_reader interface.
2025-07-25 11:00:18 +02:00
Michał Chojnowski
fe8ee34024 sstables/index_reader: remove advance_to
`advance_to` is unused now, so remove it.
2025-07-25 11:00:18 +02:00
Michał Chojnowski
03bf6347e2 sstables/mx/reader: handle inexact lookups in advance_context()
`advance_context()` needs an ability to advance the index to
the partition immediately following the reader's current partition.
For this, it uses `abstract_index_reader::advance_to(dht::ring_position_view)`

But BTI (and any index format which stores only the prefixes of keys
instead of whole keys) can't implement `advance_to` with its current
semantics. The Data position returned by the index for a generic
`advance_to` might be off by one partition.

E.g. if the index stores prefixes `a`, `b`, `c`,
the index has no way to know if the first entry after `bb`
is `b` (which might correspond to `ba` as well as `bc`), or `c`.

However, BTI can be used exactly if the partition is known to
be present in the sstable. (In the above example, if `bb` is known
to be present in the sstable, then it must correspond to `b`.
So the index can reliably advance to `bb` or the first partition after it).

And this is enough for `advance_context()`, because the
current partition is known to be present.
So we can replace the usage of `advance_to` with an equivalent API call
which only works with present keys, but in exchange is implementable
by BTI.

This makes `advance_to` unused, so we remove it.
2025-07-25 11:00:18 +02:00
Michał Chojnowski
11792850dd sstables/mx/reader: handle inexact lookups in advance_to_next_partition()
`advance_to_next_partition()` needs an ability to advance the index to
the partition immediately following the reader's current partition.
For this, it uses `abstract_index_reader::advance_to(dht::ring_position_view)`

But BTI (and any index format which stores only the prefixes of keys
instead of whole keys) can't implement `advance_to` with its current
semantics. The Data position returned by the index for a generic
`advance_to` might be off by one partition.

E.g. if the index stores prefixes `a`, `b`, `c`,
the index has no way to know if the first entry after `bb`
is `b` (which might correspond to `ba` as well as `bc`), or `c`.

However, BTI can be used exactly if the partition is known to
be present in the sstable. (In the above example, if `bb` is known
to be present in the sstable, then it must correspond to `b`.
So the index can reliably advance to `bb` or the first partition after it).

And this is enough for `advance_to_next_partition()`, because the
current partition is known to be present.
So we can replace the usage of `advance_to` with an equivalent API call
which only works with present keys, but in exchange is implementable
by BTI.
2025-07-25 11:00:18 +02:00
Michał Chojnowski
141895f9eb sstables/index_reader: make the return value of get_partition_key optional
BTI indexes only store encoded prefixes of partition keys,
not the whole keys. They can't reliably implement `get_partition_key`.
The index reader interface must be weakened and callers must
be adapted.
2025-07-25 11:00:18 +02:00
Botond Dénes
20693edb27 Merge 'sstables: put index_reader behind a virtual interface' from Michał Chojnowski
This is a refactoring patch in preparation for BTI indexes. It contains no functional changes (or at least it's not intended to).

In this patch, we modify the sstable readers to use index readers through a new virtual `abstract_index_readers` interface.
Later, we will add BTI indexes which will also implement this interface.

This interface contains the methods of `index_reader` which are needed by sstable readers, and leaves out all other methods, such as `current_clustered_cursor`.

Not all methods of this interface will be implementable by a trie-based index later. For example, a trie-based index can't provide a reliable `get_partition_key()`, because — unlike the current index — it only stores partition keys for partitions which have a row index. So the interface will have to be further restricted later. We don't do that in this patch because that will require changes to sstable reader logic, and this patch is supposed to only include cosmetic changes.

No backports needed, this is a preparation for new functionality.

Closes scylladb/scylladb#25000

* github.com:scylladb/scylladb:
  sstables: add sstable::make_index_reader() and use where appropriate
  sstables/mx: in readers, use abstract_index_reader instead of index_reader
  sstables: in validate(), use abstract_index_reader instead of index_reader where possible
  test/lib/index_reader_assertions: accept abstract_index_reader instead of index_reader
  sstables/index_reader: introduce abstract_index_reader
  sstables/index_reader: extract a prefetch_lower_bound() method
2025-07-17 14:32:08 +03:00
Michał Chojnowski
c052ccd081 sstables/index_reader: introduce abstract_index_reader
We want to implement BTI indexes in Scylla.
After we do that, some sstables will use a BTI index reader,
while others will use the old BIG index reader.
To handle that, we can expose a common virtual "index reader"
interface to sstable readers. This is what this patch does.

This interface can't be quite fully implemented by a BTI index,
because some methods returns keys which a BIG index stores,
but a BTI index doesn't. So it will be further restricted in future
patches. But for now, we only extract *all* methods currently
used by the readers to a virtual interface.
2025-07-17 10:32:56 +02:00
Michał Chojnowski
1e7a292ef4 sstables/index_reader: extract a prefetch_lower_bound() method
The sstable reader reaches directly for a `clustered_index_cursor`.
But a BTI index reader won't be able to implement
`clustered_index_cursor`, because a BTI index doesn't store
full clustering keys, only some trie-encoded prefixes.

So we want to weaken the dependency. Instead of reaching
for `clustered_index_cursor`, we add a method which expresses
our intent, and we let `index_reader` touch the cursor internally.
2025-07-16 00:13:20 +02:00
Ernest Zaslavsky
8d49bb8af2 sstables: Start using make_data_or_index_source in sstable
Convert all necessary methods to be awaitable. Start using `make_data_or_index_source`
when creating data_source for data and index components.

For proper working of compressed/checksummed input streams, start passing
stream creator functors to `make_(checksummed/compressed)_file_(k_l/m)_format_input_stream`.
2025-07-15 10:10:23 +03:00
Ernest Zaslavsky
dff9a229a7 sstables: refactor readers and sources to use coroutines
Refactor readers and sources to support coroutine usage in
preparation for integration with `make_data_or_index_source`.
Move coroutine-based member initialization out of constructors
where applicable, and defer initialization until first use.
2025-07-15 10:10:23 +03:00
Botond Dénes
bce89c0f5e sstables: replace SCYLLA_ASSERT() with parse_assert() on the read path
So parse errors on corrupt SSTables don't result in crashes, instead
just aborting the read in process.
There are a lot of SCYLLA_ASSERT() usages remaining in sstables/. This
patch tried to focus on those usages which are in the read path. Some
places not only used on the read path may have been converted too, where
the usage of said method is not clear.
2025-06-24 09:16:28 +03:00
Kefu Chai
7215d4bfe9 utils: do not include unused headers
these unused includes were identifier by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.

please note, because quite a few source files relied on
`utils/to_string.hh` to pull in the specialization of
`fmt::formatter<std::optional<T>>`, after removing
`#include <fmt/std.h>` from `utils/to_string.hh`, we have to
include `fmt/std.h` directly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-01-14 07:56:39 -05:00
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Botond Dénes
05246e123d Merge 'sstables: Avoid computing column_values_fixed_lengths on each read' from Tomasz Grabiec
Reads which need sstable index were computing
column_values_fixed_lengths each time. This showed up in perf profile
for a sstable-read heavy workload, and amounted to about 1-2% of time.

Computing it involves type name parsing.

Avoid by using cached per-sstable mapping. There is already
sstable::_column_translation which can be used for this. It caches the
mapping for the least-recently used schema. Since the cursor uses the
mapping only for primary key columns, which are stable, any schema
will do, so we can use the last _column_translation. We only need to
make sure that it's always armed, so sstable loading is augmented with
arming with sstable's schema.

Also, fixes a potential use-after-free on schema in column_translation.

Closes scylladb/scylladb#21347

* github.com:scylladb/scylladb:
  sstables: Fix potential use-after-free on column_translation::column_info::name
  sstables: Avoid computing column_values_fixed_lengths on each read
2024-12-12 12:22:32 +02:00
Kefu Chai
48c8d24345 treewide: drop support for fmt < v10
since fedora 38 is EOL. and fedora 39 comes with fmt v10.0.0, also,
we've switched to the build image based on fedora 40, which ships
fmt-devel v10.2.1, there is no need to support fmt < 10.

in this change, we drop the support fmt < 10.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21847
2024-12-09 20:42:38 +02:00
Tomasz Grabiec
b0a5bf8b4a sstables: Avoid computing column_values_fixed_lengths on each read
Reads which need clustering index cursor were computing
column_values_fixed_lengths each time. This showed up in perf profile
for a sstable-read heavy workload, and amounted to about 1%.

Avoid by using cached per-sstable mapping. There is already
sstable::_column_translation which can be used for this. It caches the
mapping for the most recently used schema. Since the cursor uses the
mapping only for primary key columns, which are stable, any schema
will do, so we can use the last _column_translation. We only need to
make sure that it's always armed, so sstable loading is augmented with
arming with sstable's schema.
2024-12-09 14:05:37 +01:00
Avi Kivity
b706e3e9e4 Merge 'sstables/index_reader: avoid unnecessary index page reads in single-partition reads' from Michał Chojnowski
Terminology note: in the context of this series, "index page" means an contiguous segment of the index file starting (inclusive) at a key corresponding to a summary entry and ending (exclusive) before the key corresponding to the next summary entry. "Index pages" are not related to filesystem pages.

---

In a single-partition read, if the searched partition key is the first key in its index page, we start scanning the index for that key starting at the previous index page (inclusive), even though we could start directly from the key's page. Similarly, if the searched partition key is absent from the sstable and lies after all other keys in its appropriate page, we additionally scan the next page, even though it's known from the summary that it can't possibly contain the key.

Those cases are wasteful. It's worse than it might seem at first glance. When partitions are small, only a small fraction of search keys fulfills those conditions (i.e. "first key in its page" or "an absent key greater than the last key in its page"), so the waste doesn't matter much. But when partitions are big enough, every index page contains only one partition key (and a promoted index for that partition), which directly means that *all* search keys fulfill the conditions, which means that total index reading work is two times bigger than what it should be.

In addition, there is a secondary performance bug which, when the aforementioned conditions are fulfilled, causes *additional* I/O to happen *past* the index reads which are actually parsed and used. In effect, the index I/O in single-partition reads might be not just doubled, but even tripled (that's for IOPS — throughput might be multiplied even more), all because of a slight inaccuracy in the edge cases.

This series fixes those inefficiencies by tightening the edge cases and ensuring that single-partition reads always read only a single index page.

Here's an example where we query the first row (i.e. `LIMIT 1`) of a certain partition key, in a table with large (1 MB) promoted indexes. Before the patch, the lookup of the lower bound involves 3 serialized disk reads (as described above) to subsequent index pages, and even the lookup of the upper bound involves 2 disk reads:

```
                                                                                                                                                                                     Execute CQL3 query
                                                                                                                                                                          Parsing a statement [shard 0]
                                                                                                                                     Processing a statement for authenticated user: anonymous [shard 0]
                                                                                                                                                        Executing read query (reversed false) [shard 0]
                                                                   Creating read executor for token -1297921881139976049 with all: [127.11.11.1] targets: [127.11.11.1] repair decision: NONE [shard 0]
                                                                    Creating never_speculating_read_executor - speculative retry is disabled or there are no extra replicas to speculate with [shard 0]
                                                                                                                                                                  read_data: querying locally [shard 0]
                                                                                                                         Start querying singular range {{-1297921881139976049, pk{00023130}}} [shard 0]
                                                                                                                                     [reader concurrency semaphore user] admitted immediately [shard 0]
                                                                                                                                           [reader concurrency semaphore user] executing read [shard 0]
                            Reading key {-1297921881139976049, pk{00023130}} from sstable ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 38359040 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 38391808 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 38359040, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 38391808, successfully read 32768 bytes [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 39370752 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 39403520 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 39370752, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 39403520, successfully read 32768 bytes [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40378368 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40411136 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40378368, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40411136, successfully read 32768 bytes [shard 0]
                                                                                                                      upper_bound_cache_only({position: clustered, ckp{}, 1}): no upper bound [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40378368 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40411136 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40378368, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40411136, successfully read 32768 bytes [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 41390080 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 41422848 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 41390080, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 41422848, successfully read 32768 bytes [shard 0]
                                 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db: scheduling bulk DMA read of size 21926 at offset 819200 [shard 0]
    ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db: finished bulk DMA read of size 21926 at offset 819200, successfully read 24576 bytes [shard 0]
                                      Page stats: 1 partition(s), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 0 cell(s) (0 live, 0 dead) [shard 0]
                                                                                                                                                                             Querying is done [shard 0]
                                                                                                                                                         Done processing - preparing a result [shard 0]
                                                                                                                                                                                       Request complete
```

After the patch, the lookup of each bound involves 1 read:
```

                                                                                                                                                                                     Execute CQL3 query
                                                                                                                                                                          Parsing a statement [shard 0]
                                                                                                                                     Processing a statement for authenticated user: anonymous [shard 0]
                                                                                                                                                        Executing read query (reversed false) [shard 0]
                                                                   Creating read executor for token -1297921881139976049 with all: [127.11.11.1] targets: [127.11.11.1] repair decision: NONE [shard 0]
                                                                    Creating never_speculating_read_executor - speculative retry is disabled or there are no extra replicas to speculate with [shard 0]
                                                                                                                                                                  read_data: querying locally [shard 0]
                                                                                                                         Start querying singular range {{-1297921881139976049, pk{00023130}}} [shard 0]
                                                                                                                                     [reader concurrency semaphore user] admitted immediately [shard 0]
                                                                                                                                           [reader concurrency semaphore user] executing read [shard 0]
                            Reading key {-1297921881139976049, pk{00023130}} from sstable ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 39370752 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 39403520 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 39370752, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 39403520, successfully read 32768 bytes [shard 0]
                                                                                                                      upper_bound_cache_only({position: clustered, ckp{}, 1}): no upper bound [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40378368 [shard 0]
                              ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40411136 [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40378368, successfully read 32768 bytes [shard 0]
 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40411136, successfully read 32768 bytes [shard 0]
                                 ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db: scheduling bulk DMA read of size 21926 at offset 819200 [shard 0]
    ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db: finished bulk DMA read of size 21926 at offset 819200, successfully read 24576 bytes [shard 0]
                                      Page stats: 1 partition(s), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 0 cell(s) (0 live, 0 dead) [shard 0]
                                                                                                                                                                             Querying is done [shard 0]
                                                                                                                                                         Done processing - preparing a result [shard 0]
                                                                                                                                                                                       Request complete
```

Doesn't have to be backported, since the problem only affects performance, not correctness, and it has been present since forever.

Closes scylladb/scylladb#20897

* github.com:scylladb/scylladb:
  index_reader: remove a piece of misguided code involved in single-partition reads
  index_reader: in single-partition reads, don't read more than one page
  index_reader: fix unnecessary reads of preceding index pages
2024-11-04 14:28:27 +02:00
Tomasz Grabiec
1b82d5117a sstables: bsearch_clustered_cursor: Narrow down range using "end" position of the block
This is optimization.

Example:

  block0: start=aaa, end=aaA
  block1: start=bbb, end=bbB
  block2: whatever

Before the patch, advance_to("aAA") would skip to block0, and upper
bound probe would skip to block1. This way, the reader would read the
range of block0 from the data file.

After the patch, "end" position is taken into account, so
advance_to("aAA") will notice that block0 doesn't contain the position
and will skip to block1. This is especially important for dense
indexes, as it allows us to skip accessing data file if the search key
is missing.

It also solves the edge case problem related to the fact that single
row reads are using a range which with positions which are not equal
to the key, but are before(key) and after(key) for the lower bound and
upper bound respectively. Before the patch, advance_to(before("bbb"))
would skip to block0, before the position is before the block1's
start. And upper bound probe for after("bbb") would point to
block2. This way the read would scan block0 needlessly. After the
patch, advance_to(before("bbb")) will skip to block1 because we notice
based on "end" that block0 doesn't contain the position.

This change also ensures that the start position of the upper bound
entry of the after_key(pos), where pos is the last advance_to()
position, is warm in cache. This is needed to optimize single-row
reads with a dense index so that they always read exactly one promoted
index block. For this to work, probe_upper_bound() for the
after_key(row) always needs to find the upper bound block in
cache.
2024-10-03 14:16:05 +02:00
Tomasz Grabiec
a29501ed67 sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions
Single-row reads from large partition issue 64 KiB reads to the data file,
which is equal to the default span of the promoted index block in the data file.
If users would want to reduce selectivity of the index to speed up single-row reads,
this won't be effective. The reason is that the reader uses promoted index
to look up the start position in the data file of the read, but end position
will in practice extend to the next partition, and amount of I/O will be
determined by the underlying file input stream implementation and its
read-ahead heuristics. By default, that results in at least 2 IOs 32KB each.

There is already infrastructure to lookup end position based on upper
bound of the read, but it's not effective becasue it's a
non-populating lookup and the upper bound cursor has its own private
cached_promoted_index, which is cold when positions are computed. It's
non-populating on purpose, to avoid extra index file IO to read upper
bound. In case upper bound is far-enough from the lower bound, this
will only increase the cost of the read.

The solution employed here is to warm up the lower bound cursor's
cache before positions are computed, and use that cursor for
non-populating lookup of the upper bound.

We use the lower bound cursor and the slice's lower bound so that we
read the same blocks as later lower-bound slicing would, so that we
don't incur extra IO for cases where looking up upper bound is not
worth it, that is when upper bound is far from the lower bound. If
upper bound is near lower bound, then warming up using lower bound
will populate cached_promoted_index with blocks which will allow us to
locate the upper bound block accurately.  This is especially important
for single-row reads, where the bounds are around the same key.  In
this case we want to read the data file range which belongs to a
single promoted index block.  It doesn't matter that the upper bound
is not exactly the same. They both will likely lie in the same block,
and if not, binary search will bring adjacent blocks into cache.  Even
if upper bound is not near, the binary search will populate the cache
with blocks which can be used to narrow down the data file range
somewhat.

Fixes #10030.

The change was tested with perf-fast-forward.

I populated the data set with `column_index_size_in_kb` set to 1

  scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1

Test run:

  build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0

This test reads two rows from the middle of a large partition (1M
rows), of subsequent keys. The first read will miss in the index file
page cache, the second read will hit.

Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total.
After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB.
I verified using logging that the data file range matches a single promoted index block.

Also, the first read which misses in cache is still faster after the change.

Before:

running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride  rows      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
500000  1         0.009802            1         1        102          0        102        102       21.0     21        196       2       1        0        1        1        0        0        0       568     269 4716050  53.4%
500001  1         0.000321            1         1       3113          0       3113       3113        2.0      2         64       1       0        1        0        0        0        0        0       116      26  555110  45.0%

After:

running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride  rows      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
500000  1         0.009609            1         1        104          0        104        104       20.0     20        137       2       1        0        1        1        0        0        0       561     268 4633407  43.1%
500001  1         0.000217            1         1       4602          0       4602       4602        1.0      1          2       1       0        1        0        0        0        0        0       110      26  313882  64.1%

(cherry picked from commit dfb339376aff1ed961b26c4759b1604f7df35e54)
2024-10-01 18:40:34 +02:00
Michał Chojnowski
c77d00fd8d index_reader: remove a piece of misguided code involved in single-partition reads
This patch removes a piece of code which, according to the comment,
allows for forwarding the index reader even if it was created as a
single-partition reader.

For single-partition reads, the input_stream
used by the reader is limited to the single index page containing the partition,
since reading the index file past that point would be a waste.
Because of this limit, such an index reader can't be forwarded/advanced.

The dubious piece of code gets around that by unsetting the stream and ensuring
it will be re-created, this time without the limit, if the index is advanced.

But there is no use for this. The idea of a "single-partition reader"
exist as an optimization. It's illegal to forward single-partition readers,
and it doesn't make sense to attempt that. (If there's a need for forwarding,
just don't create a single-partition reader).

I suspect this piece of code was written due to a misunderstanding.
Before the previous patch in this series, when the searched partition key
was the first key in its page, the index reader would scan the preceding
page first, realize it made a mistake, and advance to the next, correct page.
I suspect this piece of code was written to make this work.

But this is, in fact, undesirable. The fact that the index reader was working
like this was a performance bug. In the single-partition case there's
never an inherent reason to start with the wrong page. The index logic can be
corrected to always start with the right page, and that's what the previous
patch in this series does. And with that, there is no need to support advancing
anymore, and the dubious piece of code can be erased.

We also add an assert to emphasize that advancing a single-partition reader is
illegal.
2024-09-30 23:43:36 +02:00
Michał Chojnowski
bc30509523 index_reader: in single-partition reads, don't read more than one page
When looking for a partition key in the index, we scan the index from the
first index page which can possibly contain the key.

In a single-partition read, there is never a reason to read beyond that
page. After the previous patch in this series, it's guaranteed that
the first key in the next page is strictly greater than the searched key.
So if the searched key is greater than the last key in the first page,
then it is neither in the first nor the second page -- it must be absent
from the sstable.

But with the current logic, we read the second index page anyway, and
the realization that the key is absent happens higher in the call chain.

This patch optimizes that inefficiency by immediately returning EOF
if a single-partition read doesn't find the key in the first page.

Returning "end of file" even though we didn't actually go beyond the
end of file is hacky, but I don't see any other non-invasive way
of communicating to the caller that the partition is absent.

Some caller of the index could possibly assume that returning EOF
proves that the searched key is greater than all keys in the sstable.
I don't think any such caller exists today, but it's a possible
place for confusion.

Together with the previous patch in this series, this patch guarantees
that a single-partition read only accesses a single index page.
This fixes a weird secondary performance bug.
Due to some misunderstanding in the logic, when during a single-partition read
we scan two index pages, the second index page is scanned via an input_stream
created without an upper I/O limit, which means that we additionally read
a full read-ahead (currently: 64 kiB) past the second index page for no reason
whatsoever.
After this and the previous patch, a single-partition read always reads exactly one
index page, so the above problem cannot occur.
2024-09-30 23:43:36 +02:00
Michał Chojnowski
6b8b7d962c index_reader: fix unnecessary reads of preceding index pages
When setting the index to position X, we first look for the first summary
entry N such that N >= X. Then we load the index page preceding N and
scan it for the first partition key P such that P >= X.
If there is no such key in this page, then we scan the next page
(starting with N) for such key. (In this case it's always the first key).

For example, assume we have:
summary:  A   C   E
index:    A B C D E F

If we look up "B" in the index, then we first locate summary entry "C",
then we scan the index for B, starting from "A". This is all fine.

But when we look for "C" in the index, then we do the exactly the same
-- we scan the index for "C" starting from "A". This is wasteful, because
we can start scanning from "C".

To avoid this inefficiency, we should be looking for N > X, not N >= X.
This patch fixes that.

In addition, this fixes a second, weirder performance bug.
Due to some misunderstanding in the logic, when during a single-partition read
we scan two index pages, the second index page is scanned via an input_stream
created without an upper I/O limit, which means that we additionally read
a full read-ahead (currently: 64 kiB) past the second index page for no reason
whatsoever. After this patch, a single-partition read always reads exactly one
index page, so the above problem cannot occur.
2024-09-30 23:43:36 +02:00
Avi Kivity
aa1270a00c treewide: change assert() to SCYLLA_ASSERT()
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.

Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.

To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.

[1] 66ef711d68

Closes scylladb/scylladb#20006
2024-08-05 08:23:35 +03:00
Lakshmi Narayanan Sreethar
64dadd5ec2 sstables/index_reader: stop consuming index when abort has been requested
When an abort is requested, stop further reading of the index file and
throw and exception from index_consume_entry_context::process_state().

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-07-16 20:42:50 +05:30
Lakshmi Narayanan Sreethar
c2524337a2 sstables::index_consume_entry_context: store abort_source
Store abort source inside sstables::index_consume_entry_context, so that
the next patch can implement cancelling the index read when abort is
requested.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-07-16 20:42:50 +05:30
Aleksandra Martyniuk
4530be9e5b test: add test to check if reader is closed
Add test to check if reader is closed in sstable::has_partition_key.
2024-02-22 14:53:14 +01:00
Michał Chojnowski
5a3e4a1cc0 utils: managed_bytes: optimize memory usage for small buffers
managed_bytes is implemented as chain of blob_storage objects.
Each blob_storage contains 24 bytes of metadata. But in the most
common case -- when there is only a single element in the chain --
16 bytes of this metadata is trivial/unused.

This is regrettable waste because managed_bytes is used for every
database cell in the memtables and cache. It means that every value
of size >= 7 bytes (smaller ones fit in the inline storage of
managed_bytes) receives 16 bytes of useless overhead.

To correct that, this patch adds to managed_bytes an alternative storage
layout -- used for buffers small enough to fit in one contiguous
fragment -- which only stores the necessary minimum of metadata.
(That is: a pointer to the parent, to facilitate moving the storage during
memory defragmentation).
2024-02-09 20:56:20 +01:00
Kefu Chai
a6152cb87b sstables: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16666
2024-01-09 11:45:44 +02:00
Kefu Chai
893f319004 sstables: add formatter for index_consume_entry_context_state
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, in order to enable the code in the header to
access the formatter without being moved down after the full specialization's
definition, we

* move the enum definition out of the class and before the
  class,
* rename the enum's name from state to index_consume_entry_context_state
* define a formatter for index_consume_entry_context_state
* remove its operator<<().

as fmt v10 is able to use `format_as()` as a fallback, the formatter
full specialization is guarded with `#if FMT_VERSION < 10'00'00`. we
will remove it after we start build with fmt v10.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16204
2023-12-08 12:45:38 +02:00
Michał Chojnowski
f00bed9429 sstables: partition_index_cache: deglobalize stats
Move partition_index_cache stats from a thread_local variable
to cache_tracker. After the change, partition_index_cache
receives a reference to the stats via constructor, instead of
referencing a global.

This is needed so that cache_tracker can know the memory usage
of index caches (for cache eviction purposes) without relying on
globals.

But it also makes sense even without that motive.
2023-09-01 22:34:41 +02:00
Michał Chojnowski
c7d9d35030 utils: cached_file: deglobalize cached_file metrics
Move cached_file metrics from a thread_local variable
to cache_tracker.

This is needed so that cache_tracker can know
the memory usage of index caches (for purposes
of cache eviction) without relying on globals.

But it also makes sense even without that motive.
2023-09-01 22:34:41 +02:00
Avi Kivity
0cabf4eeb9 build: disable implicit fallthrough
Prevent switch case statements from falling through without annotation
([[fallthrough]]) proving that this was intended.

Existing intended cases were annotated.

Closes #14607
2023-07-10 19:36:06 +02:00
Pavel Emelyanov
66e43912d6 code: Switch to seastar API level 7
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).

So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command

The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields

Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)

Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile

The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13963
2023-06-06 13:29:16 +03:00
Pavel Emelyanov
2bb024c948 index_reader: Introduce and use default arguments to constructor
Most of creators of index_reader construct it with default prio class,
null trace pointer and use_caching::yes. Assigning implicit defaults to
constructor arguments keeps the code shorter and easier to read.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 11:29:04 +03:00
Pavel Emelyanov
3fd5d3cc2b index_reader: Use _pc field in get_file_input_stream_options() directly
No need to pass this-> field into this-> call

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 11:18:14 +03:00
Pavel Emelyanov
21d24e8ea3 index_reader: Move index_reader::get_file_input_stream_options to private: block
A "while at it" cleanup. When pathing the method (next patch) it turned
out that there are no other callers other than local class, so it _is_
private.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-05-23 11:18:14 +03:00
Kefu Chai
0cb842797a treewide: do not define/capture unused variables
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-15 22:57:18 +02:00
Benny Halevy
63c2cdafe8 sstables: index_reader: close(index_bound&) reset current_list
When closing _lower_bound and *_upper_bound
in the final close() call, they are currently left with
an engaged current_list member.

If the index_reader uses a _local_index_cache,
it is evicted with evict_gently which will, rightfully,
see the respective pages as referenced, and they won't be
evicted gently (only later when the index_reader is destroyed).

Reset index_bound.current_list on close(index_bound&)
to free up the reference.

Ref #12271

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #12370
2023-01-02 16:42:33 +01:00
Michał Chojnowski
d9269abf5b sstables: index_reader: always evict the local cache gently
Due to an oversight, the local index cache isn't evicted gently
when _upper_bound existed. This is a source of reactor stalls.
Fix that.

Fixes #12271

Closes #12364
2022-12-20 18:23:27 +02:00
Pavel Emelyanov
5f579eb405 sstable: Introduce index_filename()
Currently the sstable::filename(Index) is used in several places that
get the filename as a printable or throwable string and don't treat is
as a real location of any file.

For those, add the index_filename() helper symmetrical to toc_filename()
and (in some sense) the get_filename() one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:14:49 +03:00