scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-22 15:52:13 +00:00

Author	SHA1	Message	Date
Botond Dénes	d65c1523c2	sstables: migrate all malformed_sstable_exception throw sites to throw_malformed_sstable_exception() Replace all direct 'throw malformed_sstable_exception(...)' call sites with the new throw_malformed_sstable_exception() helper, which respects the --abort-on-malformed-sstable-error flag.	2026-05-11 11:58:14 +03:00
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Tomasz Grabiec	4410e9c61a	sstables: mx: index_reader: Optimize parsing for no promoted index case It's a common case with small partition workloads.	2026-03-18 16:25:21 +01:00
Tomasz Grabiec	f55bb154ec	sstables: mx: index_reader: Amoritze partition key storage This change reduces the cost of partition index page construction and LSA migration. This is achieved by several things working together: - index entries don't store keys as separate small objects (managed_bytes) They are written into one managed_bytes fragmented storage, entries hold offset into it. Before, we paid 16 bytes for managed_bytes plus LSA descriptor for the storage (1 byte) plus back-reference in the storage (8 bytes), so 25 bytes. Now we only pay 4 bytes for the size offset. If keys are 16 bytes, that's a reduction from 31 bytes to 20 bytes per key. - index entries and key storage are now trivially moveable, so LSA migration can use memcpy() which amortizes the cost per key. memcpy(). LSA eviction is now trivial and constant time for the whole page regardless of the number of entries. Page eviction dropped from 14 us to 1 us. This improves throughput in a CPU-bound miss-heavy read workload where the partition index doesn't fit in memory. scylla perf-simple-query -c1 -m200M --partitions=1000000 Before: 15328.25 tps (150.0 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 286769 insns/op, 218134 cycles/op, 0 errors) 15279.01 tps (149.9 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 287696 insns/op, 218637 cycles/op, 0 errors) 15347.78 tps (149.7 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 285851 insns/op, 217795 cycles/op, 0 errors) 15403.68 tps (149.6 allocs/op, 14.1 logallocs/op, 45.2 tasks/op, 285111 insns/op, 216984 cycles/op, 0 errors) 15189.47 tps (150.0 allocs/op, 14.1 logallocs/op, 45.5 tasks/op, 289509 insns/op, 219602 cycles/op, 0 errors) 15295.04 tps (149.8 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 288021 insns/op, 218545 cycles/op, 0 errors) 15162.01 tps (149.8 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 291265 insns/op, 220451 cycles/op, 0 errors) After: 21620.18 tps (148.4 allocs/op, 13.4 logallocs/op, 43.7 tasks/op, 176817 insns/op, 153183 cycles/op, 0 errors) 20644.03 tps (149.8 allocs/op, 13.5 logallocs/op, 44.3 tasks/op, 187941 insns/op, 160409 cycles/op, 0 errors) 20588.06 tps (150.1 allocs/op, 13.5 logallocs/op, 44.5 tasks/op, 188090 insns/op, 160818 cycles/op, 0 errors) 20789.29 tps (149.5 allocs/op, 13.5 logallocs/op, 44.2 tasks/op, 186495 insns/op, 159382 cycles/op, 0 errors) 20977.89 tps (149.5 allocs/op, 13.4 logallocs/op, 44.2 tasks/op, 183969 insns/op, 158140 cycles/op, 0 errors) 21125.34 tps (149.1 allocs/op, 13.4 logallocs/op, 44.1 tasks/op, 183204 insns/op, 156925 cycles/op, 0 errors) 21244.42 tps (148.6 allocs/op, 13.4 logallocs/op, 43.8 tasks/op, 181276 insns/op, 155973 cycles/op, 0 errors) Mostly because the index now fits in memory. When it doesn't, the benefits are still visible due to lower LSA overhead.	2026-03-18 16:25:21 +01:00
Tomasz Grabiec	50dc7c6dd8	sstables: mx: index_reader: Keep promoted_index info next to index_entry Densely populated pages have no promoted index (small partitions), so we can save space in such workloads by keeping promoted index in a separate vector. For workloads which do have a promoted index, pages have only one partition. There aren't many such pages and they are long-lived, so the extra allocation of the vector is amortized. promoted_index class is removed, and replaced with equivalent parsed_promoted_index_entry for simplicity. Because it's removed, make_cursor() is moved into the index_reader class. Reducing the size of index_entry is important for performence if pages are densly populated. It helps to reduce LSA allocator pressure and compaction/eviction speed. This change, combined with the earlier change "Shave-off 16 bytes from index_entry by using raw_token", gives significant improvement in throughput in perf_simple_query run where the index doesn't fit in memory: scylla perf-simple-query -c1 -m200M --partitions=1000000 Before: 9714.78 tps (170.9 allocs/op, 16.9 logallocs/op, 55.3 tasks/op, 494788 insns/op, 343920 cycles/op, 0 errors) 9603.13 tps (171.6 allocs/op, 17.0 logallocs/op, 55.6 tasks/op, 502358 insns/op, 348344 cycles/op, 0 errors) 9621.43 tps (171.9 allocs/op, 17.0 logallocs/op, 55.8 tasks/op, 500612 insns/op, 347508 cycles/op, 0 errors) 9597.75 tps (171.6 allocs/op, 17.0 logallocs/op, 55.6 tasks/op, 501428 insns/op, 348604 cycles/op, 0 errors) 9615.54 tps (171.6 allocs/op, 16.9 logallocs/op, 55.6 tasks/op, 501313 insns/op, 347935 cycles/op, 0 errors) 9577.03 tps (171.8 allocs/op, 17.0 logallocs/op, 55.7 tasks/op, 503283 insns/op, 349251 cycles/op, 0 errors) After: 15328.25 tps (150.0 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 286769 insns/op, 218134 cycles/op, 0 errors) 15279.01 tps (149.9 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 287696 insns/op, 218637 cycles/op, 0 errors) 15347.78 tps (149.7 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 285851 insns/op, 217795 cycles/op, 0 errors) 15403.68 tps (149.6 allocs/op, 14.1 logallocs/op, 45.2 tasks/op, 285111 insns/op, 216984 cycles/op, 0 errors) 15189.47 tps (150.0 allocs/op, 14.1 logallocs/op, 45.5 tasks/op, 289509 insns/op, 219602 cycles/op, 0 errors) 15295.04 tps (149.8 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 288021 insns/op, 218545 cycles/op, 0 errors) 15162.01 tps (149.8 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 291265 insns/op, 220451 cycles/op, 0 errors)	2026-03-18 16:25:20 +01:00
Tomasz Grabiec	e9c98274b5	sstables: mx: index_reader: Reduce allocation_section overhead during index page parsing by batching allocation If the page has many entries, we continuously enter and leave the allocating section for every key. This can be avoided by batching LSA operations for the whole page, after collecting all the entries. Later optimizations will also build on this, where we will allocate fragmented storage for keys in LSA using a single managed_bytes constructor. This alone brings only a minor improvement, but it does reduce LSA allocations, probably due to less frequent memory reclamation: scylla perf-simple-query -c1 -m200M --duration 6000 --partitions=1000000 Before: 9560.42 tps (172.2 allocs/op, 19.6 logallocs/op, 57.7 tasks/op, 567741 insns/op, 345158 cycles/op, 0 errors) 9445.95 tps (173.1 allocs/op, 19.7 logallocs/op, 58.1 tasks/op, 579075 insns/op, 352173 cycles/op, 0 errors) 9576.75 tps (172.2 allocs/op, 19.6 logallocs/op, 57.6 tasks/op, 572004 insns/op, 347373 cycles/op, 0 errors) 9597.16 tps (172.2 allocs/op, 19.6 logallocs/op, 57.6 tasks/op, 569615 insns/op, 346618 cycles/op, 0 errors) 9454.07 tps (173.5 allocs/op, 19.8 logallocs/op, 58.3 tasks/op, 579213 insns/op, 351569 cycles/op, 0 errors) After: 9562.21 tps (172.0 allocs/op, 17.0 logallocs/op, 55.8 tasks/op, 499225 insns/op, 347832 cycles/op, 0 errors) 9480.20 tps (172.3 allocs/op, 17.0 logallocs/op, 55.9 tasks/op, 507271 insns/op, 350640 cycles/op, 0 errors) 9512.42 tps (172.1 allocs/op, 17.0 logallocs/op, 55.9 tasks/op, 504247 insns/op, 350392 cycles/op, 0 errors) 9498.45 tps (172.4 allocs/op, 17.1 logallocs/op, 55.9 tasks/op, 505765 insns/op, 350320 cycles/op, 0 errors) 9076.30 tps (173.5 allocs/op, 17.1 logallocs/op, 56.5 tasks/op, 512791 insns/op, 354792 cycles/op, 0 errors) 9542.62 tps (171.9 allocs/op, 17.0 logallocs/op, 55.8 tasks/op, 502532 insns/op, 348922 cycles/op, 0 errors)	2026-03-18 16:25:20 +01:00
Tomasz Grabiec	0e0f9f41b3	sstables: mx: index_reader: Keep index_entry directly in the vector Partition index entries are relatively small, and if the workload has small partitions, index pages have a lot of elements. Currently, index entries are indirected via managed_ref, which causes increased cost of LSA eviction and compaction. This patch amortizes this cost by storing them dierctly in the managed_chunked_vector. This gives about 23% improvement in throughput in perf-simple-query for a workload where the index doesn't fit in memory: scylla perf-simple-query -c1 -m200M --duration 6000 --partitions=1000000 Before: 7774.96 tps (166.0 allocs/op, 521.7 logallocs/op, 54.0 tasks/op, 802428 insns/op, 430457 cycles/op, 0 errors) 7511.08 tps (166.1 allocs/op, 527.2 logallocs/op, 54.0 tasks/op, 804185 insns/op, 430752 cycles/op, 0 errors) 7740.44 tps (166.3 allocs/op, 526.2 logallocs/op, 54.2 tasks/op, 805347 insns/op, 432117 cycles/op, 0 errors) 7818.72 tps (165.2 allocs/op, 517.6 logallocs/op, 53.7 tasks/op, 794965 insns/op, 427751 cycles/op, 0 errors) 7865.49 tps (165.1 allocs/op, 513.3 logallocs/op, 53.6 tasks/op, 788898 insns/op, 425171 cycles/op, 0 errors) After: 9560.42 tps (172.2 allocs/op, 19.6 logallocs/op, 57.7 tasks/op, 567741 insns/op, 345158 cycles/op, 0 errors) 9445.95 tps (173.1 allocs/op, 19.7 logallocs/op, 58.1 tasks/op, 579075 insns/op, 352173 cycles/op, 0 errors) 9576.75 tps (172.2 allocs/op, 19.6 logallocs/op, 57.6 tasks/op, 572004 insns/op, 347373 cycles/op, 0 errors) 9597.16 tps (172.2 allocs/op, 19.6 logallocs/op, 57.6 tasks/op, 569615 insns/op, 346618 cycles/op, 0 errors) 9454.07 tps (173.5 allocs/op, 19.8 logallocs/op, 58.3 tasks/op, 579213 insns/op, 351569 cycles/op, 0 errors) Disabling the partition index doesn't improve the throuhgput beyond that.	2026-03-18 16:25:20 +01:00
Tomasz Grabiec	082342ecad	Attach names to allocating sections for better debuggability Large reserves in allocating_section can cause stalls. We already log reserve increase, but we don't know which table it belongs to: lsa - LSA allocation failure, increasing reserve in section 0x600009f94590 to 128 segments; Allocating sections used for updating row cache on memtable flush are notoriously problematic. Each table has its own row_cache, so its own allocating_section(s). If we attached table name to those sections, we could identify which table is causing problems. In some issues we suspected system.raft, but we can't be sure. This patch allows naming allocating_sections for the purpose of identifying them in such log messages. I use abstract_formatter for this purpose to avoid the cost of formatting strings on the hot path (e.g. index_reader). And also to avoid duplicating strings which are already stored elsewhere. Fixes #25799 Closes scylladb/scylladb#27470	2025-12-07 14:14:25 +02:00
Michał Chojnowski	420e215873	sstables/trie/bti_index_reader: allow the caller to passing a precalculated murmur hash Partitions.db internally uses a piece of the partition key murmur hash (the same hash which is used to compute the token and the relevant bits in the bloom filter). Before this patch, the Partitions.db reader computes the hash internally from the `sstables::partition_key`. That's a waste, because this hash is usually also computed for bloom filter purposes just before that. So in this patch we let the caller pass that hash instead. The old index interface, without the hash, is kept for convenience. In this patch we only add a new interface, we don't switch the callers to it yet. That will happen in the next commit.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	25e98f1ff7	sstables: extract abstract_index_reader from index_reader.hh to its own header In later parts of the series, we add a second implementation of `abstract_index_reader`. To do that, we want a header with `abstract_index_reader`, but we don't need to pull in everything else from `index_reader.hh`. So let's extract `abstract_index_reader` to its own header.	2025-09-07 00:30:15 +02:00
Michał Chojnowski	b1da5f2d0f	sstables/index_reader: weaken some exactness guarantees in abstract_index_reader After making the sstable reader more permissive, we can weaken the abstract_index_reader interface.	2025-07-25 11:00:18 +02:00
Michał Chojnowski	fe8ee34024	sstables/index_reader: remove advance_to `advance_to` is unused now, so remove it.	2025-07-25 11:00:18 +02:00
Michał Chojnowski	03bf6347e2	sstables/mx/reader: handle inexact lookups in `advance_context()` `advance_context()` needs an ability to advance the index to the partition immediately following the reader's current partition. For this, it uses `abstract_index_reader::advance_to(dht::ring_position_view)` But BTI (and any index format which stores only the prefixes of keys instead of whole keys) can't implement `advance_to` with its current semantics. The Data position returned by the index for a generic `advance_to` might be off by one partition. E.g. if the index stores prefixes `a`, `b`, `c`, the index has no way to know if the first entry after `bb` is `b` (which might correspond to `ba` as well as `bc`), or `c`. However, BTI can be used exactly if the partition is known to be present in the sstable. (In the above example, if `bb` is known to be present in the sstable, then it must correspond to `b`. So the index can reliably advance to `bb` or the first partition after it). And this is enough for `advance_context()`, because the current partition is known to be present. So we can replace the usage of `advance_to` with an equivalent API call which only works with present keys, but in exchange is implementable by BTI. This makes `advance_to` unused, so we remove it.	2025-07-25 11:00:18 +02:00
Michał Chojnowski	11792850dd	sstables/mx/reader: handle inexact lookups in `advance_to_next_partition()` `advance_to_next_partition()` needs an ability to advance the index to the partition immediately following the reader's current partition. For this, it uses `abstract_index_reader::advance_to(dht::ring_position_view)` But BTI (and any index format which stores only the prefixes of keys instead of whole keys) can't implement `advance_to` with its current semantics. The Data position returned by the index for a generic `advance_to` might be off by one partition. E.g. if the index stores prefixes `a`, `b`, `c`, the index has no way to know if the first entry after `bb` is `b` (which might correspond to `ba` as well as `bc`), or `c`. However, BTI can be used exactly if the partition is known to be present in the sstable. (In the above example, if `bb` is known to be present in the sstable, then it must correspond to `b`. So the index can reliably advance to `bb` or the first partition after it). And this is enough for `advance_to_next_partition()`, because the current partition is known to be present. So we can replace the usage of `advance_to` with an equivalent API call which only works with present keys, but in exchange is implementable by BTI.	2025-07-25 11:00:18 +02:00
Michał Chojnowski	141895f9eb	sstables/index_reader: make the return value of `get_partition_key` optional BTI indexes only store encoded prefixes of partition keys, not the whole keys. They can't reliably implement `get_partition_key`. The index reader interface must be weakened and callers must be adapted.	2025-07-25 11:00:18 +02:00
Botond Dénes	20693edb27	Merge 'sstables: put index_reader behind a virtual interface' from Michał Chojnowski This is a refactoring patch in preparation for BTI indexes. It contains no functional changes (or at least it's not intended to). In this patch, we modify the sstable readers to use index readers through a new virtual `abstract_index_readers` interface. Later, we will add BTI indexes which will also implement this interface. This interface contains the methods of `index_reader` which are needed by sstable readers, and leaves out all other methods, such as `current_clustered_cursor`. Not all methods of this interface will be implementable by a trie-based index later. For example, a trie-based index can't provide a reliable `get_partition_key()`, because — unlike the current index — it only stores partition keys for partitions which have a row index. So the interface will have to be further restricted later. We don't do that in this patch because that will require changes to sstable reader logic, and this patch is supposed to only include cosmetic changes. No backports needed, this is a preparation for new functionality. Closes scylladb/scylladb#25000 * github.com:scylladb/scylladb: sstables: add sstable::make_index_reader() and use where appropriate sstables/mx: in readers, use abstract_index_reader instead of index_reader sstables: in validate(), use abstract_index_reader instead of index_reader where possible test/lib/index_reader_assertions: accept abstract_index_reader instead of index_reader sstables/index_reader: introduce abstract_index_reader sstables/index_reader: extract a prefetch_lower_bound() method	2025-07-17 14:32:08 +03:00
Michał Chojnowski	c052ccd081	sstables/index_reader: introduce abstract_index_reader We want to implement BTI indexes in Scylla. After we do that, some sstables will use a BTI index reader, while others will use the old BIG index reader. To handle that, we can expose a common virtual "index reader" interface to sstable readers. This is what this patch does. This interface can't be quite fully implemented by a BTI index, because some methods returns keys which a BIG index stores, but a BTI index doesn't. So it will be further restricted in future patches. But for now, we only extract all methods currently used by the readers to a virtual interface.	2025-07-17 10:32:56 +02:00
Michał Chojnowski	1e7a292ef4	sstables/index_reader: extract a prefetch_lower_bound() method The sstable reader reaches directly for a `clustered_index_cursor`. But a BTI index reader won't be able to implement `clustered_index_cursor`, because a BTI index doesn't store full clustering keys, only some trie-encoded prefixes. So we want to weaken the dependency. Instead of reaching for `clustered_index_cursor`, we add a method which expresses our intent, and we let `index_reader` touch the cursor internally.	2025-07-16 00:13:20 +02:00
Ernest Zaslavsky	8d49bb8af2	sstables: Start using `make_data_or_index_source` in `sstable` Convert all necessary methods to be awaitable. Start using `make_data_or_index_source` when creating data_source for data and index components. For proper working of compressed/checksummed input streams, start passing stream creator functors to `make_(checksummed/compressed)_file_(k_l/m)_format_input_stream`.	2025-07-15 10:10:23 +03:00
Ernest Zaslavsky	dff9a229a7	sstables: refactor readers and sources to use coroutines Refactor readers and sources to support coroutine usage in preparation for integration with `make_data_or_index_source`. Move coroutine-based member initialization out of constructors where applicable, and defer initialization until first use.	2025-07-15 10:10:23 +03:00
Botond Dénes	bce89c0f5e	sstables: replace SCYLLA_ASSERT() with parse_assert() on the read path So parse errors on corrupt SSTables don't result in crashes, instead just aborting the read in process. There are a lot of SCYLLA_ASSERT() usages remaining in sstables/. This patch tried to focus on those usages which are in the read path. Some places not only used on the read path may have been converted too, where the usage of said method is not clear.	2025-06-24 09:16:28 +03:00
Kefu Chai	7215d4bfe9	utils: do not include unused headers these unused includes were identifier by clang-include-cleaner. after auditing these source files, all of the reports have been confirmed. please note, because quite a few source files relied on `utils/to_string.hh` to pull in the specialization of `fmt::formatter<std::optional<T>>`, after removing `#include <fmt/std.h>` from `utils/to_string.hh`, we have to include `fmt/std.h` directly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-14 07:56:39 -05:00
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Botond Dénes	05246e123d	Merge 'sstables: Avoid computing column_values_fixed_lengths on each read' from Tomasz Grabiec Reads which need sstable index were computing column_values_fixed_lengths each time. This showed up in perf profile for a sstable-read heavy workload, and amounted to about 1-2% of time. Computing it involves type name parsing. Avoid by using cached per-sstable mapping. There is already sstable::_column_translation which can be used for this. It caches the mapping for the least-recently used schema. Since the cursor uses the mapping only for primary key columns, which are stable, any schema will do, so we can use the last _column_translation. We only need to make sure that it's always armed, so sstable loading is augmented with arming with sstable's schema. Also, fixes a potential use-after-free on schema in column_translation. Closes scylladb/scylladb#21347 * github.com:scylladb/scylladb: sstables: Fix potential use-after-free on column_translation::column_info::name sstables: Avoid computing column_values_fixed_lengths on each read	2024-12-12 12:22:32 +02:00
Kefu Chai	48c8d24345	treewide: drop support for fmt < v10 since fedora 38 is EOL. and fedora 39 comes with fmt v10.0.0, also, we've switched to the build image based on fedora 40, which ships fmt-devel v10.2.1, there is no need to support fmt < 10. in this change, we drop the support fmt < 10. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21847	2024-12-09 20:42:38 +02:00
Tomasz Grabiec	b0a5bf8b4a	sstables: Avoid computing column_values_fixed_lengths on each read Reads which need clustering index cursor were computing column_values_fixed_lengths each time. This showed up in perf profile for a sstable-read heavy workload, and amounted to about 1%. Avoid by using cached per-sstable mapping. There is already sstable::_column_translation which can be used for this. It caches the mapping for the most recently used schema. Since the cursor uses the mapping only for primary key columns, which are stable, any schema will do, so we can use the last _column_translation. We only need to make sure that it's always armed, so sstable loading is augmented with arming with sstable's schema.	2024-12-09 14:05:37 +01:00
Avi Kivity	b706e3e9e4	Merge 'sstables/index_reader: avoid unnecessary index page reads in single-partition reads' from Michał Chojnowski Terminology note: in the context of this series, "index page" means an contiguous segment of the index file starting (inclusive) at a key corresponding to a summary entry and ending (exclusive) before the key corresponding to the next summary entry. "Index pages" are not related to filesystem pages. --- In a single-partition read, if the searched partition key is the first key in its index page, we start scanning the index for that key starting at the previous index page (inclusive), even though we could start directly from the key's page. Similarly, if the searched partition key is absent from the sstable and lies after all other keys in its appropriate page, we additionally scan the next page, even though it's known from the summary that it can't possibly contain the key. Those cases are wasteful. It's worse than it might seem at first glance. When partitions are small, only a small fraction of search keys fulfills those conditions (i.e. "first key in its page" or "an absent key greater than the last key in its page"), so the waste doesn't matter much. But when partitions are big enough, every index page contains only one partition key (and a promoted index for that partition), which directly means that all search keys fulfill the conditions, which means that total index reading work is two times bigger than what it should be. In addition, there is a secondary performance bug which, when the aforementioned conditions are fulfilled, causes additional I/O to happen past the index reads which are actually parsed and used. In effect, the index I/O in single-partition reads might be not just doubled, but even tripled (that's for IOPS — throughput might be multiplied even more), all because of a slight inaccuracy in the edge cases. This series fixes those inefficiencies by tightening the edge cases and ensuring that single-partition reads always read only a single index page. Here's an example where we query the first row (i.e. `LIMIT 1`) of a certain partition key, in a table with large (1 MB) promoted indexes. Before the patch, the lookup of the lower bound involves 3 serialized disk reads (as described above) to subsequent index pages, and even the lookup of the upper bound involves 2 disk reads: ``` Execute CQL3 query Parsing a statement [shard 0] Processing a statement for authenticated user: anonymous [shard 0] Executing read query (reversed false) [shard 0] Creating read executor for token -1297921881139976049 with all: [127.11.11.1] targets: [127.11.11.1] repair decision: NONE [shard 0] Creating never_speculating_read_executor - speculative retry is disabled or there are no extra replicas to speculate with [shard 0] read_data: querying locally [shard 0] Start querying singular range {{-1297921881139976049, pk{00023130}}} [shard 0] [reader concurrency semaphore user] admitted immediately [shard 0] [reader concurrency semaphore user] executing read [shard 0] Reading key {-1297921881139976049, pk{00023130}} from sstable ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 38359040 [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 38391808 [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 38359040, successfully read 32768 bytes [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 38391808, successfully read 32768 bytes [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 39370752 [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 39403520 [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 39370752, successfully read 32768 bytes [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 39403520, successfully read 32768 bytes [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40378368 [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40411136 [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40378368, successfully read 32768 bytes [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40411136, successfully read 32768 bytes [shard 0] upper_bound_cache_only({position: clustered, ckp{}, 1}): no upper bound [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40378368 [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40411136 [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40378368, successfully read 32768 bytes [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40411136, successfully read 32768 bytes [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 41390080 [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 41422848 [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 41390080, successfully read 32768 bytes [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 41422848, successfully read 32768 bytes [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db: scheduling bulk DMA read of size 21926 at offset 819200 [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db: finished bulk DMA read of size 21926 at offset 819200, successfully read 24576 bytes [shard 0] Page stats: 1 partition(s), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 0 cell(s) (0 live, 0 dead) [shard 0] Querying is done [shard 0] Done processing - preparing a result [shard 0] Request complete ``` After the patch, the lookup of each bound involves 1 read: ``` Execute CQL3 query Parsing a statement [shard 0] Processing a statement for authenticated user: anonymous [shard 0] Executing read query (reversed false) [shard 0] Creating read executor for token -1297921881139976049 with all: [127.11.11.1] targets: [127.11.11.1] repair decision: NONE [shard 0] Creating never_speculating_read_executor - speculative retry is disabled or there are no extra replicas to speculate with [shard 0] read_data: querying locally [shard 0] Start querying singular range {{-1297921881139976049, pk{00023130}}} [shard 0] [reader concurrency semaphore user] admitted immediately [shard 0] [reader concurrency semaphore user] executing read [shard 0] Reading key {-1297921881139976049, pk{00023130}} from sstable ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 39370752 [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 39403520 [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 39370752, successfully read 32768 bytes [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 39403520, successfully read 32768 bytes [shard 0] upper_bound_cache_only({position: clustered, ckp{}, 1}): no upper bound [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40378368 [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: scheduling bulk DMA read of size 32768 at offset 40411136 [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40378368, successfully read 32768 bytes [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Index.db: finished bulk DMA read of size 32768 at offset 40411136, successfully read 32768 bytes [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db: scheduling bulk DMA read of size 21926 at offset 819200 [shard 0] ./workdir_01/data/ks/t-536c31f09a9c11efbd5082a6aa3e8d0c/me-3gky_0v18_3rgjk2dsjae431s4uz-big-Data.db: finished bulk DMA read of size 21926 at offset 819200, successfully read 24576 bytes [shard 0] Page stats: 1 partition(s), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 0 cell(s) (0 live, 0 dead) [shard 0] Querying is done [shard 0] Done processing - preparing a result [shard 0] Request complete ``` Doesn't have to be backported, since the problem only affects performance, not correctness, and it has been present since forever. Closes scylladb/scylladb#20897 * github.com:scylladb/scylladb: index_reader: remove a piece of misguided code involved in single-partition reads index_reader: in single-partition reads, don't read more than one page index_reader: fix unnecessary reads of preceding index pages	2024-11-04 14:28:27 +02:00
Tomasz Grabiec	1b82d5117a	sstables: bsearch_clustered_cursor: Narrow down range using "end" position of the block This is optimization. Example: block0: start=aaa, end=aaA block1: start=bbb, end=bbB block2: whatever Before the patch, advance_to("aAA") would skip to block0, and upper bound probe would skip to block1. This way, the reader would read the range of block0 from the data file. After the patch, "end" position is taken into account, so advance_to("aAA") will notice that block0 doesn't contain the position and will skip to block1. This is especially important for dense indexes, as it allows us to skip accessing data file if the search key is missing. It also solves the edge case problem related to the fact that single row reads are using a range which with positions which are not equal to the key, but are before(key) and after(key) for the lower bound and upper bound respectively. Before the patch, advance_to(before("bbb")) would skip to block0, before the position is before the block1's start. And upper bound probe for after("bbb") would point to block2. This way the read would scan block0 needlessly. After the patch, advance_to(before("bbb")) will skip to block1 because we notice based on "end" that block0 doesn't contain the position. This change also ensures that the start position of the upper bound entry of the after_key(pos), where pos is the last advance_to() position, is warm in cache. This is needed to optimize single-row reads with a dense index so that they always read exactly one promoted index block. For this to work, probe_upper_bound() for the after_key(row) always needs to find the upper bound block in cache.	2024-10-03 14:16:05 +02:00
Tomasz Grabiec	a29501ed67	sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions Single-row reads from large partition issue 64 KiB reads to the data file, which is equal to the default span of the promoted index block in the data file. If users would want to reduce selectivity of the index to speed up single-row reads, this won't be effective. The reason is that the reader uses promoted index to look up the start position in the data file of the read, but end position will in practice extend to the next partition, and amount of I/O will be determined by the underlying file input stream implementation and its read-ahead heuristics. By default, that results in at least 2 IOs 32KB each. There is already infrastructure to lookup end position based on upper bound of the read, but it's not effective becasue it's a non-populating lookup and the upper bound cursor has its own private cached_promoted_index, which is cold when positions are computed. It's non-populating on purpose, to avoid extra index file IO to read upper bound. In case upper bound is far-enough from the lower bound, this will only increase the cost of the read. The solution employed here is to warm up the lower bound cursor's cache before positions are computed, and use that cursor for non-populating lookup of the upper bound. We use the lower bound cursor and the slice's lower bound so that we read the same blocks as later lower-bound slicing would, so that we don't incur extra IO for cases where looking up upper bound is not worth it, that is when upper bound is far from the lower bound. If upper bound is near lower bound, then warming up using lower bound will populate cached_promoted_index with blocks which will allow us to locate the upper bound block accurately. This is especially important for single-row reads, where the bounds are around the same key. In this case we want to read the data file range which belongs to a single promoted index block. It doesn't matter that the upper bound is not exactly the same. They both will likely lie in the same block, and if not, binary search will bring adjacent blocks into cache. Even if upper bound is not near, the binary search will populate the cache with blocks which can be used to narrow down the data file range somewhat. Fixes #10030. The change was tested with perf-fast-forward. I populated the data set with `column_index_size_in_kb` set to 1 scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1 Test run: build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0 This test reads two rows from the middle of a large partition (1M rows), of subsequent keys. The first read will miss in the index file page cache, the second read will hit. Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total. After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB. I verified using logging that the data file range matches a single promoted index block. Also, the first read which misses in cache is still faster after the change. Before: running: large-partition-select-few-rows on dataset large-part-ds1 Testing selecting few rows from a large partition: stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 500000 1 0.009802 1 1 102 0 102 102 21.0 21 196 2 1 0 1 1 0 0 0 568 269 4716050 53.4% 500001 1 0.000321 1 1 3113 0 3113 3113 2.0 2 64 1 0 1 0 0 0 0 0 116 26 555110 45.0% After: running: large-partition-select-few-rows on dataset large-part-ds1 Testing selecting few rows from a large partition: stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 500000 1 0.009609 1 1 104 0 104 104 20.0 20 137 2 1 0 1 1 0 0 0 561 268 4633407 43.1% 500001 1 0.000217 1 1 4602 0 4602 4602 1.0 1 2 1 0 1 0 0 0 0 0 110 26 313882 64.1% (cherry picked from commit dfb339376aff1ed961b26c4759b1604f7df35e54)	2024-10-01 18:40:34 +02:00
Michał Chojnowski	c77d00fd8d	index_reader: remove a piece of misguided code involved in single-partition reads This patch removes a piece of code which, according to the comment, allows for forwarding the index reader even if it was created as a single-partition reader. For single-partition reads, the input_stream used by the reader is limited to the single index page containing the partition, since reading the index file past that point would be a waste. Because of this limit, such an index reader can't be forwarded/advanced. The dubious piece of code gets around that by unsetting the stream and ensuring it will be re-created, this time without the limit, if the index is advanced. But there is no use for this. The idea of a "single-partition reader" exist as an optimization. It's illegal to forward single-partition readers, and it doesn't make sense to attempt that. (If there's a need for forwarding, just don't create a single-partition reader). I suspect this piece of code was written due to a misunderstanding. Before the previous patch in this series, when the searched partition key was the first key in its page, the index reader would scan the preceding page first, realize it made a mistake, and advance to the next, correct page. I suspect this piece of code was written to make this work. But this is, in fact, undesirable. The fact that the index reader was working like this was a performance bug. In the single-partition case there's never an inherent reason to start with the wrong page. The index logic can be corrected to always start with the right page, and that's what the previous patch in this series does. And with that, there is no need to support advancing anymore, and the dubious piece of code can be erased. We also add an assert to emphasize that advancing a single-partition reader is illegal.	2024-09-30 23:43:36 +02:00
Michał Chojnowski	bc30509523	index_reader: in single-partition reads, don't read more than one page When looking for a partition key in the index, we scan the index from the first index page which can possibly contain the key. In a single-partition read, there is never a reason to read beyond that page. After the previous patch in this series, it's guaranteed that the first key in the next page is strictly greater than the searched key. So if the searched key is greater than the last key in the first page, then it is neither in the first nor the second page -- it must be absent from the sstable. But with the current logic, we read the second index page anyway, and the realization that the key is absent happens higher in the call chain. This patch optimizes that inefficiency by immediately returning EOF if a single-partition read doesn't find the key in the first page. Returning "end of file" even though we didn't actually go beyond the end of file is hacky, but I don't see any other non-invasive way of communicating to the caller that the partition is absent. Some caller of the index could possibly assume that returning EOF proves that the searched key is greater than all keys in the sstable. I don't think any such caller exists today, but it's a possible place for confusion. Together with the previous patch in this series, this patch guarantees that a single-partition read only accesses a single index page. This fixes a weird secondary performance bug. Due to some misunderstanding in the logic, when during a single-partition read we scan two index pages, the second index page is scanned via an input_stream created without an upper I/O limit, which means that we additionally read a full read-ahead (currently: 64 kiB) past the second index page for no reason whatsoever. After this and the previous patch, a single-partition read always reads exactly one index page, so the above problem cannot occur.	2024-09-30 23:43:36 +02:00
Michał Chojnowski	6b8b7d962c	index_reader: fix unnecessary reads of preceding index pages When setting the index to position X, we first look for the first summary entry N such that N >= X. Then we load the index page preceding N and scan it for the first partition key P such that P >= X. If there is no such key in this page, then we scan the next page (starting with N) for such key. (In this case it's always the first key). For example, assume we have: summary: A C E index: A B C D E F If we look up "B" in the index, then we first locate summary entry "C", then we scan the index for B, starting from "A". This is all fine. But when we look for "C" in the index, then we do the exactly the same -- we scan the index for "C" starting from "A". This is wasteful, because we can start scanning from "C". To avoid this inefficiency, we should be looking for N > X, not N >= X. This patch fixes that. In addition, this fixes a second, weirder performance bug. Due to some misunderstanding in the logic, when during a single-partition read we scan two index pages, the second index page is scanned via an input_stream created without an upper I/O limit, which means that we additionally read a full read-ahead (currently: 64 kiB) past the second index page for no reason whatsoever. After this patch, a single-partition read always reads exactly one index page, so the above problem cannot occur.	2024-09-30 23:43:36 +02:00
Avi Kivity	aa1270a00c	treewide: change assert() to SCYLLA_ASSERT() assert() is traditionally disabled in release builds, but not in scylladb. This hasn't caused problems so far, but the latest abseil release includes a commit [1] that causes a 1000 insn/op regression when NDEBUG is not defined. Clearly, we must move towards a build system where NDEBUG is defined in release builds. But we can't just define it blindly without vetting all the assert() calls, as some were written with the expectation that they are enabled in release mode. To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT() macro in utils/assert.hh. This macro is always defined and is not conditional on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release mode. [1] `66ef711d68` Closes scylladb/scylladb#20006	2024-08-05 08:23:35 +03:00
Lakshmi Narayanan Sreethar	64dadd5ec2	sstables/index_reader: stop consuming index when abort has been requested When an abort is requested, stop further reading of the index file and throw and exception from index_consume_entry_context::process_state(). Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-07-16 20:42:50 +05:30
Lakshmi Narayanan Sreethar	c2524337a2	sstables::index_consume_entry_context: store abort_source Store abort source inside sstables::index_consume_entry_context, so that the next patch can implement cancelling the index read when abort is requested. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-07-16 20:42:50 +05:30
Aleksandra Martyniuk	4530be9e5b	test: add test to check if reader is closed Add test to check if reader is closed in sstable::has_partition_key.	2024-02-22 14:53:14 +01:00
Michał Chojnowski	5a3e4a1cc0	utils: managed_bytes: optimize memory usage for small buffers managed_bytes is implemented as chain of blob_storage objects. Each blob_storage contains 24 bytes of metadata. But in the most common case -- when there is only a single element in the chain -- 16 bytes of this metadata is trivial/unused. This is regrettable waste because managed_bytes is used for every database cell in the memtables and cache. It means that every value of size >= 7 bytes (smaller ones fit in the inline storage of managed_bytes) receives 16 bytes of useless overhead. To correct that, this patch adds to managed_bytes an alternative storage layout -- used for buffers small enough to fit in one contiguous fragment -- which only stores the necessary minimum of metadata. (That is: a pointer to the parent, to facilitate moving the storage during memory defragmentation).	2024-02-09 20:56:20 +01:00
Kefu Chai	a6152cb87b	sstables: do not include unused headers these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16666	2024-01-09 11:45:44 +02:00
Kefu Chai	893f319004	sstables: add formatter for index_consume_entry_context_state before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, in order to enable the code in the header to access the formatter without being moved down after the full specialization's definition, we * move the enum definition out of the class and before the class, * rename the enum's name from state to index_consume_entry_context_state * define a formatter for index_consume_entry_context_state * remove its operator<<(). as fmt v10 is able to use `format_as()` as a fallback, the formatter full specialization is guarded with `#if FMT_VERSION < 10'00'00`. we will remove it after we start build with fmt v10. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16204	2023-12-08 12:45:38 +02:00
Michał Chojnowski	f00bed9429	sstables: partition_index_cache: deglobalize stats Move partition_index_cache stats from a thread_local variable to cache_tracker. After the change, partition_index_cache receives a reference to the stats via constructor, instead of referencing a global. This is needed so that cache_tracker can know the memory usage of index caches (for cache eviction purposes) without relying on globals. But it also makes sense even without that motive.	2023-09-01 22:34:41 +02:00
Michał Chojnowski	c7d9d35030	utils: cached_file: deglobalize cached_file metrics Move cached_file metrics from a thread_local variable to cache_tracker. This is needed so that cache_tracker can know the memory usage of index caches (for purposes of cache eviction) without relying on globals. But it also makes sense even without that motive.	2023-09-01 22:34:41 +02:00
Avi Kivity	0cabf4eeb9	build: disable implicit fallthrough Prevent switch case statements from falling through without annotation ([[fallthrough]]) proving that this was intended. Existing intended cases were annotated. Closes #14607	2023-07-10 19:36:06 +02:00
Pavel Emelyanov	66e43912d6	code: Switch to seastar API level 7 In that level no io_priority_class-es exist. Instead, all the IO happens in the context of current sched-group. File API no longer accepts prio class argument (and makes io_intent arg mandatory to impls). So the change consists of - removing all usage of io_priority_class - patching file_impl's inheritants to updated API - priority manager goes away altogether - IO bandwidth update is performed on respective sched group - tune-up scylla-gdb.py io_queues command The first change is huge and was made semi-autimatically by: - grep io_priority_class \| default_priority_class - remove all calls, found methods' args and class' fields Patching file_impl-s is smaller, but also mechanical: - replace io_priority_class& argument with io_intent* one - pass intent to lower file (if applicatble) Dropping the priority manager is: - git-rm .cc and .hh - sed out all the #include-s - fix configure.py and cmakefile The scylla-gdb.py update is a bit hairry -- it needs to use task queues list for IO classes names and shares, but to detect it should it checks for the "commitlog" group is present. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #13963	2023-06-06 13:29:16 +03:00
Pavel Emelyanov	2bb024c948	index_reader: Introduce and use default arguments to constructor Most of creators of index_reader construct it with default prio class, null trace pointer and use_caching::yes. Assigning implicit defaults to constructor arguments keeps the code shorter and easier to read. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 11:29:04 +03:00
Pavel Emelyanov	3fd5d3cc2b	index_reader: Use _pc field in get_file_input_stream_options() directly No need to pass this-> field into this-> call Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 11:18:14 +03:00
Pavel Emelyanov	21d24e8ea3	index_reader: Move index_reader::get_file_input_stream_options to private: block A "while at it" cleanup. When pathing the method (next patch) it turned out that there are no other callers other than local class, so it _is_ private. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2023-05-23 11:18:14 +03:00
Kefu Chai	0cb842797a	treewide: do not define/capture unused variables these warnings are found by Clang-17 after removing `-Wno-unused-lambda-capture` and '-Wno-unused-variable' from the list of disabled warnings in `configure.py`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-02-15 22:57:18 +02:00
Benny Halevy	63c2cdafe8	sstables: index_reader: close(index_bound&) reset current_list When closing _lower_bound and *_upper_bound in the final close() call, they are currently left with an engaged current_list member. If the index_reader uses a _local_index_cache, it is evicted with evict_gently which will, rightfully, see the respective pages as referenced, and they won't be evicted gently (only later when the index_reader is destroyed). Reset index_bound.current_list on close(index_bound&) to free up the reference. Ref #12271 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #12370	2023-01-02 16:42:33 +01:00
Michał Chojnowski	d9269abf5b	sstables: index_reader: always evict the local cache gently Due to an oversight, the local index cache isn't evicted gently when _upper_bound existed. This is a source of reactor stalls. Fix that. Fixes #12271 Closes #12364	2022-12-20 18:23:27 +02:00
Pavel Emelyanov	5f579eb405	sstable: Introduce index_filename() Currently the sstable::filename(Index) is used in several places that get the filename as a printable or throwable string and don't treat is as a real location of any file. For those, add the index_filename() helper symmetrical to toc_filename() and (in some sense) the get_filename() one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-12-15 10:14:49 +03:00

1 2 3 4

178 Commits