Commit Graph

39 Commits

Author SHA1 Message Date
Tomasz Grabiec
082342ecad Attach names to allocating sections for better debuggability
Large reserves in allocating_section can cause stalls. We already log
reserve increase, but we don't know which table it belongs to:

  lsa - LSA allocation failure, increasing reserve in section 0x600009f94590 to 128 segments;

Allocating sections used for updating row cache on memtable flush are
notoriously problematic. Each table has its own row_cache, so its own
allocating_section(s). If we attached table name to those sections, we
could identify which table is causing problems. In some issues we
suspected system.raft, but we can't be sure.

This patch allows naming allocating_sections for the purpose of
identifying them in such log messages. I use abstract_formatter for
this purpose to avoid the cost of formatting strings on the hot path
(e.g. index_reader). And also to avoid duplicating strings which are
already stored elsewhere.

Fixes #25799

Closes scylladb/scylladb#27470
2025-12-07 14:14:25 +02:00
Michał Chojnowski
1f85069389 sstables/trie: support reader_permit and trace_state properly
Before this patch, `reader_permit` taken by `bti_index_reader`.
wasn't actually being passed down to disk reads. In this patch,
we fix this FIXME by propagating the permit down to the I/O
operations on the `cached_file`.

Also, it didn't take `trace_state_ptr` at all.
In this patch, we add a `trace_state_ptr` argument and propagate
it down to disk reads.

(We combine the two changes because the permit and the trace state
are passed together everywhere anyway).
2025-09-17 12:22:40 +02:00
Michał Chojnowski
6fe7dbaedc utils/cached_file: add get_shared_page()
BTI index is page-aware. It's designed to be read in page units.

Thus, we want a `cached_file` accessor which explicitly requests
a whole page, preferably without copying it.

`cached_file` already works in terms of reference-counted pages,
underneath. This commit only adds some accessors which lets
us request those reference-counting page pointers more directly.
2025-08-05 00:56:50 +02:00
Michał Chojnowski
58d768e383 utils/cached_file: replace a std::pair with a named struct
Cosmetic change. For clarity.
2025-08-05 00:55:32 +02:00
Kefu Chai
7215d4bfe9 utils: do not include unused headers
these unused includes were identifier by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.

please note, because quite a few source files relied on
`utils/to_string.hh` to pull in the specialization of
`fmt::formatter<std::optional<T>>`, after removing
`#include <fmt/std.h>` from `utils/to_string.hh`, we have to
include `fmt/std.h` directly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-01-14 07:56:39 -05:00
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Tomasz Grabiec
0f2101b055 utils: cached_file: Mark permit as awaiting on page miss
Otherwise, the read will be considered as on-cpu during promoted index
search, which will severely underutlize the disk because by default
on-cpu concurrency is 1.

I verified this patch on the worst case scenario, where the workload
reads missing rows from a large partition. So partition index is
cached (no IO) and there is no data file IO. But there is IO during
promoted index search (via cached_file). Before the patch this
workload was doing 4k req/s, after the patch it does 30k req/s.

The problem is much less pronounced if there is data file or index
file IO involved because that IO will signal read concurrency
semaphore to invite more concurrency.
2024-10-28 19:54:58 +01:00
Tomasz Grabiec
868f5b59c4 utils: cached_file: Push resource_unit management down to cached_file
It saves us permit operations on the hot path when we hit in cache.

Also, it will lay the ground for marking the permit as awaiting later.
2024-10-28 19:49:58 +01:00
Michał Chojnowski
882a3c60e4 utils/cached_file: reduce latency (and increase overhead) of partially-cached reads
Currently, `cached_file::stream` (currently used only by index_reader,
to read index pages), works as follows.

Assume that the caller requested a read of the range [pos, pos + size).
Then:

- If the first page of the requested range is uncached,
  the entire [pos, pos + size) range is read from disk (even if some
  later pieces of it are cached), the resulting pages are added to the cache,
  and the read completes (most likely) from the cached pages.
- If the first page of the read is cached, then the rest of the read
  is handled page-by-page, in a sequential loop, serving each page
  either from cache (if present) or from disk.

For example, assume that pages 0, 1, 2, 3, 4 are requested.

If exactly pages 1, 2 are cached, then `stream` will read the entire [0, 4] range
from disk and insert the missing 0, 3, 4, and then it will continue serving the
read from cache.

If exactly pages 0 and 3 are cached, then it will serve 0 from cache,
then it will read 1 from disk and insert it into cache,
then it will read 2 from disk and insert it into cache,
then it will serve 3 from cache,
then it will read 4 from disk and insert it into cache.

If exactly the first page is cached, a 128 kiB read turns
into 31 I/O sequential read ops.

This is weird, and doesn't look intended. In one case, we are reading even pages
we already have, just to avoid fragmenting the read, and in the other case
we are reading pages one-by-one (sequentially!) even if they are neighbours.

I'm not sure if cached_file should minimize IOPS or byte throughput,
but the current state is surely suboptimal. Even if its read strategy
is somehow optimal, it should still at least coalesce contiguous reads
and perform the non-contiguous reads in parallel.

This patch leans into minimizing IOPS. After the patch, we serve
as many front pages from the cache as we can, but when we see
an uncached page, we read the entire remainder of the read from disk.

As if we trimmed the read request by the longest cached prefix,
and then performed the rest using the logic from before the patch.

For example, if exactly pages 0 and 3 are cached,
then we serve 0 from cache,
then we read [1, 4] from disk and insert everything into cache.

For partially-cached files, this will result in more bytes read
from disk, but less IOPS. This might be a bad thing. But if so,
then we should lean the other way in a more explicit and efficient
way than we currently do.

Closes scylladb/scylladb#20935
2024-10-04 17:39:38 +02:00
Tomasz Grabiec
c09fa0cb98 test, sstables: Verify parsing stability when buffers cross page boundary 2024-09-27 01:25:15 +02:00
Tomasz Grabiec
7670ee701a sstables: bsearch_clustered_cursor: Switch parsers to work with page_view
This fixes a use-after-free bug when parsing clustering key across
pages.

Clustering key index lookup is based on the index file page cache. We
do a binary search within the index, which involves parsing index
blocks touched by the algorithm. Index file pages are 4 KB chunks
which are stored in LSA.

To parse the first key of the block, we reuse clustering_parser, which
is also used when parsing the data file. The parser is stateful and
accepts consecutive chunks as temporary_buffers. The parser is
supposed to keep its state across chunks.

In b1b5bda, the parser was changed to keep shared fragments of the
buffer passed to the parser in its internal state (across pages)
rather than copy the fragments into a new buffer. This is problematic
when buffers come from page cache because LSA buffers may be moved
around or evicted. So the temporary_buffer which is a view on the LSA
buffer is valid only around the duration of a single consume() call to
the parser.

If the blob which is parsed (e.g. variable-length clustering key
component) spans pages, the fragments stored in the parser may be
invalidated before the component is fully parsed. As a result, the
parsed clustering key may have incorrect component values. This never
causes parsing errors because the "length" field is always parsed from
the current buffer, which is valid, and component parsing will end at
the right place in the next (valid) buffer.

The problematic path for clustering_key parsing is the one which calls
primitive_consumer::read_bytes(), which is called for example for text
components. Fixed-size components are not parsed like this, they store
the intermediate state by copying data.

This may cause incorrect clustering keys to be parsed when doing
binary search in the index, diverting the search to an incorrect
block.

The solution is to use page_view instead of temporary_buffer, which
can be safely shared via share() and stored across allocating
section. The page_view maintains its hold to the LSA buffer even
across allocating sections.

Fixes #20766
2024-09-27 01:25:15 +02:00
Tomasz Grabiec
c15145b71d cached_file: Adapt page_view to ContiguousSharedBuffer 2024-09-27 01:25:15 +02:00
Tomasz Grabiec
29498a97ae cached_file: Change meaning of page_view::_size to be relative to _offset rather than page start
Will be easier to implement ContiguousSharedBuffer API as the buffer
size will be equal to _size.
2024-09-27 01:25:15 +02:00
Kefu Chai
151074240c utils: cached_file: use structured binding when appropriate
for better readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#20184
2024-08-19 14:01:42 +03:00
Avi Kivity
aa1270a00c treewide: change assert() to SCYLLA_ASSERT()
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.

Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.

To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.

[1] 66ef711d68

Closes scylladb/scylladb#20006
2024-08-05 08:23:35 +03:00
Kefu Chai
a1dcddd300 utils: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16833
2024-01-18 12:50:06 +02:00
Michał Chojnowski
c7d9d35030 utils: cached_file: deglobalize cached_file metrics
Move cached_file metrics from a thread_local variable
to cache_tracker.

This is needed so that cache_tracker can know
the memory usage of index caches (for purposes
of cache eviction) without relying on globals.

But it also makes sense even without that motive.
2023-09-01 22:34:41 +02:00
Michał Chojnowski
50b429f255 config: add index_cache_fraction
Adds a configurable upper limit to memory usage by index caches.
See the source code comments added in this patch for more details.

This patch shouldn't change visible behaviour, because the limit is set to 1.0
by default, so it is never triggerred. We will change the default in a future
patch.
2023-09-01 22:34:23 +02:00
Raphael S. Carvalho
050ce9ef1d cached_file: Evict unused pages that aren't linked to LRU yet
It was found that cached_file dtor can hit the following assert
after OOM

cached_file_test: utils/cached_file.hh:379: cached_file::~cached_file(): Assertion _cache.empty()' failed.`

cached_file's dtor iterates through all entries and evict those
that are linked to LRU, under the assumption that all unused
entries were linked to LRU.

That's partially correct. get_page_ptr() may fetch more than 1
page due to read ahead, but it will only call cached_page::share()
on the first page, the one that will be consumed now.

share() is responsible for automatically placing the page into
LRU once refcount drops to zero.

If the read is aborted midway, before cached_file has a chance
to hit the 2nd page (read ahead) in cache, it will remain there
with refcount 0 and unlinked to LRU, in hope that a subsequent
read will bring it out of that state.

Our main user of cached_file is per-sstable index caching.
If the scenario above happens, and the sstable and its associated
cached_file is destroyed, before the 2nd page is hit, cached_file
will not be able to clear all the cache because some of the
pages are unused and not linked.

A page read ahead will be linked into LRU so it doesn't sit in
memory indefinitely. Also allowing for cached_file dtor to
clear all cache if some of those pages brought in advance
aren't fetched later.

A reproducer was added.

Fixes #14814.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14818
2023-07-27 00:01:46 +02:00
Pavel Emelyanov
66e43912d6 code: Switch to seastar API level 7
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).

So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command

The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields

Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)

Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile

The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13963
2023-06-06 13:29:16 +03:00
Michał Chojnowski
f340c9cca5 utils: lru: remove unlink_from_lru()
unlink_from_lru() allows for unlinking elements from cache without notifying
the cache. This messes up any potential cache bookkeeping.
Improved that by replacing all uses of unlink_from_lru() with calls to
lru::remove(), which does have access to cache's metadata.
2022-10-17 12:07:27 +02:00
Michał Chojnowski
d785364375 cache: make all cache unlinks explicit
Our LSA cache is implemented as an auto_unlink Boost intrusive list, meaning
that elements of the list unlink themselves from the list automatically on
destruction. Some parts of the code rely on that, and don't unlink them
manually.

However, this precludes accurate bookkeeping about the cache. Elements only have
access to themselves and their neighbours, not to any bookkeeping context.
Therefore, a destructor cannot update the relevant metadata.

In this patch, we fix this by adding explicit unlink calls to places where it
would be done by a destructor. In a following patch, we will add an assert to
the destructor to check that every element is unlinked before destruction.
2022-10-17 12:07:27 +02:00
Tomasz Grabiec
e68cf55514 utils: cached_file: Fix alloc-dealloc mismatch during eviction
on_evicted() is invoked in the LSA allocator context, set in the
reclaimer callback instaled by the cache_tracker. However,
cached_pages are allocated in the standard allocator context (note:
page content is allocated inside LSA via lsa_buffer). The LSA region
will happilly deallocate these, thinking that they these are large
objects which were delegated to the standard allocator. But the
_non_lsa_memory_in_use metric will underflow. When it underflows
enough, shard_segment_pool.total_memory() will become 0 and memory
reclamation will stop doing anything, leading to apparent OOM.

The fix is to switch to the standard allocator context inside
cached_page::on_evicted(). evict_range() was also given the same
treatment as a precaution, it currently is only invoked in the
standard allocator context.

Fixes #10056
2022-02-23 18:38:05 +01:00
Tomasz Grabiec
b734615f51 util: cached_file: Fix corruption after memory reclamation was triggered from population
If memory reclamation is triggered inside _cache.emplace(), the _cache
btree can get corrupted. Reclaimers erase from it, and emplace()
assumes that the tree is not modified during its execution. It first
locates the target node and then does memory allocation.

Fix by running emplace() under allocating section, which disables
memory reclamation.

The bug manifests with assert failures, e.g:

./utils/bptree.hh:1699: void bplus::node<unsigned long, cached_file::cached_page, cached_file::page_idx_less_comparator, 12, bplus::key_search::linear, bplus::with_debug::no>::refill(Less) [Key = unsigned long, T = cached_file::cached_page, Less = cached_file::page_idx_less_comparator, NodeSize = 12, Search = bplus::key_search::linear, Debug = bplus::with_debug::no]: Assertion `p._kids[i].n == this' failed.

Fixes #9915

Message-Id: <20220130175639.15258-1-tgrabiec@scylladb.com>
2022-01-30 19:57:35 +02:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Avi Kivity
369afe3124 treewide: use coroutine::maybe_yield() instead of co_await make_ready_future()
The dedicated API shows the intent, and may be a tiny bit faster.

Closes #9382
2021-09-23 12:28:56 +02:00
Tomasz Grabiec
f553db69f7 cached_file: Issue single I/O for the whole read range on miss
Currently, reading a page range would issue I/O for each missing
page. This is inefficient, better to issue a single I/O for the whole
range and populate cache from that.

As an optimization, issue a single I/O if the first page is missing.

This is important for index reads which optimistically try to read
32KB of index file to read the partition index page.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
1f74863bf8 sstables, cached_file: Evict cache gently when sstable is destroyed
We must evict before the _cached_index_file associated with the
sstable goes away. Better to do it gently to avoid stalls.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
934824394a sstables, cached_file: Avoid copying buffers from cache when parsing promoted index 2021-07-02 19:02:14 +02:00
Tomasz Grabiec
7b6f18b4ed cached_file: Introduce get_page_units()
Will be needed later for reading a page view which cannot use
make_tracked_temporary_buffer(). Standardize on get_page_units(),
converting existing code to wrap the units in a deleter.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
484e06d69b cached_file: Always start at offset 0
All current uses start at offset 0, so simplify the code by assuming it.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
078a6e422b sstables: Cache all index file reads
After this patch, there is a singe index file page cache per
sstable, shared by index readers. The cache survives reads,
which reduces amount of I/O on subsequent reads.

As part of this, cached_file needed to be adjusted in the following ways.

The page cache may occupy a significant portion of memory. Keeping the
pages in the standard allocator could cause memory fragmentation
problems. To avoid them, the cache_file is changed to keep buffers in LSA
using lsa_buffer allocation method.

When a page is needed by the seastar I/O layer, it needs to be copied
to a temporary_buffer which is stable, so must be allocated in the
standard allocator space. We copy the page on-demand. Concurrent
requests for the same page will share the temporary_buffer. When page
is not used, it only lives in the LSA space.

In the subsequent patches cached_file::stream will be adjusted to also support
access via cached_page::ptr_type directly, to avoid materializating a
temporary_buffer.

While a page is used, it is not linked in the LRU so that it is not
freed. This ensures that the storage which is actively consumed
remains stable, either via temporary_buffer (kept alive by its
deleter), or by cached_page::ptr_type directly.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
019956739d cached_file: Switch to bplus::tree
In order to be able to move it to LSA later.
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
8fbea0b5b7 utils: cached_file: Introduce file wrapper
It's an adpator between seastar::file and cached_file. It gives a
seastar::file which will serve reads using a given cached_file as a
read-through cache.
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
8e2118069b sstables: cached_file: Account buffers returned by cached_file under read_permit
We want buffers to be accounted only when they are used outside
cached_file. Cached pages should not be accounted because they will
stay around for longer than the read after subsequent commits.
2021-07-02 10:25:58 +02:00
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Tomasz Grabiec
17ee1a2eed utils: cached_file: Fix compilation error
Fix field initialization order problem.

In file included from ./sstables/mc/bsearch_clustered_cursor.hh:28,
                 from sstables/index_reader.hh:32,
                 from sstables/sstables.cc:49:
./utils/cached_file.hh: In constructor 'cached_file::stream::stream(cached_file&, const seastar::io_priority_class&, tracing::trace_state_ptr, cached_file::page_idx_type, cached_file::offset_type)':
./utils/cached_file.hh:119:34: error: 'cached_file::stream::_trace_state' will be initialized after [-Werror=reorder]
  119 |         tracing::trace_state_ptr _trace_state;
      |                                  ^~~~~~~~~~~~
./utils/cached_file.hh:117:23: error:   'cached_file::page_idx_type cached_file::stream::_page_idx' [-Werror=reorder]
  117 |         page_idx_type _page_idx;
      |                       ^~~~~~~~~
./utils/cached_file.hh:127:9: error:   when initialized here [-Werror=reorder]
  127 |         stream(cached_file& cf, const io_priority_class& pc, tracing::trace_state_ptr trace_state,
      |         ^~~~~~
Message-Id: <1592478082-22505-1-git-send-email-tgrabiec@scylladb.com>
2020-06-18 14:08:29 +03:00
Tomasz Grabiec
58532cdf11 cached_file, sstables: Add tracing to index binary search and page cache 2020-06-16 16:15:24 +02:00
Tomasz Grabiec
c95dd67d11 utils: Introduce cached_file
It is a read-through cache of a file.

Will be used to cache contents of the promoted index area from the
index file.

Currently, cached pages are evicted manually using the invalidate_*()
method family, or when the object is destroyed.

The cached_file represents a subset of the file. The reason for this
is to satisfy two requirements. One is that we have a page-aligned
caching, where pages are aligned relative to the start of the
underlying file. This matches requirements of the seastar I/O engine
on I/O requests.  Another requirement is to have an effective way to
populate the cache using an unaligned buffer which starts in the
middle of the file when we know that we won't need to access bytes
located before the buffer's position. See populate_front(). If we
couldn't assume that, we wouldn't be able to insert an unaligned
buffer into the cache.
2020-06-16 16:15:23 +02:00