assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.
Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.
To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.
[1] 66ef711d68Closesscylladb/scylladb#20006
When an abort is requested, stop further reading of the index file and
throw and exception from index_consume_entry_context::process_state().
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Store abort source inside sstables::index_consume_entry_context, so that
the next patch can implement cancelling the index read when abort is
requested.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
managed_bytes is implemented as chain of blob_storage objects.
Each blob_storage contains 24 bytes of metadata. But in the most
common case -- when there is only a single element in the chain --
16 bytes of this metadata is trivial/unused.
This is regrettable waste because managed_bytes is used for every
database cell in the memtables and cache. It means that every value
of size >= 7 bytes (smaller ones fit in the inline storage of
managed_bytes) receives 16 bytes of useless overhead.
To correct that, this patch adds to managed_bytes an alternative storage
layout -- used for buffers small enough to fit in one contiguous
fragment -- which only stores the necessary minimum of metadata.
(That is: a pointer to the parent, to facilitate moving the storage during
memory defragmentation).
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, in order to enable the code in the header to
access the formatter without being moved down after the full specialization's
definition, we
* move the enum definition out of the class and before the
class,
* rename the enum's name from state to index_consume_entry_context_state
* define a formatter for index_consume_entry_context_state
* remove its operator<<().
as fmt v10 is able to use `format_as()` as a fallback, the formatter
full specialization is guarded with `#if FMT_VERSION < 10'00'00`. we
will remove it after we start build with fmt v10.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#16204
Move partition_index_cache stats from a thread_local variable
to cache_tracker. After the change, partition_index_cache
receives a reference to the stats via constructor, instead of
referencing a global.
This is needed so that cache_tracker can know the memory usage
of index caches (for cache eviction purposes) without relying on
globals.
But it also makes sense even without that motive.
Move cached_file metrics from a thread_local variable
to cache_tracker.
This is needed so that cache_tracker can know
the memory usage of index caches (for purposes
of cache eviction) without relying on globals.
But it also makes sense even without that motive.
Prevent switch case statements from falling through without annotation
([[fallthrough]]) proving that this was intended.
Existing intended cases were annotated.
Closes#14607
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).
So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command
The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields
Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)
Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile
The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13963
Most of creators of index_reader construct it with default prio class,
null trace pointer and use_caching::yes. Assigning implicit defaults to
constructor arguments keeps the code shorter and easier to read.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A "while at it" cleanup. When pathing the method (next patch) it turned
out that there are no other callers other than local class, so it _is_
private.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
When closing _lower_bound and *_upper_bound
in the final close() call, they are currently left with
an engaged current_list member.
If the index_reader uses a _local_index_cache,
it is evicted with evict_gently which will, rightfully,
see the respective pages as referenced, and they won't be
evicted gently (only later when the index_reader is destroyed).
Reset index_bound.current_list on close(index_bound&)
to free up the reference.
Ref #12271
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#12370
Due to an oversight, the local index cache isn't evicted gently
when _upper_bound existed. This is a source of reactor stalls.
Fix that.
Fixes#12271Closes#12364
Currently the sstable::filename(Index) is used in several places that
get the filename as a printable or throwable string and don't treat is
as a real location of any file.
For those, add the index_filename() helper symmetrical to toc_filename()
and (in some sense) the get_filename() one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
sstable::get_filename() constructs the filename from components, which
takes some work. It happens to be called on every
index_reader::index_reader() call even though it's only used for TRACE
logs. That's 1700 instructions (~1% of a full query) wasted on every
SSTable read. Fix that.
Closes#11485
Commit e8f3d7dd13 added eof() checks to public partition-level
advance_to() methods, to ensure we do not attempt to re-read the last
page of the index when at eof(). It was noted however that this check
would be safer in advance_to(index_bound&, dht::ring_position_view)
because that is the method that all these higher-level methods end up
calling. Placing the check there would guarantee safety for all such
operations. This path does exactly that: it pushes down the check to
said method. One change needed for this to work is to check eof on the
bound that is currently advanced, instead of unconditionally checking
the lower bound.
Closes#10531
Attempting to call advance_to() on the index, after it is positioned at
EOF, can result in an assert failure, because the operation results in
an attempt to move backwards in the index-file (to read the last index
page, which was already read). This only happens if the index cache
entry belonging to the last index page is evicted, otherwise the advance
operation just looks-up said entry and returns it.
To prevent this, we add an early return conditioned on eof() to all the
partition-level advance-to methods.
A regression unit test reproducing the above described crash is also
added.
Due to fd62fba985
scoped enums are not automatically converted to integers anymore,
this is the intended behavior, according to the fmtlib devs.
A bit nicer solution would be to use `std::to_underlying`
instead of a direct `static_cast`, but it's not available until
C++23 and some compilers are still missing the support for it.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
All entries from a single partition can be found in a
single summary page.
Because of that, in cases when we know we want to read
only one partition, we can limit the underyling file
input_stream to the range of the page.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Currently, when advancing one of index_reader's bounds,
we're creating a new index_consume_entry_context with a new
underlying file input_stream for each new page.
For either bound, the streams can be reused, because
the indexes of pages that we are reading are never
decreasing.
This patch adds a index_consume_entry_context to each of
index_reader's bounds, so that for each new page, the same
file input_stream is used.
As a result, when reading consecutive pages, the reads that
follow the first one can be satisfied by the input_stream's
read aheads, decreasing the number of blocking reads and
increasing the throughput of the index_reader.
Fixes#2388
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
The _file_name and _index_file fields in index_consume_entry_context
are no longer used anywhere in the class (_file_name isn't even set,
and _index_file was previously used when creating a promoted_index,
which doesn't store the file object anymore)
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
Errors during parsing are usually reported via malformed sstable
exception to signify their gravity of potentially being caused by
corrupt sstables. This patch converts the exception thrown in
`index_consume_entry_context::verify_end_state()`.
While at it the error message is improved as well. It currently suggests
that parsing was ended prematurely because data ran out, while in fact
the condition under which this error is thrown is the opposite: parsing
ended but there is unconsumed data left. The current state is also added
to the error message.
Add the summary index and the bound's address to the error message, so
it can be correlated with other trace level logging when investigating a
problem.
Refs: #9446
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211202124955.542293-2-bdenes@scylladb.com>
In the sstable reader, we iterate over clustering ranges using the
index_reader, which normally only accepts advancing to increasing
positions. In this patch we add methods for advancing the index
reader in reverse.
To simplify our job we restrict our attention to a single implementation
of the promoted index block cursor: `bsearch_clustered_cursor`. The
`index_reader` methods for advancing in reverse will thus assume that
this implementation is used. The assumption is correct given that we're
working only with sstables of versions >= mc, which is indeed the
intended use case. We add some documentation in appropriate places to
make this obvious.
We extend `bsearch_clustered_cursor` with two methods:
`advance_past(pos)`, which advances the cursor to the first block after
`pos` (or to the end if there is no such block), and
`last_block_offset()`, which returns the data file offset of the first
row from the last promoted index block.
To efficiently find the position in the data file of the last row
of the partition (which we need when performing a reversed query)
the sstable reader may need to read the span of the entire last promoted
index block in the data file. To learn where the block starts it can use
`index_reader::last_block_offset()`, which is implemented in terms of
`bsearch_clustered_cursor::last_block_offset()`.
When performing a single partition read in forward order, the reader
asks the index to position its lower bound at the start of the partition
and its upper bound after the end of the slice. It starts by reading the
first range. After exhausting a range it jumps to the next one by asking
the index to advance the lower bound.
For reverse single partition reads we'll take a similar approach: the
initial bound positions are as in the forward case. However, we start
with the last range and after exhausting a range we want to jump to a
previous one; we will do it by advancing the upper bound in reverse
(i.e. moving it closer to the beginning of the partition). For this
we introduce the `index_reader::advance_reverse` function.
This was a global variable that was potentially modified from a
performance benchmark. It would modify the behavior of `index_reader`
in certain scenarios.
Remove the variable so we can specify the behavior of `index_reader`
functions without relying on anything other than what's passed into the
constructor and the function parameters.
Reads which bypass cache will use a private temporary instance of cached_file
which dies together with the index cursor.
The cursor still needs a cached_file with cachig layer. Binary searching needs
caching for performance, some of the pages will be reused. Another reason to still
use cached_file is to work with a common interface, and reusing it requires minimal changes.
Index cursor for reads which bypass cache will use a private temporary
instance of the partition index cache.
Promoted index scanner (ka/la format) will not go through the page cache.
In preparation for caching index objects, manage them under LSA.
Implementation notes:
key_view was changed to be a view on managed_bytes_view instead of
bytes, so it now can be fragmented. Old users of key_view now have to
linearize it. Actual linearization should be rare since partition
keys are typically small.
Index parser is now not constructing the index_entry directly, but
produces value objects which live in the standard allocator space:
class parsed_promoted_index_entry;
calss parsed_partition_index_entry;
This change was needed to support consumers which don't populate the
partition index cache and don't use LSA,
e.g. sstable::generate_summary(). It's now consumer's responsibility
to allocate index_entry out of parsed_partition_index_entry.
index_entry will be an LSA-managed object. Those have to be accessed
with care, with the LSA region locked.
This patch hides most of direct index_entry accesses inside the
index_reader so that users are safe.
The file object which is currently stored there has per-request
tracing wrappers (permit, trace_state) attached to it. It doesn't make
sense once the entry is cached and shared. Annotate when the cursor is
created instead.
Index reads and promoted index reads are both using the same
cached_file now, so there's no need to pass the buffers between the
index parser and promoted index reader.
Makes the promoted_index structure easier to move to LSA.
After this patch, there is a singe index file page cache per
sstable, shared by index readers. The cache survives reads,
which reduces amount of I/O on subsequent reads.
As part of this, cached_file needed to be adjusted in the following ways.
The page cache may occupy a significant portion of memory. Keeping the
pages in the standard allocator could cause memory fragmentation
problems. To avoid them, the cache_file is changed to keep buffers in LSA
using lsa_buffer allocation method.
When a page is needed by the seastar I/O layer, it needs to be copied
to a temporary_buffer which is stable, so must be allocated in the
standard allocator space. We copy the page on-demand. Concurrent
requests for the same page will share the temporary_buffer. When page
is not used, it only lives in the LSA space.
In the subsequent patches cached_file::stream will be adjusted to also support
access via cached_page::ptr_type directly, to avoid materializating a
temporary_buffer.
While a page is used, it is not linked in the LRU so that it is not
freed. This ensures that the storage which is actively consumed
remains stable, either via temporary_buffer (kept alive by its
deleter), or by cached_page::ptr_type directly.
We want buffers to be accounted only when they are used outside
cached_file. Cached pages should not be accounted because they will
stay around for longer than the read after subsequent commits.
We'd like that to simplify the soon-to-be-introduced
sstable_mutation_reader::close error handling path.
close_index_list can be marked noexcept since parallel_for_each is,
with that index_reader::close can be marked noexcept too.
Note that since reader close can not fail
both lower and upper bounds are closed (since
closing lower_bound cannot fail).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>