Attempting to call advance_to() on the index, after it is positioned at
EOF, can result in an assert failure, because the operation results in
an attempt to move backwards in the index-file (to read the last index
page, which was already read). This only happens if the index cache
entry belonging to the last index page is evicted, otherwise the advance
operation just looks-up said entry and returns it.
To prevent this, we add an early return conditioned on eof() to all the
partition-level advance-to methods.
A regression unit test reproducing the above described crash is also
added.
Due to fd62fba985
scoped enums are not automatically converted to integers anymore,
this is the intended behavior, according to the fmtlib devs.
A bit nicer solution would be to use `std::to_underlying`
instead of a direct `static_cast`, but it's not available until
C++23 and some compilers are still missing the support for it.
Tests: unit(dev)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
All entries from a single partition can be found in a
single summary page.
Because of that, in cases when we know we want to read
only one partition, we can limit the underyling file
input_stream to the range of the page.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Currently, when advancing one of index_reader's bounds,
we're creating a new index_consume_entry_context with a new
underlying file input_stream for each new page.
For either bound, the streams can be reused, because
the indexes of pages that we are reading are never
decreasing.
This patch adds a index_consume_entry_context to each of
index_reader's bounds, so that for each new page, the same
file input_stream is used.
As a result, when reading consecutive pages, the reads that
follow the first one can be satisfied by the input_stream's
read aheads, decreasing the number of blocking reads and
increasing the throughput of the index_reader.
Fixes#2388
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
The _file_name and _index_file fields in index_consume_entry_context
are no longer used anywhere in the class (_file_name isn't even set,
and _index_file was previously used when creating a promoted_index,
which doesn't store the file object anymore)
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
Errors during parsing are usually reported via malformed sstable
exception to signify their gravity of potentially being caused by
corrupt sstables. This patch converts the exception thrown in
`index_consume_entry_context::verify_end_state()`.
While at it the error message is improved as well. It currently suggests
that parsing was ended prematurely because data ran out, while in fact
the condition under which this error is thrown is the opposite: parsing
ended but there is unconsumed data left. The current state is also added
to the error message.
Add the summary index and the bound's address to the error message, so
it can be correlated with other trace level logging when investigating a
problem.
Refs: #9446
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211202124955.542293-2-bdenes@scylladb.com>
In the sstable reader, we iterate over clustering ranges using the
index_reader, which normally only accepts advancing to increasing
positions. In this patch we add methods for advancing the index
reader in reverse.
To simplify our job we restrict our attention to a single implementation
of the promoted index block cursor: `bsearch_clustered_cursor`. The
`index_reader` methods for advancing in reverse will thus assume that
this implementation is used. The assumption is correct given that we're
working only with sstables of versions >= mc, which is indeed the
intended use case. We add some documentation in appropriate places to
make this obvious.
We extend `bsearch_clustered_cursor` with two methods:
`advance_past(pos)`, which advances the cursor to the first block after
`pos` (or to the end if there is no such block), and
`last_block_offset()`, which returns the data file offset of the first
row from the last promoted index block.
To efficiently find the position in the data file of the last row
of the partition (which we need when performing a reversed query)
the sstable reader may need to read the span of the entire last promoted
index block in the data file. To learn where the block starts it can use
`index_reader::last_block_offset()`, which is implemented in terms of
`bsearch_clustered_cursor::last_block_offset()`.
When performing a single partition read in forward order, the reader
asks the index to position its lower bound at the start of the partition
and its upper bound after the end of the slice. It starts by reading the
first range. After exhausting a range it jumps to the next one by asking
the index to advance the lower bound.
For reverse single partition reads we'll take a similar approach: the
initial bound positions are as in the forward case. However, we start
with the last range and after exhausting a range we want to jump to a
previous one; we will do it by advancing the upper bound in reverse
(i.e. moving it closer to the beginning of the partition). For this
we introduce the `index_reader::advance_reverse` function.
This was a global variable that was potentially modified from a
performance benchmark. It would modify the behavior of `index_reader`
in certain scenarios.
Remove the variable so we can specify the behavior of `index_reader`
functions without relying on anything other than what's passed into the
constructor and the function parameters.
Reads which bypass cache will use a private temporary instance of cached_file
which dies together with the index cursor.
The cursor still needs a cached_file with cachig layer. Binary searching needs
caching for performance, some of the pages will be reused. Another reason to still
use cached_file is to work with a common interface, and reusing it requires minimal changes.
Index cursor for reads which bypass cache will use a private temporary
instance of the partition index cache.
Promoted index scanner (ka/la format) will not go through the page cache.
In preparation for caching index objects, manage them under LSA.
Implementation notes:
key_view was changed to be a view on managed_bytes_view instead of
bytes, so it now can be fragmented. Old users of key_view now have to
linearize it. Actual linearization should be rare since partition
keys are typically small.
Index parser is now not constructing the index_entry directly, but
produces value objects which live in the standard allocator space:
class parsed_promoted_index_entry;
calss parsed_partition_index_entry;
This change was needed to support consumers which don't populate the
partition index cache and don't use LSA,
e.g. sstable::generate_summary(). It's now consumer's responsibility
to allocate index_entry out of parsed_partition_index_entry.
index_entry will be an LSA-managed object. Those have to be accessed
with care, with the LSA region locked.
This patch hides most of direct index_entry accesses inside the
index_reader so that users are safe.
The file object which is currently stored there has per-request
tracing wrappers (permit, trace_state) attached to it. It doesn't make
sense once the entry is cached and shared. Annotate when the cursor is
created instead.
Index reads and promoted index reads are both using the same
cached_file now, so there's no need to pass the buffers between the
index parser and promoted index reader.
Makes the promoted_index structure easier to move to LSA.
After this patch, there is a singe index file page cache per
sstable, shared by index readers. The cache survives reads,
which reduces amount of I/O on subsequent reads.
As part of this, cached_file needed to be adjusted in the following ways.
The page cache may occupy a significant portion of memory. Keeping the
pages in the standard allocator could cause memory fragmentation
problems. To avoid them, the cache_file is changed to keep buffers in LSA
using lsa_buffer allocation method.
When a page is needed by the seastar I/O layer, it needs to be copied
to a temporary_buffer which is stable, so must be allocated in the
standard allocator space. We copy the page on-demand. Concurrent
requests for the same page will share the temporary_buffer. When page
is not used, it only lives in the LSA space.
In the subsequent patches cached_file::stream will be adjusted to also support
access via cached_page::ptr_type directly, to avoid materializating a
temporary_buffer.
While a page is used, it is not linked in the LRU so that it is not
freed. This ensures that the storage which is actively consumed
remains stable, either via temporary_buffer (kept alive by its
deleter), or by cached_page::ptr_type directly.
We want buffers to be accounted only when they are used outside
cached_file. Cached pages should not be accounted because they will
stay around for longer than the read after subsequent commits.
We'd like that to simplify the soon-to-be-introduced
sstable_mutation_reader::close error handling path.
close_index_list can be marked noexcept since parallel_for_each is,
with that index_reader::close can be marked noexcept too.
Note that since reader close can not fail
both lower and upper bounds are closed (since
closing lower_bound cannot fail).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, the primitive_consumer parses all values in contiguous buffers.
A string of bytes may be very long, so parsing it in a single buffer
can cause a big allocation. This patch allows parsing into
fragmented_temporary_buffers instead of temporary_buffers.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Before this patch, each index reader had its own cache of partition
index pages. Now there is a shared cache, owned by the sstable object.
This allows concurrent reads to share partition index pages and thus
reduce the amount of I/O.
This change is also needed before we can implement promoted index caching.
That's because before the promoted index can be shared by readers, the
partition index entries, which hold the promoted index, must also be
shareable.
The pages live as long as there is at least one index reader
referencing them. So it only helps when there is concurrent access. In
the future we will keep them for longer and evict on memory pressure.
Promoted index cursor is no longer created when the partition index entry
is parsed, by it's created on-demand when the top-level cursor enters
the partition. The promoted index cursor is owned by the top-level cursor,
not by the partition index entry.
Currently, the partition index page parser will create and store
promoted index cursors for each entry. The assumption is that
partition index pages are not shared by readers so each promoted index
cursor will be used by a single index_reader (the top-level cursor).
In order to be able to share partition index entries we must make the
entries immutable and thus move the cursor outside. The promoted index
cursor is now created and owned by each index_reader. There is at most
one such active cursor per index_reader bound (lower/upper).
index_reader::index_bound must be constructible by non-friend classes
since it's used in std::optional (which isn't anyone's friend). This
now works in gcc because gcc's inter-template access checking is broken,
but clang correctly rejects it.
... and tests. Printin a pointer in logs is considered to be a bad practice,
so the proposal is to keep this explicit (with fmt::ptr) and allow it for
.debug and .trace cases.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add the sstable_version_types::md enum value
and logically extend sstable_version_types comparisons to cover
also the > sstable_version_types::mc cases.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Turns out the fix f591c9c710 wasn't enough to make sure all input streams
are properly closed on failure.
It only closes the main input stream that belongs to context, but it misses
all the input streams that can be opened in the consumer for promote index
reading. Consumer stores a list of indexes, where each of them has its own
input stream. On failure, we need to make sure that every single one of
them is properly closed before destroying the indexes as that could cause
memory corruption due to read ahead.
Fixes#6924.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200727182214.377140-1-raphaelsc@scylladb.com>
Next patches will generalize ring_position_comparator with templates
to replace cache_entry's and memtable_entry's comparators. The overload
of operator() for sstables has its own implementation, that differs from
the "generic" one, for smoother generalization it's better to detach it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The seastar api v4 changes the return type of when_all_succeed. This
patch adds discard_result when that is best solution to handle the
change.
This doesn't do the actual update to v4 since there are still a few
issues left to fix in seastar. A patch doing just the update will
follow.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200617233150.918110-1-espindola@scylladb.com>
Currently, index reader uses 128 KiB I/O size with read-ahead. That is
a waste of bandwidth if index entries contain large promoted index and
binary search will be used within the promoted index, which may not
need to access as much.
The read-ahead is wasted both when using binary search and when using
the scanning cursor.
On the other hand, large I/O is optimal if there is no promoted index
and we're going to parse the whole page.
There is no way to predict which case it is up front before reading
the index.
Attaching dynamic adjustments (per-sstable) lets the system auto adjust
to the workload from past history.
The large promoted index workload will settle on reading 32 KiB (with
read-ahead). This is still not optimal, we should lower the buffer
size even more. But that requires a seastar change, so is deferred.
Currently, lookups in the promoted index are done by scanning the index linearly so the lookup
is O(N). For large partitions that's inefficient. It consumes both a lot of CPU and I/O.
We could do better and use binary search in the index. This patch series switches the mc-format
index reader to do that. Other formats use the old way.
The "mc" format promoted index has an extra structure at the end of the index called "offset map".
It's a vector of offsets of consecutive promoted index entries. This allows us to access random
entries in the index without reading the whole index.
The location of the offset entry for a given promoted index entry can be derived by knowing where
the offset vector ends in the index file, so the offset map also doesn't have to be read completely
into the memory.
The most tricky part is caching. We need to cache blocks read from the index file to amortize the
cost of binary search:
- if the promoted index fits in the 32 KiB which was read from the index when looking for
the partition entry, we don't want to issue any additional I/O to search the promoted index.
- with large promoted indexes, the last few bisections will fall into the same I/O block and we
want to reuse that block.
- we don't want the cache to grow too big, we don't want to cache the whole promoted index
as the read progresses over the index. Scanning reads may skip multiple times.
This patch implements a rather simple approach which meets all the
above requirements and is not worse than the current state of affairs:
- Each index cursor has its own cache of the index file area which corresponds to promoted index
This is managed by the cached_file class.
- Each index cursor has its own cache of parsed blocks. This allows the upper bound estimation to
reuse information obtained during lower bound lookup. This estimation is used to limit
read-aheads in the data file.
- Each cursor drops entries that it walked past so that memory footprint stays O(log N)
- Cached buffers are accounted to read's reader_permit.
In preparation for supporting more than one algorithm for lookups in
the promoted index, extract relevant logic out of the index_reader
(which is a partition index cursor).
The clustered index cursor implementation is now hidden behind
abstract interface called clustered_index_cursor.
The current implementation is put into the
scanning_clustered_index_cursor. It's mostly code movement with minor
adjustments.
In order to encapsulate iteration over promoted index entries,
clustered_index_cursor::next_entry() was introduced.
No change in behavior intended in this patch.
Tested against performance regression using:
build/release/test/perf/perf_fast_forward --run-test=small-partition-skips -c1
I get similar results before and after the patch.
Message-Id: <20200521213032.15286-1-tgrabiec@scylladb.com>