When a reverse slice is provided, it is given in the native reverse
format. Thus the ranges will be returned in the same order as stored
in the slice.
Therefore there is no need to distinguish between get_ranges and
get_native_ranges. The latter one gets dropped and get_ranges returns
ranges in the same order as stored in the slice.
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.
Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.
To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.
[1] 66ef711d68Closesscylladb/scylladb#20006
flat_mutation_reader_v2 was introduced in a pair of commits in 2021:
e3309322c3 "Clone flat_mutation_reader related classes into v2 variants"
08b5773c12 "Adapt flat_mutation_reader_v2 to the new version of the API"
as a replacement for flat_mutation_reader, using range_tombstone_change
instead of range_tombstone to represent represent range tombstones. See
those commits for more information.
The transition was incremental; the last use of the original
flat_mutation_reader was removed in 2022 in commit
026f8cc1e7 "db: Use mutation_partition_v2 in mvcc"
In turn, flat_mutation_reader was introduced in 2017 in commit
748205ca75 "Introduce flat_mutation_reader"
To transition from a mutation_reader that nested rows within
a partition in a separate stream, to a flat reader that streamed
partitions and rows in the same stream.
Here, we reclaim the original name and rename the awkward
flat_mutation_reader_v2 to mutation_reader.
Note that mutation_fragment_v2 remains since we still use the original
for compatibilty, sometimes.
Some notes about the transition:
- files were also renamed. In one case (flat_mutation_reader_test.cc), the
rename target already existed, so we rename to
mutation_reader_another_test.cc.
- a namespace 'mutation_reader' with two definitions existed (in
mutation_reader_fwd.hh). Its contents was folded into the mutation_reader
class. As a result, a few #includes had to be adjusted.
Closesscylladb/scylladb#19356
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we include `fmt/ranges.h` and/or `fmt/std.h`
for formatting the container types, like vector, map
optional and variant using {fmt} instead of the homebrew
formatter based on operator<<.
with this change, the changes adding fmt::formatter and
the changes using ostream formatter explicitly, we are
allowed to drop `FMT_DEPRECATED_OSTREAM` macro.
Refs scylladb#13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for cache_entry, and drop its
operator<<.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17594
Adds a test reproducing https://github.com/scylladb/scylladb/issues/16759, and the instrumentation needed for it.
Closesscylladb/scylladb#17208
* github.com:scylladb/scylladb:
row_cache_test: test cache consistency during memtable-to-cache merge
row_cache: use preemption_source in update()
utils: preempt: add preemption_source
"fro" is the short of "from" but the value is an
`optimized_optional<flat_mutation_reader_v2>`. codespell considers
it a misspelling of "for" or "from". neither of them makes sense,
so let's change it to "reader" for better readability, also for
silencing the warning. so that the geniune warning can stands out,
this would help to make the codespell more useful.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17221
To facilitate testing the state of cache after the update is preempted
at various points, pass a preemption_source& to update() instead of
calling the reactor directly.
In release builds, the calls to preemption_source methods should compile
to the same direct reactor calls as today. Only in dev mode they should
add an extra branch. (However, the `preemption_source&` argument has
to be shoveled in any mode).
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for `dht::ring_position_ext` and
`dht::ring_position_view`, and drop their operator<<.
Refs #13245Closesscylladb/scylladb#17128
* github.com:scylladb/scylladb:
db: add formatter for dht::ring_position_ext
db: add formatter for dht::ring_position_view
Commit e81fc1f095 accidentally broke the control
flow of row_cache::do_update().
Before that commit, the body of the loop was wrapped in a lambda.
Thus, to break out of the loop, `return` was used.
The bad commit removed the lambda, but didn't update the `return` accordingly.
Thus, since the commit, the statement doesn't just break out of the loop as
intended, but also skips the code after the loop, which updates `_prev_snapshot_pos`
to reflect the work done by the loop.
As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted,
`do_update()` fails to update `_prev_snapshot_pos`. It remains in a
stale state, until `do_update()` runs again and either finishes or
is preempted outside of `updater`.
If we read a partition processed by `do_update()` but not covered by
`_prev_snapshot_pos`, we will read stale data (from the previous snapshot),
which will be remembered in the cache as the current data.
This results in outdated data being returned by the replica.
(And perhaps in something worse if range tombstones are involved.
I didn't investigate this possibility in depth).
Note: for queries with CL>1, occurences of this bug are likely to be hidden
by reconciliation, because the reconciled query will only see stale data if
the queried partition is affected by the bug on on *all* queried replicas
at the time of the query.
Fixes#16759Closesscylladb/scylladb#17138
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we define formatters for `dht::ring_position_view`,
and drop its operator<<.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Currently the cache updaters aren't exception safe
yet they are intended to be.
Instead of allowing exceptions from
`external_updater::execute` escape `row_cache::update`,
abort using `on_fatal_internal_error`.
Future changes should harden all `execute` implementations
to effectively make them `noexcept`, then the pure virtual
definition can be made `noexcept` to cement that.
Fixesscylladb/scylladb#15576Closesscylladb/scylladb#15577
* github.com:scylladb/scylladb:
row_cache: abort on exteral_updater::execute errors
row_cache: do_update: simplify _prev_snapshot_pos setup
Said lambda currently moves the permit parameter, so on the second and
later calls it will possibly run into use-after-move. This can happen if
the allocating section below fails and is re-tried.
If the constructor of row_cache throws, `_partitions` is cleared in the
wrong allocator, possibly causing allocator corruption.
Fix that.
Fixes#15632Closesscylladb/scylladb#15633
Currently the cache updaters aren't exception safe
yet they are intended to be.
Instead of allowing exceptions from
`external_updater::execute` escape `row_cache::update`,
abort using `on_fatal_internal_error`.
Future changes should harden all `execute` implementations
to effectively make them `noexcept`, then the pure virtual
definition can be made `noexcept` to cement that.
Fixesscylladb/scylladb#15576
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
ring_position::min() is noexcept since 6d7ae4ead1
So no need to call it outside of the critical noexcept block.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move partition_index_cache stats from a thread_local variable
to cache_tracker. After the change, partition_index_cache
receives a reference to the stats via constructor, instead of
referencing a global.
This is needed so that cache_tracker can know the memory usage
of index caches (for cache eviction purposes) without relying on
globals.
But it also makes sense even without that motive.
Move cached_file metrics from a thread_local variable
to cache_tracker.
This is needed so that cache_tracker can know
the memory usage of index caches (for purposes
of cache eviction) without relying on globals.
But it also makes sense even without that motive.
Adds a configurable upper limit to memory usage by index caches.
See the source code comments added in this patch for more details.
This patch shouldn't change visible behaviour, because the limit is set to 1.0
by default, so it is never triggerred. We will change the default in a future
patch.
when read from cache compact and expire row tombstones
remove expired empty rows from cache
do not expire range tombstones in this patch
Refs #2252, #6033Closes#12917
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).
So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command
The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields
Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)
Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile
The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13963
Adds a logalloc::region argument to upgrade_schema().
It's currently unused, but will be further propagated to
partition_entry::upgrade() in an upcoming patch.
operator<< accepts a schema& and a partition_entry&. But since the latter
now contains a reference to its schema inside, the former is redundant.
Remove it.
After adding a _schema field to each partition version,
the field in cache_entry is redundant. It can be always recovered
from the latest version. Remove it.
Courtersy of clang-tidy:
row_cache.cc:1191:28: warning: 'entry' used after it was moved [bugprone-use-after-move]
_partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{_schema});
^
row_cache.cc:1191:60: note: move occurred here
_partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{_schema});
^
row_cache.cc:1191:28: note: the use and move are unsequenced, i.e. there is no guarantee about the order in which they are evaluated
_partitions.insert(entry.position().token().raw(), std::move(entry), dht::ring_position_comparator{*_schema});
The use-after-move is UB, as for it to happen, depends on evaluation order.
We haven't hit it yet as clang is left-to-right.
Fixes#13400.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13401
now that fmtlib provides fmt::join(). see
https://fmt.dev/latest/api.html#_CPPv4I0EN3fmt4joinE9join_viewIN6detail10iterator_tI5RangeEEN6detail10sentinel_tI5RangeEEERR5Range11string_view
there is not need to revent the wheel. so in this change, the homebrew
join() is replaced with fmt::join().
as fmt::join() returns an join_view(), this could improve the
performance under certain circumstances where the fully materialized
string is not needed.
please note, the goal of this change is to use fmt::join(), and this
change does not intend to improve the performance of existing
implementation based on "operator<<" unless the new implementation is
much more complicated. we will address the unnecessarily materialized
strings in a follow-up commit.
some noteworthy things related to this change:
* unlike the existing `join()`, `fmt::join()` returns a view. so we
have to materialize the view if what we expect is a `sstring`
* `fmt::format()` does not accept a view, so we cannot pass the
return value of `fmt::join()` to `fmt::format()`
* fmtlib does not format a typed pointer, i.e., it does not format,
for instance, a `const std::string*`. but operator<<() always print
a typed pointer. so if we want to format a typed pointer, we either
need to cast the pointer to `void*` or use `fmt::ptr()`.
* fmtlib is not able to pick up the overload of
`operator<<(std::ostream& os, const column_definition* cd)`, so we
have to use a wrapper class of `maybe_column_definition` for printing
a pointer to `column_definition`. since the overload is only used
by the two overloads of
`statement_restrictions::add_single_column_parition_key_restriction()`,
the operator<< for `const column_definition*` is dropped.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
operator<<(..) does not mutate the cache_entry parameter passed to it.
also, without this change fmtlib is not able to format given cache_entry
parameter, as the caller formatter has "const" specifier.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Now the nonforwardable reader unconditionally produces
a partition_end, even if the input reader was empty.
This is strange in itself, but it also hinders to
properly fix its next_partition() method, which is
our ultimate goal. So we are going to change this
and produce partition_end only if there were some
data in the stream. However, this makes a problem:
now we pop partition_start from the underlying reader
in autoupdating_underlying_reader::move_to_next_partition
and manually push it back to downstream readers
bypassing nonforwardable reader. This means if we
change the logic in nonforwardable reader as described
we will end up with partition_start without partition_end
in the downstream readers.
This patch rectifies this by making sure that
nonforwardable will see the initial partition_start.
We inject this partition_start just before the
nonforwardable reader, into delegating_reader.
This also makes the result type of
range_populating_reader::operator() a bit simpler,
we don't need to pass partition_start anymore.
Will be used by MVCC tests which don't want (can't) deal with the
row_cache as the container but work with the partition_entry directly.
Currently, rows_entry::on_evicted() assumes that it's embedded in
row_cache and would segfault when trying to evict the contining
partition entry which is not embedded in row_cache. The solution is to
call evict_shallow() from mvcc_tests, which does not attempt to evict
the containing partition_entry.
This patch switches memtable and cache to use mutation_partition_v2,
and all affected algorithms accordingly.
The memtable reader was changed to use the same cursor implementation
which cache uses, for improved code reuse and reducing risk of bugs
due to discrepancy of algorithms which deal with MVCC.
Range tombstone eviction in cache has now fine granularity, like with
rows.
Fixes#2578Fixes#3288Fixes#10587
Consider the following MVCC state of a partition:
v2: ==== <7> [entry2] ==== <9> ===== <last dummy>
v1: ================================ <last dummy> [entry1]
Where === means a continuous range and --- means a discontinuous range.
After two LRU items are evicted (entry1 and entry2), we will end up with:
v2: ---------------------- <9> ===== <last dummy>
v1: ================================ <last dummy> [entry1]
This will cause readers to incorrectly think there are no rows before
entry <9>, because the range is continuous in v1, and continuity of a
snapshot is a union of continuous intervals in all versions. The
cursor will see the interval before <9> as continuous and the reader
will produce no rows.
This is only temporary, because current MVCC merging rules are such
that the flag on the latest entry wins, so we'll end up with this once
v1 is no longer needed:
v2: ---------------------- <9> ===== <last dummy>
...and the reader will go to sstables to fetch the evicted rows before
entry <9>, as expected.
The bug is in rows_entry::on_evicted(), which treats the last dummy
entry in a special way, and doesn't evict it, and doesn't clear the
continuity by omission.
The situation is not easy to trigger because it requires certain
eviction pattern concurrent with multiple reads of the same partition
in different versions, so across memtable flushes.
Closes#12452
The reason is alloc-dealloc mismatch of position_in_partition objects
allocated by cursors inside coroutine object stored in the update
variable in row_cache::do_update()
It is allocated under cache region, but in case of exception it will
be destroyed under the standard allocator. If update is successful, it
will be cleared under region allocator, so there is not problem in the
normal case.
Fixes#12068Closes#12233
This series is a step towards non-LRU cache algorithms.
Our cache items are able to unlink themselves from the LRU list. (In other words, they can be unlinked solely via a pointer to the item, without access to the containing list head). Some places in the code make use of that, e.g. by relying on auto-unlink of items in their destructor.
However, to implement algorithms smarter than LRU, we might want to update some cache-wide metadata on item removal. But any cache-wide structures are unreachable through an item pointer, since items only have access to themselves and their immediate neighbours. Therefore, we don't want items to unlink themselves — we want `cache.remove(item)`, rather than `item.remove_self()`, because the former can update the metadata in `cache`.
This series inserts explicit item unlink calls in places that were previously relying on destructors, gets rid of other self-unlinks, and adds an assert which ensures that every item is explicitly unlinked before destruction.
Closes#11716
* github.com:scylladb/scylladb:
utils: lru: assert that evictables are unlinked before destruction
utils: lru: remove unlink_from_lru()
cache: make all cache unlinks explicit
unlink_from_lru() allows for unlinking elements from cache without notifying
the cache. This messes up any potential cache bookkeeping.
Improved that by replacing all uses of unlink_from_lru() with calls to
lru::remove(), which does have access to cache's metadata.
All calls in the try block have been noexcept for some time.
Remove the try...catch and the associated misleading comment to avoid confusing
source code readers.
Closes#11715
This reverts commit e0670f0bb5, reversing
changes made to 605ee74c39. It causes failures
in debug mode in
database_test.test_database_with_data_in_sstables_is_a_mutation_source_plain,
though with low probability.
Fixes#10780Reopens#652.