cache_flat_mutation_reader gets a native v2 implementation. The
underlying mutation representation is not changed: range deletions are
still stored as v1 range_tombstones in mutation_partition. These are
converted to range tombstone changes during reading.
This allows for separating the change of a native v2 reader
implementation and a native v2 in-memory storage format, enabling the
two to be done at separate times and incrementally.
In most files it was unused. We should move these to the patch which
moved out the last interesting reader from mutation_reader.hh (and added
the corresponding new header include) but its probably not worth the
effort.
Some other files still relied on mutation_reader.hh to provide reader
concurrency semaphore and some other misc reader related definitions.
The flat_mutation_reader files were conflated and contained multiple
readers, which were not strictly necessary. Splitting optimizes both
iterative compilation times, as touching rarely used readers doesn't
recompile large chunks of codebase. Total compilation times are also
improved, as the size of flat_mutation_reader.hh and
flat_mutation_reader_v2.hh have been reduced and those files are
included by many file in the codebase.
With changes
real 29m14.051s
user 168m39.071s
sys 5m13.443s
Without changes
real 30m36.203s
user 175m43.354s
sys 5m26.376s
Closes#10194
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
Print MVCC snapshot to help distinguish reads which use different snapshots.
Also, print the whole cursor, not just its position. This helps in
determining which MVCC version the iterator comes from.
Some implementation notes below.
When iterating in reverse, _last_row is after the current entry
(_next_row) in table schema order, not before like in the forward
mode.
Since there is no dummy row before all entries, reverse iteration must
be now prepared for the fact that advancing _next_row may land not
pointing at any row. The partition_snapshot_row_cursor maintains
continuity() correctly in this case, and positions the cursor before
all rows, so most of the code works unchanged. The only excpetion is
in move_to_next_entry(), which now cannot assume that failure to
advance to an entry means it can end a read.
maybe_drop_last_entry() is not implemented in reverse mode, which may
expose reverse-only workload to the problem of accumulating dummy
entries.
ensure_population_lower_bound() was not updating _last_row after
inserting the entry in latets version. This was not a problem for
forward reads because they do not modify the row in the partition
snapshot represented by _last_row. They only need the row to be there
in the latest version after the call. It's different for reveresed
reads, which change the continuity of the entry represented by
_last_row, hence _last_row needs to have the iterator updated to point
to the entry from the latest version, otherwise we'd set the
continuity of the previous version entry which would corrupt the
continuity.
The B-tree's insert_before() is throwing operation, its caller
must account for that. When the rows_entry's collection was
switched on B-tree all the risky places were fixed by ee9e1045,
but few places went under the radar.
In the cache_flat_mutation_reader there's a place where a C-pointer
is inserted into the tree, thus potentially leaking the entry.
In the partition_snapshot_row_cursor there are two places that not
only leak the entry, but also leave it in the LRU list. The latter
it quite nasty, because those entry can be evicted, eviction code
tries to get rows_entry iterator from "this", but the hook happens
to be unattached (because insertion threw) and fails the assert.
fixes: #9728
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This warning can catch a virtual function that thinks it
overrides another, but doesn't, because the two functions
have different signatures. This isn't very likely since most
of our virtual functions override pure virtuals, but it's
still worth having.
Enable the warning and fix numerous violations.
Closes#9347
Some callers of mutation_partition::row_tomstones() don't want
(and shouldn't) modify the list itself, while they may want to
modify the tombstones. This patch explicitly locates those that
need to modify the collection, because the next patch will
return immutable collection for the others.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some callers of mutation_partition::clustered_rows() don't want
(and shouldn't) modify the tree of rows, while they may want to
modify the rows themselves. This patch explicitly locates those
that need to modify the collection, because the next patch will
return immutable collection for the others.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Before the patch, all range tombstones up to the next row were copied
into a vector, and then put into the buffer until it's full. This
would get quadratic if there is much more range tombstones than fit in
a buffer.
The fix is to avoid the accumulation of all tombstones in the vector
and invoke the callback instead, which stops the iteartion as soon as
the buffer is full.
Fixes#2581.
Cache populating reader was emitting the row entry which stands for
the upper bound of the population range, but did not emit range
tombstones for the clustering range corresponding to:
[ before(key), after(key) ).
This surfaces after sstable readers are changed to trim emitted range
tombstones to the fast-forwarding range. Before, it didn't cause
problems, because that range tombstone part would be emitted as part
of the sstable read.
The fix is to drop the optimization which pushes the row after
population is done, and let the regular handling for
copy_from_cache_to_buffer() take care of emitting the row and
tombstones for the remaining range.
A unit test is added which covers population from all sstable
versions.
Row cache reader can produce overlapping range tombstones in the
mutation fragment stream even if there is only a single range
tombstone in sstables, due to #2581. For every range between two rows,
the row cache reader queries for tombstones relevant for that
range. The result of the query is trimmed to the current position of
the reader (=position of the previous row) to satisfy key
monotonicity. The end position of range tombstones is left
unchanged. So cache reader will split a single range tombstone around
rows. Those range tombstones are transient, they will be only
materialized in the reader's stream, they are not persisted anywhere.
That is not a problem in itself, but it interacts badly with mutation
compactor due to #8625. The range_tombstone_accumulator which is used
to compact the mutation fragment stream needs to accumulate all
tombstones which are relevant for the current clustering position in
the stream. Adding a new range tombstone is O(N) in the number of
currently active tombstones. This means that producing N rows will be
O(N^2).
In a unit test, I saw reading 137'248 rows which overlap with a range
tombstone take 245 seconds. Almost all of CPU time is in
drop_unneeded_tombstones().
The solution is to make the cache reader trim range tombstone end to
the currently emited sub-range, so that it emits non-overlapping range
tombstones.
Fixes#8626.
This fixes a potential cause for reactor stalls during memory
reclamation. Applies only to schemas without clustering columns.
Every partition in cache has a dummy row at the end of the clustering
range (last dummy). That dummy must be evicted last, because MVCC
logic needs it to be there at all times. If LRU picks it for eviction
and it's not the last row, eviction does nothing and moves
on. Eventually, all other rows in this partition will be evicted too
and then the partition will go away with it.
Mutation reader updates the position of rows in the LRU (aka touching)
as it walks over them. However, it was not always touching the last
dummy row. If the partition was fully cached, and schema had no
clustering key, it would exit early before reaching the last dummy
row, here:
inline
void cache_flat_mutation_reader::move_to_next_entry() {
clogger.trace("csm {}: move_to_next_entry(), curr={}", fmt::ptr(this), _next_row.position());
if (no_clustering_row_between(*_schema, _next_row.position(), _upper_bound)) {
move_to_next_range();
That's because no_clustering_row_between() is always true for any key
in tables with no clustering columns, and the reader advances to
end-of-stream without advancing _next_row to the last dummy. This is
expected and desired, it means that the query range ends at the
current row and there is no need to move further. We would not take
this exit for tables with a non-singular clustering key domain and
open-ended query range, since there would be a key gap before the last
dummy row.
Refs #2972.
The effect of leaving the last dummy row not touched will be that such
scans will segregate rows in the LRU, bring all regular rows to the
front, and dummy rows at the tail. When eviction reaches the band of
dummy rows, it will have to walk over it, because evicting them
releases no memory. This can cause a reactor stall.
An easy fix for the scenario would be to always touch the dummy entry
when entering the partition. It's unlikely that the read will not
proceed to the regular rows. It would be best to avoid linking such
dummies in the LRU, but that's a much more complex change.
Discovered in perf_row_cache_update, test_small_partitions(). I saw
200ms stalls with -m8G.
Refs #8541.
Tests:
- row_cache_test (release)
- perf_simple_query [no change]
Message-Id: <20210427111619.296609-1-tgrabiec@scylladb.com>
Such that the holder, that is responsible for closing the
read_context before destroying it, holds it uniquely.
cache_flat_mutation_reader may be constructed either
with a read_context&, where it knows that the read_context
is owned externally, by the caller, or it could
be constructed with a std::unique_ptr<read_context> in
which case it assumes ownership of the read_context
and it is now responsible for closing it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The range tombstone can be added-to-buffer from two places:
when it was found in cache and when it was read from the
underlying reader. Both adders can now be generalized.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When adding a range tombstone to the buffer the need
to stop stuffing the already full one is only done if
this particular range timbstone changes the lower_bound.
This check can be tuned -- if the lower bound changed
_at_ _all_ after a range tombstone was added, we may
still abort the loop.
This change will allow to generalize range tombstone
emission by the next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When inserting a rows_entry via unique_ptr the ptr inquestion
can be pushed as is, the intrusive btree code releases the
pointer (to be exception safe) itself. This makes the code
a bit shorter and simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two alignment gaps in the middle of the
c_f_m_r -- one after the state and another one after
the set of bools. Keeping them togethers allows the
compiler to pack the c_f_m_r better.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The instance of position_in_partition::tri_compare sits
on the reader itself and just occupies memory. It can be
created on demand all the more so it's only one place that
needs it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make sure that when a partition does not exist in underlying,
do_fill_buffer does not try to fast forward withing this nonexistent
partition.
Test: unit(dev)
Fixes#8435Fixes#8411
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Right now call to .row() method may create hash on row's cells.
It's counterintuitive to see a const method that transparently
changes something it points to. Since the only caller of a row()
who knows whether the hash creation is required is the cache
reader, it's better to move the call to prepare_hash() into it.
Other than making the .row() less surprising this also helps to
get rid of the whole method by the next patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This will prevent accumulation of unnecessary dummy entries.
A single-partition populating scan with clustering key restrictions
will insert dummy entries positioned at the boundaries of the
clustering query range to mark the newly populated range as
continuous.
Those dummy entries may accumulate with time, increasing the cost of
the scan, which needs to walk over them.
In some workloads we could prevent this. If a populating query
overlaps with dummy entries, we could erase the old dummy entry since
it will not be needed, it will fall inside a broader continuous
range. This will be the case for time series worklodas which scan with
a decreasing (newest) lower bound.
Refs #8153.
_last_row is now updated atomically with _next_row. Before, _last_row
was moved first. If exception was thrown and the section was retried,
this could cause the wrong entry to be removed (new next instead of
old last) by the new algorithm. I don't think this was causing
problems before this patch.
The problem is not solved for all the cases. After this patch, we
remove dummies only when there is a single MVCC version. We could
patch apply_monotonically() to also do it, so that dummies which are
inside continuous ranges are eventually removed, but this is left for
later.
perf_row_cache_reads output after that patch shows that the second
scan touches no dummies:
$ build/release/test/perf/perf_row_cache_reads_g -c1 -m200M
Rows in cache: 0
Populating with dummy rows
Rows in cache: 265320
Scanning
read: 142.621613 [ms], preemption: {count: 639, 99%: 0.545791 [ms], max: 0.526929 [ms]}, cache: 0/0 [MB]
read: 0.023197 [ms], preemption: {count: 1, 99%: 0.035425 [ms], max: 0.032736 [ms]}, cache: 0/0 [MB]
Message-Id: <20210226172801.800264-1-tgrabiec@scylladb.com>
fill_buffer() will keep scanning until _lower_bound_changed is true,
even if preemption is signaled, so that the reader makes forward
progress.
Before the patch, we did not update _lower_bound on touching a dummy
entry. The read will not respect preemption until we hit a non-dummy
row. If there is a lot of dummy rows, that can cause reactor stalls.
Fix that by updating _lower_bound on dummy entries as well.
Refs #8153.
Tested with perf_row_cache_reads:
```
$ build/release/test/perf/perf_row_cache_reads -c1 -m200M
Rows in cache: 0
Populating with dummy rows
Rows in cache: 373929
Scanning
read: 183.658966 [ms], preemption: {count: 848, 99%: 0.545791 [ms], max: 0.519343 [ms]}, cache: 99/100 [MB]
read: 120.951515 [ms], preemption: {count: 257, 99%: 0.545791 [ms], max: 0.518795 [ms]}, cache: 99/100 [MB]
```
Notice that max preemption latency is low in the second "read:" line.
Closes#8167
* github.com:scylladb/scylla:
row_cache: Make fill_buffer() preemptable when cursor leads with dummy rows
tests: perf: Introduce perf_row_cache_reads
row_cache: Add metric for dummy row hits
fill_buffer() will keep scanning until _lower_bound_chnaged is true,
even if preemption is signalled, so that the reader makes forward
progress.
Before the patch, we did not update _lower_bound on touching a dummy
entry. The read will not respect preemption until we hit a non-dummy
row. If there is a lot of dummy rows, that can cause reactor stalls.
Fix that by updating _lower_bound on dummy entries as well.
Refs #8153.
Tested with perf_row_cache_reads:
$ build/release/test/perf/perf_row_cache_reads -c1 -m200M
Rows in cache: 0
Populating with dummy rows
Rows in cache: 373929
Scanning
read: 183.658966 [ms], preemption: {count: 848, 99%: 0.545791 [ms], max: 0.519343 [ms]}, cache: 99/100 [MB]
read: 120.951515 [ms], preemption: {count: 257, 99%: 0.545791 [ms], max: 0.518795 [ms]}, cache: 99/100 [MB]
Notice that max preemption latency is low in the second "read:" line.
Unlike flat_mutation_reader_opt that is defined using
optimized_optional<flat_mutation_reader>, std::optional<T> does not evaluate
to `false` after being moved, only after it is explicitly reset.
Use flat_mutation_reader_opt rather than std::optional<flat_mutation_reader>
to make it easier to check if it was closed before it's destroyed
or being assigned-over.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210215101254.480228-6-bhalevy@scylladb.com>
The switch is pretty straightforward, and consists of
- change less-compare into tri-compare
- rename insert/insert_check into insert_before_hint
- use tree::key_grabber in mutation_partition::apply_monotonically to
exception-safely transfer a row from one tree to another
- explicitly erase the row from tree in rows_entry::on_evicted, there's
a O(1) tree::iterator method for this
- rewrite rows_entry -> cache_entry transofrmation in the on_evicted to
fit the B-tree API
- include the B-tree's external memory usage into stats
That's it. The number of keys per node was is set to 12 with linear search
and linear extention of 20 because
- experimenting with tree shows that numbers 8 through 10 keys with linear
search show the best performance on stress tests for insert/find-s of
keys that are memcmp-able arrays of bytes (which is an approximation of
current clustring key compare). More keys work slower, but still better
than any bigger value with any type of search up to 64 keys per node
- having 12 keys per nodes is the threshold at which the memory footprint
for B-tree becomes smaller than for boost::intrusive::set for partitions
with 32+ keys
- 20 keys for linear root eats the first-split peak and still performs
well in linear search
As a result the footpring for B tree is bigger than the one for BST only for
trees filled with 21...32 keys by 0.1...0.7 bytes per key.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When a rows_entry is added to row_cache it's constructed from
clustering_row by unpacking all its internals and putting
them into the rows_entry's deletable_row. There's a shorter
way -- the clustering_row already has the deletale_row onboard
from which rows_entry can copy-construct its.
This lets keeping the rows_entry and deletable_row set of
constructors a bit shorter.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20201224161112.20394-1-xemul@scylladb.com>
We want to start tracking the memory consumption of mutation fragments.
For this we need schema and permit during construction, and on each
modification, so the memory consumption can be recalculated and pass to
the permit.
In this patch we just add the new parameters and go through the insane
churn of updating all call sites. They will be used in the next patch.
Not used yet, this patch does all the churn of propagating a permit
to each impl.
In the next patch we will use it to track to track the memory
consumption of `_buffer`.
... and tests. Printin a pointer in logs is considered to be a bad practice,
so the proposal is to keep this explicit (with fmt::ptr) and allow it for
.debug and .trace cases.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We typically use `std::bad_function_call` to throw from
mandatory-to-implement virtual functions, that cannot have a meaningful
implementation in the derived class. The problem with
`std::bad_function_call` is that it carries absolutely no information
w.r.t. where was it thrown from.
I originally wanted to replace `std::bad_function_call` in our codebase
with a custom exception type that would allow passing in the name of the
function it is thrown from to be included in the exception message.
However after I ended up also including a backtrace, Benny Halevy
pointed out that I might as well just throw `std:bad_function_call` with
a backtrace instead. So this is what this patch does.
All users are various unimplemented methods of the
`flat_mutation_reader::impl` interface.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200408075801.701416-1-bdenes@scylladb.com>
The sstable reader which populates the partition entry in the cache is
using the schema of the partition entry snapshot, which will be the
schema of the cache at the time the partition was entered. If there
was a schema change after the cache reader entered the partition but
before it created the sstable reader, the cache populating reader will
interpret sstable fragments using the wrong schema version. That is
more likely if partitions have many rows, and the front of the
partition is populated. With single-row partitions that's unlikely to
happen.
That is undefined behavior in general, which may include:
- read failures due to bad_alloc, if fixed-size cells are
interpreted as variable-sized cells, and we misinterpret
a value for a huge size
- wrong read results
- node crash
This doesn't result in a permanent corruption, restarting the node
should help.
Fixes#5127.
After the new in-memory representation of cells was introduced there was
a regression in atomic_cell_or_collection::operator<< which stopped
printing the content of the cell. This makes debugging more incovenient
are time-consuming. This patch fixes the problem. Schema is propagated
to the atomic_cell_or_collection printer and the full content of the
cell is printed.
Fixes#3571.
Message-Id: <20181024095413.10736-1-pdziepak@scylladb.com>
Currently timeout is opt-in, that is, all methods that even have it
default it to `db::no_timeout`. This means that ensuring timeout is used
where it should be is completely up to the author and the reviewrs of
the code. As humans are notoriously prone to mistakes this has resulted
in a very inconsistent usage of timeout, many clients of
`flat_mutation_reader` passing the timeout only to some members and only
on certain call sites. This is small wonder considering that some core
operations like `operator()()` only recently received a timeout
parameter and others like `peek()` didn't even have one until this
patch. Both of these methods call `fill_buffer()` which potentially
talks to the lower layers and is supposed to propagate the timeout.
All this makes the `flat_mutation_reader`'s timeout effectively useless.
To make order in this chaos make the timeout parameter a mandatory one
on all `flat_mutation_reader` methods that need it. This ensures that
humans now get a reminder from the compiler when they forget to pass the
timeout. Clients can still opt-out from passing a timeout by passing
`db::no_timeout` (the previous default value) but this will be now
explicit and developers should think before typing it.
There were suprisingly few core call sites to fix up. Where a timeout
was available nearby I propagated it to be able to pass it to the
reader, where I couldn't I passed `db::no_timeout`. Authors of the
latter kind of code (view, streaming and repair are some of the notable
examples) should maybe consider propagating down a timeout if needed.
In the test code (the wast majority of the changes) I just used
`db::no_timeout` everywhere.
Tests: unit(release, debug)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <1edc10802d5eb23de8af28c9f48b8d3be0f1a468.1536744563.git.bdenes@scylladb.com>
ensure_population_lower_bound() returned true if current clustering
range covers all rows, which means that the populator has a right to
set continuity flag to true on the row it inserts. This is correct
only if the current population range actually starts since before all
clustering rows. Otherwise we're populating since _last_row, and
should consult it.
The fix introduces a new flag, set when starting to populte, which
indicates if we're populating from the beginning of the range or
not. We cannot simply check if _last_row is set in
ensure_population_lower_bound() because _last_row can be set and then
become empty again.
Fixes#3608