Use the querier_cache (represented by the passed-in
querier_cache_context) object to lookup saved queriers at the start of
the page and save them at the end of it if it is likely that there will
be more page requests.
Instead of evicting whole partitions, evicts whole rows.
As part of this, invalidation of partition entries was changed to not
evict from snapshots right away, but unlink them and let them be
evicted by the reclaimer.
For row-level eviction we need to ensure that each version has
complete rows so that eviction from older versions doesn't affect the
value of the row in newer snapshots.
This is achieved by copying the row from an older version before
applying the increment in the new version.
Only affects evictable entries, memtables are not affected.
Every evictable version will have a dummy entry at the end so that it can be
tracked in the LRU.
It is also needed to allow old versions to stay around (with
tombstones and static rows) after all rows are evicted. Such versions
must be fully discontinuous, and we need some entry to mark that.
This change is a preparation for introducing row-level eviction, such that entries
can be evicted from older versions without having to touch other versions.
Currently continuity flags on entries are interpreted relative to the
combined view merged from all entries. For example:
v2: <key=2, cont=1>
v1: <key=1, cont=1>
In v2, the flag on entry key=2 marks the range (1, 2) as
continuous. This is problematic because if the old version is evicted, continuity
will change in an incorrect way:
v2: <key=2, cont=1>
Here, the range (-inf, 1) would be marked as continuous, which is not true.
To solve this problem, we change the rules for continuity
interpretation in MVCC. Each version will have its own continuity,
fully specified in that version, independent of continuity of other
versions. Continuity of the snapshot will be a union of continuous
ranges in each version.
It is assumed that continuous intervals in different versions are non-
overlapping, except for points corresponding to complete rows, in
which case a later version may overlap with an older version
(overwrite). We make use of this assumption to make calculation of the
union of intervals on merging easier. I make use of the above
assumption in mutation_partition::apply_monotonically().
MVCC population of incomplete entries already almost maintains the
non-overlapping invariant, because population intervals correspond to
intervals which are incomplete in the old snapshot. The only change
needed is to ensure that both population bounds will have entries in
the latest version. Population from memtables doesn't mark any
intervals as continuous, so also conforms. The only change needed
there is to not inherit continuity flags from the old snapshot,
effectively making the new version internally discontinuous except for
row points.
The example from the beginning will become:
v2: <key=1, cont=0> <key=2, cont=1>
v1: <key=1, cont=1>
When marking a range as continuous with some rows present only in
older versions, we need to insert entries in the latest version, so
that we can mark the range as continuous. The easiest solution is to
copy the entry from the old version. Another option would be to add
support for incomplete rows and insert such instead. This way we would
avoid duplicating row contents. This optimization is deferred.
This entails doing the cell hash calculation slightly differently,
where the cell is hashed individually, the resulting hash being added
to the running one.
Instead of propagating a flag all through the call chain, we detect
whether we are in the new mode by the employed hash algorithm.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This enables us to only branch once per row on the actual hash
algorithm, instead of once per row data item.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We add storage to a row to hold the cached hashes of each individual
cell. We don't store the hash in each cell because that would a)
change the cell equality function, and b) require us to change a cell
in a potentially fragmented buffer.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Use the digester class instead of md5_hasher to encapsulate the
decision of which hash algorithm to use.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"Changes merging in MVCC to apply newer version to older instead of older to
newer.
Before (v0 = oldest):
(((v3 + v2) + v1) + v0)
After:
(v0 + (v1 + (v2 + v3)))
or:
(((v0 + v1) + v2) + v3)
There are several reasons to do this:
1) When continuity merging will change semantics to support eviction
from older versions, it will be easier to implement apply() if we
can assume that we merge newer to older instead of older to
newer, since newer version may have entries falling into a
continuous interval in older, but not the other way around. If we
didn't revert the order, apply() would have to keep track of
lower bound of a continuous interval in the right-hand side
argument (older version) as it is applied and update continuity
flags in the left hand side by scanning all entries overlapping
with it. If order is reversed, merging only needs to deal with
the current entry. Also, if we were to keep the old order, we
cannot simply move entries from the left hand side as we merge
because we need to keep track of the lower bound of a continuous
interval, and we need to provide monotonic exception
guarantees. So merging would be both more complicated and slower.
2) With large partitions older versions are typically larger than
newer versions, and since merging is O(N_right*(1 + log(N_left))),
it's better to merge newer into older.
This fixes latency spikes seen in perf_cache_eviction.
Fixes #2715."
* tag 'tgrabiec/reverse-order-of-mvcc-version-merging-v1' of github.com:scylladb/seastar-dev:
mvcc: Reverse order of version merging
anchorless_list: Introduce last()
mvcc: Implement partition_entry::upgrade() using squashed()
mvcc: Extract version merging functions
mutation_partition: Add rows_entry::set_dummy()
position_in_partition: Introduce after_key()
Change merging to apply newer version to older instead of older to
newer.
Before:
(((v3 + v2) + v1) + v0)
After:
(v0 + (v1 + (v2 + v3)))
or equivalent:
(((v0 + v1) + v2) + v3)
There are several reasons to do this:
1) When continuity merging will change semantics to support eviction
from older versions, it will be easier to implement apply() if we
can assume that we merge newer to older instead of older to
newer, since newer version may have entries falling into a
continuous interval in older, but not the other way around. If we
didn't revert the order, apply() would have to keep track of
lower bound of a continuous interval in the right-hand side
argument (older version) as it is applied and update continuity
flags in the left hand side by scanning all entries overlapping
with it. If order is reversed, merging only needs to deal with
the current entry. Also, if we were to keep the old order, we
cannot simply move entries from the left hand side as we merge
because we need to keep track of the lower bound of a continuous
interval, and we need to provide monotonic exception
guarantees. So merging would be both more complicated and slower.
2) With large partitions older versions are typically larger than
newer versions, and since merging is O(N_right*(1 + log(N_left))),
it's better to merge newer into older.
Fixes#2715.
We pass the timeout that we received from data_query/mutation_query
down to consume, which is responsible for actually reading the data.
To make those timeouts actionable, though, we'll have to patch
fill_buffer(). This will happen in the next patch.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
data_query and mutation_query are patched so that they start accepting a
per-query timeout. We will default to no timeout, and then no callers
will be changed yet.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This fixes the problem of equal_continuity() being prone to false
positives due to redundant information (extra dummy rows) present in
one of the partitions. get_continuity() is minified, so is not prone
to this.
Convert to a multi line output, which is easier to read for a human.
After:
{ks.cf key {key: pk{000c706b30303030303030303030}, token:-2018791535786252460} data {mutation_partition: {tombstone: none},
range_tombstones: {},
static: cont=1 {row: },
clustered: {
{rows_entry: cont=true dummy=false {position: clustered,ckp{000c636b30303030303030303030},0} {deletable_row: {row: }}},
{rows_entry: cont=true dummy=true {position: clustered,ckp{000c636b30303030303030303031},0} {deletable_row: {row: }}}}}}
The uses which needed strong or weak exception guarantees were
switched to a solution involving apply_monotonically(). All remaining
uses don't need any exception guarantees.
Has weaker exception guarantees than apply(), which allows for simpler
implementation. Intended to replace the apply() with strong exception
guarantees.
In the memtable flusher, we account for the size of a partition as we
read them. However, there are other points in the architecture where we
would like to calculate the size of a partition in a point in which we
are not reading it. One such example is the cache update process.
This patch enhances the mutation_partition adding a method that returns
the total size for this partition.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
If exception is thrown from _row_tombstones.apply(), _rows will be
left uncleared. This will trigger assertion in bi::set_member_hook
destructor, which assrts that the hook is not linked.
Always clear _rows.
When compacting a partition for querying we would read an extra row,
to include any tombstones between that one and the previous row.
This is no longer needed since we have a general mechanism to detect
short reads in the storage_proxy.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170811103031.22866-1-duarte@scylladb.com>