Thread stack allocation may fail, in which case we did not do the
necessary invalidation. Fix by hoisting the scope of the cleanup function.
Also fixes the following test failure:
tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed
which started to trigger after commit 318423d50b.
Message-Id: <1504023113-30374-2-git-send-email-tgrabiec@scylladb.com>
Underlying data source in row cache holds a reference to sstable set
prior to compaction which isn't released until a memtable flush, which
means file descriptors of deleted sstables remains opened, wasting
disk space.
The fix is to refresh underlying data source in row cache.
Fixes#2570.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If cache is missing given key, but the range is marked as continuous,
it means sstables don't have that entry and we can insert it without
asking the presence checker (bloom filter based). The latter is more
expensive and gives false positives. So this improves update
performance and hit ratio.
Another positive effect is that we don't have to clear continuity now.
Fixes#1999.
Message-Id: <1498643043-21117-1-git-send-email-tgrabiec@scylladb.com>
The interaction will be as follows:
- Before creating cache_streamed_mutation for given partition, cache
mutation reader sets up read_context for current partition (in one
of two ways) so that the matching underlying streamed_mutation can
be accessed at any time by cached_stream_mutation.
- cache_streamed_mutation assumes that read_context is set up for
current partition and invokes fast_forward_to() and
get_next_fragment() to access the underlying
streamed_mutation.
We will need to use this information later in yet another place, when
creating a reader for incomplete cache entry. This refactors the code
so that there is a single place which determines this fact.
Currently scanning_and_populating_reader asks
just_cache_scanning_reader for the next partition from cache, together
with information if the range is continuous. If it's not, it saves the
partition it got from it and moves on to reading from the underlying
reader up to that partition. When that's done, it emits the stored
partition.
This approach won't work well with upcoming changes for storing
partial partitions. We won't have whole partitions any more, so
streamed_mutation returned for the entry needs to be prepared for
reading from the underlying mutation source. We want to reuse the same
underlying reader as much as possible, so all streamed_mutations for
given read (read_context) will share the state of the underlying
reader. Construction of a streamed_mutation will depend on the fact
that the shared state is set up for it, so we cannot have two
streamed_mutations prepared at the same time (one for entry from
primary, and one for the earlier entry being populated). This change
defers the creation of a streamed_mutation for the entry present in
cache until the whole reader reaches it to avoid this problem.
This will also have antoher potentially beneficial effect. Since we
defer the decision about which snapshot to use until we reach the
entry, there is a higher chance that the current snapshot of the entry
will match the one used last by the populating read, and that we will
be able to reuse the reader.
It's implemented by utilizing a stable partition cursor which tracks
its current position so that it's possible to revisit the cache entry
(if it's still there) after population ends. The functionality of
just_cache_scanning_reader was inlined into
scanning_and_populating_reader.
It's needed before switching cache_entry ordering to rely solely on
cache_entry::position() so that invalidate_unwrapped() never removes
the dummy entry at the end. Currently if the range has upper bound
like this:
{ ring_position::max(), inclusive=true }
The code which selects entries for removal would include the dummy row
at the end. It uses upper_bound() to get the end iterator, and the
dummy entry has a position which is equal to the position in the
bound.
ring_position_view ranges are end-exclusive, so it's impossible to
create a partition range which would include a dummy entry.
The code is also simpler.
Currently readers are always using the latest snapshot. This is fine
for respecting write atomicity if partitions are fully continuous in
cache (now), but will break write atomicity once partial population is
allowed.
Consider the following case:
flush write(ck=1), write(ck=2) -> snapshot_1
cache reader 1 reads and inserts ck=1 @snapshot_1
flush write(ck=1), write(ck=2) -> snapshot_2
cache reader 2 reads and inserts ck=2 @snapshot_2
Because cache update is not atomic, it can happen that reader 2 will
complete while the partition hasn't been updated yet for snapshot_2.
In such case, after read 2 the partition would contain ck=1 from
snapshot_1 and ck=2 from snapshot_2. It will match neither of the
snapshots, and this could violate write atomicity.
To solve this problem we conceptually assign each partition key in the
ring to its current snapshot which it reflects. The update process
gradually converts entries in ring order to the new snapshot. Reads
will not be using the latest snapshot, but rather the current snapshot
for the position in the ring they are at.
There is a race between the update process and populating reads. Since
after the update all entries must reflect the new snapshot, reads
using the old snapshot cannot be allowed to insert data which can no
longer be reached by the update process. Before this patch this race
was prevented by the use of a phased_barrier, where readers would keep
phased_barrier::operation alive between starting a read of a partition
and inserting it into cache. Cache update was waiting for all prior
operations before starting the update. Any later read which was not
waited for would use the latest snapshot for reads, so the update
process didn't have to fix anything up for such reads.
After this change, later reads cannot always use the latest snapshot,
they have to use the snapshot corresponding to given entry. So it's
not enough for update() to wait for prior reads in order to prevent
stale populations. The (simple) solution implemented in this patch is
to detect the conflict and abandon population of given sub-range. In
general, reads are allowed to populate given range only if it belongs
to a single snapshot.
Note that the range here is not the whole query range. For population
of continuity, it is the range starting after the previous key and
ending after the key being inserted. When populating a partition
entry, the range is a singular range containing only the partition
key. Readers switch to new snapshots automatically as they move across
the ring. It's possible that the insertion of the partition doesn't
conflict, but continuity does. In such case the entry will be inserted
but continuity will not be set.
Currently every time cache needs to create reader for missing data it
obtains a reader which is most up to date. That reader includes writes
from later populate phases, for which update() was not yet
called. This will be problematic once we allow partitions to be
partially populated, because different parts of the partition could be
partially populated using readers using different sets of writes, and break
write atomicity.
The solution will be to always populate given partition using the same
set of writes, using reader created from the current snapshot. The
snapshot changes only on update(), with update() gradually converting
each partition to the new snapshot.
1) Reduce duplication by delegating to more general overloads
2) Improve documentation to not mention effects in terms of
population (detail) but rather write visibiliy
3) Rename clear() to invalidate() and merge with the range variant,
it has the same semantics
This object stores all read relevant context required all
over the place. This leads to a cleaner code.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
[tgrabiec:
- made read_context shareable to allow storing shared
mutable state later
- added range and cache getters
]
This is an abstraction that represents a reader
to the underlying source and auto updates itself
to make sure the reader reflects the latest state
of the underlying source.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
[tgrabiec: Add range getter to avoid friendships]
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.
As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).
This patch has two levels:
1. In the lower level, sstable::data_consume_rows(), which reads all
partitions in a given disk byte range, now gets another byte position,
"last_end". That can be the range's end, the end of the file, or anything
in between the two. It opens the disk stream until last_end, which means
1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
not allowed beyond last_end.
2. In the upper level, we add to the various layers of sstable readers,
mutation readers, etc., a boolean flag mutation_reader::forwarding, which
says whether fast_forward_to() is allowed on the stream of mutations to
move the stream to a different partition range.
Note that this flag is separate from the existing boolean flag
streamed_mutation::fowarding - that one talks about skipping inside a
single partition, while the flag we are adding is about switching the
partition range being read. Most of the functions that previously
accepted streamed_mutation::forwarding now accept *also* the option
mutation_reader::forwarding. The exception are functions which are known
to read only a single partition, and not support fast_forward_to() a
different partition range.
We note that if mutation_reader::forwarding::no is requested, and
fast_forward_to() is forbidden, there is no point in reading anything
beyond the range's end, so data_consume_rows() is called with last_end as
the range's end. But if forwarding::yes is requested, we use the end of the
file as last_end, exactly like the code before this patch did.
Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.
In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170619152629.11703-1-nyh@scylladb.com>
This reverts commit 317d7fc253 (and also the
related 2c57ab84b2). It causes crashes
during range scans, reported by Gleb:
"To reproduce I run SELECT * FROM keyspace1.standard1; on typical c-s
dataset and 3 node cluster.
Backtrace:
at /home/gleb/work/seastar/seastar/core/apply.hh:36
rvalue=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x54cf307, DIE 0x55ebf2a>) at /home/gleb/work/seastar/seastar/core/do_with.hh:57
range=std::vector of length 6, capacity 8 = {...}) at /home/gleb/work/seastar/seastar/core/future-util.hh:142
at ./seastar/core/future.hh:890
at /home/gleb/work/seastar/seastar/core/future-util.hh:119
at /home/gleb/work/seastar/seastar/core/future-util.hh:142
In commit c63e88d556, support was added for
fast_forward_to() in data_consume_rows(). Because an input stream's end
cannot be changed after creation, that patch ignores the specified end
byte, and uses the end of file as the end position of the stream.
As result of this, even when we want to read a specific byte range (e.g.,
in the repair code to checksum the partitions in a given range), the code
reads an entire 128K buffer around the end byte, or significantly more, with
read-ahead enabled. This causes repair to do more than 10 times the amount
of I/O it really has to do in the checksumming phase (which in the current
implementation, reads small ranges of partitions at a time).
This patch has two levels:
1. In the lower level, sstable::data_consume_rows(), which reads all
partitions in a given disk byte range, now gets another byte position,
"last_end". That can be the range's end, the end of the file, or anything
in between the two. It opens the disk stream until last_end, which means
1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is
not allowed beyond last_end.
2. In the upper level, we add to the various layers of sstable readers,
mutation readers, etc., a boolean flag mutation_reader::forwarding, which
says whether fast_forward_to() is allowed on the stream of mutations to
move the stream to a different partition range.
Note that this flag is separate from the existing boolean flag
streamed_mutation::fowarding - that one talks about skipping inside a
single partition, while the flag we are adding is about switching the
partition range being read. Most of the functions that previously
accepted streamed_mutation::forwarding now accept *also* the option
mutation_reader::forwarding. The exception are functions which are known
to read only a single partition, and not support fast_forward_to() a
different partition range.
We note that if mutation_reader::forwarding::no is requested, and
fast_forward_to() is forbidden, there is no point in reading anything
beyond the range's end, so data_consume_rows() is called with last_end as
the range's end. But if forwarding::yes is requested, we use the end of the
file as last_end, exactly like the code before this patch did.
Importantly, we note that the repair's partition reading code,
column_family::make_streaming_reader, uses mutation_reader::forwarding::no,
while the other existing reading code will use the default forwarding::yes.
In the future, we can further optimize the amount of bytes read from disk
by replacing forwarding::yes by an actual last partition that may ever be
read, and use its byte position as the last_end passed to data_consume_rows.
But we don't do this yet, and it's not a regression from the existing code,
which also opened the file input stream until the end of the file, and not
until the end of the range query. Moreover, such an improvement will not
improve of anything if the overall range is always very large, in which
case not over-reading at its end will not improve performance.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170614072122.13473-1-nyh@scylladb.com>
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
- 'net' namespace conflicts with seastar::net, renamed to 'netw'.
- 'transport' namespace conflicts with seastar::transport, renamed to
cql_transport.
- "logger" global variables now conflict with logger global type, renamed
to xlogger.
- other minor changes
_underlying is created with _range, which is captured by
reference. But range_and_underlyig_reader is moved after being
constructed by do_with(), so _range reference is invalidated.
Fixes#2377.
Message-Id: <1494492025-18091-1-git-send-email-tgrabiec@scylladb.com>
need_preempt() is always true in debug mode. Because of that, this loop
will never be executed. Rewrite it as a do-while loop so we are sure
that it is executed at least once - or exactly once in debug mode.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1485913079-1283-1-git-send-email-glauber@scylladb.com>