Wrapping ranges are a pain, so we are moving wrap handling to the edges.
Since cql can't generate wrapping ranges, this means thrift and the ring
maintenance code; also range->ring transformations need to merge the first
and last ranges.
Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>
Fast forwarding of memtable readers is needed only for unit tests which
often use memtables as underlying data source for cache and the cache
readers.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Memory accounting code was attaching partition_snapshot to
partition_entry in order to calculate the size of partition_version
object. However, it is only allowed if partition_entry doesn't have
any snapshot attached already. In this case it always has one, created
by the flushing reader.
Change the accounting code to reuse existing partition_snapshot reference.
Fixes#1746
Message-Id: <1476449160-9252-1-git-send-email-tgrabiec@scylladb.com>
The latest virtual dirty patches broke the SSTable tests. The reason for
this is that those tests will flush synthetic memtables that do not have
a region_group attached to it.
Normally in cases like this we would just give the flush_reader an empty
region group. However, the memtable class constructor takes a
region_group pointer and that can be null according to the interface.
So we must conditionally test it.
If there isn't a region_group involved, the virtual dirty accounting
should be disabled: after all, we won't even have the baseline memory
to begin with.
One of the approaches to fix this could be to just provide null
accounter classes to be used as a surrogate for the accounting classes
in this case. However, since this is mostly used for tests, a much
simpler way is to just revert back to the scanning reader in that case.
The scanning reader is similar enough to the flush_reader, except that
it can handle partial ranges, slices, and delegate accesses to an
sstable post-flush. We don't need any of that, but as argued above,
there is no need to remove it either.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Message-Id: <1475667271-60806-1-git-send-email-glommer@scylladb.com>
Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.
The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.
What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.
The practical effect of that is that once we reach 50 % occupancy in our dirty
memory region, we will bring the system from CPU speed to disk speed, and will
start accepting requests only at the rate we are able to write memory back.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The code that is common will live in its own reader, the iterator_reader. All
friendly private access to memtable attributes and methods happen through the
iterator reader.
After this patch, we are now left with the scanning_reader - same as always,
but now implemented on top of the iterator_reader, and a flush_reader, which
will be used by SSTable flushes only.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Right now the special reader doesn't do much, but the idea is that we will
soon replace it will a reader that specializes in flush, and is in turn able
to provide read-side on-flush functionality like virtual dirty.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Remove clustering_key_filter_factory and clustering_key_filtering_context.
Use partition_slice directly with a static get_ranges method.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This fixes the problem of multiple concurrent get_ranges calls.
Previously each call was invalidating the result of the previous
call. Now they don't step on each other foot.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The LSA memory pressure mechanism will let us know which region is the best
candidate for eviction when under pressure. We need to somehow then translate
region -> memtable -> column family.
The easiest way to convert from region to memtable, is having memtable inherit
from region. Despite the fact that this requires multiple inheritance, which
always raise a flag a bit, the other class we inherit from is
enable_shared_from_this, which has a very simple and well defined interface. So
I think it is worthy for us to do it.
Once we have the memtable, grabing the column family is easy provided we have a
database object. We can grab it from the schema.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Add additional parameters to mp_row_consumer to be able to fetch
only cells for given clustering key ranges
This will be used in row_cache when it will work on clustering key
level instead of partition key level.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Not all SSTable readers will end up getting the right tag for a priority
class. In particular, the range reader, also used for the memtables complete
ignores any priority class.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
There are situations when a memtable is already flushed but the memtable
reader will continue to be in place, relaying reads to the underlying
table.
For that reason, the "memtables don't need a priority class" argument
gets obviously broken. We need to pass a priority class for its reader
as well.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Its definition as a lambda function is inconvenient, because it does not allow
us to use default values for parameters.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Its definition as a lambda function is inconvenient, because it does not allow
us to use default values for parameters.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
There is one current schema for given column_family. Entries in
memtables and cache can be at any of the previous schemas, but they're
always upgraded to current schema on access.
The intent is to make data returned by queries always conform to a
single schema version, which is requested by the client. For CQL
queries, for example, we want to use the same schema which was used to
compile the query. The other node expects to receive data conforming
to the requested schema.
Interface on shard level accepts schema_ptr, across nodes we use
table_schema_version UUID. To transfer schema_ptr across shards, we
use global_schema_ptr.
Because schema is identified with UUID across nodes, requestors must
be prepared for being queried for the definition of the schema. They
must hold a live schema_ptr around the request. This guarantees that
schema_registry will always know about the requested version. This is
not an issue because for queries the requestor needs to hold on to the
schema anyway to be able to interpret the results. But care must be
taken to always use the same schema version for making the request and
parsing the results.
Schema requesting across nodes is currently stubbed (throws runtime
exception).
Schema is tracked in memtable and cache per-entry. Entries are
upgraded lazily on access. Incoming mutations are upgraded to table's
current schema on given shard.
Mutating nodes need to keep schema_ptr alive in case schema version is
requested by target node.
scanning_reader has a bug in its range support when it iterates over a
memtable which is still open, and thus might still be modified between
calls to the read function.
This caused, among other things, issue #368 - where repair was reading
a memtable which was still open and being written to (by a stream from a
a remote node).
The problem is that scanning_reader has an optimization so it can avoid
comparing the current partition with the range's end on every iteration:
It finds, once, a pointer to the element past the end of the range (the
so-called "upper bound"), and saves this pointer in _end. Then at every
iteration, we can just compare pointers.
But If partitions are added to the memtable, the _end we saved is no longer
relevant: It still points to a valid partition, but this partition which
was once the first partition *after* the range, may now be precedeed by
many new partitions, which may be now returned despite being after the
range's end.
The fix is to re-calculate "_end" if partitions were added to the memtable.
Moreover, we also need to re-calculate "_i" in this case - the current code
calculates in one iteration a pointer, _i, to the element to be returned in
the *next* iteration. If additional partitions were added in the meantime,
we may need to return them.
Because it's impossible to delete partitions from a memtable (just to
add new ones or modify existing ones), we can trivially figure out if
new partitions were added, using _memtable->partition_count(). Because
boost::intrusive::set defaults to constant_time_size(true), using this
count is efficient.
Fixes#368.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Fixes#309.
When scanning memtable readers detect is was flushed, which means that
it started to be moved to cache, they fall back to reading from
memtable's sstable.
Eventually what we should do is to combine memtable and cache contents
so that as long as data is not evicted we won't do IO. We do not
support scanning in cache yet though, so there is no point in doing
this now, and it is not trivial.
Disabling compaction of a region is currently done in order to keep
the references valid. But disabling only compaction is not enough, we
also need to disable eviction, as it also invalidates
references. Rather than introducing another type of lock, compaction
and eviction are controlled together, generalized as "reclaiming"
(hence the reclaim_lock).
The goal is to make allocation less likely to fail. With async
reclaimer there is an implicit bound on the amount of memory that can
be allocated between deferring points. This bound is difficult to
enforce though. Sync reclaimer lifts this limitation off.
Also, allocations which could not be satisfied before because of
fragmentation now will have higher chances of succeeding, although
depending on how much memory is fragmented, that could involve
evicting a lot of segments from cache, so we should still avoid them.
Downside of sync reclaiming is that now references into regions may be
invalidated not only across deferring points but at any allocation
site. compaction_lock can be used to pin data, preferably just
temporarily.
Using a lambda for implementing a mutation_reader is nifty, but does not
allow us to add methods.
Switch to a class-based implementation in anticipation of adding a close()
method.