The latest virtual dirty patches broke the SSTable tests. The reason for
this is that those tests will flush synthetic memtables that do not have
a region_group attached to it.
Normally in cases like this we would just give the flush_reader an empty
region group. However, the memtable class constructor takes a
region_group pointer and that can be null according to the interface.
So we must conditionally test it.
If there isn't a region_group involved, the virtual dirty accounting
should be disabled: after all, we won't even have the baseline memory
to begin with.
One of the approaches to fix this could be to just provide null
accounter classes to be used as a surrogate for the accounting classes
in this case. However, since this is mostly used for tests, a much
simpler way is to just revert back to the scanning reader in that case.
The scanning reader is similar enough to the flush_reader, except that
it can handle partial ranges, slices, and delegate accesses to an
sstable post-flush. We don't need any of that, but as argued above,
there is no need to remove it either.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Message-Id: <1475667271-60806-1-git-send-email-glommer@scylladb.com>
Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.
The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.
What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.
The practical effect of that is that once we reach 50 % occupancy in our dirty
memory region, we will bring the system from CPU speed to disk speed, and will
start accepting requests only at the rate we are able to write memory back.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The code that is common will live in its own reader, the iterator_reader. All
friendly private access to memtable attributes and methods happen through the
iterator reader.
After this patch, we are now left with the scanning_reader - same as always,
but now implemented on top of the iterator_reader, and a flush_reader, which
will be used by SSTable flushes only.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Right now the special reader doesn't do much, but the idea is that we will
soon replace it will a reader that specializes in flush, and is in turn able
to provide read-side on-flush functionality like virtual dirty.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Remove clustering_key_filter_factory and clustering_key_filtering_context.
Use partition_slice directly with a static get_ranges method.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The LSA memory pressure mechanism will let us know which region is the best
candidate for eviction when under pressure. We need to somehow then translate
region -> memtable -> column family.
The easiest way to convert from region to memtable, is having memtable inherit
from region. Despite the fact that this requires multiple inheritance, which
always raise a flag a bit, the other class we inherit from is
enable_shared_from_this, which has a very simple and well defined interface. So
I think it is worthy for us to do it.
Once we have the memtable, grabing the column family is easy provided we have a
database object. We can grab it from the schema.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
There are situations when a memtable is already flushed but the memtable
reader will continue to be in place, relaying reads to the underlying
table.
For that reason, the "memtables don't need a priority class" argument
gets obviously broken. We need to pass a priority class for its reader
as well.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
There is one current schema for given column_family. Entries in
memtables and cache can be at any of the previous schemas, but they're
always upgraded to current schema on access.
The intent is to make data returned by queries always conform to a
single schema version, which is requested by the client. For CQL
queries, for example, we want to use the same schema which was used to
compile the query. The other node expects to receive data conforming
to the requested schema.
Interface on shard level accepts schema_ptr, across nodes we use
table_schema_version UUID. To transfer schema_ptr across shards, we
use global_schema_ptr.
Because schema is identified with UUID across nodes, requestors must
be prepared for being queried for the definition of the schema. They
must hold a live schema_ptr around the request. This guarantees that
schema_registry will always know about the requested version. This is
not an issue because for queries the requestor needs to hold on to the
schema anyway to be able to interpret the results. But care must be
taken to always use the same schema version for making the request and
parsing the results.
Schema requesting across nodes is currently stubbed (throws runtime
exception).
Schema is tracked in memtable and cache per-entry. Entries are
upgraded lazily on access. Incoming mutations are upgraded to table's
current schema on given shard.
Mutating nodes need to keep schema_ptr alive in case schema version is
requested by target node.
Fixes#309.
When scanning memtable readers detect is was flushed, which means that
it started to be moved to cache, they fall back to reading from
memtable's sstable.
Eventually what we should do is to combine memtable and cache contents
so that as long as data is not evicted we won't do IO. We do not
support scanning in cache yet though, so there is no point in doing
this now, and it is not trivial.
The goal is to make allocation less likely to fail. With async
reclaimer there is an implicit bound on the amount of memory that can
be allocated between deferring points. This bound is difficult to
enforce though. Sync reclaimer lifts this limitation off.
Also, allocations which could not be satisfied before because of
fragmentation now will have higher chances of succeeding, although
depending on how much memory is fragmented, that could involve
evicting a lot of segments from cache, so we should still avoid them.
Downside of sync reclaiming is that now references into regions may be
invalidated not only across deferring points but at any allocation
site. compaction_lock can be used to pin data, preferably just
temporarily.
This change abstracts reading from on-disk data sources behind a single
reader which is then composed with memtable readers. This change also
abstracts all data sources behind a single reader obtained via
column_family::make_reader(). That reader is then used by algorithms
like column_family::for_all_partitions() or
column_family::query(). Having those abstractions will make it easier
to add row cache, because it will be encapsulated in a single place.
The "mutation_reader" defined in database.cc is a convenient mechanism
for iterating over mutations. It can be useful for more than just
database.cc (I want to use it in the compaction code), so this patch moves
the type's definition to mutation.hh, and the make_memtable_reader()
function to memtable::make_reader() (in memtable.hh).
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
* Forward commitlog replay_position to column_family.memtable, updating
highest RP if needed
* When flushing memtable, signal back to commitlog that RP has been dealt with
to potentially remove finished segment(s)
Note: since memtable flushing right now is _not_ explicitly ordered,
this does not actually work, since we need to guarantee ordering with
regards to RP. I.e. if we flush N blocks, we must guarantee that:
a.) We report "flushed RP" in RP order
b.) For a given RP1, all RP* lower than RP1 must also have been flushed.
(The latter means that it is fine to say, flush X tables at the same time, as long as we report a single RP that is the highest, and no lower RP:s exist in non-flushed tables)
I am however letting someone else deal with ensuring MT->sstable flush order.
Signed-off-by: Calle Wilund <calle@cloudius-systems.com>
Following what happened to others: we can now include memtable.hh
without including database.hh
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>