Commit Graph

54 Commits

Author SHA1 Message Date
Tomasz Grabiec
892d4a2165 db: Enable creating forwardable readers via mutation_source
Right now all mutation source implementations will use
make_forwardable() wrapper.
2017-02-23 18:50:44 +01:00
Tomasz Grabiec
2cc27f72ca memtable: Accept all mutation_source parameters 2017-02-23 18:23:52 +01:00
Asias He
e5485f3ea6 Get rid of query::partition_range
Use dht::partition_range instead
2016-12-19 08:09:25 +08:00
Tomasz Grabiec
527ff6aa40 db: Clear memtable after flush when cache is disabled
So that memory is released gradually (impacting latency less) and
sooner than when memtable is destroyed. Active readers may keep the
memtable alive for unbounded amount of time.

Refs #1879
2016-12-05 12:59:09 +01:00
Tomasz Grabiec
1bba51319e memtable: Maintain virtual dirty on clear()
When memtable is flushing, it subtracts _flushed_memory from groups's
size to gradually allow more writes. Ideally _flushed_memory would be
equal to region's size when flush ends, so the group's size would
reach zero. When the memtable and its region are gone the group size
should remain the same as after the flush. This is ensured by adding
back _flushed_memory to group's size right before the region is
removed from the group.

Calling clear() before region is removed from the group breaks the
accounting because it will shrink the region, but will not affect the
amount of memory subtracted due to _flushed_memory. So group's size
would decrease more than we want (twice the region's size). The fix is
to change clear() so that it reverts _flushed_memory by the amount by
which the region size is reduced. This will keep the groups's size
constant as long as _flushed_memory > 0.
2016-12-05 12:59:09 +01:00
Tomasz Grabiec
1b5f338c17 memtable: Track flushed memory in memtable object 2016-12-05 12:59:09 +01:00
Tomasz Grabiec
c3768fe4de memtable: Pass dirty_memory_manager& to memtable constructor
The implementation assumes that memtable's region group is owned by
dirty_memory_manager, and tries to obtain a reference to it like this:

  boost::intrusive::get_parent_from_member(_region.group(), &dirty_memory_manager::_region_group));

This is undefined behavior when the region's group does not come from
dirty manager. It's safer to be explicit about this dependency by
taking a reference to dirty_memory_manager in the constructor.
2016-12-05 12:59:09 +01:00
Glauber Costa
0ca8c3f162 database: keep a pointer to the memtable list in a memtable
We current pass a region group to the memtable, but after so many recent
changes, that is a bit too low level. This patch changes that so we pass
a memtable list instead.

Doing that also has a couple of advantages. Mainly, during flush we must
get to a memtable to a memtable_list. Currently we do that by going to
the memtable to a column family through the schema, and from there to
the memtable_list.

That, however, involves calling virtual functions in a derived class,
because a single column family could have both streaming and normal
memtables. If we pass a memtable_list to the memtable, we can keep
pointer, and when needed get the memtable_list directly.

Not only that gets rid of the inheritance for aesthetic reasons, but
that inheritance is not even correct anymore. Since the introduction of
the big streaming memtables, we now have a plethora of lists per column
family and this transversal is totally wrong. We haven't noticed before
because we were flushing the memtables based on their individual sizes,
but it has been wrong all along for edge cases in which we would have to
resort to size-based flush. This could be the case, for instance, with
various plan_ids in flight at the same time.

At this point, there is no more reason to keep the derived classes for
the dirty_memory_manager. I'm only keeping them around to reduce
clutter, although they are useful for the specialized constructors and
to communicate to the reader exactly what they are. But those can be
removed in a follow up patch if we want.

The old memtable constructor signature is kept around for the benefit of
two tests in memtable_tests which have their own flush logic. In the
future we could do something like we do for the SSTable tests, and have
a proxy class that is friends with the memtable class. That too, is left
for the future.

Fixes #1870

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <811ec9e8e123dc5fc26eadbda82b0bae906657a9.1479743266.git.glauber@scylladb.com>
2016-11-21 18:18:27 +02:00
Paweł Dziepak
ef57b9a26f rename memory_usage() to external_memory_usage() where applicable
Renaming the function to external_memory_usage() makes it clear that
sizeof(T) is not included, something that was a source of confusion in
the past.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-11-18 11:25:36 +00:00
Glauber Costa
0b337dab14 memtable: add a method to expose the region_group
That is technically not needed because a memtable inherits from group. So
whenever we have a memtable, we can use it's group() method to obtain a
group for it, and then from there go to the region_group.

However, region() is a const method in the memtable, so we have to play
trick with the const_cast, or remove the constness from the region. An
alternative to that, which I prefer, is to expose a method for the
region_group directly from the memtable object that does the right thing
and bypasses all that.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-11-16 21:20:24 -05:00
Paweł Dziepak
6755a679f6 drop key readers
key_readers weren't used since introduction of continuity flag to cache
entries.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Glauber Costa
7146776d7c fix sstable tests by not using the flush_reader if no region_group
The latest virtual dirty patches broke the SSTable tests. The reason for
this is that those tests will flush synthetic memtables that do not have
a region_group attached to it.

Normally in cases like this we would just give the flush_reader an empty
region group. However, the memtable class constructor takes a
region_group pointer and that can be null according to the interface.
So we must conditionally test it.

If there isn't a region_group involved, the virtual dirty accounting
should be disabled: after all, we won't even have the baseline memory
to begin with.

One of the approaches to fix this could be to just provide null
accounter classes to be used as a surrogate for the accounting classes
in this case. However, since this is mostly used for tests, a much
simpler way is to just revert back to the scanning reader in that case.

The scanning reader is similar enough to the flush_reader, except that
it can handle partial ranges, slices, and delegate accesses to an
sstable post-flush. We don't need any of that, but as argued above,
there is no need to remove it either.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
Message-Id: <1475667271-60806-1-git-send-email-glommer@scylladb.com>
2016-10-05 12:44:21 +01:00
Glauber Costa
f89a67c75c database: allow virtual dirty memory management
Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.

The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.

What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.

The practical effect of that is that once we reach 50 % occupancy in our dirty
memory region, we will bring the system from CPU speed to disk speed, and will
start accepting requests only at the rate we are able to write memory back.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
eee15578fb memtables: split scanning reader in two
The code that is common will live in its own reader, the iterator_reader.  All
friendly private access to memtable attributes and methods happen through the
iterator reader.

After this patch, we are now left with the scanning_reader - same as always,
but now implemented on top of the iterator_reader, and a flush_reader, which
will be used by SSTable flushes only.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Glauber Costa
16886eeb96 sstables: use special reader for writing a memtable
Right now the special reader doesn't do much, but the idea is that we will
soon replace it will a reader that specializes in flush, and is in turn able
to provide read-side on-flush functionality like virtual dirty.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Piotr Jastrzebski
3607d99269 Remove clustering_key_filtering_context.
Remove clustering_key_filter_factory and clustering_key_filtering_context.
Use partition_slice directly with a static get_ranges method.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-08-30 20:31:55 +02:00
Glauber Costa
d41fcd45d1 memtables: make memtable inherit from region
The LSA memory pressure mechanism will let us know which region is the best
candidate for eviction when under pressure. We need to somehow then translate
region -> memtable -> column family.

The easiest way to convert from region to memtable, is having memtable inherit
from region. Despite the fact that this requires multiple inheritance, which
always raise a flag a bit, the other class we inherit from is
enable_shared_from_this, which has a very simple and well defined interface. So
I think it is worthy for us to do it.

Once we have the memtable, grabing the column family is easy provided we have a
database object. We can grab it from the schema.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-07-05 15:05:29 -04:00
Paweł Dziepak
6871bd5fa0 memtable: fully support streamed_mutations
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:29:52 +01:00
Paweł Dziepak
2ab1a73efa memtable: rename partition_entry to memtable_entry
partition_entry is going to be a more general object used by both
cache and memtable entries.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-06-20 21:29:51 +01:00
Piotr Jastrzebski
23c23abe53 Make memtable mutation_reader slice using clustering ranges.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-05-16 11:46:41 +02:00
Pekka Enberg
38a54df863 Fix pre-ScyllaDB copyright statements
People keep tripping over the old copyrights and copy-pasting them to
new files. Search and replace "Cloudius Systems" with "ScyllaDB".

Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>
2016-04-08 08:12:47 +03:00
Glauber Costa
80ab41a715 memtable reader: also include a priority class
There are situations when a memtable is already flushed but the memtable
reader will continue to be in place, relaying reads to the underlying
table.

For that reason, the "memtables don't need a priority class" argument
gets obviously broken. We need to pass a priority class for its reader
as well.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-02-24 18:00:34 -05:00
Tomasz Grabiec
d81a46d7b5 column_family: Add schema setters
There is one current schema for given column_family. Entries in
memtables and cache can be at any of the previous schemas, but they're
always upgraded to current schema on access.
2016-01-11 10:34:52 +01:00
Tomasz Grabiec
4e5a52d6fa db: Make read interface schema version aware
The intent is to make data returned by queries always conform to a
single schema version, which is requested by the client. For CQL
queries, for example, we want to use the same schema which was used to
compile the query. The other node expects to receive data conforming
to the requested schema.

Interface on shard level accepts schema_ptr, across nodes we use
table_schema_version UUID. To transfer schema_ptr across shards, we
use global_schema_ptr.

Because schema is identified with UUID across nodes, requestors must
be prepared for being queried for the definition of the schema. They
must hold a live schema_ptr around the request. This guarantees that
schema_registry will always know about the requested version. This is
not an issue because for queries the requestor needs to hold on to the
schema anyway to be able to interpret the results. But care must be
taken to always use the same schema version for making the request and
parsing the results.

Schema requesting across nodes is currently stubbed (throws runtime
exception).
2016-01-11 10:34:52 +01:00
Tomasz Grabiec
036974e19b Make mutation interfaces support multiple versions
Schema is tracked in memtable and cache per-entry. Entries are
upgraded lazily on access. Incoming mutations are upgraded to table's
current schema on given shard.

Mutating nodes need to keep schema_ptr alive in case schema version is
requested by target node.
2016-01-11 10:34:51 +01:00
Tomasz Grabiec
8a05b61d68 memtable: Read under _read_section 2016-01-11 10:34:51 +01:00
Tomasz Grabiec
5184381a0b memtable: Deconstify memtable in readers
We want to upgrade entries on read and for that we need mutating
permission.
2016-01-11 10:34:51 +01:00
Tomasz Grabiec
32ac2ccc4a memtable: Introduce apply(memtable&) 2015-11-29 16:25:21 +01:00
Avi Kivity
a40a62d840 memtable: use allocating_section to guard allocations
Without this, an allocation can fail, and we may not be able to reclaim
memory.
2015-11-16 10:56:06 +02:00
Paweł Dziepak
b0edaa5bb7 memtable: add as_key_source()
Needed only for tests.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2015-10-20 20:27:53 +02:00
Avi Kivity
d5cf0fb2b1 Add license notices 2015-09-20 10:43:39 +03:00
Amnon Heiman
c102031fe2 memtable: Expose the region object
This adds  a const getter for the region object

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-09-10 00:20:26 +03:00
Tomasz Grabiec
a0c180ef49 memtable: Fix flush in the middle of scanning bug
Fixes #309.

When scanning memtable readers detect is was flushed, which means that
it started to be moved to cache, they fall back to reading from
memtable's sstable.

Eventually what we should do is to combine memtable and cache contents
so that as long as data is not evicted we won't do IO. We do not
support scanning in cache yet though, so there is no point in doing
this now, and it is not trivial.
2015-09-09 10:17:35 +02:00
Tomasz Grabiec
d20fae96a2 lsa: Make reclaimer run synchronously with allocations
The goal is to make allocation less likely to fail. With async
reclaimer there is an implicit bound on the amount of memory that can
be allocated between deferring points. This bound is difficult to
enforce though. Sync reclaimer lifts this limitation off.

Also, allocations which could not be satisfied before because of
fragmentation now will have higher chances of succeeding, although
depending on how much memory is fragmented, that could involve
evicting a lot of segments from cache, so we should still avoid them.

Downside of sync reclaiming is that now references into regions may be
invalidated not only across deferring points but at any allocation
site. compaction_lock can be used to pin data, preferably just
temporarily.
2015-08-31 21:50:18 +02:00
Tomasz Grabiec
ff8c81b25f memtable: Encapsulate unsafe accessors 2015-08-31 21:50:17 +02:00
Tomasz Grabiec
d9ce307c6a memtable: Add non-const partition_entry::key() variant
Helps moving from memtable to cache.
2015-08-31 14:54:26 +03:00
Avi Kivity
1579f86503 memtable: keep the lsa region alive while partitions are being destroyed
Or we get a use-after-free.  Reported by Pekka.
2015-08-20 15:32:30 +03:00
Avi Kivity
c175025bb6 db: place all memtables into a single region_group
We can use this to track the amount of unevictable memory in the
system.
2015-08-19 19:36:41 +03:00
Tomasz Grabiec
3b92ba2857 db: Add memtable flush logging 2015-08-06 14:05:16 +02:00
Tomasz Grabiec
cda31eccf7 db: Use LSA to allocate data inside memtable 2015-08-06 14:05:16 +02:00
Tomasz Grabiec
fe4c75dee6 memtable: Remove unused find_partition() 2015-08-06 14:05:16 +02:00
Tomasz Grabiec
1046ee6e80 memtable: Remove all_partitions()
Preferred way to access the memtable is via reader.
2015-08-06 14:05:16 +02:00
Avi Kivity
182d5ab798 memtable: fix memory leak
Since memtable::partitions is now an intrusive_set, it must be cleared
explicitly, or memory is leaked.
2015-07-26 20:01:50 +03:00
Tomasz Grabiec
9c4956c5dc memtable: Use boost::intrusive_set<> to store partition entries
So that we can use heterogenous comparators. For range queries we will
need to compare keys with ring_position.
2015-07-22 13:14:33 +02:00
Tomasz Grabiec
da937897cf memtable: Introduce as_data_source() 2015-07-09 19:46:29 +02:00
Tomasz Grabiec
51cae834e3 db: Put all sstables behind single reader
This change abstracts reading from on-disk data sources behind a single
reader which is then composed with memtable readers. This change also
abstracts all data sources behind a single reader obtained via
column_family::make_reader(). That reader is then used by algorithms
like column_family::for_all_partitions() or
column_family::query(). Having those abstractions will make it easier
to add row cache, because it will be encapsulated in a single place.
2015-06-18 16:33:33 +02:00
Tomasz Grabiec
bc468f9a0e memtable: Make memtable inherit from enable_lw_shared_from_this 2015-06-18 15:48:21 +02:00
Tomasz Grabiec
7f1ff0401e db: Move mutation_reader definition to separate header 2015-06-18 15:47:40 +02:00
Nadav Har'El
78a8ac8470 Make mutation_reader usable outside database.cc
The "mutation_reader" defined in database.cc is a convenient mechanism
for iterating over mutations. It can be useful for more than just
database.cc (I want to use it in the compaction code), so this patch moves
the type's definition to mutation.hh, and the make_memtable_reader()
function to memtable::make_reader() (in memtable.hh).

Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
2015-06-16 14:03:34 +02:00
Calle Wilund
293dbf66e3 Forward and use replay_position when applying mutation
* Forward commitlog replay_position to column_family.memtable, updating
  highest RP if needed
* When flushing memtable, signal back to commitlog that RP has been dealt with
  to potentially remove finished segment(s)

Note: since memtable flushing right now is _not_ explicitly ordered,
this does not actually work, since we need to guarantee ordering with
regards to RP. I.e. if we flush N blocks, we must guarantee that:
a.) We report "flushed RP" in RP order
b.) For a given RP1, all RP* lower than RP1 must also have been flushed.
(The latter means that it is fine to say, flush X tables at the same time, as long as we report a single RP that is the highest, and no lower RP:s exist in non-flushed tables)

I am however letting someone else deal with ensuring MT->sstable flush order.

Signed-off-by: Calle Wilund <calle@cloudius-systems.com>
2015-06-03 12:38:13 +03:00