Commit Graph

632 Commits

Author SHA1 Message Date
Glauber Costa
d2438059a7 database: keep a pointer to the memtable list in a memtable
We current pass a region group to the memtable, but after so many recent
changes, that is a bit too low level. This patch changes that so we pass
a memtable list instead.

Doing that also has a couple of advantages. Mainly, during flush we must
get to a memtable to a memtable_list. Currently we do that by going to
the memtable to a column family through the schema, and from there to
the memtable_list.

That, however, involves calling virtual functions in a derived class,
because a single column family could have both streaming and normal
memtables. If we pass a memtable_list to the memtable, we can keep
pointer, and when needed get the memtable_list directly.

Not only that gets rid of the inheritance for aesthetic reasons, but
that inheritance is not even correct anymore. Since the introduction of
the big streaming memtables, we now have a plethora of lists per column
family and this transversal is totally wrong. We haven't noticed before
because we were flushing the memtables based on their individual sizes,
but it has been wrong all along for edge cases in which we would have to
resort to size-based flush. This could be the case, for instance, with
various plan_ids in flight at the same time.

At this point, there is no more reason to keep the derived classes for
the dirty_memory_manager. I'm only keeping them around to reduce
clutter, although they are useful for the specialized constructors and
to communicate to the reader exactly what they are. But those can be
removed in a follow up patch if we want.

The old memtable constructor signature is kept around for the benefit of
two tests in memtable_tests which have their own flush logic. In the
future we could do something like we do for the SSTable tests, and have
a proxy class that is friends with the memtable class. That too, is left
for the future.

Fixes #1870

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <811ec9e8e123dc5fc26eadbda82b0bae906657a9.1479743266.git.glauber@scylladb.com>
(cherry picked from commit 0ca8c3f162)
2016-11-21 18:18:56 +02:00
Glauber Costa
4539b8403a database: fix direct flushes of non-durable column families.
If a Column Family is non-durable, then its flushes will never create a
memtable flush reader. Our current flush logic depends on that being
created and destroyed to release the semaphore permits on the flush.

We will remove the permits ourselves it there is an exception, but not
under normal circumnstances. Given this issue, however, it would be more
adequate to always try to remove the permits after we flush. If the
permits were already removed by the flush reader, then this test will
just see that the permit is not in the map and return. But if it is
still there, then it is removed.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <049334c3b4bef620af2c7c045e6c84347dcf9013.1479498026.git.glauber@scylladb.com>
(cherry picked from commit 1933349654)
2016-11-18 21:33:22 +01:00
Raphael S. Carvalho
558f535fcb db: do not leak deleted sstable when deletion triggers an exception
The leakage results in deleted sstables being opened until shutdown, and disk
space isn't released. That's because column_family::rebuild_sstable_list()
will not remove reference to deleted sstables if an exception was triggered in
sstables::delete_atomically(). A sstable only has its files closed when its
object is destructed.

The exception happens when a major compaction is issued in parallel to a
regular one, and one of them will be unable to delete a sstable already deleted
by the other. That results in remove_by_toc_name() triggering boost::filesystem
::filesystem_error because TOC and temporary TOC don't exist.

We wouldn't have seen this problem if major compaction were going through
compaction manager, but remove_by_toc_name() and rebuild_sstable_list() should
be made resilient.

Fixes #1840.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <d43b2e78f9658e2c3c5bbb7f813756f18874bf92.1479390842.git.raphaelsc@scylladb.com>
(cherry picked from commit 3dc9294023)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <760f96d81de0bab7507bb4f52c06b30f21e82577.1479420770.git.raphaelsc@scylladb.com>
2016-11-18 13:10:46 +02:00
Glauber Costa
3d45d0d339 fix shutdown and exception conditions for flush logic
This patch addresses post-merge follow up comments by Tomek.
Basically, what we do is:
- we don't need to signal() from remove_from_flush_manager(), because
  the explicit flushes no longer wait on the condition variable. So we
  don't.
- We now wait on the stop() flushes (regardless of their return status)
  so we can make sure that the _flush_queue will indeed be done with.
- we acquire the semaphore before shutting down the dirty_memory_manager
  to make sure that there are no pending flushes
- the flush manager that holds the semaphore has to match in the exception
  handler

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <a23ab5098934546c660a08de64cd9294bb3a2008.1479400239.git.glauber@scylladb.com>
(cherry picked from commit 461778918b)
2016-11-18 11:53:21 +02:00
Avi Kivity
affc0d9138 Merge "get rid of memtable size parameter and rework flush logic" from Glauber
"This patchset allows Scylla to determine the size of a memtable instead
of relying in the user-provided memtable_cleanup_threshold. It does that
by allowing the region_group to specify a soft limit which will trigger
the allocation as early as it is reached.

Given that, we'll keep the memtables in memory for as long as it takes
to reach that limit, regardless of the individual size of any single one
of them. That limit is set to 1/4 of dirty memory. That's the same as
last submission, except this time I have run some experiments to gauge
behavior of that versus 1/2 of dirty memory, which was a preferred
theoretical value.

After that is done, the flush logic is reworked to guarantee that
flushes are not initiated if we already have one memtable under flush.
That allow us to better take advantage of coalescing opportunities with
new requests and prevents the pending memtable explosion that is
ultimately responsible for Issue 1817.

I have run mainly two workloads with this. The first one a local RF=1
workload with large partitions, sized 128kB and 100 threads. The results
are:

Before:

op rate                   : 632 [WRITE:632]
partition rate            : 632 [WRITE:632]
row rate                  : 632 [WRITE:632]
latency mean              : 157.8 [WRITE:157.8]
latency median            : 115.5 [WRITE:115.5]
latency 95th percentile   : 486.7 [WRITE:486.7]
latency 99th percentile   : 534.8 [WRITE:534.8]
latency 99.9th percentile : 599.0 [WRITE:599.0]
latency max               : 722.6 [WRITE:722.6]
Total partitions          : 189667 [WRITE:189667]
Total errors              : 0 [WRITE:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:05:00
END

After:

op rate                   : 951 [WRITE:951]
partition rate            : 951 [WRITE:951]
row rate                  : 951 [WRITE:951]
latency mean              : 104.8 [WRITE:104.8]
latency median            : 102.5 [WRITE:102.5]
latency 95th percentile   : 155.8 [WRITE:155.8]
latency 99th percentile   : 177.8 [WRITE:177.8]
latency 99.9th percentile : 686.4 [WRITE:686.4]
latency max               : 1081.4 [WRITE:1081.4]
Total partitions          : 285324 [WRITE:285324]
Total errors              : 0 [WRITE:0]
total gc count            : 0
total gc mb               : 0
total gc time (s)         : 0
avg gc time(ms)           : NaN
stdev gc time(ms)         : 0
Total operation time      : 00:05:00
END

The other workload was the workload described in #1817. And the result
is that we now have a load that is very stable around 100k ops/s and
hardly any timeouts, instead of the 1.4 baseline of wild variations
around 100k ops/s and lots of timeouts, or the deep reduction of
1.5-rc1."

* 'issue-1817-v4' of github.com:glommer/scylla:
  database: rework memtable flush logic
  get rid of max_memtable_size
  pass a region to dirty_memory_manager accounting API
  memtable: add a method to expose the region_group
  logalloc: allow region group reclaimer to specify a soft limit
  database: remove outdated comment
  database: uphold virtual dirty for system tables.

(cherry picked from commit 5d067eebf2)
2016-11-17 14:41:23 +02:00
Raphael S. Carvalho
fa308c079c database: fix collectd metrics for clustering key filter
Same instance name was used for exported metrics, which is
definitely wrong. Checked it works properly now via collectd
exporter.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <471a36706113af60aeba86fb56a365feb4dab31a.1477086706.git.raphaelsc@scylladb.com>
2016-10-22 09:51:18 +03:00
Paweł Dziepak
6755a679f6 drop key readers
key_readers weren't used since introduction of continuity flag to cache
entries.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Paweł Dziepak
7bebfb851f database: enable fast forwarding of range_sstable_reader
When fast forwarding a reader that combines sstable reader we must also
remember that the set of sstables for the new range may be different
than for the previous one. The reader introduced in this patch makes
sure that we read from correct sstables.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2016-10-19 15:29:08 +01:00
Tomasz Grabiec
4357d0a6d9 db: Add counter for writes blocked on dirty memory
There is already queue_length-requests_blocked_memory, but it's a
gauge so does not reflect what happened between the sampling points.

total_operations-requests_blocked_memory will allow to see if there
were any (and how many) requests which were blocked by dirty memory.

Message-Id: <1476098616-12682-1-git-send-email-tgrabiec@scylladb.com>
2016-10-10 14:25:22 +03:00
Glauber Costa
33e9c2bbdd memtable: reduce sstable flush concurrency to one
Limiting the concurrency of memtable flushes to 4 was a temporary
workaround for the fact that we lacked good write behind support. Now
that write behind is properly merged we can reduce the concurrency to
what it should be, one.

This means that memtable flushes will now be serialized, and only when
one of them ends will the next one begin. Disk parallelism is obtained
through the write-behind mechanism.

Fixes #1373

Signed-off-by: Glauber Costa <glauber@scylladb.com>

Message-Id: <528f9ef928b5101bed952df600eb8555c275497a.1475881100.git.glauber@scylladb.com>
2016-10-09 10:48:57 +03:00
Tomasz Grabiec
2a5a90f391 db: Do not timeout streaming readers
There is a limit to concurrency of sstable readers on each shard. When
this limit is exhausted (currently 100 readers) readers queue. There
is a timeout after which queued readers are failed, equal to
read_request_timeout_in_ms (5s by default). The reason we have the
timeout here is primarily because the readers created for the purpose
of serving a CQL request no longer need to execute after waiting
longer than read_request_timeout_in_ms. The coordinator no longer
waits for the result so there is no point in proceeding with the read.

This timeout should not apply for readers created for streaming. The
streaming client currently times out after 10 minutes, so we could
wait at least that long. Timing out sooner makes streaming unreliable,
which under high load may prevent streaming from completing.

The change sets no timeout for streaming readers at replica level,
similarly as we do for system tables readers.

Fixes #1741.

Message-Id: <1475840678-25606-1-git-send-email-tgrabiec@scylladb.com>
2016-10-07 15:41:04 +03:00
Raphael S. Carvalho
7ea4513595 database: trigger compaction after loading new sstables
Scylla wasn't trying to compact new sstables uploaded via 'nodetool
refresh'. Thus, all new sstables were left uncompacted until user
issued 'nodetool flush' or a new sstable was written which would
trigger compaction too.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <bbdf274c8bb49f4bedeefcb85da78a6fb61a1232.1475535203.git.raphaelsc@scylladb.com>
2016-10-06 18:26:49 +03:00
Avi Kivity
f8118d9fc2 Merge "Virtual dirty memory management" from Glauber
"Description:
============

Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.

The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.

What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.

The practical effect of that, is that once we reach 50 % occupancy in our dirty
memory region, we will bring the system from CPU speed to disk speed, and will
start accepting requests only at the rate we are able to write memory back.

Results
=======

With this patchset running a load big enough to easily saturate the disk,
(commitlog disabled to highlight the effects of the memtable writer), I am able
to run scylla for many minutes, with timeouts occurring only when I run out of
disk space, whereas without this patch a swarm of timeouts would start merely 2
seconds after the load started - and would never get stable.

In V2, I have sent a set of graphs illustrating the performance of this solution.
This version does not have any significant differences in that front.

For details, please refer to
https://groups.google.com/d/msg/scylladb-dev/iCvD-3Z-QqY/EM8KUh_MAQAJ

Accuracy of the accounting:
---------------------------
It is important for us to be as accurate as possible when accounting freed
memory, since every byte we mark as freed may allow one or more requests to be
executed.  I have measured the accuracy of this approach (ignoring padding,
object size for the mutation fragments) to be 99.83 % of used memory in the
test workload I have ran (large, 65k mutations). Memtables under this circumnstance
tend to have a very high occupancy ratio because throttle breeds idle, and idle
breeds compact-on-idle.

Known Issues:
-------------

A lot of time can be elapsed between destroying the flush_reader and actually
releasing memory. The release of memory only happens when the SSTable is fully
sealed, and we have to flush the files, as well as finish writing all SSTable
components at this point. This happened in practice with a buggy kernel that
would result in flushes taking a long time.

After that is fixed, this is just a theoretical problem and in practice it
shouldn't matter given the time we expect those operations to take."

* 'virtual-dirty-v6' of github.com:glommer/scylla:
  database: allow virtual dirty memory management
  streamed_mutation: make _buffer private
  add accounting of memory read to partition_snapshot_reader
  move partition_snapshot_reader code to header file
  LSA: allow a group to query its own region group
  memtables: split scanning reader in two
  sstables: use special reader for writing a memtable
  LSA: export information about object memory footprint
  LSA: export information about size of the throttle queue
  database: export virtual dirty bytes region group
2016-10-04 20:57:52 +03:00
Glauber Costa
f89a67c75c database: allow virtual dirty memory management
Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.

The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.

What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.

The practical effect of that is that once we reach 50 % occupancy in our dirty
memory region, we will bring the system from CPU speed to disk speed, and will
start accepting requests only at the rate we are able to write memory back.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-10-04 10:39:10 -04:00
Raphael S. Carvalho
747b42299c database: remove unused code
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <95e1ed590c9e45d15f19a84824a4dce05aefdab8.1475528611.git.raphaelsc@scylladb.com>
2016-10-04 09:26:43 +03:00
Raphael S. Carvalho
a3bf7558f2 lcs: fix broken token range distribution at higher levels
Uniform token range distribution across sstables in a level > 1 was broken,
because we were only choosing sstable with lowest first key, when compacting
a level > 0. This resulted in performance problem because L1->L2 may have a
huge overlap over time, for example.
Last compacted key will now be stored for each level to ensure sort of
"round robin" selection of sstables for compactions at level >= 1.
That's also done by C*, and they were once affected by it as described in
https://issues.apache.org/jira/browse/CASSANDRA-6284.

Fixes #1719.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-30 14:09:16 -03:00
Glauber Costa
f5fd6bd714 LSA: export information about size of the throttle queue
Also add information about for how long has the oldest been sitting in the
queue. This is part of the backpressure work to allow us to throttle incoming
requests if we won't have memory to process them. Shortages can happen in all
sorts of places, and it is useful when designing and testing the solutions to
know where they are, and how bad they are.

This counter is named for consistency after similar counters from transport/.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-09-27 12:09:08 -04:00
Glauber Costa
aa6a96d09b database: export virtual dirty bytes region group
Currently, we export the region group where memtables are placed as dirty bytes.
Upcoming patches will optimistically mark some bytes in this region as free, a
scheme we know as "virtual dirty".

We are still interested in knowing the real state of the dirty region, so we
will keep track of the bytes virtually freed and split the counters in two.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-09-27 12:09:08 -04:00
Asias He
b505e34062 database: Introduce make_streaming_reader
The make_streaming_reader returns a combined mutation reader reads
mutations from sstables and memtable. The memtable reader handles
memtable flushing automatically so no special handling is needed here.

It will be used by streaming soon.
2016-09-26 16:02:48 +08:00
Raphael S. Carvalho
67343798cf api: implement api to return sstable count per level
'nodetool cfstats' wasn't showing per-level sstable count because
the API wasn't implemented.

Fixes #1119.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <0dcdf9196eaec1692003fcc8ef18c77d0834b2c6.1474410770.git.raphaelsc@scylladb.com>
2016-09-21 09:13:40 +03:00
Raphael S. Carvalho
dffb41f9d8 sstables: remove schema parameter from some sstable methods
schema can now be found in the sstable object itself.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <0fa44fedbe784d924522d7eeca77c16294479c6e.1473959677.git.raphaelsc@scylladb.com>
2016-09-19 13:25:58 +02:00
Calle Wilund
f126cf769a column_family: Ensure flush() waits for all previous flushes + self
Fixes #1577
Message-Id: <1472569952-4066-1-git-send-email-calle@scylladb.com>
2016-09-14 11:00:41 +01:00
Tomasz Grabiec
a498da1987 database: Ignore spaces in initial_token list
Currently we get boost::lexical_cast on startup if inital_token has a
list which contains spaces after commas, e.g.:

  initial_token: -1100081313741479381, -1104041856484663086, ...

Fixes #1664.
Message-Id: <1473840915-5682-1-git-send-email-tgrabiec@scylladb.com>
2016-09-14 11:58:13 +03:00
Raphael S. Carvalho
b9f67351da db: expose clustering filter info via collectd
That's needed to observe behavior of clustering filter, and to
check if it's worthwhile for a specific workload.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 11:32:23 -03:00
Raphael S. Carvalho
a2dc88889d db: enable clustering optimization only on dtcs
Leveled strategy will not benefit from this strategy because
there's only a few sstables that will contain a given partition
key, which means that a clustering key that belongs to a specific
partition key can only be in a few sstables as well.

Date tiered strategy is the one that will actually benefit the
most from this optimization. Size tiered may benefit from it too
if clustering key isn't overwritten, but it will not use the
clustering optimization.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 11:31:07 -03:00
Raphael S. Carvalho
8d03ccd604 sstables: optimize reads with clustering filter
If user specifies a clustering filter, it's possible to filter out
sstable based on its metadata that tracks min/max clustering value.

For example, if sstable stores clustering key from 'a' through 'c',
it's possible to filter out that sstable if user asks for data
with clustering key greater than 'c'.

That's done by comparing each component separately because
clustering key may be composite. Further information can be found
here: https://issues.apache.org/jira/browse/CASSANDRA-5514

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 10:51:50 -03:00
Raphael S. Carvalho
004617839d database: check bloom filter of all sstables earlier
All sstables will now have bloom filter checked in a single pass
before reader iterate through all candidates. It's possible that
we will need to futurize the procedure if it holds cpu for too
long. This change is also a step towards the optimization that
will rule out sstables based on clustering filter.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 10:50:08 -03:00
Raphael S. Carvalho
1f31223f32 sstables: store schema in sstable object
That will be needed for optimization that will store decorated keys
in the sstable object, and also for a subsequent work that will
detect wrong metadata (min/max column names) by looking at columns
in the schema. As schema is stored in sstable, there's no longer
a need to store ks and cf names in it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-09-02 10:49:17 -03:00
Glauber Costa
dc5d8e33af Revert "row_cache: update sstable histograms on cache hits"
This reverts commit 1726b1d0cc.

Reverting this patch turns our SSTable access counter into a miss counter only.
The estimated histogram always starts its first bucket at 1, so by marking cache
accesses we will be wrongly feeding "1" into the buckets.

Notice that this is not yet ideal: nodetool is supposed to show a histogram of
all reads, and by doing this we are changing its meaning slightly. Workloads
that serve mostly from cache will be distorted towards their misses.

The real solution is to use a different histogram, but we will need to enforce
a newer version of nodetool for that: the current issue is that nodetool expects
an EstimatedHistogram in a specific format in the other side.

Conflicts:
	row_cache.hh

Message-Id: <a599fa9e949766e7c9697450ae34fc28e881e90a.1472742276.git.glauber@scy
lladb.com>
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-09-01 18:07:31 +03:00
Duarte Nunes
ba374da043 database: Trace sstable accesses
This patch traces when we read from an sstable, be it a key range or a
single one.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-01 12:04:32 +02:00
Duarte Nunes
030db65c62 database: Accept a trace_state_ptr
This patch changes the database and column_family types so a
trace_state_ptr can be passed in when querying. This enables tracing
of the inner components.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-01 12:04:28 +02:00
Glauber Costa
1726b1d0cc row_cache: update sstable histograms on cache hits
If we have a cache hit, we still need to update our sstable histogram - notting
that we have touched 0 SSTables.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-08-31 15:14:22 -04:00
Glauber Costa
ce24fd05fe database: keep statistics on SSTables touched per read
That is done for single partition queries only - mimicking what
Cassandra does on that matter.

For this to be correct, we also need to update this histogram on cache
hits - in which case we update the read as having touched 0 SSTables. That
will be done on a separate patch.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-08-31 15:14:21 -04:00
Piotr Jastrzebski
3607d99269 Remove clustering_key_filtering_context.
Remove clustering_key_filter_factory and clustering_key_filtering_context.
Use partition_slice directly with a static get_ranges method.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-08-30 20:31:55 +02:00
Raphael S. Carvalho
108fd1fade database: close file in lister
After listing is done, let's close file. This fixes no bug.
It's only an improvement.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <2f52d297bcf6a6b6e3429912c28f17e6b37f8842.1471381607.git.raphaelsc@scylladb.com>
2016-08-17 11:01:44 +03:00
Glauber Costa
b361dee488 database: memtables pending flushes tell us nothing
We have two counters that tracks how many memtable flushes are in progress, and
how much memory are they pinning.

The problem is, after we have revamped the code to limit the amount of flushes
in progress, those counters became useless: as they live inside the semaphore
side, they will only be incremented once we have past the semaphore.

One wouldn't notice if working with CPU-bound problems, where memtables don't
pile. But as soon as they do, those counters will always show the same numbers:
the depth of the semaphore, which doesn't mean much. The problem is poised to
become much worse: once we enable write behind in full and set the semaphore's
depth to one, that's the number we'll see here all the time.

The fix is to move the counters outside the semaphore, which will bring back its
old semantics.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <c5ae6903e170f3f356cdda7ed78a4c9ba8d5f024.1471370504.git.glauber@scylladb.com>
2016-08-17 10:54:15 +03:00
Paweł Dziepak
8a386a51bd Merge "Don't cache wide partitions" from Piotr
"When reading a partition try to read it all
but once more bytes are read than a given limit
we decide that partition is wide and we don't cache it.
Instead we retry the read with clustering key filtering
applied."
2016-07-21 10:24:25 +01:00
Piotr Jastrzebski
636a4acfd0 Add flag to configure
max size of a cached partition.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2016-07-21 09:47:20 +02:00
Tomasz Grabiec
0d26294fac database: Add table name to log message about sealing
Message-Id: <1468917744-2539-1-git-send-email-tgrabiec@scylladb.com>
2016-07-20 10:12:31 +03:00
Tomasz Grabiec
a0832f08d2 schema_tables: Add more logging
Message-Id: <1468917771-2592-1-git-send-email-tgrabiec@scylladb.com>
2016-07-20 10:12:00 +03:00
Duarte Nunes
3518db531e database: Get non-system column_families
This patch adds an utility function that allows fetching the set of
column_families that do not belong to the system keyspace.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-07-18 23:58:31 +00:00
Duarte Nunes
4bc00c2055 database: Expose selection of sstables by a range
This patch allows a set of a column_family's sstables to be
selected according to a range of ring_positions.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-07-18 23:58:31 +00:00
Avi Kivity
1048e1071b db: do not create column family directories belonging to foreign keyspaces
Currently, for any column family, we create a directory for it in all
keyspace directories.  This is incredibly awkward.

Fix by iterating over just the keyspace's column families, not all
column families in existence.

Fixes #1457.
Message-Id: <1468495182-18424-1-git-send-email-avi@scylladb.com>
2016-07-14 14:31:05 +03:00
Avi Kivity
23edc1861a db: estimate queued read size more conservatively
There are plenty of continuations involved, so don't assume it fits in 1k.
Message-Id: <1468429516-4591-1-git-send-email-avi@scylladb.com>
2016-07-14 11:42:24 +02:00
Avi Kivity
d3c87975b0 db: don't over-allocate memory for mutation_reader
column_family::make_reader() doesn't deal with sstables directly, so it
doesn't need to reserve memory for them.

Fixes #1453.
Message-Id: <1468429143-4354-1-git-send-email-avi@scylladb.com>
2016-07-14 10:01:42 +02:00
Avi Kivity
24e3026e32 Merge "compaction manager refactoring" from Raphael 2016-07-10 17:16:23 +03:00
Tomasz Grabiec
6a1f9a9b97 db: Improve logging
Message-Id: <1467997671-16570-1-git-send-email-tgrabiec@scylladb.com>
2016-07-10 16:15:03 +03:00
Tomasz Grabiec
c0233c877d db: Avoid out-of-memory when flushing cannot keep up
memtable_list::seal_on_overlflow() is called on each mutation to check
if current memtable should be flushed. It will call
memtable_list::seal_active_memtable() when that is the case.

The number of concurrent seals is guarded by a semaphore, starting
from commit 0f64eb7e7d, and allows
at most 4 of them.

If there are 4 flushes already pending, every incoming mutation will
enqueue a new flush task on the semaphore's wait list, without waiting
for it. The wait queue can grow without bounds, eventually leading to
out-of-memory.

The fix is to seal the memtable immediately to satisfy should_flush()
condition, but limit concurrency of actual flushes. This way the wait
queue size on the semaphore is limited by memtables pending a flush,
which is fairly limited.

Message-Id: <1467997652-16513-1-git-send-email-tgrabiec@scylladb.com>
2016-07-10 10:53:51 +03:00
Raphael S. Carvalho
e38f66c6fe database: make certain column family functions const qualified
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2016-07-08 15:05:22 -03:00
Vlad Zolotarov
f2bf453be2 database: revive mutation retry in case of replay_position_reordered_exception
The logic that would retry applying a mutation in case of
a replay_position_reordered_exception error was broken by
a commit 0c31f3e626
Author: Glauber Costa <glauber@scylladb.com>
Date:   Wed Apr 20 19:09:21 2016 -0400

    database: move memtable throttler to the LSA throttler

This patch makes it work again.

Fixes #1439

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1467893342-30559-1-git-send-email-vladz@cloudius-systems.com>
2016-07-07 15:00:35 +02:00