"Before, the logic for releasing writes blocked on dirty worked like this:
1) When region group size changes and it is not under pressure and there
are some requests blocked, then schedule request releasing task
2) request releasing task, if no pressure, runs one request and if there are
still blocked requests, schedules next request releasing task
If requests don't change the size of the region group, then either some request
executes or there is a request releasing task scheduled. The amount of scheduled
tasks is at most 1, there is a single releasing thread.
However, if requests themselves would change the size of the group, then each
such change would schedule yet another request releasing thread, growing the task
queue size by one.
The group size can also change when memory is reclaimed from the groups (e.g.
when contains sparse segments). Compaction may start many request releasing
threads due to group size updates.
Such behavior is detrimental for performance and stability if there are a lot
of blocked requests. This can happen on 1.5 even with modest concurrency
because timed out requests stay in the queue. This is less likely on 1.6 where
they are dropped from the queue.
The releasing of tasks may start to dominate over other processes in the
system. When the amount of scheduled tasks reaches 1000, polling stops and
server becomes unresponsive until all of the released requests are done, which
is either when they start to block on dirty memory again or run out of blocked
requests. It may take a while to reach pressure condition after memtable flush
if it brings virtual dirty much below the threshold, which is currently the
case for workloads with overwrites producing sparse regions.
I saw this happening in a write workload from issue #2021 where the number of
request releasing threads grew into thousands.
Fix by ensuring there is at most one request releasing thread at a time. There
will be one releasing fiber per region group which is woken up when pressure is
lifted. It executes blocked requests until pressure occurs."
* tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev:
tests: lsa: Add test for reclaimer starting and stopping
tests: lsa: Add request releasing stress test
lsa: Avoid avalanche releasing of requests
lsa: Move definitions to .cc
lsa: Simplify hard pressure notification management
lsa: Do not start or stop reclaiming on hard pressure
tests: lsa: Adjust to take into account that reclaimers are run synchronously
lsa: Document and annotate reclaimer notification callbacks
tests: lsa: Use with_timeout() in quiesce()
(cherry picked from commit 7a00dd6985)
Shared sstables will now be resharded in the same order to guarantee
that all shards owning a sstable will agree on its deletion nearly
the same time, therefore, reducing disk space requirement.
That's done by picking which column family to reshard in UUID order,
and each individual column family will reshard its shared sstables
in generation order.
Fixes#1952.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <87ff649ed24590c55c00cbb32bffd8fa2743e36e.1482342754.git.raphaelsc@scylladb.com>
(cherry picked from commit 27fb8ec512)
A case could be made that we should have counters for them no matter
what, since it can help us reason about the distribution of memory among
the groups. But with the hierarchy being broken in 1.5 it becomes even
more important. Now by looking solely at dirty, we will have no idea
about how much memory we are using in those groups.
After this patch, the dirty_memory_manager will register its metrics
for the 3 groups that we have, and the legacy names will be used to show
totals.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0d04ca4c7e8472097f16a5dc950b77c73766049e.1481831644.git.glauber@scylladb.com>
(cherry picked from commit 7133583797)
Issue #1918 describes a problem, in which we are generating smaller
memtables than we could, and therefore not respecting the flush
criteria.
That happens because group sizes (and limits) for pressure purposes, and
the the soft threshold is currently at 40 %. This causes system group's
soft threshold to be way below regular's virtual dirty limit and close
to regular group's soft threshold. The system group was very likely to
become under soft pressure when regular was because writes to regular
group are not yet throttled when they cross both soft thresholds.
This is a direct consequence of the linear hierarchy between the regions
and to guarantee that it won't happen we would have acqire the semaphore
of all ancestor regions when flushing from a child region. While that
works, it can lead to problems on its own, like priority inversion if
the regions have different priorities - like streaming and regular, and
groups lower in the hierarchy, like user, blocking explicit flushes
from their ancestors
To fix that, this patch reorganizes the dirty memory region groups so
that groups are now completely independent. As a disadvantage, when
streaming happen we will draw some memory from the cache, but we will
live with it for the time being.
Fixes#1918
[ glauber: fix conflicts in memtable.cc due to lack of graceful clear ]
Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 80440c0d79)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
flush_one has to make sure that we're using the correct
dirty_memory_manager object, because we could be flushing from a region
group different than the one the flush request originated.
It's simpler to just assume flush_one will be dealing with the right
object, and use a different object instead of "this" when calling it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit bb1509c21e)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Some of our CFs can't be flushed. Those are the ones who are not marked
as having durable writes. We treat them just the same from the point of
view of the flush logic, but they provide a function that doesn't do
anything and just returns right away.
We already had troubles with that in the past, and that also poses a
problem for an upcoming patch reworking the flush memtable pick
criteria.
It's easier, simpler, and cleaner, to just make the memtable_list aware
it can't flush. Achieving that is also not very complicated: we just
need a special constructor that doesn't take a seal function and then we
make sure that it is initialized to an empty std::function
Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 8ab7c04caa)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
When we finish writing a memtable, we revert the dirty memory charges
immediately. When we do that, dirty memory will grow back to what it
was, and soon (we hope) will go down again when we release the requests
for real.
During that time, we may not accept new requests. Sealing can take a
long time, specially in the face of Linux issues like the ones we have
seen in the past. It also will take proportionally more time if the
SSTables end up being small, which is a possibility in some scenarios.
This patch changes the dirty_memory_manager so that the charges won't be
reverted right after we finish the flush. Rather, we will hold on to it,
and revert it right before we update the cache. We don't need to do it
for all classes of memtable writes, because after we finish flushing,
flush_one() will destroy the hashed element anyway.
[tgrabiec: conflicts]
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2d5a8f6ca57d5036f4850ac163557bca59b8063d.1480004384.git.glauber@scylladb.com>
(cherry picked from commit c32803f2f0)
We have a semaphore controlling the amount of background work generated
by the memtable flush process. However, because we are not moving it
inside the memtable post-flush continuation, the units are being
released when we star the flush and not when we finish it.
That's not the intended behavior and that can cause flushes to
accumulate.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <b7dc1866ed3473b9b1862c433d59c5ebd8575dbc.1479839600.git.glauber@scylladb.com>
(cherry picked from commit 13973e7f3b)
We current pass a region group to the memtable, but after so many recent
changes, that is a bit too low level. This patch changes that so we pass
a memtable list instead.
Doing that also has a couple of advantages. Mainly, during flush we must
get to a memtable to a memtable_list. Currently we do that by going to
the memtable to a column family through the schema, and from there to
the memtable_list.
That, however, involves calling virtual functions in a derived class,
because a single column family could have both streaming and normal
memtables. If we pass a memtable_list to the memtable, we can keep
pointer, and when needed get the memtable_list directly.
Not only that gets rid of the inheritance for aesthetic reasons, but
that inheritance is not even correct anymore. Since the introduction of
the big streaming memtables, we now have a plethora of lists per column
family and this transversal is totally wrong. We haven't noticed before
because we were flushing the memtables based on their individual sizes,
but it has been wrong all along for edge cases in which we would have to
resort to size-based flush. This could be the case, for instance, with
various plan_ids in flight at the same time.
At this point, there is no more reason to keep the derived classes for
the dirty_memory_manager. I'm only keeping them around to reduce
clutter, although they are useful for the specialized constructors and
to communicate to the reader exactly what they are. But those can be
removed in a follow up patch if we want.
The old memtable constructor signature is kept around for the benefit of
two tests in memtable_tests which have their own flush logic. In the
future we could do something like we do for the SSTable tests, and have
a proxy class that is friends with the memtable class. That too, is left
for the future.
Fixes#1870
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <811ec9e8e123dc5fc26eadbda82b0bae906657a9.1479743266.git.glauber@scylladb.com>
(cherry picked from commit 0ca8c3f162)
If a Column Family is non-durable, then its flushes will never create a
memtable flush reader. Our current flush logic depends on that being
created and destroyed to release the semaphore permits on the flush.
We will remove the permits ourselves it there is an exception, but not
under normal circumnstances. Given this issue, however, it would be more
adequate to always try to remove the permits after we flush. If the
permits were already removed by the flush reader, then this test will
just see that the permit is not in the map and return. But if it is
still there, then it is removed.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <049334c3b4bef620af2c7c045e6c84347dcf9013.1479498026.git.glauber@scylladb.com>
(cherry picked from commit 1933349654)
The leakage results in deleted sstables being opened until shutdown, and disk
space isn't released. That's because column_family::rebuild_sstable_list()
will not remove reference to deleted sstables if an exception was triggered in
sstables::delete_atomically(). A sstable only has its files closed when its
object is destructed.
The exception happens when a major compaction is issued in parallel to a
regular one, and one of them will be unable to delete a sstable already deleted
by the other. That results in remove_by_toc_name() triggering boost::filesystem
::filesystem_error because TOC and temporary TOC don't exist.
We wouldn't have seen this problem if major compaction were going through
compaction manager, but remove_by_toc_name() and rebuild_sstable_list() should
be made resilient.
Fixes#1840.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <d43b2e78f9658e2c3c5bbb7f813756f18874bf92.1479390842.git.raphaelsc@scylladb.com>
(cherry picked from commit 3dc9294023)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <760f96d81de0bab7507bb4f52c06b30f21e82577.1479420770.git.raphaelsc@scylladb.com>
This patch addresses post-merge follow up comments by Tomek.
Basically, what we do is:
- we don't need to signal() from remove_from_flush_manager(), because
the explicit flushes no longer wait on the condition variable. So we
don't.
- We now wait on the stop() flushes (regardless of their return status)
so we can make sure that the _flush_queue will indeed be done with.
- we acquire the semaphore before shutting down the dirty_memory_manager
to make sure that there are no pending flushes
- the flush manager that holds the semaphore has to match in the exception
handler
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <a23ab5098934546c660a08de64cd9294bb3a2008.1479400239.git.glauber@scylladb.com>
(cherry picked from commit 461778918b)
"This patchset allows Scylla to determine the size of a memtable instead
of relying in the user-provided memtable_cleanup_threshold. It does that
by allowing the region_group to specify a soft limit which will trigger
the allocation as early as it is reached.
Given that, we'll keep the memtables in memory for as long as it takes
to reach that limit, regardless of the individual size of any single one
of them. That limit is set to 1/4 of dirty memory. That's the same as
last submission, except this time I have run some experiments to gauge
behavior of that versus 1/2 of dirty memory, which was a preferred
theoretical value.
After that is done, the flush logic is reworked to guarantee that
flushes are not initiated if we already have one memtable under flush.
That allow us to better take advantage of coalescing opportunities with
new requests and prevents the pending memtable explosion that is
ultimately responsible for Issue 1817.
I have run mainly two workloads with this. The first one a local RF=1
workload with large partitions, sized 128kB and 100 threads. The results
are:
Before:
op rate : 632 [WRITE:632]
partition rate : 632 [WRITE:632]
row rate : 632 [WRITE:632]
latency mean : 157.8 [WRITE:157.8]
latency median : 115.5 [WRITE:115.5]
latency 95th percentile : 486.7 [WRITE:486.7]
latency 99th percentile : 534.8 [WRITE:534.8]
latency 99.9th percentile : 599.0 [WRITE:599.0]
latency max : 722.6 [WRITE:722.6]
Total partitions : 189667 [WRITE:189667]
Total errors : 0 [WRITE:0]
total gc count : 0
total gc mb : 0
total gc time (s) : 0
avg gc time(ms) : NaN
stdev gc time(ms) : 0
Total operation time : 00:05:00
END
After:
op rate : 951 [WRITE:951]
partition rate : 951 [WRITE:951]
row rate : 951 [WRITE:951]
latency mean : 104.8 [WRITE:104.8]
latency median : 102.5 [WRITE:102.5]
latency 95th percentile : 155.8 [WRITE:155.8]
latency 99th percentile : 177.8 [WRITE:177.8]
latency 99.9th percentile : 686.4 [WRITE:686.4]
latency max : 1081.4 [WRITE:1081.4]
Total partitions : 285324 [WRITE:285324]
Total errors : 0 [WRITE:0]
total gc count : 0
total gc mb : 0
total gc time (s) : 0
avg gc time(ms) : NaN
stdev gc time(ms) : 0
Total operation time : 00:05:00
END
The other workload was the workload described in #1817. And the result
is that we now have a load that is very stable around 100k ops/s and
hardly any timeouts, instead of the 1.4 baseline of wild variations
around 100k ops/s and lots of timeouts, or the deep reduction of
1.5-rc1."
* 'issue-1817-v4' of github.com:glommer/scylla:
database: rework memtable flush logic
get rid of max_memtable_size
pass a region to dirty_memory_manager accounting API
memtable: add a method to expose the region_group
logalloc: allow region group reclaimer to specify a soft limit
database: remove outdated comment
database: uphold virtual dirty for system tables.
(cherry picked from commit 5d067eebf2)
When fast forwarding a reader that combines sstable reader we must also
remember that the set of sstables for the new range may be different
than for the previous one. The reader introduced in this patch makes
sure that we read from correct sstables.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
There is already queue_length-requests_blocked_memory, but it's a
gauge so does not reflect what happened between the sampling points.
total_operations-requests_blocked_memory will allow to see if there
were any (and how many) requests which were blocked by dirty memory.
Message-Id: <1476098616-12682-1-git-send-email-tgrabiec@scylladb.com>
Limiting the concurrency of memtable flushes to 4 was a temporary
workaround for the fact that we lacked good write behind support. Now
that write behind is properly merged we can reduce the concurrency to
what it should be, one.
This means that memtable flushes will now be serialized, and only when
one of them ends will the next one begin. Disk parallelism is obtained
through the write-behind mechanism.
Fixes#1373
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <528f9ef928b5101bed952df600eb8555c275497a.1475881100.git.glauber@scylladb.com>
There is a limit to concurrency of sstable readers on each shard. When
this limit is exhausted (currently 100 readers) readers queue. There
is a timeout after which queued readers are failed, equal to
read_request_timeout_in_ms (5s by default). The reason we have the
timeout here is primarily because the readers created for the purpose
of serving a CQL request no longer need to execute after waiting
longer than read_request_timeout_in_ms. The coordinator no longer
waits for the result so there is no point in proceeding with the read.
This timeout should not apply for readers created for streaming. The
streaming client currently times out after 10 minutes, so we could
wait at least that long. Timing out sooner makes streaming unreliable,
which under high load may prevent streaming from completing.
The change sets no timeout for streaming readers at replica level,
similarly as we do for system tables readers.
Fixes#1741.
Message-Id: <1475840678-25606-1-git-send-email-tgrabiec@scylladb.com>
"Description:
============
Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.
The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.
What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.
The practical effect of that, is that once we reach 50 % occupancy in our dirty
memory region, we will bring the system from CPU speed to disk speed, and will
start accepting requests only at the rate we are able to write memory back.
Results
=======
With this patchset running a load big enough to easily saturate the disk,
(commitlog disabled to highlight the effects of the memtable writer), I am able
to run scylla for many minutes, with timeouts occurring only when I run out of
disk space, whereas without this patch a swarm of timeouts would start merely 2
seconds after the load started - and would never get stable.
In V2, I have sent a set of graphs illustrating the performance of this solution.
This version does not have any significant differences in that front.
For details, please refer to
https://groups.google.com/d/msg/scylladb-dev/iCvD-3Z-QqY/EM8KUh_MAQAJ
Accuracy of the accounting:
---------------------------
It is important for us to be as accurate as possible when accounting freed
memory, since every byte we mark as freed may allow one or more requests to be
executed. I have measured the accuracy of this approach (ignoring padding,
object size for the mutation fragments) to be 99.83 % of used memory in the
test workload I have ran (large, 65k mutations). Memtables under this circumnstance
tend to have a very high occupancy ratio because throttle breeds idle, and idle
breeds compact-on-idle.
Known Issues:
-------------
A lot of time can be elapsed between destroying the flush_reader and actually
releasing memory. The release of memory only happens when the SSTable is fully
sealed, and we have to flush the files, as well as finish writing all SSTable
components at this point. This happened in practice with a buggy kernel that
would result in flushes taking a long time.
After that is fixed, this is just a theoretical problem and in practice it
shouldn't matter given the time we expect those operations to take."
* 'virtual-dirty-v6' of github.com:glommer/scylla:
database: allow virtual dirty memory management
streamed_mutation: make _buffer private
add accounting of memory read to partition_snapshot_reader
move partition_snapshot_reader code to header file
LSA: allow a group to query its own region group
memtables: split scanning reader in two
sstables: use special reader for writing a memtable
LSA: export information about object memory footprint
LSA: export information about size of the throttle queue
database: export virtual dirty bytes region group
Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.
The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.
What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.
The practical effect of that is that once we reach 50 % occupancy in our dirty
memory region, we will bring the system from CPU speed to disk speed, and will
start accepting requests only at the rate we are able to write memory back.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Uniform token range distribution across sstables in a level > 1 was broken,
because we were only choosing sstable with lowest first key, when compacting
a level > 0. This resulted in performance problem because L1->L2 may have a
huge overlap over time, for example.
Last compacted key will now be stored for each level to ensure sort of
"round robin" selection of sstables for compactions at level >= 1.
That's also done by C*, and they were once affected by it as described in
https://issues.apache.org/jira/browse/CASSANDRA-6284.
Fixes#1719.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Also add information about for how long has the oldest been sitting in the
queue. This is part of the backpressure work to allow us to throttle incoming
requests if we won't have memory to process them. Shortages can happen in all
sorts of places, and it is useful when designing and testing the solutions to
know where they are, and how bad they are.
This counter is named for consistency after similar counters from transport/.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Currently, we export the region group where memtables are placed as dirty bytes.
Upcoming patches will optimistically mark some bytes in this region as free, a
scheme we know as "virtual dirty".
We are still interested in knowing the real state of the dirty region, so we
will keep track of the bytes virtually freed and split the counters in two.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The make_streaming_reader returns a combined mutation reader reads
mutations from sstables and memtable. The memtable reader handles
memtable flushing automatically so no special handling is needed here.
It will be used by streaming soon.
Currently we get boost::lexical_cast on startup if inital_token has a
list which contains spaces after commas, e.g.:
initial_token: -1100081313741479381, -1104041856484663086, ...
Fixes#1664.
Message-Id: <1473840915-5682-1-git-send-email-tgrabiec@scylladb.com>
That's needed to observe behavior of clustering filter, and to
check if it's worthwhile for a specific workload.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Leveled strategy will not benefit from this strategy because
there's only a few sstables that will contain a given partition
key, which means that a clustering key that belongs to a specific
partition key can only be in a few sstables as well.
Date tiered strategy is the one that will actually benefit the
most from this optimization. Size tiered may benefit from it too
if clustering key isn't overwritten, but it will not use the
clustering optimization.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If user specifies a clustering filter, it's possible to filter out
sstable based on its metadata that tracks min/max clustering value.
For example, if sstable stores clustering key from 'a' through 'c',
it's possible to filter out that sstable if user asks for data
with clustering key greater than 'c'.
That's done by comparing each component separately because
clustering key may be composite. Further information can be found
here: https://issues.apache.org/jira/browse/CASSANDRA-5514
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
All sstables will now have bloom filter checked in a single pass
before reader iterate through all candidates. It's possible that
we will need to futurize the procedure if it holds cpu for too
long. This change is also a step towards the optimization that
will rule out sstables based on clustering filter.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That will be needed for optimization that will store decorated keys
in the sstable object, and also for a subsequent work that will
detect wrong metadata (min/max column names) by looking at columns
in the schema. As schema is stored in sstable, there's no longer
a need to store ks and cf names in it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This reverts commit 1726b1d0cc.
Reverting this patch turns our SSTable access counter into a miss counter only.
The estimated histogram always starts its first bucket at 1, so by marking cache
accesses we will be wrongly feeding "1" into the buckets.
Notice that this is not yet ideal: nodetool is supposed to show a histogram of
all reads, and by doing this we are changing its meaning slightly. Workloads
that serve mostly from cache will be distorted towards their misses.
The real solution is to use a different histogram, but we will need to enforce
a newer version of nodetool for that: the current issue is that nodetool expects
an EstimatedHistogram in a specific format in the other side.
Conflicts:
row_cache.hh
Message-Id: <a599fa9e949766e7c9697450ae34fc28e881e90a.1472742276.git.glauber@scy
lladb.com>
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This patch changes the database and column_family types so a
trace_state_ptr can be passed in when querying. This enables tracing
of the inner components.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
If we have a cache hit, we still need to update our sstable histogram - notting
that we have touched 0 SSTables.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
That is done for single partition queries only - mimicking what
Cassandra does on that matter.
For this to be correct, we also need to update this histogram on cache
hits - in which case we update the read as having touched 0 SSTables. That
will be done on a separate patch.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Remove clustering_key_filter_factory and clustering_key_filtering_context.
Use partition_slice directly with a static get_ranges method.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
We have two counters that tracks how many memtable flushes are in progress, and
how much memory are they pinning.
The problem is, after we have revamped the code to limit the amount of flushes
in progress, those counters became useless: as they live inside the semaphore
side, they will only be incremented once we have past the semaphore.
One wouldn't notice if working with CPU-bound problems, where memtables don't
pile. But as soon as they do, those counters will always show the same numbers:
the depth of the semaphore, which doesn't mean much. The problem is poised to
become much worse: once we enable write behind in full and set the semaphore's
depth to one, that's the number we'll see here all the time.
The fix is to move the counters outside the semaphore, which will bring back its
old semantics.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <c5ae6903e170f3f356cdda7ed78a4c9ba8d5f024.1471370504.git.glauber@scylladb.com>
"When reading a partition try to read it all
but once more bytes are read than a given limit
we decide that partition is wide and we don't cache it.
Instead we retry the read with clustering key filtering
applied."