"Before, the logic for releasing writes blocked on dirty worked like this:
1) When region group size changes and it is not under pressure and there
are some requests blocked, then schedule request releasing task
2) request releasing task, if no pressure, runs one request and if there are
still blocked requests, schedules next request releasing task
If requests don't change the size of the region group, then either some request
executes or there is a request releasing task scheduled. The amount of scheduled
tasks is at most 1, there is a single releasing thread.
However, if requests themselves would change the size of the group, then each
such change would schedule yet another request releasing thread, growing the task
queue size by one.
The group size can also change when memory is reclaimed from the groups (e.g.
when contains sparse segments). Compaction may start many request releasing
threads due to group size updates.
Such behavior is detrimental for performance and stability if there are a lot
of blocked requests. This can happen on 1.5 even with modest concurrency
because timed out requests stay in the queue. This is less likely on 1.6 where
they are dropped from the queue.
The releasing of tasks may start to dominate over other processes in the
system. When the amount of scheduled tasks reaches 1000, polling stops and
server becomes unresponsive until all of the released requests are done, which
is either when they start to block on dirty memory again or run out of blocked
requests. It may take a while to reach pressure condition after memtable flush
if it brings virtual dirty much below the threshold, which is currently the
case for workloads with overwrites producing sparse regions.
I saw this happening in a write workload from issue #2021 where the number of
request releasing threads grew into thousands.
Fix by ensuring there is at most one request releasing thread at a time. There
will be one releasing fiber per region group which is woken up when pressure is
lifted. It executes blocked requests until pressure occurs."
* tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev:
tests: lsa: Add test for reclaimer starting and stopping
tests: lsa: Add request releasing stress test
lsa: Avoid avalanche releasing of requests
lsa: Move definitions to .cc
lsa: Simplify hard pressure notification management
lsa: Do not start or stop reclaiming on hard pressure
tests: lsa: Adjust to take into account that reclaimers are run synchronously
lsa: Document and annotate reclaimer notification callbacks
tests: lsa: Use with_timeout() in quiesce()
(cherry picked from commit 7a00dd6985)
Before this patch system table writes were not writing to commit log
because database::add_column_family() disables writes to commit log
for the table which is added if _commitlog is not set at that
time. Fix by initializing commit log before system tables are created.
Fixes#1986.
Fixes recent regression in
batch_test.py:TestBatch.replay_after_schema_change_test after
scylla-jmx was updated to not flush system tables on nodetool flush.
Could cause system keyspace writes to be delayed for more than before
under heavy write workload. Refs #1926.
Message-Id: <1483618117-4535-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit cd630fece6)
"This patchset contains fixes for the changes introduced in "Query result
size limiting". It also improves handling of short data reads.
I order to minimise chances of digest mismatch during data queries replicas
that were asked just to return a digest also keep track of the size of the
data (in the IDL representation) so that they would stop at the same point
nodes doing full data queries would. Moreover, data queries are not
affected by per-shard memory limit and the coordinator sends individual
result size limits to replicas in order not to depend on hardcoded values.
It is still possible to get digest mismatches if the IDL changes (e.g. a
new field is added), but, hopefully, that won't be a serious problem."
* 'pdziepak/short-read-fixes/v4' of github.com:cloudius-systems/seastar-dev:
query: introduce result_memory_accounter::foreign_state
storage_proxy: fix short reads in parallel range queries
storage_proxy: pass maximum result size to replicas
mutation_partition: use result limiter for digest reads
query: make result_memory_limiter constants available for linker
result_memory_limiter: add accounter for digest reads
idl: allow writers to use any output stream
result_memory_limiter: split new_read() to new_{data, mutation}_read()
idl: is_short_read() was added in 1.6
mutation_partition: honour allowed_short_read for static rows
storage_proxy: fix _is_short_read computation
storage_proxy: disallow short reads if got no live rows
storage_proxy: don't stop after result with no live rows
(cherry picked from commit 868b4d110c)
"This patchset ensures the partition limit is enforced at
the storage_proxy level. To achieve this, we add the partition
count to query::result, and allow the result_merger to trim
excess partitions."
* 'enforce-partition-limit/v3' of https://github.com/duarten/scylla:
storage_proxy: Decrease limits when retrying command
storage_proxy: Don't fetch superfluous partitions
query::result: Add partition count
column_family: Use counters in query::result::builder
query_result_builder: Use the underlying counters
mutation_partition: Count partitions in query_compacted
mutation_partition: Remove tabs in query_compacted
query::result::builder: Add partition count
query_result_merger: Limit partitions
(cherry picked from commit 6bb875bdb7)
Shared sstables will now be resharded in the same order to guarantee
that all shards owning a sstable will agree on its deletion nearly
the same time, therefore, reducing disk space requirement.
That's done by picking which column family to reshard in UUID order,
and each individual column family will reshard its shared sstables
in generation order.
Fixes#1952.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <87ff649ed24590c55c00cbb32bffd8fa2743e36e.1482342754.git.raphaelsc@scylladb.com>
(cherry picked from commit 27fb8ec512)
A case could be made that we should have counters for them no matter
what, since it can help us reason about the distribution of memory among
the groups. But with the hierarchy being broken in 1.5 it becomes even
more important. Now by looking solely at dirty, we will have no idea
about how much memory we are using in those groups.
After this patch, the dirty_memory_manager will register its metrics
for the 3 groups that we have, and the legacy names will be used to show
totals.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0d04ca4c7e8472097f16a5dc950b77c73766049e.1481831644.git.glauber@scylladb.com>
(cherry picked from commit 7133583797)
A naive approach was to create a set of readers for each range and pass
them all to combining reader. This however performed badly if the number
of ranges was high.
The solution is to use multi range reader which uses only a single set
of readers and fast forwards from range to range when necessary. This
adds another requirement that the ranges passed to
make_streaming_reader() are sorted and disjoint.
This pach ensures than when we start executing a query a minimum result
size is reserved from result_memory_limiter.
Moreover, range queries need a way of merging memory usage information
from different shards.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
"The current criteria for memtable flush is not being respected. The
problem is demonstrated to happen when the dirty memory group is over
limit, and so is the system table extra allowance. In that situation,
both the normal region and the system table region will be under
pressure and try to flush.
More specifically, because the normal region inherits from the system
region, if the normal region is under pressure (over the soft limit
threshold), the system region will certainly be as well, even though it
has an extra allowance. This is because after virtual dirty, we start
blocking when we reach half the region, but memory itself can grow up to
100 % of the region. So the total amount of memory used will be
certainly bigger than the system pressure threshold, which is now 50 %
plus the allowance.
To fix that, this patch reworks the flush logic so that the regions are
not dependent on each other.
Fixes#1918"
* 'flush-criteria-v6' of github.com:glommer/scylla:
config: get rid of memtable_total_space
database: rework dirty memory hierarchy
system keyspace: write batchlog mutation in user memory
database: remove flush_token
database: abstract pressure condition notification
database: encapsulate semaphore_units into a flush_permit
database: remove friendship declaration
database: simplify flush_one
database: make memtable_list aware in cases it can't flush
Issue #1918 describes a problem, in which we are generating smaller
memtables than we could, and therefore not respecting the flush
criteria.
That happens because group sizes (and limits) for pressure purposes, and
the the soft threshold is currently at 40 %. This causes system group's
soft threshold to be way below regular's virtual dirty limit and close
to regular group's soft threshold. The system group was very likely to
become under soft pressure when regular was because writes to regular
group are not yet throttled when they cross both soft thresholds.
This is a direct consequence of the linear hierarchy between the regions
and to guarantee that it won't happen we would have acqire the semaphore
of all ancestor regions when flushing from a child region. While that
works, it can lead to problems on its own, like priority inversion if
the regions have different priorities - like streaming and regular, and
groups lower in the hierarchy, like user, blocking explicit flushes
from their ancestors
To fix that, this patch reorganizes the dirty memory region groups so
that groups are now completely independent. As a disadvantage, when
streaming happen we will draw some memory from the cache, but we will
live with it for the time being.
Fixes#1918
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Done in a separate patch to reduce clutter in the main patch.
Soon we'll be testing for one more condition.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
flush_one has to make sure that we're using the correct
dirty_memory_manager object, because we could be flushing from a region
group different than the one the flush request originated.
It's simpler to just assume flush_one will be dealing with the right
object, and use a different object instead of "this" when calling it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Some of our CFs can't be flushed. Those are the ones who are not marked
as having durable writes. We treat them just the same from the point of
view of the flush logic, but they provide a function that doesn't do
anything and just returns right away.
We already had troubles with that in the past, and that also poses a
problem for an upcoming patch reworking the flush memtable pick
criteria.
It's easier, simpler, and cleaner, to just make the memtable_list aware
it can't flush. Achieving that is also not very complicated: we just
need a special constructor that doesn't take a seal function and then we
make sure that it is initialized to an empty std::function
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Currently, we weren't completing a query as early as possible if it
reached the partition limit, we instead had to wait until reaching the
end of the specified partition ranges. This patches fixes that by
including a check to the partition limit in the termination condition.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161213114559.26438-1-duarte@scylladb.com>
Allow to make a streaming reader with a vector of ranges in addition to
a single range. This will be used soon in following streaming patch.
We can make the reader more efficient later.
The procedure to calculate max purgeable timestamp is optimized
by only visiting sstables that overlap with key being currently
compacted. That's done using incremental sstable selector.
Function to calculate maximum purgeable timestamp is made 10 times
faster when compacting sstables overlap with 10% of all sstables.
Fixes#1322.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
So that memory is released gradually (impacting latency less) and
sooner than when memtable is destroyed. Active readers may keep the
memtable alive for unbounded amount of time.
Refs #1879
The implementation assumes that memtable's region group is owned by
dirty_memory_manager, and tries to obtain a reference to it like this:
boost::intrusive::get_parent_from_member(_region.group(), &dirty_memory_manager::_region_group));
This is undefined behavior when the region's group does not come from
dirty manager. It's safer to be explicit about this dependency by
taking a reference to dirty_memory_manager in the constructor.
New sstables are loaded and added in parallel, meaning that scylla can
potentially return stale data if a new sstable containing a tombstone
wasn't loaded yet. Compaction should also not run until all new sstables
are added for similar reasons.
Fix is about separating blocking and non-blocking steps to allow
atomic add of multiple new sstables.
Fixes#1368.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <14283b8a4a69127071d1fabef320a93c91817ec2.1480356073.git.raphaelsc@scylladb.com>
"We currently write the size_estimates system table for every schema
on a periodic basis, currently set to 5 minutes, which can interfere
with an ongoing workload.
This patchset virtualizes it such that queries are intercepted and we
calculate the results on the fly, only for the ranges the caller is interested in.
Fixes#1616"
* 'virtual-estimates/v4' of github.com:duarten/scylla:
size_estimates_virtual_reader: Add unit test
db: Delete size_estimates_recorder
size_estimates: Add virtual reader
column_family: Add support for virtual readers
storage_service: get_local_tokens() returns a future
nonwrapping_range: Add slice() function
range: Find a sequence's lower and upper bounds
system_keyspace: Build mutations for size estimates
size_estimates: Store the token range as bytes
range_estimates: Add schema
murmur3_partitioner: Convert maximum_token to sstring
When we finish writing a memtable, we revert the dirty memory charges
immediately. When we do that, dirty memory will grow back to what it
was, and soon (we hope) will go down again when we release the requests
for real.
During that time, we may not accept new requests. Sealing can take a
long time, specially in the face of Linux issues like the ones we have
seen in the past. It also will take proportionally more time if the
SSTables end up being small, which is a possibility in some scenarios.
This patch changes the dirty_memory_manager so that the charges won't be
reverted right after we finish the flush. Rather, we will hold on to it,
and revert it right before we update the cache. We don't need to do it
for all classes of memtable writes, because after we finish flushing,
flush_one() will destroy the hashed element anyway.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2d5a8f6ca57d5036f4850ac163557bca59b8063d.1480004384.git.glauber@scylladb.com>
"Clusters with a large number of nodes, or a low number of vnodes, and a
high number of shards, or a combination, suffer from an aliasing problem:
both vnodes and intra-node sharding consider the most significant bits
to select the owning node and owning shard respectively. Since the same
bits are used for both, a low number of vnodes leads to some shards
being overcommitted relative to others.
This series fixes the problem by sharding on bits 0:47 of the token
(murmur3 partitioner only), leaving the most significant 12 bits for
vnodes. Simulation shows that this value provides reasonable sharding
for 100-node, 30-shard clusters.
In order to prevent re-sharding sstables on each boot, token ranges for
the range are stored in a new sub-component of the sstable Statistics
component. With the default 12 ignored bits we have 4096 token ranges
for non-Level-compacted SSTables, which takes some space but is still
reasonable.
Fixes #1277."
Instead of calculating the owning shard from the sstable's partition
key range, delegate to the new sstable method for getting owning shard
infomation. This insulates us from changes in the sharding algorithm.
We have a semaphore controlling the amount of background work generated
by the memtable flush process. However, because we are not moving it
inside the memtable post-flush continuation, the units are being
released when we star the flush and not when we finish it.
That's not the intended behavior and that can cause flushes to
accumulate.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <b7dc1866ed3473b9b1862c433d59c5ebd8575dbc.1479839600.git.glauber@scylladb.com>
We current pass a region group to the memtable, but after so many recent
changes, that is a bit too low level. This patch changes that so we pass
a memtable list instead.
Doing that also has a couple of advantages. Mainly, during flush we must
get to a memtable to a memtable_list. Currently we do that by going to
the memtable to a column family through the schema, and from there to
the memtable_list.
That, however, involves calling virtual functions in a derived class,
because a single column family could have both streaming and normal
memtables. If we pass a memtable_list to the memtable, we can keep
pointer, and when needed get the memtable_list directly.
Not only that gets rid of the inheritance for aesthetic reasons, but
that inheritance is not even correct anymore. Since the introduction of
the big streaming memtables, we now have a plethora of lists per column
family and this transversal is totally wrong. We haven't noticed before
because we were flushing the memtables based on their individual sizes,
but it has been wrong all along for edge cases in which we would have to
resort to size-based flush. This could be the case, for instance, with
various plan_ids in flight at the same time.
At this point, there is no more reason to keep the derived classes for
the dirty_memory_manager. I'm only keeping them around to reduce
clutter, although they are useful for the specialized constructors and
to communicate to the reader exactly what they are. But those can be
removed in a follow up patch if we want.
The old memtable constructor signature is kept around for the benefit of
two tests in memtable_tests which have their own flush logic. In the
future we could do something like we do for the SSTable tests, and have
a proxy class that is friends with the memtable class. That too, is left
for the future.
Fixes#1870
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <811ec9e8e123dc5fc26eadbda82b0bae906657a9.1479743266.git.glauber@scylladb.com>
Virtual readers allow queries to selected tables, usually system
tables, to be answered by the engine. This is useful for tables which
aren't written by users and whose contents can be calculated on
demand.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
In the last iterations of this patchset, we have moved explicit flushes
to acquire the semaphore directly and the coalescing inside the
memtable_list. As a result, we are no longer keeping any kind of action
for them inside the condition variable. Checking for them has no longer
a purpose.
This is a cleanup patch that remove does checks.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <732676ccfe4ac93eb57aa799ec94b841499a01a6.1479500646.git.glauber@scylladb.com>
If a Column Family is non-durable, then its flushes will never create a
memtable flush reader. Our current flush logic depends on that being
created and destroyed to release the semaphore permits on the flush.
We will remove the permits ourselves it there is an exception, but not
under normal circumnstances. Given this issue, however, it would be more
adequate to always try to remove the permits after we flush. If the
permits were already removed by the flush reader, then this test will
just see that the permit is not in the map and return. But if it is
still there, then it is removed.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <049334c3b4bef620af2c7c045e6c84347dcf9013.1479498026.git.glauber@scylladb.com>
This patch addresses post-merge follow up comments by Tomek.
Basically, what we do is:
- we don't need to signal() from remove_from_flush_manager(), because
the explicit flushes no longer wait on the condition variable. So we
don't.
- We now wait on the stop() flushes (regardless of their return status)
so we can make sure that the _flush_queue will indeed be done with.
- we acquire the semaphore before shutting down the dirty_memory_manager
to make sure that there are no pending flushes
- the flush manager that holds the semaphore has to match in the exception
handler
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <a23ab5098934546c660a08de64cd9294bb3a2008.1479400239.git.glauber@scylladb.com>
The leakage results in deleted sstables being opened until shutdown, and disk
space isn't released. That's because column_family::rebuild_sstable_list()
will not remove reference to deleted sstables if an exception was triggered in
sstables::delete_atomically(). A sstable only has its files closed when its
object is destructed.
The exception happens when a major compaction is issued in parallel to a
regular one, and one of them will be unable to delete a sstable already deleted
by the other. That results in remove_by_toc_name() triggering boost::filesystem
::filesystem_error because TOC and temporary TOC don't exist.
We wouldn't have seen this problem if major compaction were going through
compaction manager, but remove_by_toc_name() and rebuild_sstable_list() should
be made resilient.
Fixes#1840.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <d43b2e78f9658e2c3c5bbb7f813756f18874bf92.1479390842.git.raphaelsc@scylladb.com>
The way we currently flush memtables, we seal the current one but wait
on a semaphore for the actual flush to proceed.
This is pointless, because if the flush is not proceeding we'll use up
memory for the new entries anyway, be them in a newly opened memtable or
not. As a matter of fact, by opening a new memtable we are foregoing
coalescing opportunities.
After recent changes to the flush paths, we are now in a position to do
differently. We move the semaphore earlier, and if we can't acquire it
we keep appending to the current memtable.
For explicit flushes, we'll queue and prioritize them over memory-based
flushes. This has the nice property of potentially coalescing various
flushes for the same CF into one.
Coalescing flushes for the same CF is particularly helpful for
commitlog-initiated flushes that can't complete within the flush period.
What we see currently, is that under heavy load the commitlog will keep
sealing memtables adding to the existing load.
Another interesting property of this approach is that we can keep the
disk utilization higher, by allowing a new flush to start before the
memtable is fully sealed. By design, every time a memtable is finished
flushing it will call revert_potentially_cleaned_up_memory() to revert
the virtual memory charges. That is the perfect moment for us to act.
It indicates that all the data flushing part is done.
The way we'll do it is by keeping the semaphore_units alive for this
memtable. When the flush ends, we destroy that object. This will
effectively trigger the next flush if there is a next flush that can be
initiated.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
After recent changes to the memtable code, there is no reason for us to
uphold a maximum memtable size. Now that we only flush one memtable at a
time anyway, and also have soft limit notifications from the
region_group_reclaimer, we can just set the soft limit to the target
size and let all of that be handled by the dirty_memory_manager.
It does have the added property that we'll be flushing when we globally
reach the soft limit threshold. In conditions in which we have multiple
CF writes fighting for memory, that guarantees that we will start
flushing much earlier than the hard limit.
The threshold is set to 1/4 of dirty memory. While in theory we would
prefer the memtables to go as big as 1/2 of dirty memory, in my
experiments I have found 1/4 to be a better fit, at least for the
moment.
The reason for such behavior is that in situations where we have slow
disks, setting the soft limit to 1/2 of dirty will put us in a situation
in which we may not have finished writing down the memtable when we hit
the limit, and then throttle. When set the threshold to 1/4 of dirty, we
don't throttle at all.
This behavior could potentially be fixed by not doing the full
memtable-based throttling after we do the commitlog throttling, but that
is not something realistic for the moment.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Currently the virtual dirty mechanism is not properly set for system
tables. We haven't divided the system table allowance by two, which
means it won't start thottling earlier as it was supposed to.
In practice, this has little effect because system table requests are
very well behaved, their sizes well known, and they tend to be
force-flushed. But we should be consistent.
Signed-off-by: Glauber Costa <glauber@scylladb.com>