streaming generates lots of small sstables with large token range,
which triggers O(N^2) in space in interval map.
level 0 sstables will now be stored in a structure that has O(N)
in space complexity and which will be included for every read.
Fixes#2287.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170417185509.6633-1-raphaelsc@scylladb.com>
(cherry picked from commit 11b74050a1)
"Before, the logic for releasing writes blocked on dirty worked like this:
1) When region group size changes and it is not under pressure and there
are some requests blocked, then schedule request releasing task
2) request releasing task, if no pressure, runs one request and if there are
still blocked requests, schedules next request releasing task
If requests don't change the size of the region group, then either some request
executes or there is a request releasing task scheduled. The amount of scheduled
tasks is at most 1, there is a single releasing thread.
However, if requests themselves would change the size of the group, then each
such change would schedule yet another request releasing thread, growing the task
queue size by one.
The group size can also change when memory is reclaimed from the groups (e.g.
when contains sparse segments). Compaction may start many request releasing
threads due to group size updates.
Such behavior is detrimental for performance and stability if there are a lot
of blocked requests. This can happen on 1.5 even with modest concurrency
because timed out requests stay in the queue. This is less likely on 1.6 where
they are dropped from the queue.
The releasing of tasks may start to dominate over other processes in the
system. When the amount of scheduled tasks reaches 1000, polling stops and
server becomes unresponsive until all of the released requests are done, which
is either when they start to block on dirty memory again or run out of blocked
requests. It may take a while to reach pressure condition after memtable flush
if it brings virtual dirty much below the threshold, which is currently the
case for workloads with overwrites producing sparse regions.
I saw this happening in a write workload from issue #2021 where the number of
request releasing threads grew into thousands.
Fix by ensuring there is at most one request releasing thread at a time. There
will be one releasing fiber per region group which is woken up when pressure is
lifted. It executes blocked requests until pressure occurs."
* tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev:
tests: lsa: Add test for reclaimer starting and stopping
tests: lsa: Add request releasing stress test
lsa: Avoid avalanche releasing of requests
lsa: Move definitions to .cc
lsa: Simplify hard pressure notification management
lsa: Do not start or stop reclaiming on hard pressure
tests: lsa: Adjust to take into account that reclaimers are run synchronously
lsa: Document and annotate reclaimer notification callbacks
tests: lsa: Use with_timeout() in quiesce()
(cherry picked from commit 7a00dd6985)
"This patchset contains fixes for the changes introduced in "Query result
size limiting". It also improves handling of short data reads.
I order to minimise chances of digest mismatch during data queries replicas
that were asked just to return a digest also keep track of the size of the
data (in the IDL representation) so that they would stop at the same point
nodes doing full data queries would. Moreover, data queries are not
affected by per-shard memory limit and the coordinator sends individual
result size limits to replicas in order not to depend on hardcoded values.
It is still possible to get digest mismatches if the IDL changes (e.g. a
new field is added), but, hopefully, that won't be a serious problem."
* 'pdziepak/short-read-fixes/v4' of github.com:cloudius-systems/seastar-dev:
query: introduce result_memory_accounter::foreign_state
storage_proxy: fix short reads in parallel range queries
storage_proxy: pass maximum result size to replicas
mutation_partition: use result limiter for digest reads
query: make result_memory_limiter constants available for linker
result_memory_limiter: add accounter for digest reads
idl: allow writers to use any output stream
result_memory_limiter: split new_read() to new_{data, mutation}_read()
idl: is_short_read() was added in 1.6
mutation_partition: honour allowed_short_read for static rows
storage_proxy: fix _is_short_read computation
storage_proxy: disallow short reads if got no live rows
storage_proxy: don't stop after result with no live rows
(cherry picked from commit 868b4d110c)
The test assumed that mutations added to the commitlog are visible to
reads as soon as a new segment is opened. That's not true because
buffers are written back in the background, and new segment may be
active while the previous one is still being written or not yet
synced.
Fix the test so that it expectes that the number of mutations read
this way is <= the number of mutations read, and that after all
segments are synced, the number of mutations read is equal.
Message-Id: <1481630481-19395-1-git-send-email-tgrabiec@scylladb.com>
Issue #1918 describes a problem, in which we are generating smaller
memtables than we could, and therefore not respecting the flush
criteria.
That happens because group sizes (and limits) for pressure purposes, and
the the soft threshold is currently at 40 %. This causes system group's
soft threshold to be way below regular's virtual dirty limit and close
to regular group's soft threshold. The system group was very likely to
become under soft pressure when regular was because writes to regular
group are not yet throttled when they cross both soft thresholds.
This is a direct consequence of the linear hierarchy between the regions
and to guarantee that it won't happen we would have acqire the semaphore
of all ancestor regions when flushing from a child region. While that
works, it can lead to problems on its own, like priority inversion if
the regions have different priorities - like streaming and regular, and
groups lower in the hierarchy, like user, blocking explicit flushes
from their ancestors
To fix that, this patch reorganizes the dirty memory region groups so
that groups are now completely independent. As a disadvantage, when
streaming happen we will draw some memory from the cache, but we will
live with it for the time being.
Fixes#1918
Signed-off-by: Glauber Costa <glauber@scylladb.com>
When row cache is disabled, update_cache() will do nothing to the
memtable. Active readers may keep the memtable alive for unbounded
amount of time, preventing it from going away. This doesn't play well
with virtual dirty accounting. Soon before calling update_cache(), the
memory which was subtracted during flush is added back to the amount
of virtual dirty memory. If there was write pressure all along, we
will be at the dirty memory limit. When we give back subtracted memory
this will put virtual dirty way above the limit. This will stall all
writes until another memtable flush drags virtual dirty down or
readers finally release the memtable. We want to prevent upward
jumps of virtual dirty.
First part of the fix is to ensure that as long as the memtable's
region is in the dirty group, we will not revert flushed memory. This
must happen synchronously from region's memory being removed from the
group in order to prevent upward virtual dirty jumps. To make this
easier, tracking of flushed memory was moved to the memtable object.
Another part of the fix is to gradually clear the memtable when cache
is disabled in a similar fashion as when it's moved to cache. This
ensures that the actual memory held by memtable's region is released
sooner than it dies.
Refs #1879
The implementation assumes that memtable's region group is owned by
dirty_memory_manager, and tries to obtain a reference to it like this:
boost::intrusive::get_parent_from_member(_region.group(), &dirty_memory_manager::_region_group));
This is undefined behavior when the region's group does not come from
dirty manager. It's safer to be explicit about this dependency by
taking a reference to dirty_memory_manager in the constructor.
If exception is triggered early in boot when doing an I/O operation,
scylla will fail because io checker calls storage service to stop
transport services, and not all of them were initialized yet.
Scylla was failing as follow:
scylla: ./seastar/core/sharded.hh:439: Service& seastar::sharded<Service>::local()
[with Service = gms::gossiper]: Assertion `local_is_initialized()' failed.
Aborting on shard 0.
Backtrace:
0x000000000048a2ca
0x000000000048a3d3
0x00007fc279e739ff
0x00007fc279ad6a27
0x00007fc279ad8629
0x00007fc279acf226
0x00007fc279acf2d1
0x0000000000c145f8
0x000000000110d1bc
0x000000000041bacd
0x00000000005520f1
0x00007fc279aeaf1f
Aborted (core dumped)
Refs #883.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Asias He <asias@scylladb.com>
Message-Id: <963f7b0f5a7a8a1405728b414a7d7a6dccd70581.1479172124.git.asias@scylladb.com>
Problem will cause size tiered to return small jobs when there are
more than max_threshold sstables of similar size. For example, if
max_threshold is 32, and there are 36 sstables of similar size,
strategy will only return 4 sstables to be compacted. That's because
we incorrectly create a new bucket when it meets the max threshold.
What we should do is to allow buckets to grow beyond max threshold
and trim them when selecting the most suitable one for compaction.
Important to mention that estimation for size tiered will now
work better when there are more than max_threshold sstables of
similar size.
Fixes#1901.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <080bad70d6cb86eaf52ac1bdd6765ac47aab5b03.1478316140.git.raphaelsc@scylladb.com>
"We currently write the size_estimates system table for every schema
on a periodic basis, currently set to 5 minutes, which can interfere
with an ongoing workload.
This patchset virtualizes it such that queries are intercepted and we
calculate the results on the fly, only for the ranges the caller is interested in.
Fixes#1616"
* 'virtual-estimates/v4' of github.com:duarten/scylla:
size_estimates_virtual_reader: Add unit test
db: Delete size_estimates_recorder
size_estimates: Add virtual reader
column_family: Add support for virtual readers
storage_service: get_local_tokens() returns a future
nonwrapping_range: Add slice() function
range: Find a sequence's lower and upper bounds
system_keyspace: Build mutations for size estimates
size_estimates: Store the token range as bytes
range_estimates: Add schema
murmur3_partitioner: Convert maximum_token to sstring
"Clusters with a large number of nodes, or a low number of vnodes, and a
high number of shards, or a combination, suffer from an aliasing problem:
both vnodes and intra-node sharding consider the most significant bits
to select the owning node and owning shard respectively. Since the same
bits are used for both, a low number of vnodes leads to some shards
being overcommitted relative to others.
This series fixes the problem by sharding on bits 0:47 of the token
(murmur3 partitioner only), leaving the most significant 12 bits for
vnodes. Simulation shows that this value provides reasonable sharding
for 100-node, 30-shard clusters.
In order to prevent re-sharding sstables on each boot, token ranges for
the range are stored in a new sub-component of the sstable Statistics
component. With the default 12 ignored bits we have 4096 token ranges
for non-Level-compacted SSTables, which takes some space but is still
reasonable.
Fixes #1277."
Sharding on the most significant token bits aliases with the vnode mechanism,
which also uses the most significant bits; this requires a huge number of
vnodes to achieve good sharding.
This patch teaches the murmur3 partitioner to ignore the most significant
N bits when calculating a token's hard, so we use token bits which still have
some entropy. In effect, with changes the token range layout from
shard 0
shard 1
...
shard S-1
to
shard 0
shard 1
...
shard S-1
shard 0
shard 1
...
shard S-1
...
shard 0
shard 1
...
shard S-1
Where the number of repetitions of the block is 2^(ignored msb bits).
For compatibility, the default is zero ignored bits, matching the pre-patch
state, until we wire things up.
After 7c28ed, the schemas defined in the test became compressed by
default. This patch changes the test so that it is explicit about
which schemas shouldn't define a compressor.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1478646530-5558-1-git-send-email-duarte@scylladb.com>
partition_range passed to row_cache::make_reader
has to be kept alive as long as the resulting reader
is used.
Otherwise weird things start to happen.
This used to work just because of a pure luck.
When I started changing the row_cache implementation
I run into very weird behaviors for this tests.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <2c9e337dbbcf35f4e1394cad043eda10b8c2bd4a.1478602876.git.piotr@scylladb.com>
Fixes#1775
stream lacks a check "is_open", which is a bummer. We have to both
prevent exception propagation and add a flag of our own to make sure
exceptions in producer code reaches consumer, and does not simply
get lost in the reactor.
Message-Id: <1478508817-18854-1-git-send-email-calle@scylladb.com>
"The atomic sstable deletion provides exception safety at the cost of
quadratic behavior in the number of sstables awaiting deletion. This
causes high cpu utilization during startup.
Change the code to avoid quadratic complexity, and add some unit tests.
See #1812."
Wrapping ranges are a pain, so we are moving wrap handling to the edges.
Since cql can't generate wrapping ranges, this means thrift and the ring
maintenance code; also range->ring transformations need to merge the first
and last ranges.
Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>
Commit 7dcd70124a "tests/sstables: add
test for fast forwarding reader" added a test for skipping parts of
sstable. Unfortunately, it did not include the sstables it was trying to
read.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Originally, cache tests checked how many times a mutation reader was
created from the underlying mutation source to determine whether
continuity flag is working correctly.
This is not going to work with fast forwarding mutation readers so the
test is switched to count number of mutations (+ end of stream markers)
returned from underlying mutaiton readers which is much less fragile.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
That overload was used only by unit test and violated guarantee that
partition range lives until mutation reader is done.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Make split_after() more generic by allowing split_point to be anywhere,
not just within the input range. If the split_point is before, the entire
range is returned; and if it is after, stdx::nullopt is returned.
"before" and "after" are not well defined for wrap-around ranges, so
but we are phasing them out and soon there will not be
wrapping_range::split_after() users.
This is a prerequisite for converting partition_range and friends to
nonwrapping_range.
Message-Id: <1475765099-10657-1-git-send-email-avi@scylladb.com>
Remove inclusions from header files (primary offender is fb_utilities.hh)
and introduce new messaging_service_fwd.hh to reduce rebuilds when the
messaging service changes.
Message-Id: <1475584615-22836-1-git-send-email-avi@scylladb.com>
"Description:
============
Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.
The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.
What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.
The practical effect of that, is that once we reach 50 % occupancy in our dirty
memory region, we will bring the system from CPU speed to disk speed, and will
start accepting requests only at the rate we are able to write memory back.
Results
=======
With this patchset running a load big enough to easily saturate the disk,
(commitlog disabled to highlight the effects of the memtable writer), I am able
to run scylla for many minutes, with timeouts occurring only when I run out of
disk space, whereas without this patch a swarm of timeouts would start merely 2
seconds after the load started - and would never get stable.
In V2, I have sent a set of graphs illustrating the performance of this solution.
This version does not have any significant differences in that front.
For details, please refer to
https://groups.google.com/d/msg/scylladb-dev/iCvD-3Z-QqY/EM8KUh_MAQAJ
Accuracy of the accounting:
---------------------------
It is important for us to be as accurate as possible when accounting freed
memory, since every byte we mark as freed may allow one or more requests to be
executed. I have measured the accuracy of this approach (ignoring padding,
object size for the mutation fragments) to be 99.83 % of used memory in the
test workload I have ran (large, 65k mutations). Memtables under this circumnstance
tend to have a very high occupancy ratio because throttle breeds idle, and idle
breeds compact-on-idle.
Known Issues:
-------------
A lot of time can be elapsed between destroying the flush_reader and actually
releasing memory. The release of memory only happens when the SSTable is fully
sealed, and we have to flush the files, as well as finish writing all SSTable
components at this point. This happened in practice with a buggy kernel that
would result in flushes taking a long time.
After that is fixed, this is just a theoretical problem and in practice it
shouldn't matter given the time we expect those operations to take."
* 'virtual-dirty-v6' of github.com:glommer/scylla:
database: allow virtual dirty memory management
streamed_mutation: make _buffer private
add accounting of memory read to partition_snapshot_reader
move partition_snapshot_reader code to header file
LSA: allow a group to query its own region group
memtables: split scanning reader in two
sstables: use special reader for writing a memtable
LSA: export information about object memory footprint
LSA: export information about size of the throttle queue
database: export virtual dirty bytes region group
We allocate objects of a certain size, but we use a bit more memory to hold
them. To get a clerer picture about how much memory will an object cost us, we
need help from the allocator. This patch exports an interface that allow users
to query into a specific allocator to get that information.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"This patch set ensures we can correctly handle queries
where the minimum token is specified."
* 'min-token/v3' of github.com:duarten/scylla:
cql_query_test: Add test case for min/max token bounds
token_restriction: Deal with minimum tokens
partitioner: Parse token from bytes
When max sstable size is increased, higher levels are suffering from
starvation because we decide to compact a given level if the following
calculation results in a number greater than 1.001:
level_size(L) / max_size_for_level_l(L)
Fixes#1720.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>