* seastar 397685c...c1dbd89 (13):
> lowres_clock: drop cache-line alignment for _timer
> net/packet: add missing include
> Merge "Adding histogram and description support" from Amnon
> reactor: Fix the error: cannot bind 'std::unique_ptr' lvalue to 'std::unique_ptr&&'
> Set the option '--server' of tests/tcp_sctp_client to be required
> core/memory: Remove superfluous assignment
> core/memory: Remove dead code
> core/reactor: Use logger instead of cerr
> fix inverted logic in overprovision parameter
> rpc: fix timeout checking condition
> rpc: use lowres_clock instead of high resolution one
> semaphore: make semaphore's clock configurable
> rpc: detect timedout outgoing packets earlier
Includes treewide change to accomodate rpc changing its timeout clock
to lowres_clock.
Includes fixup from Amnon:
collectd api should use the metrics getters
As part of a preperation of the change in the metrics layer, this change
the way the collectd api uses the metrics value to use the getters
instead of calling the member directly.
This will be important when the internal implementation will changed
from union to variant.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1485457657-17634-1-git-send-email-amnon@scylladb.com>
That's because a single shard is used to calculate generation for new
sstables in upload directory, and that will result in that single shard
sharing all the resources with other shards.
For refresh without upload dir, it currently works fine because we
reshuffle column family dir instead.
flush_upload_dir() is now a free function, takes a distributed database
object, and uses calculate_shard_from_sstable_generation() to decide
which shard will move sstable using its own generation namespace.
Fixes#2008.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <b0cccf7bbb61416ff8718bac92fdca90cc5fb9c9.1484253232.git.raphaelsc@scylladb.com>
This patch ensures that when adding a shared sstable, we select only
one cpu to update that column family's stats. This is important so we
don't overestimated the on-disk size of sstables when resharding
This fixes only a temporary miscount of the current load, since shared
sstables are eventually re-written, but a fixes a permanent miscount
of the total load.
Refs #1592
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170119144823.31041-1-duarte@scylladb.com>
After resharding, sstables may be owned by all shards, which
means that file descriptors and memory usage for metadata will
increase by a factor equal to number of shards. That can easily
lead to OOM.
SSTable components are immutable, so they can be stored in one
shard and shared with others that need it. We use the following
formula to decide which shard will open the sstable and share
it with the others: (generation % smp::count), which is the
inverse of how we calculate generation for new sstables.
So if no resharding is performed, everything is shard-local.
With this approach, resource usage due to loaded sstables will
be evenly distributed among shards.
For this approach to work, we now only populate keyspaces from
shard 0. It's now the sole responsible for iterating through
column family dirs. In addition, most of population functions
are now free and take distributed database object as parameter.
Fixes#1951.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"This patchset contains fixes for the changes introduced in "Query result
size limiting". It also improves handling of short data reads.
I order to minimise chances of digest mismatch during data queries replicas
that were asked just to return a digest also keep track of the size of the
data (in the IDL representation) so that they would stop at the same point
nodes doing full data queries would. Moreover, data queries are not
affected by per-shard memory limit and the coordinator sends individual
result size limits to replicas in order not to depend on hardcoded values.
It is still possible to get digest mismatches if the IDL changes (e.g. a
new field is added), but, hopefully, that won't be a serious problem."
* 'pdziepak/short-read-fixes/v4' of github.com:cloudius-systems/seastar-dev:
query: introduce result_memory_accounter::foreign_state
storage_proxy: fix short reads in parallel range queries
storage_proxy: pass maximum result size to replicas
mutation_partition: use result limiter for digest reads
query: make result_memory_limiter constants available for linker
result_memory_limiter: add accounter for digest reads
idl: allow writers to use any output stream
result_memory_limiter: split new_read() to new_{data, mutation}_read()
idl: is_short_read() was added in 1.6
mutation_partition: honour allowed_short_read for static rows
storage_proxy: fix _is_short_read computation
storage_proxy: disallow short reads if got no live rows
storage_proxy: don't stop after result with no live rows
We may want to change the default individual result size limit in the
future. If it is provided by the coordinator and not hardcoded in the
replicas this can be done without causing data query digest mismatches
or wasteful mutation query results.
Refs #1943.
* 'tgrabiec/optimize-bloom-filter' of github.com:cloudius-systems/seastar-dev:
db: Compute key hash once in partition_presence_checker
bloom_filter: Allow checking presence using pre-hashed key
db: Use incremental selector in partition_presence_checker
This patch extracts update_column_family from schema_tables into
database so it can be used when adding materialized views, in future
patches.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch moves some duplicate code into the
add_column_family_and_create_directory() function. It also saves some
superfluous keyspace lookups and readies the code to be used by
materialized views.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds utility functions to keyspace_metadata to select only
the tables or only the views out of all the schemas.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the view class, which will contains functions related
to populating a view, either from the base table's write path or from
the view building mechanism which copies over already existing data in
the base table.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This reduces the number of sstables we need to check to only those
whose token range overlaps with the key. Reduces cache update
time. Especially effective with leveled compaction strategy.
Refs #1943.
Incremental selector works with an immutable sstable set, so cache
updates need to be serialized. Otherwise we could mispopulate due to
stale presence information.
Presence checker interface was changed to accept decorated key in
order to gain easy access to the token, which is required by
the incremental selector.
A case could be made that we should have counters for them no matter
what, since it can help us reason about the distribution of memory among
the groups. But with the hierarchy being broken in 1.5 it becomes even
more important. Now by looking solely at dirty, we will have no idea
about how much memory we are using in those groups.
After this patch, the dirty_memory_manager will register its metrics
for the 3 groups that we have, and the legacy names will be used to show
totals.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0d04ca4c7e8472097f16a5dc950b77c73766049e.1481831644.git.glauber@scylladb.com>
A naive approach was to create a set of readers for each range and pass
them all to combining reader. This however performed badly if the number
of ranges was high.
The solution is to use multi range reader which uses only a single set
of readers and fast forwards from range to range when necessary. This
adds another requirement that the ranges passed to
make_streaming_reader() are sorted and disjoint.
This pach ensures than when we start executing a query a minimum result
size is reserved from result_memory_limiter.
Moreover, range queries need a way of merging memory usage information
from different shards.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Issue #1918 describes a problem, in which we are generating smaller
memtables than we could, and therefore not respecting the flush
criteria.
That happens because group sizes (and limits) for pressure purposes, and
the the soft threshold is currently at 40 %. This causes system group's
soft threshold to be way below regular's virtual dirty limit and close
to regular group's soft threshold. The system group was very likely to
become under soft pressure when regular was because writes to regular
group are not yet throttled when they cross both soft thresholds.
This is a direct consequence of the linear hierarchy between the regions
and to guarantee that it won't happen we would have acqire the semaphore
of all ancestor regions when flushing from a child region. While that
works, it can lead to problems on its own, like priority inversion if
the regions have different priorities - like streaming and regular, and
groups lower in the hierarchy, like user, blocking explicit flushes
from their ancestors
To fix that, this patch reorganizes the dirty memory region groups so
that groups are now completely independent. As a disadvantage, when
streaming happen we will draw some memory from the cache, but we will
live with it for the time being.
Fixes#1918
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We had a flush_token structure in addition to the flush_permit because
we needed to keep a pointer to the dirty_memory_manager and apply
changes to the region group upon the region destruction. Since Tomek's
latest series, this is no longer needed and now this structure doesn't
have a place in the world anymore. Simplify the code by removing it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Done in a separate patch to reduce clutter in the main patch.
Soon we'll be testing for one more condition.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We will soon need to hold more than a semaphore_units<> object per
flush, potentially.
Preparation patch for that.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Some of our CFs can't be flushed. Those are the ones who are not marked
as having durable writes. We treat them just the same from the point of
view of the flush logic, but they provide a function that doesn't do
anything and just returns right away.
We already had troubles with that in the past, and that also poses a
problem for an upcoming patch reworking the flush memtable pick
criteria.
It's easier, simpler, and cleaner, to just make the memtable_list aware
it can't flush. Achieving that is also not very complicated: we just
need a special constructor that doesn't take a seal function and then we
make sure that it is initialized to an empty std::function
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Allow to make a streaming reader with a vector of ranges in addition to
a single range. This will be used soon in following streaming patch.
We can make the reader more efficient later.
The procedure to calculate max purgeable timestamp is optimized
by only visiting sstables that overlap with key being currently
compacted. That's done using incremental sstable selector.
Function to calculate maximum purgeable timestamp is made 10 times
faster when compacting sstables overlap with 10% of all sstables.
Fixes#1322.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The implementation assumes that memtable's region group is owned by
dirty_memory_manager, and tries to obtain a reference to it like this:
boost::intrusive::get_parent_from_member(_region.group(), &dirty_memory_manager::_region_group));
This is undefined behavior when the region's group does not come from
dirty manager. It's safer to be explicit about this dependency by
taking a reference to dirty_memory_manager in the constructor.
Streaming memtable have a delayed mode where many flushes are coalesced
together into one, with the actual flush happening later and propagated
to all the previous waiters.
However, the timer that triggers the actual flush was not using the
newly introduced flush infrastructure. This was a minor problem because
those flushes wouldn't try to take the semaphore, and so we could have
many flushes going on at the same time.
What was a potential performance issue became a correctness issue when
we moved the reversal of the dirty memory accounting out of
revert_potentially_cleaned_up_memory() into remove_from_flush_manager().
Since the latter is only called through the flush infrastructure, it
simply wasn't called. So the deferral of the reversal exposed this bug.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0d5755375bc27524b8cfb9970c76d492b14d9eea.1480522742.git.glauber@scylladb.com>
New sstables are loaded and added in parallel, meaning that scylla can
potentially return stale data if a new sstable containing a tombstone
wasn't loaded yet. Compaction should also not run until all new sstables
are added for similar reasons.
Fix is about separating blocking and non-blocking steps to allow
atomic add of multiple new sstables.
Fixes#1368.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <14283b8a4a69127071d1fabef320a93c91817ec2.1480356073.git.raphaelsc@scylladb.com>
"We currently write the size_estimates system table for every schema
on a periodic basis, currently set to 5 minutes, which can interfere
with an ongoing workload.
This patchset virtualizes it such that queries are intercepted and we
calculate the results on the fly, only for the ranges the caller is interested in.
Fixes#1616"
* 'virtual-estimates/v4' of github.com:duarten/scylla:
size_estimates_virtual_reader: Add unit test
db: Delete size_estimates_recorder
size_estimates: Add virtual reader
column_family: Add support for virtual readers
storage_service: get_local_tokens() returns a future
nonwrapping_range: Add slice() function
range: Find a sequence's lower and upper bounds
system_keyspace: Build mutations for size estimates
size_estimates: Store the token range as bytes
range_estimates: Add schema
murmur3_partitioner: Convert maximum_token to sstring
When we finish writing a memtable, we revert the dirty memory charges
immediately. When we do that, dirty memory will grow back to what it
was, and soon (we hope) will go down again when we release the requests
for real.
During that time, we may not accept new requests. Sealing can take a
long time, specially in the face of Linux issues like the ones we have
seen in the past. It also will take proportionally more time if the
SSTables end up being small, which is a possibility in some scenarios.
This patch changes the dirty_memory_manager so that the charges won't be
reverted right after we finish the flush. Rather, we will hold on to it,
and revert it right before we update the cache. We don't need to do it
for all classes of memtable writes, because after we finish flushing,
flush_one() will destroy the hashed element anyway.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2d5a8f6ca57d5036f4850ac163557bca59b8063d.1480004384.git.glauber@scylladb.com>
We current pass a region group to the memtable, but after so many recent
changes, that is a bit too low level. This patch changes that so we pass
a memtable list instead.
Doing that also has a couple of advantages. Mainly, during flush we must
get to a memtable to a memtable_list. Currently we do that by going to
the memtable to a column family through the schema, and from there to
the memtable_list.
That, however, involves calling virtual functions in a derived class,
because a single column family could have both streaming and normal
memtables. If we pass a memtable_list to the memtable, we can keep
pointer, and when needed get the memtable_list directly.
Not only that gets rid of the inheritance for aesthetic reasons, but
that inheritance is not even correct anymore. Since the introduction of
the big streaming memtables, we now have a plethora of lists per column
family and this transversal is totally wrong. We haven't noticed before
because we were flushing the memtables based on their individual sizes,
but it has been wrong all along for edge cases in which we would have to
resort to size-based flush. This could be the case, for instance, with
various plan_ids in flight at the same time.
At this point, there is no more reason to keep the derived classes for
the dirty_memory_manager. I'm only keeping them around to reduce
clutter, although they are useful for the specialized constructors and
to communicate to the reader exactly what they are. But those can be
removed in a follow up patch if we want.
The old memtable constructor signature is kept around for the benefit of
two tests in memtable_tests which have their own flush logic. In the
future we could do something like we do for the SSTable tests, and have
a proxy class that is friends with the memtable class. That too, is left
for the future.
Fixes#1870
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <811ec9e8e123dc5fc26eadbda82b0bae906657a9.1479743266.git.glauber@scylladb.com>
Virtual readers allow queries to selected tables, usually system
tables, to be answered by the engine. This is useful for tables which
aren't written by users and whose contents can be calculated on
demand.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch addresses post-merge follow up comments by Tomek.
Basically, what we do is:
- we don't need to signal() from remove_from_flush_manager(), because
the explicit flushes no longer wait on the condition variable. So we
don't.
- We now wait on the stop() flushes (regardless of their return status)
so we can make sure that the _flush_queue will indeed be done with.
- we acquire the semaphore before shutting down the dirty_memory_manager
to make sure that there are no pending flushes
- the flush manager that holds the semaphore has to match in the exception
handler
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <a23ab5098934546c660a08de64cd9294bb3a2008.1479400239.git.glauber@scylladb.com>
The way we currently flush memtables, we seal the current one but wait
on a semaphore for the actual flush to proceed.
This is pointless, because if the flush is not proceeding we'll use up
memory for the new entries anyway, be them in a newly opened memtable or
not. As a matter of fact, by opening a new memtable we are foregoing
coalescing opportunities.
After recent changes to the flush paths, we are now in a position to do
differently. We move the semaphore earlier, and if we can't acquire it
we keep appending to the current memtable.
For explicit flushes, we'll queue and prioritize them over memory-based
flushes. This has the nice property of potentially coalescing various
flushes for the same CF into one.
Coalescing flushes for the same CF is particularly helpful for
commitlog-initiated flushes that can't complete within the flush period.
What we see currently, is that under heavy load the commitlog will keep
sealing memtables adding to the existing load.
Another interesting property of this approach is that we can keep the
disk utilization higher, by allowing a new flush to start before the
memtable is fully sealed. By design, every time a memtable is finished
flushing it will call revert_potentially_cleaned_up_memory() to revert
the virtual memory charges. That is the perfect moment for us to act.
It indicates that all the data flushing part is done.
The way we'll do it is by keeping the semaphore_units alive for this
memtable. When the flush ends, we destroy that object. This will
effectively trigger the next flush if there is a next flush that can be
initiated.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
After recent changes to the memtable code, there is no reason for us to
uphold a maximum memtable size. Now that we only flush one memtable at a
time anyway, and also have soft limit notifications from the
region_group_reclaimer, we can just set the soft limit to the target
size and let all of that be handled by the dirty_memory_manager.
It does have the added property that we'll be flushing when we globally
reach the soft limit threshold. In conditions in which we have multiple
CF writes fighting for memory, that guarantees that we will start
flushing much earlier than the hard limit.
The threshold is set to 1/4 of dirty memory. While in theory we would
prefer the memtables to go as big as 1/2 of dirty memory, in my
experiments I have found 1/4 to be a better fit, at least for the
moment.
The reason for such behavior is that in situations where we have slow
disks, setting the soft limit to 1/2 of dirty will put us in a situation
in which we may not have finished writing down the memtable when we hit
the limit, and then throttle. When set the threshold to 1/4 of dirty, we
don't throttle at all.
This behavior could potentially be fixed by not doing the full
memtable-based throttling after we do the commitlog throttling, but that
is not something realistic for the moment.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We would like to know from which region is a particular flush coming
from, and account accordingly. The reasoning behind that, is that soon
we'll be driving the flushes internally from the dirty_memory_manager
without explcitly triggering them.
We need to start a flush before the current one finishes, otherwise
we'll have a period without significant disk activity when the current
SSTable is being sealed, the caches are being updated, etc.
Signed-off-by: Glauber Costa <glauber@scylladb.com>