Restrict readers based on their memory consumption, instead of the count
of the top-level readers. To do this an interposer is installed at the
input_stream level which tracks buffers emmited by the stream. This way
we can have an accurate picture of the readers' actual memory
consumption.
New readers will consume 16k units from the semaphore up-front. This is
to account their own memory-consumption, apart from the buffers they
will allocate. Creating the reader will be deferred to when there are
enough resources to create it. As before only new readers will be
blocked on an exhausted semaphore, existing readers can continue to
work.
"Currently restricting_mutation_reader restricts mutation_readears on a
count basis. This is inaccurate on multiple levels. The reader might be
a combined_mutation_reader, which might be composed of multiple
individual readers, whose number might change during the lifetime of the
reader. The memory consumption of the readers can vary and may change
during the lifetime of the reader as well.
To remedy this, make the restriction memory-consumption based. The
restricting semaphore is now configured with the amound of memory
(bytes) that its readers are allowed to consume in total. New readers
consume 128k units up-front to account for read-ahead buffers, and then
consume additional units for any buffer (returned
from input_stream<>::read()) they keep around.
Like before, readers already allowed to read will not be blocked,
instead new readers will be blocked on their first read if all the units
all consumed."
Fixes#2692.
* 'bdenes/restricting_mutation_reader-v4' of https://github.com/denesb/scylla:
Update reader restriction related metrics
Add restricted_reader_test unit test
restricted_mutation_reader: restrict based-on memory consumption
mutation_reader.hh: Move restricted_reader related code
Dirty memory manager for non-system column families was being used
when applying mutations to system cfs.
That previously lead to deadlock when updating history. Basically,
write disable waits on compaction, and compaction waits on a write
that would release dirty memory for updating compaction history.
Only using the correct dirty manager wouldn't solve this problem
if write is disabled for system cf, but the problem is completely
solved in addition to previous change which updates history
outside the sstable lock.
Refs #2769.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170918215238.9810-3-raphaelsc@scylladb.com>
The reason to do that is because compaction can deadlock if refresh
disables write which waits for compaction, and compaction in turn
waits for dirty memory[1] that would be released by memtable write.
Dirty memory manager for non-system cfs was being used for system cfs,
which was useful for exposing this problem.
[1]: when updating compaction history.
Fixes#2769.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170918215238.9810-2-raphaelsc@scylladb.com>
Restrict readers based on their memory consumption, instead of the count
of the top-level readers. To do this an interposer is installed at the
input_stream level which tracks buffers emmited by the stream. This way
we can have an accurate picture of the readers' actual memory
consumption.
New readers will consume 16k units from the semaphore up-front. This is
to account their own memory-consumption, apart from the buffers they
will allocate. Creating the reader will be deferred to when there are
enough resources to create it. As before only new readers will be
blocked on an exhausted semaphore, existing readers can continue to
work.
Soon I am about to introduce a read monitor, and pairing infrastructure
to manage it. Having it all living in sstables.hh force to include it
everytime, even in places that don't really need it.
Move to its own header.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Right now we pass a permit to the memtable writer and that permit is
used insite write_memtable_to_sstable to compose a write_monitor.
We would like to extend the write_monitor to include other things, that
right now are not available as parameters to write_memtable_to_sstable -
and which are possibly too specialized to be.
The solution for that is to pass the write_monitor instead of the permit
to the writer. Conceptually, that also makes sense because the
write_monitor is something the sstable writer is aware of. Permits, on
the other hand, are a database concept that is alien to the sstable
writer.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170915032836.21154-1-glauber@scylladb.com>
Collect coordinator side read statistic per CF and use them in percentile
speculative read executor. Getting percentile from estimated_histogram
object is rather expensive, so cache it and recalculate only once per
second (or if requested percentile changes).
Fixes#2757
Message-Id: <20170911131752.27369-3-gleb@scylladb.com>
This patchset reduces includes of sstables.hh, reducing compile time
by both reducing the amount of code compiled, and the amount of
needless recompiles caused by false dependencies. It does so by
replacing lw_shared_ptr<sstable>, which requires a complete class,
with a new custom type shared_sstable, which allows an incomplete
sstable class definition.
* https://github.com/avikivity/scylla deps2/v2.1
database: change truncate() to flush while compaction is disabled
database: make run_with_compaction_disabled() a non-template
database: add indirection to compaction_manager instance
database: remove dependency on compaction.hh and compaction_manager.hh
size_estimates_virtual_reader.hh: add missing include
system_keyspace: add missing include
main: add missing include
storage_service: add missing include
repair: add missing include
compaction.hh: add missig include and forward declaration
compaction_manager: add missing include
shared_index_lists.hh: add missing include
perf_fast_forward: add missing include
sstable_mutation_test: add missing include
sstables: extract version and format enum into a separate header file
database.hh: add missing forward declaration for
foreign_sstable_open_info
cql_test_env: add forward declaration
database: make column_family::disable_sstable_write() out-of-line
sstables: introduce make_sstable() as a shortcut for
make_lw_shared<sstable>
treewide: use shared_sstable, make_sstable in place of
lw_shared_ptr<sstable>
sstables: use support for lw_shared_ptr with incomplete type for
shared_sstable
sstables: reduce dependencies
streaming: remove unneeded includes
When table is created, it doesn't contain any data, so we can mark the whole
data range as continuous in cache. This way reads will immediately hit, and
flushes will populate. If sstables are later attached, the attaching process
is supposed to invalidate affected ranges (and it does).
Fixes#2536.
Message-Id: <1505200269-4031-1-git-send-email-tgrabiec@scylladb.com>
In preparation to make run_with_compaction_disabled() a non-template,
we want to remove any non-copyable captures (so the function can be
an std::function, which requires copyability). Move the flush within
the compaction disabled region. This changes the behavior, but it shouldn't
matter.
Scylla already refuses to load counter sstables that do not have Scylla
component. However, if this happens because of 'nodetool refresh'
command the existing protection will trigger after sstables have been
moved to the data directory. This is too later, so an additional check
is added when the upload directory is scanned.
Cache imposes requirements on how updates to the on-disk mutation source
are made:
1) each change to the on-disk muation source must be followed
by cache synchronization reflecting that change
2) The two must be serialized with other synchronizations
3) must have strong failure guarantees (atomicity)
Because of that, sstable list update and cache synchronization must be
done under a lock, and cache synchronization cannot fail to synchronize.
Normally cache synchronization achieves no-failure thing by wiping the
cache (which is noexcept) in case failure is detect. There are some
setup steps hoever which cannot be skipped, e.g. taking a lock
followed by switching cache to use the new snapshot. That truly cannot
fail. The lock inside cache synchronizers is redundant, since the
user needs to take it anyway around the combined operation.
In order to make ensuring strong exception guarantees easier, and
making the cache interface easier to use correctly, this patch moves
the control of the combined update into the cache. This is done by
having cache::update() et al accept a callback (external_updater)
which is supposed to perform modiciation of the underlying mutation
source when invoked.
This is in-line with the layering. Cache is layered on top of the
on-disk mutation source (it wraps it) and reading has to go through
cache. After the patch, modification also goes through cache. This way
more of cache's requirements can be confined to its implementation.
The failure semantics of update() and other synchronizers needed to
change due to strong exception guaratnees. Now if it fails, it means
the update was not performed, neither to the cache nor to the
underlying mutation source.
The database::_cache_update_sem goes away, serialization is done
internally by the cache.
The external_updater needs to have strong exception guarantees. This
requirement is not new. It is however currently violated in some
places. This patch marks those callbacks as noexcept and leaves a
FIXME. Those should be fixed, but that's not in the scope of this
patch. Aborting is still better than corrupting the state.
Fixes#2754.
Also fixes the following test failure:
tests/row_cache_test.cc(949): fatal error: in "test_update_failure": critical check it->second.equal(*s, mopt->partition()) has failed
which started to trigger after commit 318423d50b. Thread stack
allocation may fail, in which case we did not do the necessary
invalidation.
Commit e3ad676433 missed a few places.
It is required to serialize sstable list update and cache synchronization
in order to preserve partition update isolation.
Fixes#2746.
This was part of "add gate for generic async operations to column family" but
somehow didn't make it into the final patch.
Add the missing piece.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20170830164205.4497-1-glauber@scylladb.com>
run_with_compaction_disabled(), which is called by truncate, has a
pretty large defer point in remove(). When the code gets to finally
execute, we can't guarantee that the column family will still be alive.
That is true in particular if we issued a drop table command following
truncate: by the time truncate gets to resume, the CF will be gone.
Before the column family is dropped, it will always call its stop()
method, which means we have an opportunity to do some waiting there. We
already wait for flushes and current compactions to end.
Traditionally, we have been solving similar problems by adding a gate
that will catch asynchronous operations and making sure that potentially
asynchronous operations will enter the gate before executing. Let's do
the same thing here. We will close() the gate during stop().
Fixes#2726
Signed-off-by: Glauber Costa <glauber@scylladb.com>
truncate can throw exceptions. If it does, cf->stop() will never be
called because it is contained in a .then clause instead of finally.
One of the things that truncate does - in a finally block of its own -
is initiate a final compaction. If it returns an exception nobody will
wait for that compaction to finish (since cf->stop() is the one doing
that) and we'll crash.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The number of keysapce and column family metrics reported is
proportional to the number of shards times the number of keysapce/column
families.
This can cause a performance issue both on the reporting system and on
the collecting system.
This patch adds a configuration flag (set to false by default) to enable
or disable those metrics.
Fixes#2701
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <20170821113843.1036-1-amnon@scylladb.com>
Two reasons for this change:
1) every compaction should be multiplexed to manager which in turn
will make decision when to schedule. improvements on it will
immediately benefit every existing compaction type.
2) active tasks metric will now track ongoing reshard jobs.
Fixes#2671.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170817224334.6402-1-raphaelsc@scylladb.com>
incremental_reader_selector assumes the partition_range it receives has a lower
bound, but it was seen in mutation_test that this is not so.
Fix by checking whether the bound exists or not.
Message-Id: <20170815095852.14149-1-avi@scylladb.com>
Exhausted readers can be fast forwarded, so we have to keep them
around. However, if the current reader is not fast forwardable, then
we can drop those readers and their buffers.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
incremental_reader_selector is a specialization of reader_selector for
the case when sstables have narrow and/or disjoint token ranges. To
exploit this it creates new readers on-demand when their sstable's
token range intersects with the current ring position.
A seletion contains - in addition to the list of sstables - a next_token
which is a hint as to what is the next best token to call select() with.
This should be the smallest token such that at the next call to
select() the least number of new sstables will be returned, without
skipping any.
In commit f38e4ff3f, we have separated streaming reads from normal reads
for the purpose of determining the maximum number of reads going on.
However, we'll now be totally unaware of how many reads will be
happening on behalf of streaming and that can be important information
when debugging issues.
This patch adds this metric so we don't fly blind.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1501909973-32519-1-git-send-email-glauber@scylladb.com>
Streaming reads and normal reads share a semaphore, so if a bunch of
streaming reads use all available slots, no normal reads can proceed.
Fix by assigning streaming reads their own semaphore; they will compete
with normal reads once issued, and the I/O scheduler will determine the
winner.
Fixes#2663.
Message-Id: <20170802153107.939-1-avi@scylladb.com>
If we fail a streaming read due queue overload, we will fail the entire repair.
Remove the limit for streaming, and trust the caller (repair) to have bounded
concurrency.
Fixes#2659.
Message-Id: <20170802143448.28311-1-avi@scylladb.com>
"This series reduce that effect in two ways:
1. Remove the latency counters from the system keyspaces
2. Reduce the histogram size by limiting the maximum number of buckets and
stop the last bucket."
Fixes#2650.
* 'amnon/remove_cf_latency_v2' of github.com:cloudius-systems/seastar-dev:
database: remove latency from the system table
estimated histogram: return a smaller histogram
If we fail to flush an sstable, after creating the flush_reader, then
we will have released the flush permit when we retry the flush. Ensure
that when retrying, we re-acquire the flush permit.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>