Timeouts are a global property. However, for tables in keyspaces like
the system keyspace, we don't want to uphold that timeout--in fact, we
wan't no timeout there at all.
We already apply such configuration for requests waiting in the queued
sstable queue: system keyspace requests won't be removed. However, the
storage proxy will insert its own timeouts in those requests, causing
them to fail.
This patch changes the storage proxy read layer so that the timeout is
applied based on the column family configuration, which is in turn
inherited from the keyspace configuration. This matches our usual
way of passing db parameters down.
In terms of implementation, we can either move the timeout inside the
abstract read executor or keep it external. The former is a bit cleaner,
the the latter has the nice property that all executors generated will
share the exact same timeout point. In this patch, we chose the latter.
We are also careful to propagate the timeout information to the replica.
So even if we are talking about the local replica, when we add the
request to the concurrency queue, we will do it in accordance with the
timeout specified by the storage proxy layer.
After this patch, Scylla is able to start just fine with very low
timeouts--since read timeouts in the system keyspace are now ignored.
Fixes#2462
Implementation notes, and general comments about open discussion in 2462:
* Because we are not bypassing the timeout, just setting it high enough,
I consider the concerns about the batchlog moot: if we fail for any
other reason that will be propagated. Last case, because the timeout
is per-CF, we could do what we do for the dirty memory manager and
move the batchlog alone to use a different timeout setting.
* Storage proxy likes specifying its timeouts as a time_point, whereas
when we get low enough as to deal with the read_concurrency_config,
we are talking about deltas. So at some point we need to convert time_points
to durations. We do that in the database query functions.
v2:
- use per-request instead of per-table timeouts.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This patch enables passing a timeout to the restricted_mutation_reader
through the read path interface -- using fill_buffer and friends. This
will serve as a basis for having per-timeout requests.
The config structure still has a timeout, but that is so far only used
to actually pass the value to the query interface. Once that starts
coming from the storage proxy layer (next patch) we will remove.
The query callers are patched so that we pass the timeout down. We patch
the callers in database.cc, but leave the streaming ones alone. That can
be safely done because the default for the query path is now no_timeout,
and that is what the streaming code wants. So there is no need to
complicate the interface to allow for passing a timeout that we intend
to disable.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In the last patch, we enabled per-request timeouts, we enable timeouts
in fill_buffer. There are many places, though, in which we
fast_forward_to before we fill_buffer, so in order to make that
effective we need to propagate the timeouts to fast_forward_to as well.
In the same way as fill_buffer, we make the argument optional wherever
possible in the high level callers, making them mandatory in the
implementations.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
At the moment, various different subsystems use their different
ideas of what a timeout_clock is. This makes it a bit harder to pass
timeouts between them because although most are actually a lowres_clock,
that is not guaranteed to be the case. As a matter of fact, the timeout
for restricted reads is expressed as nanoseconds, which is not a valid
duration in the lowres_clock.
As a first step towards fixing this, we'll consolidate all of the
existing timeout_clocks in one, now called db::timeout_clock. Other
things that tend to be expressed in terms of that clock--like the fact
that the maximum time_point means no timeout and a semaphore that
wait()s with that resolution are also moved to the common header.
In the upcoming patch we will fix the restricted reader timeouts to
be expressed in terms of the new timeout_clock.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"This patchset implements the compaction controller for I/O shares. The
goal is to automatic adjust compaction shares based on a
strategy-specific backlog. A higher backlog will translate into higher
shares.
As compaction progresses, that reduces the backlog. As new data is
flushed, that increases the backlog. The goal of the controler is to
keep the backlog constant at a certain rate, so that we don't go neither
too fast or too slow.
Tracking reads and writes:
==========================
Tracking of reads and writes happen through the read_monitor and the
write_monitor. The write monitor is an existing interface that has the
purpose of releasing the write permit at particular points of the write
process. We enhance it so to get a reference to an instance that tracks
the current offset inside the sstables::file_writer. This way the
backlog tracker can always know for sure what's the offset of the
current write.
A similar thing is done for reads. The data_consumer already tracks the
position of the current read, and we isolate that into a structure to
which we can get a reference. A read_monitor allows us to connect the
compaction to that reference.
Lifetime management:
====================
In general, tracking objects will be owned by their callers and passed
down as references. The compaction object will own the read monitors and
the compaction write monitors and the memtable flush write monitor will
be kept alive in a do_with block around the flush itself.
The backlog_{write,read}_progress_manager needs to be kept alive until
the SSTable is no longer in progress. For writes, that means until we
are able to add the SSTable charges in full, and for reads (compaction)
that means until we are able to remove the charges in full.
It is important to do that to avoid spikes in the graph. If we remove
the progress managers in a different operation than updating the SSTable
list we will be left in a temporary state where charges appear or
disappear abruptly, to be fixed when the final
add_sstable/remove_sstable happens. So we want those things to happen
together.
The compaction_backlog_tracker is kept alive until the strategy changes,
for example, through ALTER TABLE. Current charges are transferred to the
new strategy's compaction_backlog_tracker object when we do that. If the
type of strategy changes, the current read charges are forgotten. We can
do that because those running compaction will not really contribute to
decrease the backlog of the new compaction strategy.
Tranfer of Charges
==================
When ALTER TABLE happens, we need to transfer ongoing writes to the new
backlog manager. Ongoing reads will still be tracked by the
backlog_manager that originated them.
The rationale for that is that reads still belong to the current
compaction, with the strategy that generated them. But new Tables being
written will add to the backlog of the new strategy.
Note that ALTER TABLE operations not necessarily cause a change of
Strategy. We can be using the same strategy but just changing
properties. If that is the case, we expect no discontinuity in the
backlog graph (tested).
Resharding
==========
Resharding compactions are more complex than normal compactions because
the SSTables are created in one shard and later sent to another shard.
It is better, then, to track resharding compactions separately and let
them have their own backlog tracker, which will insert backlog in
proportion to the amount of data to be resharded.
Memtable Flush I/O Controller
=============================
With the current infrastructure it becomes trivial to add a new
controller, for either I/O or CPU. This patchset then adds an I/O
controller for memtable flushes, using the same backlog algorithm that
we already used for CPU."
* 'compaction-controller-io-v5' of github.com:glommer/scylla:
database: add a controller for I/O on memtable flushes.
document the compaction controller
compaction: adjust shares for compactions
backlog_controllers: implement generic I/O controller
factor out some of the controller code
io shares: multiply all shares by 10
compaction_strategy: implement backlog manager for the SizeTiered strategy
infrastructure for backlog estimator for compaction work.
sstables: notify about end of data component write
sstables: add read_monitor_generator
sstables: add read_monitor
sstables: enhance data consumer with a position tracker
sstables: enhance the file_writer with an offset tracker
sstables: pass references instead of pointers for write_monitor
compaction: control destruction of readers
The algorithm and principle of operation is the same as the CPU
controller. It is, however, always enabled and we will operate on
I/O shares.
I/O-bound workloads are expected to hit the maximum once virtual
dirty fills up and stay there while the load is steady.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Compactions can be a heavy disk user and the I/O scheduler can always
guarantee that it uses its fair share of disk.
Such fair share can, however, be a lot more than what compaction indeed
need. This patch draws on the controllers infrastructure to adjust the
I/O shares that the compaction class will get so that compaction
bandwidth is dynamically adjusted.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The control algorithm we are using for memtables have proven itself
quite successful. We will very likely use the same for other processes,
like compactions.
Make the code a bit more generic, so that a new controller has to only
set the desired parameters
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The issue is triggered by compaction of sstables of level higher than 0.
The problem happens when interval map of partitioned sstable set stores
intervals such as follow:
[-9223362900961284625 : -3695961740249769322 ]
(-3695961740249769322 : -3695961103022958562 ]
When selector is called for first interval above, the exclusive lower
bound of the second interval is returned as next token, but the
inclusivess info is not returned.
So reader_selector was returning that there *were* new readers when
the current token was -3695961740249769322 because it was stored in
selector position field as inclusive, but it's actually exclusive.
This false positive was leading to infinite recursion in combined
reader because sstable set's incremental selector itself knew that
there were actually *no* new readers, and therefore *no* progress
could be made.
Fix is to use ring_position in reader_selector, such that
inclusiveness would be respected.
So reader_selector::has_new_readers() won't return false positive
under the conditions described above.
Fixes#2908.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This patch adds infrastucture in various points in the system to allow
us to determine the amount of work present as backlog from compactions.
What needs to be done can be explained in three major pieces:
1) Add hooks in the points where sstables are added or inserted to a
column family (or more precisely, to a compaction_strategy object).
2) Add hooks in reads and write monitors that allows a compaction
backlog estimator (tracker) to become aware of bytes that are
partially written and compacted away.
3) Add a per-column family class (compaction_backlog_tracker) that
can be used to track work that is done and relevant to compactions
(like the two above), and a compaction manager to provide a
system-wide backlog based on the response of the individual trackers.
The definition of how much backlog one has is strategy-specific. The
Null strategy is easy, as it never really has any backlog, and so is the
major strategy - since what it really matters is the backlog of the
underlying compaction strategy.
Although backlogs are strategy-specific, they should be "compatible", in
the sense that if a particular strategy has more work to do, it should
yield a higher number than its counterparts.
All the others are presented in this patch as unimplemented: they will
always advertise a mild backlog that should yield a constant
CPU-utilization if used alone.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We need to notify the monitor that the offset tracker that we are using is
about to be destroyed and will no longer be valid.
While we could modify the file_writer interface so that we could capture
the offset_tracker and take ownership of it - guaranteeing it is alive
until we reach the existing on_write_completed(), this feels like a
layer violation.
It is also potentially useful in general to offer the monitor callers
with knowledge that writing the data portion is done.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Passing the read monitor down to the sstable readers is tricky. The
point of interest - like compaction - are usually very far from the
interfaces that register the monitor, like read_rows. Between the two,
there is usually a mutation_reader, which is and ought to be totally
unaware of the read monitor: technically, a mutation_reader may not even
know it is backed by sstables.
The solution is to create a read_monitor_generator, that can be passed
from the upper layers, like compaction, to the layers that are actually
making the decision of which sstables to create readers for.
Note that we don't need an equivalent piece of infrastructure for
writes, because writes don't happen through hidden layers and have all
the information they need to initialize their monitors.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Callers, like the memtable flusher or compactions will be able to find
out the current amount of bytes written at any time.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This came from Avi's review on the read_monitors. He suggests we
wouldn't keep shared pointers, and would instead have the caller
ensuring lifetime. That makes sense, but having the writer interface
using shared_ptr and the read interface using references would lead to
an inconsistent interface.
For the sake of consistency we will change the write monitor to take
references before we do that. From database.cc's perspective, we could
now keep the monitors in a do_with() block, but we will keep the
shared_ptrs to manage their lifetime in anticipation of upcoming patches
in this series, where we'll have to pass them somewhere else.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"When we get two range tombstones with the same lower bound from
different data sources (e.g. two sstable), which need to be combined
into a single stream, they need to be de-overlapped, because each
mutation fragment in the stream must have a different position. If we
have range tombstones [1, 10) and [1, 20), the result of that
de-overlapping will be [1, 10) and [10, 20]. The problem is that if
the stream corresponds to a clustering slice with upper bound greater
than 1, but lower than 10, the second range tombstone would appear as
being out of the query range. This is currently violating assumptions
made by some consumers, like cache populator.
One effect of this may be that a reader will miss rows which are in
the range (1, 10) (after the start of the first range tombstone, and
before the start of the second range tombstone), if the second range
tombstone happens to be the last fragment which was read for a
discontinuous range in cache and we stopped reading at that point
because of a full buffer and cache was evicted before we resumed
reading, so we went to reading from the sstable reader again. There
could be more cases in which this violation may resurface.
There is also a related bug in mutation_fragment_merger. If the reader
is in forwarding mode, and the current range is [1, 5], the reader
would still emit range_tombstone([10, 20]). If that reader is later
fast forwarded to another range, say [6, 8], it may produce fragments
with smaller positions which were emitted before, violating
monotonicity of fragment positions in the stream.
A similar bug was also present in partition_snapshot_flat_reader.
Possible solutions:
1) relax the assumption (in cache) that streams contain only relevant
range tombstones, and only require that they contain at least all
relevant tombstones
2) allow subsequent range tombstones in a stream to share the same
starting position (position is weakly monotonic), then we don't need
to de-overlap the tombstones in readers.
3) teach combining readers about query restrictions so that they can drop
fragments which fall outside the range
4) force leaf readers to trim all range tombstones to query restrictions
This patch implements solution no 2. It simplifies combining readers,
which don't need to accumulate and trim range tombstones.
I don't like solution 3, because it makes combining readers more
complicated, slower, and harder to properly construct (currently
combining readers don't need to know restrictions of the leaf
streams).
Solution 4 is confined to implementations of leaf readers, but also
has disadvantage of making those more complicated and slower.
There is only one consumer which needs the tombstones with monotonic positions, and
that is the sstable writer.
Fixes #3093."
* tag 'tgrabiec/fix-out-of-range-tombstones-v1' of github.com:scylladb/seastar-dev:
tests: row_cache: Introduce test for concurrent read, population and eviction
tests: sstables: Add test for writing combined stream with range tombstones at same position
tests: memtable: Test that combined mutation source is a mutation source
tests: memtable: Test that memtable with many versions is a mutation source
tests: mutation_source: Add test for stream invariants with overlapping tombstones
tests: mutation_reader: Test fast forwarding of combined reader with overlapping range tombstones
tests: mutation_reader: Test combined reader slicing on random mutations
tests: mutation_source_test: Extract random_mutation_generator::make_partition_keys()
mutation_fragment: Introduce range()
clustering_interval_set: Introduce overlaps()
clustering_interval_set: Extract private make_interval()
mutation_reader: Allow range tombstones with same position in the fragment stream
sstables: Handle consecutive range_tombstone fragments with same position
tests: streamed_mutation_assertions: Merge range_tombstones with the same position in produces_range_tombstone()
streamed_mutation: Introduce peek()
mutation_fragment: Extract mergeable_with()
mutation_reader: Move definition of combining mutation reader to source file
mutation_reader: Use make_combined_reader() to create combined reader
"Currently, compaction manager will serialize compaction of same size tier
(or weight) if they belong to the same column family. However, it fails to
do so if the compaction jobs belong to different column families.
That can lead to an ungodly amount of running compaction which gets worse
the higher the number of shards and active column families. The problem
is that it may affect overall system performance due to excessive resource
usage. It's easy to trigger it during bootstraping after loading node with
new sstables or repairing, or if lots of cfs are being actively written."
Fixes#1295.
* 'similar_sized_compaction_serialization_v4' of github.com:raphaelsc/scylla:
sstables: remove column_family from compaction_weight_registration
compaction_manager: serialize compaction of same size tier for different cfs
sstables: introduces deregister() and weight() to compaction_weight_registration
sstables: move compaction_weight_registration to its own header
sstables: improve compact_sstables() interface
Currently, compaction manager will serialize compaction of same size tier
(or weight) if they belong to the same column family. However, it fails to
do so if the compaction jobs belong to different column families.
That can lead to an ungodly amount of running compaction which gets worse
the higher the number of shards and active column families. The problem
is that it may affect overall system performance due to excessive resource
usage. It's easy to trigger it during bootstraping after loading node with
new sstables or repairing, or if lots of cfs are being actively written.
That being said, compaction jobs of same size tier are now serialized
on a given shard, such that maximum number of compaction (system wise)
is now:
(SHARDS) * (SIZE TIERS)
instead of:
(SHARDS) * (COLUMN FAMILIES) * (SIZE TIERS)
We'll work hard to release a size tier (weight) for a column family
waiting on it as fast as possible, given that we wouldn't like to
underutilize resources available for compaction. We want one starting
after the other. Compaction for a column family that cannot run now
because the size tier is taken, will be postponed. There's a worker
that will be sleeping on a condition variable that will be signalled
whenever a compaction completes. FIFO ordering is used on postponed
list for fairness.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Motivation is that a new field in the descriptor will be forwarded
to compaction procedure without extending parameter list even more.
Also beautifies the interface, making it concise and easier to
play with.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Before flat mutation readers sstable::read_row() returned a
future<streamed_mutation>. That required a helper reader that would wait
for the streamed_mutations from all relevant sstables to be created and
then construct a mutation merger.
With flat mutation readers sstable::read_row_flat() returns a
flat_mutation_reader (no futures) so that the code can be simplified by
collecting all the relevant readers and creating a combined reader
without suspension points.
The unfortunate disadvantage of the flat_mutation_reader-based approach
is the fact that combined reader now needlessly compares the partition
keys even though we know that we read only a single partition, but
optimising that is out of scope of this patch.
All users of the filtering reader need only the decorated key of a
partition, but currently the predicate is given a reference to
streamed_mutations which are obsolete now.
We have had an issue recently where failed SSTable writes left the
generated SSTables dangling in a potentially invalid state. If the write
had, for instance, started and generated tmp TOCs but not finished,
those files would be left for dead.
We had fixed this in commit b7e1575ad4,
but streaming memtables still have the same isse.
Note that we can't fix this in the common function
write_memtable_to_sstable because different flushers have different
retry policies.
Fixes#3062
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171213011741.8156-1-glauber@scylladb.com>
"Didn't affect any release. Regression introduced in 301358e.
Fixes#3041"
* 'resharding_fix_v4' of github.com:raphaelsc/scylla:
tests: add sstable resharding test to test.py
tests: fix sstable resharding test
sstables: Fix resharding by not filtering out mutation that belongs to other shard
db: introduce make_range_sstable_reader
rename make_range_sstable_reader to make_local_shard_sstable_reader
db: extract sstable reader creation from incremental_reader_selector
db: reuse make_range_sstable_reader in make_sstable_reader
introduce reader variant that will allow its caller to read a range
in a given table without any filter applied.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Tomek says:
"I think that the least surprising behavior for a function named like this
is to read the sstables unfiltered (it just reads them), and the filtering
should be indicated specially in the name or by accepting a parameter."
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
For now only the interface is converted, behind the scenes the previous
implementation remains, it's output is simply converted by
flat_mutation_reader_from_mutation_reader. The implementation will be
converted in the following patches.
The wrapper is no longer needed because
read_range_rows returns ::mutation_reader instead of
sstables::mutation_reader and the reader returned from
it keeps the pointer to shared_sstable that was used to
create the reader.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>