This reverts commit 0e9b75d406.
commitlog_test fails with this:
Running 14 test cases...
ERROR 2016-11-28 20:48:00,565 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:00,578 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:10,591 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:20,601 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
tests/commitlog_test.cc(203): fatal error in "test_commitlog_discard_completed_segments": critical check dn <= nn failed
ERROR 2016-11-28 20:48:20,645 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:20,837 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
WARN 2016-11-28 20:48:20,838 [shard 0] commitlog - Exception in segment reservation: std::system_error (error system:2, No such file or directory)
ERROR 2016-11-28 20:48:20,952 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,064 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,083 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,098 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,111 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
ERROR 2016-11-28 20:48:31,113 [shard 0] commitlog - Segment reserve is full! Ignoring and trying to continue, but shouldn't happen
WARN 2016-11-28 20:48:31,116 [shard 0] commitlog - Could not allocate 16388 k bytes output buffer (16388 k required)
*** 1 failure detected in test suite "tests/commitlog_test.cc"
WARN 2016-11-28 20:48:31,117 [shard 0] commitlog - Exception in segment reservation: std::system_error (error system:2, No such file or directory)
New sstables are loaded and added in parallel, meaning that scylla can
potentially return stale data if a new sstable containing a tombstone
wasn't loaded yet. Compaction should also not run until all new sstables
are added for similar reasons.
Fix is about separating blocking and non-blocking steps to allow
atomic add of multiple new sstables.
Fixes#1368.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <14283b8a4a69127071d1fabef320a93c91817ec2.1480356073.git.raphaelsc@scylladb.com>
When requests hit the commitlog, each of them will be assigned a replay
position, which we expect to be ordered. If reorders happen, the request
will be discarded and re-applied. Although this is supposed to be rare,
it does increase our latencies, specially when big requests are
involved. Processing big requests is expensive and if we have to do it
twice that adds to the cost.
The commitlog is supposed to issue replay positions in order, and it
coudl be that the code that adds them to the memtables will reorder
them. However, there is one instance in which the commitlog will not
keep its side of the bargain.
That happens when the reserve is exhausted, and we are allocating a
segment directly at the same time the reserve is being replenished. The
following sequence of events with its deferring points will ilustrate
it:
on_timer:
return this->allocate_segment(false). // defer here // then([this](sseg_ptr s) {
At this point, the segment id is already allocated.
new_segment():
if (_reserve_segments.empty()) {
[ ... ]
return allocate_segment(true).then ...
At this point, we have a new segment that has an id that is higher than
the previous id allocated.
Then we resume the execution from the deferring point in on_timer():
i = _reserve_segments.emplace(i, std::move(s));
The next time we need to allocate a segment, we'll pick it from the
reserve. But the segment in the reserve has an id that is lower than the
id that we have already used.
Reorders are bad, but this one is particularly bad: because the reorder
happens with the segment id side of the replay position, that means that
every request that falls into that segment will have to be reinserted.
This bug can be a bit tricky to reproduce. To make it more common, we
can artificially add a sleep() fiber after the allocate_segment(false)
in on_timer(). If we do that, we'll see a sea of reinsertions going on
in the logs (if dblog is set to debug).
Applying this patch (keeping the sleep) will make them all disappear.
We do this by rewriting the reserve logic, so that the segments always
come from the reserve. If we draw from a single pool all the time, there
is no chance of reordering happening. To make that more amenable, we'll
have the reserve filler always running in the background and take it out
of the timer code.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2606b97df39997bcf3af84a23adf17e094ffb0b8.1480107174.git.glauber@scylladb.com>
"We currently write the size_estimates system table for every schema
on a periodic basis, currently set to 5 minutes, which can interfere
with an ongoing workload.
This patchset virtualizes it such that queries are intercepted and we
calculate the results on the fly, only for the ranges the caller is interested in.
Fixes#1616"
* 'virtual-estimates/v4' of github.com:duarten/scylla:
size_estimates_virtual_reader: Add unit test
db: Delete size_estimates_recorder
size_estimates: Add virtual reader
column_family: Add support for virtual readers
storage_service: get_local_tokens() returns a future
nonwrapping_range: Add slice() function
range: Find a sequence's lower and upper bounds
system_keyspace: Build mutations for size estimates
size_estimates: Store the token range as bytes
range_estimates: Add schema
murmur3_partitioner: Convert maximum_token to sstring
When we finish writing a memtable, we revert the dirty memory charges
immediately. When we do that, dirty memory will grow back to what it
was, and soon (we hope) will go down again when we release the requests
for real.
During that time, we may not accept new requests. Sealing can take a
long time, specially in the face of Linux issues like the ones we have
seen in the past. It also will take proportionally more time if the
SSTables end up being small, which is a possibility in some scenarios.
This patch changes the dirty_memory_manager so that the charges won't be
reverted right after we finish the flush. Rather, we will hold on to it,
and revert it right before we update the cache. We don't need to do it
for all classes of memtable writes, because after we finish flushing,
flush_one() will destroy the hashed element anyway.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2d5a8f6ca57d5036f4850ac163557bca59b8063d.1480004384.git.glauber@scylladb.com>
GCC 5.3.1 was unable to convert bound to optional<bound>.
sstables/sstables.cc:2494:123: error: no matching function for call to
‘nonwrapping_range<dht::ring_position>::nonwrapping_range(dht::ring_position,
dht::ring_position)’
(dtr.right.exclusive ? dht::ring_position::starting_at :
dht::ring_position::ending_at)(std::move(t2)));
In file included from ./dht/i_partitioner.hh:52:0,
from ./query-request.hh:28,
from ./clustering_key_filter.hh:27,
from sstables/sstables.hh:35,
from sstables/sstables.cc:38:
./range.hh:441:14: note: candidate: nonwrapping_range<T>::nonwrapping_range(
const wrapping_range<U>&) [with T = dht::ring_position]
explicit nonwrapping_range(const wrapping_range<T>& r)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <95bbf984cd73a61739c8da99cf6cd5e94f1d1457.1479954360.git.raphaelsc@scylladb.com>
Since not all distributions have a version of LZ4 with
LZ4_compress_default(), we use it conditionally.
This is specially important beginning with version 1.7.3 of LZ4,
which deprecates the LZ4_compress() function in favour of
LZ4_compress_default() and thus prevents Scylla from compiling
due to the deprecated warning.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161124092339.23017-1-duarte@scylladb.com>
In Thrift, SliceRange defines a count that limits the number of cells
to return from that row (in CQL3 terms, it limits the number of rows
in that partition). While this limit is honored in the engine, the
Thrift layer also applies the same limit, which, while redundant in
most cases, is used to support the get_paged_slice verb.
Currently, the limit is not being reset per Thrift row (CQL3
partition), so in practice, instead of limiting the cells in a row,
we're limiting the rows we return as well. This patch fixes that by
ensuring the limit applies only within a row/partition.
Fixes#1882
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161123220001.15496-1-duarte@scylladb.com>
With the default value of 12, a node's range is partitioned into
4096 * smp::count sub-ranges which are queried sequentually for a range
scan. If the number of rows in the table is smaller than the required
result size, we will query all of them. This can take so long that we
time out.
A better fix is to query multiple sub-ranges in parallel and merge them,
but for that we need to resurrect the non-sequential merger.
"Clusters with a large number of nodes, or a low number of vnodes, and a
high number of shards, or a combination, suffer from an aliasing problem:
both vnodes and intra-node sharding consider the most significant bits
to select the owning node and owning shard respectively. Since the same
bits are used for both, a low number of vnodes leads to some shards
being overcommitted relative to others.
This series fixes the problem by sharding on bits 0:47 of the token
(murmur3 partitioner only), leaving the most significant 12 bits for
vnodes. Simulation shows that this value provides reasonable sharding
for 100-node, 30-shard clusters.
In order to prevent re-sharding sstables on each boot, token ranges for
the range are stored in a new sub-component of the sstable Statistics
component. With the default 12 ignored bits we have 4096 token ranges
for non-Level-compacted SSTables, which takes some space but is still
reasonable.
Fixes #1277."
Sharding on the most significant token bits aliases with the vnode mechanism,
which also uses the most significant bits; this requires a huge number of
vnodes to achieve good sharding.
This patch teaches the murmur3 partitioner to ignore the most significant
N bits when calculating a token's hard, so we use token bits which still have
some entropy. In effect, with changes the token range layout from
shard 0
shard 1
...
shard S-1
to
shard 0
shard 1
...
shard S-1
shard 0
shard 1
...
shard S-1
...
shard 0
shard 1
...
shard S-1
Where the number of repetitions of the block is 2^(ignored msb bits).
For compatibility, the default is zero ignored bits, matching the pre-patch
state, until we wire things up.
Instead of calculating the owning shard from the sstable's partition
key range, delegate to the new sstable method for getting owning shard
infomation. This insulates us from changes in the sharding algorithm.
When we load an sstable, we don't know beforehand which shards it belongs
to; we don't want to open it until we do. Add a method that allows us
to read just the sharding data, without opening anything else.
Add a metadata component that describes token ranges that are spanned by
this sstable. With the current sharding algorithm, where each shard owns
a single token range, the first/last partition key is sufficient to
describing sharding information, but for multi-range algorithms, this
is not sufficient.
We have a semaphore controlling the amount of background work generated
by the memtable flush process. However, because we are not moving it
inside the memtable post-flush continuation, the units are being
released when we star the flush and not when we finish it.
That's not the intended behavior and that can cause flushes to
accumulate.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <b7dc1866ed3473b9b1862c433d59c5ebd8575dbc.1479839600.git.glauber@scylladb.com>
Instead of calculating the offset for each statistic component manually,
use a loop to iterate over all components, accumulating the offset as we
go along.
Introduce a new function that reuses the file_writer code to compute
the serialized size of an sstable object, by serializing it into memory
and discarding the result.
write() doesn't need to change its input; so change it to const.
The only snag is that describe_type() isn't and can't be made const-correct,
so cheat when it is called and const_cast the input.
This helps in writing a generic serialized_size() that is const correct,
in the next patch.
Recently we have changed our shutdown strategy to wait for the
_request_controller semaphore to make sure no other allocations are
in-flight. That was done to fix an actual issue.
The problem is that this wasn't done early enough. We acquire the
semaphore after we have already marked ourselves as _shutdown and
released the timer.
That means that if there is an allocation in flight that needs to use a
new segment, it will never finish - and we'll therefore neve acquire
the semaphore.
Fix it by acquiring it first. At this point the allocations will all be
done and gone, and then we can shutdown everything else.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <5c2a2f20e3832b6ea37d6541897519a9307294ed.1479765782.git.glauber@scylladb.com>
storage_proxy has an optimization where it tries to query multiple token
ranges concurrently to satisfy very large requests (an optimization which is
likely meaningless when paging is enabled, as it always should be). However,
the rows-per-range code severely underestimates the number of rows per range,
resulting in a large number of "read-ahead" internal queries being performed,
the results of most of which are discarded.
Fix by disabling this code. We should likely remove it completely, but let's
start with a band-aid that can be backported.
Fixes#1863.
Message-Id: <20161120165741.2488-1-avi@scylladb.com>
We current pass a region group to the memtable, but after so many recent
changes, that is a bit too low level. This patch changes that so we pass
a memtable list instead.
Doing that also has a couple of advantages. Mainly, during flush we must
get to a memtable to a memtable_list. Currently we do that by going to
the memtable to a column family through the schema, and from there to
the memtable_list.
That, however, involves calling virtual functions in a derived class,
because a single column family could have both streaming and normal
memtables. If we pass a memtable_list to the memtable, we can keep
pointer, and when needed get the memtable_list directly.
Not only that gets rid of the inheritance for aesthetic reasons, but
that inheritance is not even correct anymore. Since the introduction of
the big streaming memtables, we now have a plethora of lists per column
family and this transversal is totally wrong. We haven't noticed before
because we were flushing the memtables based on their individual sizes,
but it has been wrong all along for edge cases in which we would have to
resort to size-based flush. This could be the case, for instance, with
various plan_ids in flight at the same time.
At this point, there is no more reason to keep the derived classes for
the dirty_memory_manager. I'm only keeping them around to reduce
clutter, although they are useful for the specialized constructors and
to communicate to the reader exactly what they are. But those can be
removed in a follow up patch if we want.
The old memtable constructor signature is kept around for the benefit of
two tests in memtable_tests which have their own flush logic. In the
future we could do something like we do for the SSTable tests, and have
a proxy class that is friends with the memtable class. That too, is left
for the future.
Fixes#1870
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <811ec9e8e123dc5fc26eadbda82b0bae906657a9.1479743266.git.glauber@scylladb.com>
Now that access to the size_estimates system is virtualized, we no
longer need the recorder.
Fixes#1616
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch add a virtual mutation_reader so that queries
to the size_estimates system table are handled by the engine
without needing to perform any IO.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Virtual readers allow queries to selected tables, usually system
tables, to be answered by the engine. This is useful for tables which
aren't written by users and whose contents can be calculated on
demand.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the get_local_tokens() function in storage_service
to return a future instead of requiring running under a seastar::thread.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch add the slice() function to nonwrapping range, which uses
its bounds to slice an input sequence.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch extracts a pair of functions from mutation_partition to
calculate the lower and upper bounds of a sequence from a
nonwrapping_range.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds a function to system_keyspace responsible for creating
a mutation to a partition of the size_estimates system table from a
set of range_estimates.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the range_estimates struct so that the tokens are
represented as utf8 encoded bytes. This will make future patches
require less conversions.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>