split_range_to_single_shard() returns a vector of size 4096, with
each element (a partition_range) of size 100. The total of 400k can
cause defragmentation if memory is fragmented.
Fix by using a deque.
Fixes#2707.
Message-Id: <20170819141017.28287-1-avi@scylladb.com>
After this patch and the following patches to use the new
range_streamder interface, all the following cluster operations:
- bootstrap
- rebuild
- decommission
- removenode
will use the same code to do the streaming.
The range_streamer is now extended to support both fetch from and push
to peer node. Another big change is now the range_streamer will stream
less ranges at a time, so less data, per stream_plan and range_streamer
will remember which ranges are failed to stream and can retry later.
The retry policy is very simple at the moment it retries at most 5 times
and sleep 1 minutes, 1.5^2 minutes, 1.5^3 minutes ....
Later, we can introduce api for user to decide when to stop retrying and
the retry interval.
The benefits:
- All the cluster operation shares the same code to stream
- We can know the operation progress, e.g., we can know total number of
ranges need to be streamed and number of ranges finished in
bootstrap, decommission and etc.
- All the cluster operation can survive peer node down during the
operation which usually takes long time to complete, e.g., when adding
a new node, currently if any of the existing node which streams data to
the new node had issue sending data to the new node, the whole bootstrap
process will fail. After this patch, we can fix the problematic node
and restart it, the joining node will retry streaming from the node
again.
- We can fail streaming early and timeout early and retry less because
all the operations use stream can survive failure of a single
stream_plan. It is not that important for now to have to make a single
stream_plan successful. Note, another user of streaming, repair, is now
using small stream_plan as well and can rerun the repair for the
failed ranges too.
This is one step closer to supporting the resumable add/remove node
opeartions.
ring_position_comparator has overloads for comparing ring_positions as
well as sstables::key_view. In the case of the latter it needs to
compute the token of the key. However, the sstable layer could cache
some tokens so let's allow the comparator callers to provide it
directly.
dht::token needs to be stored as a pointer now and not a reference so
that validity of old pointers is not impacted by in-place object
construction which would occur in the copy-assignment operator.
[1] says that old pointers can be used to access the new object only
if the type "does not contain any non-static data member whose type is
const-qualified or a reference type".
[1] http://en.cppreference.com/w/cpp/language/lifetime#Storage_reuse
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
- 'net' namespace conflicts with seastar::net, renamed to 'netw'.
- 'transport' namespace conflicts with seastar::transport, renamed to
cql_transport.
- "logger" global variables now conflict with logger global type, renamed
to xlogger.
- other minor changes
The "exponentiality" is not carried over from one range to another, because
we expect one or two ranges (two ranges result from a wrapped around thrift
token range).
Like the regular sharder, the exponential sharder divides a range into
subranges owned by individual ranges. Unlike the regular sharder, it
generates ever-increasing subranges, spanning more and more shards, and
eventually returns several subranges per shard. To avoid using
exponential cpu and memory, subranges belonging to a single shard are merged,
and a flag is set to indicate the subranges are not ordered wrt. each other.
Right now, next_token_for_shard() only allows iterating linearly in shard
order. Add the ability to select a specific shard to skip to (in case we're
only interested in a single shard), and to select larger ranges (so that
exponential increases are not implemented by iteration).
The partitioners now depend on smp::count to be initialized correctly,
but smp::count isn't available at static initialization time.
The scylla executable isn't affected because it calls set_global_partitioner()
after smp::count has been initialized.
Fix by deferring initialization to the first global_partitioner() call.
It was the observation that ring_position_range_sharder doesn't support
wrapping ranges that started the nonwrapping_range madness, but that
class still has some leftover wrapping ranges. Close the circle by
removing them.
Message-Id: <20161123153113.8944-1-avi@scylladb.com>
When murmur3_partitioner_ignore_msb_bits = 12 (which we'd like to be the
default), a scan range can be split into a large number of subranges, each
going to a separate shard. With the current implementation, subranges were
queried sequentially, resulting in very long latency when the table was empty
or nearly empty.
Switch to an exponential retry mechanism, where the number of subranges
queried doubles each time, dropping the latency from O(number of subranges)
to O(log(number of subranges)).
If, during an iteration of a retry, we read at most one range
from each shard, then partial results are merged by concatentation. This
optimizes for the dense(r) case, where few partial results are required.
If, during an iteration of a retry, we need more than one range per
shard, then we collapse all of a shard's ranges into just one range,
and merge partial results by sorting decorated keys. This reduces
the number of sstable read creations we need to make, and optimizes for
the sparse table case, where we need many partial results, most of which
are empty.
We don't merge subranges that come from different partition ranges,
because those need to be sorted in request order, not decorated key order.
[tgrabiec: trivial conflicts]
Message-Id: <20161220170532.25173-1-avi@scylladb.com>
std::vector<dht::partition_range> and std::vector<dht::token_range> are
used in a lot of places, introduce dht::partition_range_vector and
dht::token_range_vector as the alias.
nonwrapping_range<ring_position> and nonwrapping_range<token> are used
in many places. Let's make an alias for them to make it less verbose.
Also there is a query::partition_range in query-request.hh which is the alias of
nonwrapping_range<ring_position>. query::partition_range is used in
places not related to query at all. Let's unify the usage project wide.
We are going to require these functions to return sorted and disjoint
ranges. They already do so (provided that the input ranges are sorted
and disjoint), but if the guarantee is not explicitly stated it may
disappear some day.
This patch fixes a typo in i_partitioner::tri_compare() where we were
using std::max instead of std::min, thus avoiding accessing random
memory and getting random results.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161211165043.17816-1-duarte@scylladb.com>
"We currently write the size_estimates system table for every schema
on a periodic basis, currently set to 5 minutes, which can interfere
with an ongoing workload.
This patchset virtualizes it such that queries are intercepted and we
calculate the results on the fly, only for the ranges the caller is interested in.
Fixes#1616"
* 'virtual-estimates/v4' of github.com:duarten/scylla:
size_estimates_virtual_reader: Add unit test
db: Delete size_estimates_recorder
size_estimates: Add virtual reader
column_family: Add support for virtual readers
storage_service: get_local_tokens() returns a future
nonwrapping_range: Add slice() function
range: Find a sequence's lower and upper bounds
system_keyspace: Build mutations for size estimates
size_estimates: Store the token range as bytes
range_estimates: Add schema
murmur3_partitioner: Convert maximum_token to sstring