It was the observation that ring_position_range_sharder doesn't support
wrapping ranges that started the nonwrapping_range madness, but that
class still has some leftover wrapping ranges. Close the circle by
removing them.
Message-Id: <20161123153113.8944-1-avi@scylladb.com>
When murmur3_partitioner_ignore_msb_bits = 12 (which we'd like to be the
default), a scan range can be split into a large number of subranges, each
going to a separate shard. With the current implementation, subranges were
queried sequentially, resulting in very long latency when the table was empty
or nearly empty.
Switch to an exponential retry mechanism, where the number of subranges
queried doubles each time, dropping the latency from O(number of subranges)
to O(log(number of subranges)).
If, during an iteration of a retry, we read at most one range
from each shard, then partial results are merged by concatentation. This
optimizes for the dense(r) case, where few partial results are required.
If, during an iteration of a retry, we need more than one range per
shard, then we collapse all of a shard's ranges into just one range,
and merge partial results by sorting decorated keys. This reduces
the number of sstable read creations we need to make, and optimizes for
the sparse table case, where we need many partial results, most of which
are empty.
We don't merge subranges that come from different partition ranges,
because those need to be sorted in request order, not decorated key order.
[tgrabiec: trivial conflicts]
Message-Id: <20161220170532.25173-1-avi@scylladb.com>
std::vector<dht::partition_range> and std::vector<dht::token_range> are
used in a lot of places, introduce dht::partition_range_vector and
dht::token_range_vector as the alias.
nonwrapping_range<ring_position> and nonwrapping_range<token> are used
in many places. Let's make an alias for them to make it less verbose.
Also there is a query::partition_range in query-request.hh which is the alias of
nonwrapping_range<ring_position>. query::partition_range is used in
places not related to query at all. Let's unify the usage project wide.
We are going to require these functions to return sorted and disjoint
ranges. They already do so (provided that the input ranges are sorted
and disjoint), but if the guarantee is not explicitly stated it may
disappear some day.
This patch fixes a typo in i_partitioner::tri_compare() where we were
using std::max instead of std::min, thus avoiding accessing random
memory and getting random results.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161211165043.17816-1-duarte@scylladb.com>
"We currently write the size_estimates system table for every schema
on a periodic basis, currently set to 5 minutes, which can interfere
with an ongoing workload.
This patchset virtualizes it such that queries are intercepted and we
calculate the results on the fly, only for the ranges the caller is interested in.
Fixes#1616"
* 'virtual-estimates/v4' of github.com:duarten/scylla:
size_estimates_virtual_reader: Add unit test
db: Delete size_estimates_recorder
size_estimates: Add virtual reader
column_family: Add support for virtual readers
storage_service: get_local_tokens() returns a future
nonwrapping_range: Add slice() function
range: Find a sequence's lower and upper bounds
system_keyspace: Build mutations for size estimates
size_estimates: Store the token range as bytes
range_estimates: Add schema
murmur3_partitioner: Convert maximum_token to sstring
Sharding on the most significant token bits aliases with the vnode mechanism,
which also uses the most significant bits; this requires a huge number of
vnodes to achieve good sharding.
This patch teaches the murmur3 partitioner to ignore the most significant
N bits when calculating a token's hard, so we use token bits which still have
some entropy. In effect, with changes the token range layout from
shard 0
shard 1
...
shard S-1
to
shard 0
shard 1
...
shard S-1
shard 0
shard 1
...
shard S-1
...
shard 0
shard 1
...
shard S-1
Where the number of repetitions of the block is 2^(ignored msb bits).
For compatibility, the default is zero ignored bits, matching the pre-patch
state, until we wire things up.
This patch ensures we can convert the maximum_token to an sstring.
For Cassandra, the minimum and maximum tokens have the same
representation. So, we use the string representation of the
maximum_token for the maximum_token.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Building on the single-range sharder, add a sharder for vectors of
partition ranges. This helps with wrapped ranges, which are translated
into a vector containing two shards.
Divides a ring_position range into a sequence of shard/range pairs. This
allows sequential iteration over shards in ring order.
The current multi-partition query executes on all shards in parallel, but
this is very wasteful, as most of the data will be thrown away if it is not
included in the page. With the generator, we can switch to sequential
execution.
When performing a range query, we want to iterate over shards, running the
query on each shard in order until the query range is exhausted or we have
the right number of rows.
To be able to do this, introduce token_for_next_shard(), which allows us
to determine the boundary between shards.
It is a sort-of inverse to shard_of(), in that
shard_of(token_for_next_range(t)) == shard_of(t) + 1
Wrapping ranges are a pain, so we are moving wrap handling to the edges.
Since cql can't generate wrapping ranges, this means thrift and the ring
maintenance code; also range->ring transformations need to merge the first
and last ranges.
Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>
This patch adds the from_bytes() function to the i_partitioner class,
whose purpose is parse a particular token and explicitly handle the
case when the minimum token is specified.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Cassandra 1.x clusters often use RandomPartitioner. Supporting
RandomPartitioner will allow easier migration to Scylla
Tests are added to make sure scylla generates the same token as
Cassandra does for the same partition key.
Fixes#1438
Message-Id: <3bc8b7f06fad16d59aaaa96e2827198ce74214c6.1469166766.git.asias@scylladb.com>
Enable --partitioner option so that user can choose partitioner other
than the default Murmur3Partitioner. Currently, only Murmur3Partitioner
and ByteOrderedPartitioner are supported. When non-supported partitioner
is specifed, error will be propogated to user.
In order to support ByteOrderedPartitioner, we need to implement the
missing describe_ownership and midpoint function in
byte_ordered_partitioner class.
As a starter, this path uses a simple node token distance based method
to calculate ownership. C* uses a complicated key samples based method.
We can switch to what C* does later.
Tests are added to tests/partitioner_test.cc.
Fixes#1378
We currently log as follow:
May 9 00:09:13 node3.nl scylla[2546]: [shard 0] storage_service - This
node was decommissioned and will not rejoin the ring unless
cassandra.override_decommission=true has been set,or all existing data
is removed and the node is bootstrapped again
Howerver, user should use
override_decommission:true
instead of
cassandra.override_decommission:true
in scylla.yaml where the cassandra prefix is stripped.
Fixes#1240
Message-Id: <b0c9424c6922431ad049ab49391771e07ca6fbde.1467079190.git.asias@scylladb.com>
- int connections_per_host
Scylla does not create connections per stream_session, instead it uses
rpc, thus connections_per_host is not relevant to scylla.
- bool keep_ss_table_level
- int repaired_at
Scylla does not stream sstable files. They are not relevant to scylla.
messaging_service will use private ip address automatically to connect a
peer node if possible. There is no need for the upper level like
streaming to worry about it. Drop it simplifies things a bit.
Since commit 16596385ee, long_token() is already checking
t.is_minimum(), so the comment which explains why it does not (for
performance) is no longer relevant. And we no longer need to check
t._kind before calling long_token (the check we do here is the same
as is_minimum).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>