It was the observation that ring_position_range_sharder doesn't support
wrapping ranges that started the nonwrapping_range madness, but that
class still has some leftover wrapping ranges. Close the circle by
removing them.
Message-Id: <20161123153113.8944-1-avi@scylladb.com>
(cherry picked from commit 8686a59ea5)
When murmur3_partitioner_ignore_msb_bits = 12 (which we'd like to be the
default), a scan range can be split into a large number of subranges, each
going to a separate shard. With the current implementation, subranges were
queried sequentially, resulting in very long latency when the table was empty
or nearly empty.
Switch to an exponential retry mechanism, where the number of subranges
queried doubles each time, dropping the latency from O(number of subranges)
to O(log(number of subranges)).
If, during an iteration of a retry, we read at most one range
from each shard, then partial results are merged by concatentation. This
optimizes for the dense(r) case, where few partial results are required.
If, during an iteration of a retry, we need more than one range per
shard, then we collapse all of a shard's ranges into just one range,
and merge partial results by sorting decorated keys. This reduces
the number of sstable read creations we need to make, and optimizes for
the sparse table case, where we need many partial results, most of which
are empty.
We don't merge subranges that come from different partition ranges,
because those need to be sorted in request order, not decorated key order.
[tgrabiec: trivial conflicts]
Message-Id: <20161220170532.25173-1-avi@scylladb.com>
(cherry picked from commit a1cafed370)
We are going to require these functions to return sorted and disjoint
ranges. They already do so (provided that the input ranges are sorted
and disjoint), but if the guarantee is not explicitly stated it may
disappear some day.
This patch fixes a typo in i_partitioner::tri_compare() where we were
using std::max instead of std::min, thus avoiding accessing random
memory and getting random results.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161211165043.17816-1-duarte@scylladb.com>
"We currently write the size_estimates system table for every schema
on a periodic basis, currently set to 5 minutes, which can interfere
with an ongoing workload.
This patchset virtualizes it such that queries are intercepted and we
calculate the results on the fly, only for the ranges the caller is interested in.
Fixes#1616"
* 'virtual-estimates/v4' of github.com:duarten/scylla:
size_estimates_virtual_reader: Add unit test
db: Delete size_estimates_recorder
size_estimates: Add virtual reader
column_family: Add support for virtual readers
storage_service: get_local_tokens() returns a future
nonwrapping_range: Add slice() function
range: Find a sequence's lower and upper bounds
system_keyspace: Build mutations for size estimates
size_estimates: Store the token range as bytes
range_estimates: Add schema
murmur3_partitioner: Convert maximum_token to sstring
Sharding on the most significant token bits aliases with the vnode mechanism,
which also uses the most significant bits; this requires a huge number of
vnodes to achieve good sharding.
This patch teaches the murmur3 partitioner to ignore the most significant
N bits when calculating a token's hard, so we use token bits which still have
some entropy. In effect, with changes the token range layout from
shard 0
shard 1
...
shard S-1
to
shard 0
shard 1
...
shard S-1
shard 0
shard 1
...
shard S-1
...
shard 0
shard 1
...
shard S-1
Where the number of repetitions of the block is 2^(ignored msb bits).
For compatibility, the default is zero ignored bits, matching the pre-patch
state, until we wire things up.
This patch ensures we can convert the maximum_token to an sstring.
For Cassandra, the minimum and maximum tokens have the same
representation. So, we use the string representation of the
maximum_token for the maximum_token.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Building on the single-range sharder, add a sharder for vectors of
partition ranges. This helps with wrapped ranges, which are translated
into a vector containing two shards.
Divides a ring_position range into a sequence of shard/range pairs. This
allows sequential iteration over shards in ring order.
The current multi-partition query executes on all shards in parallel, but
this is very wasteful, as most of the data will be thrown away if it is not
included in the page. With the generator, we can switch to sequential
execution.
When performing a range query, we want to iterate over shards, running the
query on each shard in order until the query range is exhausted or we have
the right number of rows.
To be able to do this, introduce token_for_next_shard(), which allows us
to determine the boundary between shards.
It is a sort-of inverse to shard_of(), in that
shard_of(token_for_next_range(t)) == shard_of(t) + 1
Wrapping ranges are a pain, so we are moving wrap handling to the edges.
Since cql can't generate wrapping ranges, this means thrift and the ring
maintenance code; also range->ring transformations need to merge the first
and last ranges.
Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>
This patch adds the from_bytes() function to the i_partitioner class,
whose purpose is parse a particular token and explicitly handle the
case when the minimum token is specified.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Cassandra 1.x clusters often use RandomPartitioner. Supporting
RandomPartitioner will allow easier migration to Scylla
Tests are added to make sure scylla generates the same token as
Cassandra does for the same partition key.
Fixes#1438
Message-Id: <3bc8b7f06fad16d59aaaa96e2827198ce74214c6.1469166766.git.asias@scylladb.com>
Enable --partitioner option so that user can choose partitioner other
than the default Murmur3Partitioner. Currently, only Murmur3Partitioner
and ByteOrderedPartitioner are supported. When non-supported partitioner
is specifed, error will be propogated to user.
In order to support ByteOrderedPartitioner, we need to implement the
missing describe_ownership and midpoint function in
byte_ordered_partitioner class.
As a starter, this path uses a simple node token distance based method
to calculate ownership. C* uses a complicated key samples based method.
We can switch to what C* does later.
Tests are added to tests/partitioner_test.cc.
Fixes#1378
We currently log as follow:
May 9 00:09:13 node3.nl scylla[2546]: [shard 0] storage_service - This
node was decommissioned and will not rejoin the ring unless
cassandra.override_decommission=true has been set,or all existing data
is removed and the node is bootstrapped again
Howerver, user should use
override_decommission:true
instead of
cassandra.override_decommission:true
in scylla.yaml where the cassandra prefix is stripped.
Fixes#1240
Message-Id: <b0c9424c6922431ad049ab49391771e07ca6fbde.1467079190.git.asias@scylladb.com>
- int connections_per_host
Scylla does not create connections per stream_session, instead it uses
rpc, thus connections_per_host is not relevant to scylla.
- bool keep_ss_table_level
- int repaired_at
Scylla does not stream sstable files. They are not relevant to scylla.
messaging_service will use private ip address automatically to connect a
peer node if possible. There is no need for the upper level like
streaming to worry about it. Drop it simplifies things a bit.
Since commit 16596385ee, long_token() is already checking
t.is_minimum(), so the comment which explains why it does not (for
performance) is no longer relevant. And we no longer need to check
t._kind before calling long_token (the check we do here is the same
as is_minimum).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The midpoint() algorithm to find a token between two tokens doesn't
work correctly in case of wraparound. The code tried to handle this
case, but did it wrong. So this patch fixes the midpoint() algorithm,
and adds clearer comments about why the fixed algorithm is correct.
This patch also modifies two midpoint() tests in partitioner_test,
which were incorrect - they verified that midpoint() returns some expected
values, but expected values were wrong!
We also add to the test a more fundemental test of midpoint() correctness,
which doesn't check the midpoint against a known value (which is easy to
get wrong, like indeed happened); Rather we simply check that the midpoint
is really inside the range (according to the token ordering operator).
This simple test failed with the old implementation of midpoint() and
passes with the new one.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Found by debug build
==10190==ERROR: AddressSanitizer: new-delete-type-mismatch on 0x602000084430 in thread T0:
object passed to delete has wrong type:
size of the allocated type: 16 bytes;
size of the deallocated type: 8 bytes.
#0 0x7fe244add512 in operator delete(void*, unsigned long) (/lib64/libasan.so.2+0x9a512)
#1 0x3c674fe in std::default_delete<dht::range_streamer::i_source_filter>::operator()(dht::range_streamer::i_source_filter*)
const /usr/include/c++/5.1.1/bits/unique_ptr.h:76
#2 0x3c60584 in std::unique_ptr<dht::range_streamer::i_source_filter, std::default_delete<dht::range_streamer::i_source_filter> >::~unique_ptr()
/usr/include/c++/5.1.1/bits/unique_ptr.h:236
#3 0x3c7ac22 in void __gnu_cxx::new_allocator<std::unique_ptr<dht::range_streamer::i_source_filter,
std::default_delete<dht::range_streamer::i_source_filter> > >::destroy<std::unique_ptr<dht::range_streamer::i_source_filter,
std::default_delete<dht::range_streamer::i_source_filter> > >(std::unique_ptr<dht::range_streamer::i_source_filter,
std::default_delete<dht::range_streamer::i_source_filter> >*) /usr/include/c++/5.1.1/ext/new_allocator.h:124
...
std::set_difference requires the container to be sorted which is not
true here, use remove_if.
Do not use assert, use throw instead so that we can recover from this
error.
The storage server uses the token_range in origin to return inforamtion
about the ring.
This import the structures. The functionality in origin is redundant in
this case and was not imported.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>