Commit Graph

166 Commits

Author SHA1 Message Date
Avi Kivity
af118ab52b murmur3_partitioner: fix build on clang
Don't know what the root cause it, but the fix is harmless.
2017-04-17 23:03:15 +03:00
Avi Kivity
c05f60387b i_partitioner: remove unused function
Found by clang.
2017-04-17 23:03:15 +03:00
Avi Kivity
a496ec7f5b byte_ordered_partitioner: fix bad operator precedence
Found by clang.
2017-04-17 23:03:15 +03:00
Tomasz Grabiec
d4b6e430ed dht: Introduce ring_position_view 2017-03-28 18:10:39 +02:00
Tomasz Grabiec
55a7cceef5 dht: Move comparison logic from ring_position::tri_compare() to ring_position_comparator
It will soon define common ordering for many objects, not just
ring_position.
2017-03-28 18:10:39 +02:00
Tomasz Grabiec
65a8920b25 dht: Make min/max tokens capturable by reference
So that they can be later used in views.
2017-03-28 18:10:39 +02:00
Avi Kivity
54b8acdd9f dht: add hashing and comparison helpers to dht::decorarted_key
An std::hash specialization, and an equality comparator.
2017-01-20 11:24:14 +02:00
Avi Kivity
141048e0e5 dht: improve token hash function
For a small token, we can just return it, since it already is a hash.
We hash large tokens using murmur3, which is supposedly a good hash.
2017-01-20 11:24:14 +02:00
Avi Kivity
8686a59ea5 dht: use nonwrapping_ranges in ring_position_range_sharder
It was the observation that ring_position_range_sharder doesn't support
wrapping ranges that started the nonwrapping_range madness, but that
class still has some leftover wrapping ranges.  Close the circle by
removing them.
Message-Id: <20161123153113.8944-1-avi@scylladb.com>
2016-12-22 14:40:30 +01:00
Avi Kivity
a1cafed370 storage_proxy: handle range scans of sparsely populated tables
When murmur3_partitioner_ignore_msb_bits = 12 (which we'd like to be the
default), a scan range can be split into a large number of subranges, each
going to a separate shard.  With the current implementation, subranges were
queried sequentially, resulting in very long latency when the table was empty
or nearly empty.

Switch to an exponential retry mechanism, where the number of subranges
queried doubles each time, dropping the latency from O(number of subranges)
to O(log(number of subranges)).

If, during an iteration of a retry, we read at most one range
from each shard, then partial results are merged by concatentation.  This
optimizes for the dense(r) case, where few partial results are required.

If, during an iteration of a retry, we need more than one range per
shard, then we collapse all of a shard's ranges into just one range,
and merge partial results by sorting decorated keys.  This reduces
the number of sstable read creations we need to make, and optimizes for
the sparse table case, where we need many partial results, most of which
are empty.

We don't merge subranges that come from different partition ranges,
because those need to be sorted in request order, not decorated key order.

[tgrabiec: trivial conflicts]

Message-Id: <20161220170532.25173-1-avi@scylladb.com>
2016-12-20 18:32:29 +01:00
Asias He
937f28d2f1 Convert to use dht::partition_range_vector and dht::token_range_vector 2016-12-19 14:08:50 +08:00
Asias He
7a446986fa dht: Introduce dht::partition_range_vector and dht::token_range_vector
std::vector<dht::partition_range> and std::vector<dht::token_range> are
used in a lot of places, introduce dht::partition_range_vector and
dht::token_range_vector as the alias.
2016-12-19 08:09:28 +08:00
Asias He
85034c1b57 Convert to use dht::partition_range 2016-12-19 08:04:30 +08:00
Asias He
d1178fa299 Convert to use dht::token_range 2016-12-19 08:04:29 +08:00
Asias He
1f06eedb58 dht: Rename token_range to token_range_endpoints
It is a helper class used in storage_service only. Rename it so we can
use it for the real dht::token_range.
2016-12-19 08:04:29 +08:00
Asias He
264b6ee69e dht: Introduce dht::token_range an dht::partition_range
nonwrapping_range<ring_position> and nonwrapping_range<token> are used
in many places. Let's make an alias for them to make it less verbose.

Also there is a query::partition_range in query-request.hh which is the alias of
nonwrapping_range<ring_position>. query::partition_range is used in
places not related to query at all. Let's unify the usage project wide.
2016-12-19 08:04:29 +08:00
Paweł Dziepak
b86a826baf dht: describe split_range[s]_to_shards() guarantees
We are going to require these functions to return sorted and disjoint
ranges. They already do so (provided that the input ranges are sorted
and disjoint), but if the guarantee is not explicitly stated it may
disappear some day.
2016-12-15 13:07:32 +00:00
Asias He
463cc4fbde dht: Introduce split_ranges_to_shards
Split a ranges into shard ranges map with ring_position_range_sharder
helper.
2016-12-12 09:04:21 +08:00
Asias He
044c4ff44c dht: Introduce split_range_to_shards
Split a range into shard ranges map with ring_position_range_sharder
helper.
2016-12-12 09:04:21 +08:00
Duarte Nunes
ada2f1092e dht: Make i_partitioner::tri_compare pure virtual
This patch makes the i_partitioner::tri_compare() function pure
virtual as it is overridden by all partitioners.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161211172037.16496-1-duarte@scylladb.com>
2016-12-11 19:29:37 +02:00
Duarte Nunes
bb66b051ed dht: Make i_partitioner::tri_compare memory safe
This patch fixes a typo in i_partitioner::tri_compare() where we were
using std::max instead of std::min, thus avoiding accessing random
memory and getting random results.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161211165043.17816-1-duarte@scylladb.com>
2016-12-11 18:58:10 +02:00
Avi Kivity
28857e42e7 Merge " Virtualize size_estimates system table" from Duarte
"We currently write the size_estimates system table for every schema
on a periodic basis, currently set to 5 minutes, which can interfere
with an ongoing workload.

This patchset virtualizes it such that queries are intercepted and we
calculate the results on the fly, only for the ranges the caller is interested in.

Fixes #1616"

* 'virtual-estimates/v4' of github.com:duarten/scylla:
  size_estimates_virtual_reader: Add unit test
  db: Delete size_estimates_recorder
  size_estimates: Add virtual reader
  column_family: Add support for virtual readers
  storage_service: get_local_tokens() returns a future
  nonwrapping_range: Add slice() function
  range: Find a sequence's lower and upper bounds
  system_keyspace: Build mutations for size estimates
  size_estimates: Store the token range as bytes
  range_estimates: Add schema
  murmur3_partitioner: Convert maximum_token to sstring
2016-11-28 10:12:59 +02:00
Avi Kivity
07d5a20bae Wire up sharding ignore msb parameter to configuration
We might have used a fancy map<sstring, any> to pass the parameters, but
that's overkill for now.
2016-11-22 22:40:47 +02:00
Avi Kivity
8b1d689de8 partitioner: add ignore_msb parameters to byte ordered and random partitioners
Ignored; doesn't make sense on byte ordered, and random is deprecated.
2016-11-22 21:56:42 +02:00
Avi Kivity
af16c0fac4 murmur3_partitioner: shard on the middle token bits, not most significant bits
Sharding on the most significant token bits aliases with the vnode mechanism,
which also uses the most significant bits; this requires a huge number of
vnodes to achieve good sharding.

This patch teaches the murmur3 partitioner to ignore the most significant
N bits when calculating a token's hard, so we use token bits which still have
some entropy.  In effect, with changes the token range layout from

   shard 0
   shard 1
   ...
   shard S-1

to

   shard 0
   shard 1
   ...
   shard S-1

   shard 0
   shard 1
   ...
   shard S-1

   ...

   shard 0
   shard 1
   ...
   shard S-1

Where the number of repetitions of the block is 2^(ignored msb bits).

For compatibility, the default is zero ignored bits, matching the pre-patch
state, until we wire things up.
2016-11-22 21:56:42 +02:00
Duarte Nunes
01815ecd24 murmur3_partitioner: Convert maximum_token to sstring
This patch ensures we can convert the maximum_token to an sstring.
For Cassandra, the minimum and maximum tokens have the same
representation. So, we use the string representation of the
maximum_token for the maximum_token.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-11-21 10:56:32 +00:00
Duarte Nunes
66f6a367a4 ring_position_range_sharder: Avoid copying eagerly
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161104115632.15974-1-duarte@scylladb.com>
2016-11-13 11:42:23 +02:00
Avi Kivity
7202b94183 dht: introduce a sharder for vectors of partition ranges
Building on the single-range sharder, add a sharder for vectors of
partition ranges.  This helps with wrapped ranges, which are translated
into a vector containing two shards.
2016-11-03 19:10:20 +02:00
Avi Kivity
43a2380899 dht: add a generator for shard/range pairs
Divides a ring_position range into a sequence of shard/range pairs.  This
allows sequential iteration over shards in ring order.

The current multi-partition query executes on all shards in parallel, but
this is very wasteful, as most of the data will be thrown away if it is not
included in the page.  With the generator, we can switch to sequential
execution.
2016-11-03 19:10:17 +02:00
Avi Kivity
1f88d103a8 partitioner: add i_partitioner::token_for_next_shard()
When performing a range query, we want to iterate over shards, running the
query on each shard in order until the query range is exhausted or we have
the right number of rows.

To be able to do this, introduce token_for_next_shard(), which allows us
to determine the boundary between shards.

It is a sort-of inverse to shard_of(), in that

  shard_of(token_for_next_range(t)) == shard_of(t) + 1
2016-11-03 19:09:23 +02:00
Avi Kivity
6c45b0bae8 partitioner: make comparators public
The public comparison operators depend on global_partitioner(), and are
therefore less useful for tests.
2016-11-03 11:27:40 +02:00
Avi Kivity
6320181b97 partitioner: const correctness for comparators 2016-11-03 11:27:40 +02:00
Avi Kivity
470826d127 partitioner: change partitioners to have shard counts independent from smp::count
Useful for testing.
2016-11-03 11:27:40 +02:00
Avi Kivity
a35136533d Convert ring_position and token ranges to be nonwrapping
Wrapping ranges are a pain, so we are moving wrap handling to the edges.

Since cql can't generate wrapping ranges, this means thrift and the ring
maintenance code; also range->ring transformations need to merge the first
and last ranges.

Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>
2016-11-02 21:04:11 +02:00
Duarte Nunes
862f51cddf partitioner: Parse token from bytes
This patch adds the from_bytes() function to the i_partitioner class,
whose purpose is parse a particular token and explicitly handle the
case when the minimum token is specified.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2016-09-30 11:17:02 +00:00
Avi Kivity
4fcebd4ca6 random_partitioner: fix overflow in shard_of()
uint128_t will overflow if smp::count > 2.  Replace with a larger type.

Message-Id: <1471188765-30142-1-git-send-email-avi@scylladb.com>
2016-08-15 09:41:54 +03:00
Asias He
2f4cd86809 random_partitioner: Implement random_partitioner
Cassandra 1.x clusters often use RandomPartitioner. Supporting
RandomPartitioner will allow easier migration to Scylla

Tests are added to make sure scylla generates the same token as
Cassandra does for the same partition key.

Fixes #1438

Message-Id: <3bc8b7f06fad16d59aaaa96e2827198ce74214c6.1469166766.git.asias@scylladb.com>
2016-07-24 16:25:25 +03:00
Duarte Nunes
aaa76d58ba query: Move to_partition_range to dht namespace
This patch moves to_partition_range, from the query namespace
to the dht namespace, where it is a more natural fit.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1468498060-19251-1-git-send-email-duarte@scylladb.com>
2016-07-15 10:41:52 +02:00
Asias He
f4389349e4 config: Enable partitioner option
Enable --partitioner option so that user can choose partitioner other
than the default Murmur3Partitioner. Currently, only Murmur3Partitioner
and ByteOrderedPartitioner are supported. When non-supported partitioner
is specifed, error will be propogated to user.
2016-07-08 17:44:55 +08:00
Asias He
9c27b5c46e byte_ordered_partitioner: Implement missing describe_ownership and midpoint
In order to support ByteOrderedPartitioner, we need to implement the
missing describe_ownership and midpoint function in
byte_ordered_partitioner class.

As a starter, this path uses a simple node token distance based method
to calculate ownership. C* uses a complicated key samples based method.
We can switch to what C* does later.

Tests are added to tests/partitioner_test.cc.

Fixes #1378
2016-07-08 17:44:55 +08:00
Asias He
f6a2672be0 storage_service: Modify log to match config option of scylla
We currently log as follow:

May  9 00:09:13 node3.nl scylla[2546]:  [shard 0] storage_service - This
node was decommissioned and will not rejoin the ring unless
cassandra.override_decommission=true has been set,or all existing data
is removed and the node is bootstrapped again

Howerver, user should use

   override_decommission:true

instead of

   cassandra.override_decommission:true

in scylla.yaml where the cassandra prefix is stripped.

Fixes #1240
Message-Id: <b0c9424c6922431ad049ab49391771e07ca6fbde.1467079190.git.asias@scylladb.com>
2016-07-04 10:47:49 +02:00
Piotr Jastrzebski
27575a0528 Fix previous_entry_is_continuous
Rename it to check_previous_entry.
Remove unnesessary test.
Make sure ring_position always has working relation_to_keys method.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <6bc790d492ba9b5c302a50218f3e26b924f657d0.1467101754.git.piotr@scylladb.com>
2016-06-28 10:27:08 +02:00
Asias He
ee0585cee9 dht: Add default constructor for token
It is needed to put token in to a boost interval_map in the following
patch.
2016-05-17 17:32:15 +08:00
Pekka Enberg
38a54df863 Fix pre-ScyllaDB copyright statements
People keep tripping over the old copyrights and copy-pasting them to
new files. Search and replace "Cloudius Systems" with "ScyllaDB".

Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>
2016-04-08 08:12:47 +03:00
Gleb Natapov
775cc93880 remove unused range and token serializers 2016-02-02 12:15:49 +02:00
Asias He
bdd6a69af7 streaming: Drop unused parameters
- int connections_per_host

Scylla does not create connections per stream_session, instead it uses
rpc, thus connections_per_host is not relevant to scylla.

- bool keep_ss_table_level
- int repaired_at

Scylla does not stream sstable files. They are not relevant to scylla.
2016-01-25 11:38:13 +08:00
Gleb Natapov
043d132ba9 Remove no longer used serializers. 2016-01-24 12:45:41 +02:00
Gleb Natapov
49ce2b83df Add ring_position constructor needed by serializer. 2016-01-24 12:45:41 +02:00
Asias He
89b79d44de streaming: Get rid of the _connecting_ parameter
messaging_service will use private ip address automatically to connect a
peer node if possible. There is no need for the upper level like
streaming to worry about it. Drop it simplifies things a bit.
2015-12-31 11:25:08 +01:00
Nadav Har'El
f0b27671a2 murmur3 partitioner: remove outdated comment, and code
Since commit 16596385ee, long_token() is already checking
t.is_minimum(), so the comment which explains why it does not (for
performance) is no longer relevant. And we no longer need to check
t._kind before calling long_token (the check we do here is the same
as is_minimum).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2015-12-30 10:01:29 +02:00