Commit Graph

227 Commits

Author SHA1 Message Date
Michael Munday
b9a2f4a228 dht: fix byte ordered partitioner midpoint calculation
New versions of boost saturate the output of the convert_to method
so we need to mask the part we want to extract.

Updates #3922.

Message-Id: <20181116191441.35000-1-mike.munday@ibm.com>
2018-11-16 21:19:06 +02:00
Avi Kivity
82818758ca dht: convert sprint() to format()
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().

Mechanically converted with https://github.com/avikivity/unsprint.
2018-11-01 13:16:17 +00:00
Avi Kivity
7ff5569ee8 dht: fix bad format string syntax
Some sprint() calls use the fmt language instead of the printf syntax. Convert
them all the way to format().
2018-11-01 13:16:17 +00:00
Duarte Nunes
e46ef6723b Merge seastar upstream
* seastar d152f2d...c1e0e5d (6):
  > scripts: perftune.py: properly merge parameters from the command line and the configuration file
  > fmt: update to 5.2.1
  > io_queue: only increment statistics when request is admitted
  > Adds `read_first_line.cc` and `read_first_line.hh` to CMake.
  > fstream: remove default extent allocation hint
  > core/semaphore: Change the access of semaphore_units main ctor

Due to a compile-time fight between fmt and boost::multiprecision, a
lexical_cast was added to mediate.

sprint("%s", var) no longer accepts numeric values, so some sprint()s were
converted to format() calls. Since more may be lurking we'll need to remove
all sprint() calls.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-10-25 12:53:30 +03:00
Asias He
7f826d3343 streaming: Expose reason for streaming
On receiving a mutation_fragment or a mutation triggered by a streaming
operation, we pass an enum stream_reason to notify the receiver what
the streaming is used for. So the receiver can decide further operation,
e.g., send view updates, beyond applying the streaming data on disk.

Fixes #3276
Message-Id: <f15ebcdee25e87a033dcdd066770114a499881c0.1539498866.git.asias@scylladb.com>
2018-10-15 22:03:28 +01:00
Benny Halevy
7eef527769 handle both special token_kinds in dht::tri_compare
Handle the before_all_keys and after_all_keys token_kind
at the highest layer before calling into the virtual
i_partitioner::tri_compare that is not set up to handle these cases.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20181015165612.29356-1-bhalevy@scylladb.com>
2018-10-15 20:00:54 +03:00
Asias He
8edf3defdf range_streamer: Futurize add_ranges
It might take long time for get_all_ranges_with_sources_for and
get_all_ranges_with_strict_sources_for to calculate which cause reactor
stall. To fix, run them in a thread and yield. Those functions are used in
the slow path, it is ok to yield more than needed.

Fixes #3639

Message-Id: <63aa7794906ac020c9d9b2984e1351a8298a249b.1536135617.git.asias@scylladb.com>
2018-10-09 09:46:50 +03:00
Botond Dénes
867f69b9d1 dht::i_partitioner: add partition_ranges_view 2018-09-03 10:31:44 +03:00
Asias He
95849371aa range_streamer: Remove unordered_multimap usage
We need the mapping between dht::token_range to
std::vector<inet_address> and inet_address to dht::token_range_vector in
various places. Currently, we use std::unordered_multimap and convert to
std::unordered_map. It is better to use std::unordered_map in the first
place. The changes like below:

- Change from

  std::unordered_multimap<dht::token_range, inet_address>

to

  std::unordered_map<dht::token_range, std::vector<inet_address>>

- Change from

   std::unordered_multimap<inet_address, dht::token_range>

to

   std::unordered_map<inet_address, dht::token_range_vector>

Message-Id: <b8ecc41775e46ec064db3ee07510c404583390aa.1533106019.git.asias@scylladb.com>
2018-08-01 13:01:41 +03:00
Asias He
4a0b561376 storage_service: Get rid of moving operation
The moving operation changes a node's token to a new token. It is
supported only when a node has one token. The legacy moving operation is
useful in the early days before the vnode is introduced where a node has
only one token. I don't think it is useful anymore.

In the future, we might support adjusting the number of vnodes to reblance
the token range each node owns.

Removing it simplifies the cluster operation logic and code.

Fixes #3475

Message-Id: <144d3bea4140eda550770b866ec30e961933401d.1533111227.git.asias@scylladb.com>
2018-08-01 11:18:17 +03:00
Nadav Har'El
25bd139508 cross-tree: clean up use of std::random_device()
std::random_device() uses the relatively slow /dev/urandom, and we rarely if
ever intend to use it directly - we normally want to use it to seed a faster
random_engine (a pseudo-random number generator).

In many places in the code, we first created a random_device variable, and then
using it created a random_engine variable. However, this practice created the
risk of a programmer accidentally using the random_device object, instead of the
random_engine object, because both have the same API; This hurts performance.

This risk materialized in just two places in the code, utils/uuid.cc and
gms/gossiper.cc. A patch for to uuid.cc was sent previously by Pawel and is
not included in this patch, and the fix for gossiper.{cc,hh} is included here.

To avoid risking the same mistake in the future, this patch switches across the
code to an idiom where the random_device object is not *named*, so cannot be
accidentally used. We use the following idiom:

   std::default_random_engine _engine{std::random_device{}()};

Here std::random_device{}() creates the random device (/dev/urandom) and pulls
a random integer from it. It then uses this seed to create the random_engine
(the pseudo-random number generator). The std::random_device{} object is
temporary and unnamed, and cannot be unintentionally used directly.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180726154958.4405-1-nyh@scylladb.com>
2018-07-26 16:54:58 +01:00
Asias He
1f06ee3960 range_streamer: Limit nr of nodes to stream in parallel
For example, to bootstrap a 50th node in a cluster

 [shard 0] range_streamer - Bootstrap with
 [127.0.0.8, 127.0.0.2, 127.0.0.24, 127.0.0.21, 127.0.0.49, 127.0.0.44,
 127.0.0.9, 127.0.0.7, 127.0.0.47, 127.0.0.15, 127.0.0.5, 127.0.0.30,
 127.0.0.14, 127.0.0.12, 127.0.0.36, 127.0.0.11, 127.0.0.48, 127.0.0.28,
 127.0.0.33, 127.0.0.10, 127.0.0.41, 127.0.0.4, 127.0.0.40, 127.0.0.3,
 127.0.0.6, 127.0.0.43, 127.0.0.22, 127.0.0.26, 127.0.0.42, 127.0.0.25,
 127.0.0.17, 127.0.0.37, 127.0.0.23, 127.0.0.13, 127.0.0.38, 127.0.0.1,
 127.0.0.18, 127.0.0.20, 127.0.0.39, 127.0.0.27, 127.0.0.34, 127.0.0.32,
 127.0.0.19, 127.0.0.16, 127.0.0.31, 127.0.0.45, 127.0.0.29, 127.0.0.35,
 127.0.0.46]
 for keyspace=keyspace1 started, nodes_to_stream=49, nodes_in_parallel=49

the new node will get data from 49 existing nodes.

Currently, it will stream from all the 49 existing nodes at the same
time. It is not a good idea to stream from all the nodes in parallel
which can overwhelm the bootstrap node, i.e., 49 nodes sending, 1 node
receiving.

To fix this, limit the nr of nodes to stream in parallel. We should have
a better control over the memory usage and parallelism. But for now,
limit the nr of nodes to a maximum of 16 as a starter. With this limit,
each shard can work with as many as 16 remote nodes in parallel, I think
this has enough parallelism for streaming in terms of performance.

This change have effect on the bootstrap/decommission/removenode node
operations, and do not have effect on repair.

Refs #2782

Message-Id: <980610dc97490d4f16281a0c3203b9bee73e04e4.1531989557.git.asias@scylladb.com>
2018-07-19 11:44:05 +03:00
Asias He
506eed325a dht: Fix typo in boot_strapper.cc
Eror -> Error

Message-Id: <ab1050c526f6e70c3a365595376acde7706d86e9.1531877929.git.asias@scylladb.com>
2018-07-18 10:00:27 +03:00
Avi Kivity
f4caa418ff Merge "Fix the "LCS data-loss bug"" from Botond
"
This series fixes the "LCS data-loss bug" where full scans (and
everything that uses them) would miss some small percentage (> 0.001%)
of the keys. This could easily lead to permanent data-loss as compaction
and decomission both use full scans.
aeffbb673 worked around this bug by disabling the incremental reader
selectors (the class identified as the source of the bug) altogether.
This series fixes the underlying issue and reverts aeffbb673.

The root cause of the bug is that the `incremental_reader_selector` uses
the current read position to poll for new readers using
`sstable_set::incremental_selector::select()`. This means that when the
currently open sstables contain no partitions that would intersect with
some of the yet unselected sstables, those sstables would be ignored.
Solve the problem by not calling `select()` with the current read
position and always pass the `next_position` returned in the previous
call. This means that the traversal of the sstable-set happens at a pace
defined by the sstable-set itself and this guarantees that no sstable
will be jumped over. When asked for new readers the
`incremental_reader_selector` will now iteratively call `select()` using
the `next_position` from the previous `select()` call until it either
receives some new, yet unselected sstables, or `next_position` surpasses
the read position (in which case `select()` will be tried again later).
The `sstable_set::incremental_selector` was not suitable in its present
state to support calling `select()` with the `next_position` from a
previous call as in some cases it could not make progress due to
inclusiveness related ambiguities. So in preparation to the above fix
`sstable_set` was updated to work in terms of ring-position instead of
tokens. Ring-position can express positions in a much more fine-grained
way then token, including positions after/before tokens and keys. This
allows for a clear expression of `next_position` such that calling
`select()` with it guarantees forward progress in the token-space.

Tests: unit(release, debug)

Refs: #3513
"

* 'leveled-missing-keys/v4' of https://github.com/denesb/scylla:
  tests/mutation_reader_test: combined_mutation_reader_test: use SEASTAR_THREAD_TEST_CASE
  tests/mutation_reader_test: refactor combined_mutation_reader_test
  tests/mutation_reader_test: fix reader_selector related tests
  Revert "database: stop using incremental selectors"
  incremental_reader_selector: don't jump over sstables
  mutation_reader: reader_selector: use ring_position instead of token
  sstables_set::incremental_selector: use ring_position instead of token
  compatible_ring_position: refactor to compatible_ring_position_view
  dht::ring_position_view: use token_bound from ring_position
  i_partitioner: add free function ring-position tri comparator
  mutation_reader_merger::maybe_add_readers(): remove early return
  mutation_reader_merger: get rid of _key
2018-07-05 09:33:12 +03:00
Botond Dénes
a8e795a16e sstables_set::incremental_selector: use ring_position instead of token
Currently `sstable_set::incremental_selector` works in terms of tokens.
Sstables can be selected with tokens and internally the token-space is
partitioned (in `partitioned_sstable_set`, used for LCS) with tokens as
well. This is problematic for severeal reasons.
The sub-range sstables cover from the token-space is defined in terms of
decorated keys. It is even possible that multiple sstables cover
multiple non-overlapping sub-ranges of a single token. The current
system is unable to model this and will at best result in selecting
unnecessary sstables.
The usage of token for providing the next position where the
intersecting sstables change [1] causes further problems. Attempting to
walk over the token-space by repeatedly calling `select()` with the
`next_position` returned from the previous call will quite possibly lead
to an infinite loop as a token cannot express inclusiveness/exclusiveness
and thus the incremental selector will not be able to make progress when
the upper and lower bounds of two neighbouring intervals share the same
token with different inclusiveness e.g. [t1, t2](t2, t3].

To solve these problems update incremental_selector to work in terms of
ring position. This makes it possible to partition the token-space
amoing sstables at decorated key granularity. It also makes it possible
for select() to return a next_position that is guaranteed to make
progress.

partitioned_sstable_set now builds the internal interval map using the
decorated key of the sstables, not just the tokens.
incremental_selector::select() now uses `dht::ring_position_view` as
both the selector and the next_position. ring_position_view can express
positions between keys so it can also include information about
inclusiveness/exclusiveness of the next interval guaranteeing forward
progress.

[1] `sstable_set::incremental_selector::selection::next_position`
2018-07-04 17:42:33 +03:00
Botond Dénes
bf2645c616 compatible_ring_position: refactor to compatible_ring_position_view
compatible_ring_position's sole purpose is to allow creating
boost::icl::interval_map with dht::ring_position as the key and list of
sstables as the value. This function is served equally well if
compatible_ring_position wraps a `dht::ring_position_view` instead of a
`dht::ring_position` with the added benefit of not having to copy the
possibly heavy `dht::decorated_key` around. It also makes it possible
to do lookups with `dht::ring_position_view` which is much more
versatile and allows avoiding copies just to make lookups.
The only downside is that `dht::ring_position_view` requires the
lifetime of the "viewed" object to be taken care of. This is not a
concern however, as so long as an interval is present in the map the
represented sstable is guaranteed to be alive to, as the interval map
participates in the ownership of the stored sstables.

Rename compatible_ring_position to compatible_ring_position_view to
reflect the changes.
While at it upgrade the std::experimental::optional to std::optional.
2018-07-04 08:19:39 +03:00
Botond Dénes
48b07ba5d3 dht::ring_position_view: use token_bound from ring_position
Currently dht::ring_position_view's dht::token constructor takes the
token bound in the form of a raw `uint8_t`. This allows for passing a
weight of "0" which is illegal as single token does not represent a
single ring position but an interval as arbitrary number of keys can
have the same token. dht::ring_position uses an enum in its dht::token
constructor. Import that same enum into the dht::ring_position_view
scope and take a `token_bound` instead of `uint8_t`.
This is especially important as in later patches the internal weight of
the ring_position_view will be exposed and illegal values can cause all
sorts of problems.
2018-07-04 08:19:34 +03:00
Botond Dénes
01bd34d117 i_partitioner: add free function ring-position tri comparator
Having to create an object just to compare two ring positions (or views)
is annoying and unnecessary. Provide a free function version as well.
2018-07-02 11:41:09 +03:00
Avi Kivity
db2c029f7a dht: add i_partitioner::sharding_ignore_msb()
While the sharding algorithm is exposed (as cpu_sharding_algorithm_name()),
the ignore_msb parameter is not. Add a function to do that.
2018-07-01 12:17:35 +03:00
Asias He
27cb41ddeb range_streamer: Use float for time took for stream
It is useful when the total time to stream is small, e.g, 2.0 seconds
and 2.9 seconds. Showing the time as interger number of seconds is not
accurate in such case.

Message-Id: <d801b57279981c72acb907ad4b0190ba4d938a3d.1530175052.git.asias@scylladb.com>
2018-06-28 11:39:14 +03:00
Asias He
d23dafa7ac dht: Remove column_families parameter in add_rx_ranges and add_tx_ranges
In 4b1034b (storage_service: Remove the stream_hints), we removed the
only user of the api with the column_families parameter.

std::vector column_families = { db::system_keyspace::HINTS };
streamer->add_tx_ranges(keyspace, std::move(ranges_per_endpoint),
column_families);

We can simplify the code range_streamer a bit by removing it.

Fixes #3476

Tests: dtest update_cluster_layout_tests.py
Message-Id: <c81d79c5e6dbc8dd78c1242837de892e39d6abd2.1528356342.git.asias@scylladb.com>
2018-06-10 14:53:40 +03:00
Glauber Costa
250d9332dc partitioner: export the name of the algorithm used to do intra-node sharding
We will export this on system tables. To avoid hard-coding it in the system
table level, keep it at least in the dht layer where it belongs.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-06-04 11:25:58 -04:00
Avi Kivity
9eb7c0c65b Merge "Remove (some) reactor stalls in the SSTable code" from Glauber
"
This is an improvement on my latest series. Instead of just
dealing with the problem of destroying the Summary that I have
identified in a previous test, I have tried to find other sources
of stalls.

Some of them are on readers and would affect early processes and
operations like nodetool refresh.

Others are on writers, which can affect any SSTable being written.

Two of those stalls (on large filter, on summary read), I saw in a
synthetic benchmark where I used very small values + nodetool compact
to generate one SSTable with many keys. They were 80ms and 20ms
respectively, and now they are totally gone.

For others, I just tried to be safe (for instance, if we know
reading/writing large vectors can be costly, just always insert
preemption points in them).

With all of these patches applied, I no longer see stalls coming from
the SSTable code in those tests (although given enough time, I am sure I
can find more).

Tests: unit (release)
Fixes: #3282, Fixes #3281, Fixes #3269
"

* 'sstables-stalls-v3-updated' of github.com:glommer/scylla:
  large_bitset/bloom filter: add preemption points in loops
  sstables: read filter in a thread
  abstract summary entry version of the token with a token view
  add a token_view
  sstables: rework summary entries reading
  sstables: avoid calls to resize for vectors
  sstables: replace potentially large for loop with do_until
  summary_entry: do not store key bytes in each summary entry
  tests: change tests to make summary non-copyable
  chunked_vector: do not iterate to destruct trivially destructible types
2018-03-16 09:43:36 +01:00
Glauber Costa
dddc7e1676 add a token_view
Ideally we would like tokens to be trivially destructible, so that we
can easily dispose of giant vectors holding them. While that is hard to
do with our current infrastructure, we can introduce a token_view, which
holds a bytes_view elements instead of the real data - making it
trivially destructible.

The comparators are then changed to take a token_view, and an implicit
conversion function is provided from tokens so they get compared.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-15 12:24:09 -04:00
Asias He
9b5585ebd5 range_streamer: Stream 10% of ranges instead of 10 ranges per time
If there are a lot of ranges, e.g., num_tokens=2048, 10 ranges per
stream plan will cause tons of stream plan to be created to stream data,
each having very few data. This cause each stream plan has low transfer
bandwidth, so that the total time to complete the streaming increases.

It makes more sense to send a percentage of the total ranges per stream
plan than a fixed ranges.

Here is an example to stream a keyspace with 513 ranges in
total, 10 ranges v.s. 10% ranges:

Before:
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for
keyspace=system_traces, 510 out of 513 ranges: ranges = 51
[shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1
succeeded, took 107 seconds

After:
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for
keyspace=system_traces, 510 out of 513 ranges: ranges = 10
[shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1
succeeded, took 22 seconds

Message-Id: <a890b84fbac0f3c3cc4021e30dbf4cdf135b93ea.1520992228.git.asias@scylladb.com>
2018-03-14 10:12:12 +02:00
Asias He
73d8e2743f dht: Fix log in range_streamer
The address and keyspace should be swapped.

Before:
  range_streamer - Bootstrap with ks3 for keyspace=127.0.0.1 succeeded,
  took 56 seconds

After:
  range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks3 succeeded,
  took 56 seconds

Message-Id: <5c49646f1fbe45e3a1e7545b8470e04b166922c4.1520416042.git.asias@scylladb.com>
2018-03-07 11:49:58 +02:00
Raphael S. Carvalho
19d994cfff dht: make it easier to create ring_position_view from token
that's done by adding a separate explicit constructor

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-01-03 15:26:26 -02:00
Raphael S. Carvalho
68ac0832b7 dht: introduce is_min/max for ring_position
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-01-03 15:26:25 -02:00
Paweł Dziepak
8c3b7fea81 Merge "Introduce new API and converters from/to old mutation_reader" from Piotr
"This changeset is the first step to flatten mutation_reader.
Then it introduces new mutation_fragment types for partition header and end of partition.
Using those a new flat_mutation_reader is defined.
Finally it introduces converters between new flat_mutation_reader and
old mutation_reader."

* 'haaawk/flattened_mutation_reader_v12' of github.com:scylladb/seastar-dev:
  Add tests for flat_mutation_reader
  Introduce conversion from flat_mutation_reader to mutation_reader
  Introduce conversion from mutation_reader to flat_mutation_reader
  Introduce flat_mutation_reader
  Extract FlattenedConsumer concept using GCC6_CONCEPT
  Introduce partition_end mutation_fragment
  Introduce a position for end of partition
  Introduce partition_start mutation_fragment
  Introduce FragmentConsumer
  Introduce a position for partition start
  streamed_mutation: Extract concepts using GCC6_CONCEPT macro
2017-10-16 12:14:23 +01:00
Duarte Nunes
2210d10552 gms/gossiper: Cleanup is_alive()
Make it use get_endpoint_state_for_endpoint_ptr(), check if gossiper is
enabled, mark it as const, and have some callers use it instead of open
coding the logic.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-11 10:02:32 +01:00
Piotr Jastrzebski
2516b42752 Introduce partition_start mutation_fragment
This type of mutation_fragment will be used in new mutation_reader
to signal the beginning of the next partition.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2017-10-10 16:15:59 +02:00
Duarte Nunes
ceebbe14cc gossiper: Avoid endpoint_state copies
gossiper::get_endpoint_state_for_endpoint() returns a copy of
endpoint_state, which we've seen can be very expensive.

This patch adds a similar function which returns a pointer instead,
and changes the call sites where using the pointer-returning variant
is deemed safe (the pointer neither escapes the function, nor crosses
any defer point).

Fixes #764

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2017-10-10 13:48:02 +01:00
Tomasz Grabiec
741ec61269 streaming: Fix streaming not streaming all ranges
It skipped one sub-range in each of the 10 range batch, and
tried to access the range vector using end() iterator.

Fixes sporadic failures of
update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_node_1_test.

Message-Id: <1505848902-16734-1-git-send-email-tgrabiec@scylladb.com>
2017-09-20 10:33:59 +03:00
Botond Dénes
a980ff6463 Use abort() instead of assert + throw in unreachable code
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <393c3730111dfe090c44d8fc2e31602956a7d008.1504022425.git.bdenes@scylladb.com>
2017-09-03 11:07:27 +03:00
Botond Dénes
d1209c548a Fix -Wreturn-type warnings
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <99f7a006daaa78eb87720ac51c394093398bc868.1504013915.git.bdenes@scylladb.com>
2017-08-29 16:41:09 +03:00
Tomasz Grabiec
2ca99be27d ring_position_view: Print token instead of token pointer
Broken in e989d65539.
Message-Id: <1503667158-7544-1-git-send-email-tgrabiec@scylladb.com>
2017-08-25 14:25:21 +01:00
Avi Kivity
81a33df25d dht: reduce split_range_to_single_shard contiguous memory demand
split_range_to_single_shard() returns a vector of size 4096, with
each element (a partition_range) of size 100. The total of 400k can
cause defragmentation if memory is fragmented.

Fix by using a deque.

Fixes #2707.
Message-Id: <20170819141017.28287-1-avi@scylladb.com>
2017-08-21 14:25:45 +02:00
Duarte Nunes
ec75eac37d ring_position_exponential_vector_sharder: Take ranges by rvalue
Avoids some copies.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170814093310.29200-1-duarte@scylladb.com>
2017-08-14 12:55:43 +03:00
Asias He
f239b11a84 storage_service: Use the new range_streamer interface for bootstrap
So that bootstrap operation will now stream small ranges at a time and
restream the failed ranges.
2017-08-07 16:31:47 +08:00
Asias He
6810031ba7 dht: Extend range_streamer interface
After this patch and the following patches to use the new
range_streamder interface, all the following cluster operations:

- bootstrap
- rebuild
- decommission
- removenode

will use the same code to do the streaming.

The range_streamer is now extended to support both fetch from and push
to peer node. Another big change is now the range_streamer will stream
less ranges at a time, so less data, per stream_plan and range_streamer
will remember which ranges are failed to stream and can retry later.

The retry policy is very simple at the moment it retries at most 5 times
and sleep 1 minutes, 1.5^2 minutes, 1.5^3 minutes ....

Later, we can introduce api for user to decide when to stop retrying and
the retry interval.

The benefits:

- All the cluster operation shares the same code to stream

- We can know the operation progress, e.g., we can know total number of
  ranges need to be streamed and number of ranges finished in
  bootstrap, decommission and etc.
- All the cluster operation can survive peer node down during the
  operation which usually takes long time to complete, e.g., when adding
  a new node, currently if any of the existing node which streams data to
  the new node had issue sending data to the new node, the whole bootstrap
  process will fail. After this patch, we can fix the problematic node
  and restart it, the joining node will retry streaming from the node
  again.
- We can fail streaming early and timeout early and retry less because
  all the operations use stream can survive failure of a single
  stream_plan. It is not that important for now to have to make a single
  stream_plan successful. Note, another user of streaming, repair, is now
  using small stream_plan as well and can rerun the repair for the
  failed ranges too.

This is one step closer to supporting the resumable add/remove node
opeartions.
2017-08-07 16:31:47 +08:00
Paweł Dziepak
68e57a742f ring_position_comparator: drop unused overloads 2017-07-26 14:36:37 +01:00
Paweł Dziepak
fe7eba7f06 ring_position_comparator: accept sstables::decorated_key_view
ring_position_comparator has overloads for comparing ring_positions as
well as sstables::key_view. In the case of the latter it needs to
compute the token of the key. However, the sstable layer could cache
some tokens so let's allow the comparator callers to provide it
directly.
2017-07-26 14:36:36 +01:00
Tomasz Grabiec
60678f0e8a ring_position: Optimize contruction from r-value referenceces of decorated_key
Message-Id: <1500650171-26291-1-git-send-email-tgrabiec@scylladb.com>
2017-07-24 10:25:14 +03:00
Asias He
d835cf2748 dht: Add selective_token_range_sharder
It is like ring_position_range_sharder but it works with
dht::token_range. This sharder will return the ranges belong to a
selected shard.
2017-07-04 18:46:19 +08:00
Tomasz Grabiec
e989d65539 dht: Make ring_position_view copyable
dht::token needs to be stored as a pointer now and not a reference so
that validity of old pointers is not impacted by in-place object
construction which would occur in the copy-assignment operator.

[1] says that old pointers can be used to access the new object only
if the type "does not contain any non-static data member whose type is
const-qualified or a reference type".

[1] http://en.cppreference.com/w/cpp/language/lifetime#Storage_reuse
2017-06-24 18:06:11 +02:00
Tomasz Grabiec
81e7b561da dht: Add ring_position min()/max() 2017-06-24 18:06:11 +02:00
Avi Kivity
f9f2f18145 dht: fix bad to_sstring() call
to_sstring() is part of seastar, nor the global namespace.
2017-06-22 17:51:27 +03:00
Avi Kivity
ebaeefa02b Merge seatar upstream (seastar namespace)
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
 - 'net' namespace conflicts with seastar::net, renamed to 'netw'.
 - 'transport' namespace conflicts with seastar::transport, renamed to
   cql_transport.
 - "logger" global variables now conflict with logger global type, renamed
   to xlogger.
 - other minor changes
2017-05-21 12:26:15 +03:00
Calle Wilund
6ca07f16c1 scylla: fix compilation errors on gcc 5
Message-Id: <1495030581-2138-1-git-send-email-calle@scylladb.com>
2017-05-17 17:56:06 +03:00
Avi Kivity
68034604e1 dht: murmur3_partitioner: simplify moving to and from the zero-based token range 2017-05-17 13:50:30 +03:00