In 4b1034b (storage_service: Remove the stream_hints), we removed the
only user of the api with the column_families parameter.
std::vector column_families = { db::system_keyspace::HINTS };
streamer->add_tx_ranges(keyspace, std::move(ranges_per_endpoint),
column_families);
We can simplify the code range_streamer a bit by removing it.
Fixes#3476
Tests: dtest update_cluster_layout_tests.py
Message-Id: <c81d79c5e6dbc8dd78c1242837de892e39d6abd2.1528356342.git.asias@scylladb.com>
We will export this on system tables. To avoid hard-coding it in the system
table level, keep it at least in the dht layer where it belongs.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"
This is an improvement on my latest series. Instead of just
dealing with the problem of destroying the Summary that I have
identified in a previous test, I have tried to find other sources
of stalls.
Some of them are on readers and would affect early processes and
operations like nodetool refresh.
Others are on writers, which can affect any SSTable being written.
Two of those stalls (on large filter, on summary read), I saw in a
synthetic benchmark where I used very small values + nodetool compact
to generate one SSTable with many keys. They were 80ms and 20ms
respectively, and now they are totally gone.
For others, I just tried to be safe (for instance, if we know
reading/writing large vectors can be costly, just always insert
preemption points in them).
With all of these patches applied, I no longer see stalls coming from
the SSTable code in those tests (although given enough time, I am sure I
can find more).
Tests: unit (release)
Fixes: #3282, Fixes#3281, Fixes#3269
"
* 'sstables-stalls-v3-updated' of github.com:glommer/scylla:
large_bitset/bloom filter: add preemption points in loops
sstables: read filter in a thread
abstract summary entry version of the token with a token view
add a token_view
sstables: rework summary entries reading
sstables: avoid calls to resize for vectors
sstables: replace potentially large for loop with do_until
summary_entry: do not store key bytes in each summary entry
tests: change tests to make summary non-copyable
chunked_vector: do not iterate to destruct trivially destructible types
Ideally we would like tokens to be trivially destructible, so that we
can easily dispose of giant vectors holding them. While that is hard to
do with our current infrastructure, we can introduce a token_view, which
holds a bytes_view elements instead of the real data - making it
trivially destructible.
The comparators are then changed to take a token_view, and an implicit
conversion function is provided from tokens so they get compared.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
If there are a lot of ranges, e.g., num_tokens=2048, 10 ranges per
stream plan will cause tons of stream plan to be created to stream data,
each having very few data. This cause each stream plan has low transfer
bandwidth, so that the total time to complete the streaming increases.
It makes more sense to send a percentage of the total ranges per stream
plan than a fixed ranges.
Here is an example to stream a keyspace with 513 ranges in
total, 10 ranges v.s. 10% ranges:
Before:
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for
keyspace=system_traces, 510 out of 513 ranges: ranges = 51
[shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1
succeeded, took 107 seconds
After:
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for
keyspace=system_traces, 510 out of 513 ranges: ranges = 10
[shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1
succeeded, took 22 seconds
Message-Id: <a890b84fbac0f3c3cc4021e30dbf4cdf135b93ea.1520992228.git.asias@scylladb.com>
The address and keyspace should be swapped.
Before:
range_streamer - Bootstrap with ks3 for keyspace=127.0.0.1 succeeded,
took 56 seconds
After:
range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks3 succeeded,
took 56 seconds
Message-Id: <5c49646f1fbe45e3a1e7545b8470e04b166922c4.1520416042.git.asias@scylladb.com>
"This changeset is the first step to flatten mutation_reader.
Then it introduces new mutation_fragment types for partition header and end of partition.
Using those a new flat_mutation_reader is defined.
Finally it introduces converters between new flat_mutation_reader and
old mutation_reader."
* 'haaawk/flattened_mutation_reader_v12' of github.com:scylladb/seastar-dev:
Add tests for flat_mutation_reader
Introduce conversion from flat_mutation_reader to mutation_reader
Introduce conversion from mutation_reader to flat_mutation_reader
Introduce flat_mutation_reader
Extract FlattenedConsumer concept using GCC6_CONCEPT
Introduce partition_end mutation_fragment
Introduce a position for end of partition
Introduce partition_start mutation_fragment
Introduce FragmentConsumer
Introduce a position for partition start
streamed_mutation: Extract concepts using GCC6_CONCEPT macro
Make it use get_endpoint_state_for_endpoint_ptr(), check if gossiper is
enabled, mark it as const, and have some callers use it instead of open
coding the logic.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This type of mutation_fragment will be used in new mutation_reader
to signal the beginning of the next partition.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
gossiper::get_endpoint_state_for_endpoint() returns a copy of
endpoint_state, which we've seen can be very expensive.
This patch adds a similar function which returns a pointer instead,
and changes the call sites where using the pointer-returning variant
is deemed safe (the pointer neither escapes the function, nor crosses
any defer point).
Fixes#764
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
It skipped one sub-range in each of the 10 range batch, and
tried to access the range vector using end() iterator.
Fixes sporadic failures of
update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_node_1_test.
Message-Id: <1505848902-16734-1-git-send-email-tgrabiec@scylladb.com>
split_range_to_single_shard() returns a vector of size 4096, with
each element (a partition_range) of size 100. The total of 400k can
cause defragmentation if memory is fragmented.
Fix by using a deque.
Fixes#2707.
Message-Id: <20170819141017.28287-1-avi@scylladb.com>
After this patch and the following patches to use the new
range_streamder interface, all the following cluster operations:
- bootstrap
- rebuild
- decommission
- removenode
will use the same code to do the streaming.
The range_streamer is now extended to support both fetch from and push
to peer node. Another big change is now the range_streamer will stream
less ranges at a time, so less data, per stream_plan and range_streamer
will remember which ranges are failed to stream and can retry later.
The retry policy is very simple at the moment it retries at most 5 times
and sleep 1 minutes, 1.5^2 minutes, 1.5^3 minutes ....
Later, we can introduce api for user to decide when to stop retrying and
the retry interval.
The benefits:
- All the cluster operation shares the same code to stream
- We can know the operation progress, e.g., we can know total number of
ranges need to be streamed and number of ranges finished in
bootstrap, decommission and etc.
- All the cluster operation can survive peer node down during the
operation which usually takes long time to complete, e.g., when adding
a new node, currently if any of the existing node which streams data to
the new node had issue sending data to the new node, the whole bootstrap
process will fail. After this patch, we can fix the problematic node
and restart it, the joining node will retry streaming from the node
again.
- We can fail streaming early and timeout early and retry less because
all the operations use stream can survive failure of a single
stream_plan. It is not that important for now to have to make a single
stream_plan successful. Note, another user of streaming, repair, is now
using small stream_plan as well and can rerun the repair for the
failed ranges too.
This is one step closer to supporting the resumable add/remove node
opeartions.
ring_position_comparator has overloads for comparing ring_positions as
well as sstables::key_view. In the case of the latter it needs to
compute the token of the key. However, the sstable layer could cache
some tokens so let's allow the comparator callers to provide it
directly.
dht::token needs to be stored as a pointer now and not a reference so
that validity of old pointers is not impacted by in-place object
construction which would occur in the copy-assignment operator.
[1] says that old pointers can be used to access the new object only
if the type "does not contain any non-static data member whose type is
const-qualified or a reference type".
[1] http://en.cppreference.com/w/cpp/language/lifetime#Storage_reuse
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
- 'net' namespace conflicts with seastar::net, renamed to 'netw'.
- 'transport' namespace conflicts with seastar::transport, renamed to
cql_transport.
- "logger" global variables now conflict with logger global type, renamed
to xlogger.
- other minor changes
The "exponentiality" is not carried over from one range to another, because
we expect one or two ranges (two ranges result from a wrapped around thrift
token range).
Like the regular sharder, the exponential sharder divides a range into
subranges owned by individual ranges. Unlike the regular sharder, it
generates ever-increasing subranges, spanning more and more shards, and
eventually returns several subranges per shard. To avoid using
exponential cpu and memory, subranges belonging to a single shard are merged,
and a flag is set to indicate the subranges are not ordered wrt. each other.
Right now, next_token_for_shard() only allows iterating linearly in shard
order. Add the ability to select a specific shard to skip to (in case we're
only interested in a single shard), and to select larger ranges (so that
exponential increases are not implemented by iteration).
The partitioners now depend on smp::count to be initialized correctly,
but smp::count isn't available at static initialization time.
The scylla executable isn't affected because it calls set_global_partitioner()
after smp::count has been initialized.
Fix by deferring initialization to the first global_partitioner() call.
It was the observation that ring_position_range_sharder doesn't support
wrapping ranges that started the nonwrapping_range madness, but that
class still has some leftover wrapping ranges. Close the circle by
removing them.
Message-Id: <20161123153113.8944-1-avi@scylladb.com>