After this patch and the following patches to use the new
range_streamder interface, all the following cluster operations:
- bootstrap
- rebuild
- decommission
- removenode
will use the same code to do the streaming.
The range_streamer is now extended to support both fetch from and push
to peer node. Another big change is now the range_streamer will stream
less ranges at a time, so less data, per stream_plan and range_streamer
will remember which ranges are failed to stream and can retry later.
The retry policy is very simple at the moment it retries at most 5 times
and sleep 1 minutes, 1.5^2 minutes, 1.5^3 minutes ....
Later, we can introduce api for user to decide when to stop retrying and
the retry interval.
The benefits:
- All the cluster operation shares the same code to stream
- We can know the operation progress, e.g., we can know total number of
ranges need to be streamed and number of ranges finished in
bootstrap, decommission and etc.
- All the cluster operation can survive peer node down during the
operation which usually takes long time to complete, e.g., when adding
a new node, currently if any of the existing node which streams data to
the new node had issue sending data to the new node, the whole bootstrap
process will fail. After this patch, we can fix the problematic node
and restart it, the joining node will retry streaming from the node
again.
- We can fail streaming early and timeout early and retry less because
all the operations use stream can survive failure of a single
stream_plan. It is not that important for now to have to make a single
stream_plan successful. Note, another user of streaming, repair, is now
using small stream_plan as well and can rerun the repair for the
failed ranges too.
This is one step closer to supporting the resumable add/remove node
opeartions.
* seastar f14d2a3...7a49ae5 (8):
> sharded: improve support for cooperating sharded<> services
> sharded: support for peer services
> semaphore: add a version of with_semaphore that takes a duration timeout
> scripts: perftune.py: fix the CPU mask generation for more than 64 CPUs
> Revert "future-utils: make when_all() (vector variant) exception safe"
> Revert "future-utils: fix gross compilation errors in when_all()"
> future-utils: fix gross compilation errors in when_all()
> future-utils: make when_all() (vector variant) exception safe
Includes change to batchlog_manager constructor to adapt it to
seastar::sharded::start() change.
This reverts commit 98757069a5. We have the
failure detector which will detect an unresponsive node and fail the RPC.
Adding a timeout can just introduce false positives.
In commit f38e4ff3f, we have separated streaming reads from normal reads
for the purpose of determining the maximum number of reads going on.
However, we'll now be totally unaware of how many reads will be
happening on behalf of streaming and that can be important information
when debugging issues.
This patch adds this metric so we don't fly blind.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1501909973-32519-1-git-send-email-glauber@scylladb.com>
We have problem to run fstrim with nomerges=2, so we need to change
the parameter to 1 during fstrim execution.
To do this, this fix changes follow things:
- revert dropping scylla_fstrim on Ubuntu 16.04/CentOS
- disable distribution provided fstrim script
- enable scylla_fstrim on all distributions
- introduce --set-nomerges on scylla-blocktune
- scylla_fstrim call scylla-blocktune by following order:
- 'scylla-blocktune --set-nomerges 1'
- 'fstrim' for each devices
- 'scylla-blocktune --set-nomerges 2'
Fixes#2649
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1501531393-21109-1-git-send-email-syuu@scylladb.com>
Streaming reads and normal reads share a semaphore, so if a bunch of
streaming reads use all available slots, no normal reads can proceed.
Fix by assigning streaming reads their own semaphore; they will compete
with normal reads once issued, and the I/O scheduler will determine the
winner.
Fixes#2663.
Message-Id: <20170802153107.939-1-avi@scylladb.com>
If we fail a streaming read due queue overload, we will fail the entire repair.
Remove the limit for streaming, and trust the caller (repair) to have bounded
concurrency.
Fixes#2659.
Message-Id: <20170802143448.28311-1-avi@scylladb.com>
"Currently scanning reads go to reconciliation stage directly which
requires asking for mutation data from all peers. This series makes
it to try matching digests first like a single partition read."
Fixes#2666.
* 'gleb/digest_scan' of github.com:cloudius-systems/seastar-dev:
storage_proxy: make range_slice_read_executor go through digest matching state
storage_proxy: add capability to read data/digest for non singular ranges
storage_proxy: remove redundant parameter from never_speculating_read_executor constructor
Currently scanning reads go to reconciliation stage directly which
requires asking for mutation data from all peers. This patch makes
it to try matching digests first like a single partition read.
The change requires internode protocol changes since currently it is not
possible to ask for multi partition data/digest over RPC. It means that
the capability has to be guarded by new gossip feature flag which the
patch also adds.
Using --hostname to give the container a meaningful name is a good
practice, and make the monitoring dashboard easier to understand
Signed-off-by: Tzach Livyatan <tzach@scylladb.com>
Message-Id: <20170803081027.6675-1-tzach@scylladb.com>
never_speculating_read_executor always waits for all targets so
block_for parameter is always equal to targets.size(). No need to
to pass it explicitly.
"This series adds an option to use paging in internal query and use that for the
get compaction history function.
Internal paging will be done explicitly, to use paging, you first create a
state object (that contains the query as well) and use that state to get the
first page, the result will contain both the query result and a new state that
can be used to get the next page.
Fixes#2366"
* 'amnon/paged_compaction_history_v5' of github.com:cloudius-systems/seastar-dev:
system_keyspace: Use paging for get compaction history
Add paging for internal queries
query_options: Allows creating query_options from query_options
"This series tries to fix possible repair stuck."
Fixes#2660, #2661, #2662.
* tag 'asias/repair_stuck_v2.1' of github.com:cloudius-systems/seastar-dev:
repair: Make send_repair_checksum_range timeout
repair: Singal parallelism_semaphore in case of error
repair: Fix repair_tracker done
It specifies the maximum gossip shadow round time. It can be used to
reduce the gossip feature check time during node boot up.
For instance, when the first node in the cluster, which listed both
itself and other node as seed in the yaml config, boots up, it will try
to talk to other seed nodes which are not started yet. The gossip shadow
round will be used to fetch the feature info of the cluster. Since there
is no other seed node in the cluster, the shadow round will fail. User
can reduce the default shadow_round_ms option to reduce the boot time.
Fixes#2615
Message-Id: <10916ce9059f3c7f1a1fb465919ae57de3b67d59.1500540297.git.asias@scylladb.com>
The timer is armed inside the section guarded by the _timer_reads_gate
therefore it has to be canceled after the gate is closed.
Otherwise we may end up with the armed timer after stop() method has
returned a ready future.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1501603059-32515-1-git-send-email-vladz@scylladb.com>
"This series ensures the always write correct cell names to promoted
index cell blocks, taking into account the eoc of range tombstones.
Fixes#2333"
* 'pi-cell-name/v1' of github.com:duarten/scylla:
tests/sstable_mutation_test: Test promoted index blocks are monotonic
sstables: Consider eoc when flushing pi block
sstables: Extract out converting bound_kind to eoc
It was meant to be run in the foreground since it is waited upon during
stop(), but as it is now from the stop() perspective it is completed
after first connection is accepted.
Fixes#2652
Message-Id: <20170801125558.GS20001@scylladb.com>