Commit Graph

127 Commits

Author SHA1 Message Date
Duarte Nunes
a025bf6a7d Merge seastar upstream
Seastar introduced a "compat" namespace, which conflicts with Scylla's
own "compat" namespaces. The merge thus includes changes to scope
uses of Scylla's "compat" namespaces.

* seastar 8ad870f...9bb1611  (5):
  > util/variant_utils: Ensure variant_cast behaves well with rvalues
  > util/std-compat: Fix infinite recursion
  > doc/tutorial: Undo namespace changes
  > util/variant_utils: Add cast_variant()
  > Add compatbility with C++17's library types

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-08-14 13:07:09 +01:00
Nadav Har'El
5e47061438 repair: fix small error-handling logic mistake
As noticed by Tomasz Grabiec, we test a future's available() after
having already waited for it with when_all(), which is pointless.

The code after the wrong if() exchanges the contents of a token-range
between this node and several other live neighbors; We can't do this
exchange if either this node is broken or there is no other live neighbor.
So this is what we needed to test. so !available() should have been failed().

Also the test for live_neighbors_checksum.empty() added in commit 7c873f0d1f
is unnecessary - we build live_neighbors and live_neighbors_checksum
together, so if one of them is empty, so is the other.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180710114940.26027-1-nyh@scylladb.com>
2018-07-10 15:04:03 +03:00
Nadav Har'El
3194ce16b3 repair: fix combination of "-pr" and "-local" repair options
When nodetool repair is used with the combination of the "-pr" (primary
range) and "-local" (only repair with nodes in the same DC) options,
Scylla needs to define the "primary ranges" differently: Rather than
assign one node in the entire cluster to be the primary owner of every
token, we need one node in each data-center - so that a "-local"
repair will cover all the tokens.

Fixes #3557.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180701132445.21685-1-nyh@scylladb.com>
2018-07-01 16:39:33 +03:00
Vladimir Krivopalov
acdce55572 Inject CryptoPP namespace where Crypto++ byte typedef is used.
In Crypto++ v6, the `byte` typedef has been moved from the global
namespace to the CryptoPP:: namespace.
To make Scylla code compile with both old and new versions, bring the
namespace in so that the code works regardless of the scope of `byte`
definition.

Fixes #3252

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <60e7bfe868b778b1c9bbe15d7247db64b61bd406.1520272198.git.vladimir@scylladb.com>
2018-03-05 20:43:07 +02:00
Pekka Enberg
bd365a10d3 Merge "Add an API to get all active repairs" from Amnon
"This series adds an API to return the active repairs by their IDs.

 After this series a call to:

   curl -X GET --header "Accept: application/json" "http://localhost:10000/storage_service/active_repair/"

 Will return an array with the ids of the active repairs.

 Fixes #3193"

* 'amnon/get_active_repairs_v3' of github.com:scylladb/seastar-dev:
  API: Add get active repair api
  repair: Add a get_active_repairs function to return the active repair
2018-02-19 15:32:17 +02:00
Amnon Heiman
3f2eae35fd repair: Add a get_active_repairs function to return the active repair
This patch adds a function that returns an array with the ids of the
active repairs by filtering the RUNNING ones in the repair tracker status.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2018-02-14 11:43:37 +02:00
Duarte Nunes
7ba63b1521 atomic_cell_hash: Add specialization for atomic_cell_or_collection
Replace the atomic_cell_or_collection::feed_hash() member function
with the specialization of appending_hash, and use that instead.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:51 +00:00
Duarte Nunes
a0d748c71c range_tombstone: Replace feed_hash() member function with appending_hash
Replace range_tombstone::feed_hash() with the specialization of
appending_hash, so that we can use the general feed_hash() function.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:50 +00:00
Duarte Nunes
12507fb9ce keys: Replace feed_hash() member function with appending_hash
Replace the feed_hash() member function of partition_key and
clustering_key_prefix with the specialization of appending_hash,
so that we can use the general feed_hash() function.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-02-01 00:22:50 +00:00
Piotr Jastrzebski
ee6f2ca554 partition_checksum::compute_legacy: use only flat reader
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-01-24 20:56:48 +01:00
Vlad Zolotarov
97506f39b2 repair: use seastar::cache_line_size for aligning to the cache line size
Use seastar::cache_line_size for cache line alignment instead of a hard coded value (64) - this value is
not always correct, e.g. PPC64 platform, where cache line size is 128B.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2017-12-08 13:38:13 -05:00
Paweł Dziepak
dca93bea23 db: convert make_streaming_reader() to flat_mutation_reader 2017-11-13 16:49:52 +00:00
Paweł Dziepak
f690e2e80b repair: convert partition_checksum::compute_streamed() to flat streams 2017-11-13 16:49:52 +00:00
Paweł Dziepak
d71a14b943 repair: make partition_hasher consume flat mutation streams 2017-11-13 16:49:52 +00:00
Paweł Dziepak
2b774119a1 mutation_hasher: copy mutation_hasher to repair.cc
Repair is the exclusive user of mutation_hasher. Moving it there will
make integration with partition_checksum easier.
2017-11-13 16:49:52 +00:00
Paweł Dziepak
f648f94464 partition_checksum: introduce compute() for flat_mutation_reader 2017-11-13 16:49:52 +00:00
Avi Kivity
43a72254ff repair: add missing include 2017-09-11 20:09:45 +03:00
Avi Kivity
a59e375aad Merge "Support termination of repair jobs" from Asias
"This series implements the missing API to terminate all repairs.

For example:

$ curl -X POST  --header "Accept: application/json"
"http://127.0.0.1:10000/storage_service/force_terminate_repair"

With the new stream_plan::abort() api we can now abort the stream
session assocaited with the repair as well.

On top of this, we can support termination of single repair job instead all
jobs.

Fixes #2105"

* tag 'asisas/repair_abort_v4' of github.com:scylladb/seastar-dev:
  repair: Support termination of repair jobs
  repair: Track repair_info
  repair: Intorduce repair id to repair_info map
  api: Add force_terminate_repair API
  streaming: Add abort to stream_plan
  streaming: Add abort_all_stream_sessions for stream_coordinator
  streaming: Introduce streaming::abort()
  streaming: Make stream_manager and coordinator message debug level
  streaming: Check if _stream_result is valid
  streaming: Log peer address in on_error
  streaming: Introduce received_failed_complete_message
2017-09-06 12:58:05 +03:00
Asias He
e14bb7b1d5 repair: Remove #if'ed code in repair_ranges
It is unlikely we will use parallel_for_each version in repair_ranges.
Get rid of the dead code.

Message-Id: <31a9366adfe0262512a326ef9703aa0bba05e1fb.1503996138.git.asias@scylladb.com>
2017-09-03 11:13:02 +03:00
Asias He
471e8b341f repair: Support termination of repair jobs
This patch implements the missing API to terminate all repairs.

For example:

$ curl -X POST  --header "Accept: application/json"
"http://127.0.0.2:10000/storage_service/force_terminate_repair"

With the new stream_plan::abort() api we can now abort the stream
session assocaited with the repair as well.

Fixes #2105
2017-08-30 15:19:52 +08:00
Asias He
07d9dc03ec repair: Track repair_info
Make repair_info a shared pointer and store them in _repairs map so we
can find by the repair id and access them later.
2017-08-30 15:19:52 +08:00
Asias He
5c9732c645 repair: Intorduce repair id to repair_info map
The maps are stored in a vector. The vector has smp::count elements, each
element will be accessed by only one shard.

The add_repair_info, remove_repair_info and get_repair_info helpers
are added.
2017-08-30 15:19:51 +08:00
Asias He
68346f7e53 repair: Use with_semaphore for sp_parallelism_semaphore
Instead of calling semaphore.signal() manually.

Message-Id: <51b7ecdebac91763a2340fe00959742810614845.1503648936.git.asias@scylladb.com>
2017-08-27 12:50:38 +03:00
Asias He
69c81bcc87 repair: Do not allow repair until node is in NORMAL status
The following backtrace was reported by user when running repair and keeping restarting the node at the same time.

 #0  0x00007eff077281d7 in raise () from /lib64/libc.so.6
 #1  0x00007eff07729a08 in abort () from /lib64/libc.so.6
 #2  0x00007eff07721146 in __assert_fail_base () from /lib64/libc.so.6
 #3  0x00007eff077211f2 in __assert_fail () from /lib64/libc.so.6
 #4  0x00000000010ef2c2 in locator::token_metadata::first_token_index (this=0x641000214e98, start=...) at locator/token_metadata.cc:133
 #5  0x00000000010ef2d9 in locator::token_metadata::first_token (this=0x641000214e98, start=...) at locator/token_metadata.cc:143
 #6  0x00000000010e329d in locator::abstract_replication_strategy::get_natural_endpoints (this=0x641000494000, search_token=...)
     at locator/abstract_replication_strategy.cc:66
 #7  0x0000000001481186 in get_neighbors (hosts=std::vector of length 0, capacity 0, data_centers=std::vector of length 0, capacity 0,
     range=<error reading variable: access outside bounds of object referenced via synthetic pointer>, ksname=..., db=...) at repair/repair.cc:196
 #8  repair_range<nonwrapping_range<dht::token> > (range=..., ri=...) at repair/repair.cc:781
 #9  <lambda(auto:99&)>::<lambda(auto:100&&)>::<lambda(auto:101&)>::<lambda()>::operator() (__closure=0x7efec07f7460) at repair/repair.cc:1005
 #10 futurize<future<bool_class<stop_iteration_tag> > >::apply<repair_ranges(repair_info)::<lambda(auto:99&)>::

It is reproduced with

1) while true; do curl -X POST --header "Content-Type: application/json" --header "Accept: application/json" "http://127.0.0.1:10000/storage_service/repair_async/ks3"; done

2) start node 127.0.0.1, stop node 127.0.0.1 in a loop

The problem is, during boot up, the token_metadata is not replicated to all shards until
the node goes into NORMAL status.

To fix, check until node is in NORMAL status before allowing repair.

Fixes #2723
2017-08-23 14:40:04 +08:00
Asias He
763fa83232 repair: Fix build in repair_cf_range
The compiler does not like the mutable.
Message-Id: <83c5e8a944b72a095b8e29e9988986e6ca9cefc5.1501690749.git.asias@scylladb.com>
2017-08-02 18:57:32 +02:00
Asias He
5798625d73 repair: Singal parallelism_semaphore in case of error
If we throw after we take the semaphore and beforew the when_all
below runs, no one will increase the semaphore.

Fixes #2661
Message-Id: <49540ede4c8a6d84004e10e0f63690e3c21d72c7.1501686383.git.asias@scylladb.com>
2017-08-02 18:32:32 +03:00
Avi Kivity
ac31abf6a4 repair: don't lambda-capture repair_tracker
It is static, so it need not be captured, and some compilers complain.
2017-08-02 18:07:31 +03:00
Avi Kivity
ce60ef59f3 Revert "repair: Singal parallelism_semaphore in case of error"
This reverts commit a548eee28c. It releases
the semaphore too early (noted by Glauber).
2017-08-02 17:13:46 +03:00
Asias He
a548eee28c repair: Singal parallelism_semaphore in case of error
If we throw after we take the semaphore and beforew the when_all
below runs, one one will increase the semaphore.

Fixes #2661
2017-08-02 21:41:45 +08:00
Asias He
abcff4c78e repair: Fix repair_tracker done
If it throws after repair_tracker.start and before the when_all below,
the repair_tracker.done will never be called for this repair id.

Fixes #2660
2017-08-02 21:40:29 +08:00
Asias He
b10e961a64 repair: Use selective_token_range_sharder
With this change, we ask all the shard to handle the ranges provided by
user and we use selective_token_range_sharder to split the ranges and
ignore the ranges do not belong to the current shard.
2017-07-04 18:46:19 +08:00
Nadav Har'El
d177ec05cb repair: further limit parallelism of checksum calculation
Repair today has a semaphore limiting the number of ongoing checksum
comparisons running in parallel (on one shard) to 100. We needed this
number to be fairly high, because a "checksum comparison" can involve
high latency operations - namely, sending an RPC request to another node
in a remote DC and waiting for it to calculate a checksum there, and while
waiting for a response we need to proceed calculating checksums in parallel.

But as a consequence, in the current code, we can end up with as many as
100 fibers all at the same stage of reading partitions to checksum from
sstables. This requires tons of memory, to hold at least 128K of buffer
(even more with read-ahead) for each of these fibers, plus partition data
for each. But doing 100 reads in parallel is pointless - one (or very few)
should be enough.

So this patch adds another semaphore to limit the number of checksum
*calculations* (including the read and checksum calculation) on each shard
to just 2. There may still be 100 ongoing checksum *comparisons*, in
other stages of the comparisons (sending the checksum requests to other
and waiting for them to return), but only 2 will ever be in the stage of
reading from disk and checksumming them.

The limit of 2 checksum calculations (per shard) applies on the repair
slave, not just to the master: The slave may receive many checksum
requests in parallel, but will only actually work on 2 at a time.

Because the parallelism=100 now rate-limits operations which use very little
memory, in the future we can safely increase it even more, to support
situations where the disk is very fast but the link between nodes has
very high latency.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170703151329.25716-1-nyh@scylladb.com>
2017-07-03 18:14:57 +03:00
Asias He
b2a2fbcf73 repair: Do not store the failed ranges
The number of failed ranges can be large so it can consume a lot of memory.
We already logged the failed ranges in the log. No need to storge them
in memory.

Message-Id: <7a70c4732667c5c3a69211785e8efff0c222fc28.1498809367.git.asias@scylladb.com>
2017-07-03 10:00:25 +03:00
Asias He
cc02a62756 repair: Prefer nodes in local dc when streaming
When peer nodes have the same partition data, i.e., with the same
checksum, we currently choose to stream from any of them randomly.
To improve streaming performance, select the peer within the same DC.
This patch is supposed to improve repair perforamnce with multiple DC.

Message-Id: <c6a345b6e8ed2b59f485e53c865241e463b44507.1498490831.git.asias@scylladb.com>
2017-06-26 18:34:21 +03:00
Asias He
47345078ec repair: Repair on all shards
Currently, shard zero is the coordinator of the repair. All the work of
checksuming of the local node and sending of the repair checksum rpc
verb is done on shard zero only. This causes other shards being
underutilized.

With this patch, we split the ranges need to be repaired into at least
smp::count ranges, so sizeof(ranges) / smp::count will be assigned to
each shard. For exmaple, we have 8 shards and 256 ragnes, each shard
will repair 32 ranges. Each shard will repair the 32 ranges
sequencially.  There will be at most 8 (smp::count) ranges of repair in
parallel.
2017-06-14 17:52:49 +08:00
Asias He
54831a344c repair: Allow one stream plan in flight
In "repair: Use more stream_plan" (commit 2043ffc064), we
switched to do stream while doing checksum instead of do stream only
after checksum pahse is completed. We take a parallelism_semaphore
before we do checksum, if there are more than sub_ranges_to_stream
(1024) ranges, we start a stream_plan and wait for the streaming to
complete (still under the parallelism_semaphore). So at most
parallelism_semaphore (100) stream_plans can be in parallel.

The parallelism_semaphore limits the parallelism of both checksum and the
streaming plan. However, it is not necessary to have the same
parallelism for both checksum and streaming, because 1) a streaming
operation itself runs in parallel (handling ranges on all shards in
prallel, sending mutaitons in parallel) , 2) and with more streaming plan
(in worse case 100) means we can write to 100 memtables at the same time
and flush 100 memtables to disk at the same time which can take a lot of
memory.

With this patch, we only allow one stream plan in flight.
2017-06-14 17:52:36 +08:00
Asias He
2bcb368a13 repair: Fix range use after free
Capture it by value.

scylla:  [shard 0] repair - repair's stream failed: streaming::stream_exception (Stream failed)
scylla:  [shard 0] repair - Failed sync of range ==<runtime_exception
(runtime error: Invalid token. Should have size 8, has size 0#012)>: streaming::stream_exception (Stream failed)

Message-Id: <7fda4432e54365f64b556e7e4c26e36d3a9bb1b7.1497238229.git.asias@scylladb.com>
2017-06-12 11:00:57 +03:00
Asias He
3fdb8a3d3f repair: Remove unused sub_ranges_max
With the sub range iterator, it is not used anymore. Drop it.
2017-06-07 08:52:45 +08:00
Asias He
ca00c10b35 repair: Reduce parallelism in repair_ranges
We currently repair all the ranges in parallel.

1) All the ranges will contend for parallelism_semaphore, instead of
processing multiple ranges in parallel and calculating the sub ranges
(which take memory) for each range in parallel, we can handle the ranges
one bye one.

We could have enough parallelism because the checksum are calucated on
all the shards.

2) If for some reason the repair failed, if we handle ranges 1 by 1, we
can log which range of repair is successful. Next time, we can ignore
them. If we start ranges in parallel, it has a high chance, no single
range is completed because all the ranges are on going.

Refs #1912
2017-06-07 08:50:57 +08:00
Asias He
3852665156 repair: Tweak the log a bit
- Count n out m ranges the repair is running for (kind of progress report)
- Make the 'Found differing range' log debug because it can be millions
  of such entries
- Print the failed ranges
2017-06-07 08:50:57 +08:00
Asias He
2043ffc064 repair: Use more stream_plan
In the very beginning, we use a stream_plan for each checksum range.
Later, we changed to use a single stream_plan for all the checksum
ranges. It pushes memory presure to streaming, e.g., millinons of ranges
in a vector to send over RPC.

To fix, we do checksum and streaming in parallel, limit the number of
checksum ranges stored in memory.

Fixes #2430
2017-06-07 08:50:56 +08:00
Nadav Har'El
b3ff37e67f repair: iterator over subranges instead of list
When starting repair, we divided the large token ranges (vnodes) linto small
subranges of a desired length (around 100 partition), and built a huge list
of those subranges - to iterate over them later and compare checksums of
those chunks.

However, building this list up-front is completely unnecessary, and wastes
a lot of memory: In a test with 1 TB of data, as much as 3 gigabytes was
spent on this list. Instead, what we do in this patch is to find the next
chunk in a DFS-like splitting algorithm, using only the token range
midpoint() function (as before). The amount of memory needed for this is
O(logN), instead of O(N) in the previous implementation.

Refs #2430.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2017-06-07 08:50:56 +08:00
Avi Kivity
ebaeefa02b Merge seatar upstream (seastar namespace)
- introcduced "seastarx.hh" header, which does a "using namespace seastar";
 - 'net' namespace conflicts with seastar::net, renamed to 'netw'.
 - 'transport' namespace conflicts with seastar::transport, renamed to
   cql_transport.
 - "logger" global variables now conflict with logger global type, renamed
   to xlogger.
 - other minor changes
2017-05-21 12:26:15 +03:00
Asias He
66e3b73b9c repair: Fix partition estimation
We estimate number of partitions for a given range of a column familiy
and split the range into sub ranges contains fewer partitions as a
checksum unit.

The estimation is wrong, because we need to count the partitions on all
the shards, instead of only counting the local shard.

Fixes #2299

Message-Id: <7876285bd26cfaf65563d6e03ec541626814118a.1493817339.git.asias@scylladb.com>
2017-05-03 16:25:45 +03:00
Avi Kivity
ae7d7ae20f repair: avoid auto in function argument declaration
'auto' in a non-lambda function argument is not legal C++, and is hard
to read besides.  Replace with the right type.
2017-04-17 23:03:15 +03:00
Asias He
39d2e59e7e repair: Fix midpoint is not contained in the split range assertion in split_and_add
We have:

  auto halves = range.split(midpoint, dht::token_comparator());

We saw a case where midpoint == range.start, as a result, range.split
will assert becasue the range.start is marked non-inclusive, so the
midpoint doesn't appear to be contain()ed in the range - hence the
assertion failure.

Fixes #2148

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Asias He <asias@scylladb.com>
Message-Id: <93af2697637c28fbca261ddfb8375a790824df65.1489023933.git.asias@scylladb.com>
2017-03-09 09:09:17 +01:00
Asias He
937f28d2f1 Convert to use dht::partition_range_vector and dht::token_range_vector 2016-12-19 14:08:50 +08:00
Asias He
e5485f3ea6 Get rid of query::partition_range
Use dht::partition_range instead
2016-12-19 08:09:25 +08:00
Asias He
d1178fa299 Convert to use dht::token_range 2016-12-19 08:04:29 +08:00
Avi Kivity
32fb4c3661 Merge "repair: Reduce unnecessary streaming traffic even more" from Asias
"In 7c873f0d (repair: Reduce unnecessary streaming traffic), we optimize
in cases when 1) all the remote nodes has the same checksum and 2) local node
has zero checksum.

In this series, we make the optimization more generec and cover more cases."

* tag 'asias/repair/node_reducer/v3' of github.com:cloudius-systems/seastar-dev:
  repair: Reduce unnecessary streaming traffic even more
  repair: Add hash specialization for partition_checksum
2016-12-18 16:53:39 +02:00