This patch makes hashing for repair calculate checksums in a way that
doesn't require rebuilding whole mutation.
Unfortunately, such checksums are incompatible with the old ones so the
old way for computing checksums is preserved for compatibility reasons.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
sstable_list is now a map<generation, sstable>; change it to a set
in preparation for replacing it with sstable_set. The change simplifies
a lot of code; the only casualty is the code that computes the highest
generation number.
dtest takes error level log as serious error. It is not a serious error
for streaming to fail to send a verb and fail a streaming session which
triggers a repair failure, for example, the peer node is gone or
stopped. Switch to use log level warn instead of level error.
Fixes repair_additional_test.py:RepairAdditionalTest.repair_kill_3_test
Fixes: #1335
Message-Id: <406fb0c4a45b81bd9c0aea2a898d7ca0787b23e9.1465979288.git.asias@scylladb.com>
The repair code as it is right now is a bit convoluted: it resorts to detached
continuations + do_for_each when calling sync_ranges, and deals with the
problem of excessive parallelism by employing a semaphore inside that range.
Still, even by doing that, we still generate a great number of
checksum requests because the ranges themselves are processed in parallel.
It would be better to have a single-semaphore to limit the overall parallelism
for all requests.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Streaming has currently one class, that can be used to contain the read
operations being generated by the streaming process. Those reads come from two
places:
- checksums (if doing repair)
- reading mutations to be sent over the wire.
Depending on the amount of data we're dealing with, that can generate a
significant chunk of data, with seconds worth of backlog, and if we need to
have the incoming writes intertwined with those reads, those can take a long
time.
Even if one node is only acting as a receiver, it may still read a lot for the
checksums - if we're talking about repairs, those are coming from the
checksums.
However, in more complicated failure scenarios, it is not hard to imagine a
node that will be both sending and receiving a lot of data.
The best way to guarantee progress on both fronts, is to put both kinds of
operations into different classes.
This patch introduces a new write class, and rename the old read class so it
can have a more meaningful name.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In due time we will have to fix this, but as an interim step, let's use
a "better" magic number.
The problem with 100, is that as soon as the partitions start to go bigger,
we're using too much memory. Since this is multiplied by the number of token
ranges, and happens in every shard, the final number can become really big,
and the amount of resources we use go up proportionally.
This means that even we are mistaken about the new number (we probably are),
in this case it is better to err on the side of a more conservative resource
usage.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <97158f3db5734916cee4ccf12eaa66e7402570bb.1457448855.git.glauber@scylladb.com>
Use the existing "feed_hash" mechanism to find a checksum of the
content of a mutation, instead of serializing the mutation (with freeze())
and then finding the checksum of that string.
The serialized form is more prone to future changes, and not really
guaranteed to provide equal hashes for mutations which are considered
"equal".
Fixes#971
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1456958676-27121-1-git-send-email-nyh@scylladb.com>
To implement nodetool's "--start-token"/"--end-token" feature, we need
to be able to repair only *part* of the ranges held by this node.
Our REST API already had a "ranges" option where the tool can list the
specific ranges to repair, but using this interface in the JMX
implementation is inconvenient, because it requires the *Java* code
to be able to intersect the given start/end token range with the actual
ranges held by the repaired node.
A more reasonable approach, which this patch uses, is to add new
"startToken"/"endToken" options to the repair's REST API. What these
options do is is to find the node's token ranges as usual, and only
then *intersect* them with the user-specified token range. The JMX
implementation becomes much simpler (in a separate patch for scylla-jmx)
and the real work is done in the C++ code, where it belongs, not in
Java code.
With the additional scylla-jmx patch to use the new REST API options
provided here, this fixes#917.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1455807739-25581-1-git-send-email-nyh@scylladb.com>
When shutting down a node gracefully, this patch asks all ongoing repairs
started on this node to stop as soon as possible (without completing
their work), and then waits for these repairs to finish (with failure,
usually, because they didn't complete).
We need to do this, because if the repair loop continues to run while we
start destructing the various services it relies on, it can crash (as
reported in #699, although the specific crash reported there no longer
occurs after some changes in the streaming code). Additionally, it is
important that to stop the ongoing repair, and not wait for it to complete
its normal operation, because that can take a very long time, and shutdown
is supposed to not take more than a few seconds.
Fixes#699.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1455218873-6201-1-git-send-email-nyh@scylladb.com>
Change the partition_checksum structure to be better suited for the
new serializers:
1. Use std::array<> instead of a C array, as the latters are not
supported by the new serializers.
2. Use an array of 32 bytes, instead of 4 8-byte integers. This will
guarantee that no byte-swapping monkey-business will be done on
these checksums.
The checksum XOR and equality-checking methods still temporarily
cast the bytes to 8-byte chunks, for (hopefully) better performance.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1454364900-3076-1-git-send-email-nyh@scylladb.com>
After this patch, our I/O operations will be tagged into a specific priority class.
The available classes are 5, and were defined in the previous patch:
1) memtable flush
2) commitlog writes
3) streaming mutation
4) SSTable compaction
5) CQL query
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In the existing code, when we fail to reach one of the replicas of some
range being repaired, we would give up, and not continue to repair the
living replicas of this range. The thinking behind this was since the
repair should be considered failed anyway, there's no point in trying
to do a half-job better.
However, in a discussion I had with Shlomi, he raised the following
alternative thinking, which convinced me: In a large cluster, having
one node or another temporarily dead has a high probability. In that
case, even if the if the repair is doomed to be considered "failed",
we want it at least to do as much as it possibly can to repair the
data on the living part of the cluster. This is what this patch does:
If we can only reach some of the replicas of a given range, the repair
will be considered failed (as before), but we will still repair the
reachable replicas of this range, if they have different checksums.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1453724443-29320-1-git-send-email-nyh@scylladb.com>
Theoretically, one could want to repair a single host *and* all the hosts
in one or more other data centers which don't include this host. However,
Cassandra's "nodetool repair" explicitly does not allow this, and fails if
given a list of data centers (via the "-dc" option) which doesn't include
the host starting the repair. So we need to behave like "nodetool repair"
and fail in this case too.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1453037016-25775-1-git-send-email-nyh@scylladb.com>
The intent is to make data returned by queries always conform to a
single schema version, which is requested by the client. For CQL
queries, for example, we want to use the same schema which was used to
compile the query. The other node expects to receive data conforming
to the requested schema.
Interface on shard level accepts schema_ptr, across nodes we use
table_schema_version UUID. To transfer schema_ptr across shards, we
use global_schema_ptr.
Because schema is identified with UUID across nodes, requestors must
be prepared for being queried for the definition of the schema. They
must hold a live schema_ptr around the request. This guarantees that
schema_registry will always know about the requested version. This is
not an issue because for queries the requestor needs to hold on to the
schema anyway to be able to interpret the results. But care must be
taken to always use the same schema version for making the request and
parsing the results.
Schema requesting across nodes is currently stubbed (throws runtime
exception).
Support the "hosts" and "dataCenters" parameters of repair. The first
specifies the known good hosts to repair this host from (plus this host),
and the second asks to restrict the repair to the local data center (you
must issue the repair to a node in the data center you want to repair -
issuing the command to a data center other than the named one returns
an error).
For example these options are used by nodetool commands like:
nodetool repair -hosts 127.0.0.1,127.0.0.2 keyspace
nodetool repair -dc datacenter1
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The existing repair code always streamed the entire content of the
database. In this overhaul, we send "repair_checksum_range" messages to
the other nodes to verify whether they have exactly the same data as
this node, and if they do, we avoid streaming the identical code.
We make an attempt to split the token ranges up to contain an estimated
100 keys each, and send these ranges' checksums. Future versions of this
code will need to improve this estimation (and make this "100" a parameter)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a function sync_range() for synchronizing all partitions
in a given token range between a set of replicas (this node and a list of
neighbors).
Repair will call this function once it has decided that the data the
replicas hold in this range is not identical.
The implementation streams all the data in the given range, from each of
the neighbors to this node - so now this node contains the most up-to-date
data. It then streams the resulting data back to all the neighbors.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds functions for calculating the checksum of all the
partitions in a given token range in the given column-family - either
in the current shard, or across all shards in this node.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a mechanism for calculating a checksum for a set of
partitions. The repair process will use these checksums to compare the
data held by different replicas.
We use a strong checksum (SHA-256) for each individual partition in the set,
and then a simple XOR of those checksums to produce a checksum for the
entire set. XOR is good enough for merging strong checksums, and allows us
to independently calculate the checksums of different subsets of the
original sets - e.g., each shard can calculate its own checksum and we
can XOR the resulting checksums to get the final checksum.
Apache Cassandra uses a very similar checksum scheme, also using SHA-256
and XOR. One small difference in the implementation is that we include the
partition key in its checksum, while Cassandra don't, which I believe to
have no real justification (although it is very unlikely to cause problems
in practice). See further discussion on this in
https://issues.apache.org/jira/browse/CASSANDRA-10728.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
messaging_service will use private ip address automatically to connect a
peer node if possible. There is no need for the upper level like
streaming to worry about it. Drop it simplifies things a bit.
Check the list of column families passed as an option to repair, to
provide the user with a more meaningful exception when a non-existant
column family is passed.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This was a plain bug - ranges_opt is supposed to parse the option into
the vector "var", but took the vector by value.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Support the "columnFamilies" parameter of repair, allowing to repair
only some of the column families of a keyspace, instead of all of them.
For example, using a command like "nodetool repaire keyspace cf1 cf2".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
A default value was not set for the "incremental" and "parallelism"
repair parameters, so Scylla can wrongly decide that they have an
unsupported value.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add partial support for the "incremental" option (only support the
"false" setting, i.e., not incremental repair) and the "parallelism"
option (the choice of sequential or parallel repair is ignored - we
always use our own technique).
This is needed because scylla-jmx passes these options by default
(e.g., "incremental=false" is passed to say this is *not* incremental
repair, and we just need to allow this and ignore it).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When throwing an "unsupported repair options" exception to the caller
(such as "nodetool repair"), also list which options were not recognized.
Additionally, list the options when logging the repair operation.
This patch includes an operator<< implementation for pretty-printing an
std::unordered_map. We may want to move it later to a more central
location - even Seastar (like we have a pretty-printer for std::vector
in core/sstring.hh).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch fixes a bug where the *first* run of "nodetool repair" always
returned immediately, instead of waiting for the repair to complete.
Repair operations are asynchronous: Starting a repair returns a numeric
id, which can then be used to query for the repair's completion, and this
is what "nodetool repair" does (through our JMX layer). We started with
the repair ID "0", the next one is "1", and so on.
The problem is that "nodetool repair", when it sees 0 being returned,
treats it not as a regular repair ID, but rather as an answer that
there is nothing to repair - printing a message to that effect and *not*
waiting for the repair (which was correctly started) to complete.
The trivial fix is to start our repair IDs at 1, instead of 0.
We currently do not return 0 in any case (we don't know there is nothing
to repair before we actually start the work, and parameter errors
cause an exception, not a return of 0).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add in repair another info message, "Endpoints ... and ... have ... range(s)
out of sync for <column-family>", which the repair dtest expects.
This patch is a kind of silly attempt to appease issue #81 (should we mark
it fixed?). It's kind of silly, because without merkle trees (see issue #82),
we really have no way of knowing if there's any differences between the two
nodes, so we always say there is "1 range" difference. So if the dtest expects
such a message *not* to appear (because there are no differences), it might
fail.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
As requested in issue #79, this patch ensures that if the user attempts
to pass an unknown repair option, the operation fails rather than the
option simply be ignored.
An "unknown repair option" may be one of Cassandra's options we don't yet
support ("parallelism", "incremental", "jobThreads", "columnFamilies",
"dataCenters", "hosts" and "trace"), or any other unknown option name -
in either case, the operation will fail rather than ignore the option
which might have been important.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
This patch implements repair's "primaryRange" and "ranges" options:
Without these options, a repair defaults to repair all the ranges for which
this nodes holds a replica (each range is repaired by contacting the other
replicas of this range).
If the "primaryRange" option is passed, instead of repairing all ranges, only
the "primary ranges" of this node is repaired - for each range, only one node
has this range as its "primary range". The intention is that a user can start
a "primaryRange" repair on all nodes, and the result would be that each range
will only be repaired once.
If the "ranges" option is passed, it can explicitly list a list of ranges to
repair, overriding the automatic determination of ranges explained above.
Fixes#212.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Avi asked not to use an atomic integer to produce ids for repair
operations. The existing code had another bug: It could return some
id immediately, but because our start_repair() hasn't started running
code on cpu 0 yet, the new id was not yet registered and if we were to
call repair_get_status() for this id too quickly, it could fail.
The solution for both issues is that start_repair() should return not
an int, but a future<int>: the integer id is incremented on cpu 0 (so
no atomics are needed), and then returned and the future is fulfilled.
Note that the future returned by start_repair() does not wait for the
repair to be over - just for its index to be registered and be usable
to a call to repair_get_status().
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Requested by Avi. The added benefit is that the code for repairing
all the ranges in parallel is now identical to the code of repairing
the ranges one by one - just replace do_for_each with parallel_for_each,
and no need for a different implementation using semaphores like I had
before this patch.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
[in v2: 1. Fixed a few small bugs.
2. Added rudementary support parallel/sequential repair.
3. Verified that code works correctly with Asias's fix to streaming]
This patch adds the capability to track repair operations which we have
started, and check whether they are still running or completed (successfully
or unsuccessfully).
As before one starts a repair with the REST api:
curl -X GET --header "Content-Type: application/json" --header "Accept: application/json" "http://127.0.0.1:10000/storage_service/repair_async/try1"
where "try1" is the name of the keyspace. This returns a repair id -
a small integer starting with 0. This patch adds support for similar
request to *query* the status of a previously started repair, by adding
the "id=..." option to the query, which enquires about the status of the
repair with this id: For example.,
curl -i -X GET --header "Content-Type: application/json" --header "Accept: application/json" "http://127.0.0.1:10000/storage_service/repair_async/try1?id=0"
gets the current status of this repair 0. This status can be RUNNING,
SUCCESSFUL or FAILED, or a HTTP 400 "unknown repair id ..." in case an
invalid id is passed (not the id of any real repair that was previously
started).
This patch also adds two alternative code-paths in the main repair flow
do_repair_start(): One where each range is repaired one after another,
and one where all the ranges are repaired in parallel. At the moment, the
enabled code is the parallel version, just as before this patch. But the
will also be useful for implementing the "parallel" vs "sequential" repair
options of Cassandra.
Note that if you try to use repair, you are likely to run into a bug in
the streaming code which results in Scylla either crashing or a repair
hanging (never realising it finished). Asias already has a fix this this bug,
and will hopefully publish it soon, but it is unrelated to the repair code
so I think this patch can independently be committed.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
handle_exception() should really discard the future's value automatically,
and in an upcoming version of Seastar, won't. So instead of
sp.execute().handle_exception(...)
(where execute() returns a future which is *not* future<>)
We need to write
sp.execute().discard_result().handle_exception(...)
This already works in today's Seastar (the extra discard_result()
doesn't cause any harm), and will be necessary when handle_exception()
in Seastar is improved (I'll send a patch soon).
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Add a FIXME about something I'm unsure about - does repair only need to
repair this node, or also make an effort to also repair the other nodes
(or more accurately, their specific token-ranges being repaired) if we're
already communicating with them?
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
If a stream failed, print a clear error message that repair failed, instead
of ignoring it and letting Seastar's generic "warning, exception was ignored"
be the only thing the user will see.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
The previous repair code exchanged data with the other nodes which have
one arbitrary token. This will only work correctly when all the nodes
replicate all the data. In a more realistic scenario, the node being
repaired holds copies of several token ranges, and each of these ranges
has a different set of replicas we need to perform the repair with.
So this patch does the right thing - we perform a separate repair_range()
for each of the local ranges, and each of those will find a (possibly)
different set of nodes to communicate with.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
This patch adds the beginning of node repair support. Repair is initiated
on a node using the REST API, for example to repair all the column families
in the "try1" keyspace, you can use:
curl -X GET --header "Content-Type: application/json" --header "Accept: application/json" "http://127.0.0.1:10000/storage_service/repair_async/try1"
I tested that the repair already works (exchanges mutations with all other
replicas, and successfully repairs them), so I think can be committed,
but will need more work to be completed
1. Repair options are not yet supported (range repair, sequential/parallel
repair, choice of hosts, datacenters and column families, etc.).
2. *All* the data of the keyspace is exchanged - Merkle Trees (or an
alternative optimization) and partial data exchange haven't been
implemented yet.
3. Full repair for nodes with multiple separate ranges is not yet
implemented correctly. E.g., consider 10 nodes with vnodes and RF=2,
so each vnode's range has a different host as a replica, so we need
to exchange each key range separately with a different remote host.
4. Our repair operation returns a numeric operation id (like Origin),
but we don't yet provide any means to use this id to check on ongoing
repairs like Origin allows.
5. Error hangling, logging, etc., needs to be improved.
6. SMP nodes (with multiple shards) should work correctly (thanks to
Asias's latest patch for SMP mutation streaming) but haven't been
tested.
7. Incremental repair is not supported (see
http://www.datastax.com/dev/blog/more-efficient-repairs)
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>