scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-22 01:20:39 +00:00

Author	SHA1	Message	Date
Duarte Nunes	aaa76d58ba	query: Move to_partition_range to dht namespace This patch moves to_partition_range, from the query namespace to the dht namespace, where it is a more natural fit. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1468498060-19251-1-git-send-email-duarte@scylladb.com>	2016-07-15 10:41:52 +02:00
Paweł Dziepak	eb88181347	repair: ask for streamed checksums if cluster supports them Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-13 09:51:23 +01:00
Paweł Dziepak	7e06499458	repair: convert hashing to streamed_mutations This patch makes hashing for repair calculate checksums in a way that doesn't require rebuilding whole mutation. Unfortunately, such checksums are incompatible with the old ones so the old way for computing checksums is preserved for compatibility reasons. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-13 09:51:23 +01:00
Avi Kivity	2a46410f4a	Change sstable_list from a map to a set sstable_list is now a map<generation, sstable>; change it to a set in preparation for replacing it with sstable_set. The change simplifies a lot of code; the only casualty is the code that computes the highest generation number.	2016-07-03 10:26:57 +03:00
Paweł Dziepak	737eb73499	mutation_reader: make readers return streamed_mutations Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:50 +01:00
Asias He	de0fd98349	repair: Switch log level to warn instead of error dtest takes error level log as serious error. It is not a serious error for streaming to fail to send a verb and fail a streaming session which triggers a repair failure, for example, the peer node is gone or stopped. Switch to use log level warn instead of level error. Fixes repair_additional_test.py:RepairAdditionalTest.repair_kill_3_test Fixes: #1335 Message-Id: <406fb0c4a45b81bd9c0aea2a898d7ca0787b23e9.1465979288.git.asias@scylladb.com>	2016-06-15 13:01:35 +03:00
Piotr Jastrzebski	dcba6f5c45	Pass clustering_row_ranges to mutation readers. This will allow readers to reduce the amount of data read. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-05-16 14:36:57 +02:00
Pekka Enberg	38a54df863	Fix pre-ScyllaDB copyright statements People keep tripping over the old copyrights and copy-pasting them to new files. Search and replace "Cloudius Systems" with "ScyllaDB". Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>	2016-04-08 08:12:47 +03:00
Glauber Costa	f49e965d78	repair: rework repair code so we can limit parallelism The repair code as it is right now is a bit convoluted: it resorts to detached continuations + do_for_each when calling sync_ranges, and deals with the problem of excessive parallelism by employing a semaphore inside that range. Still, even by doing that, we still generate a great number of checksum requests because the ranges themselves are processed in parallel. It would be better to have a single-semaphore to limit the overall parallelism for all requests. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:40:49 -04:00
Glauber Costa	10c8ca6ace	priority manager: separate streaming reads from writes Streaming has currently one class, that can be used to contain the read operations being generated by the streaming process. Those reads come from two places: - checksums (if doing repair) - reading mutations to be sent over the wire. Depending on the amount of data we're dealing with, that can generate a significant chunk of data, with seconds worth of backlog, and if we need to have the incoming writes intertwined with those reads, those can take a long time. Even if one node is only acting as a receiver, it may still read a lot for the checksums - if we're talking about repairs, those are coming from the checksums. However, in more complicated failure scenarios, it is not hard to imagine a node that will be both sending and receiving a lot of data. The best way to guarantee progress on both fronts, is to put both kinds of operations into different classes. This patch introduces a new write class, and rename the old read class so it can have a more meaningful name. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:12:59 -04:00
Glauber Costa	2cd756ae5e	repair: replace a magic number with another magic number In due time we will have to fix this, but as an interim step, let's use a "better" magic number. The problem with 100, is that as soon as the partitions start to go bigger, we're using too much memory. Since this is multiplied by the number of token ranges, and happens in every shard, the final number can become really big, and the amount of resources we use go up proportionally. This means that even we are mistaken about the new number (we probably are), in this case it is better to err on the side of a more conservative resource usage. Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <97158f3db5734916cee4ccf12eaa66e7402570bb.1457448855.git.glauber@scylladb.com>	2016-03-08 17:29:00 +02:00
Nadav Har'El	2cf09147b5	Repair: don't use freeze() to calculate mutation checksums Use the existing "feed_hash" mechanism to find a checksum of the content of a mutation, instead of serializing the mutation (with freeze()) and then finding the checksum of that string. The serialized form is more prone to future changes, and not really guaranteed to provide equal hashes for mutations which are considered "equal". Fixes #971 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1456958676-27121-1-git-send-email-nyh@scylladb.com>	2016-03-03 09:58:24 +01:00
Nadav Har'El	f9ee74f56f	repair: options for repairing only a subrange To implement nodetool's "--start-token"/"--end-token" feature, we need to be able to repair only part of the ranges held by this node. Our REST API already had a "ranges" option where the tool can list the specific ranges to repair, but using this interface in the JMX implementation is inconvenient, because it requires the Java code to be able to intersect the given start/end token range with the actual ranges held by the repaired node. A more reasonable approach, which this patch uses, is to add new "startToken"/"endToken" options to the repair's REST API. What these options do is is to find the node's token ranges as usual, and only then intersect them with the user-specified token range. The JMX implementation becomes much simpler (in a separate patch for scylla-jmx) and the real work is done in the C++ code, where it belongs, not in Java code. With the additional scylla-jmx patch to use the new REST API options provided here, this fixes #917. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1455807739-25581-1-git-send-email-nyh@scylladb.com>	2016-02-18 17:13:56 +02:00
Nadav Har'El	3a2885e1e3	repair: use seastar::gate Switch to use seastar::gate (and its new gate::check() method) instead of a similar implementation in repair.cc. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1455553063-13488-1-git-send-email-nyh@scylladb.com>	2016-02-15 18:22:36 +02:00
Nadav Har'El	7dc843fc1c	repair: stop ongoing repairs during shutdown When shutting down a node gracefully, this patch asks all ongoing repairs started on this node to stop as soon as possible (without completing their work), and then waits for these repairs to finish (with failure, usually, because they didn't complete). We need to do this, because if the repair loop continues to run while we start destructing the various services it relies on, it can crash (as reported in #699, although the specific crash reported there no longer occurs after some changes in the streaming code). Additionally, it is important that to stop the ongoing repair, and not wait for it to complete its normal operation, because that can take a very long time, and shutdown is supposed to not take more than a few seconds. Fixes #699. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1455218873-6201-1-git-send-email-nyh@scylladb.com>	2016-02-14 16:52:41 +02:00
Gleb Natapov	e3a40254e6	Remove old partition_checksum serializer	2016-02-02 12:15:49 +02:00
Gleb Natapov	e6f7b12b51	Move partition_checksum to use idl	2016-02-02 12:15:49 +02:00
Nadav Har'El	b95c15f040	repair: change checksum structure to be better suited for serializer Change the partition_checksum structure to be better suited for the new serializers: 1. Use std::array<> instead of a C array, as the latters are not supported by the new serializers. 2. Use an array of 32 bytes, instead of 4 8-byte integers. This will guarantee that no byte-swapping monkey-business will be done on these checksums. The checksum XOR and equality-checking methods still temporarily cast the bytes to 8-byte chunks, for (hopefully) better performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1454364900-3076-1-git-send-email-nyh@scylladb.com>	2016-02-02 11:58:25 +02:00
Glauber Costa	b63611e148	mark I/O operations with priority classes After this patch, our I/O operations will be tagged into a specific priority class. The available classes are 5, and were defined in the previous patch: 1) memtable flush 2) commitlog writes 3) streaming mutation 4) SSTable compaction 5) CQL query Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-01-25 15:20:38 -05:00
Nadav Har'El	db19a43d98	repair: try harder to repair, even when some nodes are unreachable In the existing code, when we fail to reach one of the replicas of some range being repaired, we would give up, and not continue to repair the living replicas of this range. The thinking behind this was since the repair should be considered failed anyway, there's no point in trying to do a half-job better. However, in a discussion I had with Shlomi, he raised the following alternative thinking, which convinced me: In a large cluster, having one node or another temporarily dead has a high probability. In that case, even if the if the repair is doomed to be considered "failed", we want it at least to do as much as it possibly can to repair the data on the living part of the cluster. This is what this patch does: If we can only reach some of the replicas of a given range, the repair will be considered failed (as before), but we will still repair the reachable replicas of this range, if they have different checksums. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1453724443-29320-1-git-send-email-nyh@scylladb.com>	2016-01-25 14:37:39 +02:00
Nadav Har'El	d97cbbbe43	repair: forbid repair with "-dc" not including the current host Theoretically, one could want to repair a single host and all the hosts in one or more other data centers which don't include this host. However, Cassandra's "nodetool repair" explicitly does not allow this, and fails if given a list of data centers (via the "-dc" option) which doesn't include the host starting the repair. So we need to behave like "nodetool repair" and fail in this case too. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1453037016-25775-1-git-send-email-nyh@scylladb.com>	2016-01-18 09:54:16 +02:00
Tomasz Grabiec	4e5a52d6fa	db: Make read interface schema version aware The intent is to make data returned by queries always conform to a single schema version, which is requested by the client. For CQL queries, for example, we want to use the same schema which was used to compile the query. The other node expects to receive data conforming to the requested schema. Interface on shard level accepts schema_ptr, across nodes we use table_schema_version UUID. To transfer schema_ptr across shards, we use global_schema_ptr. Because schema is identified with UUID across nodes, requestors must be prepared for being queried for the definition of the schema. They must hold a live schema_ptr around the request. This guarantees that schema_registry will always know about the requested version. This is not an issue because for queries the requestor needs to hold on to the schema anyway to be able to interpret the results. But care must be taken to always use the same schema version for making the request and parsing the results. Schema requesting across nodes is currently stubbed (throws runtime exception).	2016-01-11 10:34:52 +01:00
Asias He	2345cda42f	messaging_service: Rename shard_id to msg_addr Use shard_id as the destination of the messaging_service is confusing, since shard_id is used in the context of cpu id. Message-Id: <8c9ef193dc000ef06f8879e6a01df65cf24635d8.1452155241.git.asias@scylladb.com>	2016-01-07 10:36:35 +02:00
Nadav Har'El	f90e1c1548	repair: support "hosts" and "dataCenters" parameters Support the "hosts" and "dataCenters" parameters of repair. The first specifies the known good hosts to repair this host from (plus this host), and the second asks to restrict the repair to the local data center (you must issue the repair to a node in the data center you want to repair - issuing the command to a data center other than the named one returns an error). For example these options are used by nodetool commands like: nodetool repair -hosts 127.0.0.1,127.0.0.2 keyspace nodetool repair -dc datacenter1 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2016-01-05 15:38:40 +02:00
Nadav Har'El	ac4e86d861	repair: use repair_checksum_range The existing repair code always streamed the entire content of the database. In this overhaul, we send "repair_checksum_range" messages to the other nodes to verify whether they have exactly the same data as this node, and if they do, we avoid streaming the identical code. We make an attempt to split the token ranges up to contain an estimated 100 keys each, and send these ranges' checksums. Future versions of this code will need to improve this estimation (and make this "100" a parameter) Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2016-01-05 15:38:40 +02:00
Nadav Har'El	9e65ecf983	repair: convenience function for syncing a range This patch adds a function sync_range() for synchronizing all partitions in a given token range between a set of replicas (this node and a list of neighbors). Repair will call this function once it has decided that the data the replicas hold in this range is not identical. The implementation streams all the data in the given range, from each of the neighbors to this node - so now this node contains the most up-to-date data. It then streams the resulting data back to all the neighbors. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2016-01-05 15:38:40 +02:00
Nadav Har'El	e9d266a189	repair: checksum of partitions in range This patch adds functions for calculating the checksum of all the partitions in a given token range in the given column-family - either in the current shard, or across all shards in this node. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2016-01-05 15:38:40 +02:00
Nadav Har'El	0591fa7089	repair: partition-set checksum This patch adds a mechanism for calculating a checksum for a set of partitions. The repair process will use these checksums to compare the data held by different replicas. We use a strong checksum (SHA-256) for each individual partition in the set, and then a simple XOR of those checksums to produce a checksum for the entire set. XOR is good enough for merging strong checksums, and allows us to independently calculate the checksums of different subsets of the original sets - e.g., each shard can calculate its own checksum and we can XOR the resulting checksums to get the final checksum. Apache Cassandra uses a very similar checksum scheme, also using SHA-256 and XOR. One small difference in the implementation is that we include the partition key in its checksum, while Cassandra don't, which I believe to have no real justification (although it is very unlikely to cause problems in practice). See further discussion on this in https://issues.apache.org/jira/browse/CASSANDRA-10728. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2016-01-05 15:38:40 +02:00
Asias He	89b79d44de	streaming: Get rid of the _connecting_ parameter messaging_service will use private ip address automatically to connect a peer node if possible. There is no need for the upper level like streaming to worry about it. Drop it simplifies things a bit.	2015-12-31 11:25:08 +01:00
Nadav Har'El	de5a3e5c5a	repair: check columnFamilies list Check the list of column families passed as an option to repair, to provide the user with a more meaningful exception when a non-existant column family is passed. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2015-12-30 09:59:54 +02:00
Nadav Har'El	3ae29216c8	repair: add missing ampersand This was a plain bug - ranges_opt is supposed to parse the option into the vector "var", but took the vector by value. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2015-12-30 09:46:13 +02:00
Nadav Har'El	a0a649c1be	repair: support "columnFamilies" parameter Support the "columnFamilies" parameter of repair, allowing to repair only some of the column families of a keyspace, instead of all of them. For example, using a command like "nodetool repaire keyspace cf1 cf2". Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2015-12-30 09:45:28 +02:00
Nadav Har'El	ebebaa525d	repair: fix missing default values A default value was not set for the "incremental" and "parallelism" repair parameters, so Scylla can wrongly decide that they have an unsupported value. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2015-12-29 15:39:47 +02:00
Nadav Har'El	7247f055df	repair: partial support for some options Add partial support for the "incremental" option (only support the "false" setting, i.e., not incremental repair) and the "parallelism" option (the choice of sequential or parallel repair is ignored - we always use our own technique). This is needed because scylla-jmx passes these options by default (e.g., "incremental=false" is passed to say this is not incremental repair, and we just need to allow this and ignore it). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2015-12-29 09:38:09 +02:00
Nadav Har'El	3cfa39e1f0	repair: log repair options When throwing an "unsupported repair options" exception to the caller (such as "nodetool repair"), also list which options were not recognized. Additionally, list the options when logging the repair operation. This patch includes an operator<< implementation for pretty-printing an std::unordered_map. We may want to move it later to a more central location - even Seastar (like we have a pretty-printer for std::vector in core/sstring.hh). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2015-12-29 09:37:30 +02:00
Nadav Har'El	06f8dd4eb2	repair: job id must start at 1 This patch fixes a bug where the first run of "nodetool repair" always returned immediately, instead of waiting for the repair to complete. Repair operations are asynchronous: Starting a repair returns a numeric id, which can then be used to query for the repair's completion, and this is what "nodetool repair" does (through our JMX layer). We started with the repair ID "0", the next one is "1", and so on. The problem is that "nodetool repair", when it sees 0 being returned, treats it not as a regular repair ID, but rather as an answer that there is nothing to repair - printing a message to that effect and not waiting for the repair (which was correctly started) to complete. The trivial fix is to start our repair IDs at 1, instead of 0. We currently do not return 0 in any case (we don't know there is nothing to repair before we actually start the work, and parameter errors cause an exception, not a return of 0). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2015-12-27 12:42:26 +02:00
Avi Kivity	d5cf0fb2b1	Add license notices	2015-09-20 10:43:39 +03:00
Nadav Har'El	b80849beaa	Repair: add an info message Add in repair another info message, "Endpoints ... and ... have ... range(s) out of sync for <column-family>", which the repair dtest expects. This patch is a kind of silly attempt to appease issue #81 (should we mark it fixed?). It's kind of silly, because without merkle trees (see issue #82), we really have no way of knowing if there's any differences between the two nodes, so we always say there is "1 range" difference. So if the dtest expects such a message not to appear (because there are no differences), it might fail. Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-09-03 09:12:38 +03:00
Nadav Har'El	f91b5fabd2	repair: fail if unknown option is used As requested in issue #79, this patch ensures that if the user attempts to pass an unknown repair option, the operation fails rather than the option simply be ignored. An "unknown repair option" may be one of Cassandra's options we don't yet support ("parallelism", "incremental", "jobThreads", "columnFamilies", "dataCenters", "hosts" and "trace"), or any other unknown option name - in either case, the operation will fail rather than ignore the option which might have been important. Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-09-03 09:12:04 +03:00
Nadav Har'El	f6ae567ab1	repair: implement primaryRange and ranges options This patch implements repair's "primaryRange" and "ranges" options: Without these options, a repair defaults to repair all the ranges for which this nodes holds a replica (each range is repaired by contacting the other replicas of this range). If the "primaryRange" option is passed, instead of repairing all ranges, only the "primary ranges" of this node is repaired - for each range, only one node has this range as its "primary range". The intention is that a user can start a "primaryRange" repair on all nodes, and the result would be that each range will only be repaired once. If the "ranges" option is passed, it can explicitly list a list of ranges to repair, overriding the automatic determination of ranges explained above. Fixes #212. Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-08-31 10:02:03 +03:00
Nadav Har'El	cc4117d6c1	repair: do not use an atomic integer Avi asked not to use an atomic integer to produce ids for repair operations. The existing code had another bug: It could return some id immediately, but because our start_repair() hasn't started running code on cpu 0 yet, the new id was not yet registered and if we were to call repair_get_status() for this id too quickly, it could fail. The solution for both issues is that start_repair() should return not an int, but a future<int>: the integer id is incremented on cpu 0 (so no atomics are needed), and then returned and the future is fulfilled. Note that the future returned by start_repair() does not wait for the repair to be over - just for its index to be registered and be usable to a call to repair_get_status(). Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-08-31 09:31:19 +03:00
Nadav Har'El	07480c75e6	repair: use parallel_for_each instead of semaphore Requested by Avi. The added benefit is that the code for repairing all the ranges in parallel is now identical to the code of repairing the ranges one by one - just replace do_for_each with parallel_for_each, and no need for a different implementation using semaphores like I had before this patch. Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-08-20 10:51:57 +03:00
Nadav Har'El	4e3dbef512	repair: conform to coding style Use "_" prefix on class member "status". Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-08-20 10:51:56 +03:00
Nadav Har'El	5a02eeaba9	v2: repair: track ongoing repairs [in v2: 1. Fixed a few small bugs. 2. Added rudementary support parallel/sequential repair. 3. Verified that code works correctly with Asias's fix to streaming] This patch adds the capability to track repair operations which we have started, and check whether they are still running or completed (successfully or unsuccessfully). As before one starts a repair with the REST api: curl -X GET --header "Content-Type: application/json" --header "Accept: application/json" "http://127.0.0.1:10000/storage_service/repair_async/try1" where "try1" is the name of the keyspace. This returns a repair id - a small integer starting with 0. This patch adds support for similar request to query the status of a previously started repair, by adding the "id=..." option to the query, which enquires about the status of the repair with this id: For example., curl -i -X GET --header "Content-Type: application/json" --header "Accept: application/json" "http://127.0.0.1:10000/storage_service/repair_async/try1?id=0" gets the current status of this repair 0. This status can be RUNNING, SUCCESSFUL or FAILED, or a HTTP 400 "unknown repair id ..." in case an invalid id is passed (not the id of any real repair that was previously started). This patch also adds two alternative code-paths in the main repair flow do_repair_start(): One where each range is repaired one after another, and one where all the ranges are repaired in parallel. At the moment, the enabled code is the parallel version, just as before this patch. But the will also be useful for implementing the "parallel" vs "sequential" repair options of Cassandra. Note that if you try to use repair, you are likely to run into a bug in the streaming code which results in Scylla either crashing or a repair hanging (never realising it finished). Asias already has a fix this this bug, and will hopefully publish it soon, but it is unrelated to the repair code so I think this patch can independently be committed. Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-08-16 14:23:02 +03:00
Nadav Har'El	75384413f3	repair: fix use of handle_exception() handle_exception() should really discard the future's value automatically, and in an upcoming version of Seastar, won't. So instead of sp.execute().handle_exception(...) (where execute() returns a future which is not future<>) We need to write sp.execute().discard_result().handle_exception(...) This already works in today's Seastar (the extra discard_result() doesn't cause any harm), and will be necessary when handle_exception() in Seastar is improved (I'll send a patch soon). Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-08-12 17:46:41 +03:00
Nadav Har'El	a5ce8108f2	repair: add FIXME Add a FIXME about something I'm unsure about - does repair only need to repair this node, or also make an effort to also repair the other nodes (or more accurately, their specific token-ranges being repaired) if we're already communicating with them? Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-08-10 12:16:56 +03:00
Nadav Har'El	7a8ed228c7	repair: better error message If a stream failed, print a clear error message that repair failed, instead of ignoring it and letting Seastar's generic "warning, exception was ignored" be the only thing the user will see. Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-08-10 12:16:56 +03:00
Nadav Har'El	71a3a0c026	repair: repair each local range separately The previous repair code exchanged data with the other nodes which have one arbitrary token. This will only work correctly when all the nodes replicate all the data. In a more realistic scenario, the node being repaired holds copies of several token ranges, and each of these ranges has a different set of replicas we need to perform the repair with. So this patch does the right thing - we perform a separate repair_range() for each of the local ranges, and each of those will find a (possibly) different set of nodes to communicate with. Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-08-10 12:16:55 +03:00
Nadav Har'El	34b1cc42cd	Initial repair support This patch adds the beginning of node repair support. Repair is initiated on a node using the REST API, for example to repair all the column families in the "try1" keyspace, you can use: curl -X GET --header "Content-Type: application/json" --header "Accept: application/json" "http://127.0.0.1:10000/storage_service/repair_async/try1" I tested that the repair already works (exchanges mutations with all other replicas, and successfully repairs them), so I think can be committed, but will need more work to be completed 1. Repair options are not yet supported (range repair, sequential/parallel repair, choice of hosts, datacenters and column families, etc.). 2. All the data of the keyspace is exchanged - Merkle Trees (or an alternative optimization) and partial data exchange haven't been implemented yet. 3. Full repair for nodes with multiple separate ranges is not yet implemented correctly. E.g., consider 10 nodes with vnodes and RF=2, so each vnode's range has a different host as a replica, so we need to exchange each key range separately with a different remote host. 4. Our repair operation returns a numeric operation id (like Origin), but we don't yet provide any means to use this id to check on ongoing repairs like Origin allows. 5. Error hangling, logging, etc., needs to be improved. 6. SMP nodes (with multiple shards) should work correctly (thanks to Asias's latest patch for SMP mutation streaming) but haven't been tested. 7. Incremental repair is not supported (see http://www.datastax.com/dev/blog/more-efficient-repairs) Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-08-05 13:26:36 +03:00

49 Commits