scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-25 02:50:33 +00:00

Author	SHA1	Message	Date
Duarte Nunes	a025bf6a7d	Merge seastar upstream Seastar introduced a "compat" namespace, which conflicts with Scylla's own "compat" namespaces. The merge thus includes changes to scope uses of Scylla's "compat" namespaces. * seastar 8ad870f...9bb1611 (5): > util/variant_utils: Ensure variant_cast behaves well with rvalues > util/std-compat: Fix infinite recursion > doc/tutorial: Undo namespace changes > util/variant_utils: Add cast_variant() > Add compatbility with C++17's library types Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-08-14 13:07:09 +01:00
Nadav Har'El	5e47061438	repair: fix small error-handling logic mistake As noticed by Tomasz Grabiec, we test a future's available() after having already waited for it with when_all(), which is pointless. The code after the wrong if() exchanges the contents of a token-range between this node and several other live neighbors; We can't do this exchange if either this node is broken or there is no other live neighbor. So this is what we needed to test. so !available() should have been failed(). Also the test for live_neighbors_checksum.empty() added in commit `7c873f0d1f` is unnecessary - we build live_neighbors and live_neighbors_checksum together, so if one of them is empty, so is the other. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20180710114940.26027-1-nyh@scylladb.com>	2018-07-10 15:04:03 +03:00
Nadav Har'El	3194ce16b3	repair: fix combination of "-pr" and "-local" repair options When nodetool repair is used with the combination of the "-pr" (primary range) and "-local" (only repair with nodes in the same DC) options, Scylla needs to define the "primary ranges" differently: Rather than assign one node in the entire cluster to be the primary owner of every token, we need one node in each data-center - so that a "-local" repair will cover all the tokens. Fixes #3557. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20180701132445.21685-1-nyh@scylladb.com>	2018-07-01 16:39:33 +03:00
Vladimir Krivopalov	acdce55572	Inject CryptoPP namespace where Crypto++ `byte` typedef is used. In Crypto++ v6, the `byte` typedef has been moved from the global namespace to the CryptoPP:: namespace. To make Scylla code compile with both old and new versions, bring the namespace in so that the code works regardless of the scope of `byte` definition. Fixes #3252 Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com> Message-Id: <60e7bfe868b778b1c9bbe15d7247db64b61bd406.1520272198.git.vladimir@scylladb.com>	2018-03-05 20:43:07 +02:00
Pekka Enberg	bd365a10d3	Merge "Add an API to get all active repairs" from Amnon "This series adds an API to return the active repairs by their IDs. After this series a call to: curl -X GET --header "Accept: application/json" "http://localhost:10000/storage_service/active_repair/" Will return an array with the ids of the active repairs. Fixes #3193" * 'amnon/get_active_repairs_v3' of github.com:scylladb/seastar-dev: API: Add get active repair api repair: Add a get_active_repairs function to return the active repair	2018-02-19 15:32:17 +02:00
Amnon Heiman	3f2eae35fd	repair: Add a get_active_repairs function to return the active repair This patch adds a function that returns an array with the ids of the active repairs by filtering the RUNNING ones in the repair tracker status. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2018-02-14 11:43:37 +02:00
Duarte Nunes	7ba63b1521	atomic_cell_hash: Add specialization for atomic_cell_or_collection Replace the atomic_cell_or_collection::feed_hash() member function with the specialization of appending_hash, and use that instead. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-02-01 00:22:51 +00:00
Duarte Nunes	a0d748c71c	range_tombstone: Replace feed_hash() member function with appending_hash Replace range_tombstone::feed_hash() with the specialization of appending_hash, so that we can use the general feed_hash() function. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-02-01 00:22:50 +00:00
Duarte Nunes	12507fb9ce	keys: Replace feed_hash() member function with appending_hash Replace the feed_hash() member function of partition_key and clustering_key_prefix with the specialization of appending_hash, so that we can use the general feed_hash() function. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-02-01 00:22:50 +00:00
Piotr Jastrzebski	ee6f2ca554	partition_checksum::compute_legacy: use only flat reader Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2018-01-24 20:56:48 +01:00
Vlad Zolotarov	97506f39b2	repair: use seastar::cache_line_size for aligning to the cache line size Use seastar::cache_line_size for cache line alignment instead of a hard coded value (64) - this value is not always correct, e.g. PPC64 platform, where cache line size is 128B. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-12-08 13:38:13 -05:00
Paweł Dziepak	dca93bea23	db: convert make_streaming_reader() to flat_mutation_reader	2017-11-13 16:49:52 +00:00
Paweł Dziepak	f690e2e80b	repair: convert partition_checksum::compute_streamed() to flat streams	2017-11-13 16:49:52 +00:00
Paweł Dziepak	d71a14b943	repair: make partition_hasher consume flat mutation streams	2017-11-13 16:49:52 +00:00
Paweł Dziepak	2b774119a1	mutation_hasher: copy mutation_hasher to repair.cc Repair is the exclusive user of mutation_hasher. Moving it there will make integration with partition_checksum easier.	2017-11-13 16:49:52 +00:00
Paweł Dziepak	f648f94464	partition_checksum: introduce compute() for flat_mutation_reader	2017-11-13 16:49:52 +00:00
Avi Kivity	43a72254ff	repair: add missing include	2017-09-11 20:09:45 +03:00
Avi Kivity	a59e375aad	Merge "Support termination of repair jobs" from Asias "This series implements the missing API to terminate all repairs. For example: $ curl -X POST --header "Accept: application/json" "http://127.0.0.1:10000/storage_service/force_terminate_repair" With the new stream_plan::abort() api we can now abort the stream session assocaited with the repair as well. On top of this, we can support termination of single repair job instead all jobs. Fixes #2105" * tag 'asisas/repair_abort_v4' of github.com:scylladb/seastar-dev: repair: Support termination of repair jobs repair: Track repair_info repair: Intorduce repair id to repair_info map api: Add force_terminate_repair API streaming: Add abort to stream_plan streaming: Add abort_all_stream_sessions for stream_coordinator streaming: Introduce streaming::abort() streaming: Make stream_manager and coordinator message debug level streaming: Check if _stream_result is valid streaming: Log peer address in on_error streaming: Introduce received_failed_complete_message	2017-09-06 12:58:05 +03:00
Asias He	e14bb7b1d5	repair: Remove #if'ed code in repair_ranges It is unlikely we will use parallel_for_each version in repair_ranges. Get rid of the dead code. Message-Id: <31a9366adfe0262512a326ef9703aa0bba05e1fb.1503996138.git.asias@scylladb.com>	2017-09-03 11:13:02 +03:00
Asias He	471e8b341f	repair: Support termination of repair jobs This patch implements the missing API to terminate all repairs. For example: $ curl -X POST --header "Accept: application/json" "http://127.0.0.2:10000/storage_service/force_terminate_repair" With the new stream_plan::abort() api we can now abort the stream session assocaited with the repair as well. Fixes #2105	2017-08-30 15:19:52 +08:00
Asias He	07d9dc03ec	repair: Track repair_info Make repair_info a shared pointer and store them in _repairs map so we can find by the repair id and access them later.	2017-08-30 15:19:52 +08:00
Asias He	5c9732c645	repair: Intorduce repair id to repair_info map The maps are stored in a vector. The vector has smp::count elements, each element will be accessed by only one shard. The add_repair_info, remove_repair_info and get_repair_info helpers are added.	2017-08-30 15:19:51 +08:00
Asias He	68346f7e53	repair: Use with_semaphore for sp_parallelism_semaphore Instead of calling semaphore.signal() manually. Message-Id: <51b7ecdebac91763a2340fe00959742810614845.1503648936.git.asias@scylladb.com>	2017-08-27 12:50:38 +03:00
Asias He	69c81bcc87	repair: Do not allow repair until node is in NORMAL status The following backtrace was reported by user when running repair and keeping restarting the node at the same time. #0 0x00007eff077281d7 in raise () from /lib64/libc.so.6 #1 0x00007eff07729a08 in abort () from /lib64/libc.so.6 #2 0x00007eff07721146 in __assert_fail_base () from /lib64/libc.so.6 #3 0x00007eff077211f2 in __assert_fail () from /lib64/libc.so.6 #4 0x00000000010ef2c2 in locator::token_metadata::first_token_index (this=0x641000214e98, start=...) at locator/token_metadata.cc:133 #5 0x00000000010ef2d9 in locator::token_metadata::first_token (this=0x641000214e98, start=...) at locator/token_metadata.cc:143 #6 0x00000000010e329d in locator::abstract_replication_strategy::get_natural_endpoints (this=0x641000494000, search_token=...) at locator/abstract_replication_strategy.cc:66 #7 0x0000000001481186 in get_neighbors (hosts=std::vector of length 0, capacity 0, data_centers=std::vector of length 0, capacity 0, range=<error reading variable: access outside bounds of object referenced via synthetic pointer>, ksname=..., db=...) at repair/repair.cc:196 #8 repair_range<nonwrapping_range<dht::token> > (range=..., ri=...) at repair/repair.cc:781 #9 <lambda(auto:99&)>::<lambda(auto:100&&)>::<lambda(auto:101&)>::<lambda()>::operator() (__closure=0x7efec07f7460) at repair/repair.cc:1005 #10 futurize<future<bool_class<stop_iteration_tag> > >::apply<repair_ranges(repair_info)::<lambda(auto:99&)>:: It is reproduced with 1) while true; do curl -X POST --header "Content-Type: application/json" --header "Accept: application/json" "http://127.0.0.1:10000/storage_service/repair_async/ks3"; done 2) start node 127.0.0.1, stop node 127.0.0.1 in a loop The problem is, during boot up, the token_metadata is not replicated to all shards until the node goes into NORMAL status. To fix, check until node is in NORMAL status before allowing repair. Fixes #2723	2017-08-23 14:40:04 +08:00
Asias He	763fa83232	repair: Fix build in repair_cf_range The compiler does not like the mutable. Message-Id: <83c5e8a944b72a095b8e29e9988986e6ca9cefc5.1501690749.git.asias@scylladb.com>	2017-08-02 18:57:32 +02:00
Asias He	5798625d73	repair: Singal parallelism_semaphore in case of error If we throw after we take the semaphore and beforew the when_all below runs, no one will increase the semaphore. Fixes #2661 Message-Id: <49540ede4c8a6d84004e10e0f63690e3c21d72c7.1501686383.git.asias@scylladb.com>	2017-08-02 18:32:32 +03:00
Avi Kivity	ac31abf6a4	repair: don't lambda-capture repair_tracker It is static, so it need not be captured, and some compilers complain.	2017-08-02 18:07:31 +03:00
Avi Kivity	ce60ef59f3	Revert "repair: Singal parallelism_semaphore in case of error" This reverts commit `a548eee28c`. It releases the semaphore too early (noted by Glauber).	2017-08-02 17:13:46 +03:00
Asias He	a548eee28c	repair: Singal parallelism_semaphore in case of error If we throw after we take the semaphore and beforew the when_all below runs, one one will increase the semaphore. Fixes #2661	2017-08-02 21:41:45 +08:00
Asias He	abcff4c78e	repair: Fix repair_tracker done If it throws after repair_tracker.start and before the when_all below, the repair_tracker.done will never be called for this repair id. Fixes #2660	2017-08-02 21:40:29 +08:00
Asias He	b10e961a64	repair: Use selective_token_range_sharder With this change, we ask all the shard to handle the ranges provided by user and we use selective_token_range_sharder to split the ranges and ignore the ranges do not belong to the current shard.	2017-07-04 18:46:19 +08:00
Nadav Har'El	d177ec05cb	repair: further limit parallelism of checksum calculation Repair today has a semaphore limiting the number of ongoing checksum comparisons running in parallel (on one shard) to 100. We needed this number to be fairly high, because a "checksum comparison" can involve high latency operations - namely, sending an RPC request to another node in a remote DC and waiting for it to calculate a checksum there, and while waiting for a response we need to proceed calculating checksums in parallel. But as a consequence, in the current code, we can end up with as many as 100 fibers all at the same stage of reading partitions to checksum from sstables. This requires tons of memory, to hold at least 128K of buffer (even more with read-ahead) for each of these fibers, plus partition data for each. But doing 100 reads in parallel is pointless - one (or very few) should be enough. So this patch adds another semaphore to limit the number of checksum calculations (including the read and checksum calculation) on each shard to just 2. There may still be 100 ongoing checksum comparisons, in other stages of the comparisons (sending the checksum requests to other and waiting for them to return), but only 2 will ever be in the stage of reading from disk and checksumming them. The limit of 2 checksum calculations (per shard) applies on the repair slave, not just to the master: The slave may receive many checksum requests in parallel, but will only actually work on 2 at a time. Because the parallelism=100 now rate-limits operations which use very little memory, in the future we can safely increase it even more, to support situations where the disk is very fast but the link between nodes has very high latency. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170703151329.25716-1-nyh@scylladb.com>	2017-07-03 18:14:57 +03:00
Asias He	b2a2fbcf73	repair: Do not store the failed ranges The number of failed ranges can be large so it can consume a lot of memory. We already logged the failed ranges in the log. No need to storge them in memory. Message-Id: <7a70c4732667c5c3a69211785e8efff0c222fc28.1498809367.git.asias@scylladb.com>	2017-07-03 10:00:25 +03:00
Asias He	cc02a62756	repair: Prefer nodes in local dc when streaming When peer nodes have the same partition data, i.e., with the same checksum, we currently choose to stream from any of them randomly. To improve streaming performance, select the peer within the same DC. This patch is supposed to improve repair perforamnce with multiple DC. Message-Id: <c6a345b6e8ed2b59f485e53c865241e463b44507.1498490831.git.asias@scylladb.com>	2017-06-26 18:34:21 +03:00
Asias He	47345078ec	repair: Repair on all shards Currently, shard zero is the coordinator of the repair. All the work of checksuming of the local node and sending of the repair checksum rpc verb is done on shard zero only. This causes other shards being underutilized. With this patch, we split the ranges need to be repaired into at least smp::count ranges, so sizeof(ranges) / smp::count will be assigned to each shard. For exmaple, we have 8 shards and 256 ragnes, each shard will repair 32 ranges. Each shard will repair the 32 ranges sequencially. There will be at most 8 (smp::count) ranges of repair in parallel.	2017-06-14 17:52:49 +08:00
Asias He	54831a344c	repair: Allow one stream plan in flight In "repair: Use more stream_plan" (commit `2043ffc064`), we switched to do stream while doing checksum instead of do stream only after checksum pahse is completed. We take a parallelism_semaphore before we do checksum, if there are more than sub_ranges_to_stream (1024) ranges, we start a stream_plan and wait for the streaming to complete (still under the parallelism_semaphore). So at most parallelism_semaphore (100) stream_plans can be in parallel. The parallelism_semaphore limits the parallelism of both checksum and the streaming plan. However, it is not necessary to have the same parallelism for both checksum and streaming, because 1) a streaming operation itself runs in parallel (handling ranges on all shards in prallel, sending mutaitons in parallel) , 2) and with more streaming plan (in worse case 100) means we can write to 100 memtables at the same time and flush 100 memtables to disk at the same time which can take a lot of memory. With this patch, we only allow one stream plan in flight.	2017-06-14 17:52:36 +08:00
Asias He	2bcb368a13	repair: Fix range use after free Capture it by value. scylla: [shard 0] repair - repair's stream failed: streaming::stream_exception (Stream failed) scylla: [shard 0] repair - Failed sync of range ==<runtime_exception (runtime error: Invalid token. Should have size 8, has size 0#012)>: streaming::stream_exception (Stream failed) Message-Id: <7fda4432e54365f64b556e7e4c26e36d3a9bb1b7.1497238229.git.asias@scylladb.com>	2017-06-12 11:00:57 +03:00
Asias He	3fdb8a3d3f	repair: Remove unused sub_ranges_max With the sub range iterator, it is not used anymore. Drop it.	2017-06-07 08:52:45 +08:00
Asias He	ca00c10b35	repair: Reduce parallelism in repair_ranges We currently repair all the ranges in parallel. 1) All the ranges will contend for parallelism_semaphore, instead of processing multiple ranges in parallel and calculating the sub ranges (which take memory) for each range in parallel, we can handle the ranges one bye one. We could have enough parallelism because the checksum are calucated on all the shards. 2) If for some reason the repair failed, if we handle ranges 1 by 1, we can log which range of repair is successful. Next time, we can ignore them. If we start ranges in parallel, it has a high chance, no single range is completed because all the ranges are on going. Refs #1912	2017-06-07 08:50:57 +08:00
Asias He	3852665156	repair: Tweak the log a bit - Count n out m ranges the repair is running for (kind of progress report) - Make the 'Found differing range' log debug because it can be millions of such entries - Print the failed ranges	2017-06-07 08:50:57 +08:00
Asias He	2043ffc064	repair: Use more stream_plan In the very beginning, we use a stream_plan for each checksum range. Later, we changed to use a single stream_plan for all the checksum ranges. It pushes memory presure to streaming, e.g., millinons of ranges in a vector to send over RPC. To fix, we do checksum and streaming in parallel, limit the number of checksum ranges stored in memory. Fixes #2430	2017-06-07 08:50:56 +08:00
Nadav Har'El	b3ff37e67f	repair: iterator over subranges instead of list When starting repair, we divided the large token ranges (vnodes) linto small subranges of a desired length (around 100 partition), and built a huge list of those subranges - to iterate over them later and compare checksums of those chunks. However, building this list up-front is completely unnecessary, and wastes a lot of memory: In a test with 1 TB of data, as much as 3 gigabytes was spent on this list. Instead, what we do in this patch is to find the next chunk in a DFS-like splitting algorithm, using only the token range midpoint() function (as before). The amount of memory needed for this is O(logN), instead of O(N) in the previous implementation. Refs #2430. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2017-06-07 08:50:56 +08:00
Avi Kivity	ebaeefa02b	Merge seatar upstream (seastar namespace) - introcduced "seastarx.hh" header, which does a "using namespace seastar"; - 'net' namespace conflicts with seastar::net, renamed to 'netw'. - 'transport' namespace conflicts with seastar::transport, renamed to cql_transport. - "logger" global variables now conflict with logger global type, renamed to xlogger. - other minor changes	2017-05-21 12:26:15 +03:00
Asias He	66e3b73b9c	repair: Fix partition estimation We estimate number of partitions for a given range of a column familiy and split the range into sub ranges contains fewer partitions as a checksum unit. The estimation is wrong, because we need to count the partitions on all the shards, instead of only counting the local shard. Fixes #2299 Message-Id: <7876285bd26cfaf65563d6e03ec541626814118a.1493817339.git.asias@scylladb.com>	2017-05-03 16:25:45 +03:00
Avi Kivity	ae7d7ae20f	repair: avoid auto in function argument declaration 'auto' in a non-lambda function argument is not legal C++, and is hard to read besides. Replace with the right type.	2017-04-17 23:03:15 +03:00
Asias He	39d2e59e7e	repair: Fix midpoint is not contained in the split range assertion in split_and_add We have: auto halves = range.split(midpoint, dht::token_comparator()); We saw a case where midpoint == range.start, as a result, range.split will assert becasue the range.start is marked non-inclusive, so the midpoint doesn't appear to be contain()ed in the range - hence the assertion failure. Fixes #2148 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Asias He <asias@scylladb.com> Message-Id: <93af2697637c28fbca261ddfb8375a790824df65.1489023933.git.asias@scylladb.com>	2017-03-09 09:09:17 +01:00
Asias He	937f28d2f1	Convert to use dht::partition_range_vector and dht::token_range_vector	2016-12-19 14:08:50 +08:00
Asias He	e5485f3ea6	Get rid of query::partition_range Use dht::partition_range instead	2016-12-19 08:09:25 +08:00
Asias He	d1178fa299	Convert to use dht::token_range	2016-12-19 08:04:29 +08:00
Avi Kivity	32fb4c3661	Merge "repair: Reduce unnecessary streaming traffic even more" from Asias "In `7c873f0d` (repair: Reduce unnecessary streaming traffic), we optimize in cases when 1) all the remote nodes has the same checksum and 2) local node has zero checksum. In this series, we make the optimization more generec and cover more cases." * tag 'asias/repair/node_reducer/v3' of github.com:cloudius-systems/seastar-dev: repair: Reduce unnecessary streaming traffic even more repair: Add hash specialization for partition_checksum	2016-12-18 16:53:39 +02:00

1 2 3

127 Commits