scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-25 19:10:42 +00:00

Author	SHA1	Message	Date
Botond Dénes	136fc856c5	treewide: silence discarded future warnings for questionable discards This patches silences the remaining discarded future warnings, those where it cannot be determined with reasonable confidence that this was indeed the actual intent of the author, or that the discarding of the future could lead to problems. For all those places a FIXME is added, with the intent that these will be soon followed-up with an actual fix. I deliberately haven't fixed any of these, even if the fix seems trivial. It is too easy to overlook a bad fix mixed in with so many mechanical changes.	2019-08-26 19:28:43 +03:00
Vlad Zolotarov	53cf90b075	ec2_snitch: properly build the AWS meta server address Explicity pass the port number of the AWS metadata server API when creating a corresponding socket. This patch fixes the regression introduced by `4ef940169f`. Fixes #4719 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2019-07-25 10:50:01 +03:00
Calle Wilund	e9816efe06	Remove usage of inet_address::raw_addr()	2019-07-08 14:13:09 +00:00
Calle Wilund	4ef940169f	Replace use of "ipv4_addr" with socket_address Allows the various sockets to use ipv6 address binding if so configured.	2019-07-08 14:13:09 +00:00
Avi Kivity	c42d59d805	locator: fix pessimizing moves Remove pessimizing moves, as reported by gcc 9.	2019-05-07 09:27:27 +03:00
Benny Halevy	ff4d8b6e85	treewide: use std::filesystem Rather than {std::experimental,boost,seastar::compat}::filesystem On Sat, 2019-03-23 at 01:44 +0200, Avi Kivity wrote: > The intent for seastar::compat was to allow the application to choose > the C++ dialect and have seastar follow, rather than have seastar choose > the types and have the application follow (as in your patch). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2019-03-28 14:21:10 +02:00
Duarte Nunes	fa2b0384d2	Replace std::experimental types with C++17 std version. Replace stdx::optional and stdx::string_view with the C++ std counterparts. Some instances of boost::variant were also replaced with std::variant, namely those that called seastar::visit. Scylla now requires GCC 8 to compile. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20190108111141.5369-1-duarte@scylladb.com>	2019-01-08 13:16:36 +02:00
Avi Kivity	c96fc1d585	Merge "Introduce row level repair" from Asias " === How the the partition level repair works - The repair master decides which ranges to work on. - The repair master splits the ranges to sub ranges which contains around 100 partitions. - The repair master computes the checksum of the 100 partitions and asks the related peers to compute the checksum of the 100 partitions. - If the checksum matches, the data in this sub range is synced. - If the checksum mismatches, repair master fetches the data from all the peers and sends back the merged data to peers. === Major problems with partition level repair - A mismatch of a single row in any of the 100 partitions causes 100 partitions to be transferred. A single partition can be very large. Not to mention the size of 100 partitions. - Checksum (find the mismatch) and streaming (fix the mismatch) will read the same data twice === Row level repair Row level checksum and synchronization: detect row level mismatch and transfer only the mismatch === How the row level repair works - To solve the problem of reading data twice Read the data only once for both checksum and synchronization between nodes. We work on a small range which contains only a few mega bytes of rows, We read all the rows within the small range into memory. Find the mismatch and send the mismatch rows between peers. We need to find a sync boundary among the nodes which contains only N bytes of rows. - To solve the problem of sending unnecessary data. We need to find the mismatched rows between nodes and only send the delta. The problem is called set reconciliation problem which is a common problem in distributed systems. For example: Node1 has set1 = {row1, row2, row3} Node2 has set2 = { row2, row3} Node3 has set3 = {row1, row2, row4} To repair: Node1 fetches nothing from Node2 (set2 - set1), fetches row4 (set3 - set1) from Node3. Node1 sends row1 and row4 (set1 + set2 + set3 - set2) to Node2 Node1 sends row3 (set1 + set2 + set3 - set3) to Node3. === How to implement repair with set reconciliation - Step A: Negotiate sync boundary class repair_sync_boundary { dht::decorated_key pk; position_in_partition position } Reads rows from disk into row buffers until the size is larger than N bytes. Return the repair_sync_boundary of the last mutation_fragment we read from disk. The smallest repair_sync_boundary of all nodes is set as the current_sync_boundary. - Step B: Get missing rows from peer nodes so that repair master contains all the rows Request combined hashes from all nodes between last_sync_boundary and current_sync_boundary. If the combined hashes from all nodes are identical, data is synced, goto Step A. If not, request the full hashes from peers. At this point, the repair master knows exactly what rows are missing. Request the missing rows from peer nodes. Now, local node contains all the rows. - Step C: Send missing rows to the peer nodes Since local node also knows what peer nodes own, it sends the missing rows to the peer nodes. === How the RPC API looks like - repair_range_start() Step A: - request_sync_boundary() Step B: - request_combined_row_hashes() - reqeust_full_row_hashes() - request_row_diff() Step C: - send_row_diff() - repair_range_stop() === Performance evaluation We created a cluster of 3 Scylla nodes on AWS using i3.xlarge instance. We created a keyspace with a replication factor of 3 and inserted 1 billion rows to each of the 3 nodes. Each node has 241 GiB of data. We tested 3 cases below. 1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows. Time to repair: old = 87 min new = 70 min (rebuild took 50 minutes) improvement = 19.54% 2) 100% synced: all of the 3 nodes have 1 billion identical rows. Time to repair: old = 43 min new = 24 min improvement = 44.18% 3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows. Time to repair: old: 211 min new: 44 min improvement: 79.15% Bytes sent on wire for repair: old: tx= 162 GiB, rx = 90 GiB new: tx= 1.15 GiB, tx = 0.57 GiB improvement: tx = 99.29%, rx = 99.36% It is worth noting that row level repair sends and receives exactly the number of rows needed in theory. In this test case, repair master needs to receives 2 million rows and sends 4 million rows. Here are the details: Each node has 1 billion * 0.1% distinct rows, that is 1 million rows. So repair master receives 1 million rows from repair slave 1 and 1 million rows from repair slave 2. Repair master sends 1 million rows from repair master and 1 million rows received from repair slave 1 to repair slave 2. Repair master sends sends 1 million rows from repair master and 1 million rows received from repair slave 2 to repair slave 1. In the result, we saw the rows on wire were as expected. tx_row_nr = 1000505 + 999619 + 1001257 + 998619 (4 shards, the numbers are for each shard) = 4'000'000 rx_row_nr = 500233 + 500235 + 499559 + 499973 (4 shards, the numbers are for each shard) = 2'000'000 Fixes: #3033 Tests: dtests/repair_additional_test.py " * 'asias/row_level_repair_v7' of github.com:cloudius-systems/seastar-dev: (51 commits) repair: Enable row level repair repair: Add row_level_repair repair: Add docs for row level repair repair: Add repair_init_messaging_service_handler repair: Add repair_meta repair: Add repair_writer repair: Add repair_reader repair: Add repair_row repair: Add fragment_hasher repair: Add decorated_key_with_hash repair: Add get_random_seed repair: Add get_common_diff_detect_algorithm repair: Add shard_config repair: Add suportted_diff_detect_algorithms repair: Add repair_stats to repair_info repair: Introduce repair_stats flat_mutation_reader: Add make_generating_reader storage_service: Introduce ROW_LEVEL_REPAIR feature messaging_service: Add RPC verbs for row level repair repair: Export the repair logger ...	2018-12-25 13:13:00 +02:00
Calle Wilund	bfc6c89b00	network_topology_strategy: Simplify calculate_natural_endpoints Fixes #2896 (hopefully) Implementation of origin change c000da13563907b99fe220a7c8bde3c1dec74ad5 Reduces the amount of maps and sets and general complexity of endpoint calculation by simply mapping dc:s to expected node counts, re-using endpoint sets and iterate thusly. Tested with transposed origin unit test comparing old vs. new algo results. (Next patch)	2018-12-17 13:10:59 +00:00
Calle Wilund	707bff563e	token_metadata: Add "get_location" ip to dc+rack accessor	2018-12-12 09:32:05 +00:00
Asias He	063dfcda26	messaging_service: Add constructor for msg_addr Which takes the ip address and shard id.	2018-12-12 16:49:01 +08:00
Avi Kivity	775b7e41f4	Update seastar submodule * seastar d59fcef...b924495 (2): > build: Fix protobuf generation rules > Merge "Restructure files" from Jesse Includes fixup patch from Jesse: " Update Seastar `#include`s to reflect restructure All Seastar header files are now prefixed with "seastar" and the configure script reflects the new locations of files. Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com> Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com> "	2018-11-21 00:01:44 +02:00
Vlad Zolotarov	2636395c65	locator: ec2_multi_region_snitch::start(): print a human readable error if Public IP may not be retrieved Public IP is required for Ec2MultiRegionSnitch. If it's not available different snitch should be used. This patch would result in a readable error message to be printed instead of just a cryptic message with HTTP response body. Fixes #3897 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2018-11-01 11:50:58 -04:00
Vlad Zolotarov	c462af5549	locator: ec2_multi_region_snitch::start(): rework on top of seastar::thread Rework ec2_multi_region_snitch::start() on top of seastar::async() in order to simplify the code. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2018-11-01 10:48:37 -04:00
Avi Kivity	0c33d13165	locator: convert sprint() to format() sprint() recently became more strict, throwing on sprint("%s", 5). Replace with the more modern format(). Mechanically converted with https://github.com/avikivity/unsprint.	2018-11-01 13:16:17 +00:00
Avi Kivity	1ce52d5432	locator: fix abstract_replication_strategy::get_ranges() and friends violating sort order get_ranges() is supposed to return ranges in sorted order. However, `a35136533d` broke this and returned the range that was supposed to be last in the second position (e.g. [0, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9]). The broke cleanup, which relied on the sort order to perform a binary search. Other users of the get_ranges() family did not rely on the sort order. Fixes #3872. Message-Id: <20181019113613.1895-1-avi@scylladb.com>	2018-10-19 16:47:12 +00:00
Asias He	91dae0149d	token_metadata: Invalidate cached ring in update_normal_tokens In commit `4a0b561376`, "storage_service: Get rid of moving operation", we removed remove_from_moving() in update_normal_tokens(). However, remove_from_moving() calls invalidate_cached_rings(). We should call invalidate_cached_rings() in update_normal_tokens(), otherwise we will get wrong token range to address map in the token_metadata cache. This issue exists in master only. It is not in any of the releases. Message-Id: <c03f2ed478cfdb84494f36dce9a8cfc05ed9e0cd.1538288364.git.asias@scylladb.com>	2018-09-30 11:06:46 +03:00
Duarte Nunes	a025bf6a7d	Merge seastar upstream Seastar introduced a "compat" namespace, which conflicts with Scylla's own "compat" namespaces. The merge thus includes changes to scope uses of Scylla's "compat" namespaces. * seastar 8ad870f...9bb1611 (5): > util/variant_utils: Ensure variant_cast behaves well with rvalues > util/std-compat: Fix infinite recursion > doc/tutorial: Undo namespace changes > util/variant_utils: Add cast_variant() > Add compatbility with C++17's library types Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-08-14 13:07:09 +01:00
Asias He	95849371aa	range_streamer: Remove unordered_multimap usage We need the mapping between dht::token_range to std::vector<inet_address> and inet_address to dht::token_range_vector in various places. Currently, we use std::unordered_multimap and convert to std::unordered_map. It is better to use std::unordered_map in the first place. The changes like below: - Change from std::unordered_multimap<dht::token_range, inet_address> to std::unordered_map<dht::token_range, std::vector<inet_address>> - Change from std::unordered_multimap<inet_address, dht::token_range> to std::unordered_map<inet_address, dht::token_range_vector> Message-Id: <b8ecc41775e46ec064db3ee07510c404583390aa.1533106019.git.asias@scylladb.com>	2018-08-01 13:01:41 +03:00
Asias He	4a0b561376	storage_service: Get rid of moving operation The moving operation changes a node's token to a new token. It is supported only when a node has one token. The legacy moving operation is useful in the early days before the vnode is introduced where a node has only one token. I don't think it is useful anymore. In the future, we might support adjusting the number of vnodes to reblance the token range each node owns. Removing it simplifies the cluster operation logic and code. Fixes #3475 Message-Id: <144d3bea4140eda550770b866ec30e961933401d.1533111227.git.asias@scylladb.com>	2018-08-01 11:18:17 +03:00
Nadav Har'El	3194ce16b3	repair: fix combination of "-pr" and "-local" repair options When nodetool repair is used with the combination of the "-pr" (primary range) and "-local" (only repair with nodes in the same DC) options, Scylla needs to define the "primary ranges" differently: Rather than assign one node in the entire cluster to be the primary owner of every token, we need one node in each data-center - so that a "-local" repair will cover all the tokens. Fixes #3557. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20180701132445.21685-1-nyh@scylladb.com>	2018-07-01 16:39:33 +03:00
Vlad Zolotarov	2dde372ae6	locator::ec2_multi_region_snitch: don't call for ec2_snitch::gossiper_starting() ec2_snitch::gossiper_starting() calls for the base class (default) method that sets _gossip_started to TRUE and thereby prevents to following reconnectable_snitch_helper registration. Fixes #3454 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Message-Id: <1528208520-28046-1-git-send-email-vladz@scylladb.com>	2018-06-06 12:00:17 +03:00
Botond Dénes	536a32bb5e	query_singular(): return the used replicas This patch implements the last_replicas returning part of the query() signature changes for singular queries. It allows for client code to save the last returned replicas and pass it to query() on the next page as the preferred-replicas parameter, thus faciliate the read requests for the next page hitting the same replicas.	2018-03-13 10:34:34 +02:00
Avi Kivity	5f2600a71d	migration_manager: remove dependency on messaging_service.hh in header Use the new msg_addr.hh header to remove a dependency on messaging_service.hh.	2018-03-12 20:05:23 +02:00
Avi Kivity	af383228fb	locator: remove empty file locator.cc Empty but for compiler-time-consuming includes. Message-Id: <20180312073018.21646-1-avi@scylladb.com>	2018-03-12 10:32:26 +01:00
Avi Kivity	29d0a46220	locator: add copyright and license statements to production_snitch_base.cc Message-Id: <20180312073104.21840-1-avi@scylladb.com>	2018-03-12 10:30:48 +01:00
Avi Kivity	b946f8b308	locator: de-inline reconnectable_snitch_helper Reduce dependencies by de-inlining reconnectable_snitch_helper. A new home is found in production_snitch_base.cc, which is somewhat related.	2018-03-11 18:31:05 +02:00
Avi Kivity	84004a2574	locator: de-inline production_snitch_base De-inlining allows us to remove some dependencies, and those functions are too complex to inline anyway. A few always-throwing functions get the [[noreturn]] attribute to avoid damaging code generation.	2018-03-11 18:22:49 +02:00
Botond Dénes	ee307751e6	token_metadata: make get_host_id() and get_endpoint_for_host() const Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <febcb558848f8e06661bba592263e55e3192ed47.1519741336.git.bdenes@scylladb.com>	2018-02-27 16:29:13 +02:00
Pekka Enberg	f1f691b555	Merge "Add the GoogleCloudSnitch" from Vlad "This series adds the GoogleCloudSnitch. Fixes #1619" * 'google-cloud-snitch-v4' of https://github.com/vladzcloudius/scylla: config: uncomment/add the supported snitches description tests: added gce_snitch_test locator::gce_snitch: implementation of the GoogleCloudSnitch locator::snitch_base: properly log the failure during the snitch startup	2018-02-19 15:58:56 +02:00
Asias He	c17ce79977	token_metadata: Handle affected_ranges with do_for_each affected_ranges can be very large in a large cluster or node with big num_tokens account. calculate_natural_endpoints takes more time to process in this case as well. Futurize calculate_pending_ranges_for_leaving and handle the loop with do_for_each to give some time for the reactor to breath, so it does not block.	2018-02-13 19:00:43 +08:00
Asias He	60143a7517	token_metadata: Split token_metadata::calculate_pending_ranges token_metadata::calculate_pending_ranges is a complicated function. Split it into 3 parts for leaving operation, moving opeartion, bootstrap opeartion.	2018-02-13 19:00:43 +08:00
Asias He	1834dd023f	token_metadata: Futurize calculate_pending_ranges Now, do_update_pending_ranges is futurized. We can finally futurize token_metadata::calculate_pending_ranges in order to convert the loops inside it to do_for_each insead of plain for loops to avoid reactor stall.	2018-02-13 19:00:43 +08:00
Asias He	96266fc76a	token_metadata: Speed up token_metadata::get_endpoint token_metadata::calculate_pending_ranges -> abstract_replication_strategy::calculate_natural_endpoints -> token_metadata::get_endpoint() With std::map INFO 2018-02-09 14:58:32,960 [shard 0] token_metadata - In calculate_pending_ranges: affected_ranges.size=6145 stars Reactor stalled for 4000 ms on shard 0. Backtrace: 0x00000000004b12cb 0x00000000004b1561 /lib64/libpthread.so.0+0x00000000000123af 0x0000000001159e25 0x00000000011581eb 0x000000000114f122 0x000000000119f8c7 0x00000000011985a4 0x00000000011a7e16 0x0000000001364741 0x00000000013fe9fd 0x00000000013ff792 0x00000000014024b2 0x000000000141a66f 0x000000000141d7be 0x00000000010ed234 0x000000000112fdaa 0x00000000011301f4 0x000000000043543d INFO 2018-02-09 14:58:35,993 [shard 0] token_metadata - In calculate_pending_ranges: affected_ranges.size=6145 ends With std::unordered_map INFO 2018-02-09 14:47:50,251 [shard 0] token_metadata - In calculate_pending_ranges: affected_ranges.size=6145 stars INFO 2018-02-09 14:47:51,585 [shard 0] token_metadata - In calculate_pending_ranges: affected_ranges.size=6145 ends	2018-02-13 19:00:42 +08:00
Vlad Zolotarov	8ae2996bf8	locator::gce_snitch: implementation of the GoogleCloudSnitch This is a snitch that should be used when Scylla runs in GCE VMs in both single and multi data center (DC) configurations. This snitch interacts with the GCE (instance metadata) API as described here: https://cloud.google.com/compute/docs/storing-retrieving-metadata) similarly to how ec2_snitchXXX interacts with the AWS API. However unlike ec2_multi_region_snitch the GCE snitch only gets the instance's zone and sets the DC and the RACK based on it, e.g. for us-central1-a the DC is set to 'us-central' and the RACK - to 'a'. GCE snitch doesn't have to learn the internal and external IPs of the instance because in GCE instances from different regions can interact using internal IPs (in the AWS they can't). Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2018-02-05 09:57:03 -05:00
Vlad Zolotarov	0a8549abf1	locator::snitch_base: properly log the failure during the snitch startup Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2018-02-05 09:49:54 -05:00
Duarte Nunes	2f05d7423a	locator/reconnectable_snitch_helper: Avoid versioned_value copies Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	28d63a76df	locator/production_snitch_base: Cleanup get_endpoint_info() Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Duarte Nunes	ceebbe14cc	gossiper: Avoid endpoint_state copies gossiper::get_endpoint_state_for_endpoint() returns a copy of endpoint_state, which we've seen can be very expensive. This patch adds a similar function which returns a pointer instead, and changes the call sites where using the pointer-returning variant is deemed safe (the pointer neither escapes the function, nor crosses any defer point). Fixes #764 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-10 13:48:02 +01:00
Tomasz Grabiec	46c7e06e56	locator: Optimize token_metadata::is_member() Currently it's linear in the number of tokens in the system in the worst case. We could use the knowledge which _topology has to make it O(1). Fixes #2873. Message-Id: <1507630182-13410-1-git-send-email-tgrabiec@scylladb.com>	2017-10-10 14:27:54 +03:00
Calle Wilund	dd2b8821a4	everywhere_strategy: Make get_natural_endpoints handle non-init state Make get_natural_endpoints return local address iff token metadata is not yet setup (since that is the one address we already know of). If a request has a consistency level requiring more endpoints, it will still fail, but for calls with, for example, CL=ONE, at startup we will succeed, and more or less act like local strategy. Yet, further down the line, have data distributed as desired. Acked-by: Gleb Natapov <gleb@scylladb.com> Message-Id: <20170926113512.15707-1-calle@scylladb.com>	2017-09-26 15:21:30 +03:00
Asias He	0ec574610d	locator: Get rid of assert in token_metadata In commit `69c81bcc87` (repair: Do not allow repair until node is in NORMAL status), we saw a coredump due to an assert in token_metadata::first_token_index. Throw an exception instead of abort the whole scylla process. Message-Id: <c110645cee1ee3897e30a3ae1b7ab3f49c97412c.1504752890.git.asias@scylladb.com>	2017-09-14 10:33:02 +03:00
Avi Kivity	27d3ab20a9	locator: add missing include "log.hh" It's currently made available via another include, which is going away.	2017-08-27 15:17:05 +03:00
Avi Kivity	ebaeefa02b	Merge seatar upstream (seastar namespace) - introcduced "seastarx.hh" header, which does a "using namespace seastar"; - 'net' namespace conflicts with seastar::net, renamed to 'netw'. - 'transport' namespace conflicts with seastar::transport, renamed to cql_transport. - "logger" global variables now conflict with logger global type, renamed to xlogger. - other minor changes	2017-05-21 12:26:15 +03:00
Vlad Zolotarov	181c68e97d	token_metadata::get_host_id(ep): add a missing 'throw' Caught by PVS-Studio static analyzer: The object was created but it is not being used. The 'throw' keyword could be missing: throw runtime_error(FOO); Reported-by: Phillip Khandeliants Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-04-26 14:54:34 -04:00
Asias He	937f28d2f1	Convert to use dht::partition_range_vector and dht::token_range_vector	2016-12-19 14:08:50 +08:00
Asias He	e5485f3ea6	Get rid of query::partition_range Use dht::partition_range instead	2016-12-19 08:09:25 +08:00
Asias He	d1178fa299	Convert to use dht::token_range	2016-12-19 08:04:29 +08:00
Asias He	e523803a5d	token_metadata: Introduce interval_to_range helper It is used to convert a boost::icl::interval<token> interval back to a range<token>.	2016-12-12 11:09:26 +08:00
Avi Kivity	a35136533d	Convert ring_position and token ranges to be nonwrapping Wrapping ranges are a pain, so we are moving wrap handling to the edges. Since cql can't generate wrapping ranges, this means thrift and the ring maintenance code; also range->ring transformations need to merge the first and last ranges. Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>	2016-11-02 21:04:11 +02:00

1 2 3 4 5

214 Commits