scylladb

Author	SHA1	Message	Date
Rafael Ávila de Espíndola	fd5ea2df5a	Avoid including cryptopp headers cryptopp's config.h has the following pragma: #pragma GCC diagnostic ignored "-Wunused-function" It is not wrapped in a push/pop. Because of that, including cryptopp headers disables that warning on scylla code too. The issue has been reported as https://github.com/weidai11/cryptopp/issues/793 To work around it, this patch uses a pimpl to have a single .cc file that has to include cryptopp headers. While at it, it also reduces the differences and code duplication between the md5 and sha1 hashers. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-02-20 08:03:46 -08:00
Benny Halevy	b6ad61d2e5	dht: move declaration of default_partitioner from sstable_datafile_test to i_partitioner.hh So it can be used by other tests Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2019-02-14 22:16:52 +02:00
Duarte Nunes	fa2b0384d2	Replace std::experimental types with C++17 std version. Replace stdx::optional and stdx::string_view with the C++ std counterparts. Some instances of boost::variant were also replaced with std::variant, namely those that called seastar::visit. Scylla now requires GCC 8 to compile. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20190108111141.5369-1-duarte@scylladb.com>	2019-01-08 13:16:36 +02:00
Avi Kivity	f02c64cadf	streaming: stream_session: remove include of db/view/view_update_from_staging_generator.hh This header, which is easily replaced with a forward declaration, introduces a dependency on database.hh everywhere. Remove it and scatter includes of database.hh in source files that really need it.	2019-01-05 17:33:25 +02:00
Avi Kivity	c96fc1d585	Merge "Introduce row level repair" from Asias " === How the the partition level repair works - The repair master decides which ranges to work on. - The repair master splits the ranges to sub ranges which contains around 100 partitions. - The repair master computes the checksum of the 100 partitions and asks the related peers to compute the checksum of the 100 partitions. - If the checksum matches, the data in this sub range is synced. - If the checksum mismatches, repair master fetches the data from all the peers and sends back the merged data to peers. === Major problems with partition level repair - A mismatch of a single row in any of the 100 partitions causes 100 partitions to be transferred. A single partition can be very large. Not to mention the size of 100 partitions. - Checksum (find the mismatch) and streaming (fix the mismatch) will read the same data twice === Row level repair Row level checksum and synchronization: detect row level mismatch and transfer only the mismatch === How the row level repair works - To solve the problem of reading data twice Read the data only once for both checksum and synchronization between nodes. We work on a small range which contains only a few mega bytes of rows, We read all the rows within the small range into memory. Find the mismatch and send the mismatch rows between peers. We need to find a sync boundary among the nodes which contains only N bytes of rows. - To solve the problem of sending unnecessary data. We need to find the mismatched rows between nodes and only send the delta. The problem is called set reconciliation problem which is a common problem in distributed systems. For example: Node1 has set1 = {row1, row2, row3} Node2 has set2 = { row2, row3} Node3 has set3 = {row1, row2, row4} To repair: Node1 fetches nothing from Node2 (set2 - set1), fetches row4 (set3 - set1) from Node3. Node1 sends row1 and row4 (set1 + set2 + set3 - set2) to Node2 Node1 sends row3 (set1 + set2 + set3 - set3) to Node3. === How to implement repair with set reconciliation - Step A: Negotiate sync boundary class repair_sync_boundary { dht::decorated_key pk; position_in_partition position } Reads rows from disk into row buffers until the size is larger than N bytes. Return the repair_sync_boundary of the last mutation_fragment we read from disk. The smallest repair_sync_boundary of all nodes is set as the current_sync_boundary. - Step B: Get missing rows from peer nodes so that repair master contains all the rows Request combined hashes from all nodes between last_sync_boundary and current_sync_boundary. If the combined hashes from all nodes are identical, data is synced, goto Step A. If not, request the full hashes from peers. At this point, the repair master knows exactly what rows are missing. Request the missing rows from peer nodes. Now, local node contains all the rows. - Step C: Send missing rows to the peer nodes Since local node also knows what peer nodes own, it sends the missing rows to the peer nodes. === How the RPC API looks like - repair_range_start() Step A: - request_sync_boundary() Step B: - request_combined_row_hashes() - reqeust_full_row_hashes() - request_row_diff() Step C: - send_row_diff() - repair_range_stop() === Performance evaluation We created a cluster of 3 Scylla nodes on AWS using i3.xlarge instance. We created a keyspace with a replication factor of 3 and inserted 1 billion rows to each of the 3 nodes. Each node has 241 GiB of data. We tested 3 cases below. 1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows. Time to repair: old = 87 min new = 70 min (rebuild took 50 minutes) improvement = 19.54% 2) 100% synced: all of the 3 nodes have 1 billion identical rows. Time to repair: old = 43 min new = 24 min improvement = 44.18% 3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows. Time to repair: old: 211 min new: 44 min improvement: 79.15% Bytes sent on wire for repair: old: tx= 162 GiB, rx = 90 GiB new: tx= 1.15 GiB, tx = 0.57 GiB improvement: tx = 99.29%, rx = 99.36% It is worth noting that row level repair sends and receives exactly the number of rows needed in theory. In this test case, repair master needs to receives 2 million rows and sends 4 million rows. Here are the details: Each node has 1 billion * 0.1% distinct rows, that is 1 million rows. So repair master receives 1 million rows from repair slave 1 and 1 million rows from repair slave 2. Repair master sends 1 million rows from repair master and 1 million rows received from repair slave 1 to repair slave 2. Repair master sends sends 1 million rows from repair master and 1 million rows received from repair slave 2 to repair slave 1. In the result, we saw the rows on wire were as expected. tx_row_nr = 1000505 + 999619 + 1001257 + 998619 (4 shards, the numbers are for each shard) = 4'000'000 rx_row_nr = 500233 + 500235 + 499559 + 499973 (4 shards, the numbers are for each shard) = 2'000'000 Fixes: #3033 Tests: dtests/repair_additional_test.py " * 'asias/row_level_repair_v7' of github.com:cloudius-systems/seastar-dev: (51 commits) repair: Enable row level repair repair: Add row_level_repair repair: Add docs for row level repair repair: Add repair_init_messaging_service_handler repair: Add repair_meta repair: Add repair_writer repair: Add repair_reader repair: Add repair_row repair: Add fragment_hasher repair: Add decorated_key_with_hash repair: Add get_random_seed repair: Add get_common_diff_detect_algorithm repair: Add shard_config repair: Add suportted_diff_detect_algorithms repair: Add repair_stats to repair_info repair: Introduce repair_stats flat_mutation_reader: Add make_generating_reader storage_service: Introduce ROW_LEVEL_REPAIR feature messaging_service: Add RPC verbs for row level repair repair: Export the repair logger ...	2018-12-25 13:13:00 +02:00
Botond Dénes	1865e5da41	treewide: remove include database.hh from headers where possible Many headers don't really need to include database.hh, the include can be replaced by forward declarations and/or including the actually needed headers directly. Some headers don't need this include at all. Each header was verified to be compilable on its own after the change, by including it into an empty `.cc` file and compiling it. `.cc` files that used to get `database.hh` through headers that no longer include it were changed to include it themselves.	2018-12-14 08:03:57 +02:00
Asias He	1367c8c47e	dht: Add make_partitioner Given the name and shard count and the sharding_ignore_msb_bits, make a partitioner. It is used by row level repair.	2018-12-12 16:49:01 +08:00
Asias He	f1a914060b	dht: Add constructor for decorated_key which takes token and partition_key decorated_key(const dht::token& t, const partition_key& k)	2018-12-12 16:49:01 +08:00
Avi Kivity	864f55e745	config: remove inclusions of db/config.hh from header files Instead, distribute those inclusions to .cc files that require them. This reduces rebuilds when config.hh changes, and makes it easier to locate files that need config disaggregation.	2018-12-09 20:11:38 +02:00
Michael Munday	53fdde75f6	dht: use little endian byte order explicitly for token hash This avoids a difference between little and big endian sytems. We now also calculate a full murmur hash for tokens with less than 8 bytes, however in practice the token size is always 8. Message-Id: <20181120214733.43800-1-mike.munday@ibm.com>	2018-11-21 11:44:29 +02:00
Avi Kivity	775b7e41f4	Update seastar submodule * seastar d59fcef...b924495 (2): > build: Fix protobuf generation rules > Merge "Restructure files" from Jesse Includes fixup patch from Jesse: " Update Seastar `#include`s to reflect restructure All Seastar header files are now prefixed with "seastar" and the configure script reflects the new locations of files. Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com> Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com> "	2018-11-21 00:01:44 +02:00
Michael Munday	b9a2f4a228	dht: fix byte ordered partitioner midpoint calculation New versions of boost saturate the output of the convert_to method so we need to mask the part we want to extract. Updates #3922. Message-Id: <20181116191441.35000-1-mike.munday@ibm.com>	2018-11-16 21:19:06 +02:00
Avi Kivity	82818758ca	dht: convert sprint() to format() sprint() recently became more strict, throwing on sprint("%s", 5). Replace with the more modern format(). Mechanically converted with https://github.com/avikivity/unsprint.	2018-11-01 13:16:17 +00:00
Avi Kivity	7ff5569ee8	dht: fix bad format string syntax Some sprint() calls use the fmt language instead of the printf syntax. Convert them all the way to format().	2018-11-01 13:16:17 +00:00
Duarte Nunes	e46ef6723b	Merge seastar upstream * seastar d152f2d...c1e0e5d (6): > scripts: perftune.py: properly merge parameters from the command line and the configuration file > fmt: update to 5.2.1 > io_queue: only increment statistics when request is admitted > Adds `read_first_line.cc` and `read_first_line.hh` to CMake. > fstream: remove default extent allocation hint > core/semaphore: Change the access of semaphore_units main ctor Due to a compile-time fight between fmt and boost::multiprecision, a lexical_cast was added to mediate. sprint("%s", var) no longer accepts numeric values, so some sprint()s were converted to format() calls. Since more may be lurking we'll need to remove all sprint() calls. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-10-25 12:53:30 +03:00
Asias He	7f826d3343	streaming: Expose reason for streaming On receiving a mutation_fragment or a mutation triggered by a streaming operation, we pass an enum stream_reason to notify the receiver what the streaming is used for. So the receiver can decide further operation, e.g., send view updates, beyond applying the streaming data on disk. Fixes #3276 Message-Id: <f15ebcdee25e87a033dcdd066770114a499881c0.1539498866.git.asias@scylladb.com>	2018-10-15 22:03:28 +01:00
Benny Halevy	7eef527769	handle both special token_kinds in dht::tri_compare Handle the before_all_keys and after_all_keys token_kind at the highest layer before calling into the virtual i_partitioner::tri_compare that is not set up to handle these cases. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20181015165612.29356-1-bhalevy@scylladb.com>	2018-10-15 20:00:54 +03:00
Asias He	8edf3defdf	range_streamer: Futurize add_ranges It might take long time for get_all_ranges_with_sources_for and get_all_ranges_with_strict_sources_for to calculate which cause reactor stall. To fix, run them in a thread and yield. Those functions are used in the slow path, it is ok to yield more than needed. Fixes #3639 Message-Id: <63aa7794906ac020c9d9b2984e1351a8298a249b.1536135617.git.asias@scylladb.com>	2018-10-09 09:46:50 +03:00
Botond Dénes	867f69b9d1	dht::i_partitioner: add partition_ranges_view	2018-09-03 10:31:44 +03:00
Asias He	95849371aa	range_streamer: Remove unordered_multimap usage We need the mapping between dht::token_range to std::vector<inet_address> and inet_address to dht::token_range_vector in various places. Currently, we use std::unordered_multimap and convert to std::unordered_map. It is better to use std::unordered_map in the first place. The changes like below: - Change from std::unordered_multimap<dht::token_range, inet_address> to std::unordered_map<dht::token_range, std::vector<inet_address>> - Change from std::unordered_multimap<inet_address, dht::token_range> to std::unordered_map<inet_address, dht::token_range_vector> Message-Id: <b8ecc41775e46ec064db3ee07510c404583390aa.1533106019.git.asias@scylladb.com>	2018-08-01 13:01:41 +03:00
Asias He	4a0b561376	storage_service: Get rid of moving operation The moving operation changes a node's token to a new token. It is supported only when a node has one token. The legacy moving operation is useful in the early days before the vnode is introduced where a node has only one token. I don't think it is useful anymore. In the future, we might support adjusting the number of vnodes to reblance the token range each node owns. Removing it simplifies the cluster operation logic and code. Fixes #3475 Message-Id: <144d3bea4140eda550770b866ec30e961933401d.1533111227.git.asias@scylladb.com>	2018-08-01 11:18:17 +03:00
Nadav Har'El	25bd139508	cross-tree: clean up use of std::random_device() std::random_device() uses the relatively slow /dev/urandom, and we rarely if ever intend to use it directly - we normally want to use it to seed a faster random_engine (a pseudo-random number generator). In many places in the code, we first created a random_device variable, and then using it created a random_engine variable. However, this practice created the risk of a programmer accidentally using the random_device object, instead of the random_engine object, because both have the same API; This hurts performance. This risk materialized in just two places in the code, utils/uuid.cc and gms/gossiper.cc. A patch for to uuid.cc was sent previously by Pawel and is not included in this patch, and the fix for gossiper.{cc,hh} is included here. To avoid risking the same mistake in the future, this patch switches across the code to an idiom where the random_device object is not named, so cannot be accidentally used. We use the following idiom: std::default_random_engine _engine{std::random_device{}()}; Here std::random_device{}() creates the random device (/dev/urandom) and pulls a random integer from it. It then uses this seed to create the random_engine (the pseudo-random number generator). The std::random_device{} object is temporary and unnamed, and cannot be unintentionally used directly. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20180726154958.4405-1-nyh@scylladb.com>	2018-07-26 16:54:58 +01:00
Asias He	1f06ee3960	range_streamer: Limit nr of nodes to stream in parallel For example, to bootstrap a 50th node in a cluster [shard 0] range_streamer - Bootstrap with [127.0.0.8, 127.0.0.2, 127.0.0.24, 127.0.0.21, 127.0.0.49, 127.0.0.44, 127.0.0.9, 127.0.0.7, 127.0.0.47, 127.0.0.15, 127.0.0.5, 127.0.0.30, 127.0.0.14, 127.0.0.12, 127.0.0.36, 127.0.0.11, 127.0.0.48, 127.0.0.28, 127.0.0.33, 127.0.0.10, 127.0.0.41, 127.0.0.4, 127.0.0.40, 127.0.0.3, 127.0.0.6, 127.0.0.43, 127.0.0.22, 127.0.0.26, 127.0.0.42, 127.0.0.25, 127.0.0.17, 127.0.0.37, 127.0.0.23, 127.0.0.13, 127.0.0.38, 127.0.0.1, 127.0.0.18, 127.0.0.20, 127.0.0.39, 127.0.0.27, 127.0.0.34, 127.0.0.32, 127.0.0.19, 127.0.0.16, 127.0.0.31, 127.0.0.45, 127.0.0.29, 127.0.0.35, 127.0.0.46] for keyspace=keyspace1 started, nodes_to_stream=49, nodes_in_parallel=49 the new node will get data from 49 existing nodes. Currently, it will stream from all the 49 existing nodes at the same time. It is not a good idea to stream from all the nodes in parallel which can overwhelm the bootstrap node, i.e., 49 nodes sending, 1 node receiving. To fix this, limit the nr of nodes to stream in parallel. We should have a better control over the memory usage and parallelism. But for now, limit the nr of nodes to a maximum of 16 as a starter. With this limit, each shard can work with as many as 16 remote nodes in parallel, I think this has enough parallelism for streaming in terms of performance. This change have effect on the bootstrap/decommission/removenode node operations, and do not have effect on repair. Refs #2782 Message-Id: <980610dc97490d4f16281a0c3203b9bee73e04e4.1531989557.git.asias@scylladb.com>	2018-07-19 11:44:05 +03:00
Asias He	506eed325a	dht: Fix typo in boot_strapper.cc Eror -> Error Message-Id: <ab1050c526f6e70c3a365595376acde7706d86e9.1531877929.git.asias@scylladb.com>	2018-07-18 10:00:27 +03:00
Avi Kivity	f4caa418ff	Merge "Fix the "LCS data-loss bug"" from Botond " This series fixes the "LCS data-loss bug" where full scans (and everything that uses them) would miss some small percentage (> 0.001%) of the keys. This could easily lead to permanent data-loss as compaction and decomission both use full scans. `aeffbb673` worked around this bug by disabling the incremental reader selectors (the class identified as the source of the bug) altogether. This series fixes the underlying issue and reverts `aeffbb673`. The root cause of the bug is that the `incremental_reader_selector` uses the current read position to poll for new readers using `sstable_set::incremental_selector::select()`. This means that when the currently open sstables contain no partitions that would intersect with some of the yet unselected sstables, those sstables would be ignored. Solve the problem by not calling `select()` with the current read position and always pass the `next_position` returned in the previous call. This means that the traversal of the sstable-set happens at a pace defined by the sstable-set itself and this guarantees that no sstable will be jumped over. When asked for new readers the `incremental_reader_selector` will now iteratively call `select()` using the `next_position` from the previous `select()` call until it either receives some new, yet unselected sstables, or `next_position` surpasses the read position (in which case `select()` will be tried again later). The `sstable_set::incremental_selector` was not suitable in its present state to support calling `select()` with the `next_position` from a previous call as in some cases it could not make progress due to inclusiveness related ambiguities. So in preparation to the above fix `sstable_set` was updated to work in terms of ring-position instead of tokens. Ring-position can express positions in a much more fine-grained way then token, including positions after/before tokens and keys. This allows for a clear expression of `next_position` such that calling `select()` with it guarantees forward progress in the token-space. Tests: unit(release, debug) Refs: #3513 " * 'leveled-missing-keys/v4' of https://github.com/denesb/scylla: tests/mutation_reader_test: combined_mutation_reader_test: use SEASTAR_THREAD_TEST_CASE tests/mutation_reader_test: refactor combined_mutation_reader_test tests/mutation_reader_test: fix reader_selector related tests Revert "database: stop using incremental selectors" incremental_reader_selector: don't jump over sstables mutation_reader: reader_selector: use ring_position instead of token sstables_set::incremental_selector: use ring_position instead of token compatible_ring_position: refactor to compatible_ring_position_view dht::ring_position_view: use token_bound from ring_position i_partitioner: add free function ring-position tri comparator mutation_reader_merger::maybe_add_readers(): remove early return mutation_reader_merger: get rid of _key	2018-07-05 09:33:12 +03:00
Botond Dénes	a8e795a16e	sstables_set::incremental_selector: use ring_position instead of token Currently `sstable_set::incremental_selector` works in terms of tokens. Sstables can be selected with tokens and internally the token-space is partitioned (in `partitioned_sstable_set`, used for LCS) with tokens as well. This is problematic for severeal reasons. The sub-range sstables cover from the token-space is defined in terms of decorated keys. It is even possible that multiple sstables cover multiple non-overlapping sub-ranges of a single token. The current system is unable to model this and will at best result in selecting unnecessary sstables. The usage of token for providing the next position where the intersecting sstables change [1] causes further problems. Attempting to walk over the token-space by repeatedly calling `select()` with the `next_position` returned from the previous call will quite possibly lead to an infinite loop as a token cannot express inclusiveness/exclusiveness and thus the incremental selector will not be able to make progress when the upper and lower bounds of two neighbouring intervals share the same token with different inclusiveness e.g. [t1, t2](t2, t3]. To solve these problems update incremental_selector to work in terms of ring position. This makes it possible to partition the token-space amoing sstables at decorated key granularity. It also makes it possible for select() to return a next_position that is guaranteed to make progress. partitioned_sstable_set now builds the internal interval map using the decorated key of the sstables, not just the tokens. incremental_selector::select() now uses `dht::ring_position_view` as both the selector and the next_position. ring_position_view can express positions between keys so it can also include information about inclusiveness/exclusiveness of the next interval guaranteeing forward progress. [1] `sstable_set::incremental_selector::selection::next_position`	2018-07-04 17:42:33 +03:00
Botond Dénes	bf2645c616	compatible_ring_position: refactor to compatible_ring_position_view compatible_ring_position's sole purpose is to allow creating boost::icl::interval_map with dht::ring_position as the key and list of sstables as the value. This function is served equally well if compatible_ring_position wraps a `dht::ring_position_view` instead of a `dht::ring_position` with the added benefit of not having to copy the possibly heavy `dht::decorated_key` around. It also makes it possible to do lookups with `dht::ring_position_view` which is much more versatile and allows avoiding copies just to make lookups. The only downside is that `dht::ring_position_view` requires the lifetime of the "viewed" object to be taken care of. This is not a concern however, as so long as an interval is present in the map the represented sstable is guaranteed to be alive to, as the interval map participates in the ownership of the stored sstables. Rename compatible_ring_position to compatible_ring_position_view to reflect the changes. While at it upgrade the std::experimental::optional to std::optional.	2018-07-04 08:19:39 +03:00
Botond Dénes	48b07ba5d3	dht::ring_position_view: use token_bound from ring_position Currently dht::ring_position_view's dht::token constructor takes the token bound in the form of a raw `uint8_t`. This allows for passing a weight of "0" which is illegal as single token does not represent a single ring position but an interval as arbitrary number of keys can have the same token. dht::ring_position uses an enum in its dht::token constructor. Import that same enum into the dht::ring_position_view scope and take a `token_bound` instead of `uint8_t`. This is especially important as in later patches the internal weight of the ring_position_view will be exposed and illegal values can cause all sorts of problems.	2018-07-04 08:19:34 +03:00
Botond Dénes	01bd34d117	i_partitioner: add free function ring-position tri comparator Having to create an object just to compare two ring positions (or views) is annoying and unnecessary. Provide a free function version as well.	2018-07-02 11:41:09 +03:00
Avi Kivity	db2c029f7a	dht: add i_partitioner::sharding_ignore_msb() While the sharding algorithm is exposed (as cpu_sharding_algorithm_name()), the ignore_msb parameter is not. Add a function to do that.	2018-07-01 12:17:35 +03:00
Asias He	27cb41ddeb	range_streamer: Use float for time took for stream It is useful when the total time to stream is small, e.g, 2.0 seconds and 2.9 seconds. Showing the time as interger number of seconds is not accurate in such case. Message-Id: <d801b57279981c72acb907ad4b0190ba4d938a3d.1530175052.git.asias@scylladb.com>	2018-06-28 11:39:14 +03:00
Asias He	d23dafa7ac	dht: Remove column_families parameter in add_rx_ranges and add_tx_ranges In `4b1034b` (storage_service: Remove the stream_hints), we removed the only user of the api with the column_families parameter. std::vector column_families = { db::system_keyspace::HINTS }; streamer->add_tx_ranges(keyspace, std::move(ranges_per_endpoint), column_families); We can simplify the code range_streamer a bit by removing it. Fixes #3476 Tests: dtest update_cluster_layout_tests.py Message-Id: <c81d79c5e6dbc8dd78c1242837de892e39d6abd2.1528356342.git.asias@scylladb.com>	2018-06-10 14:53:40 +03:00
Glauber Costa	250d9332dc	partitioner: export the name of the algorithm used to do intra-node sharding We will export this on system tables. To avoid hard-coding it in the system table level, keep it at least in the dht layer where it belongs. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-06-04 11:25:58 -04:00
Avi Kivity	9eb7c0c65b	Merge "Remove (some) reactor stalls in the SSTable code" from Glauber " This is an improvement on my latest series. Instead of just dealing with the problem of destroying the Summary that I have identified in a previous test, I have tried to find other sources of stalls. Some of them are on readers and would affect early processes and operations like nodetool refresh. Others are on writers, which can affect any SSTable being written. Two of those stalls (on large filter, on summary read), I saw in a synthetic benchmark where I used very small values + nodetool compact to generate one SSTable with many keys. They were 80ms and 20ms respectively, and now they are totally gone. For others, I just tried to be safe (for instance, if we know reading/writing large vectors can be costly, just always insert preemption points in them). With all of these patches applied, I no longer see stalls coming from the SSTable code in those tests (although given enough time, I am sure I can find more). Tests: unit (release) Fixes: #3282, Fixes #3281, Fixes #3269 " * 'sstables-stalls-v3-updated' of github.com:glommer/scylla: large_bitset/bloom filter: add preemption points in loops sstables: read filter in a thread abstract summary entry version of the token with a token view add a token_view sstables: rework summary entries reading sstables: avoid calls to resize for vectors sstables: replace potentially large for loop with do_until summary_entry: do not store key bytes in each summary entry tests: change tests to make summary non-copyable chunked_vector: do not iterate to destruct trivially destructible types	2018-03-16 09:43:36 +01:00
Glauber Costa	dddc7e1676	add a token_view Ideally we would like tokens to be trivially destructible, so that we can easily dispose of giant vectors holding them. While that is hard to do with our current infrastructure, we can introduce a token_view, which holds a bytes_view elements instead of the real data - making it trivially destructible. The comparators are then changed to take a token_view, and an implicit conversion function is provided from tokens so they get compared. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-03-15 12:24:09 -04:00
Asias He	9b5585ebd5	range_streamer: Stream 10% of ranges instead of 10 ranges per time If there are a lot of ranges, e.g., num_tokens=2048, 10 ranges per stream plan will cause tons of stream plan to be created to stream data, each having very few data. This cause each stream plan has low transfer bandwidth, so that the total time to complete the streaming increases. It makes more sense to send a percentage of the total ranges per stream plan than a fixed ranges. Here is an example to stream a keyspace with 513 ranges in total, 10 ranges v.s. 10% ranges: Before: [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=system_traces, 510 out of 513 ranges: ranges = 51 [shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1 succeeded, took 107 seconds After: [shard 0] range_streamer - Bootstrap with 127.0.0.1 for keyspace=system_traces, 510 out of 513 ranges: ranges = 10 [shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1 succeeded, took 22 seconds Message-Id: <a890b84fbac0f3c3cc4021e30dbf4cdf135b93ea.1520992228.git.asias@scylladb.com>	2018-03-14 10:12:12 +02:00
Asias He	73d8e2743f	dht: Fix log in range_streamer The address and keyspace should be swapped. Before: range_streamer - Bootstrap with ks3 for keyspace=127.0.0.1 succeeded, took 56 seconds After: range_streamer - Bootstrap with 127.0.0.1 for keyspace=ks3 succeeded, took 56 seconds Message-Id: <5c49646f1fbe45e3a1e7545b8470e04b166922c4.1520416042.git.asias@scylladb.com>	2018-03-07 11:49:58 +02:00
Raphael S. Carvalho	19d994cfff	dht: make it easier to create ring_position_view from token that's done by adding a separate explicit constructor Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2018-01-03 15:26:26 -02:00
Raphael S. Carvalho	68ac0832b7	dht: introduce is_min/max for ring_position Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2018-01-03 15:26:25 -02:00
Paweł Dziepak	8c3b7fea81	Merge "Introduce new API and converters from/to old mutation_reader" from Piotr "This changeset is the first step to flatten mutation_reader. Then it introduces new mutation_fragment types for partition header and end of partition. Using those a new flat_mutation_reader is defined. Finally it introduces converters between new flat_mutation_reader and old mutation_reader." * 'haaawk/flattened_mutation_reader_v12' of github.com:scylladb/seastar-dev: Add tests for flat_mutation_reader Introduce conversion from flat_mutation_reader to mutation_reader Introduce conversion from mutation_reader to flat_mutation_reader Introduce flat_mutation_reader Extract FlattenedConsumer concept using GCC6_CONCEPT Introduce partition_end mutation_fragment Introduce a position for end of partition Introduce partition_start mutation_fragment Introduce FragmentConsumer Introduce a position for partition start streamed_mutation: Extract concepts using GCC6_CONCEPT macro	2017-10-16 12:14:23 +01:00
Duarte Nunes	2210d10552	gms/gossiper: Cleanup is_alive() Make it use get_endpoint_state_for_endpoint_ptr(), check if gossiper is enabled, mark it as const, and have some callers use it instead of open coding the logic. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-11 10:02:32 +01:00
Piotr Jastrzebski	2516b42752	Introduce partition_start mutation_fragment This type of mutation_fragment will be used in new mutation_reader to signal the beginning of the next partition. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-10-10 16:15:59 +02:00
Duarte Nunes	ceebbe14cc	gossiper: Avoid endpoint_state copies gossiper::get_endpoint_state_for_endpoint() returns a copy of endpoint_state, which we've seen can be very expensive. This patch adds a similar function which returns a pointer instead, and changes the call sites where using the pointer-returning variant is deemed safe (the pointer neither escapes the function, nor crosses any defer point). Fixes #764 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-10-10 13:48:02 +01:00
Tomasz Grabiec	741ec61269	streaming: Fix streaming not streaming all ranges It skipped one sub-range in each of the 10 range batch, and tried to access the range vector using end() iterator. Fixes sporadic failures of update_cluster_layout_tests.py:TestUpdateClusterLayout.simple_add_node_1_test. Message-Id: <1505848902-16734-1-git-send-email-tgrabiec@scylladb.com>	2017-09-20 10:33:59 +03:00
Botond Dénes	a980ff6463	Use abort() instead of assert + throw in unreachable code Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <393c3730111dfe090c44d8fc2e31602956a7d008.1504022425.git.bdenes@scylladb.com>	2017-09-03 11:07:27 +03:00
Botond Dénes	d1209c548a	Fix -Wreturn-type warnings Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <99f7a006daaa78eb87720ac51c394093398bc868.1504013915.git.bdenes@scylladb.com>	2017-08-29 16:41:09 +03:00
Tomasz Grabiec	2ca99be27d	ring_position_view: Print token instead of token pointer Broken in `e989d65539`. Message-Id: <1503667158-7544-1-git-send-email-tgrabiec@scylladb.com>	2017-08-25 14:25:21 +01:00
Avi Kivity	81a33df25d	dht: reduce split_range_to_single_shard contiguous memory demand split_range_to_single_shard() returns a vector of size 4096, with each element (a partition_range) of size 100. The total of 400k can cause defragmentation if memory is fragmented. Fix by using a deque. Fixes #2707. Message-Id: <20170819141017.28287-1-avi@scylladb.com>	2017-08-21 14:25:45 +02:00
Duarte Nunes	ec75eac37d	ring_position_exponential_vector_sharder: Take ranges by rvalue Avoids some copies. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170814093310.29200-1-duarte@scylladb.com>	2017-08-14 12:55:43 +03:00
Asias He	f239b11a84	storage_service: Use the new range_streamer interface for bootstrap So that bootstrap operation will now stream small ranges at a time and restream the failed ranges.	2017-08-07 16:31:47 +08:00

1 2 3 4 5

238 Commits