scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-21 17:10:35 +00:00

Author	SHA1	Message	Date
Botond Dénes	e0284bb9ee	treewide: add missing headers and/or forward declarations	2020-03-23 09:29:45 +02:00
Rafael Ávila de Espíndola	c0072eab30	everywhere: Be more explicit that we don't want std::make_shared If sstring is made an alias to std::string ADL causes std::make_shared to be found. Explicitly ask for ::make_shared. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-03-10 13:13:48 -07:00
Avi Kivity	906784639d	Merge "Clean sstables from using global objects" from Pavel E " This set cleans sstable_writer_config and surrounding sstables code from using global storage_ and feature_ service-s and database by moving the configuration logic onto sstables_manager (that was supposed to do it since `eebc3701a5`). Most of the complexity is hidden around sstable_writer_config creation, this set makes the sstables_manager create this object with an explicit call. All the rest are consequences of this change. Tests: unit(debug), manual start-stop " * 'br-clean-sstables-manager-2' of https://github.com/xemul/scylla: sstables: Move get_highest_supported_format sstables: Remove global get_config() helper sstables: Use manager's config() in .new_sstable_component_file() sstable_writer_config: Extend with more db::config stuff sstables_manager: Don't use global helper to generate writer config sstable_writer_config: Sanitize out some features fields initialization sstable_writer_config: Factor out some field initialization sstables: Generate writer config via manager only sstables: Keep reference on manager test: Re-use existing global sstables_manager table: Pass sstable_writer_config into write_memtable_to_sstable	2020-03-03 18:33:01 +02:00
Raphael S. Carvalho	40e75fb109	streaming/stream_transfer_task: avoid pointless iterations in has_relevant_range_on_this_shard() When has_relevant_range_on_this_shard() found a relevant range, it will unnecessarily iterate through the end. Verified manually that this could be thousands of pointless iterations when streaming data to a node just added. The relevant code could be simplified by de-futurizing it but I think it remains so to allow task scheduler to preempt it if necessary. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200220224048.28804-2-raphaelsc@scylladb.com>	2020-02-28 07:57:12 +02:00
Raphael S. Carvalho	8a986bc23b	streaming/stream_transfer_task: avoid unecessary copies of ranges Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200220224048.28804-1-raphaelsc@scylladb.com>	2020-02-28 07:57:12 +02:00
Pavel Emelyanov	5adce3390c	sstables: Generate writer config via manager only The sstable_writer_config creation looks simple (just declare the struct instance) but behind the scenes references storage and feature services, messes with database config, etc. This patch teaches the sstables_manager generate the writer config and makes the rest of the code use it. For future safety by-hands creation of the sstable_writer_config is prohibited. The manager is referenced through table-s and sstable-s, but two existing sstables_managers live on database object, and table-s and sstable-s both live shorter than the database, this reference is save. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-02-25 14:31:04 +03:00
Raphael S. Carvalho	56f66cff9f	dht: Extract to_partition_ranges() from streaming to allow reuse Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2020-02-20 10:53:01 -03:00
Piotr Jastrzebski	9494da2102	distribute_reader_and_consume_on_shards: don't take partitioner This function already takes schema so it can get partitioner using schema::get_partitioner. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-02-17 10:59:15 +01:00
Piotr Jastrzebski	db19a76b1f	selective_token_range_sharder: stop calling global_partitioner() This requires a change in a repair that uses selective_token_range_sharder. Repair performs operation on a set of tables. We will have to make sure that all of that tables use the same partitioner. This is achieved by adding a check to a repair_info constructor. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-02-17 10:19:15 +01:00
Piotr Jastrzebski	dd1120454b	dht: move sharders to a separate header i_partitioner.hh is widely included while sharders are used only in 6 places so there's no need to include them in the whole codebase. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2020-02-17 10:19:02 +01:00
Pavel Emelyanov	b11cf6e950	cql3/query_processor.hh: Debloat from other headers This gives ~30% less (251 jobs -> 181 jobs) recompile when touching it Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20200212225828.3374-1-xemul@scylladb.com>	2020-02-16 11:22:30 +02:00
Pavel Emelyanov	abe588888d	database: Use feature service Keep local feature_service reference on database. This relaxes the circular storage_service <-> database reference, but not removes it completely. This needs some args tossing in apply_to_builder, but it's rather straightforward, so comes in the same patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-02-03 15:16:23 +03:00
Asias He	145fd0313a	streaming: Fix map access in stream_manager::get_progress When the progress is queried, e.g., query from nodetool netstats the progress info might not be updated yet. Fix it by checking before access the map to avoid errors like: std::out_of_range (_Map_base::at) Fixes: #5437 Tests: nodetool_additional_test.py:TestNodetool.netstats_test	2020-01-06 10:31:15 +02:00
Asias He	6b7344f6e5	streaming: Fix typo in stream_result_future::maybe_complete s/progess/progress/ Refs: #5437	2019-12-16 11:12:03 +02:00
Pavel Solodovnikov	2f442f28af	treewide: add const qualifiers throughout the code base	2019-11-26 02:24:49 +03:00
Asias He	b89ced4635	streaming: Do not open rpc stream connection if reader has no data We can use the reader::peek() to check if the reader contains any data. If not, do not open the rpc stream connection. It helps to reduce the port usage. Refs: #4943	2019-10-08 10:31:02 +02:00
Botond Dénes	783277fb02	stream_session: STREAM_MUTATION_FRAGMENTS: print errors in receive and distribute phase Currently when an error happens during the receive and distribute phase it is swallowed and we just return a -1 status to the remote. We only log errors that happen during responding with the status. This means that when streaming fails, we only know that something went wrong, but the node on which the failure happened doesn't log anything. Fix by also logging errors happening in the receive and distribute phase. Also mention the phase in which the error happened in both error log messages. Refs: #4901 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20190903115735.49915-1-bdenes@scylladb.com>	2019-09-05 13:43:00 +02:00
Botond Dénes	136fc856c5	treewide: silence discarded future warnings for questionable discards This patches silences the remaining discarded future warnings, those where it cannot be determined with reasonable confidence that this was indeed the actual intent of the author, or that the discarding of the future could lead to problems. For all those places a FIXME is added, with the intent that these will be soon followed-up with an actual fix. I deliberately haven't fixed any of these, even if the fix seems trivial. It is too easy to overlook a bad fix mixed in with so many mechanical changes.	2019-08-26 19:28:43 +03:00
Botond Dénes	fddd9a88dd	treewide: silence discarded future warnings for legit discards This patch silences those future discard warnings where it is clear that discarding the future was actually the intent of the original author, and they did the necessary precautions (handling errors). The patch also adds some trivial error handling (logging the error) in some places, which were lacking this, but otherwise look ok. No functional changes.	2019-08-26 18:54:44 +03:00
Asias He	49a73aa2fc	streaming: Move stream_mutation_fragments_cmd to a new file (#4812 ) Avoid including the lengthy stream_session.hh in messaging_service. More importantly, fix the build because currently messaging_service.cc and messaging_service.hh does not include stream_mutation_fragments_cmd. I am not sure why it builds on my machine. Spotted this when backporting the "streaming: Send error code from the sender to receiver" to 3.0 branch. Refs: #4789	2019-08-07 14:59:46 +02:00
Asias He	288371ce75	streaming: Do not call rpc stream flush in send_mutation_fragments The stream close() guarantees the data sent will be flushed. No need to call the stream flush() since the stream is not reused. Follow up fix for commit `bac987e32a` (streaming: Send error code from the sender to receiver). Refs #4789	2019-08-07 14:31:17 +02:00
Asias He	bac987e32a	streaming: Send error code from the sender to receiver In case of error on the sender side, the sender does not propagate the error to the receiver. The sender will close the stream. As a result, the receiver will get nullopt from the source in get_next_mutation_fragment and pass mutation_fragment_opt with no value to the generating_reader. In turn, the generating_reader generates end of stream. However, the last element that the generating_reader has generated can be any type of mutation_fragment. This makes the sstable that consumes the generating_reader violates the mutation_fragment stream rule. To fix, we need to propagate the error. However RPC streaming does not support propagate the error in the framework. User has to send an error code explicitly. Fixes: #4789	2019-08-06 16:54:56 +02:00
Asias He	64a4c0ede2	streaming: Do not open rpc stream connection if ranges are not relevant to a shard Given a list of ranges to stream, stream_transfer_task will create an reader with the ranges and create a rpc stream connection on all the shards. When user provides ranges to repair with -st -et options, e.g., using scylla-manger, such ranges can belong to only one shard, repair will pass such ranges to streaming. As a result, only one shard will have data to send while the rpc stream connections are created on all the shards, which can cause the kernel run out of ports in some systems. To mitigate the problem, do not open the connection if the ranges do not belong to the shard at all. Refs: #4708	2019-07-18 18:31:21 +03:00
Botond Dénes	12b8405720	streaming,repair: restore indentation Deferred from the previous two patches.	2019-06-26 18:45:36 +03:00
Botond Dénes	9c2407573c	streaming: pass the data stream through the compaction strategy's interposer consumer	2019-06-26 18:45:36 +03:00
Botond Dénes	2693f1838a	Introduce mutation_writer namespace Currently there is a single mutation_writer: `multishard_writer`, however in the next path we are going to add another one. This is the right moment to move these into a common namespace (and folder), we have way too much stuff scattered already in the top-level namespace (and folder). Also rename `tests/multishard_writer_test.cc` to `tests/mutation_writer_test.cc`, this test-suite will be the home of all the different mutation writer's unit test cases.	2019-06-26 15:45:59 +03:00
Asias He	f212dfb887	streaming: Reject stream if the _sys_dist_ks or _view_update_generator are not ready They are of type db::system_distributed_keyspace and db::view::view_update_generator. n1 is in normal status n2 boots up and _sys_dist_ks or _view_update_generator are not initialized n1 runs stream, n2 is the follower. n2 uses the _sys_dist_ks or _view_update_generator "Assertion `local_is_initialized()' failed" is observed Fixes #4360 Message-Id: <4ae13e1640ac8707a9ba0503a2744f6faf89ecf4.1554330030.git.asias@scylladb.com>	2019-04-04 10:48:00 +03:00
Benny Halevy	223e1af521	sstables: provide large_data_handler to constructor And use it for writing the sstable and/or when deleting it. Refs #4198 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2019-03-26 16:24:19 +02:00
Asias He	b8158dd65d	streaming: Get rid of the keep alive timer in streaming There is no guarantee that rpc streaming makes progress in some time period. Remove the keep alive timer in streaming to avoid killing the session when the rpc streaming is just slow. The keep alive timer is used to close the session in the following case: n2 (the rpc streaming sender) streams to n1 (the rpc streaming receiver) kill -9 n2 We need this because we do not kill the session when gossip think a node is down, because we think the node down might only be temporary and it is a waste to drop the previous work that has done especially when the stream session takes long time. Since in range_streamer, we do not stream all data in a single stream session, we stream 10% of the data per time, and we have retry logic. I think it is fine to kill a stream session when gossip thinks a node is down. This patch changes to close all stream session with the node that gossip think it is down. Message-Id: <bdbb9486a533eee25fcaf4a23a946629ba946537.1551773823.git.asias@scylladb.com>	2019-03-12 12:20:28 +01:00
Rafael Ávila de Espíndola	625080b414	Rename large_partition_handler Now that it also handles large rows, rename it to large_data_handler. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2019-01-28 15:03:14 -08:00
Piotr Jastrzebski	1ac7283550	Fix cross shard cf usage in streaming Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2019-01-24 18:13:30 +01:00
Duarte Nunes	04a14b27e4	Merge 'Add handling staging sstables to /upload dir' from Piotr " This series adds generating view updates from sstables added through /upload directory if their tables have accompanying materialized views. Said sstables are left in /upload directory until updates are generated from them and are treated just like staging sstables from /staging dir. If there are no views for a given tables, sstables are simply moved from /upload dir to datadir without any changes. Tests: unit (release) " * 'add_handling_staging_sstables_to_upload_dir_5' of https://github.com/psarna/scylla: all: rename view_update_from_staging_generator distributed_loader: fix indentation service: add generating view updates from uploaded sstables init: pass view update generator to storage service sstables: treat sstables in upload dir as needing view build sstables,table: rename is_staging to requires_view_building distributed_loader: use proper directory for opening SSTable db,view: make throttling optional for view_update_generator	2019-01-15 18:19:27 +00:00
Piotr Sarna	0eb703dc80	all: rename view_update_from_staging_generator The new name, view_update_generator, is both more concise and correct, since we now generate from directories other than "/staging".	2019-01-15 17:31:47 +01:00
Piotr Sarna	7e61f02365	streaming: add phasing incoming streams Incoming streams are now phased, which can be leveraged later to wait for all ongoing streams to finish. Refs #4032	2019-01-15 10:28:15 +01:00
Duarte Nunes	fa2b0384d2	Replace std::experimental types with C++17 std version. Replace stdx::optional and stdx::string_view with the C++ std counterparts. Some instances of boost::variant were also replaced with std::variant, namely those that called seastar::visit. Scylla now requires GCC 8 to compile. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20190108111141.5369-1-duarte@scylladb.com>	2019-01-08 13:16:36 +02:00
Avi Kivity	f02c64cadf	streaming: stream_session: remove include of db/view/view_update_from_staging_generator.hh This header, which is easily replaced with a forward declaration, introduces a dependency on database.hh everywhere. Remove it and scatter includes of database.hh in source files that really need it.	2019-01-05 17:33:25 +02:00
Piotr Sarna	9d46715613	streaming,view: move view update checks to separate file Checking if view update path should be used for sstables is going to be reused in row level repair code, so relevant functions are moved to a separate header.	2019-01-03 08:31:40 +01:00
Asias He	d90836a2d3	streaming: Make total_incoming_bytes and total_outgoing_bytes metrics monotonic Currently, they increases and decreases as the stream sessions are created and destroyed. Make them prometheus monotonically increasing counter for easier monitoring. Message-Id: <7c07cea25a59a09377292dc8f64ed33ff12eda87.1545959905.git.asias@scylladb.com>	2018-12-30 16:52:17 +02:00
Duarte Nunes	bab7e6877b	streaming/stream_session: Only stage sstables for tables with views When streaming, sstables for which we need to generate view updates are placed in a special staging directory. However, we only need to do this for tables that actually have views. Refs #4021 Message-Id: <20181227215412.5632-1-duarte@scylladb.com>	2018-12-28 18:32:24 +02:00
Gleb Natapov	37b4043677	streaming: always read from rpc::source until end-of-stream during mutation sending rpc::source cannot be abandoned until EOS is reached, but current code does not obey it if error code is received, it throws exception instead that aborts the reading loop. Fix it by moving exception throwing out of the loop. Fixes: #4025 Message-Id: <20181227135051.GC29458@scylladb.com>	2018-12-27 16:50:53 +02:00
Duarte Nunes	66e45469b2	streaming/stream_session: Don't use table reference across defer points When creating a sstable from which to generate view updates, we held on to a table reference across defer points. In case there's a concurrent schema drop, the table object might be destroyed and we will incur in a use-after-free. Solve this by holding on to a shared pointer and pinning the table object. Refs #4021 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20181227105921.3601-1-duarte@scylladb.com>	2018-12-27 13:05:46 +02:00
Avi Kivity	c96fc1d585	Merge "Introduce row level repair" from Asias " === How the the partition level repair works - The repair master decides which ranges to work on. - The repair master splits the ranges to sub ranges which contains around 100 partitions. - The repair master computes the checksum of the 100 partitions and asks the related peers to compute the checksum of the 100 partitions. - If the checksum matches, the data in this sub range is synced. - If the checksum mismatches, repair master fetches the data from all the peers and sends back the merged data to peers. === Major problems with partition level repair - A mismatch of a single row in any of the 100 partitions causes 100 partitions to be transferred. A single partition can be very large. Not to mention the size of 100 partitions. - Checksum (find the mismatch) and streaming (fix the mismatch) will read the same data twice === Row level repair Row level checksum and synchronization: detect row level mismatch and transfer only the mismatch === How the row level repair works - To solve the problem of reading data twice Read the data only once for both checksum and synchronization between nodes. We work on a small range which contains only a few mega bytes of rows, We read all the rows within the small range into memory. Find the mismatch and send the mismatch rows between peers. We need to find a sync boundary among the nodes which contains only N bytes of rows. - To solve the problem of sending unnecessary data. We need to find the mismatched rows between nodes and only send the delta. The problem is called set reconciliation problem which is a common problem in distributed systems. For example: Node1 has set1 = {row1, row2, row3} Node2 has set2 = { row2, row3} Node3 has set3 = {row1, row2, row4} To repair: Node1 fetches nothing from Node2 (set2 - set1), fetches row4 (set3 - set1) from Node3. Node1 sends row1 and row4 (set1 + set2 + set3 - set2) to Node2 Node1 sends row3 (set1 + set2 + set3 - set3) to Node3. === How to implement repair with set reconciliation - Step A: Negotiate sync boundary class repair_sync_boundary { dht::decorated_key pk; position_in_partition position } Reads rows from disk into row buffers until the size is larger than N bytes. Return the repair_sync_boundary of the last mutation_fragment we read from disk. The smallest repair_sync_boundary of all nodes is set as the current_sync_boundary. - Step B: Get missing rows from peer nodes so that repair master contains all the rows Request combined hashes from all nodes between last_sync_boundary and current_sync_boundary. If the combined hashes from all nodes are identical, data is synced, goto Step A. If not, request the full hashes from peers. At this point, the repair master knows exactly what rows are missing. Request the missing rows from peer nodes. Now, local node contains all the rows. - Step C: Send missing rows to the peer nodes Since local node also knows what peer nodes own, it sends the missing rows to the peer nodes. === How the RPC API looks like - repair_range_start() Step A: - request_sync_boundary() Step B: - request_combined_row_hashes() - reqeust_full_row_hashes() - request_row_diff() Step C: - send_row_diff() - repair_range_stop() === Performance evaluation We created a cluster of 3 Scylla nodes on AWS using i3.xlarge instance. We created a keyspace with a replication factor of 3 and inserted 1 billion rows to each of the 3 nodes. Each node has 241 GiB of data. We tested 3 cases below. 1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows. Time to repair: old = 87 min new = 70 min (rebuild took 50 minutes) improvement = 19.54% 2) 100% synced: all of the 3 nodes have 1 billion identical rows. Time to repair: old = 43 min new = 24 min improvement = 44.18% 3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows. Time to repair: old: 211 min new: 44 min improvement: 79.15% Bytes sent on wire for repair: old: tx= 162 GiB, rx = 90 GiB new: tx= 1.15 GiB, tx = 0.57 GiB improvement: tx = 99.29%, rx = 99.36% It is worth noting that row level repair sends and receives exactly the number of rows needed in theory. In this test case, repair master needs to receives 2 million rows and sends 4 million rows. Here are the details: Each node has 1 billion * 0.1% distinct rows, that is 1 million rows. So repair master receives 1 million rows from repair slave 1 and 1 million rows from repair slave 2. Repair master sends 1 million rows from repair master and 1 million rows received from repair slave 1 to repair slave 2. Repair master sends sends 1 million rows from repair master and 1 million rows received from repair slave 2 to repair slave 1. In the result, we saw the rows on wire were as expected. tx_row_nr = 1000505 + 999619 + 1001257 + 998619 (4 shards, the numbers are for each shard) = 4'000'000 rx_row_nr = 500233 + 500235 + 499559 + 499973 (4 shards, the numbers are for each shard) = 2'000'000 Fixes: #3033 Tests: dtests/repair_additional_test.py " * 'asias/row_level_repair_v7' of github.com:cloudius-systems/seastar-dev: (51 commits) repair: Enable row level repair repair: Add row_level_repair repair: Add docs for row level repair repair: Add repair_init_messaging_service_handler repair: Add repair_meta repair: Add repair_writer repair: Add repair_reader repair: Add repair_row repair: Add fragment_hasher repair: Add decorated_key_with_hash repair: Add get_random_seed repair: Add get_common_diff_detect_algorithm repair: Add shard_config repair: Add suportted_diff_detect_algorithms repair: Add repair_stats to repair_info repair: Introduce repair_stats flat_mutation_reader: Add make_generating_reader storage_service: Introduce ROW_LEVEL_REPAIR feature messaging_service: Add RPC verbs for row level repair repair: Export the repair logger ...	2018-12-25 13:13:00 +02:00
Gleb Natapov	393269d34b	streaming: hold to sink while close() is running and call close on error as well Currently if something throws while streaming in mutation sending loop sink is not closed. Also when close() is running the code does not hold onto sink object. close() is async, so sink should be kept alive until it completes. The patch uses do_with() to hold onto sink while close is running and run close() on error path too. Fixes #4004. Message-Id: <20181220155931.GL3075@scylladb.com>	2018-12-20 18:03:37 +02:00
Asias He	bcba6b4f4d	streaming: Futurize estimate_partitions The loop can take a long time if the number of sstables and/or ranges are large. To fix, futurize the loop. Fixes: #4005 Message-Id: <3b05cb84f3f57cc566702142c6365a04b075018e.1545290730.git.asias@scylladb.com>	2018-12-20 12:08:03 +02:00
Asias He	0067d32b47	flat_mutation_reader: Add make_generating_reader Move generating_reader from stream_session.cc to flat_mutation_reader.cc. It will be used by repair code soon. Also introduce a helper make_generating_reader to hide the implementation of generating_reader.	2018-12-12 16:49:01 +08:00
Piotr Sarna	8e6021dfa1	streaming: don't check view building of system tables System tables will never need view building, and, what's more, are actually used in the process of view build checking. So, checking whether system tables need a view update path is simplified to returning 'false'.	2018-11-28 09:21:56 +01:00
Piotr Sarna	6ad2c39f88	streaming: remove unused sstable_is_staging bool class sstable_is_staging bool class is not used anywhere in the code anymore, so it's removed.	2018-11-28 09:21:56 +01:00
Avi Kivity	775b7e41f4	Update seastar submodule * seastar d59fcef...b924495 (2): > build: Fix protobuf generation rules > Merge "Restructure files" from Jesse Includes fixup patch from Jesse: " Update Seastar `#include`s to reflect restructure All Seastar header files are now prefixed with "seastar" and the configure script reflects the new locations of files. Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com> Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com> "	2018-11-21 00:01:44 +02:00
Piotr Sarna	32c0fe8df2	streaming: stream tables with views through staging sstables While streaming to a table with paired views, staging sstables are used. After the table is written to disk, it's used to generate all required view updates. It's also resistant to restarts as it's stored on a hard drive in staging/ directory. Refs #3275	2018-11-13 15:04:42 +01:00
Piotr Sarna	dc74887ff3	streaming: add system distributed keyspace ref to streaming Streaming code needs system distributed keyspace to check if streamed sstables should be staging, so a proper reference is added.	2018-11-13 15:01:53 +01:00

1 2 3 4 5 ...

492 Commits