scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-28 10:41:12 +00:00

Author	SHA1	Message	Date
Piotr Sarna	acf7bedad4	idl,service: add persistent last partition row count In order to process paged queries with per-partition limits properly, paging state needs to keep additional information: what was the row count of last partition returned in previous run. That's necessary because the end of previous page and the beginning of current one might consist of rows with the same partition key and we need to be able to trim the results to the number indicated by per-partition limit.	2019-02-18 11:06:44 +01:00
Duarte Nunes	fa2b0384d2	Replace std::experimental types with C++17 std version. Replace stdx::optional and stdx::string_view with the C++ std counterparts. Some instances of boost::variant were also replaced with std::variant, namely those that called seastar::visit. Scylla now requires GCC 8 to compile. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20190108111141.5369-1-duarte@scylladb.com>	2019-01-08 13:16:36 +02:00
Avi Kivity	c96fc1d585	Merge "Introduce row level repair" from Asias " === How the the partition level repair works - The repair master decides which ranges to work on. - The repair master splits the ranges to sub ranges which contains around 100 partitions. - The repair master computes the checksum of the 100 partitions and asks the related peers to compute the checksum of the 100 partitions. - If the checksum matches, the data in this sub range is synced. - If the checksum mismatches, repair master fetches the data from all the peers and sends back the merged data to peers. === Major problems with partition level repair - A mismatch of a single row in any of the 100 partitions causes 100 partitions to be transferred. A single partition can be very large. Not to mention the size of 100 partitions. - Checksum (find the mismatch) and streaming (fix the mismatch) will read the same data twice === Row level repair Row level checksum and synchronization: detect row level mismatch and transfer only the mismatch === How the row level repair works - To solve the problem of reading data twice Read the data only once for both checksum and synchronization between nodes. We work on a small range which contains only a few mega bytes of rows, We read all the rows within the small range into memory. Find the mismatch and send the mismatch rows between peers. We need to find a sync boundary among the nodes which contains only N bytes of rows. - To solve the problem of sending unnecessary data. We need to find the mismatched rows between nodes and only send the delta. The problem is called set reconciliation problem which is a common problem in distributed systems. For example: Node1 has set1 = {row1, row2, row3} Node2 has set2 = { row2, row3} Node3 has set3 = {row1, row2, row4} To repair: Node1 fetches nothing from Node2 (set2 - set1), fetches row4 (set3 - set1) from Node3. Node1 sends row1 and row4 (set1 + set2 + set3 - set2) to Node2 Node1 sends row3 (set1 + set2 + set3 - set3) to Node3. === How to implement repair with set reconciliation - Step A: Negotiate sync boundary class repair_sync_boundary { dht::decorated_key pk; position_in_partition position } Reads rows from disk into row buffers until the size is larger than N bytes. Return the repair_sync_boundary of the last mutation_fragment we read from disk. The smallest repair_sync_boundary of all nodes is set as the current_sync_boundary. - Step B: Get missing rows from peer nodes so that repair master contains all the rows Request combined hashes from all nodes between last_sync_boundary and current_sync_boundary. If the combined hashes from all nodes are identical, data is synced, goto Step A. If not, request the full hashes from peers. At this point, the repair master knows exactly what rows are missing. Request the missing rows from peer nodes. Now, local node contains all the rows. - Step C: Send missing rows to the peer nodes Since local node also knows what peer nodes own, it sends the missing rows to the peer nodes. === How the RPC API looks like - repair_range_start() Step A: - request_sync_boundary() Step B: - request_combined_row_hashes() - reqeust_full_row_hashes() - request_row_diff() Step C: - send_row_diff() - repair_range_stop() === Performance evaluation We created a cluster of 3 Scylla nodes on AWS using i3.xlarge instance. We created a keyspace with a replication factor of 3 and inserted 1 billion rows to each of the 3 nodes. Each node has 241 GiB of data. We tested 3 cases below. 1) 0% synced: one of the node has zero data. The other two nodes have 1 billion identical rows. Time to repair: old = 87 min new = 70 min (rebuild took 50 minutes) improvement = 19.54% 2) 100% synced: all of the 3 nodes have 1 billion identical rows. Time to repair: old = 43 min new = 24 min improvement = 44.18% 3) 99.9% synced: each node has 1 billion identical rows and 1 billion * 0.1% distinct rows. Time to repair: old: 211 min new: 44 min improvement: 79.15% Bytes sent on wire for repair: old: tx= 162 GiB, rx = 90 GiB new: tx= 1.15 GiB, tx = 0.57 GiB improvement: tx = 99.29%, rx = 99.36% It is worth noting that row level repair sends and receives exactly the number of rows needed in theory. In this test case, repair master needs to receives 2 million rows and sends 4 million rows. Here are the details: Each node has 1 billion * 0.1% distinct rows, that is 1 million rows. So repair master receives 1 million rows from repair slave 1 and 1 million rows from repair slave 2. Repair master sends 1 million rows from repair master and 1 million rows received from repair slave 1 to repair slave 2. Repair master sends sends 1 million rows from repair master and 1 million rows received from repair slave 2 to repair slave 1. In the result, we saw the rows on wire were as expected. tx_row_nr = 1000505 + 999619 + 1001257 + 998619 (4 shards, the numbers are for each shard) = 4'000'000 rx_row_nr = 500233 + 500235 + 499559 + 499973 (4 shards, the numbers are for each shard) = 2'000'000 Fixes: #3033 Tests: dtests/repair_additional_test.py " * 'asias/row_level_repair_v7' of github.com:cloudius-systems/seastar-dev: (51 commits) repair: Enable row level repair repair: Add row_level_repair repair: Add docs for row level repair repair: Add repair_init_messaging_service_handler repair: Add repair_meta repair: Add repair_writer repair: Add repair_reader repair: Add repair_row repair: Add fragment_hasher repair: Add decorated_key_with_hash repair: Add get_random_seed repair: Add get_common_diff_detect_algorithm repair: Add shard_config repair: Add suportted_diff_detect_algorithms repair: Add repair_stats to repair_info repair: Introduce repair_stats flat_mutation_reader: Add make_generating_reader storage_service: Introduce ROW_LEVEL_REPAIR feature messaging_service: Add RPC verbs for row level repair repair: Export the repair logger ...	2018-12-25 13:13:00 +02:00
Duarte Nunes	d54ac4961d	idl: Add db::view::update_backlog Add db::view::update_backlog to the newly created view.idl.hh. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-12-19 22:38:30 +00:00
Asias He	48341a2d4d	idl: Add decorated_key support Needed by the row level repair RPC verbs.	2018-12-12 16:49:01 +08:00
Asias He	1db4e3fd0a	idl: Add row_level_diff_detect_algorithm Needed by the row level repair RPC verbs.	2018-12-12 16:49:01 +08:00
Asias He	ccc706559f	idl: Add get_sync_boundary_response Needed by the row level repair RPC verbs.	2018-12-12 16:49:01 +08:00
Asias He	1173d1dd5a	idl: Add repair_sync_boundary Needed by the row level repair RPC verbs.	2018-12-12 16:49:01 +08:00
Asias He	dc223e9216	idl: Add partition_key_and_mutation_fragments Needed by the row level repair RPC verbs.	2018-12-12 16:49:01 +08:00
Asias He	5fbbc63676	idl: Add position_in_partition Needed by the row level repair RPC verbs.	2018-12-12 16:49:01 +08:00
Asias He	e9fbc27740	idl: Add bound_weight It will be used by the row level repair code.	2018-12-12 16:49:01 +08:00
Asias He	3c39462397	idl: Add partition_region Needed by the row level repair RPC verbs.	2018-12-12 16:49:01 +08:00
Asias He	e2b9840e24	idl: Add repair_hash Needed by the row level repair RPC verbs.	2018-12-12 16:49:01 +08:00
Paweł Dziepak	9024187222	partition_slice: use small_vector for column_ids	2018-12-06 14:21:04 +00:00
Asias He	7f826d3343	streaming: Expose reason for streaming On receiving a mutation_fragment or a mutation triggered by a streaming operation, we pass an enum stream_reason to notify the receiver what the streaming is used for. So the receiver can decide further operation, e.g., send view updates, beyond applying the streaming data on disk. Fixes #3276 Message-Id: <f15ebcdee25e87a033dcdd066770114a499881c0.1539498866.git.asias@scylladb.com>	2018-10-15 22:03:28 +01:00
Nadav Har'El	36a657fc10	schema: persist "view virtual" columns to a separate system table In the previous patch, we added a "view virtual" flag on columns. In this patch we add persistance to this flag: I.e., writing it to the on-disk schema table and reading it back on startup. But the implementation is not as simple as adding a flag: In the on-disk system tables, we have a "columns" table listing all the columns in the database and their types. Cqlsh's "DESCRIBE MATERIALIZED VIEW" works by reading this "columns" table, and listing all of the requested view's columns. Therefore, we cannot add "virtual columns" - which are columns not added by the user and not intended to be seen - to this list. We therefore need to create in this patch a separate list for virtual columns, in a new table "view_virtual_columns". This table is essentially identical to the existing "columns" table, just separate. We need to write each column to the appropriate table (columns with the view_virtual flag to "view_virtual_columns", columns without it to the old "columns"), read from both on startup, and remember to delete columns from both when a table is dropped. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2018-08-16 15:30:06 +03:00
Asias He	fd71c5718f	gossip: Reduce continuous memory usage Gossip SYN and ACK uses std::vector to store a list of gossip_digest, the larger the cluster, the more continuous memory is needed. To reduce the memory pressure which might cause std::bad_alloc, switch the std::vector to chunked_vector. In addition, change add_local_application_state to use std::list instead of std::vector. Refs #2782	2018-07-17 20:15:32 +08:00
Paweł Dziepak	ed12555192	idl: add idl description of frozen_mutation_fragments	2018-05-25 10:15:10 +01:00
Paweł Dziepak	aa4e589ace	frozen_mutation: introduce frozen_mutation_fragment This patch introduces IDL definition as well as serialisers and deserialisers for freezing mutation_fragment so that they can be transferred between nodes in a cluster.	2018-05-25 10:15:10 +01:00
Paweł Dziepak	b2e9491728	tests/idl: test variant being the first member of a structure	2018-05-25 10:15:10 +01:00
Paweł Dziepak	d731cf427d	tests/idl: test serialising and deserialising empty structures	2018-05-25 10:15:10 +01:00
Botond Dénes	ddd70dc113	Use dht::token_range alias for last/preferred replicas Use the pre-existing type alias instead of fully spelling out the type everywhere.	2018-05-10 06:22:39 +03:00
Botond Dénes	b55dcc2ce5	Add query_read_repair_decision to paging-state This new field will store the repair-decision made on the first page of the query. This decision will be sticky to all pages of the query. In mixed clusters the decision might not happen on the first page and it might even change during the query as old coordinators will not store nor respect the decision.	2018-03-19 15:17:31 +02:00
Botond Dénes	f281b3e923	Add last_replicas to paging_state Helps paged queries consistently hit the same replicas for each subsequent page. Replicas that already served a page will keep the readers used for filling it around in a cache. Subsequent page request hitting the same replicas can reuse these readers to fill the pages avoiding the work of creating these readers from scratch on every page. In a mixed cluster older coordinators will ignore this value. The value of last_replicas may change between pages as nodes may become available/unavailable or the coordinator may decide to send the read requests to different replicas at its discretion. Replicas are identified by an opaque uuid which should only make sense to the storage-proxy.	2018-03-13 10:34:34 +02:00
Nadav Har'El	fa284f6307	Add query UUID to read command This patch adds the parameter to read_command which is needed for caching of readers during multiple pages of a paged queries, which we will introduce in the next patches. The query_uuid is a UUID of a previously saved reader, which the replica is now asked to recall and resume (if this saved reader is no longer in the cache, it is fine, a new reader will be started). Additionally a helper flag is_first_page is added so that the replica can avoid doing any cache lookups (and incrementing miss counters) for the first page. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2018-03-13 10:34:34 +02:00
Nadav Har'El	ec7c56d18a	Add query UUID to paging state This patch adds to the "paging_state", the opaque cookie that clients are supposed to provide when asking for the next page on a paged query, a unique id field. This new field will be used to tell that a new request for a page really continues the previous page, and doesn't just by chance start at the same position the previous page stopped. We need to support setups with mixed versions - a client may get a paging state from a coordinator running a new version of Scylla and send it to a different coordinator running an old version - or vice versa. So the new uuid field is set up to have a default uuid of UUID() (a recognizable invalid uuid 0), so new versions receiving no uuid from an old version will set this invalid uuid, and old versions receiving a uuid from a new version will simply ignore it. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2018-03-13 10:34:34 +02:00
Duarte Nunes	0bab3e59c2	service/storage_service: Add and use xxhash feature We add a cluster feature that informs whether the xxHash algorithm is supported, and allow nodes to switch to it. We use a cluster feature because older versions are not ready to receive a different digest algorithm than MD5 when answering a data request. If we ever should add a new hash algorithm, we would also need to add a new cluster feature for that algorithm. The alternative would be to add code so a coordinator could negotiate what digest algorithm to use with the set of replicas it is contacting. Fixes #2884 Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2018-02-01 01:02:50 +00:00
Duarte Nunes	3b9a9b7321	query-result: Send row and partition count over the wire To avoid calculating them on the coordinator side. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-08-14 10:29:06 +02:00
Tomasz Grabiec	cdf5b67522	schema_tables: Introduce system_schema.scylla_tables It will be used to store Scylla spcific table metadata. We cannot store it in the standard "tables" table for compatibility reasons - Cassandra will fail to read schema if it encounteres columns it is not expecting.	2017-07-11 14:52:23 +02:00
Pekka Enberg	8112d7c5c0	idl: Fix frozen_schema version numbers The IDL changes will appear in 2.0 so fix up the version numbers. Message-Id: <1499680669-6757-1-git-send-email-penberg@scylladb.com>	2017-07-10 14:02:20 +03:00
Gleb Natapov	fab18c0c5a	database: introduce cache_temperature class The class will represent cache hit rate for a column family and is serializable for use with RPC.	2017-06-13 09:57:14 +03:00
Avi Kivity	c4faa1e202	Merge "tracing: tracing spans and time series helper table" from Vlad " - Introduce a parent span IP and span ID paradigm. - Introduce time series tables to simplify traces processing. - Add the "How to get traces?" chapter to the tracing.md. " * 'tracing-span-ids-and-time-series-helpers-v4' of github.com:cloudius-systems/seastar-dev: docs: tracing.md: add a "how to get traces" chapter tracing::trace_keyspace_helper: introduce a time series helper tables tracing: cleanup: use nullptr instead of trace_state_ptr() tracing: introduce a span ID and parent span ID	2017-05-28 12:01:35 +03:00
Calle Wilund	6c8b5fc09d	schema_tables: Use v3 schema tables and formats Switches system/schema_* for system_schema/*, updates schema/schema builder and uses to hold/expect v3 style info (i.e. types & dropped).	2017-05-10 16:44:48 +00:00
Avi Kivity	8c5c5d3004	Merge "CQL front-end for secondary indices" from Pekka "This patch series adds CQL front-end support for secondary indices. You can now execute CREATE INDEX and DROP INDEX statements, which will update the newly added "Indexes" system table. However, the indexes are not actually backed up by anything nor are they available for CQL queries. The feature is hidden behind a new cluster feature flag and enabled only with the "--experimental" flag." * 'penberg/cql-2i/v2' of github.com:cloudius-systems/seastar-dev: (34 commits) schema: Kill index_type enum schema: Kill index_info class cql3/statements/create_index_statement: Use database::existing_index_names() in validation cql3/statements: Use secondary index manager in alter_table_statement class index: Add secondary_index_manager thrift/handler: Use index_metadata db/schema_tables: Index persistence schema: Add all_indices() to schema class schema: Remove add_default_index_names() from schema_builder class db/schema_tables: Add system table for indices cql3/Cgl.g: DROP INDEX cql3/statements: Add drop_index_statement class database: Add find_indexed_table() to database class cql3: Return change event from announce_migration() cql3/statements: Multiple index targets for CREATE INDEX cql3/statements: Use index_metadata in create_index_statement class cql3/statements: Use feature flag in create_index_statement class service/storage_service: Add feature flag for secondary indices database: Add get_available_index_name() to database class schema: Add get_default_index_name() to index_metadata class ...	2017-05-08 17:04:40 +03:00
Pekka Enberg	11474ed4c6	db/schema_tables: Index persistence	2017-05-08 10:03:28 +03:00
Vlad Zolotarov	b0f660331a	tracing: introduce a span ID and parent span ID This patch makes the tracing framework follow the general idea of Google's Dapper paper: traces generated in a context of the same query are forming a single-rooted acyclic tree where in a ScyllaDB case vertexes are spans running on each involved replica Node and edges are RPCs sent from one Node to another. - Each vertex in the tree above has an ID - "span ID". - In order to be able to build the tree from the sessions traces we need to know the parent "span ID" - the ID of a span that sent an RPC that created the current span. - Each span of a tracing session is given a 64-bit random span ID. - The root span has a span_id::illegal_id value. This patch adds: - The described above parent span ID and a span ID to the one_session_records object. - The current span ID is passed in the trace_info struct to the remote replica. - Add parent_id and span_id columns to system_traces.events table for the parent ID and span ID. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-04-25 21:52:23 -04:00
Duarte Nunes	4e693383f7	mutation_partion: Use row_tombstone This patch replaces the current row tombstone representation by a row_tombstone. The intent of the patch is thus to reify the idea of shadowable tombstones, that up until now we considered all materialized view row tombstones to be. We need to distinguish shadowable from non-shadowable row tombstones to support scenarios such as, when inserting to a table with a materialzied view: 1. insert into base (p, v1, v2) values (3, 1, 3) using timestamp 1 2. delete from base using timestamp 2 where p = 3 3. insert into base (p, v1) values (3, 1) using timestamp 3 These should yield a view row where v2 is definitely null, but with the current implementation, v2 will pop back with its value v2=3@TS=1, even though its dead in the base row. This is because the row tombstone inserted at 2) is a shadowable one. This patch only addresses the memory representation of such row_tombstones. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-04-25 11:46:33 +02:00
Duarte Nunes	8cc29f84fb	idl-compiler: Support optional fields in views When generating view code, the compiler was ignoring optional fields. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-04-25 11:43:04 +02:00
Paweł Dziepak	374c8a56ac	commitlog: avoid copying column_mapping It is safe to copy column_mapping accros shards. Such guarantee comes at the cost of performance. This patch makes commitlog_entry_writer use IDL generated writer to serialise commitlog_entry so that column_mapping is not copied. This also simplifies commitlog_entry itself. Performance difference tested with: perf_simple_query -c4 --write --duration 60 (medians) before after diff write 79434.35 89247.54 +12.3%	2017-02-27 17:05:58 +00:00
Paweł Dziepak	9989239c97	idl: add idl description of consistency level	2017-02-02 10:35:14 +00:00
Paweł Dziepak	9f1ebd4f7c	idl/mutation: add counter serialisation logic	2017-02-02 10:35:14 +00:00
Paweł Dziepak	b8e29cc99c	idl: is_short_read() was added in 1.6	2016-12-22 13:35:04 +01:00
Duarte Nunes	19a76a82e8	frozen_schema: Support view schemas This patch allows a view schema to be frozen. To unfreeze such a schema, we add an is_view attribute to the schema idl. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Paweł Dziepak	43fe3439ca	reconcilable_result: properly propagate short_read flag reconcilable_result can be merged with another or transformed into query::result. Make sure that short_read information is never lost. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	da7ca85040	query: allow short reads When paging is used the cluster is allowed to return less rows than the client asked for. However, if such possibility is used we need a way of telling that to the coordinator and the paging implementation so that they can differentiate between short reads caused by the replica running out of data to sent and short reads caused by any other means. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:01 +00:00
Avi Kivity	18078bea9b	storage_proxy: avoid calculating digest when only one replica is contacted If we're talking to just one replica, the digest is not going to be used, so better not to calculate it at all. The optimization helps with LOCAL_ONE queries where the result is large, but does not contain large blobs (many small rows). This patch adds a digest_algorithm parameter to the READ_DATA verb that can take on two values: none and MD5 (default), and sets it to none when we're reading from one replica. In the future we may add other values for more hardware-friendly digest algorithms. Message-Id: <1479380600-19206-1-git-send-email-avi@scylladb.com>	2016-11-17 13:04:30 +02:00
Avi Kivity	a35136533d	Convert ring_position and token ranges to be nonwrapping Wrapping ranges are a pain, so we are moving wrap handling to the edges. Since cql can't generate wrapping ranges, this means thrift and the ring maintenance code; also range->ring transformations need to merge the first and last ranges. Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>	2016-11-02 21:04:11 +02:00
Vlad Zolotarov	a491ac0f18	tracing: introduce a log_slow_query logic The main idea is to log queries that take "too long" to complete. The "too long" is above the given threshold. To achieve the above this patch does the following: - Introduce two new properties to the tracing::trace_state: - "Full tracing": when the tracing of this query was explicitly requested. In this state we will record all possible traces related to this query: both on the coordinator and on any replica involved. - "Log slow query": when slow query logging is enabled. If slow query logging is enabled and a session's "duration" is above the specified threshold we will create a record in the "slow queries log" and write all trace records created on the coordinator and on a replica if a replica's session lasts longer than that threshold. (We will propagate the Coordinator's slow query logging threshold to replicas in the context of a specific tracing/logging session). The properties above are independent, namely they may be enabled and/or disabled independently and any combination of them is legal (naturally, creating a tracing session when both states above are disabled makes no sense). - Instrument the tracing::tracing service to allow the following: - Enable/disable slow query logging. - Set/get the slow query duration threshold (in microseconds). - Set/get the slow query log record TTL value (in seconds). - Instrument the trace_keyspace_helper to write a slow query log entry when requested. - The slow query logging is disabled by default and the threshold is set to half a second. - The TTL of a slow log record is set to 86400 seconds by default. - It makes sense to use the same "slow query logging threshold" and a "slow query record TTL" both on a coordinator and on a replica Nodes in a context of the same tracing session: - Pass both TTL and a threshold to the replica in a trace_info. This patch also implements the new slow query logging specific logic: - Don't write the pending tracing records before the end of a tracing session until "duration" reaches the logging threshold. - Don't build the parameters<sstring, sstring> map unless we know we will write it to I/O. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-28 18:28:44 +03:00
Vlad Zolotarov	8609900621	tracing: introduce trace_state capabilities bit field - Instead of keeping separate booleans introduce a trace_state_props_set enum_set and pass it around instead of separate booleans. - Change the trace_info to hold this value in addition to write_on_close. Initialize a corresponding bit in an enum_set based on a write_on_close value in a trace_info constructor for a backward compatibility. - Separate a trace_state constructor into two: - For a primary session object. - For a secondary session object. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-08-23 18:34:36 +03:00
Paweł Dziepak	dcf794b04d	idl: make bytes compatible with bytes_ostream This patch makes idl type "bytes" compatible with both bytes and bytes_ostream. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-08-22 09:31:33 +01:00

1 2

94 Commits