scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-30 19:46:48 +00:00

Author	SHA1	Message	Date
Avi Kivity	f3c8cbbac5	Merge "Introduce dht::token_range an dht::partition_range" from Asias "nonwrapping_range<ring_position> and nonwrapping_range<token> are used in many places. Let's make an alias for them to make it less verbose. Also there is a query::partition_range in query-request.hh which is the alias of nonwrapping_range<ring_position>. query::partition_range is used in places not related to query at all. Let's unify the usage project wide." * tag 'asias/repair_dht_token_range/v2' of github.com:cloudius-systems/seastar-dev: Convert to use dht::partition_range_vector and dht::token_range_vector dht: Introduce dht::partition_range_vector and dht::token_range_vector Get rid of query::partition_range Convert to use dht::partition_range Convert to use dht::token_range dht: Rename token_range to token_range_endpoints dht: Introduce dht::token_range an dht::partition_range	2016-12-19 10:59:52 +02:00
Asias He	937f28d2f1	Convert to use dht::partition_range_vector and dht::token_range_vector	2016-12-19 14:08:50 +08:00
Asias He	7a446986fa	dht: Introduce dht::partition_range_vector and dht::token_range_vector std::vector<dht::partition_range> and std::vector<dht::token_range> are used in a lot of places, introduce dht::partition_range_vector and dht::token_range_vector as the alias.	2016-12-19 08:09:28 +08:00
Asias He	e5485f3ea6	Get rid of query::partition_range Use dht::partition_range instead	2016-12-19 08:09:25 +08:00
Asias He	85034c1b57	Convert to use dht::partition_range	2016-12-19 08:04:30 +08:00
Asias He	d1178fa299	Convert to use dht::token_range	2016-12-19 08:04:29 +08:00
Asias He	1f06eedb58	dht: Rename token_range to token_range_endpoints It is a helper class used in storage_service only. Rename it so we can use it for the real dht::token_range.	2016-12-19 08:04:29 +08:00
Asias He	264b6ee69e	dht: Introduce dht::token_range an dht::partition_range nonwrapping_range<ring_position> and nonwrapping_range<token> are used in many places. Let's make an alias for them to make it less verbose. Also there is a query::partition_range in query-request.hh which is the alias of nonwrapping_range<ring_position>. query::partition_range is used in places not related to query at all. Let's unify the usage project wide.	2016-12-19 08:04:29 +08:00
Avi Kivity	32fb4c3661	Merge "repair: Reduce unnecessary streaming traffic even more" from Asias "In `7c873f0d` (repair: Reduce unnecessary streaming traffic), we optimize in cases when 1) all the remote nodes has the same checksum and 2) local node has zero checksum. In this series, we make the optimization more generec and cover more cases." * tag 'asias/repair/node_reducer/v3' of github.com:cloudius-systems/seastar-dev: repair: Reduce unnecessary streaming traffic even more repair: Add hash specialization for partition_checksum	2016-12-18 16:53:39 +02:00
Avi Kivity	3421ebe8be	Merge "storage_proxy: Enforce row limit" from Duarte "This patchset ensures the partition limit is enforced at the storage_proxy level. Uppers layers like the pager may already be depending on this behavior." * 'enforce-row-limit/v3' of https://github.com/duarten/scylla: query_pagers: Don't trim returned rows select_statement: Don't always trim result set query_result_merger: Limit rows mutation_query: to_data_query_result enforces row limit	2016-12-18 08:15:51 +02:00
Avi Kivity	6bb875bdb7	Merge "storage_proxy: Enforce partition limit" from Duarte "This patchset ensures the partition limit is enforced at the storage_proxy level. To achieve this, we add the partition count to query::result, and allow the result_merger to trim excess partitions." * 'enforce-partition-limit/v3' of https://github.com/duarten/scylla: storage_proxy: Decrease limits when retrying command storage_proxy: Don't fetch superfluous partitions query::result: Add partition count column_family: Use counters in query::result::builder query_result_builder: Use the underlying counters mutation_partition: Count partitions in query_compacted mutation_partition: Remove tabs in query_compacted query::result::builder: Add partition count query_result_merger: Limit partitions	2016-12-16 13:57:37 +02:00
Glauber Costa	7133583797	track streaming and system virtual dirty memory A case could be made that we should have counters for them no matter what, since it can help us reason about the distribution of memory among the groups. But with the hierarchy being broken in 1.5 it becomes even more important. Now by looking solely at dirty, we will have no idea about how much memory we are using in those groups. After this patch, the dirty_memory_manager will register its metrics for the 3 groups that we have, and the legacy names will be used to show totals. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <0d04ca4c7e8472097f16a5dc950b77c73766049e.1481831644.git.glauber@scylladb.com>	2016-12-16 10:59:40 +02:00
Avi Kivity	293876c72f	Merge "Limit number of readers streaming uses" from Paweł "Original, naive db::make_streaming_reader() implementation created a set of memtable and sstable readers for every partition range. This caused bad interaction with the code limiting sstable readers concurrency and was suboptimal. This series introduces multi range mutation reader that takes mutation source and a sorted, disjoint vector of ranges. It creates only a single set of memtable and sstable readers and fast forwards it to the next range once the current one is completed." * 'pdziepak/multi-range-reader/v1' of github.com:cloudius-systems/seastar-dev: db: use multi range reader for streaming readers dht: describe split_range[s]_to_shards() guarantees repair: remove outdated fixme test/mutation_reader_test: add multi_range_reader test tests/mutation_reader: extract key creation code mutation_reader: add multi_range_reader	2016-12-15 17:48:31 +02:00
Paweł Dziepak	cf679a413c	db: use multi range reader for streaming readers A naive approach was to create a set of readers for each range and pass them all to combining reader. This however performed badly if the number of ranges was high. The solution is to use multi range reader which uses only a single set of readers and fast forwards from range to range when necessary. This adds another requirement that the ranges passed to make_streaming_reader() are sorted and disjoint.	2016-12-15 13:54:43 +00:00
Paweł Dziepak	b86a826baf	dht: describe split_range[s]_to_shards() guarantees We are going to require these functions to return sorted and disjoint ranges. They already do so (provided that the input ranges are sorted and disjoint), but if the guarantee is not explicitly stated it may disappear some day.	2016-12-15 13:07:32 +00:00
Paweł Dziepak	5287417136	repair: remove outdated fixme	2016-12-15 13:07:32 +00:00
Paweł Dziepak	5b0cf20f75	test/mutation_reader_test: add multi_range_reader test	2016-12-15 13:07:32 +00:00
Paweł Dziepak	787a976c2b	tests/mutation_reader: extract key creation code	2016-12-15 13:07:32 +00:00
Paweł Dziepak	52a4e79210	mutation_reader: add multi_range_reader So far, the only way to combine outputs of multiple readers was to use combining reader. It is very general and, in particular, supports case when the readers emit mutations from overlapping ranges. However, we have cases (e.g. streaming) when we need to read from several disjoint ranges. Combining reader is a suboptimal solution as it requires to creating a reader for each range and ignores the fact that they do not overlap. This patch introduces multi_range_mutation_reader which takes a mutation_source and a sorted set of disjoint ranges. Internally, it uses mutation_reader::fast_forward_to() to move to the next range once the current one is completed.	2016-12-15 13:07:31 +00:00
Duarte Nunes	0518895f5b	query_pagers: Don't trim returned rows Since storage_proxy::query() now respects the read_command limits, we can remove the trimming logic from query_pagers. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 11:00:46 +00:00
Duarte Nunes	7ce859799b	select_statement: Don't always trim result set Trimming the result set is only needed when the query contains an "IN" relation, an ORDER BY clause, and defines a limit, which is the case where we query different ranges concurrently. We don't use the result_merger to trim since we first need to reorder the rows. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 11:00:46 +00:00
Duarte Nunes	fee0b7fa48	query_result_merger: Limit rows This patch makes the row limit enforced by the storage_proxy layer. It adds a row limit to the query_result_merger, useful when merging results for concurrent queries. More importantly, it provides guarantees that upper layers may be relying on implicitly (e.g., the paging code). Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 11:00:36 +00:00
Duarte Nunes	efc986d548	mutation_query: to_data_query_result enforces row limit This patch changes mutation_query::to_data_query_result() so that it enforces the row limit alongside the partition limit and the per-partition limit. In the following patch, we'll enforce the row limit in an upper layer, but this lets us optimize the case where only when replica replies. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:56:40 +00:00
Duarte Nunes	c2072c7dc9	storage_proxy: Decrease limits when retrying command This patch changes a read_command's limits when retrying it, so that we don't ask for more rows than necessary. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:41:06 +00:00
Duarte Nunes	9572c19dc6	storage_proxy: Don't fetch superfluous partitions This patch ensures we keep track of how many partitions we've queried so we don't ask for more than the number we need. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:27:46 +00:00
Duarte Nunes	93be8d7cef	query::result: Add partition count This patch adds a partition count to query::result, filled by the query::result::builder. The partition count is present whenever the result carries data, being absent only for the case where the result contains only a digest. We also ensure that counts are present for an empty query::result. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:27:46 +00:00
Duarte Nunes	781cd82cb8	column_family: Use counters in query::result::builder This patch changes column_family::query() to use the counters in the builder to determine how many partitions and rows to ask for and also to implement the stop condition. This saves a continuation to do the bookkeeping, and allows us to remove data_query_result. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:27:46 +00:00
Duarte Nunes	05b2ef4fa2	query_result_builder: Use the underlying counters This patch changes the query_result_builder to use the counters provided by the query::result::builder. It also ensures they are kept current. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:27:46 +00:00
Duarte Nunes	f5cf7f7921	mutation_partition: Count partitions in query_compacted This patch changes mutation_partition::query_compacted() to count the number of partitions written to the underlying writer. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:27:46 +00:00
Duarte Nunes	f21dfb8217	mutation_partition: Remove tabs in query_compacted Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:27:46 +00:00
Duarte Nunes	2409b6b250	query::result::builder: Add partition count This patch adds a partition count to the query::result::builder. It is intended to be incremented by users, and later used to build a query::result. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:27:46 +00:00
Duarte Nunes	108011a839	query_result_merger: Limit partitions This patch adds a partition limit to the query_result_merger, useful when merging results for concurrent queries. This change also makes the partition limit enforced by the storage_proxy layer, no changes being needed by the upper layers, namely the Thrift interface. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-15 10:27:41 +00:00
Pekka Enberg	06c5216c9d	Merge "Improve gossip feature logging" from Asias	2016-12-15 10:36:54 +02:00
Asias He	e578e65103	gossip: Log feature enabled message on shard zero only Feature is per node. No need to log them number of shards times.	2016-12-15 16:33:11 +08:00
Asias He	4137fab91b	gossip: Make log in check_features debug level We saw the message twice for the same feature check. This is a bit confusing. INFO 2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {RANGE_TOMBSTONES} in features {} INFO 2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {RANGE_TOMBSTONES} in features {} INFO 2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {LARGE_PARTITIONS} in features {} INFO 2016-12-15 11:26:23,993 [shard 0] gossip - Checking if need_features {LARGE_PARTITIONS} in features {} This is because ss._range_tombstones_feature = gms::feature(RANGE_TOMBSTONES_FEATURE); ss._large_partitions_feature = gms::feature(LARGE_PARTITIONS_FEATURE); The first message is printed when gms::feature(RANGE_TOMBSTONES_FEATURE) is constructed. The second message is printed when the ss._range_tombstones_feature is copy-constructed.	2016-12-15 16:33:10 +08:00
Asias He	2b1ebc4719	gossip: Introduce gms:features::enable helper Add the helper function to enable the a feature and log the feature is enabled. When a feature is enabled, we see INFO 2016-12-15 11:29:32,443 [shard 0] gossip - Feature LARGE_PARTITIONS is enabled INFO 2016-12-15 11:29:32,443 [shard 0] gossip - Feature RANGE_TOMBSTONES is enabled in the log.	2016-12-15 16:33:10 +08:00
Paweł Dziepak	b70e5d2089	Merge seastar upstream Submodule seastar 6fbd792..0b98024: > fstream: fix read ahead byte metric types > fstream: add read-ahead metrics > future-util: make stop_iteration use bool_class<> > util: introduce bool_class<Tag>	2016-12-14 15:01:13 +00:00
Avi Kivity	57f4910832	Merge "Query result size limiting" from Paweł "This series makes Scylla limit size of query results it produces in case they grow unreasonably large. This is possible because CQL paging queries do not guarantee that the returned page is going to have page_size rows and pages smaller than tha do not indicate end of stream. Non-paged queries and Thrift requests do not have such flexibility and they also get all the requested data (though their memory usage is still accounted for and may limit paged queries). There is a maximum result size (1 MB) and all results builders will stop after reaching it. Moreover, there is a per-shard limitation on the amount of memory used by all results combined (10%). To avoid tiny results a query has to reserve (wait if necessary) 4 kB before starting executing, after that it can consume more memory without any additional waiting provided it is below individual and shard-local limits. Enabling the cluster to return less rows than requested also means some changes for the coordinator. Firstly, if it receives such short result from a replica retrying it with a larger limit obviously makes no sense whatsoever. Instead, in such cases the coordinator removes the clustering rows it has incomplate information about and sends short result back to the client. Moreover, even if no replica returned short response reconciliation may have made it so. In this case, the coordinator do not necessairly need to retry the query as well. Unfortunately, with the current implementation short responses ruin data queries since they will cause a digest mismatch. Three new metrics were added: * database_bytes_total_result_memory -- total memory used by query results * database_total_operations_short_data_queries -- data queries that were limited by size, particulary bad as it basically forces coordinator to retry them as mutation queries * database_total_operations_short_mutation_queries -- mutation queries limited by size" * 'pdziepak/short-paged-reads/v4' of github.com:cloudius-systems/seastar-dev: storage_proxy: clean up after primary_key introduction cql3: allow short reads with paged queries storage_proxy: handle intentional short reads storage_proxy: make sure coordinator has complete data storage_proxy: honour partition limit storage_proxy: use cmd limits to determine that replica reached end db: add metrics for short reads and memory used for results data_query: limit result size mutation_query: limit result size db: create result_memory_accounters when starting query query_builder: add partition_slice getter reconcilable_result: keep result_memory_tracker object mutation_compactor: honour stop_iteration from consumers db: add result_memory_limiter query: add result size limiter reconcilable_result: properly propagate short_read flag query_pagers: handle short reads properly query: allow short reads serializer_impl: add serializer for bool_class<Tag>	2016-12-14 16:53:07 +02:00
Paweł Dziepak	4c69d7e2fe	storage_proxy: clean up after primary_key introduction primary_key was introduced as a replacement for std::pair<dht::decorated_key, std::optional<clustering_key>>. In order to simplify patch introducing its fields were named 'first' and 'second'. This patch changes the names to something less useless, removes old row_address alias and removes is_missing_rows() in favour of primary_key::less_compare_clustering comparator. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:28:37 +00:00
Paweł Dziepak	dde4bd5051	cql3: allow short reads with paged queries Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:28:37 +00:00
Paweł Dziepak	3c173d87b5	storage_proxy: handle intentional short reads If the result is going to be too large the replica may decide to make it shorter and coordinator should handle this properly (i.e. do not retry). Moreover, coordinator could avoid some retries by setting the short_read flag itself. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:28:37 +00:00
Paweł Dziepak	dd67de7218	storage_proxy: make sure coordinator has complete data got_incomplete_information() ensures that the coordinator has received all required data from all replicas. (see `77dbe3c12f` "storage_proxy: fix reconciliation with limits" for the examples when that may not be the case). However, this function is called only if reconciled result has at least as much rows as the user asked for. This was correct when we had only total row limit: if the result was shorter than that either all replicas sent all data they have or the coordinator will retry anyway. However, since then we got partition limit and per partition row limit and a request may be limited by one of these while being still below the total row limit. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:28:36 +00:00
Paweł Dziepak	2ff5308d8e	storage_proxy: honour partition limit At the moment the coordinator does not care much for the partition limit. In particular it doesn't check whether after reconciliation the result still contains enough partitions. This patch makes it honour the partition limit and increase it in the retried queries if necessary. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:28:36 +00:00
Paweł Dziepak	7bed7aa7de	storage_proxy: use cmd limits to determine that replica reached end Coordinator may retry a query with larger limits. However, code determining whether replica has no more data always used the original limits. This may cause a livelock. For example, consider cluster having the following partitions (deletions cover live cells): node1: pk=0, v=0 pk=1, v=1 node2 delete pk=0 delete pk=1 pk=2, v=2 pk=3, v=3 Now, if there is a query SELECT * FROM cf LIMIT 2 the first node is going to send partitions 0 and 1 while second node is going to send 2 and 3 + tombstones for 0 and 1. The coordinator will decide that it needs to retry the request with larger row limit since node1 may have some information about partitions 2 and 3 that are newer than what node2 has sent. However, when the second response arrives node1 will still sent only two rows since it has no more data. Because the coordinator uses original row limit it will not notice that this node reached the end and we are going to get another retry without making any progress. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:28:36 +00:00
Paweł Dziepak	cfd4d0f680	db: add metrics for short reads and memory used for results Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:28:36 +00:00
Paweł Dziepak	ba51e7e8db	data_query: limit result size Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	f1b9f49f2b	mutation_query: limit result size Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	6c33a4f177	db: create result_memory_accounters when starting query This pach ensures than when we start executing a query a minimum result size is reserved from result_memory_limiter. Moreover, range queries need a way of merging memory usage information from different shards. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	0bce4047bd	query_builder: add partition_slice getter Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	15de8de9e5	reconcilable_result: keep result_memory_tracker object Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00

1 2 3 4 5 ...

10950 Commits