scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-30 19:46:48 +00:00

Author	SHA1	Message	Date
Paweł Dziepak	dd67de7218	storage_proxy: make sure coordinator has complete data got_incomplete_information() ensures that the coordinator has received all required data from all replicas. (see `77dbe3c12f` "storage_proxy: fix reconciliation with limits" for the examples when that may not be the case). However, this function is called only if reconciled result has at least as much rows as the user asked for. This was correct when we had only total row limit: if the result was shorter than that either all replicas sent all data they have or the coordinator will retry anyway. However, since then we got partition limit and per partition row limit and a request may be limited by one of these while being still below the total row limit. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:28:36 +00:00
Paweł Dziepak	2ff5308d8e	storage_proxy: honour partition limit At the moment the coordinator does not care much for the partition limit. In particular it doesn't check whether after reconciliation the result still contains enough partitions. This patch makes it honour the partition limit and increase it in the retried queries if necessary. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:28:36 +00:00
Paweł Dziepak	7bed7aa7de	storage_proxy: use cmd limits to determine that replica reached end Coordinator may retry a query with larger limits. However, code determining whether replica has no more data always used the original limits. This may cause a livelock. For example, consider cluster having the following partitions (deletions cover live cells): node1: pk=0, v=0 pk=1, v=1 node2 delete pk=0 delete pk=1 pk=2, v=2 pk=3, v=3 Now, if there is a query SELECT * FROM cf LIMIT 2 the first node is going to send partitions 0 and 1 while second node is going to send 2 and 3 + tombstones for 0 and 1. The coordinator will decide that it needs to retry the request with larger row limit since node1 may have some information about partitions 2 and 3 that are newer than what node2 has sent. However, when the second response arrives node1 will still sent only two rows since it has no more data. Because the coordinator uses original row limit it will not notice that this node reached the end and we are going to get another retry without making any progress. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:28:36 +00:00
Paweł Dziepak	cfd4d0f680	db: add metrics for short reads and memory used for results Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:28:36 +00:00
Paweł Dziepak	ba51e7e8db	data_query: limit result size Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	f1b9f49f2b	mutation_query: limit result size Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	6c33a4f177	db: create result_memory_accounters when starting query This pach ensures than when we start executing a query a minimum result size is reserved from result_memory_limiter. Moreover, range queries need a way of merging memory usage information from different shards. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	0bce4047bd	query_builder: add partition_slice getter Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	15de8de9e5	reconcilable_result: keep result_memory_tracker object Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	34f9eb4cbd	mutation_compactor: honour stop_iteration from consumers Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	5d7185fd39	db: add result_memory_limiter Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	ee89d80d5c	query: add result size limiter This patch introduces an infrastrucutre for limiting result size. There is a shard-local limit which makes sure that all results combined do not use more than 10% of the shard memory. There is also an invidual limit which restricts a result to 4 MB. In order In order to avoid sending tiny results there is minimum guaranteed size (4 kB), which the query needs to reserve before it starts producing the result. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	43fe3439ca	reconcilable_result: properly propagate short_read flag reconcilable_result can be merged with another or transformed into query::result. Make sure that short_read information is never lost. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	837d24f1b2	query_pagers: handle short reads properly Currently, the paging implementation assumes that the server retunrs either as many rows as it was asked for all reached the end. Soon, that's not going to be true so instead of making any assumptions about the number of the rows returned use the new "short read" flag to determine whether there is going to be more data. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	da7ca85040	query: allow short reads When paging is used the cluster is allowed to return less rows than the client asked for. However, if such possibility is used we need a way of telling that to the coordinator and the paging implementation so that they can differentiate between short reads caused by the replica running out of data to sent and short reads caused by any other means. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:01 +00:00
Paweł Dziepak	7a15c89b1d	serializer_impl: add serializer for bool_class<Tag> Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:01 +00:00
Takuya ASADA	8918a4be57	dist/common/scripts/scylla_setup: don't abort scylla_setup when each setup script failed Instead of abort scylla_setup, print warning message then continue to next setup. Fixes #1357 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1481713664-18429-1-git-send-email-syuu@scylladb.com>	2016-12-14 13:31:50 +02:00
Tomasz Grabiec	c9344826e9	tests: Remove unintentional enablement of trace-level logging Sneaked in by mistake.	2016-12-14 10:58:07 +01:00
Tomasz Grabiec	fe6a70dba1	tests: commitlog: Fix assumption about write visibility The test assumed that mutations added to the commitlog are visible to reads as soon as a new segment is opened. That's not true because buffers are written back in the background, and new segment may be active while the previous one is still being written or not yet synced. Fix the test so that it expectes that the number of mutations read this way is <= the number of mutations read, and that after all segments are synced, the number of mutations read is equal. Message-Id: <1481630481-19395-1-git-send-email-tgrabiec@scylladb.com>	2016-12-14 11:29:33 +02:00
Avi Kivity	a61ff53150	Merge "rework flush criteria" from Glauber "The current criteria for memtable flush is not being respected. The problem is demonstrated to happen when the dirty memory group is over limit, and so is the system table extra allowance. In that situation, both the normal region and the system table region will be under pressure and try to flush. More specifically, because the normal region inherits from the system region, if the normal region is under pressure (over the soft limit threshold), the system region will certainly be as well, even though it has an extra allowance. This is because after virtual dirty, we start blocking when we reach half the region, but memory itself can grow up to 100 % of the region. So the total amount of memory used will be certainly bigger than the system pressure threshold, which is now 50 % plus the allowance. To fix that, this patch reworks the flush logic so that the regions are not dependent on each other. Fixes #1918" * 'flush-criteria-v6' of github.com:glommer/scylla: config: get rid of memtable_total_space database: rework dirty memory hierarchy system keyspace: write batchlog mutation in user memory database: remove flush_token database: abstract pressure condition notification database: encapsulate semaphore_units into a flush_permit database: remove friendship declaration database: simplify flush_one database: make memtable_list aware in cases it can't flush	2016-12-14 11:24:10 +02:00
Takuya ASADA	c18a95cddf	dist/redhat: add scylla_lib.sh to scylla.spec Fix .rpm build error. Fixes #1932 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1481703992-9596-1-git-send-email-syuu@scylladb.com>	2016-12-14 10:27:37 +02:00
Glauber Costa	56df53f51e	compaction_manager: fix shutdown sequence By the time we are able to acquire this semaphore, we may be stopped already. So we need to test it before we go ahead. I can see shutdown hangs before this patch that are fixed with it applied. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <e5b378893128d086d584ffbb2acd3fb687648e5c.1481655433.git.glauber@scylladb.com>	2016-12-14 09:26:24 +01:00
Glauber Costa	2aa6514667	config: get rid of memtable_total_space Those values are now statically set. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 17:05:12 -05:00
Glauber Costa	80440c0d79	database: rework dirty memory hierarchy Issue #1918 describes a problem, in which we are generating smaller memtables than we could, and therefore not respecting the flush criteria. That happens because group sizes (and limits) for pressure purposes, and the the soft threshold is currently at 40 %. This causes system group's soft threshold to be way below regular's virtual dirty limit and close to regular group's soft threshold. The system group was very likely to become under soft pressure when regular was because writes to regular group are not yet throttled when they cross both soft thresholds. This is a direct consequence of the linear hierarchy between the regions and to guarantee that it won't happen we would have acqire the semaphore of all ancestor regions when flushing from a child region. While that works, it can lead to problems on its own, like priority inversion if the regions have different priorities - like streaming and regular, and groups lower in the hierarchy, like user, blocking explicit flushes from their ancestors To fix that, this patch reorganizes the dirty memory region groups so that groups are now completely independent. As a disadvantage, when streaming happen we will draw some memory from the cache, but we will live with it for the time being. Fixes #1918 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 14:07:53 -05:00
Glauber Costa	db7cc3cba8	system keyspace: write batchlog mutation in user memory Batchlog is a potentially memory-intensive table whose workload is driven by user needs, not system's. Move it to the user dirty memory manager. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 13:59:35 -05:00
Glauber Costa	be9e4c71ad	database: remove flush_token We had a flush_token structure in addition to the flush_permit because we needed to keep a pointer to the dirty_memory_manager and apply changes to the region group upon the region destruction. Since Tomek's latest series, this is no longer needed and now this structure doesn't have a place in the world anymore. Simplify the code by removing it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 13:59:34 -05:00
Glauber Costa	98030ad66c	database: abstract pressure condition notification Done in a separate patch to reduce clutter in the main patch. Soon we'll be testing for one more condition. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 13:59:34 -05:00
Glauber Costa	c9a8b03311	database: encapsulate semaphore_units into a flush_permit We will soon need to hold more than a semaphore_units<> object per flush, potentially. Preparation patch for that. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 13:59:34 -05:00
Glauber Costa	2e8c7d2c62	database: remove friendship declaration Not needed anymore since memtable started having a direct pointer to the memtable list. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 13:59:34 -05:00
Glauber Costa	bb1509c21e	database: simplify flush_one flush_one has to make sure that we're using the correct dirty_memory_manager object, because we could be flushing from a region group different than the one the flush request originated. It's simpler to just assume flush_one will be dealing with the right object, and use a different object instead of "this" when calling it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 13:59:34 -05:00
Glauber Costa	8ab7c04caa	database: make memtable_list aware in cases it can't flush Some of our CFs can't be flushed. Those are the ones who are not marked as having durable writes. We treat them just the same from the point of view of the flush logic, but they provide a function that doesn't do anything and just returns right away. We already had troubles with that in the past, and that also poses a problem for an upcoming patch reworking the flush memtable pick criteria. It's easier, simpler, and cleaner, to just make the memtable_list aware it can't flush. Achieving that is also not very complicated: we just need a special constructor that doesn't take a seal function and then we make sure that it is initialized to an empty std::function Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 13:59:34 -05:00
Takuya ASADA	0a6312d254	dist/common/scripts/scylla_ntp_setup: fix incorrect usage of is_debian_variant Use it as "if is_debian_variant; then". Fixes #1931 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1481644262-29383-1-git-send-email-syuu@scylladb.com>	2016-12-13 18:29:42 +02:00
Takuya ASADA	ed4cd1908f	dist/common/scripts/scylla_selinux_setup: correct CentOS/RHEL detection CentOS/RHEL is using SELinux, and it's NOT Debian variant, so fixed from "is_debian_variant" to "! is_debian_variant". Fixes #1930 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1481643873-28984-1-git-send-email-syuu@scylladb.com>	2016-12-13 18:29:29 +02:00
Takuya ASADA	6c0dc55495	dist/common/scripts/scylla_selinux_setup: to use is_debian_variant(), need to source /usr/lib/scylla/scylla_lib.sh This fixes following command not found error: ``` /usr/sbin/scylla_selinux_setup: line 7: is_debian_variant: command not found ``` Fixes #1929 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1481643308-28637-1-git-send-email-syuu@scylladb.com>	2016-12-13 18:29:13 +02:00
Takuya ASADA	3b74c50546	dist/ubuntu: add uuidgen to package dependency We haven't added uuidgen to Ubuntu/Debian package dependency, so scylla_setup script may abort because of command not found. Fixes #1928 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1481642385-27941-1-git-send-email-syuu@scylladb.com>	2016-12-13 18:28:48 +02:00
Duarte Nunes	1e75a4950e	database: Complete query when hitting partition limit Currently, we weren't completing a query as early as possible if it reached the partition limit, we instead had to wait until reaching the end of the specified partition ranges. This patches fixes that by including a check to the partition limit in the termination condition. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20161213114559.26438-1-duarte@scylladb.com>	2016-12-13 14:53:46 +02:00
Tomasz Grabiec	f451014785	schema: Implement operator<< for column_mapping Message-Id: <1481310679-14074-1-git-send-email-tgrabiec@scylladb.com>	2016-12-13 12:20:46 +02:00
Tomasz Grabiec	059a1a4f22	db: Fix commitlog replay to not drop cell mutations with older schema column_mapping is not safe to access across shards, because data_type is not safe to access. One of the manifestation of this is that abstract_type::is_value_compatible_with() always fails if the two types belong to different shards. During replay, column_mapping lives on the replaying shard, and is used by converting_mutation_partition_applier against the schema on the target shard. Since types in the mapping will be considered incompatible with types in the schema, all cells will be dropped. Fix by using column_mapping in a safe way, by copying it to the target shard if necessary. Each shard maintains its own cache of column mappings. Fixes #1924. Message-Id: <1481310463-13868-1-git-send-email-tgrabiec@scylladb.com>	2016-12-13 12:19:32 +02:00
Avi Kivity	32d55bbb4c	Merge seastar upstream * seastar 0773e98...6fbd792 (2): > tls: Only run our "verify" function in client session > Merge "Clean the metric definition" from Amnon Includes patch from Amnon adjusting the metrics registration due to seastar API changes.	2016-12-13 12:17:14 +02:00
Avi Kivity	6f9c317b91	Merge "Use uuid file in housekeeping" from Amnon "This patch adds the use of uuid file to the housekeeping daily version check. uuid file are optional, if a file is missing no uuid will be used."	2016-12-13 10:52:44 +02:00
Avi Kivity	c67782f169	Merge seastar upstream * seastar 0a74317...0773e98 (6): > tls: Add support for client cetrificate verification & priority strings > semaphore: add consume_units > semaphore: add available_units() > thread: check need_preempt for threads in a scheduling group as well > tutorial: fix semaphore example, and text > stop_iteration: add && and \|\| operators	2016-12-12 18:06:19 +02:00
Avi Kivity	c801cc4bd1	Merge "streaming and repair updates" from Asias "This series: - We can make reader with ranges - Fix possible use after free of 'si' - Streaming ranges now are sorted and merged - Fix shard_begin shard_end end loop in both streaming and repair"	2016-12-12 11:32:42 +02:00
Asias He	ba54654af3	streaming: Use interval_set to sort and merge ranges So that the ranges are sorted and have no overlaps. We can have less ranges to deal with and it can help the mutation readers to optimize. Here is an exmaple of ranges generated by repair: Before: INFO 2016-12-07 17:44:21,185 [shard 0] stream_session - cf_id = dec9fa90-bc3b-11e6-af78-000000000001, before ranges = {(-3383928698815274642, -3376937163195039606], (-7260764223708720005, -7251657821052234309], (-4767213984179237293, -4747032371925842389], (-7645879646119667643, -7589962743703481776], (-2340199306656526861, -2320523117224780931], (-576028861239229331, -560973674020019962], (-4070378863644120252, -3987599893827407860], (-2551584407739673151, -2498779102482524711], (-5416061903556353312, -5354212455975869358], (37594980457713898, 67885601051654285], (3083778975065200884, 3091232478835418439], (3131345970514528877, 3187922544267434961], (5765437476661317163, 5778671293583720541], (5960610072466058818, 5972289771228014343], (7749618183851698485, 7758080813117351135], (-3987599893827407860, -3899198931034439776], (-7251657821052234309, -7131649010279865221], (-3576581915808403133, -3383928698815274642], (-417850207760366422, -327959672080599465], (-2671876682129336880, -2551584407739673151], (-1305178847032904465, -1137497074548854552], (8540448858050275827, 8610171849752115483], (-560973674020019962, -417850207760366422], (-2498779102482524711, -2340199306656526861], (2394447940525988167, 2523396860109747637], (-6703329224557608009, -6517757811218772762], (-3675103288021821677, -3576581915808403133], (-5622185785296846551, -5416061903556353312], (8610171849752115483, 8742605005068551458], (8068079250973315241, 8185655671734937642], (560264964510741191, 790641981923757238], (5581202487214475094, 5765437476661317163], (8742605005068551458, 8923908282731801645], (-6038176423022601107, -5622185785296846551], (5778671293583720541, 5960610072466058818], (-3899198931034439776, -3675103288021821677], (8356739976149429222, 8540448858050275827], (-6517757811218772762, -6038176423022601107], (-8052600134279395253, -7645879646119667643], (-327959672080599465, 37594980457713898], (7758080813117351135, 8019254284118543066], (4781565016737645510, 5067070718000527886], (2523396860109747637, 3083778975065200884], (-5354212455975869358, -4767213984179237293], (6784138025918878582, 7190719703944308372], (67885601051654285, 447405341661896387], (-2190610927722759275, -1305178847032904465], (-4747032371925842389, -4070378863644120252]}, size=48 After: INFO 2016-12-07 17:44:21,185 [shard 0] stream_session - cf_id = dec9fa90-bc3b-11e6-af78-000000000001, after ranges = {(-8052600134279395253, -7589962743703481776], (-7260764223708720005, -7131649010279865221], (-6703329224557608009, -3376937163195039606], (-2671876682129336880, -2320523117224780931], (-2190610927722759275, -1137497074548854552], (-576028861239229331, 447405341661896387], (560264964510741191, 790641981923757238], (2394447940525988167, 3091232478835418439], (3131345970514528877, 3187922544267434961], (4781565016737645510, 5067070718000527886], (5581202487214475094, 5972289771228014343], (6784138025918878582, 7190719703944308372], (7749618183851698485, 8019254284118543066], (8068079250973315241, 8185655671734937642], (8356739976149429222, 8923908282731801645]}, size=15	2016-12-12 11:09:26 +08:00
Asias He	e523803a5d	token_metadata: Introduce interval_to_range helper It is used to convert a boost::icl::interval<token> interval back to a range<token>.	2016-12-12 11:09:26 +08:00
Asias He	af3d76e6ac	repair: Fix a typo in the log sucessfully -> successfully	2016-12-12 11:09:26 +08:00
Asias He	374324e6fb	repair: Fix shard_begin and shard_end A range now alternates between different shards: the first part of the range goes to shard X, the next to shard X+1, but after a while we go back to shard X. So we can't do a simple loop between shard_begin and shard_end. Fix by using the newly introduced dht::split_range_to_shards Use the cf.make_streaming_reader with ranges to simplify the code a bit.	2016-12-12 11:09:26 +08:00
Asias He	1987264beb	streaming: Make streaming reader with ranges Now that we have the new interface to make readers with ranges, we can simplify the code a lot. 1) Less readers are needed before: number of ranges of readers after: smp::count readers at most 2) No foreign_ptr is needed There is no need to forward to a shard to make the foreign_ptr for send_info in the first phase and forward to that shard to execute the send_info in the second phase. 3) No do_with is needed in send_mutations since si now is a lw_shared_ptr 4) Fix possible user after free of 'si' in do_send_mutations We need to take a reference of 'si' when sending the mutation with send_stream_mutation rpc call, otherwise: msg1 got exception si->mutations_done.broken() si is freed msg2 got exception si is used again The issue is introduced in `dc50ce0ce5` (streaming: Make the mutation readers when streaming starts) which is master only, branch 1.5 is not affected.	2016-12-12 09:04:21 +08:00
Asias He	463cc4fbde	dht: Introduce split_ranges_to_shards Split a ranges into shard ranges map with ring_position_range_sharder helper.	2016-12-12 09:04:21 +08:00
Asias He	044c4ff44c	dht: Introduce split_range_to_shards Split a range into shard ranges map with ring_position_range_sharder helper.	2016-12-12 09:04:21 +08:00
Asias He	cd2105b8bd	database: make_streaming_reader for ranges Allow to make a streaming reader with a vector of ranges in addition to a single range. This will be used soon in following streaming patch. We can make the reader more efficient later.	2016-12-12 09:04:21 +08:00

1 2 3 4 5 ...

10907 Commits