scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-20 16:40:35 +00:00

Author	SHA1	Message	Date
Amnon Heiman	45b6070832	Merge seastar upstream * seastar 397685c...c1dbd89 (13): > lowres_clock: drop cache-line alignment for _timer > net/packet: add missing include > Merge "Adding histogram and description support" from Amnon > reactor: Fix the error: cannot bind 'std::unique_ptr' lvalue to 'std::unique_ptr&&' > Set the option '--server' of tests/tcp_sctp_client to be required > core/memory: Remove superfluous assignment > core/memory: Remove dead code > core/reactor: Use logger instead of cerr > fix inverted logic in overprovision parameter > rpc: fix timeout checking condition > rpc: use lowres_clock instead of high resolution one > semaphore: make semaphore's clock configurable > rpc: detect timedout outgoing packets earlier Includes treewide change to accomodate rpc changing its timeout clock to lowres_clock. Includes fixup from Amnon: collectd api should use the metrics getters As part of a preperation of the change in the metrics layer, this change the way the collectd api uses the metrics value to use the getters instead of calling the member directly. This will be important when the internal implementation will changed from union to variant. Signed-off-by: Amnon Heiman <amnon@scylladb.com> Message-Id: <1485457657-17634-1-git-send-email-amnon@scylladb.com>	2017-02-01 14:39:08 +02:00
Raphael S. Carvalho	1857ba0abc	db: fix bad resource usage distribution when resharding due to refresh That's because a single shard is used to calculate generation for new sstables in upload directory, and that will result in that single shard sharing all the resources with other shards. For refresh without upload dir, it currently works fine because we reshuffle column family dir instead. flush_upload_dir() is now a free function, takes a distributed database object, and uses calculate_shard_from_sstable_generation() to decide which shard will move sstable using its own generation namespace. Fixes #2008. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <b0cccf7bbb61416ff8718bac92fdca90cc5fb9c9.1484253232.git.raphaelsc@scylladb.com>	2017-01-19 18:55:21 +02:00
Duarte Nunes	d53f96e0da	column_family: Only update stats once for a shared sstables This patch ensures that when adding a shared sstable, we select only one cpu to update that column family's stats. This is important so we don't overestimated the on-disk size of sstables when resharding This fixes only a temporary miscount of the current load, since shared sstables are eventually re-written, but a fixes a permanent miscount of the total load. Refs #1592 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20170119144823.31041-1-duarte@scylladb.com>	2017-01-19 17:40:35 +02:00
Tomasz Grabiec	ddfee57c97	Replace iostream include with iosfwd in headers Message-Id: <1484656119-8386-4-git-send-email-tgrabiec@scylladb.com>	2017-01-17 14:52:44 +02:00
Vlad Zolotarov	cda382e8d6	database: move collectd registrations to metrics registration layer Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2017-01-10 16:24:54 -05:00
Raphael S. Carvalho	68dfcf5256	db: avoid excessive memory usage during resharding After resharding, sstables may be owned by all shards, which means that file descriptors and memory usage for metadata will increase by a factor equal to number of shards. That can easily lead to OOM. SSTable components are immutable, so they can be stored in one shard and shared with others that need it. We use the following formula to decide which shard will open the sstable and share it with the others: (generation % smp::count), which is the inverse of how we calculate generation for new sstables. So if no resharding is performed, everything is shard-local. With this approach, resource usage due to loaded sstables will be evenly distributed among shards. For this approach to work, we now only populate keyspaces from shard 0. It's now the sole responsible for iterating through column family dirs. In addition, most of population functions are now free and take distributed database object as parameter. Fixes #1951. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2017-01-09 15:24:36 -02:00
Avi Kivity	868b4d110c	Merge "Fixes for intentional short reads" from Paweł "This patchset contains fixes for the changes introduced in "Query result size limiting". It also improves handling of short data reads. I order to minimise chances of digest mismatch during data queries replicas that were asked just to return a digest also keep track of the size of the data (in the IDL representation) so that they would stop at the same point nodes doing full data queries would. Moreover, data queries are not affected by per-shard memory limit and the coordinator sends individual result size limits to replicas in order not to depend on hardcoded values. It is still possible to get digest mismatches if the IDL changes (e.g. a new field is added), but, hopefully, that won't be a serious problem." * 'pdziepak/short-read-fixes/v4' of github.com:cloudius-systems/seastar-dev: query: introduce result_memory_accounter::foreign_state storage_proxy: fix short reads in parallel range queries storage_proxy: pass maximum result size to replicas mutation_partition: use result limiter for digest reads query: make result_memory_limiter constants available for linker result_memory_limiter: add accounter for digest reads idl: allow writers to use any output stream result_memory_limiter: split new_read() to new_{data, mutation}_read() idl: is_short_read() was added in 1.6 mutation_partition: honour allowed_short_read for static rows storage_proxy: fix _is_short_read computation storage_proxy: disallow short reads if got no live rows storage_proxy: don't stop after result with no live rows	2016-12-26 10:42:49 +02:00
Raphael S. Carvalho	fd80499b3d	database: make column_family::add_sstable() private again Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <38226308bee2970a91b0e35370d6a646b85ecfe9.1482459877.git.raphaelsc@scylladb.com>	2016-12-23 11:42:16 +02:00
Paweł Dziepak	1a52569f7d	storage_proxy: pass maximum result size to replicas We may want to change the default individual result size limit in the future. If it is provided by the coordinator and not hardcoded in the replicas this can be done without causing data query digest mismatches or wasteful mutation query results.	2016-12-22 17:16:23 +01:00
Tomasz Grabiec	c7ff2a2bb0	db: Expose column_family::add_sstable Needed by compaction tests.	2016-12-22 13:24:46 +01:00
Avi Kivity	875635554d	Merge "educe overhead of partition presence checker during cache update" from Tomasz Refs #1943. * 'tgrabiec/optimize-bloom-filter' of github.com:cloudius-systems/seastar-dev: db: Compute key hash once in partition_presence_checker bloom_filter: Allow checking presence using pre-hashed key db: Use incremental selector in partition_presence_checker	2016-12-21 14:24:54 +02:00
Duarte Nunes	06ab61a570	schema_tables: Extract update_column_family This patch extracts update_column_family from schema_tables into database so it can be used when adding materialized views, in future patches. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	40c684b5f5	database: Extract common create cf code This patch moves some duplicate code into the add_column_family_and_create_directory() function. It also saves some superfluous keyspace lookups and readies the code to be used by materialized views. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	2b231f22b8	keyspace_metadata: Add tables() and views() functions This patch adds utility functions to keyspace_metadata to select only the tables or only the views out of all the schemas. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Duarte Nunes	7818339791	materialized views: Add view class This patch adds the view class, which will contains functions related to populating a view, either from the base table's write path or from the view building mechanism which copies over already existing data in the base table. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-12-20 13:06:11 +00:00
Tomasz Grabiec	78844fa2e5	db: Use incremental selector in partition_presence_checker This reduces the number of sstables we need to check to only those whose token range overlaps with the key. Reduces cache update time. Especially effective with leveled compaction strategy. Refs #1943. Incremental selector works with an immutable sstable set, so cache updates need to be serialized. Otherwise we could mispopulate due to stale presence information. Presence checker interface was changed to accept decorated key in order to gain easy access to the token, which is required by the incremental selector.	2016-12-19 14:20:58 +01:00
Asias He	937f28d2f1	Convert to use dht::partition_range_vector and dht::token_range_vector	2016-12-19 14:08:50 +08:00
Asias He	e5485f3ea6	Get rid of query::partition_range Use dht::partition_range instead	2016-12-19 08:09:25 +08:00
Glauber Costa	7133583797	track streaming and system virtual dirty memory A case could be made that we should have counters for them no matter what, since it can help us reason about the distribution of memory among the groups. But with the hierarchy being broken in 1.5 it becomes even more important. Now by looking solely at dirty, we will have no idea about how much memory we are using in those groups. After this patch, the dirty_memory_manager will register its metrics for the 3 groups that we have, and the legacy names will be used to show totals. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <0d04ca4c7e8472097f16a5dc950b77c73766049e.1481831644.git.glauber@scylladb.com>	2016-12-16 10:59:40 +02:00
Paweł Dziepak	cf679a413c	db: use multi range reader for streaming readers A naive approach was to create a set of readers for each range and pass them all to combining reader. This however performed badly if the number of ranges was high. The solution is to use multi range reader which uses only a single set of readers and fast forwards from range to range when necessary. This adds another requirement that the ranges passed to make_streaming_reader() are sorted and disjoint.	2016-12-15 13:54:43 +00:00
Paweł Dziepak	cfd4d0f680	db: add metrics for short reads and memory used for results Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:28:36 +00:00
Paweł Dziepak	ba51e7e8db	data_query: limit result size Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	6c33a4f177	db: create result_memory_accounters when starting query This pach ensures than when we start executing a query a minimum result size is reserved from result_memory_limiter. Moreover, range queries need a way of merging memory usage information from different shards. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Paweł Dziepak	5d7185fd39	db: add result_memory_limiter Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-12-14 14:10:02 +00:00
Glauber Costa	2aa6514667	config: get rid of memtable_total_space Those values are now statically set. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 17:05:12 -05:00
Glauber Costa	80440c0d79	database: rework dirty memory hierarchy Issue #1918 describes a problem, in which we are generating smaller memtables than we could, and therefore not respecting the flush criteria. That happens because group sizes (and limits) for pressure purposes, and the the soft threshold is currently at 40 %. This causes system group's soft threshold to be way below regular's virtual dirty limit and close to regular group's soft threshold. The system group was very likely to become under soft pressure when regular was because writes to regular group are not yet throttled when they cross both soft thresholds. This is a direct consequence of the linear hierarchy between the regions and to guarantee that it won't happen we would have acqire the semaphore of all ancestor regions when flushing from a child region. While that works, it can lead to problems on its own, like priority inversion if the regions have different priorities - like streaming and regular, and groups lower in the hierarchy, like user, blocking explicit flushes from their ancestors To fix that, this patch reorganizes the dirty memory region groups so that groups are now completely independent. As a disadvantage, when streaming happen we will draw some memory from the cache, but we will live with it for the time being. Fixes #1918 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 14:07:53 -05:00
Glauber Costa	be9e4c71ad	database: remove flush_token We had a flush_token structure in addition to the flush_permit because we needed to keep a pointer to the dirty_memory_manager and apply changes to the region group upon the region destruction. Since Tomek's latest series, this is no longer needed and now this structure doesn't have a place in the world anymore. Simplify the code by removing it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 13:59:34 -05:00
Glauber Costa	98030ad66c	database: abstract pressure condition notification Done in a separate patch to reduce clutter in the main patch. Soon we'll be testing for one more condition. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 13:59:34 -05:00
Glauber Costa	c9a8b03311	database: encapsulate semaphore_units into a flush_permit We will soon need to hold more than a semaphore_units<> object per flush, potentially. Preparation patch for that. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 13:59:34 -05:00
Glauber Costa	2e8c7d2c62	database: remove friendship declaration Not needed anymore since memtable started having a direct pointer to the memtable list. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 13:59:34 -05:00
Glauber Costa	8ab7c04caa	database: make memtable_list aware in cases it can't flush Some of our CFs can't be flushed. Those are the ones who are not marked as having durable writes. We treat them just the same from the point of view of the flush logic, but they provide a function that doesn't do anything and just returns right away. We already had troubles with that in the past, and that also poses a problem for an upcoming patch reworking the flush memtable pick criteria. It's easier, simpler, and cleaner, to just make the memtable_list aware it can't flush. Achieving that is also not very complicated: we just need a special constructor that doesn't take a seal function and then we make sure that it is initialized to an empty std::function Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-12-13 13:59:34 -05:00
Asias He	cd2105b8bd	database: make_streaming_reader for ranges Allow to make a streaming reader with a vector of ranges in addition to a single range. This will be used soon in following streaming patch. We can make the reader more efficient later.	2016-12-12 09:04:21 +08:00
Raphael S. Carvalho	fcfc84e836	compaction: reduce bloom filter overhead with incremental selector The procedure to calculate max purgeable timestamp is optimized by only visiting sstables that overlap with key being currently compacted. That's done using incremental sstable selector. Function to calculate maximum purgeable timestamp is made 10 times faster when compacting sstables overlap with 10% of all sstables. Fixes #1322. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-12-09 16:17:17 -02:00
Tomasz Grabiec	1b5f338c17	memtable: Track flushed memory in memtable object	2016-12-05 12:59:09 +01:00
Tomasz Grabiec	c3768fe4de	memtable: Pass dirty_memory_manager& to memtable constructor The implementation assumes that memtable's region group is owned by dirty_memory_manager, and tries to obtain a reference to it like this: boost::intrusive::get_parent_from_member(_region.group(), &dirty_memory_manager::_region_group)); This is undefined behavior when the region's group does not come from dirty manager. It's safer to be explicit about this dependency by taking a reference to dirty_memory_manager in the constructor.	2016-12-05 12:59:09 +01:00
Glauber Costa	d7256e7b21	database: do not call seal directly from the streaming timer Streaming memtable have a delayed mode where many flushes are coalesced together into one, with the actual flush happening later and propagated to all the previous waiters. However, the timer that triggers the actual flush was not using the newly introduced flush infrastructure. This was a minor problem because those flushes wouldn't try to take the semaphore, and so we could have many flushes going on at the same time. What was a potential performance issue became a correctness issue when we moved the reversal of the dirty memory accounting out of revert_potentially_cleaned_up_memory() into remove_from_flush_manager(). Since the latter is only called through the flush infrastructure, it simply wasn't called. So the deferral of the reversal exposed this bug. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <0d5755375bc27524b8cfb9970c76d492b14d9eea.1480522742.git.glauber@scylladb.com>	2016-11-30 18:00:55 +01:00
Tomasz Grabiec	b5d5612f98	database: Add counter for timed out writes	2016-11-29 16:40:59 +01:00
Tomasz Grabiec	2c561ecaed	db: Allow writes to be timed out	2016-11-29 16:40:58 +01:00
Tomasz Grabiec	b1ae6ad2ad	db: Introduce counters for failed reads and writes	2016-11-29 16:40:58 +01:00
Raphael S. Carvalho	f141b0cdae	database: atomically add new sstables to cf when refreshing New sstables are loaded and added in parallel, meaning that scylla can potentially return stale data if a new sstable containing a tombstone wasn't loaded yet. Compaction should also not run until all new sstables are added for similar reasons. Fix is about separating blocking and non-blocking steps to allow atomic add of multiple new sstables. Fixes #1368. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <14283b8a4a69127071d1fabef320a93c91817ec2.1480356073.git.raphaelsc@scylladb.com>	2016-11-28 20:30:48 +02:00
Avi Kivity	28857e42e7	Merge " Virtualize size_estimates system table" from Duarte "We currently write the size_estimates system table for every schema on a periodic basis, currently set to 5 minutes, which can interfere with an ongoing workload. This patchset virtualizes it such that queries are intercepted and we calculate the results on the fly, only for the ranges the caller is interested in. Fixes #1616" * 'virtual-estimates/v4' of github.com:duarten/scylla: size_estimates_virtual_reader: Add unit test db: Delete size_estimates_recorder size_estimates: Add virtual reader column_family: Add support for virtual readers storage_service: get_local_tokens() returns a future nonwrapping_range: Add slice() function range: Find a sequence's lower and upper bounds system_keyspace: Build mutations for size estimates size_estimates: Store the token range as bytes range_estimates: Add schema murmur3_partitioner: Convert maximum_token to sstring	2016-11-28 10:12:59 +02:00
Glauber Costa	c32803f2f0	database: move reversion of virtual dirty state closer to update_cache. When we finish writing a memtable, we revert the dirty memory charges immediately. When we do that, dirty memory will grow back to what it was, and soon (we hope) will go down again when we release the requests for real. During that time, we may not accept new requests. Sealing can take a long time, specially in the face of Linux issues like the ones we have seen in the past. It also will take proportionally more time if the SSTables end up being small, which is a possibility in some scenarios. This patch changes the dirty_memory_manager so that the charges won't be reverted right after we finish the flush. Rather, we will hold on to it, and revert it right before we update the cache. We don't need to do it for all classes of memtable writes, because after we finish flushing, flush_one() will destroy the hashed element anyway. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <2d5a8f6ca57d5036f4850ac163557bca59b8063d.1480004384.git.glauber@scylladb.com>	2016-11-24 18:18:15 +01:00
Glauber Costa	0ca8c3f162	database: keep a pointer to the memtable list in a memtable We current pass a region group to the memtable, but after so many recent changes, that is a bit too low level. This patch changes that so we pass a memtable list instead. Doing that also has a couple of advantages. Mainly, during flush we must get to a memtable to a memtable_list. Currently we do that by going to the memtable to a column family through the schema, and from there to the memtable_list. That, however, involves calling virtual functions in a derived class, because a single column family could have both streaming and normal memtables. If we pass a memtable_list to the memtable, we can keep pointer, and when needed get the memtable_list directly. Not only that gets rid of the inheritance for aesthetic reasons, but that inheritance is not even correct anymore. Since the introduction of the big streaming memtables, we now have a plethora of lists per column family and this transversal is totally wrong. We haven't noticed before because we were flushing the memtables based on their individual sizes, but it has been wrong all along for edge cases in which we would have to resort to size-based flush. This could be the case, for instance, with various plan_ids in flight at the same time. At this point, there is no more reason to keep the derived classes for the dirty_memory_manager. I'm only keeping them around to reduce clutter, although they are useful for the specialized constructors and to communicate to the reader exactly what they are. But those can be removed in a follow up patch if we want. The old memtable constructor signature is kept around for the benefit of two tests in memtable_tests which have their own flush logic. In the future we could do something like we do for the SSTable tests, and have a proxy class that is friends with the memtable class. That too, is left for the future. Fixes #1870 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <811ec9e8e123dc5fc26eadbda82b0bae906657a9.1479743266.git.glauber@scylladb.com>	2016-11-21 18:18:27 +02:00
Duarte Nunes	cd7e2fd602	column_family: Add support for virtual readers Virtual readers allow queries to selected tables, usually system tables, to be answered by the engine. This is useful for tables which aren't written by users and whose contents can be calculated on demand. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:05 +00:00
Glauber Costa	461778918b	fix shutdown and exception conditions for flush logic This patch addresses post-merge follow up comments by Tomek. Basically, what we do is: - we don't need to signal() from remove_from_flush_manager(), because the explicit flushes no longer wait on the condition variable. So we don't. - We now wait on the stop() flushes (regardless of their return status) so we can make sure that the _flush_queue will indeed be done with. - we acquire the semaphore before shutting down the dirty_memory_manager to make sure that there are no pending flushes - the flush manager that holds the semaphore has to match in the exception handler Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <a23ab5098934546c660a08de64cd9294bb3a2008.1479400239.git.glauber@scylladb.com>	2016-11-17 21:16:44 +01:00
Glauber Costa	f08162e181	database: rework memtable flush logic The way we currently flush memtables, we seal the current one but wait on a semaphore for the actual flush to proceed. This is pointless, because if the flush is not proceeding we'll use up memory for the new entries anyway, be them in a newly opened memtable or not. As a matter of fact, by opening a new memtable we are foregoing coalescing opportunities. After recent changes to the flush paths, we are now in a position to do differently. We move the semaphore earlier, and if we can't acquire it we keep appending to the current memtable. For explicit flushes, we'll queue and prioritize them over memory-based flushes. This has the nice property of potentially coalescing various flushes for the same CF into one. Coalescing flushes for the same CF is particularly helpful for commitlog-initiated flushes that can't complete within the flush period. What we see currently, is that under heavy load the commitlog will keep sealing memtables adding to the existing load. Another interesting property of this approach is that we can keep the disk utilization higher, by allowing a new flush to start before the memtable is fully sealed. By design, every time a memtable is finished flushing it will call revert_potentially_cleaned_up_memory() to revert the virtual memory charges. That is the perfect moment for us to act. It indicates that all the data flushing part is done. The way we'll do it is by keeping the semaphore_units alive for this memtable. When the flush ends, we destroy that object. This will effectively trigger the next flush if there is a next flush that can be initiated. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-16 21:20:58 -05:00
Glauber Costa	895e838ac0	get rid of max_memtable_size After recent changes to the memtable code, there is no reason for us to uphold a maximum memtable size. Now that we only flush one memtable at a time anyway, and also have soft limit notifications from the region_group_reclaimer, we can just set the soft limit to the target size and let all of that be handled by the dirty_memory_manager. It does have the added property that we'll be flushing when we globally reach the soft limit threshold. In conditions in which we have multiple CF writes fighting for memory, that guarantees that we will start flushing much earlier than the hard limit. The threshold is set to 1/4 of dirty memory. While in theory we would prefer the memtables to go as big as 1/2 of dirty memory, in my experiments I have found 1/4 to be a better fit, at least for the moment. The reason for such behavior is that in situations where we have slow disks, setting the soft limit to 1/2 of dirty will put us in a situation in which we may not have finished writing down the memtable when we hit the limit, and then throttle. When set the threshold to 1/4 of dirty, we don't throttle at all. This behavior could potentially be fixed by not doing the full memtable-based throttling after we do the commitlog throttling, but that is not something realistic for the moment. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-16 21:20:24 -05:00
Glauber Costa	2ed3f342c1	pass a region to dirty_memory_manager accounting API We would like to know from which region is a particular flush coming from, and account accordingly. The reasoning behind that, is that soon we'll be driving the flushes internally from the dirty_memory_manager without explcitly triggering them. We need to start a flush before the current one finishes, otherwise we'll have a period without significant disk activity when the current SSTable is being sealed, the caches are being updated, etc. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-16 21:20:24 -05:00
Tomasz Grabiec	c1a7e2090e	Revert "database: change find_column_families signature so it returns a lw_shared_ptr" This reverts commit `f3528ede65`.	2016-11-04 10:48:21 +01:00
Tomasz Grabiec	6366eb5cf8	Revert "correctly calculate latencies for writes" This reverts commit `a382f10fc4`.	2016-11-04 10:48:02 +01:00

1 2 3 4 5 ...

462 Commits