scylladb

Author	SHA1	Message	Date
Duarte Nunes	1e75a4950e	database: Complete query when hitting partition limit Currently, we weren't completing a query as early as possible if it reached the partition limit, we instead had to wait until reaching the end of the specified partition ranges. This patches fixes that by including a check to the partition limit in the termination condition. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20161213114559.26438-1-duarte@scylladb.com>	2016-12-13 14:53:46 +02:00
Asias He	cd2105b8bd	database: make_streaming_reader for ranges Allow to make a streaming reader with a vector of ranges in addition to a single range. This will be used soon in following streaming patch. We can make the reader more efficient later.	2016-12-12 09:04:21 +08:00
Raphael S. Carvalho	fcfc84e836	compaction: reduce bloom filter overhead with incremental selector The procedure to calculate max purgeable timestamp is optimized by only visiting sstables that overlap with key being currently compacted. That's done using incremental sstable selector. Function to calculate maximum purgeable timestamp is made 10 times faster when compacting sstables overlap with 10% of all sstables. Fixes #1322. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-12-09 16:17:17 -02:00
Glauber Costa	733d87fcc6	database: try to acquire semaphore before we start flush As Tomek pointed out, as we are starting the flush before we acquire the semaphore, we are not really limiting parallelism, but only delaying the end of the flush instead. Fixes #1919 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <6cbf9ec2f3a341c76becf94f794cfa16539c5192.1481120410.git.glauber@scylladb.com>	2016-12-08 12:18:32 +01:00
Tomasz Grabiec	527ff6aa40	db: Clear memtable after flush when cache is disabled So that memory is released gradually (impacting latency less) and sooner than when memtable is destroyed. Active readers may keep the memtable alive for unbounded amount of time. Refs #1879	2016-12-05 12:59:09 +01:00
Tomasz Grabiec	1b5f338c17	memtable: Track flushed memory in memtable object	2016-12-05 12:59:09 +01:00
Tomasz Grabiec	c3768fe4de	memtable: Pass dirty_memory_manager& to memtable constructor The implementation assumes that memtable's region group is owned by dirty_memory_manager, and tries to obtain a reference to it like this: boost::intrusive::get_parent_from_member(_region.group(), &dirty_memory_manager::_region_group)); This is undefined behavior when the region's group does not come from dirty manager. It's safer to be explicit about this dependency by taking a reference to dirty_memory_manager in the constructor.	2016-12-05 12:59:09 +01:00
Tomasz Grabiec	b5d5612f98	database: Add counter for timed out writes	2016-11-29 16:40:59 +01:00
Tomasz Grabiec	2c561ecaed	db: Allow writes to be timed out	2016-11-29 16:40:58 +01:00
Tomasz Grabiec	b1ae6ad2ad	db: Introduce counters for failed reads and writes	2016-11-29 16:40:58 +01:00
Raphael S. Carvalho	f141b0cdae	database: atomically add new sstables to cf when refreshing New sstables are loaded and added in parallel, meaning that scylla can potentially return stale data if a new sstable containing a tombstone wasn't loaded yet. Compaction should also not run until all new sstables are added for similar reasons. Fix is about separating blocking and non-blocking steps to allow atomic add of multiple new sstables. Fixes #1368. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <14283b8a4a69127071d1fabef320a93c91817ec2.1480356073.git.raphaelsc@scylladb.com>	2016-11-28 20:30:48 +02:00
Avi Kivity	28857e42e7	Merge " Virtualize size_estimates system table" from Duarte "We currently write the size_estimates system table for every schema on a periodic basis, currently set to 5 minutes, which can interfere with an ongoing workload. This patchset virtualizes it such that queries are intercepted and we calculate the results on the fly, only for the ranges the caller is interested in. Fixes #1616" * 'virtual-estimates/v4' of github.com:duarten/scylla: size_estimates_virtual_reader: Add unit test db: Delete size_estimates_recorder size_estimates: Add virtual reader column_family: Add support for virtual readers storage_service: get_local_tokens() returns a future nonwrapping_range: Add slice() function range: Find a sequence's lower and upper bounds system_keyspace: Build mutations for size estimates size_estimates: Store the token range as bytes range_estimates: Add schema murmur3_partitioner: Convert maximum_token to sstring	2016-11-28 10:12:59 +02:00
Glauber Costa	c32803f2f0	database: move reversion of virtual dirty state closer to update_cache. When we finish writing a memtable, we revert the dirty memory charges immediately. When we do that, dirty memory will grow back to what it was, and soon (we hope) will go down again when we release the requests for real. During that time, we may not accept new requests. Sealing can take a long time, specially in the face of Linux issues like the ones we have seen in the past. It also will take proportionally more time if the SSTables end up being small, which is a possibility in some scenarios. This patch changes the dirty_memory_manager so that the charges won't be reverted right after we finish the flush. Rather, we will hold on to it, and revert it right before we update the cache. We don't need to do it for all classes of memtable writes, because after we finish flushing, flush_one() will destroy the hashed element anyway. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <2d5a8f6ca57d5036f4850ac163557bca59b8063d.1480004384.git.glauber@scylladb.com>	2016-11-24 18:18:15 +01:00
Avi Kivity	d58c8aaa32	db: remove unused belongs_to_{current,other}_shard(s) functions Obsoleted by new sharding mechanism, but break the build for some.	2016-11-23 21:39:29 +02:00
Paweł Dziepak	919825a2c7	Merge "Improve sharding in large clusters" from Avi "Clusters with a large number of nodes, or a low number of vnodes, and a high number of shards, or a combination, suffer from an aliasing problem: both vnodes and intra-node sharding consider the most significant bits to select the owning node and owning shard respectively. Since the same bits are used for both, a low number of vnodes leads to some shards being overcommitted relative to others. This series fixes the problem by sharding on bits 0:47 of the token (murmur3 partitioner only), leaving the most significant 12 bits for vnodes. Simulation shows that this value provides reasonable sharding for 100-node, 30-shard clusters. In order to prevent re-sharding sstables on each boot, token ranges for the range are stored in a new sub-component of the sstable Statistics component. With the default 12 ignored bits we have 4096 token ranges for non-Level-compacted SSTables, which takes some space but is still reasonable. Fixes #1277."	2016-11-23 11:25:53 +00:00
Avi Kivity	024c8ef8a1	db: adjust sstable load to use sstable self-reporting of shard ownership Instead of calculating the owning shard from the sstable's partition key range, delegate to the new sstable method for getting owning shard infomation. This insulates us from changes in the sharding algorithm.	2016-11-22 21:56:40 +02:00
Glauber Costa	13973e7f3b	keep background work semaphore alive during sstable flush We have a semaphore controlling the amount of background work generated by the memtable flush process. However, because we are not moving it inside the memtable post-flush continuation, the units are being released when we star the flush and not when we finish it. That's not the intended behavior and that can cause flushes to accumulate. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <b7dc1866ed3473b9b1862c433d59c5ebd8575dbc.1479839600.git.glauber@scylladb.com>	2016-11-22 19:54:08 +01:00
Glauber Costa	0ca8c3f162	database: keep a pointer to the memtable list in a memtable We current pass a region group to the memtable, but after so many recent changes, that is a bit too low level. This patch changes that so we pass a memtable list instead. Doing that also has a couple of advantages. Mainly, during flush we must get to a memtable to a memtable_list. Currently we do that by going to the memtable to a column family through the schema, and from there to the memtable_list. That, however, involves calling virtual functions in a derived class, because a single column family could have both streaming and normal memtables. If we pass a memtable_list to the memtable, we can keep pointer, and when needed get the memtable_list directly. Not only that gets rid of the inheritance for aesthetic reasons, but that inheritance is not even correct anymore. Since the introduction of the big streaming memtables, we now have a plethora of lists per column family and this transversal is totally wrong. We haven't noticed before because we were flushing the memtables based on their individual sizes, but it has been wrong all along for edge cases in which we would have to resort to size-based flush. This could be the case, for instance, with various plan_ids in flight at the same time. At this point, there is no more reason to keep the derived classes for the dirty_memory_manager. I'm only keeping them around to reduce clutter, although they are useful for the specialized constructors and to communicate to the reader exactly what they are. But those can be removed in a follow up patch if we want. The old memtable constructor signature is kept around for the benefit of two tests in memtable_tests which have their own flush logic. In the future we could do something like we do for the SSTable tests, and have a proxy class that is friends with the memtable class. That too, is left for the future. Fixes #1870 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <811ec9e8e123dc5fc26eadbda82b0bae906657a9.1479743266.git.glauber@scylladb.com>	2016-11-21 18:18:27 +02:00
Duarte Nunes	cd7e2fd602	column_family: Add support for virtual readers Virtual readers allow queries to selected tables, usually system tables, to be answered by the engine. This is useful for tables which aren't written by users and whose contents can be calculated on demand. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-11-21 11:15:05 +00:00
Glauber Costa	504b5ac30f	database: don't check for waiters in the condition variable predicate. In the last iterations of this patchset, we have moved explicit flushes to acquire the semaphore directly and the coalescing inside the memtable_list. As a result, we are no longer keeping any kind of action for them inside the condition variable. Checking for them has no longer a purpose. This is a cleanup patch that remove does checks. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <732676ccfe4ac93eb57aa799ec94b841499a01a6.1479500646.git.glauber@scylladb.com>	2016-11-18 21:34:48 +01:00
Glauber Costa	1933349654	database: fix direct flushes of non-durable column families. If a Column Family is non-durable, then its flushes will never create a memtable flush reader. Our current flush logic depends on that being created and destroyed to release the semaphore permits on the flush. We will remove the permits ourselves it there is an exception, but not under normal circumnstances. Given this issue, however, it would be more adequate to always try to remove the permits after we flush. If the permits were already removed by the flush reader, then this test will just see that the permit is not in the map and return. But if it is still there, then it is removed. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <049334c3b4bef620af2c7c045e6c84347dcf9013.1479498026.git.glauber@scylladb.com>	2016-11-18 21:32:29 +01:00
Glauber Costa	461778918b	fix shutdown and exception conditions for flush logic This patch addresses post-merge follow up comments by Tomek. Basically, what we do is: - we don't need to signal() from remove_from_flush_manager(), because the explicit flushes no longer wait on the condition variable. So we don't. - We now wait on the stop() flushes (regardless of their return status) so we can make sure that the _flush_queue will indeed be done with. - we acquire the semaphore before shutting down the dirty_memory_manager to make sure that there are no pending flushes - the flush manager that holds the semaphore has to match in the exception handler Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <a23ab5098934546c660a08de64cd9294bb3a2008.1479400239.git.glauber@scylladb.com>	2016-11-17 21:16:44 +01:00
Raphael S. Carvalho	3dc9294023	db: do not leak deleted sstable when deletion triggers an exception The leakage results in deleted sstables being opened until shutdown, and disk space isn't released. That's because column_family::rebuild_sstable_list() will not remove reference to deleted sstables if an exception was triggered in sstables::delete_atomically(). A sstable only has its files closed when its object is destructed. The exception happens when a major compaction is issued in parallel to a regular one, and one of them will be unable to delete a sstable already deleted by the other. That results in remove_by_toc_name() triggering boost::filesystem ::filesystem_error because TOC and temporary TOC don't exist. We wouldn't have seen this problem if major compaction were going through compaction manager, but remove_by_toc_name() and rebuild_sstable_list() should be made resilient. Fixes #1840. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <d43b2e78f9658e2c3c5bbb7f813756f18874bf92.1479390842.git.raphaelsc@scylladb.com>	2016-11-17 17:46:36 +02:00
Glauber Costa	f08162e181	database: rework memtable flush logic The way we currently flush memtables, we seal the current one but wait on a semaphore for the actual flush to proceed. This is pointless, because if the flush is not proceeding we'll use up memory for the new entries anyway, be them in a newly opened memtable or not. As a matter of fact, by opening a new memtable we are foregoing coalescing opportunities. After recent changes to the flush paths, we are now in a position to do differently. We move the semaphore earlier, and if we can't acquire it we keep appending to the current memtable. For explicit flushes, we'll queue and prioritize them over memory-based flushes. This has the nice property of potentially coalescing various flushes for the same CF into one. Coalescing flushes for the same CF is particularly helpful for commitlog-initiated flushes that can't complete within the flush period. What we see currently, is that under heavy load the commitlog will keep sealing memtables adding to the existing load. Another interesting property of this approach is that we can keep the disk utilization higher, by allowing a new flush to start before the memtable is fully sealed. By design, every time a memtable is finished flushing it will call revert_potentially_cleaned_up_memory() to revert the virtual memory charges. That is the perfect moment for us to act. It indicates that all the data flushing part is done. The way we'll do it is by keeping the semaphore_units alive for this memtable. When the flush ends, we destroy that object. This will effectively trigger the next flush if there is a next flush that can be initiated. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-16 21:20:58 -05:00
Glauber Costa	895e838ac0	get rid of max_memtable_size After recent changes to the memtable code, there is no reason for us to uphold a maximum memtable size. Now that we only flush one memtable at a time anyway, and also have soft limit notifications from the region_group_reclaimer, we can just set the soft limit to the target size and let all of that be handled by the dirty_memory_manager. It does have the added property that we'll be flushing when we globally reach the soft limit threshold. In conditions in which we have multiple CF writes fighting for memory, that guarantees that we will start flushing much earlier than the hard limit. The threshold is set to 1/4 of dirty memory. While in theory we would prefer the memtables to go as big as 1/2 of dirty memory, in my experiments I have found 1/4 to be a better fit, at least for the moment. The reason for such behavior is that in situations where we have slow disks, setting the soft limit to 1/2 of dirty will put us in a situation in which we may not have finished writing down the memtable when we hit the limit, and then throttle. When set the threshold to 1/4 of dirty, we don't throttle at all. This behavior could potentially be fixed by not doing the full memtable-based throttling after we do the commitlog throttling, but that is not something realistic for the moment. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-16 21:20:24 -05:00
Glauber Costa	da738a6cd1	database: remove outdated comment Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-16 21:20:23 -05:00
Glauber Costa	919de98aa5	database: uphold virtual dirty for system tables. Currently the virtual dirty mechanism is not properly set for system tables. We haven't divided the system table allowance by two, which means it won't start thottling earlier as it was supposed to. In practice, this has little effect because system table requests are very well behaved, their sizes well known, and they tend to be force-flushed. But we should be consistent. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-16 21:20:23 -05:00
Gleb Natapov	27e041606b	fix LOCAL_ONE printout Message-Id: <20161109125307.GH7766@scylladb.com>	2016-11-09 12:53:55 +00:00
Tomasz Grabiec	c1a7e2090e	Revert "database: change find_column_families signature so it returns a lw_shared_ptr" This reverts commit `f3528ede65`.	2016-11-04 10:48:21 +01:00
Tomasz Grabiec	3b5ccda70e	Revert "database: refactor code so apply_in_memory() is called only once" This reverts commit `3f825f593d`.	2016-11-04 10:48:18 +01:00
Tomasz Grabiec	6366eb5cf8	Revert "correctly calculate latencies for writes" This reverts commit `a382f10fc4`.	2016-11-04 10:48:02 +01:00
Tomasz Grabiec	a5ee87611a	Revert "database: when querying, move latency counter instead of copying" This reverts commit `8840a5a593`.	2016-11-04 10:47:58 +01:00
Glauber Costa	8840a5a593	database: when querying, move latency counter instead of copying It is comprised of two time points. Let's move it instead of copying it. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <c7c155c77780e188bfbe05881c81ce86456016d5.1478111467.git.glauber@scylladb.com>	2016-11-03 13:27:31 +01:00
Glauber Costa	a382f10fc4	correctly calculate latencies for writes Right now we are calculating latencies only when we are about to add an item to the memtable. That's incorrect and misleading, for two reasons. First, it leaves the commitlog latencies out. But second, it is done after the memtable wall effect is applied, which means we are not counting throttle time neither in the memtables or in the commitlog. To do that, we'll start the latency_counter object as soon as possible and move it all the way to apply_in_memory(). That should span the entire write operation. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <4e424780d290fd5938046060df2b17e2b470b717.1478111467.git.glauber@scylladb.com>	2016-11-03 13:27:31 +01:00
Glauber Costa	3f825f593d	database: refactor code so apply_in_memory() is called only once There are two variants of apply_in_memory() being called in do_apply(): with and without the commitlog. The main differences are that when the commitlog is involved, we need to wait for its future to complete before moving to apply_in_memory. That can easily be factored out by providing an always-ready future if we don't have the commitlog enabled, and waiting on that. The second, is that the commitlog version can cause apply_in_memory to generate an exception if there is replay position reordering. However, there is no harm in appending the exception handler to both versions. In one of them it's an impossible exception, but that's fine. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <8cee0cad9b1930a057a24e095f0a655069ae8be2.1478111467.git.glauber@scylladb.com>	2016-11-03 13:27:31 +01:00
Glauber Costa	f3528ede65	database: change find_column_families signature so it returns a lw_shared_ptr There are places in which we need to use the column family object many times, with deferring points in between. Because the column family may have been destroyed in the deferring point, we need to go and find it again. If we use lw_shared_ptr, however, we'll be able to at least guarantee that the object will be alive. Some users will still need to check, if they want to guarantee that the column family wasn't removed. But others that only need to make sure we don't access an invalid object will be able to avoid the cost of re-finding it just fine. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <722bf49e158da77ff509372c2034e5707706e5bf.1478111467.git.glauber@scylladb.com>	2016-11-03 13:27:31 +01:00
Avi Kivity	a35136533d	Convert ring_position and token ranges to be nonwrapping Wrapping ranges are a pain, so we are moving wrap handling to the edges. Since cql can't generate wrapping ranges, this means thrift and the ring maintenance code; also range->ring transformations need to merge the first and last ranges. Message-Id: <1478105905-31613-1-git-send-email-avi@scylladb.com>	2016-11-02 21:04:11 +02:00
Raphael S. Carvalho	d11e839520	db: make refresh resilient to permission denied error User may forget to set permission of new sstables in upload dir before refreshing them, and that will result in shutdown. io_checker is now able to work with a custom handler, so all we have to do is to whitelist EACCES. Fixes #1709. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-10-27 16:50:40 -02:00
Raphael S. Carvalho	a3e065da9b	db: make it possible to use custom error handler with io checker By default, io checker will cause Scylla to shutdown if it finds specific system errors. Right now, io checker isn't flexible enough to allow a specialized handler. For example, we don't want to Scylla to shutdown if there's an permission problem when uploading new files from upload dir. This desired flexibility is made possible here by allowing a handler parameter to io check functions and also changing existing code to take advantage of it. That's a step towards fixing #1709. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-10-27 15:54:21 -02:00
Raphael S. Carvalho	bc2d351c25	sstables: remove duplicated declaration of remove_by_toc_name Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-10-26 11:21:27 -02:00
Raphael S. Carvalho	fa308c079c	database: fix collectd metrics for clustering key filter Same instance name was used for exported metrics, which is definitely wrong. Checked it works properly now via collectd exporter. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <471a36706113af60aeba86fb56a365feb4dab31a.1477086706.git.raphaelsc@scylladb.com>	2016-10-22 09:51:18 +03:00
Paweł Dziepak	6755a679f6	drop key readers key_readers weren't used since introduction of continuity flag to cache entries. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	7bebfb851f	database: enable fast forwarding of range_sstable_reader When fast forwarding a reader that combines sstable reader we must also remember that the set of sstables for the new range may be different than for the previous one. The reader introduced in this patch makes sure that we read from correct sstables. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Tomasz Grabiec	4357d0a6d9	db: Add counter for writes blocked on dirty memory There is already queue_length-requests_blocked_memory, but it's a gauge so does not reflect what happened between the sampling points. total_operations-requests_blocked_memory will allow to see if there were any (and how many) requests which were blocked by dirty memory. Message-Id: <1476098616-12682-1-git-send-email-tgrabiec@scylladb.com>	2016-10-10 14:25:22 +03:00
Glauber Costa	33e9c2bbdd	memtable: reduce sstable flush concurrency to one Limiting the concurrency of memtable flushes to 4 was a temporary workaround for the fact that we lacked good write behind support. Now that write behind is properly merged we can reduce the concurrency to what it should be, one. This means that memtable flushes will now be serialized, and only when one of them ends will the next one begin. Disk parallelism is obtained through the write-behind mechanism. Fixes #1373 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <528f9ef928b5101bed952df600eb8555c275497a.1475881100.git.glauber@scylladb.com>	2016-10-09 10:48:57 +03:00
Tomasz Grabiec	2a5a90f391	db: Do not timeout streaming readers There is a limit to concurrency of sstable readers on each shard. When this limit is exhausted (currently 100 readers) readers queue. There is a timeout after which queued readers are failed, equal to read_request_timeout_in_ms (5s by default). The reason we have the timeout here is primarily because the readers created for the purpose of serving a CQL request no longer need to execute after waiting longer than read_request_timeout_in_ms. The coordinator no longer waits for the result so there is no point in proceeding with the read. This timeout should not apply for readers created for streaming. The streaming client currently times out after 10 minutes, so we could wait at least that long. Timing out sooner makes streaming unreliable, which under high load may prevent streaming from completing. The change sets no timeout for streaming readers at replica level, similarly as we do for system tables readers. Fixes #1741. Message-Id: <1475840678-25606-1-git-send-email-tgrabiec@scylladb.com>	2016-10-07 15:41:04 +03:00
Raphael S. Carvalho	7ea4513595	database: trigger compaction after loading new sstables Scylla wasn't trying to compact new sstables uploaded via 'nodetool refresh'. Thus, all new sstables were left uncompacted until user issued 'nodetool flush' or a new sstable was written which would trigger compaction too. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <bbdf274c8bb49f4bedeefcb85da78a6fb61a1232.1475535203.git.raphaelsc@scylladb.com>	2016-10-06 18:26:49 +03:00
Avi Kivity	f8118d9fc2	Merge "Virtual dirty memory management" from Glauber "Description: ============ Scylla currently suffers from a brick wall behavior of the request throttler. Requests pile up until we reach the dirty memory limit, at which point we stop serving them until we have freed enough memory to allow for more requests. The problem is that freeing dirty memory means writing an SSTable to completion. That can take a long time, even if we are blessed with great disks. Those long waiting times can and will translate into timeouts. That is bad behavior. What this patch does is introduce one form of virtual dirty memory accounting. Instead of allowing 100 % of the dirty memory to be filled up until we stop accepting requests, we will do that when we reach 50 % of memory. However, instead of releasing requests only when an SSTable is fully written, we start releasing them when some memory was written. The practical effect of that, is that once we reach 50 % occupancy in our dirty memory region, we will bring the system from CPU speed to disk speed, and will start accepting requests only at the rate we are able to write memory back. Results ======= With this patchset running a load big enough to easily saturate the disk, (commitlog disabled to highlight the effects of the memtable writer), I am able to run scylla for many minutes, with timeouts occurring only when I run out of disk space, whereas without this patch a swarm of timeouts would start merely 2 seconds after the load started - and would never get stable. In V2, I have sent a set of graphs illustrating the performance of this solution. This version does not have any significant differences in that front. For details, please refer to https://groups.google.com/d/msg/scylladb-dev/iCvD-3Z-QqY/EM8KUh_MAQAJ Accuracy of the accounting: --------------------------- It is important for us to be as accurate as possible when accounting freed memory, since every byte we mark as freed may allow one or more requests to be executed. I have measured the accuracy of this approach (ignoring padding, object size for the mutation fragments) to be 99.83 % of used memory in the test workload I have ran (large, 65k mutations). Memtables under this circumnstance tend to have a very high occupancy ratio because throttle breeds idle, and idle breeds compact-on-idle. Known Issues: ------------- A lot of time can be elapsed between destroying the flush_reader and actually releasing memory. The release of memory only happens when the SSTable is fully sealed, and we have to flush the files, as well as finish writing all SSTable components at this point. This happened in practice with a buggy kernel that would result in flushes taking a long time. After that is fixed, this is just a theoretical problem and in practice it shouldn't matter given the time we expect those operations to take." * 'virtual-dirty-v6' of github.com:glommer/scylla: database: allow virtual dirty memory management streamed_mutation: make _buffer private add accounting of memory read to partition_snapshot_reader move partition_snapshot_reader code to header file LSA: allow a group to query its own region group memtables: split scanning reader in two sstables: use special reader for writing a memtable LSA: export information about object memory footprint LSA: export information about size of the throttle queue database: export virtual dirty bytes region group	2016-10-04 20:57:52 +03:00
Glauber Costa	f89a67c75c	database: allow virtual dirty memory management Scylla currently suffers from a brick wall behavior of the request throttler. Requests pile up until we reach the dirty memory limit, at which point we stop serving them until we have freed enough memory to allow for more requests. The problem is that freeing dirty memory means writing an SSTable to completion. That can take a long time, even if we are blessed with great disks. Those long waiting times can and will translate into timeouts. That is bad behavior. What this patch does is introduce one form of virtual dirty memory accounting. Instead of allowing 100 % of the dirty memory to be filled up until we stop accepting requests, we will do that when we reach 50 % of memory. However, instead of releasing requests only when an SSTable is fully written, we start releasing them when some memory was written. The practical effect of that is that once we reach 50 % occupancy in our dirty memory region, we will bring the system from CPU speed to disk speed, and will start accepting requests only at the rate we are able to write memory back. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-10-04 10:39:10 -04:00
Raphael S. Carvalho	747b42299c	database: remove unused code Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <95e1ed590c9e45d15f19a84824a4dce05aefdab8.1475528611.git.raphaelsc@scylladb.com>	2016-10-04 09:26:43 +03:00

1 2 3 4 5 ...

667 Commits