scylladb

Author	SHA1	Message	Date
Glauber Costa	f08162e181	database: rework memtable flush logic The way we currently flush memtables, we seal the current one but wait on a semaphore for the actual flush to proceed. This is pointless, because if the flush is not proceeding we'll use up memory for the new entries anyway, be them in a newly opened memtable or not. As a matter of fact, by opening a new memtable we are foregoing coalescing opportunities. After recent changes to the flush paths, we are now in a position to do differently. We move the semaphore earlier, and if we can't acquire it we keep appending to the current memtable. For explicit flushes, we'll queue and prioritize them over memory-based flushes. This has the nice property of potentially coalescing various flushes for the same CF into one. Coalescing flushes for the same CF is particularly helpful for commitlog-initiated flushes that can't complete within the flush period. What we see currently, is that under heavy load the commitlog will keep sealing memtables adding to the existing load. Another interesting property of this approach is that we can keep the disk utilization higher, by allowing a new flush to start before the memtable is fully sealed. By design, every time a memtable is finished flushing it will call revert_potentially_cleaned_up_memory() to revert the virtual memory charges. That is the perfect moment for us to act. It indicates that all the data flushing part is done. The way we'll do it is by keeping the semaphore_units alive for this memtable. When the flush ends, we destroy that object. This will effectively trigger the next flush if there is a next flush that can be initiated. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-16 21:20:58 -05:00
Glauber Costa	895e838ac0	get rid of max_memtable_size After recent changes to the memtable code, there is no reason for us to uphold a maximum memtable size. Now that we only flush one memtable at a time anyway, and also have soft limit notifications from the region_group_reclaimer, we can just set the soft limit to the target size and let all of that be handled by the dirty_memory_manager. It does have the added property that we'll be flushing when we globally reach the soft limit threshold. In conditions in which we have multiple CF writes fighting for memory, that guarantees that we will start flushing much earlier than the hard limit. The threshold is set to 1/4 of dirty memory. While in theory we would prefer the memtables to go as big as 1/2 of dirty memory, in my experiments I have found 1/4 to be a better fit, at least for the moment. The reason for such behavior is that in situations where we have slow disks, setting the soft limit to 1/2 of dirty will put us in a situation in which we may not have finished writing down the memtable when we hit the limit, and then throttle. When set the threshold to 1/4 of dirty, we don't throttle at all. This behavior could potentially be fixed by not doing the full memtable-based throttling after we do the commitlog throttling, but that is not something realistic for the moment. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-16 21:20:24 -05:00
Glauber Costa	2ed3f342c1	pass a region to dirty_memory_manager accounting API We would like to know from which region is a particular flush coming from, and account accordingly. The reasoning behind that, is that soon we'll be driving the flushes internally from the dirty_memory_manager without explcitly triggering them. We need to start a flush before the current one finishes, otherwise we'll have a period without significant disk activity when the current SSTable is being sealed, the caches are being updated, etc. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-11-16 21:20:24 -05:00
Tomasz Grabiec	c1a7e2090e	Revert "database: change find_column_families signature so it returns a lw_shared_ptr" This reverts commit `f3528ede65`.	2016-11-04 10:48:21 +01:00
Tomasz Grabiec	6366eb5cf8	Revert "correctly calculate latencies for writes" This reverts commit `a382f10fc4`.	2016-11-04 10:48:02 +01:00
Glauber Costa	a382f10fc4	correctly calculate latencies for writes Right now we are calculating latencies only when we are about to add an item to the memtable. That's incorrect and misleading, for two reasons. First, it leaves the commitlog latencies out. But second, it is done after the memtable wall effect is applied, which means we are not counting throttle time neither in the memtables or in the commitlog. To do that, we'll start the latency_counter object as soon as possible and move it all the way to apply_in_memory(). That should span the entire write operation. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <4e424780d290fd5938046060df2b17e2b470b717.1478111467.git.glauber@scylladb.com>	2016-11-03 13:27:31 +01:00
Glauber Costa	f3528ede65	database: change find_column_families signature so it returns a lw_shared_ptr There are places in which we need to use the column family object many times, with deferring points in between. Because the column family may have been destroyed in the deferring point, we need to go and find it again. If we use lw_shared_ptr, however, we'll be able to at least guarantee that the object will be alive. Some users will still need to check, if they want to guarantee that the column family wasn't removed. But others that only need to make sure we don't access an invalid object will be able to avoid the cost of re-finding it just fine. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <722bf49e158da77ff509372c2034e5707706e5bf.1478111467.git.glauber@scylladb.com>	2016-11-03 13:27:31 +01:00
Paweł Dziepak	6755a679f6	drop key readers key_readers weren't used since introduction of continuity flag to cache entries. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Glauber Costa	33e9c2bbdd	memtable: reduce sstable flush concurrency to one Limiting the concurrency of memtable flushes to 4 was a temporary workaround for the fact that we lacked good write behind support. Now that write behind is properly merged we can reduce the concurrency to what it should be, one. This means that memtable flushes will now be serialized, and only when one of them ends will the next one begin. Disk parallelism is obtained through the write-behind mechanism. Fixes #1373 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <528f9ef928b5101bed952df600eb8555c275497a.1475881100.git.glauber@scylladb.com>	2016-10-09 10:48:57 +03:00
Tomasz Grabiec	2a5a90f391	db: Do not timeout streaming readers There is a limit to concurrency of sstable readers on each shard. When this limit is exhausted (currently 100 readers) readers queue. There is a timeout after which queued readers are failed, equal to read_request_timeout_in_ms (5s by default). The reason we have the timeout here is primarily because the readers created for the purpose of serving a CQL request no longer need to execute after waiting longer than read_request_timeout_in_ms. The coordinator no longer waits for the result so there is no point in proceeding with the read. This timeout should not apply for readers created for streaming. The streaming client currently times out after 10 minutes, so we could wait at least that long. Timing out sooner makes streaming unreliable, which under high load may prevent streaming from completing. The change sets no timeout for streaming readers at replica level, similarly as we do for system tables readers. Fixes #1741. Message-Id: <1475840678-25606-1-git-send-email-tgrabiec@scylladb.com>	2016-10-07 15:41:04 +03:00
Glauber Costa	aa6a96d09b	database: export virtual dirty bytes region group Currently, we export the region group where memtables are placed as dirty bytes. Upcoming patches will optimistically mark some bytes in this region as free, a scheme we know as "virtual dirty". We are still interested in knowing the real state of the dirty region, so we will keep track of the bytes virtually freed and split the counters in two. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-09-27 12:09:08 -04:00
Asias He	b505e34062	database: Introduce make_streaming_reader The make_streaming_reader returns a combined mutation reader reads mutations from sstables and memtable. The memtable reader handles memtable flushing automatically so no special handling is needed here. It will be used by streaming soon.	2016-09-26 16:02:48 +08:00
Raphael S. Carvalho	67343798cf	api: implement api to return sstable count per level 'nodetool cfstats' wasn't showing per-level sstable count because the API wasn't implemented. Fixes #1119. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <0dcdf9196eaec1692003fcc8ef18c77d0834b2c6.1474410770.git.raphaelsc@scylladb.com>	2016-09-21 09:13:40 +03:00
Raphael S. Carvalho	b9f67351da	db: expose clustering filter info via collectd That's needed to observe behavior of clustering filter, and to check if it's worthwhile for a specific workload. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-09-02 11:32:23 -03:00
Duarte Nunes	f4cf2f2aef	tracing: Make trace_state_ptr argument required This patch makes the optional trace_state_ptr arguments introduced in previous patches mandatory where possible. Functions which are called internally don't have a trace context, so for those we keep the argument's default value for convenience. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-09-01 12:04:32 +02:00
Duarte Nunes	030db65c62	database: Accept a trace_state_ptr This patch changes the database and column_family types so a trace_state_ptr can be passed in when querying. This enables tracing of the inner components. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-09-01 12:04:28 +02:00
Glauber Costa	0f413695ac	database: make column family stats mutable The make_reader method is currently a const method, but we would like to start keeping hit statistics from it. Instead of relaxing the const condition too much, we can just mark the _stats field as mutable, indicating that make_reader will not be able to change anything in the CF, except for keeping statistics. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-08-31 15:13:24 -04:00
Glauber Costa	5c4d73577a	initialize sstables_per_read histogram with 35 instead of 90 buckets This is to match what Cassandra does. Nodetool may be expecting this on the other side. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-08-31 15:13:24 -04:00
Glauber Costa	4310635bae	move estimated histogram to utils Nothing sstable-specific in it, really. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-08-31 15:13:23 -04:00
Glauber Costa	ffc2131c51	decouple estimated_histogram from sstables There is nothing really that fundamentally ties the estimated histogram to sstables. This patch gets rid of the few incidental ties. They are: - the namespace name, which is now moved to utils. Users inside sstables/ now need to add a namespace prefix, while the ones outside have to change it to the right one - sstables::merge, which has a very non-descriptive name to begin with, is changed to a more descriptive name that can live inside utils/ - the disk_types.hh include has to be removed - but it had no reason to be here in the first place. Todo, is to actually move the file outside sstables/. That is done in a separate step for clarity. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-08-31 15:13:23 -04:00
Piotr Jastrzebski	3607d99269	Remove clustering_key_filtering_context. Remove clustering_key_filter_factory and clustering_key_filtering_context. Use partition_slice directly with a static get_ranges method. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-08-30 20:31:55 +02:00
Piotr Jastrzebski	636a4acfd0	Add flag to configure max size of a cached partition. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-07-21 09:47:20 +02:00
Duarte Nunes	3518db531e	database: Get non-system column_families This patch adds an utility function that allows fetching the set of column_families that do not belong to the system keyspace. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-18 23:58:31 +00:00
Duarte Nunes	4bc00c2055	database: Expose selection of sstables by a range This patch allows a set of a column_family's sstables to be selected according to a range of ring_positions. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-18 23:58:31 +00:00
Avi Kivity	24e3026e32	Merge "compaction manager refactoring" from Raphael	2016-07-10 17:16:23 +03:00
Tomasz Grabiec	c0233c877d	db: Avoid out-of-memory when flushing cannot keep up memtable_list::seal_on_overlflow() is called on each mutation to check if current memtable should be flushed. It will call memtable_list::seal_active_memtable() when that is the case. The number of concurrent seals is guarded by a semaphore, starting from commit `0f64eb7e7d`, and allows at most 4 of them. If there are 4 flushes already pending, every incoming mutation will enqueue a new flush task on the semaphore's wait list, without waiting for it. The wait queue can grow without bounds, eventually leading to out-of-memory. The fix is to seal the memtable immediately to satisfy should_flush() condition, but limit concurrency of actual flushes. This way the wait queue size on the semaphore is limited by memtables pending a flush, which is fairly limited. Message-Id: <1467997652-16513-1-git-send-email-tgrabiec@scylladb.com>	2016-07-10 10:53:51 +03:00
Raphael S. Carvalho	e38f66c6fe	database: make certain column family functions const qualified Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-08 15:05:22 -03:00
Paweł Dziepak	32a5de7a1f	db: handle receiving fragmented mutations If mutations are fragmented during streaming a special care must be taken so that isolation guarantees are not broken. Mutations received with flag "fragmented" set are applied to a memtable that is used only by that particular streaming task and the sstables created by flushing such memtables are not made visible until the task is complte. Also, in case the streaming fails all data is dropped. This means that fragmented mutations cannot benefit from coalescing of writes from multiple streaming plans, hence separate way of handling them so that there is no loss of performance for small partitions. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:18:35 +01:00
Paweł Dziepak	f2ae31711e	streaming: inform CF when streaming fails Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:18:35 +01:00
Paweł Dziepak	4031c0ed8f	streaming: pass plan_id to column family for apply and flush plan_id is needed to keep track of the origin of mutations so that if they are fragmented all fragments are made visible at the same time, when that particular streaming plan_id completes. Basically, each streaming plan that sends big (fragmented) mutations is going to have its own memtables and a list of sstables which will get flushed and made visible when that plan completes (or dropped if it fails). Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:18:35 +01:00
Paweł Dziepak	51ec7a7285	db: wait for ongoing flushes at end of streaming When flush_streaming_mutations() is called at the end of streaming it is supposed to flush all data and then invalidate cache. ranges However, if there are already some memtable flushes in progress it won't wait for them. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:18:35 +01:00
Glauber Costa	54ce6221a7	allow the dirty memory manager to be used without a database object Some of our tests don't provide a database object to a CF. Create a default dirty memory manager object that can be used without a database for them. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <872f8c9232ff87d788e271b1db86c814d7a75d9f.1467832713.git.glauber@scylladb.com>	2016-07-07 10:00:43 +01:00
Glauber Costa	b0932ceb04	database: act on LSA pressure notification Issue 1195 describes a scenario with a fairly easy reproducer in which we can freeze the database. That involves writing simultaneously to multiple CFs, such that the sum of all the memory they are using is larger than the dirty memory limit, without not any of them individually being larger than the memtable size. Because we will never reach the individual memtable seal size for any of them, none of them will initiate a flush leading the database to a halt. The LSA has now gained infrastructure that allow us to be notified when pressure conditions mount. What we will do in this case is initiate a flush ourselves. Fixes #1195 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-07-05 17:46:28 -04:00
Glauber Costa	7169b727ea	move system tables to its own region In the spirit of what we are doing for the read semaphore, this patch moves system writes to its own dirty memory manager. Not only will it make sure that system tables will not be serialized by its own semaphore, but it will also put system tables in its own region group. Moving system tables to its own region group has the advantage that system requests won't be waiting during throttle behind a potentially big queue of user requests, since requests are tended to in FIFO order within the same region group. However, system tables being more controlled and predictable, we can actually go a step further and give them some extra reservation so they may not necessarily block even if under pressure (up to 10 MB more). Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-07-05 17:46:28 -04:00
Glauber Costa	c358947284	database: wrap semaphore and region group into a new dirty memory manager We currently have a semaphore in the column family level that protects us against multiple concurrent sstable flushes. However, storing that semaphore into the CF, not the database, was a (implementation, not design) mistake. One comment in particular makes it quite clear: // Ideally, we'd allow one memtable flush per shard (or per database object), and write-behind // would take care of the rest. But that still has issues, so we'll limit parallelism to some // number (4), that we will hopefully reduce to 1 when write behind works. So I aimed for the shard, but ended up coding it into the CF because that's closer to the flush point - my bad. This patch fixes this while paving the way for active reclaim to take place. It wraps the semaphore and the region group in a new structure, the dirty_memory_manager. The immediate benefit is that we don't need to be passing both the semaphore and the region group downwards in the DB -> CF path. The long term benefit is that we now have a one unified structure that can hold shared flush data in all of the CFs. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-07-05 15:29:04 -04:00
Glauber Costa	0c31f3e626	database: move memtable throttler to the LSA throttler The LSA infrastructure, through the use of its region groups, now have a throttler mechanism built-in. This patch converts the current throttlers so that the LSA throttler is used instead. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-07-05 15:05:19 -04:00
Avi Kivity	4cb7618601	Convert column_family::_sstables to sstable_set Using sstable_set will allow us to filter sstables during a query before actually creating a reader (this is left to the next patch; here we just convert the users of the _sstables field).	2016-07-03 10:32:27 +03:00
Avi Kivity	2a46410f4a	Change sstable_list from a map to a set sstable_list is now a map<generation, sstable>; change it to a set in preparation for replacing it with sstable_set. The change simplifies a lot of code; the only casualty is the code that computes the highest generation number.	2016-07-03 10:26:57 +03:00
Avi Kivity	9ac730dcc9	mutation_reader: make restricting_mutation_reader even more restricting While limiting the number of concurrently executing sstable readers reduces our memory load, the queued readers, although consuming a small amount of memory, can still grow without bounds. To limit the damage, add two limits on the queue: - a timeout, which is equal to the read timeout - a queue length limit, which is equal to 2% of the shard memory divided by an estimate of the queued request size (1kb) Together, these limits bound the amount of memory needed by queued disk requests in case the disk can't keep up. Message-Id: <1467206055-30769-1-git-send-email-avi@scylladb.com>	2016-06-29 15:17:35 +02:00
Avi Kivity	edeef03b34	db: restrict replica read concurrency Since reading mutations can consume a large amount of memory, which, moreover, is not predicatable at the time the read is initiated, restrict the number of reads to 100 per shard. This is more than enough to saturate the disk, and hopefully enough to prevent allocation failures. Restriction is applied in column_family::make_sstable_reader(), which is called either on a cache miss or if the cache is disabled. This allows cached reads to proceed without restriction, since their memory usage is supposedly low. Reads from the system keyspace use a separate semaphore, to prevent user reads from blocking system reads. Perhaps we should select the semaphore based on the source of the read rather than the keyspace, but for now using the keyspace is sufficient.	2016-06-27 17:17:56 +03:00
Duarte Nunes	aacc7193f2	schema: Replace keyspace's schema_ptr on CF update This patch ensures we replace the schema_ptr held by its respective keyspace object when a column family is being updated. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20160623085710.26168-1-duarte@scylladb.com>	2016-06-23 11:11:52 +02:00
Glauber Costa	e08fa7dafa	fix potential stale data in cache update We currently have a problem in update_cache, that can be trigger by ordering issues related to memtable flush termination (not initiation) and/or update_cache() call duration. That issue is described in #1364, and in short, happens if a call to update_cache starts before and ongoing call finishes. There is now a new SSTable that should be consulted by the presence checker that is not. The partition checker operates in a stale list because we need to make sure the SSTable we just wrote is excluded from it. This patch changes the partition checker so that all SSTables currently in use are consulted, except for the one we have just flushed. That provides both the guarantee that we won't check our own SSTable and access to the most up-to-date SSTable list. Fixes #1364 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <fa1cee672bba8e21725c6847353552791225295f.1466534499.git.glauber@scylladb.com>	2016-06-23 10:54:44 +02:00
Nadav Har'El	3372052d48	Rewriting shared sstables only after all shards loaded sstables After commit `faa4581`, each shard only starts splitting its shared sstables after opening all sstables. This was important because compaction needs to be aware of all sstables. However, another bug remained: If one shard finishes loading its sstables and starts the splitting compactions, and in parallel a different shard is still opening sstables - the second shard might find a half-written sstable being written by the first shard, and abort on a malformed sstable. So in this patch we start the shared sstable rewrites - on all shards - only after all shards finished loading their sstables. Doing this is easy, because main.cc already contains a list of sequential steps where each uses invoke_on_all() to make sure the step completes on all shards before continuing to the next step. Fixes #1371 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1466426641-3972-1-git-send-email-nyh@scylladb.com>	2016-06-20 16:25:24 +03:00
Nadav Har'El	faa45812b2	Rewrite shared sstables only after entire CF is read Starting in commit `721f7d1d4f`, we start "rewriting" a shared sstable (i.e., splitting it into individual shards) as soon as it is loaded in each shard. However as discovered in issue #1366, this is too soon: Our compaction process relies in several places that compaction is only done after all the sstables of the same CF have been loaded. One example is that we need to know the content of the other sstables to decide which tombstones we can expire (this is issue #1366). Another example is that we use the last generation number we are aware of to decide the number of the next compaction output - and this is wrong before we saw all sstables. So with this patch, while loading sstables we only make a list of shared sstables which need to be rewritten - and the actual rewrite is only started when we finish reading all the sstables for this CF. We need to do this in two cases: reboot (when we load all the existing sstables we find on disk), and nodetool referesh (when we import a set of new sstables). Fixes #1366. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1466344078-31290-1-git-send-email-nyh@scylladb.com>	2016-06-19 16:50:51 +03:00
Avi Kivity	465c0a4ead	Merge "Make stronger guarantees in row_cache's clear/invalidate" from Tomasz "Correctness of current uses of clear() and invalidate() relies on fact that cache is not populated using readers created before invalidation. Sstables are first modified and then cache is invalidated. This is not guaranteed by current implementation though. As pointed out by Avi, a populating read may race with the call to clear(). If that read started before clear() and completed after it, the cache may be populated with data which does not correspond to the new sstable set. To provide such guarantee, invalidate() variants were adjusted to synchronize using _populate_phaser, similarly like row_cache::update() does. Fixes #1291."	2016-06-13 09:55:29 +03:00
Nadav Har'El	721f7d1d4f	Rewrite shared sstables soon after startup Several shards may share the same sstable - e.g., when re-starting scylla with a different number of shards, or when importing sstables from an external source. Sharing an sstable is fine, but it can result in excessive disk space use because the shared sstable cannot be deleted until all the shards using it have finished compacting it. Normally, we have no idea when the shards will decide to compact these sstables - e.g., with size- tiered-compaction a large sstable will take a long time until we decide to compact it. So what this patch does is to initiate compaction of the shared sstables - on each shard using it - so that a soon as possible after the restart, we will have the original sstable is split into separate sstables per shard, and the original sstable can be deleted. If several sstables are shared, we serialize this compaction process so that each shard only rewrites one sstable at a time. Regular compactions may happen in parallel, but they will not not be able to choose any of the shared sstables because those are already marked as being compacted. Commit `3f2286d0` increased the need for this patch, because since that commit, if we don't delete the shared sstable, we also cannot delete additional sstables which the different shards compacted with it. For one scylla user, this resulted in so much excessive disk space use, that it literally filled the whole disk. After this patch commit `3f2286d0`, or the discussion in issue #1318 on how to improve it, is no longer necessary, because we will never compact a shared sstable together with any other sstable - as explained above, the shared sstables are marked as "being compacted" so the regular compactions will avoid them. Fixes #1314. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1465406235-15378-1-git-send-email-nyh@scylladb.com> Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-06-08 15:44:29 -04:00
Raphael S. Carvalho	1b8e170254	compaction: retry compaction until strategy is satisfied Previously, we were using a stat to decide if compaction should be retried, but that's not efficient. The information is also lost after node is restarted. After these changes, compaction will be retried until strategy is satisfied, i.e. there is nothing to compact. We will now be doing the following in a loop: Get compaction job from compaction strategy. If cannot run, finish the loop. Otherwise, compact this column family. Go back to start of the loop. By the way, pending_compactions stat will be deprecated after this commit. Previously, it was increased to indicate the want for compaction and decreased when compaction finished. Now, we can compact more than we asked for, so it would be decreased below 0. Also, it's the strategy that will tell the want for compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <899df0d8d807f6b5d9bb8600d7c63b4e260cc282.1465398243.git.raphaelsc@scylladb.com>	2016-06-08 11:31:56 -04:00
Tomasz Grabiec	170a214628	row_cache: Make stronger guarantees in clear/invalidate Correctness of current uses of clear() and invalidate() relies on fact that cache is not populated using readers created before invalidation. Sstables are first modified and then cache is invalidated. This is not guaranteed by current implementation though. As pointed out by Avi, a populating read may race with the call to clear(). If that read started before clear() and completed after it, the cache may be populated with data which does not correspond to the new sstable set. To provide such guarantee, invalidate() variants were adjusted to synchronize using _populate_phaser, similarly like row_cache::update() does.	2016-06-06 13:21:06 +02:00
Glauber Costa	0f64eb7e7d	serialize memtable flush for a memtable_list We can only free memory for a region_group when the entire memtable is released. This means that while the disk can handle requests from multiple memtables just fine, we won't free any memory until all of them finish. If we are under a pressure situation we will take a lot more time to leave it. Ideally, with write-behind, we would allow just one memtable to be flushed at a time. But since we don't have it enabled, it's better to serialize the flushes so that only some memtables (4) are flushed at a time. Having the memtable writer bandwidth all to itself, the memtable will finish sooner, release memory sooner, and recover the system's health sooner. We would like to do that without having streaming and memtables starve each other. Ideally, that should mean half the bandwidth for each - but that sacrifices memtable writes in the common case there is no streaming. Again, write behind will help here, and since this is something we intend to do, there is no need to complicate the code too much for an interim solution. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-05-31 17:18:35 -04:00
Glauber Costa	46c79be401	database: allow callers to specify memtable list's flush behavior This patch introduces an explicit behavior enum class - one of delayed or immediate, that allow callers to tell the memtable list whether they want a delayed flush (default), or force an immediate flush. So far this only affects the streaming code (memtables just ignore it), but the concept is one that can be easily generalized. With that in place, we can revert back the stop function to use the standard flush. I have argued before that adding infrastructure like that would not be worth it for the sake of stop alone, but some other code could now use it. Specifically, the active reclaimer for the throttler would like to force immediate flushes, as delayed flushes really won't make a lot of difference in reducing memory usage. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-05-31 17:17:48 -04:00

1 2 3 4 5 ...

417 Commits