scylladb

Author	SHA1	Message	Date
Duarte Nunes	aacc7193f2	schema: Replace keyspace's schema_ptr on CF update This patch ensures we replace the schema_ptr held by its respective keyspace object when a column family is being updated. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <20160623085710.26168-1-duarte@scylladb.com>	2016-06-23 11:11:52 +02:00
Glauber Costa	e08fa7dafa	fix potential stale data in cache update We currently have a problem in update_cache, that can be trigger by ordering issues related to memtable flush termination (not initiation) and/or update_cache() call duration. That issue is described in #1364, and in short, happens if a call to update_cache starts before and ongoing call finishes. There is now a new SSTable that should be consulted by the presence checker that is not. The partition checker operates in a stale list because we need to make sure the SSTable we just wrote is excluded from it. This patch changes the partition checker so that all SSTables currently in use are consulted, except for the one we have just flushed. That provides both the guarantee that we won't check our own SSTable and access to the most up-to-date SSTable list. Fixes #1364 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <fa1cee672bba8e21725c6847353552791225295f.1466534499.git.glauber@scylladb.com>	2016-06-23 10:54:44 +02:00
Duarte Nunes	69798df95e	query: Limit number of partitions returned This is required to implement a thrift verb. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-06-22 09:48:13 +02:00
Paweł Dziepak	0828c88b25	mutation_partition: implement streaming-friendly data_query() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:31:19 +01:00
Paweł Dziepak	b6f78a8e2f	sstable: make sstable reads return streamed_mutation Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:50 +01:00
Paweł Dziepak	737eb73499	mutation_reader: make readers return streamed_mutations Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:50 +01:00
Nadav Har'El	3372052d48	Rewriting shared sstables only after all shards loaded sstables After commit `faa4581`, each shard only starts splitting its shared sstables after opening all sstables. This was important because compaction needs to be aware of all sstables. However, another bug remained: If one shard finishes loading its sstables and starts the splitting compactions, and in parallel a different shard is still opening sstables - the second shard might find a half-written sstable being written by the first shard, and abort on a malformed sstable. So in this patch we start the shared sstable rewrites - on all shards - only after all shards finished loading their sstables. Doing this is easy, because main.cc already contains a list of sequential steps where each uses invoke_on_all() to make sure the step completes on all shards before continuing to the next step. Fixes #1371 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1466426641-3972-1-git-send-email-nyh@scylladb.com>	2016-06-20 16:25:24 +03:00
Nadav Har'El	faa45812b2	Rewrite shared sstables only after entire CF is read Starting in commit `721f7d1d4f`, we start "rewriting" a shared sstable (i.e., splitting it into individual shards) as soon as it is loaded in each shard. However as discovered in issue #1366, this is too soon: Our compaction process relies in several places that compaction is only done after all the sstables of the same CF have been loaded. One example is that we need to know the content of the other sstables to decide which tombstones we can expire (this is issue #1366). Another example is that we use the last generation number we are aware of to decide the number of the next compaction output - and this is wrong before we saw all sstables. So with this patch, while loading sstables we only make a list of shared sstables which need to be rewritten - and the actual rewrite is only started when we finish reading all the sstables for this CF. We need to do this in two cases: reboot (when we load all the existing sstables we find on disk), and nodetool referesh (when we import a set of new sstables). Fixes #1366. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1466344078-31290-1-git-send-email-nyh@scylladb.com>	2016-06-19 16:50:51 +03:00
Raphael S. Carvalho	0b2cd41daf	database: remember sstable level when cleaning it up Cleanup operation wasn't preserving level of sstables. That will have a bad impact on performance because compaction work is lost. Fixes #1317. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <35ce8fbbb4590725bb0414e6a5450fcbe6cb7212.1465843387.git.raphaelsc@scylladb.com>	2016-06-14 08:06:00 +03:00
Duarte Nunes	c896309383	database: Actually decrease query_state limit query_state expects the current row limit to be updated so it can be enforced across partition ranges. A regression introduced in `e4e8acc946` prevented that from happening by passing a copy of the limit to querying_reader. This patch fixes the issue by having column_family::query update the limit as it processes partitions from the querying_reader. Fixes #1338 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1465804012-30535-1-git-send-email-duarte@scylladb.com>	2016-06-13 10:03:27 +02:00
Avi Kivity	465c0a4ead	Merge "Make stronger guarantees in row_cache's clear/invalidate" from Tomasz "Correctness of current uses of clear() and invalidate() relies on fact that cache is not populated using readers created before invalidation. Sstables are first modified and then cache is invalidated. This is not guaranteed by current implementation though. As pointed out by Avi, a populating read may race with the call to clear(). If that read started before clear() and completed after it, the cache may be populated with data which does not correspond to the new sstable set. To provide such guarantee, invalidate() variants were adjusted to synchronize using _populate_phaser, similarly like row_cache::update() does. Fixes #1291."	2016-06-13 09:55:29 +03:00
Nadav Har'El	721f7d1d4f	Rewrite shared sstables soon after startup Several shards may share the same sstable - e.g., when re-starting scylla with a different number of shards, or when importing sstables from an external source. Sharing an sstable is fine, but it can result in excessive disk space use because the shared sstable cannot be deleted until all the shards using it have finished compacting it. Normally, we have no idea when the shards will decide to compact these sstables - e.g., with size- tiered-compaction a large sstable will take a long time until we decide to compact it. So what this patch does is to initiate compaction of the shared sstables - on each shard using it - so that a soon as possible after the restart, we will have the original sstable is split into separate sstables per shard, and the original sstable can be deleted. If several sstables are shared, we serialize this compaction process so that each shard only rewrites one sstable at a time. Regular compactions may happen in parallel, but they will not not be able to choose any of the shared sstables because those are already marked as being compacted. Commit `3f2286d0` increased the need for this patch, because since that commit, if we don't delete the shared sstable, we also cannot delete additional sstables which the different shards compacted with it. For one scylla user, this resulted in so much excessive disk space use, that it literally filled the whole disk. After this patch commit `3f2286d0`, or the discussion in issue #1318 on how to improve it, is no longer necessary, because we will never compact a shared sstable together with any other sstable - as explained above, the shared sstables are marked as "being compacted" so the regular compactions will avoid them. Fixes #1314. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1465406235-15378-1-git-send-email-nyh@scylladb.com> Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-06-08 15:44:29 -04:00
Raphael S. Carvalho	1b8e170254	compaction: retry compaction until strategy is satisfied Previously, we were using a stat to decide if compaction should be retried, but that's not efficient. The information is also lost after node is restarted. After these changes, compaction will be retried until strategy is satisfied, i.e. there is nothing to compact. We will now be doing the following in a loop: Get compaction job from compaction strategy. If cannot run, finish the loop. Otherwise, compact this column family. Go back to start of the loop. By the way, pending_compactions stat will be deprecated after this commit. Previously, it was increased to indicate the want for compaction and decreased when compaction finished. Now, we can compact more than we asked for, so it would be decreased below 0. Also, it's the strategy that will tell the want for compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <899df0d8d807f6b5d9bb8600d7c63b4e260cc282.1465398243.git.raphaelsc@scylladb.com>	2016-06-08 11:31:56 -04:00
Tomasz Grabiec	170a214628	row_cache: Make stronger guarantees in clear/invalidate Correctness of current uses of clear() and invalidate() relies on fact that cache is not populated using readers created before invalidation. Sstables are first modified and then cache is invalidated. This is not guaranteed by current implementation though. As pointed out by Avi, a populating read may race with the call to clear(). If that read started before clear() and completed after it, the cache may be populated with data which does not correspond to the new sstable set. To provide such guarantee, invalidate() variants were adjusted to synchronize using _populate_phaser, similarly like row_cache::update() does.	2016-06-06 13:21:06 +02:00
Raphael S. Carvalho	3f4500cb71	db: compaction strategy changes via alter table must have immediate effect At the moment, compaction strategy changes via ALTER TABLE have no effect until node restart. Tomek says: "Statements of the following form should have immediate effect: ALTER TABLE t WITH compaction = { 'class' : 'LeveledCompactionStrategy' };" Fixes #877. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <3b72c494f887643b82a272ef0a9995edb970382c.1464726828.git.raphaelsc@scylladb.com>	2016-06-02 16:59:50 +02:00
Pekka Enberg	d03f65d94e	database: Don't use std::cbegin() and std::cend() They're not supported by GCC 4.9. Fixes #1305 Message-Id: <1464877984-27856-1-git-send-email-penberg@scylladb.com>	2016-06-02 16:57:24 +02:00
Avi Kivity	8dcbddc7ed	Merge "Serialize memtable flushes" from Glauber "One of the things we need to do as part of the throttle rework I am doing is to serialize memtable flushes to some extent - that will guarantee that in case we're throttling, the flushes finish earlier and release memory earlier, if compared to the case in which we just let all tables flush freely and simultaneously."	2016-06-01 18:31:18 +03:00
Pekka Enberg	3ca7fc2a8b	database: Add sstable filename to thrown malformed_sstable_exceptions	2016-06-01 14:56:10 +03:00
Glauber Costa	0f64eb7e7d	serialize memtable flush for a memtable_list We can only free memory for a region_group when the entire memtable is released. This means that while the disk can handle requests from multiple memtables just fine, we won't free any memory until all of them finish. If we are under a pressure situation we will take a lot more time to leave it. Ideally, with write-behind, we would allow just one memtable to be flushed at a time. But since we don't have it enabled, it's better to serialize the flushes so that only some memtables (4) are flushed at a time. Having the memtable writer bandwidth all to itself, the memtable will finish sooner, release memory sooner, and recover the system's health sooner. We would like to do that without having streaming and memtables starve each other. Ideally, that should mean half the bandwidth for each - but that sacrifices memtable writes in the common case there is no streaming. Again, write behind will help here, and since this is something we intend to do, there is no need to complicate the code too much for an interim solution. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-05-31 17:18:35 -04:00
Glauber Costa	46c79be401	database: allow callers to specify memtable list's flush behavior This patch introduces an explicit behavior enum class - one of delayed or immediate, that allow callers to tell the memtable list whether they want a delayed flush (default), or force an immediate flush. So far this only affects the streaming code (memtables just ignore it), but the concept is one that can be easily generalized. With that in place, we can revert back the stop function to use the standard flush. I have argued before that adding infrastructure like that would not be worth it for the sake of stop alone, but some other code could now use it. Specifically, the active reclaimer for the throttler would like to force immediate flushes, as delayed flushes really won't make a lot of difference in reducing memory usage. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-05-31 17:17:48 -04:00
Glauber Costa	30d54cef38	database: add a comment explaining the choice of function in CF stop We have recently commited a fix to a broken streaming bug that involved reverting column_family::stop() back to calling the custom seal functions explicitly for both memtables and streaming memtables. We here add a comment to explain why that had to be done. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <fe94b5883e9c29adc7fc9ee9f498894c057e7b64.1464293167.git.glauber@scylladb.com>	2016-05-29 11:28:15 +03:00
Glauber Costa	46f60f52d9	database: do not use implicitly stated seal function when closing the CF In commit `4981362f57`, I have introduced a regression that was thankfully caught by our dtest infrastructure. That patch is a preparation patch for the active reclaim patchset that is to come, and it consolidated all the flushes using the memtable_list's seal_fn function instead of calling the seal function explicitly. The problem here is that the streaming memtables have the delayed mechanism, about which the memtable_list is unaware. Calling memtable_list's seal_active_memtable() for the streaming memtables calls the delayed version, that does not guarantee flush. If we're lucky, we will indeed flush after the timer expires, but if we're not we'll just stop the CF with data not flushed. There are two options to fix this: the first is to teach the memtable_list about the delayed/forced mechanism, and the second is to just call the correct function explicitly during shutdown, and then when the time comes to add continuations to the result of the seal, add them here as well. Although the second option involves a bit more work and duplication, I think it is better in the sense that the delayed / forced mechanism really is something that belong to the streaming only. Being this the only user, I don't think it justifies complicating the memtable_list with this concept. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <b26017c825ccf585f39f58c4ab3787d78e551f5f.1464126884.git.glauber@scylladb.com>	2016-05-25 08:21:24 +03:00
Pekka Enberg	ceb29f9d32	Merge "Introduce upload dir for sstable migration" from Raphael "This change is intended to make migration process safer and easier. All column families will now have a directory called upload. With this feature, users may choose to copy migrated sstables to upload directory of respective column families, and run 'nodetool refresh'. That's supposed to be the preferred option from now on."	2016-05-24 16:36:47 +03:00
Avi Kivity	9637c2232c	Merge "Move the JMX timer polling logic to Scylla" from Amnon	2016-05-24 13:07:52 +03:00
Raphael S. Carvalho	c2fa3b796d	db: fix read consistency after refresh If sstable loaded by refresh covers a row that is cached by the column family, read query may fail to return consistent data. What we should do is to clear cache for the column family being loaded with new sstables. Fixes #1212. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <a08c9885a5ceb0b2991e40337acf5b7679580a66.1464072720.git.raphaelsc@scylladb.com>	2016-05-24 12:11:41 +03:00
Raphael S. Carvalho	e5f0314afd	db: introduce upload directory for sstable migration This change is intended to make migration process safer and easier. All column families will now have a directory called upload. With this feature, users may choose to copy migrated sstables to upload directory of respective column families, and call 'nodetool refresh'. That's supposed to be the preferred option from now on. For each sstable in upload directory, refresh will do the following: 1) Mutate sstable level to 0. 2) Create hard links to its components in column family dir, using a new generation. We make it safe by creating a hard link to temporary TOC first. 3) Remove all of its components in upload directory. This new code runs after refresh checked for new sstables in the column family directory. Otherwise, we could have a generation conflict. Unlike the first step, this new step runs with sstable write enabled. It's easier here because we know exactly which sstables are new. After that, refresh will load new sstables found in column family and upload directories. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-05-20 17:26:21 -03:00
Raphael S. Carvalho	74c8a87777	sstables: fix statistics rewrite It's not working because it tries to overwrite existing statistics file with exclusive flag. It's fixed by writing new statistics into temporary file and renaming it into place. If Scylla failed in middle of rewrite, a temporary file is left over. So boot code was adjusted to delete a temporary file created by this rewrite procedure. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-05-20 17:24:15 -03:00
Raphael S. Carvalho	ee0f66eef6	db: fix migration of sstables with level greater than 0 Refresh will rewrite statistics of any migrated sstable with level > 0. However, this operation is currently not working because O_EXCL flag is used, meaning that create will fail. It turns out that we don't actually need to change on-disk level of a sstable by overwriting statistics file. We can only set in-memory level of a sstable to 0. If Scylla reboots before all migrated sstables are compacted, leveled strategy is smart enough to detect sstables that overlap, and set their in-memory level to 0. Fixes #1124. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-05-17 11:08:08 -03:00
Amnon Heiman	750f30cf07	column_family: Change histogram to timed_rate_moving_average_and_histogram As part of moving the derived statistic in to scylla, this replaces the histogram object in the column_family to timed_rate_moving_average_and_histogram. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-05-17 11:53:15 +03:00
Glauber Costa	17b9203719	database: invert order of elements So that the sizes of the region can be initialized first Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <dc3df186a977b492d83c0a397f206c2db940aa37.1463448522.git.glauber@scylladb.com>	2016-05-17 11:28:39 +03:00
Glauber Costa	2ff6d38d0c	database: use a single constructor for the column family We've been keeping two constructors for the column family to allow for a version without the commitlog. But it's by now quite complicated to maintain the two, because changes always have to be made in two places. This patch adds a private constructor that does the actual construction, and have the public constructors to call it. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <dd3cb0b9c20ad154a6131bad6ece619f70ed5025.1463448522.git.glauber@scylladb.com>	2016-05-17 11:28:39 +03:00
Glauber Costa	8fede5b98e	memtables: isolate logic for disk writes disabled When we have disk writes disabled, we exit immediately from the flush function. We can just encode that separately and pass a different function in the memtable_list creation. That simplifies the memtable flush a bit. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <908e3b5eb2c6ee84b8ad7b31c3673be5531a087c.1463448522.git.glauber@scylladb.com>	2016-05-17 11:28:38 +03:00
Glauber Costa	4981362f57	memtables: always seal through memtable_list seal function I would like to be able to apply a function at the end of every flush, that is common for both memtables and streaming memtables. For instance, to unthrottle current waiters. Right now some calls to seal_active_memtable are open coded, calling the column family's function directly, for both the main memtable list and the streaming list. This patch moves all the current open code callers to call the respective memtable_list function. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <0c780254f3c4eb03e2bcd856b83941cf49a84b85.1463448522.git.glauber@scylladb.com>	2016-05-17 11:28:37 +03:00
Piotr Jastrzebski	dcba6f5c45	Pass clustering_row_ranges to mutation readers. This will allow readers to reduce the amount of data read. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-05-16 14:36:57 +02:00
Piotr Jastrzebski	23c23abe53	Make memtable mutation_reader slice using clustering ranges. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-05-16 11:46:41 +02:00
Piotr Jastrzebski	484d2ecd0a	Slice data with clustering key range in sstable reader Add additional parameters to mp_row_consumer to be able to fetch only cells for given clustering key ranges This will be used in row_cache when it will work on clustering key level instead of partition key level. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-05-16 11:46:30 +02:00
Pekka Enberg	d93d46e721	Merge "ALTER KEYSPACE" from Calle "Implementation of ALTER KEYSPACE. Fixes #429"	2016-05-10 22:07:06 +03:00
Piotr Jastrzebski	240a185727	Stop scanning keyspace data directory when populating. Iterate over column families and check/create directories for them instead of scanning keyspace data directory and filtering directories against column families that exist in system tables for this keyspace. Fixes #1008 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <26da66eec67a1ab1318917a66161915cdef924ab.1462890592.git.piotr@scylladb.com>	2016-05-10 17:35:55 +03:00
Calle Wilund	6ef7885ae3	database: Implement update_keyspace Reloads keyspace metadata and replaces in existing keyspace. Note: since keyspace metadata, and consequently, replication strategy now becomes volatile, keyspace::metadata now returns shared pointer by value (i.e. keep-alive). Replication strategy should receive the same treatment, but since it is extensively used, but never kept across a continuation, I've just added a comment for now.	2016-05-10 14:31:30 +00:00
Avi Kivity	80302d98dd	database: silence atomic deletion cancellation logs during compaction Those logs are expected during shutdown.	2016-05-07 20:37:48 +03:00
Raphael S. Carvalho	5aeeb0b3e8	compaction: add support to parallel compaction on the same column family It was noticed that small sstables will accumulate for a column family because scylla was limited to two compaction per shard, and a column family could have at most one compaction running at a given shard. With the number of sstables increasing rapidly, read performance is degraded. At the moment, our compaction manager works by running two compaction task handlers that run in parallel to the rest of the system. Each task handler gets to run when needed, gets a column family from compaction manager queue, runs compaction on it, and goes to sleep again. That's basically its cycle. Compaction manager only allows one instance of a column family to be on its queue, meaning that it's impossible for a column family to be compacted in parallel. One compaction starts after another for a given column family. To solve the problem described, we want to concurrently run compaction jobs of a column family that have different "size tier" (or "weight"). For those unfamiliar, compaction job contains a list of sstables that will be compacted together. The "size tier" of a compaction job is the log of the total size of the input sstables. So a compaction job only gets to run if its "size tier" is not the same of an ongoing compaction. There is no point in compacting concurrently at the same "size tier", because that slows down both compactions. We will no longer queue column families in compaction manager. Instead, we create a new fiber to run compaction on demand. This fiber that runs asynchronously will do the following: 1) Get a compaction job from compaction strategy. 2) Calculate "size tier" of compaction job. 3) Run compaction job if its "size tier" is not the same of an ongoing compaction for the given column family. As before, it may decide to re-compact a column family based on a stat stored in column family object. Ran all compaction-related dtests. Fixes #1216. Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <d30952ff136192a522bde4351926130addec8852.1462311908.git.raphaelsc@scylladb.com>	2016-05-04 11:46:09 +03:00
Calle Wilund	9130b0de16	database.cc: Fix compilation error with boost 1.55 Message-Id: <1461067254-526-1-git-send-email-calle@scylladb.com>	2016-04-25 12:54:43 +03:00
Pekka Enberg	f6da9bc92b	Merge "Additional mutations/queries related collectd metrics" from Vlad "This series introduces some additional metrics (mostly) in a storage_proxy and a database level that are meant to create a better picture of how data flows in the cluster. First of all where possible counters of each category (e.g. total writes in the storage proxy level) are split into the following categories: - operations performed on a local Node - operations performed on remote Nodes aggregated per DC In a storage_proxy level there are the following metrics that have this "split" nature (all on a sending side): - total writes (attempts/errors) - writes performed as a result of a Read Repair logic - total data reads (attempts/completed/errors) - total digest reads (attempts/completed/errors) - total mutations data reads (attempts/completed/errors) In a batchlog_manager: - writes performed as a result of a batchlog replay logic Thereby if for instance somebody wants to get an idea of how many writes the current Node performs due to user requested mutations only he/she has to take a counter of total writes and subtract the writes resulted by Read Repairs and batchlog replays. On a receiving side of a storage_proxy we add the two following counters: - total number of received mutations - total number of forwarded mutations (attempts/errors) In order to get a better picture of what is going on on a local Node we are adding two counters on a database level: - total number of writes - total number of reads Comparing these to total writes/reads in a storage_proxy may give a good idea if there is an excessive access to a local DB for example."	2016-04-21 15:58:45 +03:00
Vlad Zolotarov	97e5bfa815	database: add metrics for total writes and reads This patch adds a counter of total writes and reads for each shard. It seems that nothing ensures that all database queries are ready before database object is destroyed. Make _stats lw_shared_ptr in order to ensure that the object is alive when lambda gets to incrementing it. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-04-21 11:28:53 +03:00
Duarte Nunes	c7b3a4b144	udt: Parse user types system table This patch loads and parses the user types system table during bootstrap. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 09:54:06 +02:00
Tomasz Grabiec	45527fcffa	Merge branch 'glommer/issue-1144-v5' From Glauber: There are current some outstanding issues with the throttling code. It's easier to see them with the streaming code, but at least one of them is general. One of them is related to situations in which the amount of memory available leaves only one memtable fitting in memory. That would only happen with the general code if we set the memtable cleanup threshold to 100 % - and I don't even know if it is valid - but will happen quite often with the streaming code. If that happens, we'll start throttling when that memtable is being written, but won't be able to put anything else in its place - leading to unnecessary throttling. The second, and more serious, happens when we start throttling and the amount of available memory is not at least 1MB. This can deadlock the database in the sense that it will prevent any request from continuing, and in turn causing a flush due to memtable size. It is a good practice anyway to always guarantee progress. Fixes #1144	2016-04-18 12:20:13 +02:00
Glauber Costa	9c87ae3496	throttle: always release at least one request if we are below the limit Our current throttling code releases one requests per 1MB of memory available that we have. If we are below the memory limit, but not by 1MB or more, then we will keep getting to unthrottle, but never really do anything. If another memtable is close to the flushing point, those requests may be exactly the ones that would make it flush. Without them, we'll freeze the database. In general, we need to always release at least one request to make sure that progress is always achieved. This fixes #1144 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-14 13:13:15 -04:00
Glauber Costa	2c5dfe08c1	memtable_list: make sure at least two memtables are available This is usually not a problem for the main memtable list - although it can be, depending on settings, but shows up easily for the streaming memtables list. We would like to have at least two memtables, even if we have to cut it short. If we don't do that, one memtable will have use all available memory and we'll force throttling until the memtable gets totally flushed. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-14 12:12:50 -04:00
Glauber Costa	1daede7396	unnest throttle_state throttle_state is currently a nested member of database, but there is no particular reason - aside from the fact that it is currently only ever referenced by the database for us to do so. We'll soon want to have some interaction between this and the column family, to allow us to flush during throttle. To make that easier, let's unnest it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-14 12:12:50 -04:00
Glauber Costa	39def369ce	move information about memtables' region group inside memtable list This is a preparation patch so we can move the throttling infrastructure inside the memtable_list. To do that, the region group will have to be passed to the throttler so let's just go ahead and store it. In consequence of that, all that the CF has to tell us is what is the current schema - no longer how to create a new memtable. Also, with a new parameter to be passed to the memtable_list the creation code gets quite big and hard to follow. So let's move the creation functions to a helper. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-14 12:12:50 -04:00

1 2 3 4 5 ...

567 Commits