scylladb

Author	SHA1	Message	Date
Nadav Har'El	721f7d1d4f	Rewrite shared sstables soon after startup Several shards may share the same sstable - e.g., when re-starting scylla with a different number of shards, or when importing sstables from an external source. Sharing an sstable is fine, but it can result in excessive disk space use because the shared sstable cannot be deleted until all the shards using it have finished compacting it. Normally, we have no idea when the shards will decide to compact these sstables - e.g., with size- tiered-compaction a large sstable will take a long time until we decide to compact it. So what this patch does is to initiate compaction of the shared sstables - on each shard using it - so that a soon as possible after the restart, we will have the original sstable is split into separate sstables per shard, and the original sstable can be deleted. If several sstables are shared, we serialize this compaction process so that each shard only rewrites one sstable at a time. Regular compactions may happen in parallel, but they will not not be able to choose any of the shared sstables because those are already marked as being compacted. Commit `3f2286d0` increased the need for this patch, because since that commit, if we don't delete the shared sstable, we also cannot delete additional sstables which the different shards compacted with it. For one scylla user, this resulted in so much excessive disk space use, that it literally filled the whole disk. After this patch commit `3f2286d0`, or the discussion in issue #1318 on how to improve it, is no longer necessary, because we will never compact a shared sstable together with any other sstable - as explained above, the shared sstables are marked as "being compacted" so the regular compactions will avoid them. Fixes #1314. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1465406235-15378-1-git-send-email-nyh@scylladb.com> Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-06-08 15:44:29 -04:00
Raphael S. Carvalho	1b8e170254	compaction: retry compaction until strategy is satisfied Previously, we were using a stat to decide if compaction should be retried, but that's not efficient. The information is also lost after node is restarted. After these changes, compaction will be retried until strategy is satisfied, i.e. there is nothing to compact. We will now be doing the following in a loop: Get compaction job from compaction strategy. If cannot run, finish the loop. Otherwise, compact this column family. Go back to start of the loop. By the way, pending_compactions stat will be deprecated after this commit. Previously, it was increased to indicate the want for compaction and decreased when compaction finished. Now, we can compact more than we asked for, so it would be decreased below 0. Also, it's the strategy that will tell the want for compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <899df0d8d807f6b5d9bb8600d7c63b4e260cc282.1465398243.git.raphaelsc@scylladb.com>	2016-06-08 11:31:56 -04:00
Raphael S. Carvalho	3f4500cb71	db: compaction strategy changes via alter table must have immediate effect At the moment, compaction strategy changes via ALTER TABLE have no effect until node restart. Tomek says: "Statements of the following form should have immediate effect: ALTER TABLE t WITH compaction = { 'class' : 'LeveledCompactionStrategy' };" Fixes #877. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <3b72c494f887643b82a272ef0a9995edb970382c.1464726828.git.raphaelsc@scylladb.com>	2016-06-02 16:59:50 +02:00
Pekka Enberg	d03f65d94e	database: Don't use std::cbegin() and std::cend() They're not supported by GCC 4.9. Fixes #1305 Message-Id: <1464877984-27856-1-git-send-email-penberg@scylladb.com>	2016-06-02 16:57:24 +02:00
Avi Kivity	8dcbddc7ed	Merge "Serialize memtable flushes" from Glauber "One of the things we need to do as part of the throttle rework I am doing is to serialize memtable flushes to some extent - that will guarantee that in case we're throttling, the flushes finish earlier and release memory earlier, if compared to the case in which we just let all tables flush freely and simultaneously."	2016-06-01 18:31:18 +03:00
Pekka Enberg	3ca7fc2a8b	database: Add sstable filename to thrown malformed_sstable_exceptions	2016-06-01 14:56:10 +03:00
Glauber Costa	0f64eb7e7d	serialize memtable flush for a memtable_list We can only free memory for a region_group when the entire memtable is released. This means that while the disk can handle requests from multiple memtables just fine, we won't free any memory until all of them finish. If we are under a pressure situation we will take a lot more time to leave it. Ideally, with write-behind, we would allow just one memtable to be flushed at a time. But since we don't have it enabled, it's better to serialize the flushes so that only some memtables (4) are flushed at a time. Having the memtable writer bandwidth all to itself, the memtable will finish sooner, release memory sooner, and recover the system's health sooner. We would like to do that without having streaming and memtables starve each other. Ideally, that should mean half the bandwidth for each - but that sacrifices memtable writes in the common case there is no streaming. Again, write behind will help here, and since this is something we intend to do, there is no need to complicate the code too much for an interim solution. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-05-31 17:18:35 -04:00
Glauber Costa	46c79be401	database: allow callers to specify memtable list's flush behavior This patch introduces an explicit behavior enum class - one of delayed or immediate, that allow callers to tell the memtable list whether they want a delayed flush (default), or force an immediate flush. So far this only affects the streaming code (memtables just ignore it), but the concept is one that can be easily generalized. With that in place, we can revert back the stop function to use the standard flush. I have argued before that adding infrastructure like that would not be worth it for the sake of stop alone, but some other code could now use it. Specifically, the active reclaimer for the throttler would like to force immediate flushes, as delayed flushes really won't make a lot of difference in reducing memory usage. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-05-31 17:17:48 -04:00
Glauber Costa	30d54cef38	database: add a comment explaining the choice of function in CF stop We have recently commited a fix to a broken streaming bug that involved reverting column_family::stop() back to calling the custom seal functions explicitly for both memtables and streaming memtables. We here add a comment to explain why that had to be done. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <fe94b5883e9c29adc7fc9ee9f498894c057e7b64.1464293167.git.glauber@scylladb.com>	2016-05-29 11:28:15 +03:00
Glauber Costa	46f60f52d9	database: do not use implicitly stated seal function when closing the CF In commit `4981362f57`, I have introduced a regression that was thankfully caught by our dtest infrastructure. That patch is a preparation patch for the active reclaim patchset that is to come, and it consolidated all the flushes using the memtable_list's seal_fn function instead of calling the seal function explicitly. The problem here is that the streaming memtables have the delayed mechanism, about which the memtable_list is unaware. Calling memtable_list's seal_active_memtable() for the streaming memtables calls the delayed version, that does not guarantee flush. If we're lucky, we will indeed flush after the timer expires, but if we're not we'll just stop the CF with data not flushed. There are two options to fix this: the first is to teach the memtable_list about the delayed/forced mechanism, and the second is to just call the correct function explicitly during shutdown, and then when the time comes to add continuations to the result of the seal, add them here as well. Although the second option involves a bit more work and duplication, I think it is better in the sense that the delayed / forced mechanism really is something that belong to the streaming only. Being this the only user, I don't think it justifies complicating the memtable_list with this concept. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <b26017c825ccf585f39f58c4ab3787d78e551f5f.1464126884.git.glauber@scylladb.com>	2016-05-25 08:21:24 +03:00
Pekka Enberg	ceb29f9d32	Merge "Introduce upload dir for sstable migration" from Raphael "This change is intended to make migration process safer and easier. All column families will now have a directory called upload. With this feature, users may choose to copy migrated sstables to upload directory of respective column families, and run 'nodetool refresh'. That's supposed to be the preferred option from now on."	2016-05-24 16:36:47 +03:00
Avi Kivity	9637c2232c	Merge "Move the JMX timer polling logic to Scylla" from Amnon	2016-05-24 13:07:52 +03:00
Raphael S. Carvalho	c2fa3b796d	db: fix read consistency after refresh If sstable loaded by refresh covers a row that is cached by the column family, read query may fail to return consistent data. What we should do is to clear cache for the column family being loaded with new sstables. Fixes #1212. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <a08c9885a5ceb0b2991e40337acf5b7679580a66.1464072720.git.raphaelsc@scylladb.com>	2016-05-24 12:11:41 +03:00
Raphael S. Carvalho	e5f0314afd	db: introduce upload directory for sstable migration This change is intended to make migration process safer and easier. All column families will now have a directory called upload. With this feature, users may choose to copy migrated sstables to upload directory of respective column families, and call 'nodetool refresh'. That's supposed to be the preferred option from now on. For each sstable in upload directory, refresh will do the following: 1) Mutate sstable level to 0. 2) Create hard links to its components in column family dir, using a new generation. We make it safe by creating a hard link to temporary TOC first. 3) Remove all of its components in upload directory. This new code runs after refresh checked for new sstables in the column family directory. Otherwise, we could have a generation conflict. Unlike the first step, this new step runs with sstable write enabled. It's easier here because we know exactly which sstables are new. After that, refresh will load new sstables found in column family and upload directories. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-05-20 17:26:21 -03:00
Raphael S. Carvalho	74c8a87777	sstables: fix statistics rewrite It's not working because it tries to overwrite existing statistics file with exclusive flag. It's fixed by writing new statistics into temporary file and renaming it into place. If Scylla failed in middle of rewrite, a temporary file is left over. So boot code was adjusted to delete a temporary file created by this rewrite procedure. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-05-20 17:24:15 -03:00
Raphael S. Carvalho	ee0f66eef6	db: fix migration of sstables with level greater than 0 Refresh will rewrite statistics of any migrated sstable with level > 0. However, this operation is currently not working because O_EXCL flag is used, meaning that create will fail. It turns out that we don't actually need to change on-disk level of a sstable by overwriting statistics file. We can only set in-memory level of a sstable to 0. If Scylla reboots before all migrated sstables are compacted, leveled strategy is smart enough to detect sstables that overlap, and set their in-memory level to 0. Fixes #1124. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-05-17 11:08:08 -03:00
Amnon Heiman	750f30cf07	column_family: Change histogram to timed_rate_moving_average_and_histogram As part of moving the derived statistic in to scylla, this replaces the histogram object in the column_family to timed_rate_moving_average_and_histogram. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2016-05-17 11:53:15 +03:00
Glauber Costa	17b9203719	database: invert order of elements So that the sizes of the region can be initialized first Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <dc3df186a977b492d83c0a397f206c2db940aa37.1463448522.git.glauber@scylladb.com>	2016-05-17 11:28:39 +03:00
Glauber Costa	2ff6d38d0c	database: use a single constructor for the column family We've been keeping two constructors for the column family to allow for a version without the commitlog. But it's by now quite complicated to maintain the two, because changes always have to be made in two places. This patch adds a private constructor that does the actual construction, and have the public constructors to call it. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <dd3cb0b9c20ad154a6131bad6ece619f70ed5025.1463448522.git.glauber@scylladb.com>	2016-05-17 11:28:39 +03:00
Glauber Costa	8fede5b98e	memtables: isolate logic for disk writes disabled When we have disk writes disabled, we exit immediately from the flush function. We can just encode that separately and pass a different function in the memtable_list creation. That simplifies the memtable flush a bit. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <908e3b5eb2c6ee84b8ad7b31c3673be5531a087c.1463448522.git.glauber@scylladb.com>	2016-05-17 11:28:38 +03:00
Glauber Costa	4981362f57	memtables: always seal through memtable_list seal function I would like to be able to apply a function at the end of every flush, that is common for both memtables and streaming memtables. For instance, to unthrottle current waiters. Right now some calls to seal_active_memtable are open coded, calling the column family's function directly, for both the main memtable list and the streaming list. This patch moves all the current open code callers to call the respective memtable_list function. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <0c780254f3c4eb03e2bcd856b83941cf49a84b85.1463448522.git.glauber@scylladb.com>	2016-05-17 11:28:37 +03:00
Piotr Jastrzebski	dcba6f5c45	Pass clustering_row_ranges to mutation readers. This will allow readers to reduce the amount of data read. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-05-16 14:36:57 +02:00
Piotr Jastrzebski	23c23abe53	Make memtable mutation_reader slice using clustering ranges. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-05-16 11:46:41 +02:00
Piotr Jastrzebski	484d2ecd0a	Slice data with clustering key range in sstable reader Add additional parameters to mp_row_consumer to be able to fetch only cells for given clustering key ranges This will be used in row_cache when it will work on clustering key level instead of partition key level. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2016-05-16 11:46:30 +02:00
Pekka Enberg	d93d46e721	Merge "ALTER KEYSPACE" from Calle "Implementation of ALTER KEYSPACE. Fixes #429"	2016-05-10 22:07:06 +03:00
Piotr Jastrzebski	240a185727	Stop scanning keyspace data directory when populating. Iterate over column families and check/create directories for them instead of scanning keyspace data directory and filtering directories against column families that exist in system tables for this keyspace. Fixes #1008 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <26da66eec67a1ab1318917a66161915cdef924ab.1462890592.git.piotr@scylladb.com>	2016-05-10 17:35:55 +03:00
Calle Wilund	6ef7885ae3	database: Implement update_keyspace Reloads keyspace metadata and replaces in existing keyspace. Note: since keyspace metadata, and consequently, replication strategy now becomes volatile, keyspace::metadata now returns shared pointer by value (i.e. keep-alive). Replication strategy should receive the same treatment, but since it is extensively used, but never kept across a continuation, I've just added a comment for now.	2016-05-10 14:31:30 +00:00
Avi Kivity	80302d98dd	database: silence atomic deletion cancellation logs during compaction Those logs are expected during shutdown.	2016-05-07 20:37:48 +03:00
Raphael S. Carvalho	5aeeb0b3e8	compaction: add support to parallel compaction on the same column family It was noticed that small sstables will accumulate for a column family because scylla was limited to two compaction per shard, and a column family could have at most one compaction running at a given shard. With the number of sstables increasing rapidly, read performance is degraded. At the moment, our compaction manager works by running two compaction task handlers that run in parallel to the rest of the system. Each task handler gets to run when needed, gets a column family from compaction manager queue, runs compaction on it, and goes to sleep again. That's basically its cycle. Compaction manager only allows one instance of a column family to be on its queue, meaning that it's impossible for a column family to be compacted in parallel. One compaction starts after another for a given column family. To solve the problem described, we want to concurrently run compaction jobs of a column family that have different "size tier" (or "weight"). For those unfamiliar, compaction job contains a list of sstables that will be compacted together. The "size tier" of a compaction job is the log of the total size of the input sstables. So a compaction job only gets to run if its "size tier" is not the same of an ongoing compaction. There is no point in compacting concurrently at the same "size tier", because that slows down both compactions. We will no longer queue column families in compaction manager. Instead, we create a new fiber to run compaction on demand. This fiber that runs asynchronously will do the following: 1) Get a compaction job from compaction strategy. 2) Calculate "size tier" of compaction job. 3) Run compaction job if its "size tier" is not the same of an ongoing compaction for the given column family. As before, it may decide to re-compact a column family based on a stat stored in column family object. Ran all compaction-related dtests. Fixes #1216. Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <d30952ff136192a522bde4351926130addec8852.1462311908.git.raphaelsc@scylladb.com>	2016-05-04 11:46:09 +03:00
Calle Wilund	9130b0de16	database.cc: Fix compilation error with boost 1.55 Message-Id: <1461067254-526-1-git-send-email-calle@scylladb.com>	2016-04-25 12:54:43 +03:00
Pekka Enberg	f6da9bc92b	Merge "Additional mutations/queries related collectd metrics" from Vlad "This series introduces some additional metrics (mostly) in a storage_proxy and a database level that are meant to create a better picture of how data flows in the cluster. First of all where possible counters of each category (e.g. total writes in the storage proxy level) are split into the following categories: - operations performed on a local Node - operations performed on remote Nodes aggregated per DC In a storage_proxy level there are the following metrics that have this "split" nature (all on a sending side): - total writes (attempts/errors) - writes performed as a result of a Read Repair logic - total data reads (attempts/completed/errors) - total digest reads (attempts/completed/errors) - total mutations data reads (attempts/completed/errors) In a batchlog_manager: - writes performed as a result of a batchlog replay logic Thereby if for instance somebody wants to get an idea of how many writes the current Node performs due to user requested mutations only he/she has to take a counter of total writes and subtract the writes resulted by Read Repairs and batchlog replays. On a receiving side of a storage_proxy we add the two following counters: - total number of received mutations - total number of forwarded mutations (attempts/errors) In order to get a better picture of what is going on on a local Node we are adding two counters on a database level: - total number of writes - total number of reads Comparing these to total writes/reads in a storage_proxy may give a good idea if there is an excessive access to a local DB for example."	2016-04-21 15:58:45 +03:00
Vlad Zolotarov	97e5bfa815	database: add metrics for total writes and reads This patch adds a counter of total writes and reads for each shard. It seems that nothing ensures that all database queries are ready before database object is destroyed. Make _stats lw_shared_ptr in order to ensure that the object is alive when lambda gets to incrementing it. Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2016-04-21 11:28:53 +03:00
Duarte Nunes	c7b3a4b144	udt: Parse user types system table This patch loads and parses the user types system table during bootstrap. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-04-20 09:54:06 +02:00
Tomasz Grabiec	45527fcffa	Merge branch 'glommer/issue-1144-v5' From Glauber: There are current some outstanding issues with the throttling code. It's easier to see them with the streaming code, but at least one of them is general. One of them is related to situations in which the amount of memory available leaves only one memtable fitting in memory. That would only happen with the general code if we set the memtable cleanup threshold to 100 % - and I don't even know if it is valid - but will happen quite often with the streaming code. If that happens, we'll start throttling when that memtable is being written, but won't be able to put anything else in its place - leading to unnecessary throttling. The second, and more serious, happens when we start throttling and the amount of available memory is not at least 1MB. This can deadlock the database in the sense that it will prevent any request from continuing, and in turn causing a flush due to memtable size. It is a good practice anyway to always guarantee progress. Fixes #1144	2016-04-18 12:20:13 +02:00
Glauber Costa	9c87ae3496	throttle: always release at least one request if we are below the limit Our current throttling code releases one requests per 1MB of memory available that we have. If we are below the memory limit, but not by 1MB or more, then we will keep getting to unthrottle, but never really do anything. If another memtable is close to the flushing point, those requests may be exactly the ones that would make it flush. Without them, we'll freeze the database. In general, we need to always release at least one request to make sure that progress is always achieved. This fixes #1144 Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-14 13:13:15 -04:00
Glauber Costa	2c5dfe08c1	memtable_list: make sure at least two memtables are available This is usually not a problem for the main memtable list - although it can be, depending on settings, but shows up easily for the streaming memtables list. We would like to have at least two memtables, even if we have to cut it short. If we don't do that, one memtable will have use all available memory and we'll force throttling until the memtable gets totally flushed. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-14 12:12:50 -04:00
Glauber Costa	1daede7396	unnest throttle_state throttle_state is currently a nested member of database, but there is no particular reason - aside from the fact that it is currently only ever referenced by the database for us to do so. We'll soon want to have some interaction between this and the column family, to allow us to flush during throttle. To make that easier, let's unnest it. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-14 12:12:50 -04:00
Glauber Costa	39def369ce	move information about memtables' region group inside memtable list This is a preparation patch so we can move the throttling infrastructure inside the memtable_list. To do that, the region group will have to be passed to the throttler so let's just go ahead and store it. In consequence of that, all that the CF has to tell us is what is the current schema - no longer how to create a new memtable. Also, with a new parameter to be passed to the memtable_list the creation code gets quite big and hard to follow. So let's move the creation functions to a helper. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-14 12:12:50 -04:00
Avi Kivity	a843aea547	db: delete compacted sstables atomically If sstables A, B are compacted, A and B must be deleted atomically. Otherwise, if A has data that is covered by a tombstone in B, and that tombstone is deleted, and if B is deleted while A is not, then the data in A is resurrected. Fixes #1181.	2016-04-14 17:14:26 +03:00
Paweł Dziepak	2db70cf912	database: remove throw() specifiers Most of them are missing std::bad_alloc (which leads to aborts) and they force the compiler to add unnecessary runtime checks. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-04-11 23:52:13 +01:00
Glauber Costa	8453ff7788	make get_sstable_key_range an instance method Because just creating an SSTable object does not generate any I/O, get_sstable_key_range should be an instance method. The main advantage of doing that is that we won't have to read the summary twice. The way we're doing it currently, if happens to be a shard-relevant table we'll call load() - which reads the summary again. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-08 17:14:29 -04:00
Avi Kivity	db03295c8a	Merge "Fix query digest mismatch" from Tomasz "Currently data query digest includes cells and tombstones which may have expired or be covered by higher-level tombstones. This causes digest mismatch between replicas if some elements are compacted on one of the nodes and not on others. This mismatch triggers read-repair which doesn't resolve because mutations received by mutation queries are not differing, they are compacted already. The fix adds compacting step before writing and digesting query results by reusing the algorithm used by mutation query. This is not the most optimal way to fix this. The compaction step could be folded with the query writing, there is redundancy in both steps. However such change carries more risk, and thus was postponed. perf_simple_query test (cassandra-stress-like partitions) shows regression from 83k to 77k (7%) ops/s. Fixes #1165."	2016-04-08 12:13:29 +03:00
Pekka Enberg	38a54df863	Fix pre-ScyllaDB copyright statements People keep tripping over the old copyrights and copy-pasting them to new files. Search and replace "Cloudius Systems" with "ScyllaDB". Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>	2016-04-08 08:12:47 +03:00
Tomasz Grabiec	f15c380a4f	database: Compact mutations when executing data queries Currently data query digest includes cells and tombstones which may have expired or be covered by higher-level tombstones. This causes digest mismatch between replicas if some elements are compacted on one of the nodes and not on others. This mismatch triggers read-repair which doesn't resolve because mutations received by mutation queries are not differing, they are compacted already. The fix adds compacting step before writing and digesting query results by reusing the algorithm used by mutation query. This is not the most optimal way to fix this. The compaction step could be folded with the query writing, there is redundancy in both steps. However such change carries more risk, and thus was postponed. perf_simple_query test (cassandra-stress-like partitions) shows regression from 83k to 77k (7%) ops/s. Fixes #1165.	2016-04-07 19:56:58 +02:00
Calle Wilund	ff5df306e3	database: Use disk-marking delete function in discard_sstables Fixes #797 To make sure an inopportune crash after truncate does not leave sstables on disk to be considered live, and thus resurrect data, after a truncate, use delete function that renames the TOC file to make sure we've marked sstables as dead on disk when we finish this discard call. Message-Id: <1458575440-505-2-git-send-email-calle@scylladb.com>	2016-03-24 12:02:08 +02:00
Glauber Costa	34a9fc106f	database: keep streaming memtables in their own region group Theoretically, because we can have a lot of pending streaming memtables, we can have the database start throttling and incoming connections slowing down during streaming. Turns out this is actually a very easy condition to trigger. That is basically because the other side of the wire in this case is quite efficient in sending us work. This situation is alleviated a bit by reducing parallelism, but not only it does't go away completely, once we have the tools to start increasing parallelism again it will become common place. The solution for this is to limit the streaming memtables to a fraction of the total allowed dirty memory. Using the nesting capability built in in the LSA regions, we will make the streaming region group a child of the main region group. With that, we can throttle streaming requests separately, while at the same time being able to control the total amount of dirty memory as well. Because of the property, it can still be the case that incoming requests will throttle earlier due to streaming - unless we allow for more dirty memory to be used during repairs - but at least that effect will be limited. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:40:47 -04:00
Glauber Costa	455d5a57d2	streaming memtables: coalesce incoming writes The repair process will potentially send ranges containing few mutations, definitely not enough to fill a memtable. It wants to know whether or not each of those ranges individually succeeded or failed, so we need a future for each. Small memtables being flushed are bad, and we would like to write bigger memtables so we can better utilize our disks. One of the ways to fix that, is changing the repair itself to send more mutations at a single batch. But relying on that is a bad idea for two reasons: First, the goals of the SSTable writer and the repair sender are at odds. The SSTable writer wants to write as few SSTables as possible, while the repair sender wants to break down the range in pieces as small as it can and checksum them individually, so it doesn't have to send a lot of mutations for no reason. Second, even if the repair process wants to process larger ranges at once, some ranges themselves may be small. So while most ranges would be large, we would still have potentially some fairly small SSTables lying around. The best course of action in this case is to coalesce the incoming streams write-side. repair can now choose whatever strategy - small or big ranges - it wants, resting assure that the incoming memtables will be coalesced together. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:38:22 -04:00
Glauber Costa	5fa866223d	streaming: add incoming streaming mutations to a different sstable Keeping the mutations coming from the streaming process as mutations like any other have a number of advantages - and that's why we do it. However, this makes it impossible for Seastar's I/O scheduler to differentiate between incoming requests from clients, and those who are arriving from peers in the streaming process. As a result, if the streaming mutations consume a significant fraction of the total mutations, and we happen to be using the disk at its limits, we are in no position to provide any guarantees - defeating the whole purpose of the scheduler. To implement that, we'll keep a separate set of memtables that will contain only streaming mutations. We don't have to do it this way, but doing so makes life a lot easier. In particular, to write an SSTable, our API requires (because the filter requires), that a good estimate on the number of partitions is informed in advance. The partitions also need to be sorted. We could write mutations directly to disk, but the above conditions couldn't be met without significant effort. In particular, because mutations can be arriving from multiple peer nodes, we can't really sort them without keeping a staging area anyway. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:13:00 -04:00
Glauber Costa	78189de57f	database: make seal_on_overflow a method of the memtable_list Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:12:59 -04:00
Glauber Costa	635bb942b2	database: move add_memtable as a method of the memtable_list The column family still has to teach the memtable list how to allocate a new memtable, since it uses CF parameters to do so. After that, the memtable_list's constructor takes a seal and a create function and is complete. The copy constructor can now go, since there are no users left. The behavior of keeping a reference to the underlying memtables can also go, since we can now guarantee that nobody is keeping references to it (it is not even a shared pointer anymore). Individual memtables are, and users may be keeping references to them individually. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:12:59 -04:00

1 2 3 4 5 ...

555 Commits