scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 11:30:36 +00:00

Author	SHA1	Message	Date
Asias He	cdb43c5586	batchlog_manager: Allow user initiated bachlog replay operation During decommission, the storage_service::unbootstrap() needs to initiate a batchlog replay operation. To sync the replay operation initiated by the timer in batchlog_manager and storage_service, a semaphore is introduced. To simplify the semaphore locking, the management code now always runs on shard zero, but the real work is distruted to all shards.	2016-03-30 20:54:30 +08:00
Glauber Costa	23808ba184	sstables: fix exception printouts in check_marker As Nadav noticed in his bug report, check_marker is creating its error messages using characters instead of numbers - which is what we intended here in the first place. That happens because sprint(), when faced with an 8-byte type, interprets this as a character. To avoid that we'll use uint16_t types, taking care not to sign-extend them. The bug also noted that one of the error messages is missing a parameter, and that is also fixed. Fixes #1122 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <74f825bbff8488ffeb1911e626db51eed88629b1.1459266115.git.glauber@scylladb.com>	2016-03-29 19:23:28 +03:00
Takuya ASADA	c1277bacb4	dist/common/scripts: prevent misinterpret blank input as '/dev/', show error when inputted device path is not found Fixes #1110 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1459267786-19123-1-git-send-email-syuu@scylladb.com>	2016-03-29 19:18:51 +03:00
Glauber Costa	d5c1366e85	compaction: be verbose about which table is causing an exception When we, for some reason, fail to compact an SSTable, we do not log the file name leaving us with cryptic messages that tell us what happened, but not where it happened. This patch adds logging in compaction so that we'll know what's going on. Please note that readers are more of a concern, because the SSTable being written technically do not exist yet. Still, better safe than sorry: if open_data fails, or we leave an unfinished SSTable, it is still good to know which one was the culprit. Some argument can be made about whether we should log this at the lower SSTable level, or at the compaction level. The reason I am logging this at the compaction level, is that we don't really know which exception will trigger, and where: it may be the case that we're seeing exceptions that are not SSTable specific, and may not have the chance to log it properly. In particular, if the exception happens inside the reader: read_rows() and friends only return a mutation reader, which doesn't really do anything until we call read(). But at that time, we don't hold any pointers to the SSTable anymore. In Summary, logging at the compaction level guarantees that we always do it no matter what. Exceptions that are part of the main SSTable path can log the file name as well if they want: if that's the case, we'll be left with the name appearing twice. That's totally harmless, and better than none. Fixes #1123 Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <c5c969fb6aeb788a037bd7a4ea69979c1042cb34.1459263847.git.glauber@scylladb.com>	2016-03-29 18:15:56 +03:00
Glauber Costa	d536846433	commitlog: initialize sync period with actual sync period commitlog's sync period is initialized as the batch period, and not as the sync period itself as it should be. I've found this by code inspection, but unless I am missing something really fundamental, this seems to be completely wrong. It's been working fine because in our defaults, I have checked that both variables default to the same value. But it seems to me that as long as anyone would change one of them, the behavior wouldn't be as expected. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <2e7c565242fe5d4481a3ee8b0ba425ef14f5e42a.1459252783.git.glauber@scylladb.com>	2016-03-29 15:21:02 +03:00
Takuya ASADA	a5bb6c4b1b	dist/ubuntu: drop classical sysv init script, only support Upstart for Ubuntu 14.04LTS Sysv init script was added just for prevent warning message on lintian, never really used by Ubuntu users. Result of that, we often break this script since upstart/systemd unit file frequently changed. It may confuse users, it's better to use Upstart only, just like Fedora/CentOS. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1459177601-20269-2-git-send-email-syuu@scylladb.com>	2016-03-29 11:48:18 +03:00
Takuya ASADA	42ce77a3b7	dist/redhat: prevent 'yum: command not found' on some Fedora environment On some Fedora environments such as Fedora official AMI, dnf-yum package is not installed by default, causes command not found error when we run our setup scripts. To prevent this, we need to add dnf-yum to scylla-server package dependency. Fixes #1106 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1459099744-23068-1-git-send-email-syuu@scylladb.com>	2016-03-29 11:29:09 +03:00
Avi Kivity	adffb1c061	dist/ubuntu: improve handling of bad command line options On a bad command line, Scylla will exit with an exit code of 2. Mark it as a "normal" exit, to prevent a respawn. Fixes #1087 Message-Id: <1458827221-12833-1-git-send-email-avi@scylladb.com>	2016-03-29 11:14:45 +03:00
Avi Kivity	c1d8fb56f7	dist/ubuntu: specify kill timeout Allow more time for commitlog flushing Message-Id: <1458827216-12778-1-git-send-email-avi@scylladb.com>	2016-03-29 11:14:27 +03:00
Raphael Carvalho	d515a7fd85	sstables: fix deletion of sstable with temporary TOC After `4e52b41a4`, remove_by_toc_name() became aware of temporary TOC files, however, it doesn't consider that some components may be missing if temporary TOC is present. When creating a new sstable, the first thing we do is to write all components into temporary TOC, so content of a temporary TOC isn't reliable until it is renamed. Solution is about implementing the following flow (described by Avi): "Flow should be: - remove all components in parallel - forgive ENOENT, since the compoent may not have been written; otherwise deletion error should be raised - fsync the directory - delete the temporary TOC " This problem can be reproduced by running compaction without disk space, so compaction would fail and leave a partial sstable that would be marked for deletion. Afterwards, remove_by_toc_name() would try to delete a component that doesn't exist because it looked at the content of temporary TOC. Fixes #1095. Signed-off-by: Raphael Carvalho <raphaelsc@scylladb.com> Message-Id: <0cfcaacb43cc5bad3a8a7ea6c1fa6f325c5de97d.1459194263.git.raphaelsc@scylladb.com>	2016-03-29 10:38:01 +03:00
Tomasz Grabiec	d1db23e353	storage_service: Fix typos Message-Id: <1458837390-26634-1-git-send-email-tgrabiec@scylladb.com>	2016-03-29 10:29:04 +03:00
Pekka Enberg	994390769f	Update scylla-ami submodule * dist/ami/files/scylla-ami 89e7436...7019088 (1): > Re-enable clocksource=tsc on AMI	2016-03-29 10:18:06 +03:00
Takuya ASADA	201b0c6ab3	dist: re-enable clocksource=tsc on AMI clocksource=tsc on boot parameter mistakenly dropped on `b3c85aea89`, need to re-enable. [ penberg: Manual backport of commit `050fb911d5` to 1.0. ] Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1459180643-4389-1-git-send-email-syuu@scylladb.com> (cherry picked from commit `80242ff443`)	2016-03-29 10:17:41 +03:00
Pekka Enberg	227daecba6	Revert "dist: move setup scripts to /usr/sbin" This reverts commit `989357189a` because it broke our Jenkins packaging jobs.	2016-03-29 10:17:05 +03:00
Pekka Enberg	d1ec97e76f	Revert "dist: re-enable clocksource=tsc on AMI" This reverts commit `050fb911d5` in preparation for reverting `989357189a`.	2016-03-29 10:16:48 +03:00
Takuya ASADA	050fb911d5	dist: re-enable clocksource=tsc on AMI clocksource=tsc on boot parameter mistakenly dropped on `b3c85aea89`, need to re-enable. Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1459180643-4389-1-git-send-email-syuu@scylladb.com>	2016-03-29 09:53:23 +03:00
Asias He	62d443a07d	streaming: Fix log of plan_id and session address in stream_session They are get swapped. Fix it up. Spotted by looking at the log. Message-Id: <d163d71e9a96d1a45c3a4c529519790eeff7c486.1459172778.git.asias@scylladb.com>	2016-03-29 09:01:06 +03:00
Nadav Har'El	a05577ca41	sstable: fix read failure of certain sstables We had a problem reading certain existing Cassandra sstables into Scylla. Our consume_range_tombstone() function assumes that the start and end columns have a certain "end of component" markers, and want to verify that assumption. But because of bugs in older versions of Cassandra, see https://issues.apache.org/jira/browse/CASSANDRA-7593, sometimes the "end of component" was missing (set to 0). CASSANDRA-7593 suggested this problem might exist on the start column, so we allowed for that, but now we discovered a case where also the end column is set to 0 - causing the test in consume_range_tombstone() to fail and the sstable read to fail - causing Scylla to no be able to import that sstable from Cassandra. Allowing for an 0 also on the end column made it possible to read that sstable, compact it, and so on. Fixes #1125. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1459173964-23242-1-git-send-email-nyh@scylladb.com>	2016-03-28 17:09:37 +03:00
Duarte Nunes	db881fdc8f	cql: Add support for pg-style string literal This patch adds support for pg-style string literals to the CQL grammar. Fixes #1078 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1459093238-2529-1-git-send-email-duarte@scylladb.com>	2016-03-28 17:06:03 +03:00
yan cui	e5d1c031ac	dist: add ubuntu docker file	2016-03-28 10:14:12 +03:00
Avi Kivity	a919113fdb	schema_tables: fix deadlock in cross-node communications Seastar wrongly limits the number of concurrent submit_to()s to a single remote shard. This can cause an ABBA deadlock: fiberA fiberB (x127) submit_to(0) # lock schema <- returns submit_to(0) # lock schema (waits) submit_to(0) # do work (waits) The fiberBs wait for fiberA, which in turn waits for a fiberB to return. While the correct fix is to remote the client-side limit and replace it with a server-side per-verb limit, we start with a simpler fix that replaces the blocking lock call with a non-blocking call, removing the deadlock. Fixes #1088. Message-Id: <1459095357-28950-1-git-send-email-avi@scylladb.com>	2016-03-28 10:12:10 +03:00
Raphael Carvalho	e6e5999282	Fix corner-case in refresh Problem found by dtest which loads sstables with generation 1 and 2 into an empty column family. The root of the problem is that reshuffle procedure changes new sstables to start from generation 2 at least. So reshuffle could try to set generation 1 to 2 when generation 2 exists. This problem can be fixed by starting from generation 1 instead, so reshuffle would handle this case properly. Fixes #1099. Signed-off-by: Raphael Carvalho <raphaelsc@scylladb.com> Message-Id: <88c51fbda9557a506ad99395aeb0a91cd550ede4.1458917237.git.raphaelsc@scylladb.com>	2016-03-27 10:03:32 +03:00
Avi Kivity	077c0d1022	dist: ami: fix AMI_OPT receiving no value We assign AMI=0 and AMI_OPT=1, so in the true case, AMI_OPT has no value, and a later compare fails.	2016-03-26 21:16:28 +03:00
Takuya ASADA	989357189a	dist: move setup scripts to /usr/sbin Since these scripts are user command, should be on $PATH. Fixes #1092 Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1458860407-25269-1-git-send-email-syuu@scylladb.com>	2016-03-25 11:50:13 +03:00
Takuya ASADA	2582dbe4a0	dist/ami: use tilde for release candidate builds Sync with ubuntu package versioning rule Signed-off-by: Takuya ASADA <syuu@scylladb.com> Message-Id: <1458882718-29317-1-git-send-email-syuu@scylladb.com>	2016-03-25 11:34:28 +03:00
Glauber Costa	e750a94300	sanity check Seastar's I/O queue configuration While Seastar in general can accept any parameter for its I/O queues, Scylla in particular shouldn't run with them disabled. Such will be the status when the max-io-requests parameter is not enabled. On top of that, we would like to have enough depth per I/O queue not to allow for shard-local parallelism. Therefore, we will require a minimum per-queue capacity of 4. In machines where the disk iodepth is not enough to allow for 4 concurrent requests per shard, one should reduce the number of I/O queues. For --max-io-requests, we will check the parameter itself. However, the --num-io-queues parameter is not mandatory, and given enough concurrent requests, Seastar's default configuration can very well just be doing the right thing. So for that, we will check the final result of each I/O queue. As it is the case with other checks of the sorts, this can be overridden by the --developer-mode switch. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <63bf7e91ac10c95810351815bb8f5e94d75592a5.1458836000.git.glauber@scylladb.com>	2016-03-25 11:33:57 +03:00
Tomasz Grabiec	53bbcf4a1e	schema_tables: Wait for notifications to be processed. Listeners may defer since: `93015bcc54` "migration_manager: Make the migration callbacks runs inside seastar thread" Not all places were adjusted to wait for them. Fix that. Message-Id: <1458837613-27616-1-git-send-email-tgrabiec@scylladb.com>	2016-03-24 19:04:12 +02:00
Avi Kivity	12744217b8	Initial github issue template Message-Id: <1458817106-1513-1-git-send-email-avi@scylladb.com>	2016-03-24 15:37:00 +02:00
Benoît Canet	4ac1126677	collectd: Write to the network to get rid of spurious log messages Closes #1018 Suggested-by: Avi Kivity <avi@scylladb.com> Signed-of-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1458759378-4935-1-git-send-email-benoit@scylladb.com>	2016-03-24 12:34:14 +02:00
Calle Wilund	ff5df306e3	database: Use disk-marking delete function in discard_sstables Fixes #797 To make sure an inopportune crash after truncate does not leave sstables on disk to be considered live, and thus resurrect data, after a truncate, use delete function that renames the TOC file to make sure we've marked sstables as dead on disk when we finish this discard call. Message-Id: <1458575440-505-2-git-send-email-calle@scylladb.com>	2016-03-24 12:02:08 +02:00
Calle Wilund	4e52b41a46	sstables: Add delete func to rename TOC ensuring table is marked dead Note: "normal" remove_by_toc_name must now be prepared for and check if the TOC of the sstable is already moved to temp file when we get to the juicy delete parts. Message-Id: <1458575440-505-1-git-send-email-calle@scylladb.com>	2016-03-24 12:01:53 +02:00
Asias He	6fd6e57e80	streaming: Harden keep alive timer - Do nothing in case the session is closed, to prevent we fire up the timer again - Print log info when no progress has been made if the time expires, it is very useful to debug a idle session - Grab a reference when the keep alive timer is running Message-Id: <9f2cc3164696905a6a39c0d072a980765d598dfd.1458782956.git.asias@scylladb.com>	2016-03-24 11:58:54 +02:00
Avi Kivity	112a930f92	Merge "Bring back simplify session completion logic" from Asias "The following patches are reverted becasue they were thought they break Glauber's "Make sure repairs do not cripple incoming load" series. It turns out these two patches just made another bug more visisble. The bug is fixed in `c2eff7e824` (streaming: Complete receive task after the flush). We can bring the two patches back now. Passed repair_additional_test.py and update_cluster_layout_tests.py with smp 2."	2016-03-24 11:57:20 +02:00
Tomasz Grabiec	341b509f68	cql_test_env: Make initialization exception-safe Currently start() is not prepared to handle exceptions thrown from service initialization. It's easy to trigger such exceprion by starting two tests at the same time, which will result in socket bind error. Exception thrown from start() typically results in assertion failures like this one: seastar::sharded<Service>::~sharded() [with Service = database]: Assertion `_instances.empty()' failed. This patch fixes the problem by combining start() and stop() in a single do_with() and using RAII for stopping services. Now exceptions thrown from service initialization should stop services in proper order and let the original exception to pass through. Example result: fatal error in "test_new_schema_with_no_structural_change_is_propagated": std::runtime_error: bind: Address already in use Message-Id: <1458768018-27662-1-git-send-email-tgrabiec@scylladb.com>	2016-03-24 11:20:01 +02:00
Shlomi Livne	d3a91e737b	fix a collision betwen --ami command line param and env sysconfig scylla-server includes an AMI, the script also used an AMI variable fix this by renaming the script variable `6a18634f9f` introduced this issue since it started imported the sysconfig scylla-server Signed-off-by: Shlomi Livne <shlomi@scylladb.com> Message-Id: <0bc472bb885db2f43702907e3e40d871f1385972.1458767984.git.shlomi@scylladb.com>	2016-03-24 08:14:41 +02:00
Asias He	fe263e5436	Revert "Revert "streaming: Start to send mutations after PREPARE_DONE_MESSAGE"" This reverts commit `1f29a698d5`.	2016-03-24 08:43:17 +08:00
Asias He	a6dd6e6d55	Revert "Revert "streaming: Simplify session completion logic"" This reverts commit `354fca9d56`.	2016-03-24 07:48:27 +08:00
Gleb Natapov	0afd1c6f0a	config: enable truncate_request_timeout_in_ms option Option truncate_request_timeout_in_ms is used by truncate. Mark it as used. Message-Id: <20160323162649.GH2282@scylladb.com>	2016-03-23 18:50:24 +02:00
Yoav Kleinberger	91269d0c15	tools/scyllatop: add sums to aggregate view the aggregate view now supports both sums and means. Signed-off-by: Yoav Kleinberger <yoav@scylladb.com> Message-Id: <1328af8efb113a786d7402b0704220108bfb28db.1458749600.git.yoav@scylladb.com>	2016-03-23 18:49:57 +02:00
Shlomi Livne	6a18634f9f	scylla_io_setup import scylla-server env args scylla_io_seup requires the scylla-server env to be setup to run correctly. previously scylla_io_setup was encapsulated in scylla-io.service that assured this. extracting CPUSET,SMP from SCYLLA_ARGS as CPUSET is needed for invoking io_tune Signed-off-by: Shlomi Livne <shlomi@scylladb.com> Message-Id: <d49af9cb54ae327c38e451ff76fe0322e64a5f00.1458747527.git.shlomi@scylladb.com>	2016-03-23 17:54:06 +02:00
Pekka Enberg	8bf3d4f550	Merge "Make sure repairs do not cripple incoming load" from Glauber "This series makes sure that the influence of repairs on the ongoing loads is limited. This patch does not fix the situation completely, but it will be the best we can do for 1.0 Here's a brief explanation about some potentially contentions points, and future work: 1) With the old parallelism semaphore in tree, we could never really drop parallelism below 256, since even with (local) parallelism = 1, we would still have 256 vnodes. So while the number 100 is totally empirical, we know for a fact that around 200-something, we start having real trouble. (total) parallelism = 100 is enough to allow us to survive a load as much as 3 times heavier than the load described in Issue944. So while it is empirical, at least it is based on something 2) I totally support changing the checksumming algorithm. However, I would rather focus my efforts on testing this to exhaustion than doing this at the moment. But if anybody wants to do it, I think it is a great thing to have before 1.0. Specially because we'll probably need a new verb for that, so we would be better off having it from the start 3) This problem was made harder due to the fact that there are three conditions really that can affect the ongoing load. Only one of them needs to trigger for us to see degradation, so fixing them individually will usually buy us nothing. Those are: a) The disk bandwidth. Since the mutations are all together in the same memtable/commitlog as normal memtables, we can differentiate between them from the I/O Scheduler perspective. This is not an issue of course if the incoming mutations are not enough for us to saturate the disk, but specially given the highly parallel nature of repair, we usually will. If the commitlog queue starts getting too big, for instance, new requests will start being put to wait. The effect of this part of the series is to completely shift the high waiting times from those classes to the streaming ones (unfortunately compaction is still affected, but that's fine IMHO). With the new streaming classes, the waiting time of a memtable / commitlog requests is still kept in the microseconds range. The streaming classes, on the other hand, will be in the hundreds of milliseconds range, or even seconds. b) The memory consumption: since the whole problem that leads to a) is the fact that due to high disk activity some requests will have to wait, we will end up with a lot of streaming memtables not yet flushed. Because of that, we will start throttling new incoming CQL requests and all the isolation efforts are rendered useless. Once again, due to the highly parallel nature of repair, this turned out to be a very easy condition to trigger. The solution proposed here is to limit a maximum amount of dirty memory for the repair job (in here, 25 %). This way, we can endure even slightly heavier loads without sweating too much. c) The task scheduler: repair generates a ton of requests for range checksums, and we actually want to keep it that way - so that the ranges checksummed are small enough so we don't have to resend a lot of mutations for no reason. However, if we pile up thousands of continuations in the task scheduler, seastar has absolutely no mechanism (right now) to prioritize between different kinds of requests. That means that the continuations that are supposed to be handling user requests will simply not for a long time. Even if the Seastar load is less than 100 % that is still a problem, since that is just adding hundreds of milliseconds worth of latencies to any request processing. Fixes #944 and fixes #1033."	2016-03-23 16:07:06 +02:00
Yoav Kleinberger	d2cfb86dc8	tools/scyllatop: defend against unexpected strings from collectd Signed-off-by: Yoav Kleinberger <yoav@scylladb.com> Message-Id: <cd7ecf6b3b82bd2027179cbec4e689a946469e9a.1458740337.git.yoav@scylladb.com>	2016-03-23 16:05:59 +02:00
Asias He	c2eff7e824	streaming: Complete receive task after the flush A STREAM_MUTATION_DONE message will signal the receiver that the sender has completed the sending of streams mutations. When the receiver finds it has zero task to send and zero task to receive, it will finish the stream_session, and in turn finish the stream_plan if all the stream_sessions are finished. We should call receive_task_completed only after the flush finishes so that when stream_plan is finshed all the data is on disk. Fixes repair_disjoint_data_test issue with Glauber's "[PATCH v4 0/9] Make sure repairs do not cripple incoming load" serries ====================================================================== FAIL: repair_disjoint_data_test (repair_additional_test.RepairAdditionalTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "scylla-dtest/repair_additional_test.py", line 102, in repair_disjoint_data_test self.check_rows_on_node(node1, 3000) File "scylla-dtest/repair_additional_test.py", line 33, in check_rows_on_node self.assertEqual(len(result), rows, len(result)) AssertionError: 2461	2016-03-23 09:40:49 -04:00
Glauber Costa	f49e965d78	repair: rework repair code so we can limit parallelism The repair code as it is right now is a bit convoluted: it resorts to detached continuations + do_for_each when calling sync_ranges, and deals with the problem of excessive parallelism by employing a semaphore inside that range. Still, even by doing that, we still generate a great number of checksum requests because the ranges themselves are processed in parallel. It would be better to have a single-semaphore to limit the overall parallelism for all requests. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:40:49 -04:00
Glauber Costa	34a9fc106f	database: keep streaming memtables in their own region group Theoretically, because we can have a lot of pending streaming memtables, we can have the database start throttling and incoming connections slowing down during streaming. Turns out this is actually a very easy condition to trigger. That is basically because the other side of the wire in this case is quite efficient in sending us work. This situation is alleviated a bit by reducing parallelism, but not only it does't go away completely, once we have the tools to start increasing parallelism again it will become common place. The solution for this is to limit the streaming memtables to a fraction of the total allowed dirty memory. Using the nesting capability built in in the LSA regions, we will make the streaming region group a child of the main region group. With that, we can throttle streaming requests separately, while at the same time being able to control the total amount of dirty memory as well. Because of the property, it can still be the case that incoming requests will throttle earlier due to streaming - unless we allow for more dirty memory to be used during repairs - but at least that effect will be limited. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:40:47 -04:00
Glauber Costa	455d5a57d2	streaming memtables: coalesce incoming writes The repair process will potentially send ranges containing few mutations, definitely not enough to fill a memtable. It wants to know whether or not each of those ranges individually succeeded or failed, so we need a future for each. Small memtables being flushed are bad, and we would like to write bigger memtables so we can better utilize our disks. One of the ways to fix that, is changing the repair itself to send more mutations at a single batch. But relying on that is a bad idea for two reasons: First, the goals of the SSTable writer and the repair sender are at odds. The SSTable writer wants to write as few SSTables as possible, while the repair sender wants to break down the range in pieces as small as it can and checksum them individually, so it doesn't have to send a lot of mutations for no reason. Second, even if the repair process wants to process larger ranges at once, some ranges themselves may be small. So while most ranges would be large, we would still have potentially some fairly small SSTables lying around. The best course of action in this case is to coalesce the incoming streams write-side. repair can now choose whatever strategy - small or big ranges - it wants, resting assure that the incoming memtables will be coalesced together. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:38:22 -04:00
Glauber Costa	5fa866223d	streaming: add incoming streaming mutations to a different sstable Keeping the mutations coming from the streaming process as mutations like any other have a number of advantages - and that's why we do it. However, this makes it impossible for Seastar's I/O scheduler to differentiate between incoming requests from clients, and those who are arriving from peers in the streaming process. As a result, if the streaming mutations consume a significant fraction of the total mutations, and we happen to be using the disk at its limits, we are in no position to provide any guarantees - defeating the whole purpose of the scheduler. To implement that, we'll keep a separate set of memtables that will contain only streaming mutations. We don't have to do it this way, but doing so makes life a lot easier. In particular, to write an SSTable, our API requires (because the filter requires), that a good estimate on the number of partitions is informed in advance. The partitions also need to be sorted. We could write mutations directly to disk, but the above conditions couldn't be met without significant effort. In particular, because mutations can be arriving from multiple peer nodes, we can't really sort them without keeping a staging area anyway. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:13:00 -04:00
Glauber Costa	10c8ca6ace	priority manager: separate streaming reads from writes Streaming has currently one class, that can be used to contain the read operations being generated by the streaming process. Those reads come from two places: - checksums (if doing repair) - reading mutations to be sent over the wire. Depending on the amount of data we're dealing with, that can generate a significant chunk of data, with seconds worth of backlog, and if we need to have the incoming writes intertwined with those reads, those can take a long time. Even if one node is only acting as a receiver, it may still read a lot for the checksums - if we're talking about repairs, those are coming from the checksums. However, in more complicated failure scenarios, it is not hard to imagine a node that will be both sending and receiving a lot of data. The best way to guarantee progress on both fronts, is to put both kinds of operations into different classes. This patch introduces a new write class, and rename the old read class so it can have a more meaningful name. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:12:59 -04:00
Glauber Costa	78189de57f	database: make seal_on_overflow a method of the memtable_list Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:12:59 -04:00
Glauber Costa	635bb942b2	database: move add_memtable as a method of the memtable_list The column family still has to teach the memtable list how to allocate a new memtable, since it uses CF parameters to do so. After that, the memtable_list's constructor takes a seal and a create function and is complete. The copy constructor can now go, since there are no users left. The behavior of keeping a reference to the underlying memtables can also go, since we can now guarantee that nobody is keeping references to it (it is not even a shared pointer anymore). Individual memtables are, and users may be keeping references to them individually. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-23 09:12:59 -04:00

1 2 3 4 5 ...

9022 Commits