Commit Graph

9022 Commits

Author SHA1 Message Date
Asias He
cdb43c5586 batchlog_manager: Allow user initiated bachlog replay operation
During decommission, the storage_service::unbootstrap() needs to
initiate a batchlog replay operation. To sync the replay operation
initiated by the timer in batchlog_manager and storage_service, a
semaphore is introduced.  To simplify the semaphore locking, the
management code now always runs on shard zero, but the real work is
distruted to all shards.
2016-03-30 20:54:30 +08:00
Glauber Costa
23808ba184 sstables: fix exception printouts in check_marker
As Nadav noticed in his bug report, check_marker is creating its error messages
using characters instead of numbers - which is what we intended here in the
first place.

That happens because sprint(), when faced with an 8-byte type, interprets this
as a character.  To avoid that we'll use uint16_t types, taking care not to
sign-extend them.

The bug also noted that one of the error messages is missing a parameter, and
that is also fixed.

Fixes #1122

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <74f825bbff8488ffeb1911e626db51eed88629b1.1459266115.git.glauber@scylladb.com>
2016-03-29 19:23:28 +03:00
Takuya ASADA
c1277bacb4 dist/common/scripts: prevent misinterpret blank input as '/dev/', show error when inputted device path is not found
Fixes #1110

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1459267786-19123-1-git-send-email-syuu@scylladb.com>
2016-03-29 19:18:51 +03:00
Glauber Costa
d5c1366e85 compaction: be verbose about which table is causing an exception
When we, for some reason, fail to compact an SSTable, we do not log the file
name leaving us with cryptic messages that tell us what happened, but not where
it happened.

This patch adds logging in compaction so that we'll know what's going on.
Please note that readers are more of a concern, because the SSTable being
written technically do not exist yet. Still, better safe than sorry: if
open_data fails, or we leave an unfinished SSTable, it is still good to know
which one was the culprit.

Some argument can be made about whether we should log this at the lower SSTable
level, or at the compaction level.

The reason I am logging this at the compaction level, is that we don't really
know which exception will trigger, and where: it may be the case that we're
seeing exceptions that are not SSTable specific, and may not have the chance to
log it properly.

In particular, if the exception happens inside the reader: read_rows() and
friends only return a mutation reader, which doesn't really do anything until
we call read(). But at that time, we don't hold any pointers to the SSTable
anymore.

In Summary, logging at the compaction level guarantees that we always do it no
matter what. Exceptions that are part of the main SSTable path can log the file
name as well if they want: if that's the case, we'll be left with the name
appearing twice. That's totally harmless, and better than none.

Fixes #1123

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <c5c969fb6aeb788a037bd7a4ea69979c1042cb34.1459263847.git.glauber@scylladb.com>
2016-03-29 18:15:56 +03:00
Glauber Costa
d536846433 commitlog: initialize sync period with actual sync period
commitlog's sync period is initialized as the batch period, and not as the
sync period itself as it should be.

I've found this by code inspection, but unless I am missing something
really fundamental, this seems to be completely wrong. It's been working
fine because in our defaults, I have checked that both variables default to
the same value. But it seems to me that as long as anyone would change one
of them, the behavior wouldn't be as expected.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2e7c565242fe5d4481a3ee8b0ba425ef14f5e42a.1459252783.git.glauber@scylladb.com>
2016-03-29 15:21:02 +03:00
Takuya ASADA
a5bb6c4b1b dist/ubuntu: drop classical sysv init script, only support Upstart for Ubuntu 14.04LTS
Sysv init script was added just for prevent warning message on lintian,
never really used by Ubuntu users. Result of that, we often break this
script since upstart/systemd unit file frequently changed. It may
confuse users, it's better to use Upstart only, just like Fedora/CentOS.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1459177601-20269-2-git-send-email-syuu@scylladb.com>
2016-03-29 11:48:18 +03:00
Takuya ASADA
42ce77a3b7 dist/redhat: prevent 'yum: command not found' on some Fedora environment
On some Fedora environments such as Fedora official AMI, dnf-yum package
is not installed by default, causes command not found error when we run
our setup scripts.

To prevent this, we need to add dnf-yum to scylla-server package
dependency.

Fixes #1106

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1459099744-23068-1-git-send-email-syuu@scylladb.com>
2016-03-29 11:29:09 +03:00
Avi Kivity
adffb1c061 dist/ubuntu: improve handling of bad command line options
On a bad command line, Scylla will exit with an exit code of 2.  Mark it
as a "normal" exit, to prevent a respawn.

Fixes #1087
Message-Id: <1458827221-12833-1-git-send-email-avi@scylladb.com>
2016-03-29 11:14:45 +03:00
Avi Kivity
c1d8fb56f7 dist/ubuntu: specify kill timeout
Allow more time for commitlog flushing
Message-Id: <1458827216-12778-1-git-send-email-avi@scylladb.com>
2016-03-29 11:14:27 +03:00
Raphael Carvalho
d515a7fd85 sstables: fix deletion of sstable with temporary TOC
After 4e52b41a4, remove_by_toc_name() became aware of temporary TOC
files, however, it doesn't consider that some components may be
missing if temporary TOC is present.
When creating a new sstable, the first thing we do is to write all
components into temporary TOC, so content of a temporary TOC isn't
reliable until it is renamed.

Solution is about implementing the following flow (described by Avi):
"Flow should be:

  - remove all components in parallel
  - forgive ENOENT, since the compoent may not have been written;
otherwise deletion error should be raised
  - fsync the directory
  - delete the temporary TOC
"

This problem can be reproduced by running compaction without disk
space, so compaction would fail and leave a partial sstable that would
be marked for deletion. Afterwards, remove_by_toc_name() would try to
delete a component that doesn't exist because it looked at the content
of temporary TOC.

Fixes #1095.

Signed-off-by: Raphael Carvalho <raphaelsc@scylladb.com>
Message-Id: <0cfcaacb43cc5bad3a8a7ea6c1fa6f325c5de97d.1459194263.git.raphaelsc@scylladb.com>
2016-03-29 10:38:01 +03:00
Tomasz Grabiec
d1db23e353 storage_service: Fix typos
Message-Id: <1458837390-26634-1-git-send-email-tgrabiec@scylladb.com>
2016-03-29 10:29:04 +03:00
Pekka Enberg
994390769f Update scylla-ami submodule
* dist/ami/files/scylla-ami 89e7436...7019088 (1):
  > Re-enable clocksource=tsc on AMI
2016-03-29 10:18:06 +03:00
Takuya ASADA
201b0c6ab3 dist: re-enable clocksource=tsc on AMI
clocksource=tsc on boot parameter mistakenly dropped on b3c85aea89, need to re-enable.

[ penberg: Manual backport of commit 050fb911d5 to 1.0. ]
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1459180643-4389-1-git-send-email-syuu@scylladb.com>

(cherry picked from commit 80242ff443)
2016-03-29 10:17:41 +03:00
Pekka Enberg
227daecba6 Revert "dist: move setup scripts to /usr/sbin"
This reverts commit 989357189a because it
broke our Jenkins packaging jobs.
2016-03-29 10:17:05 +03:00
Pekka Enberg
d1ec97e76f Revert "dist: re-enable clocksource=tsc on AMI"
This reverts commit 050fb911d5 in
preparation for reverting 989357189a.
2016-03-29 10:16:48 +03:00
Takuya ASADA
050fb911d5 dist: re-enable clocksource=tsc on AMI
clocksource=tsc on boot parameter mistakenly dropped on b3c85aea89, need to re-enable.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1459180643-4389-1-git-send-email-syuu@scylladb.com>
2016-03-29 09:53:23 +03:00
Asias He
62d443a07d streaming: Fix log of plan_id and session address in stream_session
They are get swapped. Fix it up. Spotted by looking at the log.
Message-Id: <d163d71e9a96d1a45c3a4c529519790eeff7c486.1459172778.git.asias@scylladb.com>
2016-03-29 09:01:06 +03:00
Nadav Har'El
a05577ca41 sstable: fix read failure of certain sstables
We had a problem reading certain existing Cassandra sstables into
Scylla.

Our consume_range_tombstone() function assumes that the start and end
columns have a certain "end of component" markers, and want to verify
that assumption. But because of bugs in older versions of Cassandra,
see https://issues.apache.org/jira/browse/CASSANDRA-7593, sometimes the
"end of component" was missing (set to 0). CASSANDRA-7593 suggested
this problem might exist on the start column, so we allowed for that,
but now we discovered a case where also the end column is set to 0 -
causing the test in consume_range_tombstone() to fail and the sstable
read to fail - causing Scylla to no be able to import that sstable from
Cassandra. Allowing for an 0 also on the end column made it possible
to read that sstable, compact it, and so on.

Fixes #1125.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1459173964-23242-1-git-send-email-nyh@scylladb.com>
2016-03-28 17:09:37 +03:00
Duarte Nunes
db881fdc8f cql: Add support for pg-style string literal
This patch adds support for pg-style string literals to the CQL
grammar.

Fixes #1078

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1459093238-2529-1-git-send-email-duarte@scylladb.com>
2016-03-28 17:06:03 +03:00
yan cui
e5d1c031ac dist: add ubuntu docker file 2016-03-28 10:14:12 +03:00
Avi Kivity
a919113fdb schema_tables: fix deadlock in cross-node communications
Seastar wrongly limits the number of concurrent submit_to()s to a single
remote shard.  This can cause an ABBA deadlock:

  fiberA                fiberB (x127)
  submit_to(0)                         # lock schema
  <- returns
                        submit_to(0)   # lock schema (waits)
  submit_to(0)                         # do work (waits)

The fiberBs wait for fiberA, which in turn waits for a fiberB to return.

While the correct fix is to remote the client-side limit and replace it
with a server-side per-verb limit, we start with a simpler fix that
replaces the blocking lock call with a non-blocking call, removing the
deadlock.

Fixes #1088.

Message-Id: <1459095357-28950-1-git-send-email-avi@scylladb.com>
2016-03-28 10:12:10 +03:00
Raphael Carvalho
e6e5999282 Fix corner-case in refresh
Problem found by dtest which loads sstables with generation 1 and 2 into an
empty column family. The root of the problem is that reshuffle procedure
changes new sstables to start from generation 2 at least. So reshuffle could
try to set generation 1 to 2 when generation 2 exists.
This problem can be fixed by starting from generation 1 instead, so reshuffle
would handle this case properly.

Fixes #1099.

Signed-off-by: Raphael Carvalho <raphaelsc@scylladb.com>
Message-Id: <88c51fbda9557a506ad99395aeb0a91cd550ede4.1458917237.git.raphaelsc@scylladb.com>
2016-03-27 10:03:32 +03:00
Avi Kivity
077c0d1022 dist: ami: fix AMI_OPT receiving no value
We assign AMI=0 and AMI_OPT=1, so in the true case, AMI_OPT has no value,
and a later compare fails.
2016-03-26 21:16:28 +03:00
Takuya ASADA
989357189a dist: move setup scripts to /usr/sbin
Since these scripts are user command, should be on $PATH.

Fixes #1092

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1458860407-25269-1-git-send-email-syuu@scylladb.com>
2016-03-25 11:50:13 +03:00
Takuya ASADA
2582dbe4a0 dist/ami: use tilde for release candidate builds
Sync with ubuntu package versioning rule

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1458882718-29317-1-git-send-email-syuu@scylladb.com>
2016-03-25 11:34:28 +03:00
Glauber Costa
e750a94300 sanity check Seastar's I/O queue configuration
While Seastar in general can accept any parameter for its I/O queues, Scylla
in particular shouldn't run with them disabled. Such will be the status when
the max-io-requests parameter is not enabled.

On top of that, we would like to have enough depth per I/O queue not to allow
for shard-local parallelism. Therefore, we will require a minimum per-queue
capacity of 4. In machines where the disk iodepth is not enough to allow for 4
concurrent requests per shard, one should reduce the number of I/O queues.

For --max-io-requests, we will check the parameter itself. However, the
--num-io-queues parameter is not mandatory, and given enough concurrent
requests, Seastar's default configuration can very well just be doing the right
thing. So for that, we will check the final result of each I/O queue.

As it is the case with other checks of the sorts, this can be overridden by
the --developer-mode switch.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <63bf7e91ac10c95810351815bb8f5e94d75592a5.1458836000.git.glauber@scylladb.com>
2016-03-25 11:33:57 +03:00
Tomasz Grabiec
53bbcf4a1e schema_tables: Wait for notifications to be processed.
Listeners may defer since:

 93015bcc54 "migration_manager: Make the migration callbacks runs inside seastar thread"

Not all places were adjusted to wait for them. Fix that.

Message-Id: <1458837613-27616-1-git-send-email-tgrabiec@scylladb.com>
2016-03-24 19:04:12 +02:00
Avi Kivity
12744217b8 Initial github issue template
Message-Id: <1458817106-1513-1-git-send-email-avi@scylladb.com>
2016-03-24 15:37:00 +02:00
Benoît Canet
4ac1126677 collectd: Write to the network to get rid of spurious log messages
Closes #1018

Suggested-by: Avi Kivity <avi@scylladb.com>
Signed-of-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <1458759378-4935-1-git-send-email-benoit@scylladb.com>
2016-03-24 12:34:14 +02:00
Calle Wilund
ff5df306e3 database: Use disk-marking delete function in discard_sstables
Fixes #797

To make sure an inopportune crash after truncate does not leave
sstables on disk to be considered live, and thus resurrect data,
after a truncate, use delete function that renames the TOC file to
make sure we've marked sstables as dead on disk when we finish
this discard call.
Message-Id: <1458575440-505-2-git-send-email-calle@scylladb.com>
2016-03-24 12:02:08 +02:00
Calle Wilund
4e52b41a46 sstables: Add delete func to rename TOC ensuring table is marked dead
Note: "normal" remove_by_toc_name must now be prepared for and check
if the TOC of the sstable is already moved to temp file when we
get to the juicy delete parts.
Message-Id: <1458575440-505-1-git-send-email-calle@scylladb.com>
2016-03-24 12:01:53 +02:00
Asias He
6fd6e57e80 streaming: Harden keep alive timer
- Do nothing in case the session is closed, to prevent we fire up the
  timer again

- Print log info when no progress has been made if the time expires, it
  is very useful to debug a idle session

- Grab a reference when the keep alive timer is running

Message-Id: <9f2cc3164696905a6a39c0d072a980765d598dfd.1458782956.git.asias@scylladb.com>
2016-03-24 11:58:54 +02:00
Avi Kivity
112a930f92 Merge "Bring back simplify session completion logic" from Asias
"The following patches are reverted becasue they were thought they break
Glauber's "Make sure repairs do not cripple incoming load" series. It turns out
these two patches just made another bug more visisble. The bug is fixed in
c2eff7e824 (streaming: Complete receive task
after the flush). We can bring the two patches back now.

Passed repair_additional_test.py and update_cluster_layout_tests.py with smp 2."
2016-03-24 11:57:20 +02:00
Tomasz Grabiec
341b509f68 cql_test_env: Make initialization exception-safe
Currently start() is not prepared to handle exceptions thrown from
service initialization. It's easy to trigger such exceprion by
starting two tests at the same time, which will result in socket bind
error.

Exception thrown from start() typically results in assertion failures
like this one:

  seastar::sharded<Service>::~sharded() [with Service = database]: Assertion `_instances.empty()' failed.

This patch fixes the problem by combining start() and stop() in a
single do_with() and using RAII for stopping services.

Now exceptions thrown from service initialization should stop services
in proper order and let the original exception to pass
through. Example result:

  fatal error in "test_new_schema_with_no_structural_change_is_propagated": std::runtime_error: bind: Address already in use
Message-Id: <1458768018-27662-1-git-send-email-tgrabiec@scylladb.com>
2016-03-24 11:20:01 +02:00
Shlomi Livne
d3a91e737b fix a collision betwen --ami command line param and env
sysconfig scylla-server includes an AMI, the script also used an AMI
variable fix this by renaming the script variable

6a18634f9f introduced this issue since it
started imported the sysconfig scylla-server

Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <0bc472bb885db2f43702907e3e40d871f1385972.1458767984.git.shlomi@scylladb.com>
2016-03-24 08:14:41 +02:00
Asias He
fe263e5436 Revert "Revert "streaming: Start to send mutations after PREPARE_DONE_MESSAGE""
This reverts commit 1f29a698d5.
2016-03-24 08:43:17 +08:00
Asias He
a6dd6e6d55 Revert "Revert "streaming: Simplify session completion logic""
This reverts commit 354fca9d56.
2016-03-24 07:48:27 +08:00
Gleb Natapov
0afd1c6f0a config: enable truncate_request_timeout_in_ms option
Option truncate_request_timeout_in_ms is used by truncate. Mark it as
used.

Message-Id: <20160323162649.GH2282@scylladb.com>
2016-03-23 18:50:24 +02:00
Yoav Kleinberger
91269d0c15 tools/scyllatop: add sums to aggregate view
the aggregate view now supports both sums and means.

Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <1328af8efb113a786d7402b0704220108bfb28db.1458749600.git.yoav@scylladb.com>
2016-03-23 18:49:57 +02:00
Shlomi Livne
6a18634f9f scylla_io_setup import scylla-server env args
scylla_io_seup requires the scylla-server env to be setup to run
correctly. previously scylla_io_setup was encapsulated in
scylla-io.service that assured this.

extracting CPUSET,SMP from SCYLLA_ARGS as CPUSET is needed for invoking
io_tune

Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <d49af9cb54ae327c38e451ff76fe0322e64a5f00.1458747527.git.shlomi@scylladb.com>
2016-03-23 17:54:06 +02:00
Pekka Enberg
8bf3d4f550 Merge "Make sure repairs do not cripple incoming load" from Glauber
"This series makes sure that the influence of repairs on the ongoing loads is
limited. This patch does not fix the situation completely, but it will be the
best we can do for 1.0

Here's a brief explanation about some potentially contentions points, and future work:

1) With the old parallelism semaphore in tree, we could never really drop parallelism
below 256, since even with (local) parallelism = 1, we would still have 256 vnodes. So while
the number 100 is totally empirical, we know for a fact that around 200-something, we
start having real trouble. (total) parallelism = 100 is enough to allow us to survive
a load as much as 3 times heavier than the load described in Issue944. So while it is
empirical, at least it is based on something

2) I totally support changing the checksumming algorithm. However, I would rather focus
my efforts on testing this to exhaustion than doing this at the moment. But if anybody
wants to do it, I think it is a great thing to have before 1.0. Specially because
we'll probably need a new verb for that, so we would be better off having it from the start

3) This problem was made harder due to the fact that there are three conditions really that
can affect the ongoing load. Only one of them needs to trigger for us to see degradation, so
fixing them individually will usually buy us nothing.

Those are:
    a) The disk bandwidth. Since the mutations are all together in the same memtable/commitlog
       as normal memtables, we can differentiate between them from the I/O Scheduler perspective.
       This is not an issue of course if the incoming mutations are not enough for us to saturate
       the disk, but specially given the highly parallel nature of repair, we usually will. If
       the commitlog queue starts getting too big, for instance, new requests will start being
       put to wait. The effect of this part of the series is to *completely* shift the high waiting
       times from those classes to the streaming ones (unfortunately compaction is still affected,
       but that's fine IMHO). With the new streaming classes, the waiting time of a memtable / commitlog
       requests is still kept in the microseconds range. The streaming classes, on the other hand,
       will be in the hundreds of milliseconds range, or even seconds.

    b) The memory consumption: since the whole problem that leads to a) is the fact that due to high
       disk activity some requests will have to wait, we will end up with a lot of streaming memtables
       not yet flushed. Because of that, we will start throttling new incoming CQL requests and all
       the isolation efforts are rendered useless. Once again, due to the highly parallel nature of
       repair, this turned out to be a very easy condition to trigger. The solution proposed here
       is to limit a maximum amount of dirty memory for the repair job (in here, 25 %). This way,
       we can endure even slightly heavier loads without sweating too much.

    c) The task scheduler: repair generates a ton of requests for range checksums, and we actually
       want to keep it that way - so that the ranges checksummed are small enough so we don't have
       to resend a lot of mutations for no reason. However, if we pile up thousands of continuations
       in the task scheduler, seastar has absolutely no mechanism (right now) to prioritize between
       different kinds of requests. That means that the continuations that are supposed to be handling
       user requests will simply not for a long time. Even if the Seastar load is less than 100 %
       that is still a problem, since that is just adding hundreds of milliseconds worth of latencies to
       any request processing.

Fixes #944 and fixes #1033."
2016-03-23 16:07:06 +02:00
Yoav Kleinberger
d2cfb86dc8 tools/scyllatop: defend against unexpected strings from collectd
Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <cd7ecf6b3b82bd2027179cbec4e689a946469e9a.1458740337.git.yoav@scylladb.com>
2016-03-23 16:05:59 +02:00
Asias He
c2eff7e824 streaming: Complete receive task after the flush
A STREAM_MUTATION_DONE message will signal the receiver that the sender
has completed the sending of streams mutations. When the receiver finds
it has zero task to send and zero task to receive, it will finish the
stream_session, and in turn finish the stream_plan if all the
stream_sessions are finished. We should call receive_task_completed only
after the flush finishes so that when stream_plan is finshed all the
data is on disk.

Fixes repair_disjoint_data_test issue with Glauber's "[PATCH v4 0/9] Make
sure repairs do not cripple incoming load" serries

======================================================================
FAIL: repair_disjoint_data_test
(repair_additional_test.RepairAdditionalTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "scylla-dtest/repair_additional_test.py",
line 102, in repair_disjoint_data_test
    self.check_rows_on_node(node1, 3000)
  File "scylla-dtest/repair_additional_test.py",
line 33, in check_rows_on_node
    self.assertEqual(len(result), rows, len(result))
AssertionError: 2461
2016-03-23 09:40:49 -04:00
Glauber Costa
f49e965d78 repair: rework repair code so we can limit parallelism
The repair code as it is right now is a bit convoluted: it resorts to detached
continuations + do_for_each when calling sync_ranges, and deals with the
problem of excessive parallelism by employing a semaphore inside that range.

Still, even by doing that, we still generate a great number of
checksum requests because the ranges themselves are processed in parallel.

It would be better to have a single-semaphore to limit the overall parallelism
for all requests.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-03-23 09:40:49 -04:00
Glauber Costa
34a9fc106f database: keep streaming memtables in their own region group
Theoretically, because we can have a lot of pending streaming memtables, we can
have the database start throttling and incoming connections slowing down during
streaming.

Turns out this is actually a very easy condition to trigger. That is basically
because the other side of the wire in this case is quite efficient in sending
us work. This situation is alleviated a bit by reducing parallelism, but not
only it does't go away completely, once we have the tools to start increasing
parallelism again it will become common place.

The solution for this is to limit the streaming memtables to a fraction of the
total allowed dirty memory. Using the nesting capability built in in the LSA
regions, we will make the streaming region group a child of the main region
group.  With that, we can throttle streaming requests separately, while at the
same time being able to control the total amount of dirty memory as well.

Because of the property, it can still be the case that incoming requests will
throttle earlier due to streaming - unless we allow for more dirty memory to be
used during repairs - but at least that effect will be limited.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-03-23 09:40:47 -04:00
Glauber Costa
455d5a57d2 streaming memtables: coalesce incoming writes
The repair process will potentially send ranges containing few mutations,
definitely not enough to fill a memtable. It wants to know whether or not each
of those ranges individually succeeded or failed, so we need a future for each.

Small memtables being flushed are bad, and we would like to write bigger
memtables so we can better utilize our disks.

One of the ways to fix that, is changing the repair itself to send more
mutations at a single batch. But relying on that is a bad idea for two reasons:

First, the goals of the SSTable writer and the repair sender are at odds. The
SSTable writer wants to write as few SSTables as possible, while the repair
sender wants to break down the range in pieces as small as it can and checksum
them individually, so it doesn't have to send a lot of mutations for no reason.

Second, even if the repair process wants to process larger ranges at once, some
ranges themselves may be small. So while most ranges would be large, we would
still have potentially some fairly small SSTables lying around.

The best course of action in this case is to coalesce the incoming streams
write-side.  repair can now choose whatever strategy - small or big ranges - it
wants, resting assure that the incoming memtables will be coalesced together.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-03-23 09:38:22 -04:00
Glauber Costa
5fa866223d streaming: add incoming streaming mutations to a different sstable
Keeping the mutations coming from the streaming process as mutations like any
other have a number of advantages - and that's why we do it.

However, this makes it impossible for Seastar's I/O scheduler to differentiate
between incoming requests from clients, and those who are arriving from peers
in the streaming process.

As a result, if the streaming mutations consume a significant fraction of the
total mutations, and we happen to be using the disk at its limits, we are in no
position to provide any guarantees - defeating the whole purpose of the
scheduler.

To implement that, we'll keep a separate set of memtables that will contain
only streaming mutations. We don't have to do it this way, but doing so
makes life a lot easier. In particular, to write an SSTable, our API requires
(because the filter requires), that a good estimate on the number of partitions
is informed in advance. The partitions also need to be sorted.

We could write mutations directly to disk, but the above conditions couldn't be
met without significant effort. In particular, because mutations can be
arriving from multiple peer nodes, we can't really sort them without keeping a
staging area anyway.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-03-23 09:13:00 -04:00
Glauber Costa
10c8ca6ace priority manager: separate streaming reads from writes
Streaming has currently one class, that can be used to contain the read
operations being generated by the streaming process. Those reads come from two
places:

- checksums (if doing repair)
- reading mutations to be sent over the wire.

Depending on the amount of data we're dealing with, that can generate a
significant chunk of data, with seconds worth of backlog, and if we need to
have the incoming writes intertwined with those reads, those can take a long
time.

Even if one node is only acting as a receiver, it may still read a lot for the
checksums - if we're talking about repairs, those are coming from the
checksums.

However, in more complicated failure scenarios, it is not hard to imagine a
node that will be both sending and receiving a lot of data.

The best way to guarantee progress on both fronts, is to put both kinds of
operations into different classes.

This patch introduces a new write class, and rename the old read class so it
can have a more meaningful name.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-03-23 09:12:59 -04:00
Glauber Costa
78189de57f database: make seal_on_overflow a method of the memtable_list
Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-03-23 09:12:59 -04:00
Glauber Costa
635bb942b2 database: move add_memtable as a method of the memtable_list
The column family still has to teach the memtable list how to allocate a new memtable,
since it uses CF parameters to do so.

After that, the memtable_list's constructor takes a seal and a create function and is complete.
The copy constructor can now go, since there are no users left.
The behavior of keeping a reference to the underlying memtables can also go, since we can now
guarantee that nobody is keeping references to it (it is not even a shared pointer anymore).
Individual memtables are, and users may be keeping references to them individually.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-03-23 09:12:59 -04:00