During decommission, the storage_service::unbootstrap() needs to
initiate a batchlog replay operation. To sync the replay operation
initiated by the timer in batchlog_manager and storage_service, a
semaphore is introduced. To simplify the semaphore locking, the
management code now always runs on shard zero, but the real work is
distruted to all shards.
As Nadav noticed in his bug report, check_marker is creating its error messages
using characters instead of numbers - which is what we intended here in the
first place.
That happens because sprint(), when faced with an 8-byte type, interprets this
as a character. To avoid that we'll use uint16_t types, taking care not to
sign-extend them.
The bug also noted that one of the error messages is missing a parameter, and
that is also fixed.
Fixes#1122
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <74f825bbff8488ffeb1911e626db51eed88629b1.1459266115.git.glauber@scylladb.com>
When we, for some reason, fail to compact an SSTable, we do not log the file
name leaving us with cryptic messages that tell us what happened, but not where
it happened.
This patch adds logging in compaction so that we'll know what's going on.
Please note that readers are more of a concern, because the SSTable being
written technically do not exist yet. Still, better safe than sorry: if
open_data fails, or we leave an unfinished SSTable, it is still good to know
which one was the culprit.
Some argument can be made about whether we should log this at the lower SSTable
level, or at the compaction level.
The reason I am logging this at the compaction level, is that we don't really
know which exception will trigger, and where: it may be the case that we're
seeing exceptions that are not SSTable specific, and may not have the chance to
log it properly.
In particular, if the exception happens inside the reader: read_rows() and
friends only return a mutation reader, which doesn't really do anything until
we call read(). But at that time, we don't hold any pointers to the SSTable
anymore.
In Summary, logging at the compaction level guarantees that we always do it no
matter what. Exceptions that are part of the main SSTable path can log the file
name as well if they want: if that's the case, we'll be left with the name
appearing twice. That's totally harmless, and better than none.
Fixes#1123
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <c5c969fb6aeb788a037bd7a4ea69979c1042cb34.1459263847.git.glauber@scylladb.com>
commitlog's sync period is initialized as the batch period, and not as the
sync period itself as it should be.
I've found this by code inspection, but unless I am missing something
really fundamental, this seems to be completely wrong. It's been working
fine because in our defaults, I have checked that both variables default to
the same value. But it seems to me that as long as anyone would change one
of them, the behavior wouldn't be as expected.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2e7c565242fe5d4481a3ee8b0ba425ef14f5e42a.1459252783.git.glauber@scylladb.com>
Sysv init script was added just for prevent warning message on lintian,
never really used by Ubuntu users. Result of that, we often break this
script since upstart/systemd unit file frequently changed. It may
confuse users, it's better to use Upstart only, just like Fedora/CentOS.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1459177601-20269-2-git-send-email-syuu@scylladb.com>
On some Fedora environments such as Fedora official AMI, dnf-yum package
is not installed by default, causes command not found error when we run
our setup scripts.
To prevent this, we need to add dnf-yum to scylla-server package
dependency.
Fixes#1106
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1459099744-23068-1-git-send-email-syuu@scylladb.com>
After 4e52b41a4, remove_by_toc_name() became aware of temporary TOC
files, however, it doesn't consider that some components may be
missing if temporary TOC is present.
When creating a new sstable, the first thing we do is to write all
components into temporary TOC, so content of a temporary TOC isn't
reliable until it is renamed.
Solution is about implementing the following flow (described by Avi):
"Flow should be:
- remove all components in parallel
- forgive ENOENT, since the compoent may not have been written;
otherwise deletion error should be raised
- fsync the directory
- delete the temporary TOC
"
This problem can be reproduced by running compaction without disk
space, so compaction would fail and leave a partial sstable that would
be marked for deletion. Afterwards, remove_by_toc_name() would try to
delete a component that doesn't exist because it looked at the content
of temporary TOC.
Fixes#1095.
Signed-off-by: Raphael Carvalho <raphaelsc@scylladb.com>
Message-Id: <0cfcaacb43cc5bad3a8a7ea6c1fa6f325c5de97d.1459194263.git.raphaelsc@scylladb.com>
We had a problem reading certain existing Cassandra sstables into
Scylla.
Our consume_range_tombstone() function assumes that the start and end
columns have a certain "end of component" markers, and want to verify
that assumption. But because of bugs in older versions of Cassandra,
see https://issues.apache.org/jira/browse/CASSANDRA-7593, sometimes the
"end of component" was missing (set to 0). CASSANDRA-7593 suggested
this problem might exist on the start column, so we allowed for that,
but now we discovered a case where also the end column is set to 0 -
causing the test in consume_range_tombstone() to fail and the sstable
read to fail - causing Scylla to no be able to import that sstable from
Cassandra. Allowing for an 0 also on the end column made it possible
to read that sstable, compact it, and so on.
Fixes#1125.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1459173964-23242-1-git-send-email-nyh@scylladb.com>
Seastar wrongly limits the number of concurrent submit_to()s to a single
remote shard. This can cause an ABBA deadlock:
fiberA fiberB (x127)
submit_to(0) # lock schema
<- returns
submit_to(0) # lock schema (waits)
submit_to(0) # do work (waits)
The fiberBs wait for fiberA, which in turn waits for a fiberB to return.
While the correct fix is to remote the client-side limit and replace it
with a server-side per-verb limit, we start with a simpler fix that
replaces the blocking lock call with a non-blocking call, removing the
deadlock.
Fixes#1088.
Message-Id: <1459095357-28950-1-git-send-email-avi@scylladb.com>
Problem found by dtest which loads sstables with generation 1 and 2 into an
empty column family. The root of the problem is that reshuffle procedure
changes new sstables to start from generation 2 at least. So reshuffle could
try to set generation 1 to 2 when generation 2 exists.
This problem can be fixed by starting from generation 1 instead, so reshuffle
would handle this case properly.
Fixes#1099.
Signed-off-by: Raphael Carvalho <raphaelsc@scylladb.com>
Message-Id: <88c51fbda9557a506ad99395aeb0a91cd550ede4.1458917237.git.raphaelsc@scylladb.com>
While Seastar in general can accept any parameter for its I/O queues, Scylla
in particular shouldn't run with them disabled. Such will be the status when
the max-io-requests parameter is not enabled.
On top of that, we would like to have enough depth per I/O queue not to allow
for shard-local parallelism. Therefore, we will require a minimum per-queue
capacity of 4. In machines where the disk iodepth is not enough to allow for 4
concurrent requests per shard, one should reduce the number of I/O queues.
For --max-io-requests, we will check the parameter itself. However, the
--num-io-queues parameter is not mandatory, and given enough concurrent
requests, Seastar's default configuration can very well just be doing the right
thing. So for that, we will check the final result of each I/O queue.
As it is the case with other checks of the sorts, this can be overridden by
the --developer-mode switch.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <63bf7e91ac10c95810351815bb8f5e94d75592a5.1458836000.git.glauber@scylladb.com>
Fixes#797
To make sure an inopportune crash after truncate does not leave
sstables on disk to be considered live, and thus resurrect data,
after a truncate, use delete function that renames the TOC file to
make sure we've marked sstables as dead on disk when we finish
this discard call.
Message-Id: <1458575440-505-2-git-send-email-calle@scylladb.com>
Note: "normal" remove_by_toc_name must now be prepared for and check
if the TOC of the sstable is already moved to temp file when we
get to the juicy delete parts.
Message-Id: <1458575440-505-1-git-send-email-calle@scylladb.com>
- Do nothing in case the session is closed, to prevent we fire up the
timer again
- Print log info when no progress has been made if the time expires, it
is very useful to debug a idle session
- Grab a reference when the keep alive timer is running
Message-Id: <9f2cc3164696905a6a39c0d072a980765d598dfd.1458782956.git.asias@scylladb.com>
"The following patches are reverted becasue they were thought they break
Glauber's "Make sure repairs do not cripple incoming load" series. It turns out
these two patches just made another bug more visisble. The bug is fixed in
c2eff7e824 (streaming: Complete receive task
after the flush). We can bring the two patches back now.
Passed repair_additional_test.py and update_cluster_layout_tests.py with smp 2."
Currently start() is not prepared to handle exceptions thrown from
service initialization. It's easy to trigger such exceprion by
starting two tests at the same time, which will result in socket bind
error.
Exception thrown from start() typically results in assertion failures
like this one:
seastar::sharded<Service>::~sharded() [with Service = database]: Assertion `_instances.empty()' failed.
This patch fixes the problem by combining start() and stop() in a
single do_with() and using RAII for stopping services.
Now exceptions thrown from service initialization should stop services
in proper order and let the original exception to pass
through. Example result:
fatal error in "test_new_schema_with_no_structural_change_is_propagated": std::runtime_error: bind: Address already in use
Message-Id: <1458768018-27662-1-git-send-email-tgrabiec@scylladb.com>
scylla_io_seup requires the scylla-server env to be setup to run
correctly. previously scylla_io_setup was encapsulated in
scylla-io.service that assured this.
extracting CPUSET,SMP from SCYLLA_ARGS as CPUSET is needed for invoking
io_tune
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <d49af9cb54ae327c38e451ff76fe0322e64a5f00.1458747527.git.shlomi@scylladb.com>
"This series makes sure that the influence of repairs on the ongoing loads is
limited. This patch does not fix the situation completely, but it will be the
best we can do for 1.0
Here's a brief explanation about some potentially contentions points, and future work:
1) With the old parallelism semaphore in tree, we could never really drop parallelism
below 256, since even with (local) parallelism = 1, we would still have 256 vnodes. So while
the number 100 is totally empirical, we know for a fact that around 200-something, we
start having real trouble. (total) parallelism = 100 is enough to allow us to survive
a load as much as 3 times heavier than the load described in Issue944. So while it is
empirical, at least it is based on something
2) I totally support changing the checksumming algorithm. However, I would rather focus
my efforts on testing this to exhaustion than doing this at the moment. But if anybody
wants to do it, I think it is a great thing to have before 1.0. Specially because
we'll probably need a new verb for that, so we would be better off having it from the start
3) This problem was made harder due to the fact that there are three conditions really that
can affect the ongoing load. Only one of them needs to trigger for us to see degradation, so
fixing them individually will usually buy us nothing.
Those are:
a) The disk bandwidth. Since the mutations are all together in the same memtable/commitlog
as normal memtables, we can differentiate between them from the I/O Scheduler perspective.
This is not an issue of course if the incoming mutations are not enough for us to saturate
the disk, but specially given the highly parallel nature of repair, we usually will. If
the commitlog queue starts getting too big, for instance, new requests will start being
put to wait. The effect of this part of the series is to *completely* shift the high waiting
times from those classes to the streaming ones (unfortunately compaction is still affected,
but that's fine IMHO). With the new streaming classes, the waiting time of a memtable / commitlog
requests is still kept in the microseconds range. The streaming classes, on the other hand,
will be in the hundreds of milliseconds range, or even seconds.
b) The memory consumption: since the whole problem that leads to a) is the fact that due to high
disk activity some requests will have to wait, we will end up with a lot of streaming memtables
not yet flushed. Because of that, we will start throttling new incoming CQL requests and all
the isolation efforts are rendered useless. Once again, due to the highly parallel nature of
repair, this turned out to be a very easy condition to trigger. The solution proposed here
is to limit a maximum amount of dirty memory for the repair job (in here, 25 %). This way,
we can endure even slightly heavier loads without sweating too much.
c) The task scheduler: repair generates a ton of requests for range checksums, and we actually
want to keep it that way - so that the ranges checksummed are small enough so we don't have
to resend a lot of mutations for no reason. However, if we pile up thousands of continuations
in the task scheduler, seastar has absolutely no mechanism (right now) to prioritize between
different kinds of requests. That means that the continuations that are supposed to be handling
user requests will simply not for a long time. Even if the Seastar load is less than 100 %
that is still a problem, since that is just adding hundreds of milliseconds worth of latencies to
any request processing.
Fixes#944 and fixes #1033."
A STREAM_MUTATION_DONE message will signal the receiver that the sender
has completed the sending of streams mutations. When the receiver finds
it has zero task to send and zero task to receive, it will finish the
stream_session, and in turn finish the stream_plan if all the
stream_sessions are finished. We should call receive_task_completed only
after the flush finishes so that when stream_plan is finshed all the
data is on disk.
Fixes repair_disjoint_data_test issue with Glauber's "[PATCH v4 0/9] Make
sure repairs do not cripple incoming load" serries
======================================================================
FAIL: repair_disjoint_data_test
(repair_additional_test.RepairAdditionalTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "scylla-dtest/repair_additional_test.py",
line 102, in repair_disjoint_data_test
self.check_rows_on_node(node1, 3000)
File "scylla-dtest/repair_additional_test.py",
line 33, in check_rows_on_node
self.assertEqual(len(result), rows, len(result))
AssertionError: 2461
The repair code as it is right now is a bit convoluted: it resorts to detached
continuations + do_for_each when calling sync_ranges, and deals with the
problem of excessive parallelism by employing a semaphore inside that range.
Still, even by doing that, we still generate a great number of
checksum requests because the ranges themselves are processed in parallel.
It would be better to have a single-semaphore to limit the overall parallelism
for all requests.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Theoretically, because we can have a lot of pending streaming memtables, we can
have the database start throttling and incoming connections slowing down during
streaming.
Turns out this is actually a very easy condition to trigger. That is basically
because the other side of the wire in this case is quite efficient in sending
us work. This situation is alleviated a bit by reducing parallelism, but not
only it does't go away completely, once we have the tools to start increasing
parallelism again it will become common place.
The solution for this is to limit the streaming memtables to a fraction of the
total allowed dirty memory. Using the nesting capability built in in the LSA
regions, we will make the streaming region group a child of the main region
group. With that, we can throttle streaming requests separately, while at the
same time being able to control the total amount of dirty memory as well.
Because of the property, it can still be the case that incoming requests will
throttle earlier due to streaming - unless we allow for more dirty memory to be
used during repairs - but at least that effect will be limited.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The repair process will potentially send ranges containing few mutations,
definitely not enough to fill a memtable. It wants to know whether or not each
of those ranges individually succeeded or failed, so we need a future for each.
Small memtables being flushed are bad, and we would like to write bigger
memtables so we can better utilize our disks.
One of the ways to fix that, is changing the repair itself to send more
mutations at a single batch. But relying on that is a bad idea for two reasons:
First, the goals of the SSTable writer and the repair sender are at odds. The
SSTable writer wants to write as few SSTables as possible, while the repair
sender wants to break down the range in pieces as small as it can and checksum
them individually, so it doesn't have to send a lot of mutations for no reason.
Second, even if the repair process wants to process larger ranges at once, some
ranges themselves may be small. So while most ranges would be large, we would
still have potentially some fairly small SSTables lying around.
The best course of action in this case is to coalesce the incoming streams
write-side. repair can now choose whatever strategy - small or big ranges - it
wants, resting assure that the incoming memtables will be coalesced together.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Keeping the mutations coming from the streaming process as mutations like any
other have a number of advantages - and that's why we do it.
However, this makes it impossible for Seastar's I/O scheduler to differentiate
between incoming requests from clients, and those who are arriving from peers
in the streaming process.
As a result, if the streaming mutations consume a significant fraction of the
total mutations, and we happen to be using the disk at its limits, we are in no
position to provide any guarantees - defeating the whole purpose of the
scheduler.
To implement that, we'll keep a separate set of memtables that will contain
only streaming mutations. We don't have to do it this way, but doing so
makes life a lot easier. In particular, to write an SSTable, our API requires
(because the filter requires), that a good estimate on the number of partitions
is informed in advance. The partitions also need to be sorted.
We could write mutations directly to disk, but the above conditions couldn't be
met without significant effort. In particular, because mutations can be
arriving from multiple peer nodes, we can't really sort them without keeping a
staging area anyway.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Streaming has currently one class, that can be used to contain the read
operations being generated by the streaming process. Those reads come from two
places:
- checksums (if doing repair)
- reading mutations to be sent over the wire.
Depending on the amount of data we're dealing with, that can generate a
significant chunk of data, with seconds worth of backlog, and if we need to
have the incoming writes intertwined with those reads, those can take a long
time.
Even if one node is only acting as a receiver, it may still read a lot for the
checksums - if we're talking about repairs, those are coming from the
checksums.
However, in more complicated failure scenarios, it is not hard to imagine a
node that will be both sending and receiving a lot of data.
The best way to guarantee progress on both fronts, is to put both kinds of
operations into different classes.
This patch introduces a new write class, and rename the old read class so it
can have a more meaningful name.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The column family still has to teach the memtable list how to allocate a new memtable,
since it uses CF parameters to do so.
After that, the memtable_list's constructor takes a seal and a create function and is complete.
The copy constructor can now go, since there are no users left.
The behavior of keeping a reference to the underlying memtables can also go, since we can now
guarantee that nobody is keeping references to it (it is not even a shared pointer anymore).
Individual memtables are, and users may be keeping references to them individually.
Signed-off-by: Glauber Costa <glauber@scylladb.com>