From Glauber:
There are current some outstanding issues with the throttling code. It's
easier to see them with the streaming code, but at least one of them is general.
One of them is related to situations in which the amount of memory available
leaves only one memtable fitting in memory. That would only happen with the
general code if we set the memtable cleanup threshold to 100 % - and I don't
even know if it is valid - but will happen quite often with the streaming code.
If that happens, we'll start throttling when that memtable is being written,
but won't be able to put anything else in its place - leading to unnecessary
throttling.
The second, and more serious, happens when we start throttling and the amount
of available memory is not at least 1MB. This can deadlock the database in
the sense that it will prevent any request from continuing, and in turn causing
a flush due to memtable size. It is a good practice anyway to always guarantee
progress.
Fixes#1144
Broken by f15c380a4f.
This resulted in empty collection being returned in the results
instead of no collection.
Fixes org.apache.cassandra.cql3.validation.entities.CollectionsTest
from cassandra-unit-tests.
* seastar 2aeb9dd...2185f37 (15):
> reactor: avoid issuing systemwide memory barriers in parallel
> Revert "Use sys_membarrier() when available"
> Merge "Various exception-safety fixes" from Tomasz
> future-util: make map reduce exception safe
> collectd: do not give up after a failure
> future-util: make repeat_until_value exception safe
> rpc: do not block connection when unknown verbs is received
> rpc: do not wait for a reply after timeout
> rpc: move connection stats to base class
> core/reactor: Handle io_submit failures inside flush_pending_aio
> apps/iotune: add --fs-check option to use iotune for kernel version check
> Merge "Some exception safety patches" from Paweł
> tls: Fix conversion of dh_params::level to gnutls_sec_param_t
> core: posix_thread: Mark start_routine as noexcept
> fair_queue: better overflow protection
"If we compact sstables A, B into a new sstable C we must either delete both
A and B, or none of them. This is because a tombstone in B may delete data
in A, and during compaction, both the tombstone and the data are removed.
If only B is deleted, then the data gets resurrected.
Non-atomic deletion occurs because the filesystem does not support atomic
deletion of multiple files; but the window for that is small and is not
addressed in this patchset. Another case is when A is shared across
multiple shards (as is the case when changing shard count, or migrating
from existing Cassandra sstables). This case is covered by this patchset.
Fixes #1181."
Our current throttling code releases one requests per 1MB of memory available
that we have. If we are below the memory limit, but not by 1MB or more, then
we will keep getting to unthrottle, but never really do anything.
If another memtable is close to the flushing point, those requests may be
exactly the ones that would make it flush. Without them, we'll freeze the
database.
In general, we need to always release at least one request to make sure that
progress is always achieved.
This fixes#1144
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Since commit c1cffd06 logger catch errors internally, so no need to
catch most of them at the top level. Only those that can happen during
parameter evaluation can reach here. Change parameters to not throw
too.
query::result transformation to printable form is very heavy operation
that allocates memory and thus can fail. Add a class to query::result that
can be used with logger to push to string conversion when output is
performed.
This is usually not a problem for the main memtable list - although it can be,
depending on settings, but shows up easily for the streaming memtables list.
We would like to have at least two memtables, even if we have to cut it short.
If we don't do that, one memtable will have use all available memory and we'll
force throttling until the memtable gets totally flushed.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
throttle_state is currently a nested member of database, but there is no
particular reason - aside from the fact that it is currently only ever
referenced by the database for us to do so.
We'll soon want to have some interaction between this and the column family, to
allow us to flush during throttle. To make that easier, let's unnest it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This is a preparation patch so we can move the throttling infrastructure inside
the memtable_list. To do that, the region group will have to be passed to the
throttler so let's just go ahead and store it.
In consequence of that, all that the CF has to tell us is what is the current
schema - no longer how to create a new memtable.
Also, with a new parameter to be passed to the memtable_list the creation code
gets quite big and hard to follow. So let's move the creation functions to a
helper.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
If sstables A, B are compacted, A and B must be deleted atomically.
Otherwise, if A has data that is covered by a tombstone in B, and that
tombstone is deleted, and if B is deleted while A is not, then the data
in A is resurrected.
Fixes#1181.
A shared sstable must be compacted by all shards before it can be deleted.
Since we're stoping, that's not going to happen. Cancel those pending
deletions to let anyone waiting on them to continue.
When we compact a set of sstables, we have to remove the set atomically,
otherwise we can resurrect data if the following happens:
insert data to sstable A
insert tombstone to sstable B
compact A+B -> C (removing both data and tombstone)
delete B only
read data from A
Since an sstable may be shared by multiple shard, and each shard performs
compaction at a different time, we need to defer deletion of an sstable
set until all shards agree that the set can be deleted.
An additional atomicity issue exists because posix does not provide a way
to atomically delete multiple files. This issue is not addressed by this
patch.
"With this series, we support all the 3 nodetool removenode commands, e.g.,
$ nodetool removenode 778948bf-6709-4eb5-80fe-bee911e9c3bf
$ nodetool removenode status
RemovalStatus: Removing token (-8969872965815280276). Waiting for
replication confirmation from [127.0.0.3,127.0.0.1].
$ nodetool removenode force
RemovalStatus: No token removals in process.
Tested with:
1)
- start 3 nodes
- inject data with
cassandra-stress write no-warmup cl=TWO n=2000000 -schema 'replication(factor=2)'
- kill -9 node2
- wait for node2 to be in DOWN state
- run nodetool removenode host2_host_id on node1
2)
- start 3 nodes
- inject data with
cassandra-stress write no-warmup cl=TWO n=2000000 -schema 'replication(factor=2)'
- kill -9 node2
- wait for node2 to be in DOWN state
- run nodetool removenode host2_host_id on node1
- kill -9 node3
- nodetool removenode will wait forever since node3 is gonne, node3
will never send the replication confirmation to node1
- run nodetool removenode force on node1
nodetool removenode completes with the following error:
$ nodetool removenode 31690b82-ebb0-4594-8bcf-1ce82b6e0f6e
nodetool: Scylla API server HTTP POST to URL
'/storage_service/remove_node' failed: nodetool removenode force is called by user
nodetool removenode force completes sucessfully
$ nodetool removenode force
RemovalStatus: Removing token (-9171569494049085776). Waiting for
replication confirmation from [127.0.0.3,127.0.0.1].
Fixes #1135."
Make the Docker image more user-friendly by starting up JMX proxy in the
background and install Scylla tools in the image. Also add a welcome
banner like we have with our AMI so that users have pointers to nodetool
and cqlsh, as well as our documentation.
Message-Id: <1460376059-3678-1-git-send-email-penberg@scylladb.com>
This is kind of sorting, so it belongs there, but it also fixes a bug in
storage_proxy::get_read_executor() that assumes filter_for_query() do
not change order of nodes in all_nodes when extra replica is chosen.
Otherwise if coordinator ip happens to be last in all_nodes then it will
be chosen as extra replica and will be quired twice.
Message-Id: <1460549369-29523-1-git-send-email-gleb@scylladb.com>
We choosed #!/bin/sh for shebang when we started to implement installer scripts, not bash.
After we started to work on Ubuntu, we found that we mistakenly used bash syntax on AMI script, it caused error since /bin/sh is dash on Ubuntu.
So we changed shebang to /bin/bash for the script, from that time we have both sh scripts and bash scripts.
(2f39e2e269)
If we use bash syntax on sh scripts, it won't work on Ubuntu but works on Fedora/CentOS, could be very easy to confusing.
So switch all scripts to #!/bin/bash. It will much safer.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1460594643-30666-1-git-send-email-syuu@scylladb.com>
"This patchset contains some fixes spotted during post-merged review
by {Nad,}av{,i}. I don't consider any of them a must for backport to 1.0,
but since we haven't yet even backported the main series, might as well backport
everything.
It also includes some unit tests to make sure that they will be kept working
in the future."
There is a problem in the implementation of leveled compaction strategy that
prevents level 1 from being compacted into level 2, and so forth. As a result,
all sstables will only belong to either level 0 or 1. One of the consequences
is level 1 being overwhelmed by a huge amount of sstables.
The root of the problem is a conditional statement in the code that prevents a
single sstable, with level > 0, from being compacted into a subsequent level
that is empty or has no overlapping sstables.
Fixes#1180.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <9a4bffdb0368dea77b49c23687015ff5832299ab.1460508373.git.raphaelsc@scylladb.com>
They are used by nodetool removenode:
$ nodetool removenode force
$ nodetool removenode status
For example:
$ nodetool removenode status
RemovalStatus: Removing token (-8969872965815280276). Waiting for
replication confirmation from [127.0.0.3,127.0.0.1].
$ nodetool removenode force
RemovalStatus: No token removals in process.
Tested with:
1)
- start 3 nodes
- inject data with
cassandra-stress write no-warmup cl=TWO n=2000000 -schema 'replication(factor=2)'
- kill -9 node2
- wait for node2 to be in DOWN state
- run nodetool removenode host2_host_id on node1
2)
- start 3 nodes
- inject data with
cassandra-stress write no-warmup cl=TWO n=2000000 -schema 'replication(factor=2)'
- kill -9 node2
- wait for node2 to be in DOWN state
- run nodetool removenode host2_host_id on node1
- kill -9 node3
- nodetool removenode will wait forever since node3 is gonne, node3
will never send the replication confirmation to node1
- run nodetool removenode force on node1
nodetool removenode completes with the following error:
$ nodetool removenode 31690b82-ebb0-4594-8bcf-1ce82b6e0f6e
nodetool: Scylla API server HTTP POST to URL
'/storage_service/remove_node' failed: nodetool removenode force is called by user
nodetool removenode force completes sucessfully
$ nodetool removenode force
RemovalStatus: Removing token (-9171569494049085776). Waiting for
replication confirmation from [127.0.0.3,127.0.0.1].
Fixes 1135.
This change will allow user to specify the maximum size of a new sstable
created as a result of leveled compaction.
Example of using this setting:
ALTER TABLE ks.test5 with compaction = {'sstable_size_in_mb': '1000',
'class': 'LeveledCompactionStrategy'}
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <ebb9844401af74388bda12586c2435283f6d8db8.1460486043.git.raphaelsc@scylladb.com>
When we recreate the summary from a missing Summary, we should make
sure it is generated sanely, and that it resembles the Summary that
would have otherwise been there.
In this tests we'll grab one of the Summary tests we've been doing,
and just apply them to the non-existent Summary file. We expect
the same results on those cases. Plus, a new test is added with some
sanity checking.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Now that we can boot without a Summary file, we can just as easily boot
with a broken one.
Suggested by Nadav, and it is actually very easy to do, so do it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Spotted by Avi post-merge
1) Need to close the file
2) Should be using the parameter pc instead of the default_class
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This shouldn't be a problem in practice, because if read_toc() fails,
the users will just tend to discard the sstable object altogether, and
not insist on using it.
However, if somebody does try to keep using it, a subsequent read_toc() could
theoretically have some components filled up leading the new reader to believe
the toc was populated successfully.
It is easier to just clear the _components set and never worry about it, than
trying to reason about whether or not that could happen.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
To share systemd unit file between Fedora/CentOS and Ubuntu, generate
systemd unit file on building time since Fedora/CentOS and Ubuntu has
sysconfdir on different place.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1459779957-11007-1-git-send-email-syuu@scylladb.com>
Using leveled compaction strategy, only a few sstables will contain a
given key, so we need to filter out the rest. Using the summary entries
to filter keys works if the key is before the first summary entry,
but does not work if it is after the last summary entry, because the last
summary entry does not represent the last key; so sstables that are
are towards the beginning of the ring are read even if they do not contain
the key, greatly reducing read performance.
Fix by consulting the summary's first_key/last_key entries before consulting
the summary entry array.