Commit Graph

9166 Commits

Author SHA1 Message Date
Tomasz Grabiec
45527fcffa Merge branch 'glommer/issue-1144-v5'
From Glauber:

There are current some outstanding issues with the throttling code. It's
easier to see them with the streaming code, but at least one of them is general.

One of them is related to situations in which the amount of memory available
leaves only one memtable fitting in memory. That would only happen with the
general code if we set the memtable cleanup threshold to 100 % - and I don't
even know if it is valid - but will happen quite often with the streaming code.
If that happens, we'll start throttling when that memtable is being written,
but won't be able to put anything else in its place - leading to unnecessary
throttling.

The second, and more serious, happens when we start throttling and the amount
of available memory is not at least 1MB. This can deadlock the database in
the sense that it will prevent any request from continuing, and in turn causing
a flush due to memtable size. It is a good practice anyway to always guarantee
progress.

Fixes #1144
2016-04-18 12:20:13 +02:00
Gleb Natapov
f3b515052b udt: fix error generation if accessed type is not udt
Fixes #1198
Message-Id: <1460884314-3717-2-git-send-email-gleb@scylladb.com>
2016-04-18 12:45:03 +03:00
Duarte Nunes
ece89069dd udt: Implement to_string() for selectable
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1460884314-3717-1-git-send-email-gleb@scylladb.com>
2016-04-18 12:44:48 +03:00
Pekka Enberg
edf7f098e2 Merge "Fix query of collection cell with all items deleted" from Tomek 2016-04-18 11:01:24 +03:00
Tomasz Grabiec
2e08d0f698 Merge branch 'dev/gleb/logging'
Logging improvements from Gleb.
2016-04-15 19:03:44 +02:00
Tomasz Grabiec
89bc32b020 tests: Add test for query of collection with deleted item 2016-04-15 18:14:05 +02:00
Tomasz Grabiec
c69d0a8e87 mutation_partition: Fix collection emptiness check
Broken by f15c380a4f.

This resulted in empty collection being returned in the results
instead of no collection.

Fixes org.apache.cassandra.cql3.validation.entities.CollectionsTest
from cassandra-unit-tests.
2016-04-15 18:14:05 +02:00
Tomasz Grabiec
b0d4782016 types: Add default argument values to is_any_live() 2016-04-15 18:14:05 +02:00
Avi Kivity
0de32ab120 Merge seastar upstream
* seastar 2aeb9dd...2185f37 (15):
  > reactor: avoid issuing systemwide memory barriers in parallel
  > Revert "Use sys_membarrier() when available"
  > Merge "Various exception-safety fixes" from Tomasz
  > future-util: make map reduce exception safe
  > collectd: do not give up after a failure
  > future-util: make repeat_until_value exception safe
  > rpc: do not block connection when unknown verbs is received
  > rpc: do not wait for a reply after timeout
  > rpc: move connection stats to base class
  > core/reactor: Handle io_submit failures inside flush_pending_aio
  > apps/iotune: add --fs-check option to use iotune for kernel version check
  > Merge "Some exception safety patches" from Paweł
  > tls: Fix conversion of dh_params::level to gnutls_sec_param_t
  > core: posix_thread: Mark start_routine as noexcept
  > fair_queue: better overflow protection
2016-04-15 16:06:53 +03:00
Pekka Enberg
3f2286d02e Merge "Delete compacted sstables atomically" from Avi
"If we compact sstables A, B into a new sstable C we must either delete both
A and B, or none of them.  This is because a tombstone in B may delete data
in A, and during compaction, both the tombstone and the data are removed.
If only B is deleted, then the data gets resurrected.

Non-atomic deletion occurs because the filesystem does not support atomic
deletion of multiple files; but the window for that is small and is not
addressed in this patchset.  Another case is when A is shared across
multiple shards (as is the case when changing shard count, or migrating
from existing Cassandra sstables).  This case is covered by this patchset.

Fixes #1181."
2016-04-14 22:04:15 +03:00
Glauber Costa
9c87ae3496 throttle: always release at least one request if we are below the limit
Our current throttling code releases one requests per 1MB of memory available
that we have. If we are below the memory limit, but not by 1MB or more, then
we will keep getting to unthrottle, but never really do anything.

If another memtable is close to the flushing point, those requests may be
exactly the ones that would make it flush. Without them, we'll freeze the
database.

In general, we need to always release at least one request to make sure that
progress is always achieved.

This fixes #1144

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-14 13:13:15 -04:00
Gleb Natapov
9801d69d53 storage_proxy: add query result row count to brief format
Report number of rows in brief reporting format, but only if
we can count them without linearizing result's buffer.
2016-04-14 19:26:00 +03:00
Gleb Natapov
53993527ed storage_proxy: move verbose query result printing into separate logger
If query result is large tracing cannot be done since printing the result takes
too much time and space.
2016-04-14 19:26:00 +03:00
Gleb Natapov
46e5d05220 storage_proxy: cleanup query logging.
Since commit c1cffd06 logger catch errors internally, so no need to
catch most of them at the top level. Only those that can happen during
parameter evaluation can reach here. Change parameters to not throw
too.
2016-04-14 19:26:00 +03:00
Gleb Natapov
15ebe5e4e5 query: add calculate_row_count function to query::result 2016-04-14 19:26:00 +03:00
Gleb Natapov
f47b2dad18 query: add lazy printer to query::result
query::result transformation to printable form is very heavy operation
that allocates memory and thus can fail. Add a class to query::result that
can be used with logger to push to string conversion when output is
performed.
2016-04-14 19:26:00 +03:00
Glauber Costa
2c5dfe08c1 memtable_list: make sure at least two memtables are available
This is usually not a problem for the main memtable list - although it can be,
depending on settings, but shows up easily for the streaming memtables list.

We would like to have at least two memtables, even if we have to cut it short.
If we don't do that, one memtable will have use all available memory and we'll
force throttling until the memtable gets totally flushed.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-14 12:12:50 -04:00
Glauber Costa
1daede7396 unnest throttle_state
throttle_state is currently a nested member of database, but there is no
particular reason - aside from the fact that it is currently only ever
referenced by the database for us to do so.

We'll soon want to have some interaction between this and the column family, to
allow us to flush during throttle. To make that easier, let's unnest it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-14 12:12:50 -04:00
Glauber Costa
39def369ce move information about memtables' region group inside memtable list
This is a preparation patch so we can move the throttling infrastructure inside
the memtable_list. To do that, the region group will have to be passed to the
throttler so let's just go ahead and store it.

In consequence of that, all that the CF has to tell us is what is the current
schema - no longer how to create a new memtable.

Also, with a new parameter to be passed to the memtable_list the creation code
gets quite big and hard to follow. So let's move the creation functions to a
helper.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-14 12:12:50 -04:00
Avi Kivity
a843aea547 db: delete compacted sstables atomically
If sstables A, B are compacted, A and B must be deleted atomically.
Otherwise, if A has data that is covered by a tombstone in B, and that
tombstone is deleted, and if B is deleted while A is not, then the data
in A is resurrected.

Fixes #1181.
2016-04-14 17:14:26 +03:00
Avi Kivity
3798d04ae8 sstables: convert sstable::mark_for_deletion() to atomic deletion infrastructure
All deletions must go through the same data structure, or some atomic
deletions will never be satisified.
2016-04-14 17:14:26 +03:00
Avi Kivity
e43dbac836 main: cancel pending atomic deletions on shutdown
A shared sstable must be compacted by all shards before it can be deleted.
Since we're stoping, that's not going to happen.  Cancel those pending
deletions to let anyone waiting on them to continue.
2016-04-14 17:14:26 +03:00
Avi Kivity
2ba584db8d sstables: add delete_atomically(), for atomically deleting multiple sstables
When we compact a set of sstables, we have to remove the set atomically,
otherwise we can resurrect data if the following happens:

 insert data to sstable A
 insert tombstone to sstable B
 compact A+B -> C (removing both data and tombstone)
 delete B only
 read data from A

Since an sstable may be shared by multiple shard, and each shard performs
compaction at a different time, we need to defer deletion of an sstable
set until all shards agree that the set can be deleted.

An additional atomicity issue exists because posix does not provide a way
to atomically delete multiple files.  This issue is not addressed by this
patch.
2016-04-14 17:14:26 +03:00
Pekka Enberg
a1a9294d8c Merge "Support nodetool removenode force and status" from Asias
"With this series, we support all the 3 nodetool removenode commands, e.g.,

$ nodetool removenode 778948bf-6709-4eb5-80fe-bee911e9c3bf

$ nodetool removenode status
RemovalStatus: Removing token (-8969872965815280276). Waiting for
replication confirmation from [127.0.0.3,127.0.0.1].

$ nodetool removenode force
RemovalStatus: No token removals in process.

Tested with:

1)
- start 3 nodes
- inject data with
  cassandra-stress write no-warmup cl=TWO n=2000000 -schema 'replication(factor=2)'
- kill -9 node2
- wait for node2 to be in DOWN state
- run nodetool removenode host2_host_id on node1

2)
- start 3 nodes
- inject data with
  cassandra-stress write no-warmup cl=TWO n=2000000 -schema 'replication(factor=2)'
- kill -9 node2
- wait for node2 to be in DOWN state
- run nodetool removenode host2_host_id on node1
- kill -9 node3
- nodetool removenode will wait forever since node3 is gonne, node3
  will never send the replication confirmation to node1
- run nodetool removenode force on node1
  nodetool removenode completes with the following error:
    $ nodetool removenode 31690b82-ebb0-4594-8bcf-1ce82b6e0f6e
    nodetool: Scylla API server HTTP POST to URL
    '/storage_service/remove_node' failed: nodetool removenode force is called by user
  nodetool removenode force completes sucessfully
    $ nodetool removenode force
    RemovalStatus: Removing token (-9171569494049085776). Waiting for
    replication confirmation from [127.0.0.3,127.0.0.1].

Fixes #1135."
2016-04-14 15:44:33 +03:00
Pekka Enberg
144d1e3216 dist/docker/redhat: Start up JMX proxy and include tools
Make the Docker image more user-friendly by starting up JMX proxy in the
background and install Scylla tools in the image. Also add a welcome
banner like we have with our AMI so that users have pointers to nodetool
and cqlsh, as well as our documentation.
Message-Id: <1460376059-3678-1-git-send-email-penberg@scylladb.com>
2016-04-14 15:41:21 +03:00
Pekka Enberg
355c3ea331 dist/docker/redhat: Make sure image builds against latest Scylla
Use "yum clean expire-cache" to make sure we build against the latest
Scylla release.
Message-Id: <1460374418-27315-1-git-send-email-penberg@scylladb.com>
2016-04-14 15:41:10 +03:00
Gleb Natapov
6f13715f8c storage_proxy: add logging to read executor creation path
Message-Id: <1460549369-29523-4-git-send-email-gleb@scylladb.com>
2016-04-14 14:58:02 +03:00
Gleb Natapov
14ecadb247 storage_proxy: add logging for mutation write path
Message-Id: <1460549369-29523-3-git-send-email-gleb@scylladb.com>
2016-04-14 14:57:29 +03:00
Gleb Natapov
dbb1217896 cl: enable logging for insufficient LOCAL_QUORUM consistency
Message-Id: <1460549369-29523-2-git-send-email-gleb@scylladb.com>
2016-04-14 14:56:58 +03:00
Gleb Natapov
dfdbb1e703 storage_proxy: move hack to make coordinator most preferable node for read into sorting function
This is kind of sorting, so it belongs there, but it also fixes a bug in
storage_proxy::get_read_executor() that assumes filter_for_query() do
not change order of nodes in all_nodes when extra replica is chosen.
Otherwise if coordinator ip happens to be last in all_nodes then it will
be chosen as extra replica and will be quired twice.
Message-Id: <1460549369-29523-1-git-send-email-gleb@scylladb.com>
2016-04-14 14:56:21 +03:00
Takuya ASADA
f98997120a dist: #!/bin/bash for all scripts
We choosed #!/bin/sh for shebang when we started to implement installer scripts, not bash.
After we started to work on Ubuntu, we found that we mistakenly used bash syntax on AMI script, it caused error since /bin/sh is dash on Ubuntu.
So we changed shebang to /bin/bash for the script, from that time we have both sh scripts and bash scripts.
(2f39e2e269)
If we use bash syntax on sh scripts, it won't work on Ubuntu but works on Fedora/CentOS, could be very easy to confusing.
So switch all scripts to #!/bin/bash. It will much safer.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1460594643-30666-1-git-send-email-syuu@scylladb.com>
2016-04-14 12:01:28 +03:00
Pekka Enberg
60352f810a Merge "Fixes for the reading of missing Summary" from Glauber
"This patchset contains some fixes spotted during post-merged review
by {Nad,}av{,i}. I don't consider any of them a must for backport to 1.0,
but since we haven't yet even backported the main series, might as well backport
everything.

It also includes some unit tests to make sure that they will be kept working
in the future."
2016-04-13 11:32:05 +03:00
Raphael S. Carvalho
beaacbda2e tests: test that leveled strategy was fixed
L1 wasn't being compacted into L2.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <1a357896a448eafa7da4d28bc56fa02b89d4193e.1460508373.git.raphaelsc@scylladb.com>
2016-04-13 11:14:28 +03:00
Raphael S. Carvalho
c7b728e716 sstables: Fix leveled compaction strategy
There is a problem in the implementation of leveled compaction strategy that
prevents level 1 from being compacted into level 2, and so forth. As a result,
all sstables will only belong to either level 0 or 1. One of the consequences
is level 1 being overwhelmed by a huge amount of sstables.

The root of the problem is a conditional statement in the code that prevents a
single sstable, with level > 0, from being compacted into a subsequent level
that is empty or has no overlapping sstables.

Fixes #1180.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <9a4bffdb0368dea77b49c23687015ff5832299ab.1460508373.git.raphaelsc@scylladb.com>
2016-04-13 11:14:14 +03:00
Asias He
1e84699a64 api: Wire up storage_service removal_status and force_remove_completion
They are used by nodetool removenode:

$ nodetool removenode force
$ nodetool removenode status

For example:

$ nodetool removenode status
RemovalStatus: Removing token (-8969872965815280276). Waiting for
replication confirmation from [127.0.0.3,127.0.0.1].

$ nodetool removenode force
RemovalStatus: No token removals in process.

Tested with:

1)
- start 3 nodes
- inject data with
  cassandra-stress write no-warmup cl=TWO n=2000000 -schema 'replication(factor=2)'
- kill -9 node2
- wait for node2 to be in DOWN state
- run nodetool removenode host2_host_id on node1

2)
- start 3 nodes
- inject data with
  cassandra-stress write no-warmup cl=TWO n=2000000 -schema 'replication(factor=2)'
- kill -9 node2
- wait for node2 to be in DOWN state
- run nodetool removenode host2_host_id on node1
- kill -9 node3
- nodetool removenode will wait forever since node3 is gonne, node3
  will never send the replication confirmation to node1
- run nodetool removenode force on node1
  nodetool removenode completes with the following error:
    $ nodetool removenode 31690b82-ebb0-4594-8bcf-1ce82b6e0f6e
    nodetool: Scylla API server HTTP POST to URL
    '/storage_service/remove_node' failed: nodetool removenode force is called by user
  nodetool removenode force completes sucessfully
    $ nodetool removenode force
    RemovalStatus: Removing token (-9171569494049085776). Waiting for
    replication confirmation from [127.0.0.3,127.0.0.1].

Fixes 1135.
2016-04-13 14:53:28 +08:00
Asias He
891e947314 storage_service: Rename remove_node to removenode
nodetool uses removenode command to remove a node. Rename the
implementation in storage_service to match the command.
2016-04-13 14:53:28 +08:00
Asias He
9ffb95216d storage_service: Add force_remove_completion
It is needed by the

$ nodetool removenode force

command.
2016-04-13 14:53:28 +08:00
Asias He
7c7e5967f6 storage_service: Add get_removal_status
It is needed by the

$ nodetool removenode status

command.
2016-04-13 14:53:28 +08:00
Asias He
8d7cd07d6c storage_service: Add print info in confirm_replication
The message is rare but it is very useful to debug removenode operation.
2016-04-13 14:53:28 +08:00
Asias He
ffe91b5755 token_metadata: Do not assert in get_host_id
Throw an exception instead of assert.
2016-04-13 14:53:27 +08:00
Raphael S. Carvalho
c28d168619 sstables: allow user to specify max sstable size with leveled strategy
This change will allow user to specify the maximum size of a new sstable
created as a result of leveled compaction.

Example of using this setting:
ALTER TABLE ks.test5 with compaction = {'sstable_size_in_mb': '1000',
'class': 'LeveledCompactionStrategy'}

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <ebb9844401af74388bda12586c2435283f6d8db8.1460486043.git.raphaelsc@scylladb.com>
2016-04-13 09:13:33 +03:00
Raphael S. Carvalho
15246f31f7 sstables: fix incorrect sstable size when compression is enabled
Size of uncompressed sstable was being unconditionally used to determine
when to stop writing a table. When compression is enabled, compressed
size should be used instead. Problem affected Scylla when compression
and leveled strategy were used.

Fixes #1177.

Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <d9bf26def41fb33ca297f4127ce042b7f67adf96.1460484529.git.raphaelsc@scylladb.com>
2016-04-13 09:01:01 +03:00
Glauber Costa
60ab3b3f50 sstable_tests: make sure the generation of the Summary is sane
When we recreate the summary from a missing Summary, we should make
sure it is generated sanely, and that it resembles the Summary that
would have otherwise been there.

In this tests we'll grab one of the Summary tests we've been doing,
and just apply them to the non-existent Summary file. We expect
the same results on those cases. Plus, a new test is added with some
sanity checking.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-12 11:55:01 -04:00
Glauber Costa
114ba5e3a8 be robust against broken summary files
Now that we can boot without a Summary file, we can just as easily boot
with a broken one.

Suggested by Nadav, and it is actually very easy to do, so do it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-12 11:55:01 -04:00
Glauber Costa
72dc45999d review fixes for generate_summary
Spotted by Avi post-merge
1) Need to close the file
2) Should be using the parameter pc instead of the default_class

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-12 11:55:01 -04:00
Glauber Costa
f78f43850d clear components if reading toc fail
This shouldn't be a problem in practice, because if read_toc() fails,
the users will just tend to discard the sstable object altogether, and
not insist on using it.

However, if somebody does try to keep using it, a subsequent read_toc() could
theoretically have some components filled up leading the new reader to believe
the toc was populated successfully.

It is easier to just clear the _components set and never worry about it, than
trying to reason about whether or not that could happen.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-12 11:55:01 -04:00
Glauber Costa
0f41ef1b84 index_reader: avoid misleading parent name
Also add comments about the expected signature of IndexConsumer

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-04-12 11:15:11 -04:00
Takuya ASADA
1eebe8bce1 dist: Support systemd for Ubuntu 15.10
To share systemd unit file between Fedora/CentOS and Ubuntu, generate
systemd unit file on building time since Fedora/CentOS and Ubuntu has
sysconfdir on different place.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1459779957-11007-1-git-send-email-syuu@scylladb.com>
2016-04-12 14:39:26 +03:00
Avi Kivity
715794cce6 sstables: filter sstables single-row read using first_key/last_key
Using leveled compaction strategy, only a few sstables will contain a
given key, so we need to filter out the rest.  Using the summary entries
to filter keys works if the key is before the first summary entry,
but does not work if it is after the last summary entry, because the last
summary entry does not represent the last key; so sstables that are
are towards the beginning of the ring are read even if they do not contain
the key, greatly reducing read performance.

Fix by consulting the summary's first_key/last_key entries before consulting
the summary entry array.
2016-04-12 10:33:17 +03:00
Pekka Enberg
64c9ebb962 Merge "More exception safety fixes" from Paweł
"This is the second part of exception safety fixes for issues
discovered using memory allocation failure injector."
2016-04-12 08:08:00 +03:00