Commit Graph

11716 Commits

Author SHA1 Message Date
Glauber Costa
94e90d4a17 column_family: do not open code generation calculation
We already have a function that wraps this, re-use it.  This FIXME is still
relevant, so just move it there. Let's not lose it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-03-10 21:05:47 -05:00
Glauber Costa
46fdeec60a colum_family: remove mutation_count
We use memory usage as a threshold these days, and nowhere is _mutation_count
checked. Get rid of it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2016-03-10 21:05:47 -05:00
Gleb Natapov
16135c2084 make initialization run in a thread
While looking at initialization code I felt like my head is going to
explode. Moving initialization into a thread makes things a little bit
better. Only lightly tested.

Message-Id: <20160310163142.GE28529@scylladb.com>
2016-03-10 17:42:05 +01:00
Gleb Natapov
176aa25d35 fix developer-mode parameter application on SMP
I am almost sure we want to apply it once on each shard, and not multiple
times on a single shard.

Message-Id: <20160310155804.GB28529@scylladb.com>
2016-03-10 17:17:48 +01:00
Pekka Enberg
97bef4fb7c build: Fix http/http_response_parser.hh dependency
Make sure http_response.hh that is pulled by locator/ec2_snitch.hh is
built. The commit is similar to what commit 6ccf8f8 ("build: make sure
to ask seastar to build http/request_parser.hh, and depend on it") did
for request_parser.hh.

Fixes the following build error on CentOS:

  In file included from ./locator/ec2_multi_region_snitch.hh:41:0,
                   from locator/ec2_multi_region_snitch.cc:39:
  ./locator/ec2_snitch.hh:24:40: fatal error: http/http_response_parser.hh: No such file or directory

Spotted by Shlomi.
Message-Id: <1457612266-315-1-git-send-email-penberg@scylladb.com>
2016-03-10 14:46:41 +01:00
Gleb Natapov
51ca3122cf cleanup forward declaration for key types
Message-Id: <20160310075138.GC6117@scylladb.com>
2016-03-10 10:52:19 +01:00
Pekka Enberg
5dd1fda6cf main: Initialize system keyspace earlier
We start services like gossiper before system keyspace is initialized
which means we can start writing too early. Shuffle code so that system
keyspace is initialized earlier.

Refs #1014
Message-Id: <1457593758-9444-1-git-send-email-penberg@scylladb.com>
2016-03-10 10:39:27 +01:00
Pekka Enberg
f2f35a2f50 Merge "fix shutdown and improve logging" from Asias
"Fixes #1005 and probably fixes #1013."
2016-03-10 08:21:48 +02:00
Asias He
a9ec752939 streaming: Reduce STREAM_MUTATION error logging
There might be larger number of STREAM_MUTATION inflight. Log one error
per column_family per range to avoid spam the log.
2016-03-10 10:56:48 +08:00
Asias He
134b814cde gossip: Log status info when stopping gossip 2016-03-10 10:56:48 +08:00
Asias He
7c4c99d7c7 streaming: Fix a log level in get_column_family_stores
It is supposed to be debug level instead of info level.
2016-03-10 10:56:48 +08:00
Asias He
cb90ff2709 storage_service: Make decommission log info instead of debug level
The log is just a few lines. It is very useful to tell which step fails
in case of error when we do decommission.
2016-03-10 10:56:48 +08:00
Asias He
ed723665df gossip: Do not stop gossip more than once
If we do
   - Decommission a node
   - Stop a node
we will shutdown gossip more than once in:
   - storage_service::decommission
   - storage_service::drain_on_shutdown

Fix by checking if it is already stopped and back off if so.
2016-03-10 10:56:48 +08:00
Asias He
138c5f5834 storage_service: Do not stop messaging_service more than once
If we do
   - Decommission a node
   - Stop a node
we will shutdown messaging_service more than once in:
   - storage_service::decommission
   - storage_service::drain_on_shutdown

Fixes #1005
Refs  #1013

This fix a dtest failure in debug build.

update_cluster_layout_tests.TestUpdateClusterLayout.simple_decommission_node_1_test/

/data/jenkins/workspace/urchin-dtest/label/monster/mode/debug/scylla/seastar/core/future.hh:802:35:
runtime error: member call on null pointer of type 'struct
future_state'
core/future.hh:334:49: runtime error: member access within null
pointer of type 'const struct future_state'
ASAN:SIGSEGV
=================================================================
==4557==ERROR: AddressSanitizer: SEGV on unknown address
0x000000000000 (pc 0x00000065923e bp 0x7fbf6ffac430 sp 0x7fbf6ffac420
T0)
    #0 0x65923d in future_state<>::available() const
/data/jenkins/workspace/urchin-dtest/label/monster/mode/debug/scylla/seastar/core/future.hh:334
    #1 0x41458f1 in future<>::available()
/data/jenkins/workspace/urchin-dtest/label/monster/mode/debug/scylla/seastar/core/future.hh:802
    #2 0x41458f1 in then_wrapped<parallel_for_each(Iterator, Iterator,
Func&&)::<lambda(parallel_for_each_state&)> [with Iterator =
std::__detail::_Node_iterator<std::pair<const net::msg_addr,
net::messaging_service::shard_info>, false, true>; Func =
net::messaging_service::stop()::<lambda(auto:39&)> [with auto:39 =
std::unordered_map<net::msg_addr, net::messaging_service::shard_info,
net::msg_addr::hash>]::<lambda(std::pair<const net::msg_addr,
net::messaging_service::shard_info>&)>]::<lambda(future<>)>, future<>
> /data/jenkins/workspace/urchin-dtest/label/monster/mode/debug/scylla/seastar/core/future.hh:878
2016-03-10 10:56:48 +08:00
Tomasz Grabiec
838a038cbd log: Fix operator<<(std::ostream&, const std::exception_ptr&)
Attempt to print std::nested_exception currently results in exception
to leak outside the printer. Fix by capturing all exception in the
final catch block.

For nested exception, the logger will print now just
"std::nested_exception".  For nested exceptions specifically we should
log more, but that is a separate problem to solve.
Message-Id: <1457532215-7498-1-git-send-email-tgrabiec@scylladb.com>
2016-03-09 16:05:03 +02:00
Pekka Enberg
2566f8dc18 configure: Remove 'scylla_libs' variable
It's not actually used by anyone so drop it.
Message-Id: <1457531753-27891-2-git-send-email-penberg@scylladb.com>
2016-03-09 14:56:54 +01:00
Pekka Enberg
9bfb6a0c5b configure: Add boost date_time library as a dependency
It's needed to fix the debug build.
Message-Id: <1457531753-27891-1-git-send-email-penberg@scylladb.com>
2016-03-09 14:56:51 +01:00
Takuya ASADA
0ab3d0fd52 dist: use SEASTAR_IO instead of SCYLLA_IO
sync with iotune, fixes #1010

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1457530910-1273-1-git-send-email-syuu@scylladb.com>
2016-03-09 15:45:34 +02:00
Gleb Natapov
f242c6395c storage_proxy: add counter for retries reads
Message-Id: <20160309130453.GF2253@scylladb.com>
2016-03-09 14:09:42 +01:00
Pekka Enberg
ab502bcfa8 types: Implement to_string for timestamps and dates
The to_string() function is used for logging purpose so use boost
to_iso_extended_string() to format both timestamps and dates.

Fixes #968 (showstopper)
Message-Id: <1457528755-6164-1-git-send-email-penberg@scylladb.com>
2016-03-09 14:08:33 +01:00
Pekka Enberg
8eedaca948 Merge "streaming: handle cf is deleted" from Asias
"Fixes #979
 Fixes #976"
2016-03-09 14:52:27 +02:00
Asias He
3a4ea227d8 storage_service: Fix effective_ownership
Now, get_ranges_for_endpoint will unwrap the first range. With t0 t1 t2
t3, the first range (t3,t0] will be splitted as (min,t0] and (t3,max].
Skippping the range (t3,max] we will get the correct ownership number as
if the first range were not splitted.

Fixes #928
Message-Id: <2e30ebd53f3dba3cc5e0cf36d5541c354b0e30ca.1457506704.git.asias@scylladb.com>
2016-03-09 13:26:01 +01:00
Asias He
d9ead889f3 streaming: Handle cf is deleted when sending STREAM_MUTATION_DONE
In the preparation phase of streaming, we check that remote node has all
the cf_id which are needed for the entire streaming process, including the
cf_id which local node will send to remote node and wise versa.

So, at later time, if the cf_id is missing, it must be that the cf_id is
deleted. It is fine to ingore no_such_column_family exception. In this
patch, we change the code to ignore at server side to avoid sending the
exception back, to avoid handle exception in an IDL compatiable way.

One thing we can improve is that the sender might know the cf is deleted
later than the receiver does. In this case, the sender will send some
more mutations if we send back the no_such_column_family back to the
sender. However, since we do not throw exceptions in the receiver stream
mutation handler, it will not cause a lot of overhead, the receiver will
just ignore the mutation received.

Fixes #979
2016-03-09 16:50:38 +08:00
Asias He
efa74dbae0 streaming: Do not send if the cf is deleted
It is possible that a cf is deleted after we make the cf reader. Avoid
sending them to avoid the unnecessary overhead to send them on the wire and
the peer node to drop the received mutations.
2016-03-09 16:50:38 +08:00
Asias He
4abaacfc61 db: Introduce column_family_exists
It is cheaper than throwing a no_such_column_family exception to test if
a cf is gone, e.g., deleted.
2016-03-09 16:50:38 +08:00
Asias He
dca9e594cc streaming: Remove the unused test code
It is introduced in the early development of streaming. We have dtest
for streaming now, drop it.
Message-Id: <1457499303-21163-1-git-send-email-asias@scylladb.com>
2016-03-09 10:31:42 +02:00
Pekka Enberg
4f3d6977f1 Merge "Abort stream_session if peer is removed or restarted" from Asias
"Hook streaming with gossip callback so we can abort
the stream_session in such case:

- a node is restarted
- a node is removed from the cluster

Fixes #1001."
2016-03-09 10:18:42 +02:00
Nadav Har'El
2f56577794 sstables: more efficient read of compressed data file
Before this patch, reading large ranges from a compressed data file involved
two inefficiencies:

 1.  The compressed data file was read one compressed chunk at a time.
     Such a chunk is around 30 KB in size, well below our desired sstable
     read-ahead size (sstable_buffer_size = 128 KB).

 2.  Because the compressed chunks have variable length (the uncompressed
     chunk has a fixed length) they are not aligned to disk blocks, so
     consecutive chunks have overlapping blocks which were unnecessarily
     read twice.

The fix for both issues is to build the compressed_file_input_stream on
an existing file_input_stream, instead of using direct file IO to read the
individual chunks. file_input_stream takes care of doing the appropriate
amount of read-ahead, and the compressed_file_input_stream layer does the
decompression of the data read from the underlying layer.

Fixes #992.

Historical note: Implementing compressed_file_input_stream on top of
file_input_stream was already tried in the past, and rejected. The problem
at that time was that compressed_file_input_stream's constructor did not
specify the *end* of the range to read, so that when we wanted to read
only a small range we got too much read-ahead beyond the exactly one
compressed chunk that we needed to read.  Following the fix to issue #964,
we now know on every streaming read also the intended *end* of the stream,
so we can now use this to stop reading at the end of the last required
chunk, even when we use a read-ahead buffer much larger than a chunk.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1457304335-8507-1-git-send-email-nyh@scylladb.com>
2016-03-09 10:14:15 +02:00
Glauber Costa
8260b8fc6f touch CF directories during startup
We try to be robust against files disappearing (due to any kind of corruption)
inside the data directory.

But if the data directory itself goes missing, that's a situation that we don't
handle correctly.  We will keep accepting writes normally, but when we try to
flush the memtable to disk, we'll fail with a system error.

Having the CF directory disappearing is not a common thing. But it is also one
that we can easily protect against, by touching all CF directories we know
about on startup.

Fixes #999

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <ed66373dccca11742150a6d08e21ece3980227d3.1457379853.git.glauber@scylladb.com>
2016-03-09 09:06:51 +02:00
Asias He
bf3507d093 messaging_service: Stop retrying if node is removed from gossip
- Start a node
- Inject data
- Start another node to bootstrap
- Before the second node finishes streaming, kill the second node
- After a while the node will be removed from the cluster becusue it does
  not manage to join the cluster.
- At this time, messaging_service might keep retrying the
  stream_mutations unncessarily.

To fix, check if the peer node is still a known node in the gossip.
2016-03-09 07:35:20 +08:00
Asias He
1f3928c321 streaming: Hook streaming with gossip callback
If the peer node of a stream_session is restarted or removed we should
abort the streaming. It is better to hook gossip callback in the stream
manager than in each streamm_session.
2016-03-09 07:35:20 +08:00
Glauber Costa
2cd756ae5e repair: replace a magic number with another magic number
In due time we will have to fix this, but as an interim step, let's use
a "better" magic number.

The problem with 100, is that as soon as the partitions start to go bigger,
we're using too much memory. Since this is multiplied by the number of token
ranges, and happens in every shard, the final number can become really big,
and the amount of resources we use go up proportionally.

This means that even we are mistaken about the new number (we probably are),
in this case it is better to err on the side of a more conservative resource
usage.

Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <97158f3db5734916cee4ccf12eaa66e7402570bb.1457448855.git.glauber@scylladb.com>
2016-03-08 17:29:00 +02:00
Nadav Har'El
b7e29691c2 sstables: avoid index and data file over-reads
When we do a streaming read that knows the expected *end* position of the
read, we can use a large read-ahead buffer, and at the same time, stop
reading at exactly the intended end (or small rounding of it to the DMA
block size) and not waste resources blindly reading a large amount of data
after the end just to fill the read-ahead buffer.

The sstable reading code, both for reading the data file and the index file,
created a file input stream without specifiying its end, thereby losing
this optimization - so when a large buffer was used, we would get a large
over-read. This patch fixes this, so sstable data file and index file are
read using a file input stream which is a ware of its end.

Fixes #964.

Note that this patch does not change the behavior when reading a
*compressed* data file. For compressed read, we did not have the problem
of over-read in the first place, because chunks are read one by one.
But we do have other sources of inefficiencies there (stemming, again,
from the fact that the compressed chunks are read one by one), and I
opened a separate issue #992 for that.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1457219304-12680-1-git-send-email-nyh@scylladb.com>
2016-03-08 17:26:10 +02:00
Calle Wilund
8575f1391f lists.cc: fix update insert of frozen list
Fixes #967

Frozen lists are just atomic cells. However, old code inserted the
frozen data directly as an atomic_cell_or_collection, which in turn
meant it lacked the header data of a cell. When in turn it was
handled by internal serialization (freeze), since the schema said
is was not a (non-frozen) collection, we tried to look at frozen
list data as cell header -> most likely considered dead.
Message-Id: <1457432538-28836-1-git-send-email-calle@scylladb.com>
2016-03-08 13:48:45 +01:00
Pekka Enberg
81af486b69 Update scylla-ami submodule
* dist/ami/files/scylla-ami d4a0e18...84bcd0d (1):
  > Add --ami parameter
2016-03-08 13:49:31 +02:00
Takuya ASADA
254b0fa676 dist: show message to use XFS for scylla data directory and also notify about developer mode, when iotune fails
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1457426286-15925-1-git-send-email-syuu@scylladb.com>
2016-03-08 12:20:33 +02:00
Pekka Enberg
83d82ea901 Merge "Fix Ubuntu package issues on AMI" from Takuya
"This fixes bugs on Ubuntu package and AMI scripts, closes #991."
2016-03-08 11:51:30 +02:00
Takuya ASADA
18a27de3c8 dist: export all entries on /etc/default/scylla-server on Ubuntu
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
2016-03-08 18:18:30 +09:00
Gleb Natapov
ce6d1a242a storage_proxy: fix background_reads counter
background_reads collectd counter was not always properly decremented.
Fix it and streamline background read repair error handling.

Message-Id: <20160307182255.GI4849@scylladb.com>
2016-03-07 19:41:09 +01:00
Yoav Kleinberger
1cd01cd2ab tools/scyllatop: defend against curses "out of screen bounds" error
Fixes issue #945 (hopefully)
This issue was probably the result of trying to write outside the
confines of the window. The views.Base class now defends against this.

Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <9735806b211567f3239e187d87437c484f532291.1457265435.git.yoav@scylladb.com>
2016-03-07 18:02:26 +01:00
Raphael S. Carvalho
0f4239d63a service: improve logging of storage_service::load_new_sstables
Closes #952.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <2402f387c32d2d1221e740edb67e56c1593c1936.1457366098.git.raphaelsc@scylladb.com>
2016-03-07 18:01:52 +01:00
Raphael S. Carvalho
e850c1406e sstables: update comment
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <8abc1c6c66ed8d3bb35ecfb6d8251de3f61a97ae.1457093016.git.raphaelsc@scylladb.com>
2016-03-07 17:36:34 +01:00
Raphael S. Carvalho
822759eee0 compaction_manager: update stat pending_tasks properly
Size of both _cfs_to_cleanup and _cfs_to_compact must be added when
calculating a new value to _stats.pending_tasks.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <b601e24d0631922798575f39d00fb54fe00d4971.1457093016.git.raphaelsc@scylladb.com>
2016-03-07 17:36:03 +01:00
Gleb Natapov
2d092bbd32 storage_proxy: send read requests with timeout
No need to wait for replies long after request is timed out.
Message-Id: <1457351304-28721-2-git-send-email-gleb@scylladb.com>
2016-03-07 14:00:11 +01:00
Gleb Natapov
4122422d19 storage_proxy: always wait for digest read resolver done future
Currently it is waited upon only if background read repair check is
needed and this cause unhandled exception warning to be printed if
it enters failed state. Fix this by always waiting on it, but doing
anything beyond ignoring an exception only if check is needed.
Message-Id: <1457351304-28721-1-git-send-email-gleb@scylladb.com>
2016-03-07 14:00:09 +01:00
Gleb Natapov
626c9d046b fix EACH_QUORUM handling during bootstrapping
Currently write acknowledgements handling does not take bootstrapping
node into account for CL=EACH_QUORUM. The patch fixes it.

Fixes #994

Message-Id: <20160307121620.GR2253@scylladb.com>
2016-03-07 13:56:34 +01:00
Raphael S. Carvalho
d65642cee8 fix storage_service::load_new_sstables() to not disable write permanently
Avi says:
"If an exception happens, then enable_sstable_writes won't be called."

The problem is fixed by catching a possible exception and enabling sstable
write for the relevant column family if it wasn't enabled already.

Closes #953.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <32c1bcb2c60c7b9e5514eb0a95062f40ca92093a.1457119308.git.raphaelsc@scylladb.com>
2016-03-07 13:56:02 +01:00
Gleb Natapov
f59415b3c6 Take pending endpoints into account while checking for sufficient live nodes
During bootstrapping additional copies of data has to be made to ensure
that CL level is met (see CASSANDRA-833 for details). Our code does
that, but it does not take into account that bootstraping node can be
dead which may cause request to proceed even though there is no
enough live nodes for it to be completed. In such a case request neither
completes nor timeouts, so it appear to be stuck from CQL layer POV. The
patch fixes this by taking into account pending nodes while checking
that there are enough sufficient live nodes for operation to proceed.

Fixes #965

Message-Id: <20160303165250.GG2253@scylladb.com>
2016-03-07 13:30:13 +01:00
Gleb Natapov
8dad399256 log: add space between log level and date in the outpu
It was dropped by 6dc51027a3

Message-Id: <20160306125313.GI2253@scylladb.com>
2016-03-07 13:06:06 +01:00
Tomasz Grabiec
9deb036e4e Merge branch 'dev/issue-845-set-incremental-backup-config-v1' from seastar-dev.git
From Vlad:

This series modifies the 'database' class to use the internal
_enable_incremental_backups value (initialized with
'incremental_backups' configuration value) instead of using the
'incremental_backups' configuration value directly.

Then we update this internal value in runtime from 'nodetool
enable/disablebackup' API callback so that newly created keyspaces and
column families use the newly configured incremental backup
configuration.
2016-03-07 10:47:20 +01:00