This fixes gossip test shutdown similar to what commit 13ce48e ("tests:
Fix stop of storage_service in cql_test_env") did for CQL tests:
gossip_test: /home/penberg/scylla/seastar/core/sharded.hh:439: Service& seastar::sharded<Service>::local() [with Service = net::messaging_service]: Assertion `local_is_initialized()' failed.
Running 1 test case...
[snip]
unknown location(0): fatal error in "test_boot_shutdown": signal: SIGABRT (application abort requested)
seastar/tests/test-utils.cc(32): last checkpoint
Message-Id: <1458126520-20025-1-git-send-email-penberg@scylladb.com>
The cf can be deleted after the cf deletion check. Handle this case as
well.
Use "warn" level to log if cf is missing. Although we can handle the
case, but it is good to distingush where the receiver of streaming
applied all the stream mutations or not. We believe that the cf is
missing because it was dropped, but it could be missing because of a bug
or something we didn't anticipated here.
Related patch: "streaming: Handle cf is deleted when sending
STREAM_MUTATION_DONE"
Fixes simple_add_new_node_while_schema_changes_test failure.
Message-Id: <c4497e0500f50e0a3422efb37e73130765c88c57.1458090598.git.asias@scylladb.com>
It is a legacy API from c*. Since we can wait for the
update_pending_ranges to complete, we can wait for it directly instead
of calling block_until_update_pending_ranges_finished to do so.
Also, change do_update_pending_ranges to be private.
Message-Id: <ac79b2879ec08fdcd3b2278ff68962cc71492f12.1458040608.git.asias@scylladb.com>
The verb is just for reporting and debugging purposes, but it is better
not to register it until it can return a meaningful value. Besides, it
really belongs to the migration manager subsystem anyway.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1458037053-14836-1-git-send-email-pdziepak@scylladb.com>
Streaming is used by bootstrap and repair. Streaming uses storage_proxy
class to apply the frozen_mutation and db/column_family class to
invalidate row cache. Defer the initalization just before repair and
bootstrap init.
Message-Id: <8e99cf443239dd8e17e6b6284dab171f7a12365c.1458034320.git.asias@scylladb.com>
Register the REPAIR_CHECKSUM_RANGE messaging service verb handler after
we have replayed the commitlog to avoid responding with bogus checksums.
Message-Id: <1458027934-8546-1-git-send-email-penberg@scylladb.com>
"At the momment, the migration_listener callbacks returns void, it is impossible
to wait for the callbacks to complete. Make the callbacks runs inside seastar
thread, so if we need to wait for the callback, we can make it call
foo_operation().get() in the callback. It is easier than making the callbacks
return future<>.
Fixes #1000."
If a keyspace is created after we calcuate the pending ranges during
bootstrap. We will ignore the keyspace in pending ranges when handling
write request for that keyspace which will casue data lose if rf = 1.
Fixes#1000
At the momment, the callbacks returns void, it is impossible to wait for
the callbacks to complete. Make the callbacks runs inside seastar
thread, so if we need to wait for the callback, we can make it call
foo_operation().get() in the callback. It is easier than making the
callbacks return future<>.
Some network equipment that does TCP session tracking tend to drop TCP
sessions after a period of inactivity. Use keepalive mechanism to
prevent this from happening for our inter-node communication.
Message-Id: <20160314173344.GI31837@scylladb.com>
* seastar 88cc232...0739576 (4):
> rpc: allow configuring keepalive for rpc client
> net: add keepalive configuration to socket interface
> iotune: refuse to run if there is not enough space available
> rpc: make client connection error more clear
Defer registering migration manager RPC verbs after commitlog has has
been replayed so that our own schema is fully loaded before other other
nodes start querying it or sending schema updates.
Message-Id: <1457971028-7325-1-git-send-email-penberg@scylladb.com>
The same shard may create an sstables::sstable object for the same SStable
that doesn't belong to it more than once and mark it
for deletion (e.g. in a 'nodetool refresh' flow).
In that case the destructor of sstables::sstable accounted
the deletion requests from the same shard more than once since it was a simple
counter incremented each time there was a deletion request while it should
account request from the same shard as a single request. This is because
the removal logic waited for all shards to agree on a removal of a specific
SStable by comparing the counter mentioned above to the total
number of shards and once they were equal the SStable files were actually removed.
This patch fixes this by replacing the counter by an std::unordered_set<unsigned>
that will store a shard ids of the shards requesting the deletion
of the sstable object and will compare the size() of this set
to smp::count in order to decide whether to actually delete the corresponding
SStable files.
Fixes#1004
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1457886812-32345-1-git-send-email-vladz@cloudius-systems.com>
When we are about to write a new sstable, we check if the sstable exists
by checking if respective TOC exists. That check was added to handle a
possible attempt to write a new sstable with a generation being used.
Gleb was worried that a TOC could appear after the check, and that's indeed
possible if there is an ongoing sstable write that uses the same generation
(running in parallel).
If TOC appear after the check, we would again crap an existing sstable with
a temporary, and user wouldn't be to boot scylla anymore without manual
intervention.
Then Nadav proposed the following solution:
"We could do this by the following variant of Raphael's idea:
1. create .txt.tmp unconditionally, as before the commit 031bf57c1
(if we can't create it, fail).
2. Now confirm that .txt does not exist. If it does, delete the .txt.tmp
we just created and fail.
3. continue as usual
4. and at the end, as before, rename .txt.tmp to .txt.
The key to solving the race is step 1: Since we created .txt.tmp in step 1
and know this creation succeeded, we know that we cannot be running in
parallel with another writer - because such a writer too would have tried to
create the same file, and kept it existing until the very last step of its
work (step 4)."
This patch implements the solution described above.
Let me also say that the race is theoretical and scylla wasn't affected by
it so far.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <ef630f5ac1bd0d11632c343d9f77a5f6810d18c1.1457818331.git.raphaelsc@scylladb.com>
Since calculate_pending_ranges will modify token_metadata, we need to
replicate to other shards. With this patch, when we call
calculate_pending_ranges, token_metadata will be replciated to other
non-zero shards.
In addition, it is not useful as a standalone class. We can merge it
into the storage_service. Kill one singleton class.
Fixes#1033
Refs #962
Message-Id: <fb5b26311cafa4d315eb9e72d823c5ade2ab4bda.1457943074.git.asias@scylladb.com>
"This series adds more information (i.e. keys and tombstones) to the
query result digest in order to ensure correctness and increase the
chances of early detection of disagreement between replicas.
The digest is no longer computed by hashing query::result but build
using the query result builder. That is necessary since the query
result itself doesn't contain all information required to compute
the digest. Another consequence of this is that now replicas asked
for a result need to send both the result and the digest to
the coordinator as it won't be able to compute the digest itself.
Unfortunately, these patches change our on wire communication:
1) hash computation is different
2) format of query::result is changed (and it is made non-final)
Fixes #182."
Query result digest is used to verify that all replicas have the same
data. Therefore, it needs to contain more information than the query
result itself in order to ensure proper detection of disagreements.
Generally, adding clustering keys to the digest regardless of whether
the client asked for them will guarantee correctness. However, adding
tombstones as well improves the chances of early detection of nodes
containing stale data.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Result digest is going to be computed in query result builder and
require information not available in the query resylt. That's why the
digest now needs to be sent to the other nodes together with the result
as they won't be able compute it on their own.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Currently, if there is a disagreement between replicas we get mutations
from all of them, merge this mutations and send the result to the
client, difference between the result and the mutation sent by a
particular replica is sent back to repair it.
Unfortunately, that may not suffice to provide user with correct results
in case of disagreements.
Consider the following scenario:
create table cf(p int, c int, r int, primary key(p, c));
node1:
p=0, c=1, r=1 (timestamp = 1)
p=0, c=2, r=2 (timestamp = 2)
node2:
p=0, c=1, r=tombstone (timestamp = 2)
p=0, c=2, r=1 (timestamp = 1)
query:
select r from cf limit 1;
Let's assume there are no row markers. node1 will send only outdated
cell (p=0, c=1, r=1) while node2 will send both tombstone for c=1 and
outdated cell (p=0, c=2, r=1). A disagreement will be detected, the
replies will be merged and the coordinator will respond to the client
with result r=1, while the correct answer is r=2.
The solution proposed in this patch is to attempt to detect cases when
the problem may occur and retry queries with larger limit which result
in replicas providing more information.
The detection logic is simple: the partition key and clustering key of
the last row in the reconciled result are compared with the partition
keys and clustering keys of the last rows of replies from replicas
(except short reads). If the (pk, ck) of the replica last row is smaller
than the (pk, ck) of the reconciled result the query is retried.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Fix bootstrap_test.py:TestBootstrap.failed_bootstap_wiped_node_can_join_test
Logs on node 1:
INFO 2016-03-11 15:53:43,287 [shard 0] gossip - FatClient 127.0.0.2 has been silent for 30000ms, removing from gossip
INFO 2016-03-11 15:53:43,287 [shard 0] stream_session - stream_manager: Close all stream_session with peer = 127.0.0.2 in on_remove
WARN 2016-03-11 15:53:43,498 [shard 0] stream_session - [Stream #4e411ba0-e75e-11e5-81f8-000000000000] stream_transfer_task: Fail to send STREAM_MUTATION_DONE to 127.0.0.2:0: std::runtime_error ([Stream #4e411ba0-e75e-11e5-81f8-000000000000] GOT STREAM_ MUTATION_DONE 127.0.0.1: Can not find stream_manager)
terminate called without an active exception
Backtrace on node 1:
#0 0x00007fb74723da98 in raise () from /lib64/libc.so.6
#1 0x00007fb74723f69a in abort () from /lib64/libc.so.6
#2 0x00007fb74ab84aed in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
#3 0x00007fb74ab82936 in ?? () from /lib64/libstdc++.so.6
#4 0x00007fb74ab82981 in std::terminate() () from /lib64/libstdc++.so.6
#5 0x00007fb74ab82be9 in __cxa_rethrow () from /lib64/libstdc++.so.6
#6 0x0000000000f3521e in streaming::stream_transfer_task::<lambda()>::<lambda(auto:44)>::operator()<std::__exception_ptr::exception_ptr> (ep=..., __closure=0x7ffce74d8630) at streaming/stream_transfer_task.cc:169
#7 do_void_futurize_apply<const streaming::stream_transfer_task::start()::<lambda()>::<lambda(auto:44)>&, std::__exception_ptr::exception_ptr> (func=...) at /home/asias/src/cloudius-systems/scylla/seastar/core/future.hh:1142
#8 futurize<void>::apply<const streaming::stream_transfer_task::start()::<lambda()>::<lambda(auto:44)>&, std::__exception_ptr::exception_ptr> (func=...) at /home/asias/src/cloudius-systems/scylla/seastar/core/future.hh:1190
#9 future<>::<lambda(auto:7&&)>::operator()<future<> > ( fut=fut@entry=<unknown type in /home/asias/src/cloudius-systems/scylla/build/release/scylla, CU 0xec84d00, DIE 0xee2561d>, __closure=__closure@entry=0x7ffce74d8630) at /home/asias/src/cloudius-systems/scylla/seastar/core/future.hh:1014
Message-Id: <1457684884-4776-2-git-send-email-asias@scylladb.com>
Currently, if sstable::write_components() is called to write a new sstable
using the same generation of a sstable that exists, a temporary TOC will
be unconditionally created. Afterwards, the same sstable::write_components()
will fail when it reaches sstable::create_data(). The reason is obvious
because data component exists for that generation (in this scenario).
After that, user will not be able to boot scylla anymore because there is
a generation with both a TOC and a temporary TOC. We cannot simply remove a
generation with TOC and temporary TOC because user data will be lost (again,
in this scenario). After all, the temporary TOC was only created because
sstable::write_components() was wrongly called with the generation of a
sstable that exists.
Solution proposed by this patch is to trigger exception if a TOC file
exists for the generation used.
Some SSTable unit tests were also changed to guarantee that we don't try
to overwrite components of an existing sstable.
Refs #1014.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <caffc4e19cdcf25e4c6b9dd277d115422f8246c4.1457643565.git.raphaelsc@scylladb.com>
Since commit 2f56577 ("sstables: more efficient read of compressed data
file"), the compressed_file_input_stream uses a file_input_stream to
efficiently read the compressed data at chunks some desired size (128 KB
is our default) instead of at smaller compressed chunks.
However, I had a bug where I mis-calculated the desired length of the
read (giving the *end byte* instead of the length!) and as a result
file_input_stream did not know where the read was supposed to stop, and
always read 128 KB buffers. The results were not incorrect, because the
sstable reader stops when it needs to, even if given too much data. But
it was inefficient because too much data was read in the last buffer.
With this patch, the length is correctly given to the input stream, and
it can read a much smaller buffer at the end of the read, not the full
128 KB. I tested that this actually happens.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1457633616-15193-1-git-send-email-nyh@scylladb.com>
"As described in issue #1014, we have found ourselves in a situation where
SSTables can be written too early, and that causes problems for the existing
SSTables. While this shouldn't happen - and Pekka's recent patch to move
populate() a lot earlier in initialization should fix that, when that did
happen what we had was not enough to prevent it from overwriting existing
tables.
We should do a lot better job protecting against that.
Also, some of the exceptions that are generated at totally inconclusive. This
series also aims at making some of the exceptions more descriptive."
This patch makes sure that every time we need to create a new generation number -
the very first step in the creation of a new SSTable, the respective CF is already
initialized and populated. Failure to do so can lead to data being overwritten.
Extensive details about why this is important can be found
in Scylla's Github Issue #1014
Nothing should be writing to SSTables before we have the chance to populate the
existing SSTables and calculate what should the next generation number be.
However, if that happens, we want to protect against it in a way that does not
involve overwriting existing tables. This is one of the ways to do it: every
column family starts in an unwriteable state, and when it can finally be written
to, we mark it as writeable.
Note that this *cannot* be a part of add_column_family. That adds a column family
to a db in memory only, and if anybody is about to write to a CF, that was most
likely already called. We need to call this explicitly when we are sure we're ready
to issue disk operations safely.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The standard C++ exception messages that will be thrown if there is anything
wrong writing the file, are suboptimal: they barely tell us the name of the failing
file.
Use a specialized create function so that we can capture that better.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Deletion of previous stale, temporary SSTables is done by Shard0. Therefore,
let's run Shard0 first. Technically, we could just have all shards agree on the
deletion and just delete it later, but that is prone to races.
Those races are not supposed to happen during normal operation, but if we have
bugs, they can. Scylla's Github Issue #1014 is an example of a situation where
that can happen, making existing problems worse. So running a single shard
first and getting making sure that all temporary tables are deleted provides
extra protection against such situations.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We are no longer using the in_flight_seals gate, but forgot to remove it.
To guarantee that all seal operations will have finished when we're done,
we are using the memtable_flush_queue, which also guarantees order. But
that gate was never removed.
The FIXME code should also be removed, since such interface does exist now.
Signed-off-by: Glauber Costa <glauber@scylladb.com>