When we are about to write a new sstable, we check if the sstable exists
by checking if respective TOC exists. That check was added to handle a
possible attempt to write a new sstable with a generation being used.
Gleb was worried that a TOC could appear after the check, and that's indeed
possible if there is an ongoing sstable write that uses the same generation
(running in parallel).
If TOC appear after the check, we would again crap an existing sstable with
a temporary, and user wouldn't be to boot scylla anymore without manual
intervention.
Then Nadav proposed the following solution:
"We could do this by the following variant of Raphael's idea:
1. create .txt.tmp unconditionally, as before the commit 031bf57c1
(if we can't create it, fail).
2. Now confirm that .txt does not exist. If it does, delete the .txt.tmp
we just created and fail.
3. continue as usual
4. and at the end, as before, rename .txt.tmp to .txt.
The key to solving the race is step 1: Since we created .txt.tmp in step 1
and know this creation succeeded, we know that we cannot be running in
parallel with another writer - because such a writer too would have tried to
create the same file, and kept it existing until the very last step of its
work (step 4)."
This patch implements the solution described above.
Let me also say that the race is theoretical and scylla wasn't affected by
it so far.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <ef630f5ac1bd0d11632c343d9f77a5f6810d18c1.1457818331.git.raphaelsc@scylladb.com>
(cherry picked from commit 0af786f3ea)
Currently, if sstable::write_components() is called to write a new sstable
using the same generation of a sstable that exists, a temporary TOC will
be unconditionally created. Afterwards, the same sstable::write_components()
will fail when it reaches sstable::create_data(). The reason is obvious
because data component exists for that generation (in this scenario).
After that, user will not be able to boot scylla anymore because there is
a generation with both a TOC and a temporary TOC. We cannot simply remove a
generation with TOC and temporary TOC because user data will be lost (again,
in this scenario). After all, the temporary TOC was only created because
sstable::write_components() was wrongly called with the generation of a
sstable that exists.
Solution proposed by this patch is to trigger exception if a TOC file
exists for the generation used.
Some SSTable unit tests were also changed to guarantee that we don't try
to overwrite components of an existing sstable.
Refs #1014.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <caffc4e19cdcf25e4c6b9dd277d115422f8246c4.1457643565.git.raphaelsc@scylladb.com>
(cherry picked from commit 031bf57c19)
The standard C++ exception messages that will be thrown if there is anything
wrong writing the file, are suboptimal: they barely tell us the name of the failing
file.
Use a specialized create function so that we can capture that better.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit f2a8bcabc2)
Attempt to print std::nested_exception currently results in exception
to leak outside the printer. Fix by capturing all exception in the
final catch block.
For nested exception, the logger will print now just
"std::nested_exception". For nested exceptions specifically we should
log more, but that is a separate problem to solve.
Message-Id: <1457532215-7498-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 838a038cbd)
Fixes#967
Frozen lists are just atomic cells. However, old code inserted the
frozen data directly as an atomic_cell_or_collection, which in turn
meant it lacked the header data of a cell. When in turn it was
handled by internal serialization (freeze), since the schema said
is was not a (non-frozen) collection, we tried to look at frozen
list data as cell header -> most likely considered dead.
Message-Id: <1457432538-28836-1-git-send-email-calle@scylladb.com>
(cherry picked from commit 8575f1391f)
Currently write acknowledgements handling does not take bootstrapping
node into account for CL=EACH_QUORUM. The patch fixes it.
Fixes#994
Message-Id: <20160307121620.GR2253@scylladb.com>
(cherry picked from commit 626c9d046b)
In region destructor, after active segments is freed pointer to it is
left unchanged. This confuses the remaining parts of the destructor
logic (namely, removal from region group) which may rely on the
information in region_impl::_active.
In this particular case the problem was that code removing from the
region group called region_impl::occupancy() which was
dereferencing _active if not null.
Fixes#993.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1457341670-18266-1-git-send-email-pdziepak@scylladb.com>
(cherry picked from commit 99b61d3944)
If timeout happens after cl promise is fulfilled, but before
continuation runs it removes all the data that cl continuation needs
to calculate result. Fix this by calculating result immediately and
returning it in cl promise instead of delaying this work until
continuation runs. This has a nice side effect of simplifying digest
mismatch handling and making it exception free.
Fixes#977.
Message-Id: <1457015870-2106-3-git-send-email-gleb@scylladb.com>
Read executor may ask for more than one data reply during digest
resolving stage, but only one result is actually needed to satisfy
a query, so no need to store all of them.
Message-Id: <1457015870-2106-2-git-send-email-gleb@scylladb.com>
In digest resolver for cl to be achieved it is not enough to get correct
number of replies, but also to have data reply among them. The condition
in digest timeout does not check that, fortunately we have a variable
that we set to true when cl is achieved, so use it instead.
Message-Id: <1457015870-2106-1-git-send-email-gleb@scylladb.com>
Ubuntu 14.04LTS package is broken now because iotune does not statically linked against libstdc++, so this patch fixed it.
Requires seastar patch to add --static-stdc++ on configure.py.
Fixes#982
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1456995050-22007-1-git-send-email-syuu@scylladb.com>
1) As explained in commit 697b16414a (gossip: Make gossip message
handling async), in each gossip round we can make talking to the 1-3
peer nodes in parallel to reduce latency of gossip round.
2) Gossip syn message uses one way rpc message, but now the returned
future of the one way message is ready only when message is dequeued for
some reason (sent or dropped). If we wait for the one way syn messge to
return it might block the gossip round for a unbounded time. To fix, do
not wait for it in the gossip round. The downside is there will be no
back pressure to bound the syn messages, however since the messages are
once per second, I think it is fine.
Message-Id: <ea4655f121213702b3f58185378bb8899e422dd1.1456991561.git.asias@scylladb.com>
Currently schema changes are only logged at coordinator node which
initiates the change. It would be helpful in post morten analysis to
also see when and how schema changes are resolved when applied on
other nodes.
Message-Id: <1456953095-1982-1-git-send-email-tgrabiec@scylladb.com>
Use the existing "feed_hash" mechanism to find a checksum of the
content of a mutation, instead of serializing the mutation (with freeze())
and then finding the checksum of that string.
The serialized form is more prone to future changes, and not really
guaranteed to provide equal hashes for mutations which are considered
"equal".
Fixes#971
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1456958676-27121-1-git-send-email-nyh@scylladb.com>
While is is formally better to take a local lock first and
then first contend for a global, in this case it is arguably
better to ensure we get a gate exception synchronously (early)
instead of potentially in a continuation. Old version might
cause us to do a gate::leave even while never entered.
And since we should really only have one active (contending)
segment per shard anyway, it should not matter.
Message-Id: <1456931988-5876-1-git-send-email-calle@scylladb.com>
Fixes#865
(Some) gcc 5 (5.3.0 for me) on ubuntu will generate errors on
compilation of this code (compiling logalloc_test). The memcpy
to inline storage seems to confuse the compiler.
Simply change to std::copy, which shuts the compiler up.
Any decent stl should convert primitive std::copy to memcpy
anyway, but since it is also the inline (small storage),
it should not matter which way.
Message-Id: <1456931988-5876-4-git-send-email-calle@scylladb.com>
"This series implements describe_schema_versions so that we nodetool
describecluster can return proper schema information for the whole
cluster. It involves adding new verb SCHEMA_CHECK which is used to get
schema version for a given node and a simple map-reduce that using that
verb gets info from the whole cluster.
This fixes#677, fixes#684, and fixes #472."
Useful for determining order of events in logs of different nodes, or
for estimating how much time passed between two events.
Fixes#941.
Example log:
INFO 2016-03-01 18:30:37,688 [shard 0] gossip - Waiting for gossip to settle before accepting client requests...
INFO 2016-03-01 18:30:45,689 [shard 0] gossip - No gossip backlog; proceeding
INFO 2016-03-01 18:30:45,689 [shard 0] storage_service - Starting listening for CQL clients on localhost:9042...
Message-Id: <1456853532-28800-1-git-send-email-tgrabiec@scylladb.com>
Start with coarse control:
1) converting the run_with_write_api_lock operations:
join_ring, start_gossiping, stop_gossiping, start_rpc_server,
stop_rpc_server, start_native_transport, stop_native_transport,
decommission, remove_node, drain, move, rebuild
to use run_with_api_lock which uses a flag to indicate current operation
in progress.
If one of the above operation is in progress when admin issues another
opeartion we return a "try again" exception to avoid running two
operations in parallel.
2) converting the run_with_read_api_lock to use no lock.
Fixes#850.
Message-Id: <00782b601028ed87437e5decae382f72dff634f6.1456758391.git.asias@scylladb.com>