[in v2: 1. Fixed a few small bugs.
2. Added rudementary support parallel/sequential repair.
3. Verified that code works correctly with Asias's fix to streaming]
This patch adds the capability to track repair operations which we have
started, and check whether they are still running or completed (successfully
or unsuccessfully).
As before one starts a repair with the REST api:
curl -X GET --header "Content-Type: application/json" --header "Accept: application/json" "http://127.0.0.1:10000/storage_service/repair_async/try1"
where "try1" is the name of the keyspace. This returns a repair id -
a small integer starting with 0. This patch adds support for similar
request to *query* the status of a previously started repair, by adding
the "id=..." option to the query, which enquires about the status of the
repair with this id: For example.,
curl -i -X GET --header "Content-Type: application/json" --header "Accept: application/json" "http://127.0.0.1:10000/storage_service/repair_async/try1?id=0"
gets the current status of this repair 0. This status can be RUNNING,
SUCCESSFUL or FAILED, or a HTTP 400 "unknown repair id ..." in case an
invalid id is passed (not the id of any real repair that was previously
started).
This patch also adds two alternative code-paths in the main repair flow
do_repair_start(): One where each range is repaired one after another,
and one where all the ranges are repaired in parallel. At the moment, the
enabled code is the parallel version, just as before this patch. But the
will also be useful for implementing the "parallel" vs "sequential" repair
options of Cassandra.
Note that if you try to use repair, you are likely to run into a bug in
the streaming code which results in Scylla either crashing or a repair
hanging (never realising it finished). Asias already has a fix this this bug,
and will hopefully publish it soon, but it is unrelated to the repair code
so I think this patch can independently be committed.
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
The solution was proposed by Nadav. When writing a new sstable,
write all usual files, write the TOC to a temporary file, and
then rename it, which is atomic.
Files not belonging to any TOC are invalid, so we ensure that
partially written sstables aren't reused.
Avi also proposed using fsync on the sstable directory to guarantee
that the files reached the disk before sealing the sstable.
Subsequently, we should add code to avoid loading sstable which
TOC is either temporary or doesn't exist. Temporary TOC files
should also be deleted.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
There are two counters for the "current number of open connections" in the cql_server:
- _connects: incremented every time a new connection is opened. Should be used for
a derived statistics of connections/sec
- _connections: incremented and decremented every time a new connection is opened/closed
correspondingly.
_connects has been registered as a source for both derived and gauge collectd statistics by
mistake while it had to be registered for a derived counter only and _connections had to be
registered as a source for a gauge counter.
Fixes issue #143
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
mutation_result_merger can outlive query::read_command, so it have to
hold shared pointer to it instead of reference. The bug was introduced by
89e36541c3
This allows token::_data to be in a different representation
than the one expected by the token type.
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
_v.begin() points to the next element. If the size of last element
in a compound is zero then iterators pointing to second to last and
last element would seem equal. To fix this we also have to compare
_types_left.
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
* seastar b56a6eb...432e973 (3):
> dpdk: merge local patch to fix ixgbe
> dpdk: rebase to latest upstream
> net::dpdk: actually check the resulting cluster and not the original packet
Currently limit is enforced only on partition boundary, so real result
can contain 2*row_limit - 1 rows in the worst case. Fix it by trimming
rows from a mutation if only part of its rows fit the requested limit.
2.2 allows IN on any column and it seems that we support that fine, but
DTESTs except us to reject such queries.
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
Because of the reverse flag in partition slice rows inside bounds will
be returned in reversed order, however, we still have to make sure
that the bounds are in the expected order.
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
Values inside IN clause should be sorted and duplicates removed if the
restricted columns are part of the clustering key, which is always true
for multi column restrictions.
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
Values inside IN () restrictions may be either in a vector _in_values or
a marker (_in_marker or _value). To determine which one is appropriate
we check whether _in_values is empty, which is wrong because IN clause
can be empty (and there is no marker in such case). This is fixed by
using the presence of a marker to determine whether a vector of values
or a marker should be used.
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
"This series introduces the i_endpoint_snitch::reset_snitch() static method
that allows to replace the current (global) snitch instance with the new one.
This is done in an (per-shard) atomic way transparent so anyone holding a reference
to snitch_ptr.
This series starts with some cleanups, adds the above method and the unit test
that verifies its functionality."
"I am currently looking at the performance of our index_read, since it was in
the past pinpointed at the source of problems.
While the read side is the one that is mostly interesting, I would like to test
both - besides anything else, it is easier to test reads after writes so we
don't have to create synthetic data with outside tools.
This patch introduces the write side benchmark (read side will hopefully come
tomorrow). While the write side is, as mentioned, not the most interesting
part, I did see some standing from the flamegraph that allowed me to optimize
one particular function, yielding a 8.6 % improvement."
"Related to 108
Does not fix the problem (fully at least), but at least:
* Throws exceptions instead of crashing
* Tries to back off slighly (allocate less) if possible
* Logs it
Also recycles segments to keep them from being fragmented by mem system"
handle_exception() should really discard the future's value automatically,
and in an upcoming version of Seastar, won't. So instead of
sp.execute().handle_exception(...)
(where execute() returns a future which is *not* future<>)
We need to write
sp.execute().discard_result().handle_exception(...)
This already works in today's Seastar (the extra discard_result()
doesn't cause any harm), and will be necessary when handle_exception()
in Seastar is improved (I'll send a patch soon).
Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>
Loading data from memory tends to be the most expensive part of the comparison
operations. Because we don't have a tri_compare function for tokens, we end up
having to do an equality test, which will load the token's data in memory, and
then, because all we know is that they are not equal, we need to do another
one.
Having two dereferences is harmful, and shows up in my simple benchmark. This
is because before writing to sstables, we must order the keys in decorated key
order, which is heavy on the comparisons.
The proposed change speeds up index write benchmark by 8.6%:
Before:
41458.14 +- 1.49 partitions / sec (30 runs)
After:
45020.81 +- 3.60 partitions / sec (30 runs)
Parameters:
--smp 6 --partitions 500000
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
This is a test that allow us to query the performance of our sstable index
reads and writes (currently only writes implemented). A lot of potentially
common code is put into a header, which will make writing new tests easier if
needed.
We don't want to take shortcuts for this, so all reading and writing is done
through public sstable interfaces.
For writing, there is no way to write the index without writing the datafile.
But because we are only writing the primary key, the datafile will not contain
anything else. This is the closest we can get to an index testing with the
public interfaces.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
if a directory is found, recursively delete it. This will be useful for
allowing the creation of test structures like test/cpuX/sstable
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
Our normal test directory may not be good enough for performance testing. The
reason is, that while our git tree with its relative path will usually be
sitting in a standard ext4 filesystem, we want the performance tests to be run
against XFS, which is our deployment target.
It is a lot easier to point the perf test to an already mounted xfs directory,
than to meddle with mounts into the codebase's relative path for this alone.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
In some situations, it is useful to have the test directory persistent. To do that,
expose the inner function that creates it.
Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
resume_io() is different from start() in that it won't try to read to configuration
and will only restart the periodic I/O task (if any).
This also means that resume_io() may not fail while start() will return an
exceptional future if it fails to read the configuration.
pause_io() is a counterpart of resume_io() - it stops the periodic I/O task (if any).
After it returns a ready future - snitch will not try to read any configuration until
either start() or resume_io() are called.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>