Exception handling was broken because after io checker, storage_io_error
exception is wrapped around system error exceptions. Also the message
when handling exception wasn't precise enough for all cases. For example,
lack of permission to write to existing data directory.
Fixes#883.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <b2dc75010a06f16ab1b676ce905ae12e930a700a.1478542388.git.raphaelsc@scylladb.com>
(cherry picked from commit 9a9f0d3a0f)
Snapshot destructor may free some objects managed by the LSA. That's why
partition_snapshot_reader destructor explicitly destroys the snapshot it
uses. However, it was possible that exception thrown by _read_section
prevented that from happenning making snapshot destoryed implicitly
without current allocator set to LSA.
Refs #1831.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1478778570-2795-1-git-send-email-pdziepak@scylladb.com>
(cherry picked from commit f16d6f9c40)
moving_averages constructor is defined like this:
moving_average(latency_counter::duration interval, latency_counter::duration tick_interval)
But when it is time to initialize them, we do this:
... {tick_interval(), std::chrono::minutes(1)} ...
As it can be seen, the interval and tick interval are inverted. This
leads to the metrics being assigned bogus values.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <d83f09eed20ea2ea007d120544a003b2e0099732.1478798595.git.glauber@scylladb.com>
(cherry picked from commit d3f11fbabf)
* seastar b62d7a5...5adb964 (2):
> file: make close() more robust against concurrent calls
> rpc: Do not close client connection on error response for a timed out request
When max sstable size is increased, higher levels are suffering from
starvation because we decide to compact a given level if the following
calculation results in a number greater than 1.001:
level_size(L) / max_size_for_level_l(L)
Fixes#1720.
For this backport, I needed to add schema as parameter to sstable
functions that return first and last decorated keys.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit a8ab4b8f37)
Uniform token range distribution across sstables in a level > 1 was broken,
because we were only choosing sstable with lowest first key, when compacting
a level > 0. This resulted in performance problem because L1->L2 may have a
huge overlap over time, for example.
Last compacted key will now be stored for each level to ensure sort of
"round robin" selection of sstables for compactions at level >= 1.
That's also done by C*, and they were once affected by it as described in
https://issues.apache.org/jira/browse/CASSANDRA-6284.
Fixes#1719.
For this backport, I added schema parameter to compaction_strategy::
notify_completion() because sstable doesn't store schema here.
Most conflicts were that some interfaces take schema parameter at
this version.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit a3bf7558f2)
Under the hood, the selectable::add_and_get_index() function
deliberately filters out duplicate columns. This causes
simple_selector::get_output_row() to return a row with all duplicate
columns filtered out, which triggers and assertion because of row
mismatch with metadata (which contains the duplicate columns).
The fix is rather simple: just make selection::from_selectors() use
selection_with_processing if the number of selectors and column
definitions doesn't match -- like Apache Cassandra does.
Fixes#1367
Message-Id: <1477989740-6485-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit e1e8ca2788)
We use `data_resource` class in the CQL parser, which let's users refer
to a table resource without specifying a keyspace. This asserts out in
get_level() for no good reason as we already know the intented level
based on the constructor. Therefore, change `data_resource` to track the
level like upstream Cassandra does and use that.
Fixes#1790
Message-Id: <1477599169-2945-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit b54870764f)
We store all auth perm strings in upper case, but the user might very
well pass this in upper case.
We could use a standard key comparator / hash here, but since the
strings tend to be small, the new sstring will likely be allocated in
the stack here and this approach yields significantly less code.
Fixes#1791.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <51df92451e6e0a6325a005c19c95eaa55270da61.1477594199.git.glauber@scylladb.com>
(cherry picked from commit ef3c7ab38e)
The move constructor of partition_version was not invoking move
constructor of anchorless_list_base_hook. As a result, when
partition_version objects were moved, e.g. during LSA compaction, they
were unlinked from their lists.
This can make readers return invalid data, because not all versions
will be reachable.
It also casues leaks of the versions which are not directly attached
to memtable entry. This will trigger assertion failure in LSA region
destructor. This assetion triggers with row cache disabled. With cache
enabled (default) all segments are merged into the cache region, which
currently is not destroyed on shutdown, so this problem would go
unnoticed. With cache disabled, memtable region is destroyed after
memtable is flushed and after all readers stop using that memtable.
Fixes#1753.
Message-Id: <1476778472-5711-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit fe387f8ba0)
Commit e6ef49e ("db: Do not timeout streaming readers") breaks compilation of database.cc:
database.cc: In lambda function:
database.cc:282:62: error: ‘const class io_priority_class’ has no member named ‘id’
if (service::get_local_streaming_read_priority().id() == pc.id()) {
^~
database.cc:282:73: error: ‘const class io_priority_class’ has no member named ‘id’
if (service::get_local_streaming_read_priority().id() == pc.id()) {
...because we don't have Seastar commit 823a404 ("io_priority_class:
remove non-explicit operator unsigned") backported.
Fix the issue by using the non-explicit operator instead of explicit id().
Acked-by: Tomasz Grabiec <tgrabiec@scylladb.com>
Message-Id: <1476425276-17171-1-git-send-email-penberg@scylladb.com>
Paging code assumes that clustering row range [a, a] contains only one
row which may not be true. Another problem is that it tries to use
range<> interface for dealing with clustering key ranges which doesn't
work because of the lack of correct comparator.
Refs #1446.
Fixes#1684.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1475236805-16223-1-git-send-email-pdziepak@scylladb.com>
(cherry picked from commit eb1fcf3ecc)
The expire time which is used to decide when to remove a node from
gossip membership is gossiped around the cluster. We switched to steady
clock in the past. In order to have a consistent time_point in all the
nodes in the cluster, we have to use wall clock. Switch to use
system_clock for gossip.
Fixes#1704
(cherry picked from commit f0d3084c8b)
There is a limit to concurrency of sstable readers on each shard. When
this limit is exhausted (currently 100 readers) readers queue. There
is a timeout after which queued readers are failed, equal to
read_request_timeout_in_ms (5s by default). The reason we have the
timeout here is primarily because the readers created for the purpose
of serving a CQL request no longer need to execute after waiting
longer than read_request_timeout_in_ms. The coordinator no longer
waits for the result so there is no point in proceeding with the read.
This timeout should not apply for readers created for streaming. The
streaming client currently times out after 10 minutes, so we could
wait at least that long. Timing out sooner makes streaming unreliable,
which under high load may prevent streaming from completing.
The change sets no timeout for streaming readers at replica level,
similarly as we do for system tables readers.
Fixes#1741.
Message-Id: <1475840678-25606-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 2a5a90f391)
CQL server is supposed to throttle requests so that they don't
overflow memory. The problem is that it currently accounts for
request's memory only around reading of its frame from the connection
and not actual request execution. As a result too many requests may be
allowed to execute and we may run out of memory.
Fixes#1708.
Message-Id: <1475149302-11517-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 7e25b958ac)
It is possible that endpoint_state_map does not contain the entry for
the node itself when collectd accesses it.
Fixes the issue:
Sep 18 11:33:16 XXX scylla[19483]: [shard 0] seastar - Exceptional
future ignored: std::out_of_range (_Map_base::at)
Fixes#1656
Message-Id: <8ffe22a542ff71e8c121b06ad62f94db54cc388f.1474377722.git.asias@scylladb.com>
(cherry picked from commit aa47265381)
timeuuid_type_impl::compare_bytes is a "trichotomic" comparator (-1,
0, 1) while less() is a "less" comparator (false, true). The code
incorrectly returns c1 instead of c1 < 0 which breaks the ordering.
Fixes#1196.
Message-Id: <1473956716-5209-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 804fe50b7f)
On instances differenet then i2/m3/c3 we provide instructions to run
scylla_ip_setup. Running scylla_io_setup requires access to
/var/lib/scylla to crate a temporary file. To gain access to that
directory the user should run 'sudo scylla_io_setup'.
refs: #1645
Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Message-Id: <4ce90ca1ba4da8f07cf8aa15e755675463a22933.1473935778.git.shlomi@scylladb.com>
(cherry picked from commit acb83073e2)
"This series backports fixes for #1670 on top of 1.3 branch.
Fixes abort when querying with contradicting clustering column
restrictions, for example:
SELECT * FROM test WHERE k = 0 AND ck < 1 and ck > 2"
Example of affected query:
SELECT * FROM test WHERE k = 0 AND ck < 1 and ck > 2
Refs #1670.
This commit brings back the backport of "Don't allow CK wrapping
ranges" by Duarte by reverting commit 11d7f83d52.
It also has the following fix, which is introduced by the
aforementioned commit, squashed to improve bisectability:
"cql3: Consider bound type when detecting wrap around
This patch uses the clustering bounds comparator to correctly detect
wrap around of a clustering range. This fixes a manifestation of #1446,
introduced by b1f9688432, where a query
such as select * from cf where k = 0x00 and c0 = 0x02 and c1 > 0x02
would result in a range containing a clustering key and a prefix,
incorrectly ordered by the prefix equality or lexicographical
comparators.
Refs #1446
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
(cherry picked from commit ee2694e27d)"
This patch extracts bounds_view from range_tombstone so its comprator
can be reused elsewhere.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
(cherry picked from commit 878927d9d2)
Currently we get boost::lexical_cast on startup if inital_token has a
list which contains spaces after commas, e.g.:
initial_token: -1100081313741479381, -1104041856484663086, ...
Fixes#1664.
Message-Id: <1473840915-5682-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit a498da1987)
On posix_net_conf.sh's single queue NIC mode (which means RPS enabled mode), we are excluded cpu0 and it's sibling from network stack processing cpus, and assigned NIC IRQ to cpu0.
So always network stack is not working on cpu0 and it's sibling, to get better performance we need to exclude these cpus from scylla too.
To do this, we need to get RPS cpu mask from posix_net_conf.sh, pass it to scylla_cpuset_setup to construct /etc/scylla.d/cpuset.conf when scylla_setup executed.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1472544875-2033-2-git-send-email-syuu@scylladb.com>
(cherry picked from commit 533dc0485d)
Right now scylla_prepare specifies -mq option to posix_net_conf.sh when number of RX queues > 1, but on posix_net_conf.sh it sets NIC mode to sq when queues < ncpus / 2.
So the logic is different, and actually posix_net_conf.sh does not need to specify -sq/-mq now, it autodetects queue mode.
So we need to drop detection logic from scylla_prepare, let posix_net_conf.sh to detect it.
Fixes#1406
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1472544875-2033-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 0c3bb2ee63)
Alexandr Porunov reports that Scylla fails to start up after reboot as follows:
Aug 25 19:44:51 scylla1 scylla[637]: Exiting on unhandled exception of type 'std::system_error': Error system:99 (Cannot assign requested address)
The problem is that because there's no dependency to network service,
Scylla simply attempts to start up too soon in the boot sequence and
fails.
Fixes#1618.
Message-Id: <1472212447-21445-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit 2d3aee73a6)
Size estimates for a particular column family are recorded every 5
minutes. However, when a user calls the describe_splits(_ex) verbs,
they may want to see estimates for a recently created and updated
column family; this is legitimate and common in testing. However, a
client may also call describe_splits(_ex) very frequently and
recording the estimates on every call is wasteful and, worse, can
cause clients to give up. This patch fixes this by only recording
estimates if the first attempt to query them produces no results.
Refs #1139
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1471900595-4715-1-git-send-email-duarte@scylladb.com>
(cherry picked from commit 440c1b2189)
We have API for getting pending compaction tasks both in column
family and compaction manager. Column family is already returning
pending tasks properly.
Compaction manager's one is used by 'nodetool compactionstats', and
was returning a value which doesn't reflect pending compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <a20b88938ad39e95f98bfd7f93e4d1666d1c6f95.1471641211.git.raphaelsc@scylladb.com>
(cherry picked from commit d8be32d93a)
Reversed iterators are adaptors for 'normal' iterators. These underlying
iterators point to different objects that the reversed iterators
themselves.
The consequence of this is that removing an element pointed to by a
reversed iterator may invalidate reversed iterator which point to a
completely different object.
This is what happens in trim_rows for reversed queries. Erasing a row
can invalidate end iterator and the loop would fail to stop.
The solution is to introduce
reversal_traits::erase_dispose_and_update_end() funcion which erases and
disposes object pointed to by a given iterator but takes also a
reference to and end iterator and updates it if necessary to make sure
that it stays valid.
Fixes#1609.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1472080609-11642-1-git-send-email-pdziepak@scylladb.com>
(cherry picked from commit 6012a7e733)
Normally, the check version should start and stop with the scylla-server
service.
If it fails to find scylla server, there is no need to check the
version, nor to report it, so it can stop silently.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 2b98335da4)
There is a problem with Python SSL's in Ubuntu 14.04:
ubuntu@ip-10-81-165-156:~$ /usr/lib/scylla/scylla-housekeeping -q version
Traceback (most recent call last):
File "/usr/lib/scylla/scylla-housekeeping", line 94, in <module>
args.func(args)
File "/usr/lib/scylla/scylla-housekeeping", line 71, in check_version
latest_version = get_json_from_url(version_url + "?version=" + current_version)["version"]
File "/usr/lib/scylla/scylla-housekeeping", line 50, in get_json_from_url
response = urllib2.urlopen(req)
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1222, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 1] _ssl.c:510: error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure>
Instead of using Python libraries to connect to the check version
server, we will use curl for that.
Fixes#1600
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 4598674673)
To work around an SSL problem with Python on Ubuntu 14.04, we need to
use curl. Add it as a dependency so that it's available on the host.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 91944b736e)
.in is the name for template files witch requires to rewrite on building time, but these systemd unit files does not require rewrite, so don't name .in, reference directly from .spec.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1471607533-3821-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit aac60082ae)
After state_processor().process_state() returns proceed::no the upper
layer should have a chance to act before more data is pushed to the
consumer. This means that in case of proceed::no verify_end_state()
should not be called immediately since it may invoke
consume_end_partition().
Fixes#1605.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1471943032-7290-1-git-send-email-pdziepak@scylladb.com>
(cherry picked from commit 5feed84e32)
Move scylla-server and scylla-jmx supervisord config files to separate
files and make the main supervisord.conf scan /etc/supervisord.conf.d/
directory. This makes it easier for people to extend the Docker image
and add their own services.
Message-Id: <1471588406-25444-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit 9d1d8baf37)
Clustering rows in the sstables are sorted in the ascending order so we
can use that to minimise number of comparisons when checking if a row is
in the requested range.
Refs #1544.
Paweł further explains the backport rationale for 1.3:
"Apart from making sense on its own, this patch has a very curious
property
of working around #1544 in a way that doesn't make #1446 hit us harder
than
usual.
So, in the branch-1.3 we can:
- revert 85376ce555
'Merge "Don't allow CK wrapping ranges" from Duarte' -- previous,
insufficient workaround for #1544
- apply this patch
- rejoice as cql_query_test passes and #1544 is no longer a problem
The scenario above assumes that this patch doesn't introduces any
regressions."
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Reviewed-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <1471608921-30818-1-git-send-email-pdziepak@scylladb.com>
(cherry picked from commit e60bb83688)
Glauber "eagle eyes" Costa pointed out that the Scylla logo used in our
Docker image documentation looks broken because it's missing the Scylla
text.
Fix the problem by using the Scylla mascot instead.
Message-Id: <1471525154-2800-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit 2bf5e8de6e)
The bug tracker URL in our Docker image documentation is not clickable
because the URL Markdown extracts automatically is broken.
Fix that and add some more links on how to get help and report issues.
Message-Id: <1471524880-2501-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit 4d90e1b4d4)
allow user to use the `supervisorctl' program to start and stop
services. `exec` needed to be added to the scylla and scylla-jmx starter
scripts - otherwise supervisord loses track of the actual process we
want to manage.
Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <1471442960-110914-1-git-send-email-yoav@scylladb.com>
(cherry picked from commit 25fb5e831e)
The WorkingDirectory directive does not support environment variables on
systemd version that is shipped with Ubuntu 16.04. Fortunately, not
setting WorkingDirectory implicitly sets it to user home directory,
which is the same thing (i.e. /var/lib/scylla).
Fixes#1319
Signed-of-by: Benoit Canet <benoit@scylladb.com>
Message-Id: <1470053876-1019-1-git-send-email-benoit@scylladb.com>
(cherry picked from commit 90ef150ee9)
Wrapping ranges are not supported in CQL3. If one is specified,
this patch converts it to an empty range.
Fixes#1544
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch makes the storage_proxy return an empty result when the
query doesn't define any clustering ranges (default or specific).
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch makes make_clustering_range not enforce that the range be
non-wrapping, so that it can be validated differently if needed. A
make_clustering_range_and_validate function is introduced that keeps
the old behavior.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Currently, partition snapshot destructor can throw which is a big no-no.
The solution is to ignore the exception and leave versions unmerged and
hope that subsequent reads will succeed at merging.
However, another problem is that the merge doesn't use allocating
sections which means that memory won't be reclaimed to satisfy its
needs. If the cache is full this may result in partition versions not
being merged for a very long time.
This patch introduces partition_snapshot::merge_partition_versions()
which contains all the version merging logic that was previously present
in the snapshot destructor. This function may throw so that it can be
used with allocating sections.
The actual merging and handling of potential erros is done from
partition_snapshot_reader destructor. It tries to merge versions under
the allocating section. Only if that fails it gives up and leaves them
unmerged.
Fixes#1578Fixes#1579.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1471265544-23579-1-git-send-email-pdziepak@scylladb.com>
(cherry picked from commit 5cae44114f)
$ tools/scyllatop/scyllatop.py '*gossip*'
node-1/gossip-0/gauge-heart_beat_version 1.0
node-2/gossip-0/gauge-heart_beat_version 1.0
node-3/gossip-0/gauge-heart_beat_version 1.0
Gossip heart beat version changes every second. If everyting is working
correctly, the gauge-heart_beat_version output should be 1.0. If not,
the gauge-heart_beat_version output should be less than 1.0.
Message-Id: <cbdaa1397cdbcd0dc6a67987f8af8038fd9b2d08.1470712861.git.asias@scylladb.com>
(cherry picked from commit ef782f0335)
[v2: fix check for static column (don't check if the schema is not compound)
and move want-static-columns flag inside the filtering context to avoid
changing all the callers.]
When a CQL request asks to read only a range of clustering keys inside
a partition, we actually need to read not just these clustering rows, but
also the static columns and add them to the response (as explained by Tomek
in issue #1568).
With the current code, that CQL request is translated into an
sstable::read_row() with a clustering-key filter. But this currently
only reads the requested clustering keys - NOT the static columns.
We don't want sstable::read_row() to unconditionally read the from disk
the static columns because if, for example, they are already cached, we
might not want to read them from disk. We don't have such partial-partition
cache yet, but we are likely to have one in the future.
This patch adds in the clustering key filter object a flag of whether we
need to read the static columns (actually, it's function, returning this
flag per partition, to match the API for the clustering-key filtering).
When sstable::read_row() sees the flag for this partition is true, it also
request to read the static columns.
Currently, the code always passes "true" for this flag - because we don't
have the logic to cache partially-read partitions.
The current find_disk_ranges() code does not yet support returning a non-
contiguous byte range, so this patch, if it notices that this partition
really has static columns in addition to the range it needs to read,
falls back to reading the entire partition. This is a correct solution
(and fixes#1568) but not the most efficient solution. Because static
columns are relatively rare, let's start with this solution (correct
by less efficient when there are static columns) and providing the non-
contiguous reading support is left as a FIXME.
Fixes#1568
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1471124536-19471-1-git-send-email-nyh@scylladb.com>
(cherry picked from commit 0d00da7f7f)
When the housekeeping configuration name was changed from conf to cfg it
was no longer included as part of the conf rpm.
This change adds a macro that determines of if the file should be
included or not and use that marco to conditionally add the
configuration file to the rpm.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1471169042-19099-1-git-send-email-amnon@scylladb.com>
(cherry picked from commit 612f677283)
Files with a conf extension are run by the scylla_prepare on the AMI.
The scylla-housekeeping configuration file is not a bash script and
should not be run.
This patch changes its extension to cfg which is more python like.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1470896759-22651-2-git-send-email-amnon@scylladb.com>
(cherry picked from commit 5a4fc9c503)
maybe_flush_pi_block, which is called for each cell, assumes that
block_first_colname will be empty when the first cell is encountered
for each partition.
This didn't hold after writing partition which generated no index
entry, because block_first_colname was cleared only when there way any
data written into the promoted index. Fix by always clearing the name.
The effect was that the promoted index entry for the next partition
would be flushed sooner than necessary (still counting since the start
of the previous partition) and with offset pointing to the start of
the current partition. This will cause parsing error when such sstable
is read through promoted index entry because the offset is assumed to
point to a cell not to partition start.
Fixes#1567
Message-Id: <1470909915-4400-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit f1c2481040)
The dist flag mark the debian package as distributed package.
As such the housekeeping configuration file will be included in the
package and will not need to be created by the scylla_setup.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1470907208-502-2-git-send-email-amnon@scylladb.com>
(cherry picked from commit a24941cc5f)
"The series adds an optional configuration file to the scylla-housekeeping. The
file act as a way to prevent the scylla-housekeeping to run. A missing
configuration file, will make the scylla-housekeeping immediately.
The series adds a flag to the build_rpm that differentiate between public
distributions that would contain the configuration file and private
distributions that will not contain it which will cause the setup script to
create it."
(cherry picked from commit da4d33802e)
This series handle two issues:
* Moving to python2, though python3 is supported, there are modules that we
need that are not rpm installable, python3 would wait when it will be more
mature.
* Check version should send the current version when it check for a new one and
a simple string compare is wrong.
(cherry picked from commit ec62f0d321)
This patch handle two issues with check version:
* When checking for a version, the script send the current version
* Instead of string compare it uses parse_version to compare the
versions.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 406fa11cc5)
There is a problem with python module installation in pythno3,
especially on centos. Though pytho34 has a normal package, alot of the
modules are missing yum installation and can only be installed by pip.
This patch switch the scylla-housekeeping implementation to use
python2, we should switch back to python3 when CeontOS python3 will be
more mature.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 641e5dc57c)
The promoted-index reading code contained a bug where it copied the value
of an disengaged optional (this non-value was never used, but it was still
copied ). Fix it by keeping the optional<> as such longer.
This bug caused tests/sstable_test in the debug build to crash (the release
build somehow worked).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1470742418-8813-1-git-send-email-nyh@scylladb.com>
(cherry picked from commit e005762271)
Add '--smp', '--memory', and '--overprovisioned' options to the Docker
image. The options are written to /etc/scylla.d/docker.conf file, which
is picked up by the Scylla startup scripts.
You can now, for example, restrict your Docker container to 1 CPU and 1
GB of memory with:
$ docker run --name some-scylla penberg/scylla --smp 1 --memory 1G --overprovisioned 1
Needed by folks who want to run Scylla on Docker in production.
Cc: Sasha Levin <alexander.levin@verizon.com>
Message-Id: <1470680445-25731-1-git-send-email-penberg@scylladb.com>
(cherry picked from commit 6a5ab6bff4)
The sanitizer of the debug build warns when a "bool" variable is read when
containing a value not 0 or 1. In particular, if a class has an
uninitialized bool field, which class logic allows to only be set later,
then "move"ing such an object will read the uninitialized value and produce
this warning.
This patch fixes four of these warnings seen in sstable_test by initializing
some bool fields to false, even though the code doesn't strictly need this
initialization.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1470744318-10230-1-git-send-email-nyh@scylladb.com>
(cherry picked from commit c2e4f5ba16)
Commit 0d8463aba5 broke some of the tests with an assertion
failure about local_is_initialized(). It turns out that there is more than
one level of local_is_initialized() we need to check... For some tests,
neither locals were initialized, but for others, one was and the other
wasn't, and the wrong one was tested.
With this patch, all unit tests except "flush_queue_test.cc" pass on my
machine. I doubt this test is relevant to the promoted index patches,
but I'll continue to investigate it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1470695199-32649-1-git-send-email-nyh@scylladb.com>
(cherry picked from commit bce020efbd)
"The goal of this patch series is to support reading and writing of a
"promoted index" - the Cassandra 2.* SSTable feature which allows reading
only a part of the partition without needing to read an entire partition
when it is very long. To make a long story short, a "promoted index" is
a sample of each partition's column names, written to the SSTable Index
file with that partition's entry. See a longer explanation of the index
file format, and the promoted index, here:
https://github.com/scylladb/scylla/wiki/SSTables-Index-File
There are two main features in this series - first enabling reading of
parts of partitions (using the promoted index stored in an sstable),
and then enable writing promoted indexes to new sstables. These two
features are broken up into smaller stand-alone pieces to facilitate the
review.
Three features are still missing from this series and are planned to be
developed later:
1. When we fail to parse a partition's promoted index, we silently fall back
to reading the entire partition. We should log (with rate limiting) and
count these errors, to help in debugging sstable problems.
2. The current code only uses the promoted index when looking for a single
contiguous clustering-key range. If the ck range is non-contiguous, we
fall back to reading the entire partition. We should use the promoted
index in that case too.
3. The current code only uses the promoted index when reading a single
partition, via sstable::read_row(). When scanning through all or a
range of partitions (read_rows() or read_range_rows()), we do not yet
use the promoted index; We read contiguously from data file (we do not
even read from the index file, so unsurprisingly we can't use it)."
(cherry picked from commit 700feda0db)
The sanitizer of the debug build warns when a "bool" variable is read when
containing a value not 0 or 1. In particular, if a class has an
uninitialized bool field, which class logic allows to only be set later,
then "move"ing such an object will read the uninitialized value and produce
this warning.
This patch fixes four of these warnings seen in sstable_test by initializing
some bool fields to false, even though the code doesn't strictly need this
initialization.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1470744318-10230-1-git-send-email-nyh@scylladb.com>
(cherry picked from commit c2e4f5ba16)
Commit 0d8463aba5 broke some of the tests with an assertion
failure about local_is_initialized(). It turns out that there is more than
one level of local_is_initialized() we need to check... For some tests,
neither locals were initialized, but for others, one was and the other
wasn't, and the wrong one was tested.
With this patch, all unit tests except "flush_queue_test.cc" pass on my
machine. I doubt this test is relevant to the promoted index patches,
but I'll continue to investigate it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1470695199-32649-1-git-send-email-nyh@scylladb.com>
(cherry picked from commit bce020efbd)
This patch adds writing of promoted index to sstables.
The promoted index is basically a sample of columns and their positions
for large partitions: The promoted index appears in the sstable's index
file for partitions which are larger than 64 KB, and divides the partition
to 64 KB blocks (as in Cassandra, this interval is configurable through
the column_index_size_in_kb config parameter). Beyond modifying the index
file, having a promoted index may also modify the data file: Since each
of blocks may be read independently, we need to add in the beginning of
each block the list of range tombstones that are still open at that
position.
See also https://github.com/scylladb/scylla/wiki/SSTables-Index-FileFixes#959
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 0d8463aba5)
This patch sets the default validator for dynamic column families.
Doing so has no consequences in terms of behavior, but it causes the
correct type to be shown when describing the column family through
cassandra-cli.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1470739773-30497-1-git-send-email-duarte@scylladb.com>
(cherry picked from commit 0ed19ec64d)
This patch ensures we always send the column metadata, even when the
column family is dynamic and the metadata is empty, as some clients
like cassandra-cli always assume its presence.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1470740971-31169-1-git-send-email-duarte@scylladb.com>
(cherry picked from commit f63886b32e)
There was no way to setup correct repo when AMI is building by --localrpm option, since AMI does not have access to 'version' file, and we don't passed repo URL to the AMI.
So detect optimal repo path when starting build AMI, passes repo URL to the AMI, setup it correctly.
Note: this changes behavor of build_ami.sh/scylla_install_pkg's --repo option.
It was repository URL, but now become .repo/.list file URL.
This is optimal for the distribution which requires 3rdparty packages to install scylla, like CentOS7.
Existing shell scripts which invoking build_ami.sh are need to change in new way, such as our Jenkins jobs.
Fixes#1414
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1469636377-17828-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit d3746298ae)
"Kubernetes is unhappy with our Docker image because we start systemd
under the hood. Fix that by switching to use "supervisord" to manage the
two processes -- "scylla" and "scylla-jmx":
http://blog.kunicki.org/blog/2016/02/12/multiple-entrypoints-in-docker/
While at it, fix up "docker logs" and "docker exec cqlsh" to work
out-of-the-box, and update our documentation to match what we have.
Further work is needed to ensure Scylla production configuration works
as expected and is documented accordingly."
(cherry picked from commit 28ee2bdbd2)
Previously, the Docker image could only be run interactively, which is
not conducive for running clusters. This patch makes the docker image
run in the background (using systemd). This makes the docker workflow
similar to working with virtual machines, i.e. the user launches a
container, and once it is running they can connect to it with
docker exec -it <container_name> bash
and immediately use `cqlsh` to control it.
In addition, the configuration of scylla is done using established
scripts, such as `scylla_dev_mode_setup`, `scylla_cpuset_setup` and
`scylla_io_setup`, whereas previously code from these scripts was
duplicated into the docker startup file.
To specify seeds for making a cluster, use the --seeds command line
argument, e.g.
docker run -d --privileged scylladb/scylla
docker run -d --privileged scylladb/scylla --seeds 172.17.0.2
other options include --developer-mode, --cpuset, --broadcast-address
The --developer-mode option mode is on by default - so that we don't fail users
who just want to play with this.
The Dockerfile entrypoint script was rewritten as a few Python modules.
The move to Python is meritted because:
* Using `sed` to manipulate YAML is fragile
* Lack of proper command line parsing resulted in introducing ad-hoc environment variables
* Shell scripts don't throw exceptions, and it's easy to forget to check exit codes for every single command
I've made an effort to make the entrypoint `go' script very simple and readable.
The goary details are hidden inside the other python modules.
Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <1468938693-32168-1-git-send-email-yoav@scylladb.com>
(cherry picked from commit d1d1be4c1a)
Calls like later() and with_gate() may allocate memory, although that is not
very common. This can create a problem in the sense that it will potentially
recurse and bring us back to the allocator during free - which is the very thing
we are trying to avoid with the call to later().
This patch wraps the relevant calls in the reclaimer lock. This do mean that the
allocation may fail if we are under severe pressure - which includes having
exhausted all reserved space - but at least we won't recurse back to the
allocator.
To make sure we do this as early as possible, we just fold both release_requests
and do_release_requests into a single function
Thanks Tomek for the suggestion.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <980245ccc17960cf4fcbbfedb29d1878a98d85d8.1470254846.git.glauber@scylladb.com>
(cherry picked from commit fe6a0d97d1)
Issue 1510 describes a scenario in which, under load, we allocate memory within
release_requests() leading to a reentry into an invalid state in our
blocked requests' shared_promise.
This is not easy to trigger since not all allocations will actually get to the
point in which they need a new segment, let alone have that happening during
another allocator call.
Having those kinds of reentry is something we have always sought to avoid with
release_requests(): this is the reason why most of the actual routine is
deferred after a call to later().
However, that is a trick we cannot use for updating the state of the blocked
requests' shared_promise: we can't guarantee when is that going to run, and we
always need a valid shared_promise, in a valid state, waiting for new requests
to hook into.
The solution employed by this patch is to make sure that no allocation
operations whatsoever happen during the initial part of release_requests on
behalf of the shared promise. Allocation is now deferred to first use, which
relieves release_requests() from all allocation duties. All it needs to do is
free the old object and signal to the its user that an allocation is needed (by
storing {} into the shared_promise).
Fixes#1510
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <49771e51426f972ddbd4f3eeea3cdeef9cc3b3c6.1470238168.git.glauber@scylladb.com>
(cherry picked from commit ad58691afb)
This patch ensures that when the schema is dense, regardless of
compact_storage being set, the single regular columns is translated
into a compact column.
This fixes an issue where Thrift dynamic column families are
translated to a dense schema with a regular column, instead of a
compact one.
Since a compact column is also a regular column (e.g., for purposes of
querying), no further changes are required.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1470062410-1414-1-git-send-email-duarte@scylladb.com>
(cherry picked from commit 5995aebf39)
Fixes#1535.
data_consume_rows_context needs to have close() called and the returned
future waited for before it can be destroyed. data_consume_context::impl
does that in the background upon its destruction.
However, it is possible that the sstable is removed before
data_consume_rows_context::close() completes in which case EBADF may
happen. The solution is to make data_consume_context::impl keep a
reference to the sstable and extend its life time until closing of
data_consume_rows_context (which is performed in the background)
completes.
Side effect of this change is also that data_consume_context no longer
requires its user to make sure that the sstable exists as long as it is
in use since it owns its own reference to it.
Fixes#1537.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1470222225-19948-1-git-send-email-pdziepak@scylladb.com>
(cherry picked from commit 02ffc28f0d)
"This patch series ensures we don't count dead partitions (i.e.,
partitions with no live rows) towards the partition_limit. We also
enforce the partition limit at the storage_proxy level, so that
limits with smp > 1 works correctly."
(cherry picked from commit 5f11a727c9)
* seastar f603f88...0b53ab2 (2):
> reactor: limit task backlog
> reactor: make sure a poll cycle always happens when later is called
Fix runaway task queue growth on cpu-bound loads.
This series adds the ability for partition cache to keep information
whether partition size makes it uncacheable. During, reads these
entries save us IO operations since we already know that the partiiton
is too big to be put in the cache.
First part of the patchset makes all mutation_readers allow the
streamed_mutations they produce to outlive them, which is a guarantee
used later by the code handling reading large partitions.
(cherry picked from commit d2ed75c9ff)
Inherit the alignment parameters from the underlying file instead of
defaulting to 4096. This gives better read performance on disks with 512-byte
sectors.
Fixes#1532.
Message-Id: <1470122188-25548-1-git-send-email-avi@scylladb.com>
(cherry picked from commit 9f35e4d328)
The current code assumes cell names are always compound and may
wrongly report a non-static row as such. This patch addresses this
and adds a test case to catch regressions.
Backports the fix to #1495.
get_sstables_including_compacted_undeleted() may return temporary shared
ptr which will be destroyed before the loop if not stored locally.
Fixes#1514
Message-Id: <20160728100504.GD2502@scylladb.com>
(cherry picked from commit 3531dd8d71)
This patch enforces compatibility between a cell and the
corresponding column definition with regards to them being
static.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The current code assumes cell names are always compound and may
wrongly report a non-static row as such, since it looks at the first
bytes of the name assuming they are the component's length.
Tables with compact storage (which cannot contain static rows) may not
have a compound comparator, so we check for the table's compoundness
before checking for the static marker. We do this by delegating to
composite_view::is_static.
Fixes#1495
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469616205-4550-4-git-send-email-duarte@scylladb.com>
composite_view's is_static function is wrong because:
1) It doesn't guard against the composite being a compound;
2) Doesn't deal with widening due to integral promotions and
consequent sign extension.
This patch fixes this by ensuring there's only one correct
implementation of is_static, to avoid code duplication and
enforce test coverage.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469616205-4550-2-git-send-email-duarte@scylladb.com>
If a composite is not a compound, then it doesn't carry a length
prefix where static information is encoded. In its absence, a
non-compound composite can never be static.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469397561-7748-1-git-send-email-duarte@scylladb.com>
Keep track of every read of wide partition that's
not cached.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit 37a7d49676)
If limit is exceeded then return the streamed_mutation
and don't cache it.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit 98c12dc2e2)
If mutation is bigger than this limit
it won't be read and mutation_from_streamed_mutation
will return empty optional.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
(cherry picked from commit 0d39bb1ad0)
Promoted index may cause sstable to have range tombstones duplicated
several times. These duplicates appear in the "wrong" place since they
are smaller than the entity preceeding them.
This patch ignores such duplicates by skipping range tombstones that are
smaller than previously read ones.
Moreover, these duplicted range tombstone may appear in the middle of
clustering row, so the sstable reader has also gained the ability to
merge parts of the row in such cases.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
(cherry picked from commit 08032db269)
This patch changes the column_visitor so that it preservers the order
of the partitions it visits when building the accumulation result.
This is required by verbs such as get_range_slice, on top of which
users can implement paging. In such cases, the last key returned by
the query will be that start of the range for the next query. If
that key is not actually the last in the partitioner's order, then
the new request will likely result in duplicate values being sent.
Ref #693
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469568135-19644-1-git-send-email-duarte@scylladb.com>
(cherry picked from commit 5aaf43d1bc)
If partition contains no static and clustering rows or range tombstones
mp_row_consumer will return disengaged mutation_fragment_opt with
is_mutation_end flag set to mark end of this partition.
Current, mutation_reader::impl code incorrectly recognized disengaged
mutation fragment as end of the stream of all mutations. This patch
fixes that by using is_mutation_end flag to determine whether end of
partition or end of stream was reached.
Fixes#1503.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1469525449-15525-1-git-send-email-pdziepak@scylladb.com>
(cherry picked from commit efa690ce8c)
I tried to start scylla-housekeeping service by:
# sudo systemctl restart scylla-housekeeping.service
But it's failed for wrong script path, error detail:
systemd[5605]: Failed at step EXEC spawning
/usr/lib/scylla/scylla-Housekeeping: No such file or directory
The right script name is 'scylla-housekeeping'
Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <c11319a3c7d3f22f613f5f6708699be0aa6bd740.1469506477.git.amos@scylladb.com>
(cherry picked from commit 64530e9686)
The query_size_estimates() function queries the size_estimates system
table for a given keyspace and table, filtering out the token ranges
according to the specified tokens.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
(cherry picked from commit ecfa04da77)
This patch fixes stop() by checking if the current CPU instead of
whether the service is active (which it won't be at the time stop() is
called).
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
(cherry picked from commit d984cc30bf)
This patch makes range_estimates a proper struct, where tokens are
represented as dht::tokens rather than dht::ring_position*.
We also pass other arguments to update_ and clear_size_estimates by
copy, since one will already be required.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
(cherry picked from commit e16f3f2969)
This patch ensures we fail when creating a mixed column family, either
when adding columns to a dynamic CF through updated_column_family() or
when adding a dynamic column upon insertion.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469378658-19853-1-git-send-email-duarte@scylladb.com>
(cherry picked from commit 5c4a2044d5)
This patch implements the describe_splits verb on top of
describe_splits_ex.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
(cherry picked from commit ab08561b89)
This patch implements the describe_splits_ex verbs by querying the
size_estimates system table for all the estimates in the specified
token range.
If the keys_per_split argument is bigger then the
estimated partitions count, then we merge ranges until keys_per_split
is met. Note that the tokens can't be split any further,
keys_per_split might be less than the reported number of keys in one
or more ranges.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
(cherry picked from commit 472c23d7d2)
This patch converts an exceptions::invalid_request_exception
into a Thrift InvalidRequestException instead of into a generic one.
This makes TitanDB work correctly, which expects an
InvalidRequestException when setting a non-existent keyspace.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1469362086-1013-1-git-send-email-duarte@scylladb.com>
(cherry picked from commit 2be45c4806)
This patch changes lookup_schema() so it directly calls
database::find_schema() instead of going through
database::find_column_family(). It also drops conversion of the
no_such_column_family exeption, as that is already handled at a higher
layer.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
(cherry picked from commit 8991d35231)
We use ::abs(), which has an int parameter, on long arguments, resulting
in incorrect results.
Switch to std::abs() instead, which has the correct overloads.
Fixes#1494.
Message-Id: <1469347802-28933-1-git-send-email-avi@scylladb.com>
(cherry picked from commit 900639915d)
Fixes#1484.
We drop tables as part of keyspace drop. Table drop starts with
creating a snapshot on all shards. All shards must use the same
snapshot timestamp which, among other things, is part of the snapshot
name. The timestamp is generated using supplied timestamp generating
function (joinpoint object). The joinpoint object will wait for all
shards to arrive and then generate and return the timestamp.
However, we drop tables in parallel, using the same joinpoint
instance. So joinpoint may be contacted by snapshotting shards of
tables A and B concurrently, generating timestamp t1 for some shards
of table A and some shards of table B. Later the remaining shards of
table A will get a different timestamp. As a result, different shards
may use different snapshot names for the same table. The snapshot
creation will never complete because the sealing fiber waits for all
shards to signal it, on the same name.
The fix is to give each table a separate joinpoint instance.
Message-Id: <1469117228-17879-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 5e8f0efc85)
A seastar::value_of() lambda used in a trace point was doing the unthinkable:
it called std::move() on a value captured by reference. Not only it compiled(!!!)
but it also actually std::move()ed the shared_ptr before it was used in a make_result()
which naturally caused a SIGSEG crash.
Fixes#1491
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1469193763-27631-1-git-send-email-vladz@cloudius-systems.com>
(cherry picked from commit 9423c13419)
This test set the time window to 1 hour and checks that the strategy
works accordingly.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit cf54af9e58)
Now date tiered compaction strategy will take into account the
strategy options which are defined in the schema.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit eaa6e281a2)
"This series includes the following:
- Introduction of a formatted message support in trace().
- Major rename: s/flush_/write_/, s/flush()/kick()/, s/store_/write_/.
- Some cosmetic fixes found on the way.
- Fix a bug in a shutdown flow.
- Instrumentation to MUTATE, PREPARE, EXECUTE and BATCH flow and some
related changes.
- A patch that aligns the QUERY tracing format with the Origin.
- Methods and functions description in tracing/trace_state.hh."
Add a proper description to a tracing::trace() that clarifies
that the tracing message string and the positional parameters
are going to be copied if tracing state is initialized.
Add a description for trace_state::begin() methods and for a
tracing::begin() helper function.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Having a trace_state_ptr in the storage_proxy level is needed to trace code bits in this level.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
- Don't put the query name as a 'request' but rather save it as one of entries in a
'params' map.
- Save some additional query parameters in 'params'.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Store the trace state in the abstract_write_response_handler.
Instrument send_mutation RPC to receive an additional
rpc::optional parameter that will contain optional<trace_info>
value.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Adding this method allows to use tracing helper functions
and remove the no longer needed accessors in the query_state.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
From now on trace_state::trace() is able to receive the sprint-ready
format string with the arguments that will be applied only during
the flush event.
This patch also optimizes the way the source address is evaluated -
do it only once instead of twice if tracing is requested.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Sometimes we want to be able to set "params" map after we
started a tracing session, e.g. when the parameters values,
like a consistency level parsed from the "options" part of a binary frame,
are available only after some heavy part of a flow we would like
to trace.
This patch includes the following changes:
- No longer pass a map to the begin().
- Limit the parameters to the known set.
- Define a method to set each such parameter and save its
value till the final sstring->sstring map is created.
- Construct the final sstring->sstring map in the destructor of the trace_state
object in order to defer all the formatting to be after the traced flow.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
A backend helper has to constantly communicate with the corresponding
tracing::tracing instance. By saving a reference to the tracing::tracing instance
will save us a lot of tracing::get_local_tracing_instance() calls and thus
a lot of dereferencing.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
- Extend the i_tracing_backend_helper interface to accept the event
record timestamp.
- Grab the current timestamp when the event record is taken.
- Add the instrumentation to the trace_keyspace_helper to create a unique time-UUID
from a given std::chrono::duration object.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
This helper returns an std::experimental::optional<trace_info>
which is initialized or not initialized depending on whether
a given trace_state_ptr is initialized or not.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Add an support for passing a format string plus positional parameters
for creation of a trace point message.
Format string should be given in a fmt library native format described
here: http://fmtlib.net/latest/syntax.html#syntax .
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
kick() backend during shutdown and restrict accessing a backend
after that.
Flush pending records when service is being shut down.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Prevent the destruction of tracing::tracing instances while there
are still tracing::trace_state objects that are using it:
- Make tracing::tracing inherit from seastar::async_sharded_service<tracing::tracing>.
- Grab a tracing::tracing.shared_from_this() in each
tracing::trace_state object using it.
- Use a saved pointer to the local tracing::tracing instance in a destructor
instead of accessing it via tracing::get_local_tracing_instance()
to avoid "local is not initialized" assert when sessions are
being destroyed after the service was stopped.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
In names of functions and variables:
s/flush_/write_/
s/store_/write_/
In a i_tracing_backend_helper:
s/flush()/kick()/
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
"This patchset implements the size_estimates_recorder, which periodically
writes estimations for all the non-system column families in the
size_estimates system table. This table is updated per schema with a set
of token ranges and the associated estimations of how many partitions
there are and their mean size.
Fixes#352"
This patch implements the size_estimates_recorder, which periodically
writes estimations for all the non-system column families in the
size_estimates system table. The size_estimates_recorder class
corresponds to the one in Cassandra's SizeEstimatesRecorder.java.
Estimation is carried out by shard 0. Since we're estimating based on
data in shared sstables, having multiple shards doing this would skew
the results.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch implements functions that allow the size_estimates system
table to be updated and cleared. The size_estimates table is updated
per schema with a set of token ranges and the associated estimations
of how many partitions there are and their mean size.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds an utility function that allows fetching the set of
column_families that do not belong to the system keyspace.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch allows a set of a column_family's sstables to be
selected according to a range of ring_positions.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch makes it so that the template arguments of
range<T>::transform are more easily deducible by the compiler.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
That was a bug in the test itself. It could happen that a sstable would
incorrectly belong to the next time window if the current minute is
approaching its end. Fix is about having all sstables that we want in
the same time window with the same min/max timestamp.
Fixes#1448.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <ee25d49e7ed12b4cf7d018a08163404c3d122e56.1468782787.git.raphaelsc@scylladb.com>
This patch avoids moving entries from range tombstones and clustering
rows sets in streamed_mutation_from_mutation(). Such action breaks these
sets as the entries will be left in some unknown state.
Instead, the sets are being broken in a supported and predictable way
using unlink_leftmost_without_rebalance().
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1468843205-18852-1-git-send-email-pdziepak@scylladb.com>
* seastar d699205...a45823a (5):
> rpc: do not call shutdown function on already closed fd
> log: Do not crash if logger is invoked from non-reactor thread
> rpc: remove unaligned_cast and reinterpret_cast uses
> unaligned: note unaligned_cast<> is deprecated
> byteorder: add unaligned read/write helpers
Fixes#1463.
This patch adds authorization to the DDL thrift verbs. Since checking
for authorization is asynchronous, we now need to copy the verb
arguments so they can be accessed from the continuations.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This function is similar to has_column_family_access, but skips
validating if the specified keyspace and column family names map to a
valid schema, as it already takes one as an argument.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch transforms the mutation map, a map of keys to a map of columns
families to mutations, into a map of column families to a map of keys
to mutations. This makes is a more natural organization, as things
like checking access permissions are done by column family.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This is a wrapper around with_cob, which fetches a schema and forwards
it to a supplied function.
The patch also removes superfluous return instructions.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch validates that a user is correctly logged in (if
authentication is required) for the required verbs.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The scylla-server.service will try to start the scylla-housekeeping.
This patch adds a question to the scylla_setup if to enable the version
check, if the answer is no, the scylla-housekeeping will be masked.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1468741129-1977-1-git-send-email-amnon@scylladb.com>
Move the to_bytes_view(temporary_buffer<char>) function from source file
to header file where is can be used in more places.
This saves one use of reinterpret_cast (which we are no re-evaluating),
and moreover, we want to use this function also in the promoted index
code (to return a bytes_view from the promoted index which was saved as a
temporary_buffer).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1468761437-27046-1-git-send-email-nyh@scylladb.com>
* seastar 5e97d5f...d699205 (3):
> rpc: fix race between send loop and expiration timer
> rpc: fix cancellable type move operations
> reactor: create new files with a more reasonable default mode
Range queries need to take special care when transitioning between
ranges that are read from sstables and ranges that are already in the
cache.
Original code in such case just started a secondary reader and told it
to unconditionally mark the last entry as continuous (primary reader has
already returned an element tha immediately follows the range that is
going to be read form sstables).
However, that information may get stale. For instance, by the time
secondary reader finish reading its range the element immediately
following it may get evicted from the cache thus causing continuity flag
to be incorrectly set.
The solution is to ensure that the element immediately after the range
read from sstables is still in the cache.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1468586893-15266-1-git-send-email-pdziepak@scylladb.com>
From Duarte:
This patchset adds support for the data manipulation verbs. It defers support
for super columns and mixed CFs (a static CF treated as dynamic) to later
patchsets.
Everything is done on top of storage_proxy; it was only necessary to modify the
layers below to add support for different kinds of limits: per partition row
limit, which corresponds to limiting the number of columns returned when
querying a dynamic CF, and limit on the number of partitions returned, so that
we can emulate the one thrift row per key model when querying dynamic CFs.
Ref #399
By default, the schema is marked as compound regardless of the
comparator. Since a composite comparator for static CFs is currently
unsupported (otherwise thrift column families would be
indistinguishable from CQL ones), just mark them as non-compound.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch prevents CQL3 column families from being returned to
clients or subject to updates from thrift.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The get_multi_slice verb is used to perform multiple slices on a
single row key in one operation. It takes a set of column_slices,
which we normalize to not contain any overlapping ranges.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the deoverlap function to range.hh, which takes in a
vector of possibly overlapping ranges and returns a vector of
non-overlapping ranges covering the same values.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The get_paged_slice verb is similar to the get_range_slices verb,
except that it doesn't take a SlicePredicate. Instead, it takes a
column from which to start the query.
For dynamic CFs, we use the partition_slice::specific_ranges to single
out the first partition, and query starting from the start_column row.
For static CFs, we issue an initial query to fetch the remainder of
columns from the first partition, and at least one more query to fetch
the subsequent columns until the limit is reached. This implies a
performance penalty for static CFs.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The get_range_slices verb is similar to the multiget_slice verb,
except that it operates on a range of partition keys (or tokens).
In origin, empty partitions are returned as part of the KeySlice, for
which the key will be filled in but the columns vector will be empty.
Since in our case we don't return empty partitions, we don't know which
partition keys in the specified range we should return back to the client.
So for now, our behavior differs from Origin.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch implements the multiget_count verb in a similar fashion as
multiget_slice, but using an accumulator that counts the returned
columns instead of create thrift ColumnOrSuperColumn objects.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch build a query::read_command from a SlicePredicate,
for both dynamic and static column families.
For dynamic CFs, restrictions on the clustering columns are added, and
for static CFs, limits and ordering is defined inline by selecting the
correct regular columns.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds support to send a cell's ttl as part of a query's
result. This is needed for thrift support.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the is_dynamic() function to thrift_schema, which
tells whether the underlying column family is dynamic or not,
according to thrift rules.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds support for composite comparators (which, for dynamic
column families, it means composite clustering keys) and for composite
keys (composite partition keys).
Support for composite column names and regular columns is deferred,
which will entail making compound_type an abstract_type.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"This series replaces the original scylla-help.py
It contains only a basic script that checks daily for version and report if a
newer version matched.
The script is added as a service and will be started and shutdown with
scylla-server."
Currently, for any column family, we create a directory for it in all
keyspace directories. This is incredibly awkward.
Fix by iterating over just the keyspace's column families, not all
column families in existence.
Fixes#1457.
Message-Id: <1468495182-18424-1-git-send-email-avi@scylladb.com>
The check version script uses the python requests package, this add the
dependency to the ubuntu package.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Ununtu 14.4 upstart does not support timers for recurrent operations.
The upstart cookbook suggest a way to mimic this functionality here:
http://upstart.ubuntu.com/cookbook/#run-a-job-periodically
This patch adds a service that runs the house-keeping daily.
Setting it as a service insure that it would start and stop with
scylla-server service.
filter_for_query() gets sorted by preference list of endpoints and
should preserve that order after filtering out non local endpoints for
local query. partition() does not guaranty this while stable_partition()
does, so use it instead.
Fixes#1450.
Message-Id: <20160713100909.GM10767@scylladb.com>
From Paweł:
This is another episode in the "convert X to streamed mutations" series.
Hashing mutations (mainly for repair) is converted so that it doesn't
need to rebuild whole mutation.
The first part of the series changes the way streamed mutations deal
with range tombstones. Since it is not necessary to make sure we write
disjoint tombstones to sstables there is no need anymore for streamed
mutations to produce disjoint tombstones and, consequently, no need for
range tombstones to be split into range_tombstone_begin and
range_tombstone_end.
The second part is the actual hashing implementation. However, to ensure
that the hash depends only on the contents of the mutation and no the
way it is stored in different data sources range tombstones have to be
made disjoint before they are hashed.
This series also ensures that any changes caused by streamed mutations
to hashing and streaming do not break repair during upgrade.
This patch makes hashing for repair calculate checksums in a way that
doesn't require rebuilding whole mutation.
Unfortunately, such checksums are incompatible with the old ones so the
old way for computing checksums is preserved for compatibility reasons.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
The receiving side needs to handle fragmented mutations properly so that
isolation guarantees are not broken. If the receiving node may be an old
one do not fragment mutations.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
mutation_hasher is a consumer of streamed_mutation that feeds its data
to a specified hasher.
It is not compatible with hashing_partition_visitor.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Originally, streamed_mutations guaranteed that emitted tombstones are
disjoint. In order to achieve that two separate objects were produced
for each range tombstone: range_tombstone_begin and range_tombstone_end.
Unfortunately, this forced sstable writer to accumulate all clustering
rows between range_tombstone_begin and range_tombstone_end.
However, since there is no need to write disjoint tombstones to sstables
(see #1153 "Write range tombstones to sstables like Cassandra does") it
is also not necessary for streamed_mutations to produce disjoint range
tombstones.
This patch changes that by making streamed_mutation produce
range_tombstone objects directly.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
range_tombstone::flip() flips range bounds. This is necessary in order
to use range tombstone in reversed mutation fragment streams.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
range_tombstone_accumulator is a helper class that allows determining
tombstone for a clustering row when range tombstones and clustering rows
are streamed from streamed_mutation.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
range_tombstone::apply() allows merging two, possibly overlapping, range
tombstones with the same start bound and produces one or two disjoint
range tombstones as a result.
It is intended to be used for merging tombstones coming from different
sources.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
The sstable parsing code calls mp_row_consumer::flush() after every
clustering row has been read, and this puts the now complete row in a single
field "_ready". The assumption is that at this point parsing will stop, the
consumer will move out this _ready (mp_row_consumer::get_mutation_fragment())
and when flush() is later called again, _ready will be empty again.
This assumption is correct in our code, but is based on an intricate
combination of estoreric parts of the code, such as:
1. In data_consume_row_context we stop parsing after reading the parition's
header, before reading any clustering rows, giving the caller the chance
to call sstable_streamed_mutation::read_next() to be prepared for the
incoming mutations.
2. In mp_row_consumer::flush_if_needed(), we stop the parser after each
individual clustering row.
It is easy to break this assumption, and I did this in one of my code changes,
and the result was silent loss of clustering rows, as "_ready" got silently
overwritten before the reader had a chance to move it out.
What this patch does is to add an assertion: If a clustering row is silently
lost before being transferred to the mutation fragment reader, we croak.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1468389955-24600-1-git-send-email-nyh@scylladb.com>
This patch fixes a regression introduced in
f81329be60, which made keys compound by
default when using a particular ctor, in turn leading to mismatches
when comparing the same key built with functions that properly
consider compoundness.
As a temporary fix, the sstable::key and sstable::key_view classes
store raw bytes instead of a composite.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1468339295-3924-1-git-send-email-duarte@scylladb.com>
Since the timestamp is not serialized, it must always be the last
parameter of query::read_command. This patch reorders it with the
partition_limit parameters and updates callers that specified a
timestamp argument.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1468312334-10623-1-git-send-email-duarte@scylladb.com>
This makes scylla-server to try and start the scylla-housekeeping.
Failing to start the service will not interfere with the scylla-server
start.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
From Duarte:
This patchset adds a representation of a legacy composite
value to compound_compat.hh and replaces the one in
sstables/key.hh. This patchset is needed for the thrift series.
The scylla housekeeping service responsible for recurent tasks.
It is currently set to run daily and report if the version is correct.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
scylla-housekeeping is a script that check and report for hardware and software issues.
The first phase of it check for newer version and report if the version
is old.
To see the available options run
scylla-housekeeping help
We have imported most of our data about config options from Cassandra. Due to
that, many options that mention the database by name are still using
"Cassandra".
Specially for the user visible options, which is something that a user sees, we
should really be using Scylla here.
This patch was created by automatically replacing every occurrence of "Cassandra"
with "Scylla" and then later on discarding the ones in which the change didn't
make sense (such as Unused options and mentions to the Cassandra documentation)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <1423e1d7e36874a1f46bd091aec96dcb4d8482d9.1468267193.git.glauber@scylladb.com>
The sstables::key class now delegates much of its functionality
to the composite class. All existing behavior is preserved.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds support for parsing legacy compound values by
introducing the composite class, a wrapper around a sequence of bytes
serialized in the legacy format for compounds. Compound values can be
sent though the thrift API.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
* seastar 9267dfa...e7a7d41 (3):
> Merge "Compression support for RPC" from Gleb
> reactor: allow sleeping while disk aio is pending
> sstring: add resize method
Continuation reordering could cause us to repeatedly see the
segment-local flag var even though actual write/sync ops are done.
Can cause wild recursion without actual delayed continuation ->
SOE.
Fix by also checking queue status, since this is the wait object.
Message-Id: <1468234873-13581-1-git-send-email-calle@scylladb.com>
centos-master jenkins job failed at building libgo, but we don't need go language, so let's disable it on scylla-gcc package.
Also we never use ada, disable it too.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1468166660-23323-1-git-send-email-syuu@scylladb.com>
"Checking bloom filters of sstables to compute max purgeable timestamp
for compaction is expensive in terms of CPU time. We can avoid
calculating it if we're not about to GC any tombstone.
This patch changes compacting functions to accept a function instead
of ready value for max_purgeable.
I verified that bloom filter operations no longer appear on flame
graphs during compaction-heavy workload (without tombstones).
Refs #1322."
Checking bloom filters of sstables to compute max purgeable timestamp
for compaction is expensive in terms of CPU time. We can avoid
calculating it if we're not about to GC any tombstone.
This patch changes compacting functions to accept a function instead
of ready value for max_purgeable.
I verified that bloom filter operations no longer appear on flame
graphs during compaction-heavy workload (without tombstones).
Refs #1322.
memtable_list::seal_on_overlflow() is called on each mutation to check
if current memtable should be flushed. It will call
memtable_list::seal_active_memtable() when that is the case.
The number of concurrent seals is guarded by a semaphore, starting
from commit 0f64eb7e7d, and allows
at most 4 of them.
If there are 4 flushes already pending, every incoming mutation will
enqueue a new flush task on the semaphore's wait list, without waiting
for it. The wait queue can grow without bounds, eventually leading to
out-of-memory.
The fix is to seal the memtable immediately to satisfy should_flush()
condition, but limit concurrency of actual flushes. This way the wait
queue size on the semaphore is limited by memtables pending a flush,
which is fairly limited.
Message-Id: <1467997652-16513-1-git-send-email-tgrabiec@scylladb.com>
With so many consumer concepts out there, it is confusing to name
parameters using genering "Consumer" name, let's name them after
(already defined) concepts: CompactedMutationsConsumer, FlattenedConsumer.
Previously, same function was used to handle both regular compaction
and cleanup requests. That's bad because a lot of conditions were
added for both compaction types to live in the same function.
Now, cleanup and regular compaction will live in different functions.
They share a lot of code, so helper functions were introduced.
This change is also important for user-initiated compaction that
will go through compaction manager in the future.
Code is also a lot easier to read now.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
There is no longer a need to use gate for regular termination of
fiber that runs compaction. Now, we only set task->stopping to
true, ask for compaction termination, and wait for its future to
resolve. Code is simplified a lot with this change.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Enable --partitioner option so that user can choose partitioner other
than the default Murmur3Partitioner. Currently, only Murmur3Partitioner
and ByteOrderedPartitioner are supported. When non-supported partitioner
is specifed, error will be propogated to user.
In order to support ByteOrderedPartitioner, we need to implement the
missing describe_ownership and midpoint function in
byte_ordered_partitioner class.
As a starter, this path uses a simple node token distance based method
to calculate ownership. C* uses a complicated key samples based method.
We can switch to what C* does later.
Tests are added to tests/partitioner_test.cc.
Fixes#1378
If a node fails to talk to any seed node, shadow round will fail. We
should exit shadow round state before we continue.
This issue is spotted by
consistency_test.TestConsistency.data_query_digest_test dtest.
Message-Id: <ba0613532a69bac369ca316ab61d907b320c8e68.1467963674.git.asias@scylladb.com>
As Nadav notes we use the chunk length as the buffer size for the compressed
stream too.
Fix by using it only for the outer (uncompressed) stream; the inner
(compressed) stream uses the sstable buffer size, 128 kiB.
Fixes#1402.
Message-Id: <1467910556-5759-1-git-send-email-avi@scylladb.com>
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Support for streaming of large partitions from Paweł:
This series converts streaming to streaming_mutations so that there
is need to store full mutation in memory in order to send or receive
it.
The first several patches add a way of estimating mutation fragment
memory usage and introduce fragment_and_freeze() which produces
a stream of reasonably sized frozen mutations from a single streamed
mutation.
The second part of this patchset makes sure that streaming mutations
in fragments doesn't break isolation guarantees. This is achieved by
delaying visibility of sstables produced by streaming until the
streaming is completed. However, our current receiving code merges
mutations from all streaming plans together thus making it impossible
to track which data was received from a particular streaming plan.
The solution to that problem is to introduce an additional flag to
STREAM_MUTATION verb which informs the receiver whether the mutation
is fragmented and care must be taken to preserve isolation. Small
mutations behaved as they were, with writes from different stream
plans coalesced while big mutations are handled separately for each
streaming task.
Commit 206955e4 "streaming: Reduce memory usage when sending mutations"
moved streaming mutation limiter from do_send_mutations() to
send_mutations(). The reason for that was that send_mutation() did full
mutation copies. That's no longer the case and streaming limiter should
be moved back to do_send_mutation() in order to provide back pressure to
fragment_and_freeze().
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
If mutations are fragmented during streaming a special care must be
taken so that isolation guarantees are not broken.
Mutations received with flag "fragmented" set are applied to a memtable
that is used only by that particular streaming task and the sstables
created by flushing such memtables are not made visible until the task
is complte. Also, in case the streaming fails all data is dropped.
This means that fragmented mutations cannot benefit from coalescing of
writes from multiple streaming plans, hence separate way of handling
them so that there is no loss of performance for small partitions.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
plan_id is needed to keep track of the origin of mutations so that if
they are fragmented all fragments are made visible at the same time,
when that particular streaming plan_id completes.
Basically, each streaming plan that sends big (fragmented) mutations is
going to have its own memtables and a list of sstables which will get
flushed and made visible when that plan completes (or dropped if it
fails).
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
When flush_streaming_mutations() is called at the end of streaming it is
supposed to flush all data and then invalidate cache. ranges However, if
there are already some memtable flushes in progress it won't wait for them.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
The purpose of this patch is to split the actions of writing sstable and
sealing it. As long as the sstable is unsealed it is considered
incomplete and is going to be removed on reboot.
Such functionality is needed in order to defer visibility of sstables
created during streaming until the streaming is complete.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
* seastar c82c36f...9267dfa (6):
> app_template: Make run() wait for func when reactor exit is triggered externally
> core: Introduce futurize_apply() helper
> rpc: make unexpected eof messages more informative
> Fix boost version check
> reactor: more fix for smp poll with older boost
> reactor: fix build on older boost due to spsc_queue::read_available()
2a46410f4a changed sstable_list from a map
to a set, so it is no longer sorted by generation. The code for finding
the list of sstables not being compacted relied on this sort order, and
now broke, returning a longer list than needed (including some of the
sstables being compacted). As a result, the compaction code preserved
the tombstones, incorrectly thinking there was still live data they
referenced.
Fix by sorting the set explicitly.
Fixes#1429.
Message-Id: <1467793026-6571-1-git-send-email-avi@scylladb.com>
"After this patchset, date tiered compaction strategy is supported by Scylla.
For those who don't know what it is about, the following article may help:
https://labs.spotify.com/2014/12/18/date-tiered-compaction/
It's also nicely explained here by our wiki page:
https://github.com/scylladb/scylla/wiki/SSTable-compaction#date-tiered-compaction
Basically, date tiered strategy was developed to help the database perform better
when facing a time series workload. Date tiered strategy will work to keep data
written at nearly the same time together, such that the number of relevant sstables
for a time-based query is relatively low. We still lacks support to filter out
sstables based on time parameters of a query, but that feature should come ASAP.
The following dtests now pass:
compaction_test.py:TestCompaction_with_DateTieredCompactionStrategy.compaction_delete_test
compaction_test.py:TestCompaction_with_DateTieredCompactionStrategy.compaction_strategy_switching_test
Used cassandra-stress with the parameter '-schema compaction\(strategy=DateTieredCompactionStrategy\)'
to check stability.
Fixes #511."
"Issue 1195 describes a scenario with a fairly easy reproducer in which
we can freeze the database. That involves writing simultaneously to
multiple CFs, such that the sum of all the memory they are using is larger
than the dirty memory limit, without not any of them individually being
larger than the memtable size.
This patchset rewrites the throttling code, including now active flushes
so that this situation cannot happen.
Fixes#1195"
This commit is basically about converting Java to C++.
Date tiered compaction strategy isn't wired yet.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Strongly based on org.apache.cassandra.db.compaction.
CompactionController.getFullyExpiredSSTables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We weren't updating max local deletion time for cells that contain
ttl, or for tombstone cells.
If there is a live cell with no ttl, then max local deletion time
is supposed to store maximum value, which means that the sstable
will not be fully expired later on.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
File can be found at the following C* directory:
src/java/org/apache/cassandra/db/compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Issue 1195 describes a scenario with a fairly easy reproducer in which we can
freeze the database. That involves writing simultaneously to multiple CFs, such
that the sum of all the memory they are using is larger than the dirty memory
limit, without not any of them individually being larger than the memtable size.
Because we will never reach the individual memtable seal size for any of them,
none of them will initiate a flush leading the database to a halt.
The LSA has now gained infrastructure that allow us to be notified when pressure
conditions mount. What we will do in this case is initiate a flush ourselves.
Fixes#1195
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In the spirit of what we are doing for the read semaphore, this patch moves
system writes to its own dirty memory manager. Not only will it make sure that
system tables will not be serialized by its own semaphore, but it will also put
system tables in its own region group.
Moving system tables to its own region group has the advantage that system
requests won't be waiting during throttle behind a potentially big queue of user
requests, since requests are tended to in FIFO order within the same region
group. However, system tables being more controlled and predictable, we can
actually go a step further and give them some extra reservation so they may not
necessarily block even if under pressure (up to 10 MB more).
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We currently have a semaphore in the column family level that protects us against
multiple concurrent sstable flushes. However, storing that semaphore into the CF,
not the database, was a (implementation, not design) mistake.
One comment in particular makes it quite clear:
// Ideally, we'd allow one memtable flush per shard (or per database object), and write-behind
// would take care of the rest. But that still has issues, so we'll limit parallelism to some
// number (4), that we will hopefully reduce to 1 when write behind works.
So I aimed for the shard, but ended up coding it into the CF because that's closer to the
flush point - my bad.
This patch fixes this while paving the way for active reclaim to take place. It wraps the semaphore
and the region group in a new structure, the dirty_memory_manager. The immediate benefit is that we
don't need to be passing both the semaphore and the region group downwards in the DB -> CF path. The
long term benefit is that we now have a one unified structure that can hold shared flush data in all
of the CFs.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The LSA memory pressure mechanism will let us know which region is the best
candidate for eviction when under pressure. We need to somehow then translate
region -> memtable -> column family.
The easiest way to convert from region to memtable, is having memtable inherit
from region. Despite the fact that this requires multiple inheritance, which
always raise a flag a bit, the other class we inherit from is
enable_shared_from_this, which has a very simple and well defined interface. So
I think it is worthy for us to do it.
Once we have the memtable, grabing the column family is easy provided we have a
database object. We can grab it from the schema.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The LSA infrastructure, through the use of its region groups, now have
a throttler mechanism built-in. This patch converts the current throttlers
so that the LSA throttler is used instead.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"This series contains remaining changes necessary to safely enable read
ahead of sstables. Basically, it makes sure that input_streams are
always properly closed (even in case of exception during read)."
At the moment, it's not possible to know how many compaction are needed for
compaction strategy to be satisfied. It's not possible to know exactly the
number of pending compaction, but the strategy can provide an estimation.
For size tiered, it's based on number of sstables in each bucket. By dividing
bucket size by max threshold, we get number of compaction needed to compact
that single bucket.
For leveled, it's about the number of sstables that exceeds the limit in
each level.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <e209e52f6159ee274a8358b69961a7c0ce357f7d.1467667054.git.raphaelsc@scylladb.com>
Checking features for seed node is a bit more complicated than non-seed
node, because non-seed node can always talk to at least one seed node,
seed node may not.
In this patch, we distingush new cluster and existing cluster by
checking if the system table is empty. We relax the feature check for
new cluster because the feature check is mostly useful when upgrading an
existing cluster to prevent old node to join new cluster.
When talking to a seed node failed during the check, we fallback to the
check using features stored in the system table. This makes restarting a
seed node when no other seed node is up possible (no other seed node at
all, or other seed node is not up yet).
I tested the following scenarios.
1) start a completely new seed node in a new cluster
* system table is empty, skip the check.
2) start a cluster, restart one seed node, at least one other seed node
is up
* system table is not empty, check with shadow round, shadow round will
* succeed
3) start a cluster, restart one seed node, no other seed node is up
* system table is not empty, check with shadow round, shadow round will
* fail, fallback to system table check.
4) start a cluster, shutdown all the nodes, start one seed node with new
ip address, seed list in yaml is updated with new ip address
* system table is not empty, check with shadow round, shadow round will
* fail, fallback to system table check
In 3a36ec33db (gossip: Wait longer for seed node during boot up), we
increased the timeout by the factor of 60, i.e., ring_dealy * 60 = 5
seconds * 60 = 5 minutes.
In 57ee9676c2 (storage_service: Fix default ring_delay time), we fixed
the default ring_dealy to 30 seconds. Now the timeout is 30 * 60 seconds
= 30 minutes, which is too long.
Make it 5 minues.
If someone tried to naively use utf8_type->decompose("18wX"), this would
mysteriously fail, returning an empty key.
decompose takes a data_value, so the compiler looked for an implict
conversion from the string constant (const char*) to data_value. We did
not have such a conversion, only conversion from sstring. But the compiler
chose (backed by the C++ standard, no doubt) to implicitly convert the
const char* to a bool (!), and then use data_value(bool). It did not
convert the const char* to an sstring, nor did it warn about the possible
ambiguity.
So this patch adds a data_value(const char*) constructor, so people will
not fall into the same trap that I fell into...
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1467643462-6349-1-git-send-email-nyh@scylladb.com>
In a leveled column family, there can be many thousands of sstables, since
each sstable is limited to a relatively small size (160M by default).
With the current approach of reading from all sstables in parallel, cpu
quickly becomes a bottleneck as we need to check the bloom filter for each
of these sstables.
This patch addresses the problem by introducing a
compaction-strategy-specific data structure for holding sstables. This
data structure has a method to obtain the sstables used for a read.
For leveled compaction strategy, this data structure is an interval map,
which can be efficiently used to select the right sstables.
When a seed node boots up with more than one node in the seed list, it
will fail to talk to the other seed node which is not up yet.
This fails the feature check, so the seed node will not boot.
Skip the feature check for seed node for now, util we have a proper solution.
Fixes recent dtest failure due to fail to boot the seed node.
Message-Id: <e1d4110f96817e45f81dc0bc948dd14600fc5333.1467251799.git.asias@scylladb.com>
"This series converts sstable writers (including compaction) to streamed
mutations and makes them use consumer-style interface.
Code related to sstable writes and compaction is converted to consumers
that can be used with consume_flattened_in_thread() (which is a variant
of consume_flattened() intended to be run inside a thread).
compac_for_query is improved so that it can be reused by sstable
compaction."
Amnon says:
The API that returns the version, currently returns the compatibility
version
(e.g. the version the compatible origin version - currently 2.1.8).
The check version functionality need to know what is the current running
version of scylla. For that a new API was added that return the current
version.
The result is equivalent of running scylla --version.
After this series a call to:
$ curl -X GET
"http://localhost:10000/storage_service/scylla_release_version"
"666.development-20160703.72f0d4d"
Which is the json representation of:
$ ./build/release/scylla --version
666.development-20160703.72f0d4d
We currently log as follow:
May 9 00:09:13 node3.nl scylla[2546]: [shard 0] storage_service - This
node was decommissioned and will not rejoin the ring unless
cassandra.override_decommission=true has been set,or all existing data
is removed and the node is bootstrapped again
Howerver, user should use
override_decommission:true
instead of
cassandra.override_decommission:true
in scylla.yaml where the cassandra prefix is stripped.
Fixes#1240
Message-Id: <b0c9424c6922431ad049ab49391771e07ca6fbde.1467079190.git.asias@scylladb.com>
data_resource lookup uses data_resource::name(), which uses sprint(), which
uses (indirectly) locale, which takes a global lock. This is a bottleneck
on large machines.
Fix by not using name() during lookup.
Fixes#1419
Message-Id: <1467616296-17645-1-git-send-email-avi@scylladb.com>
This adds a definition to the scylla release version. The API already
return the compatibility version (ie. the compatible origin version)
This definition returns the scylla version, a call to the API should
return the same result as running scylla --version.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Apply compaction strategy specific logic to narrow down the set of sstables
used for a query; can speed up reads using LeveledCompactionStrategy
significantly.
Fixes#1185.
Using sstable_set will allow us to filter sstables during a query before
actually creating a reader (this is left to the next patch; here we just
convert the users of the _sstables field).
Allow compaction_strategy to create a container for sstables that is
optimized for the strategy.
Most compaction_strategies return bag_sstable_set; leveled compaction
returns the specialized partitioned_sstable_set.
partitioned_sstable_set assumes that sstable are mostly partitioned along
the token range: only a few sstables will be needed to access a particular
token. It is implemented as an interval_map.
bag_sstable_set is a generic sstable_set implementation: it assumes nothing
about the sstables. It is implemented as a vector, and any select will
return the entire sstable set.
sstable_set abstracts the notion of a container of sstables, allowing
different compaction strategies to supply their own implementation. The
intended user is leveled compaction strategy; since it partitions sstables,
it can quickly restrict the number of sstables that participate in a query
by looking at the min/max partition key.
sstable_set also maintains an internal lw_shared_ptr<sstable_list>,
in parallel with the abstract container. This is to support
column_family::get_sstable(), which returns a lw_shared_ptr<sstable_list>
which must be anchored somewhere if it is not saved at the caller side,
as it isn't in most current callers.
ring_position is built for modern code that does not require default
constructors or stateless comparators. But not all code is modern, so
supply a compatible_ring_position that works with old code, at the cost
of some extra storage. Intended user is boost's interval container
library.
sstable_list is now a map<generation, sstable>; change it to a set
in preparation for replacing it with sstable_set. The change simplifies
a lot of code; the only casualty is the code that computes the highest
generation number.
If read ahead is going to be enabled it is important to close
input_stream<> properly (and wait for completion) before destroying it.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch moves compaction logic to a consumer that can be used with
consume_flattened_in_thread(). Internally, sstable_writer is used to
write individual sstables.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
sstable_writer encapsulates all logic related to writing sstable.
Previously introduced component_writer is used to write actual
mutations. sstable_writer is intended to be used with
consume_flattened_in_thread(). Its purpose is to be used by higher-level
consumer that needs to write possibly more than one sstable (sstable
compaction is an example of such consumer).
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch rewrites do_write_components() so that it can use
consume_flattened_in_thread(). All components-writing code is moved to a
new consumer: component_writer.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This is a version of consume_flattened() intended to be run inside a
thread. All consumer code is going to be invoked in the same thread
context.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
compact_mutation code is going to be shared among queries and sstable
compaction. There are some differences though. Queries don't provide
_max_purgeable and sstable compaction don't need any limits.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
_max_perguable can be different for each partition, since it is computed
using sstables in which that partition is present (or likely to be
present).
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Since decorated keys are already computed it is better to pass more
information than less. Consumers interested just in partition key can
just drop token and the ones requiring full decorated key don't need to
recompute it.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
query::full_slice is a partiton slice which has full clustering row
ranges for all partition keys and no per-partition row limit.
Options and columns are not set.
It is used as a helper object in cases when a reference to
partition_slice is needed but the user code needs just all data there is
(an example of such case would be sstable compaction).
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Currently, each sstable write has its separate thread. However, the goal
is to have compaction use consume_flattened() with a consumer that
creates and writes the sstables. consume_flattened() needs to be executed
inside a thread, since sstable writer may defer.
This patch is a first step in preparations and it just makes whole
compaction logic run inside a thread. That makes little sense now, since
all sstable writes spawn their own threads but that's going to change
in the following patches.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
The partition_limit should have been added to the end of the ctor
argument list, as its current placement causes some callers to pass it
the timestamp instead of the limit.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1467239360-6853-3-git-send-email-duarte@scylladb.com>
While limiting the number of concurrently executing sstable readers reduces
our memory load, the queued readers, although consuming a small amount of
memory, can still grow without bounds.
To limit the damage, add two limits on the queue:
- a timeout, which is equal to the read timeout
- a queue length limit, which is equal to 2% of the shard memory divided
by an estimate of the queued request size (1kb)
Together, these limits bound the amount of memory needed by queued disk
requests in case the disk can't keep up.
Message-Id: <1467206055-30769-1-git-send-email-avi@scylladb.com>
At the moment, we only trigger compaction after creating a new
sstable as a result of memtable flush, or some other event such
as changing compaction strategy of a column family.
However, it's important to trigger compaction on boot too.
That will happen after loading all column families.
Fixes#1404.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <54d38a418157454eec97aaba6b8a6b6e51484db4.1467135349.git.raphaelsc@scylladb.com>
The scylla-jmx no longer shutdown itself. A better setup would be that
the it would be started when the scylla-server starts and that it would
shutdown when the scylla-server shutdown.
This patch do the scylla-server part of the change.
The scylla-server definition would Want the scylla-jmx.service so there
is no need to enable the scylla-jmx.service.
A patch to the scylla-jmx would cause it to shutdown when the scylla-jmx
shutsdown.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1467184502-4358-1-git-send-email-amnon@scylladb.com>
Following Cassandra, our default sstable compression chunk size is 64 KB.
The big downside of this default size is that small reads need to read
and uncompress a large chunk, around 32 KB (if compression halves the data
size). In this patch we switch the default chunk size to 4 KB, which allows
faster small reads (the report in issue #1337 was of a 60-fold speedup...).
Since commit 2f56577, large reads will not be signficantly slowed down by
changing to a small chunk size. The remaining potential downside of this
change is lowering of the compression ratio because of the smaller chunks
individually compressed. However, experimentation shows that the compression
ratio is hurt somewhat, but not dramatically, by lowering the chunk size:
A recent survey of Cassandra compression in
https://www.percona.com/blog/2016/03/09/evaluating-database-compression-methods/
reports a compression ratio of 2 for 64 KB chunks, vs. 1.75 for 4 KB chunks.
My own test on a cassandra-stress workload (whose data is relatively hard
to compress), showed compression ratio 1.25 for 64 KB chunk, vs. 1.23 for
4 KB chunks.
Also remember that if a user wants to control the chunk length for a
particular table, he can - the 64 KB or 4 KB sizes are just the default.
Fixes#1337
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1467063335-12096-1-git-send-email-nyh@scylladb.com>
Helps in identifying pointers allocated through seastar
allocator. Shows to which thread the pointer belongs, to which size
class, whether it's live or free, what's the offset realtive to the
live object.
Example:
(gdb) scylla ptr 0x6040abe88170
thread 1, small (size <= 320), live (0x6040abe88140 +48)
Message-Id: <1467047215-1763-1-git-send-email-tgrabiec@scylladb.com>
* seastar 3029ebe...c15055c (5):
> memory: add option to mlock() all memory
> reactor: run idle poll handler with a pure poll function
> ignore all but one failed futures in map_reduce
> tutorial: more general exception printout on startup
> resource: don't abort on too-high io queue count
Fixes#1395.
Fixes#1400.
From Avi:
Both the cql binary transport and the rpc server have protection against
too many concurrent requests overwhelming the database due to transient
allocations. There work by estimating the amount of memory a request
requires, and accounting that against a semaphore. When the semaphore
blocks, we stop dequeing requests from the tcp connection.
Unfortunately, this doesn't work for reads, because we can't estimate the
required memory size. A small read request can require many sstables to be
read, perhaps concurrently, and a large response to be generated.
Fix by limiting the number of concurrent reads in a shard to 100. This
is more than enough concurrency for any reasonable disk, and there is no
network communication at this level, so we're safe from high network
latency requiring high concurrency.
Fixes#1398.
Since reading mutations can consume a large amount of memory, which, moreover,
is not predicatable at the time the read is initiated, restrict the number
of reads to 100 per shard. This is more than enough to saturate the disk,
and hopefully enough to prevent allocation failures.
Restriction is applied in column_family::make_sstable_reader(), which is
called either on a cache miss or if the cache is disabled. This allows
cached reads to proceed without restriction, since their memory usage is
supposedly low.
Reads from the system keyspace use a separate semaphore, to prevent
user reads from blocking system reads. Perhaps we should select the
semaphore based on the source of the read rather than the keyspace,
but for now using the keyspace is sufficient.
A restricting_reader wraps a mutation_reader, and restricts it concurrency
using a provided semaphore; this allows controlling read concurrency, which
is important since reads can consume a lot of resources ((number of
participating sstables) * 128k after we have streaming mutations, and a lot
more before).
This patch adds support for thrift prepared statements. It specializes
the result_message::prepared into two types:
result_message::prepared::cql and result_message::prepared::thrift, as
their identifiers have different types.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
query_options::prepare() changes the values array, but this is not the
one used by query_options internally (e.g., in get_value_at). So we
need to also recalculate the value_views after prepare() is called.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Having both the values and value_views arguments in the query_options
ctor is confusing, since query_options uses only the value_views field
but that is not communicated to the caller.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Similarly to the with_cob functions, this one takes the exn_cob
function and ensures it is called in case of an exception. This
is useful when the return type of the thrift verb is not nothrow
move constructible; by holding on to the cob inside the verb and
calling it directly when we have the result we avoid having to
wrap it in a smart pointer.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
gcc 6 complains that deleting a managed_bytes::external isn't defined
because the size isn't known. I'm not sure it's correct, but there's no
way to tell because flexible arrays aren't standardized.
Fix by using an array of zero size.
Message-Id: <1466715187-4125-1-git-send-email-avi@scylladb.com>
Using a template lambda invokes a bug in Fedora 24's boost where the
lambda's parameter is an internal boost type rather than a range_tombestone.
Constraining the parameter with an explicit type avoids the problem.
Message-Id: <1466844211-17298-1-git-send-email-avi@scylladb.com>
This adds to the definition of the collectd API the ability to turn on
and off specific collectd metrics.
For the GET end point a POST option was added that allow to enable or
disable a metric.
The general GET endpoint now returns the enable flag that indicates if
the metric is enable.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1466932139-19264-2-git-send-email-amnon@scylladb.com>
"This patchset implements the thrib describe verbs:
- describe_keyspace
- describe_keyspaces
- describe_cluster_name
- describe_version
- describe_ring
- describe_local_ring
- describe_token_map
- describe_partitioner
- describe_snitch
- describe_schema_versions
The verbs describe_splits and describe_splits_ex are not implemented
because they are marked as experimentail (Origin's thrift interface has
this to say about them: "experimental API for hadoop/parallel query
support. may change violently and without warning."). Some drivers have
moved away from depending on this verb (SPARKC-94). The correct way to
implement the verbs for us would be to use the size_estimates system table
(CASSANDRA-7688). However, we currently don't populate size_estimates, which
is done by SizeEstimatesRecorder.java in Origin."
This patch removes a conversion function from an internal type
name to Origin's naming, which isn't needed because the
abstract_type hierarchy already keeps that mapping.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We don't implement describe_splits, and this patch describes why that
it. In a nutshell, to properly implement this, we would need something
like Origin's SizeEstimatesRecorder.java, but as the verb is marked as
experimental, we don't do it for now.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch completes the system_add_keyspace verb by setting all
relevant options on the new schemas.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
In thrift, a static column family is one where all columns are
defined upon schema creation. It maps to a CQL table with a singular
partition key and a set of regular columns.
On the other hand, a dynamic column family is one which allows column
to be dynamically added by insertion requests. It maps to a CQL table
with a partition key and a clustering key, which will hold the names of
the dynamic columns, and a regular column, which will how the respective
values. If the thrift comparator type is composite, then there will be a
clustering column for each of the composite's components.
There can also be mixed column families; supporting those is future
work.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch moves the make_exception function from thrift/handler.cc to
the new header file thrift/utils.hh.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
We currently have a problem in update_cache, that can be trigger by ordering
issues related to memtable flush termination (not initiation) and/or
update_cache() call duration.
That issue is described in #1364, and in short, happens if a call to
update_cache starts before and ongoing call finishes. There is now a new SSTable
that should be consulted by the presence checker that is not.
The partition checker operates in a stale list because we need to make sure the
SSTable we just wrote is excluded from it. This patch changes the partition
checker so that all SSTables currently in use are consulted, except for the one
we have just flushed. That provides both the guarantee that we won't check our
own SSTable and access to the most up-to-date SSTable list.
Fixes#1364
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <fa1cee672bba8e21725c6847353552791225295f.1466534499.git.glauber@scylladb.com>
Add contiguity flag to cache entry and set it in scanning reader.
Partitions fetched during scanning are continuous
and we know there's nothing between them.
Clear contiguity flag on cache entries
when the succeeding entry is removed.
Use continuous flag in range queries.
Don't go do disk if we know that there's nothing
between two entries we have in cache. We know that
when continuous flag of the first one is set to true.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <72bae432717037e95d1ac9465deaccfa7c7da707.1466627603.git.piotr@scylladb.com>
If we don't, std::terminate() causes a core dump, even though an
exception is sort-of-expected here and can be handled.
Add an exception handler to fix.
Fixes#1379.
Message-Id: <1466595221-20358-1-git-send-email-avi@scylladb.com>
gdb gets confused if a non-fully-qualified class name is used when
we are in some namespace context. Help it out by adding a :: prefix.
Message-Id: <1466587895-8690-1-git-send-email-avi@scylladb.com>
This patchset adds two new types of query limits:
- Per partition row limit, which limits how many rows
a given partition may return; needed both for thrift
and for future CQL features;
- Limit on the number of partitions returned, needed
by thrift.
This patch renames compact_query::_partition_limit to
_current_partition_limit for clarity, as the next patch adds
a partition limit that limits the number of partitions.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch renames compact_query::_limit to _row_limit for
clarity, as a subsequent patch introduces yet another limit.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch as a per-partition row limit. It ensures both local
queries and the reconciliation logic abide by this limit.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the way we fetch each replica's last row to
determine if we got incomplete information from any of them. Instead
of fetching the last rows up front, we fetch them on demand only if we
actually trigger the code that needs them. We now get the last row from
the versions vector of vectors.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch extracts to a function the code that actually determines
the last row of a partition based on the direction of the query.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Since systemd moved to PermissionsStartOnly, only upstart uses sudoers.
So move common/sudoers.d to dist/ubuntu, drop them from .rpm.
Also, Ubuntu 15.10/16.04 does not requires sudoers since these are uses systemd.
So copy sudoers only for 14.04.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1466536491-9860-1-git-send-email-syuu@scylladb.com>
* seastar 401c333...3029ebe (3):
> util: add a seastar::value_of() helper function
> rpc: force closing listen fd on server stop
> reactor: fix I/O priority class id assignment
From Paweł:
This series introduces streaming_mutations which allow mutations to be
streamed between the producers and the consumers as a series of
mutation_fragments. Because of that the mutation streaming interface
works well with partitions larger than available memory provided that
actual producer and consumer implementations can support this as well.
mutation_fragments are the basic objects that are emitted by
streamed_mutations they can represent a static row, a clustering row,
the beginning and the end of a range tombstone. They are ordered by their
clustering keys (with static rows being always the first emitted mutation
fragment). The beginning of range tombstone is emitted before any
clustering row affected by that tombstone and the end of range tombstone
is emitted after the last clustering row affected by it. Range tombstones
are disjoint.
In this series all producers are converted to fully support the new
interface, that includes cache, memtables and sstables. Mutation queries
and data queries are the only consumers converted so far.
To minimize the per-mutation_fragment overhead streamed_mutations use
batching. The actual producer implementation fills a buffer until
it is full (currently, buffer size is 16, the limit should, however,
be changed to depend on the actual size in memory of the stored elements)
or end of stream is reached.
In order to guarantee isolation of writes reads from cache and memtable
use MVCC. When a reader is created it takes a snapshot of the particular
cache or memtable entry. The snapshot is immutable and if there happen
to be any incoming writes while the read is active a new version of
partition is created. When the snapshot is destroyed partition versions
are merged together as much as possible.
Performance results with perf_simple_query (median of results with
duration 15):
before after diff
write 618652.70 618047.58 -0.10%
read 661712.44 608070.49 -8.11%
This reverts commit 2d7f8f4a47.
Avi sayeth:
"Isn't this the other way round? EBS is persistent."
and
"The patch is wrong too. Instance store takes 5 minutes to boot
compared to 1 minute for EBS."
From Glauber:
This is my new take at the "Move throttler to the LSA" series, except
this one don't actually move anything anywhere: I am leaving all
memtable conversion out, and instead I am sending just the LSA bits +
LSA active reclaim. This should help us see where we are going, and
then we can discuss all memtable changes in a series on its own,
logically separated (and hopefully already integrated with virtual
dirty).
[tgrabiec: trivial merge conflicts in logalloc.cc]
If a CF does not have any sstables at all, we should treat it
as having a replay position of zero. However, since we also
must deal with potential re-sharding, we cannot just set
shard->uuid->zero initially, because we don't know what shards
existed.
Go through all CF:s post map-reduce, and for every shard where
a CF does not have an RP-mapping (no sstables found), set the
global min pos (for shard) to zero.
Fixes#1372
Message-Id: <1465991864-4211-1-git-send-email-calle@scylladb.com>
Try to emulate the origin behaviour for batch reply. They use an
explicit write handler, combinging
1.) Hinting to all known dead endpoints
2.) Sending to all persumed live, requiring ack from all
3.) Hinting to endpoint to which send failed.
We don't have hints, so try to work around by doing send with
cl=ALL, and if send fails (wholly or partially), retain the
batch in the log.
This is still slight behavioural difference, and we also risk
filling up the batch log in extreme cases. (Though probably not
in any real environment).
Refs #1222
Message-Id: <1466444170-23797-1-git-send-email-calle@scylladb.com>
We now keep the regions sorted by size, and the children region groups as well.
Internally, the LSA has all information it needs to make size-based reclaim
decisions. However, we don't do reclaim internally, but rather warn our user
that a pressure situation is mounted.
The user of a region_group doesn't need to evict the largest region in case of
pressure and is free to do whatever it chooses - including nothing. But more
likely than not, taking into account which region is the largest makes sense.
This patch puts together this last missing piece of the puzzle, and exports the
information we have internally to the user.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Region is implemented using the pimpl pattern (region_impl), and all its
relevant data is present in a private structure instead of the region itself.
That private structure is the one that the other parts of the LSA will refer to,
the region_group being the prime example. To allow classes such as the
region_group the externally export a particular region, we will introduce a
backpointer region_impl -> region.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We are currently just allowing the region_group to specify a throttle_threshold,
that triggers throttling when a certain amount of memory is reached. We would
like to notify the callers that such condition is reached, so that the callers
can do something to alleviate it - like triggering flushes of their structures.
The approach we are taking here is to pass a reclaimer instance. Any user of a
region_group can specialize its methods start_reclaiming and stop_reclaiming
that will be called when the region_group becomes under pressure or ceases to
be, respectively.
Now that we have such facility, it makes more sense to move the
throttle_threshold here than having it separately.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
When we decide to evict from a specific region_group due to excessive memory
usage, we must also consider looking at each of their children (subgroups). It
could very well be that most of memory is used by one of the subgroups, and
we'll have to evict from there.
We also want to make sure we are evicting from the biggest region of all, and
not the biggest region in the biggest region_group. To understand why this is
important, consider the case in which the regions are memtables associated with
dirty region groups. It could be that a very big memtable was recently flushed,
and a fairly small one took its place. That region group is still quite large
because the memtable hasn't finished flushing yet, but that doesn't mean we
should evict from it.
To allow us to efficiently pick which region is the largest, each root of each
subtree will keep track of its maximal score, defined as the maximum between our
largest region total_space and the maximum maximal score of subtrees.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Currently, the regions in a region group are organized in a simple vector.
We can do better by using a binomial heap, as we do for segments, and then
updating when there is change. Internally to the LSA, we are in good position
to always know when change happens, so that's really the best way to do it.
The end game here, is to easily call for the reclaim of the largest offending
region (potentially asynchronously). Because of that, we aren't really interested
in the region occupancy, but in the region reclaimable occuppancy instead: that's
simply equal to the occupancy if the region is reclaimable, and 0 otherwise. Doing
that effectively lists all non reclaimable regions in the end of the heap, in no
particular order.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The database code uses a throttling function to make sure that memory
used for the dirty region never is over the limit. We track that with
a region group, so it makes sense to move this as generic functionality
into LSA.
This patch implements the LSA-side functionality and a later patch will
convert the current memtable throttler to use it.
Unlike the current throttling mechanism, we'll not use a timer-based
mechanism here. Aside from being more generic and friendlier towards
other users, this is a good change for current memtable by itself.
The constants - 10ms and 1MB chosen by the current throttler are arbitrary, and we
would be better off without them. Let's discuss the merits of each separately:
1) 10ms timer: If we are throttling, we expect somebody to flush the memtables
for memory to be released. Since we are in position to know exactly when a memtable
was written, thus releasing memory, we can just call unthrottle at that point, instead
of using a timer.
2) 1MB release threshold: we do that because we have no idea how much memory a request
will use, so we put the cut somehow. However, because of 1) we don't call unthrottle
through a timer anymore, and do it directly instead. This means that we can just execute
the request and see how much memory it has used, with no need to guess. So we'll call
unthrottle at the end of every request that was previously throttled.
Writing the code this way also has the advantage that we need one less continuation in
the common case of the database not being throttled.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
compact_for_query is an intermediate stage used to compact data in a
flattened stream of mutations before they are consumed by query building
consumers.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Mutation reader produces a stream of streamed_mutations. Each
streamed_mutation itself is a stream so basically we are dealing here
with a stream of streams.
consume_flattened() flattens such stream of streams making all its
elements consumable by a single consumer. It also allows reversing
the mutations before consumption using reverse_streamed_mutation().
reverse_streamed_mutation() is an inefficient way of reversing
streamed_mutations. First, it collects all mutation_fragments and then
it emits them in the reversed orders (except static row which always is
the first element and it also flips the bounds of range tombstones).
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
flip_bound_kind() changes start bound to end bound and vice versa while
preserving the inclusivness.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
To ensure isolation of operation when streaming a mutation from a
mutable source (such as cache or memtable) MVCC is used.
Each entry in memtable or cache is actually a list of used versions of
that entry. Incoming writes are either applied directly to the last
verion (if it wasn't being read by anyone) or preprended to the list
(if the former head was being read by someone). When reader finishes it
tries to squash versions together provided there is no other reader that
could prevent this.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Originally, ranges for reversed queries were in descending order and
ranges for forward queries in ascending order. However,
streamed_mutations require them to always be in ascending order.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
The main user of this list is MVCC implementation in partition_version.cc.
The reason why boost::intrusive::list<> cannot be used is that tere is no
single owner of the list who could keep boost::intrusive::list<> object
alive. In the MVCC case there is at least one partition_entry object and
possibly multiple partition_snapshot objects which lifetime is independent
and the list must remain in a valid state as long as at least one of them
is alive.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
It is incorrect to update row_cache with a memtable that is also its
underlying storage. The reason for that is that after memtable is merged
into row_cache they share lsa region. Then when there is a cache miss
it asks underlying storage for data. This will result with memtable
reader running under row_cache allocation section. Since memtable reader
also uses allocation section the result is an assertion fault since
allocation sections from the same lsa region cannot be nested.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
With streamed_mutations a partition with many small rows doesn't stress
the cache as much as the test expects. Use large clustering rows instead.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
range_tombstone_stream encapsulates logic responsible for turning
range_tombstone_list into a stream of mutation_fragments and merging
that stream with a stream of clustering rows.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Row markers and collections weren't filtered out even if they belonged
to a clustering row that shouldn't be in the result. The check whether
to include cell or not was done only for live and dead atomic cells.
This patch adds appropriate checks for collections and row markers.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
mutation_reader and streamed_mutation may use the same stream as a source
mutation_fragments and mutations themselves (this happens in sstable reader).
In such case asking for next streamed_mutation from mutation_reader would
invalidate all other streamed_mutations.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
streamed_mutation represents a mutation in a form of a stream of
mutation_fragments. streamed_mutation emits mutation fragments in the
order they should appear in the sstables, i.e. static row is always
the first one, then clustering rows and range tombstones are emitted
according to the lexicographical ordering of their clustering keys and
bounds of the range tombstones.
Range tombstones are disjoint, i.e. after emitting
range_tombstone_begin it is guaranteed that there is going to be a
single range_tombstone_end before another range_tombstone_begin is
emitted.
The ordering of mutation_fragments also guarantees that by the time
the consumer sees a clustering row it has already received all
relevant tombstones.
Partition key and partition tombstone are not streamed and is part of
the streamed_mutation itself.
streamed_mutation uses batching. The mutation implementations are
supposed to fill a buffer with mutation fragments until is_buffer_full()
or the end of stream is encountered.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This commit introduces mutation_fragment class which represents the parts
of mutation streamed by streamed_mutation.
mutation_fragment can be:
- a static row (only one in the mutation)
- a clustering row
- start of range tombstone
- end of range rombstone
There is an ordering (implemented in position_in_partition class) between
mutation_fragment objects. It reflects the order in which content of
partition appears in the sstables.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
After commit faa4581, each shard only starts splitting its shared sstables
after opening all sstables. This was important because compaction needs to
be aware of all sstables.
However, another bug remained: If one shard finishes loading its sstables
and starts the splitting compactions, and in parallel a different shard is
still opening sstables - the second shard might find a half-written sstable
being written by the first shard, and abort on a malformed sstable.
So in this patch we start the shared sstable rewrites - on all shards -
only after all shards finished loading their sstables. Doing this is easy,
because main.cc already contains a list of sequential steps where each
uses invoke_on_all() to make sure the step completes on all shards before
continuing to the next step.
Fixes#1371
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1466426641-3972-1-git-send-email-nyh@scylladb.com>
Using an ordering mechanism better than rw-locks for write/flush
means we can wait for pending write in batch mode, and coalesce
data from more than one mutation into a chunk.
It also means we can wait for a specific read+flush pair (based on
file position).
Downside is that we will not do parallel writes in batch mode (unless
we run out of buffer), which might underutilize the disk bandwidth.
Upside is that running in batch mode (i.e. per-write consistency)
now has way better bandwidth, and also, at least with high mutation
rate, better average latency.
Message-Id: <1465990064-2258-1-git-send-email-calle@scylladb.com>
Scylla will not start if the disk was not benchmarked
so start run io_tune with the right parameters.
Also add the cpu_set environment variables for passing
cpu set to iotune and scylla.
Signed-of-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <1466412846-4760-2-git-send-email-benoit@scylladb.com>
Use the PermissionsStartOnly systemd option to apply the permission
related configurations only to the start command. This allows us to stop
using "sudo" for ExecStartPre and ExecStopPost hooks and drop the
"requiretty" /etc/sudoers hack from Scylla's RPM.
Tested-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1466407587-31734-1-git-send-email-penberg@scylladb.com>
Because we build on CentOS 7, which does not have the %sysctl_apply macro,
the macro is not expanded, and therefore executed incorrectly even on 7.2,
which does.
Fix by expanding the macro manually.
Fixes#1360.
Message-Id: <1466250006-19476-1-git-send-email-avi@scylladb.com>
Since most of the time people are running scylla_setup on
a fully upgraded ubuntu 14.04 box, we rarely reach that
code path, but once we do we end up with an error. Let's
fix that.
Signed-off-by: Lucas Meneghel Rodrigues <lmr@scylladb.com>
Message-Id: <1466205496-3885-2-git-send-email-lmr@scylladb.com>
Starting in commit 721f7d1d4f, we start "rewriting" a shared sstable (i.e.,
splitting it into individual shards) as soon as it is loaded in each shard.
However as discovered in issue #1366, this is too soon: Our compaction
process relies in several places that compaction is only done after all
the sstables of the same CF have been loaded. One example is that we
need to know the content of the other sstables to decide which tombstones
we can expire (this is issue #1366). Another example is that we use the
last generation number we are aware of to decide the number of the next
compaction output - and this is wrong before we saw all sstables.
So with this patch, while loading sstables we only make a list of shared
sstables which need to be rewritten - and the actual rewrite is only started
when we finish reading all the sstables for this CF. We need to do this in
two cases: reboot (when we load all the existing sstables we find on disk),
and nodetool referesh (when we import a set of new sstables).
Fixes#1366.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1466344078-31290-1-git-send-email-nyh@scylladb.com>
Commit daad2eb "row_cache: fix memory leak in case of schema upgrade
failure" has fixed a memory leak caused by failed upgrade_entry().
However, in case of upgrade failure memtable_entry used to create the
new cache entry was left in some invalid state. If the operation was
retried the cache would attempt again to apply that memtable_entry which
now would be in invalid state.
The solution is to either to ignore upgrade_entry() exceptions or do not
call it at all and let the cache entry be upgraded on demand. This patch
implements the latter.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1466163435-27367-1-git-send-email-pdziepak@scylladb.com>
When update() causes a new entry to be inserted to the cache the
procedure is as follows:
1. allocate and construct new entry
2. upgrade entry schema
3. add entry to lru list and cache tree
Step 2 may fail and at this point the pointer to the entry is neither
protected by RAII nor added in any of the cache containers. The solution
is to swap steps 2 and 3 so that even if the upgrade fails the entry is
already owned by the cache and won't leak.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1466161709-25288-1-git-send-email-pdziepak@scylladb.com>
We want to prevent older version of scylla which has fewer features to
join a cluster with newer version of scylla which has more features,
because when scylla sees a feature is enabled on all other nodes, it
will start to use the feature and assume existing nodes and future nodes
will always have this feature.
In order to support downgrade during rolling upgrade, we need to support
mixed old and new nodes case.
1) All old nodes
O O O O O <- N OK
O O O O O <- O OK
2) All new nodes
N N N N N <- N OK
N N N N N <- O FAIL
3) Mixed old and new nodes
O N O N O <- N OK
O N O N O <- O OK
(O == old node, N == new node, <- == joining the cluster)
With this patch, I tested:
1.1) Add new node to new node cluster
gossip - Feature check passed. Local node 127.0.0.4 features =
{RANGE_TOMBSTONES}, Remote common_features = {RANGE_TOMBSTONES}
1.2) Add old node to old node cluster
gossip - Feature check passed. Local node 127.0.0.4 features = {},
Remote common_features = {}
2.1) Add new node to new node cluster
gossip - Feature check passed. Local node 127.0.0.4 features =
{RANGE_TOMBSTONES}, Remote common_features = {RANGE_TOMBSTONES}
2.2) Add old node to new node cluster
seastar - Exiting on unhandled exception: std::runtime_error (Feature
check failed. This node can not join the cluster because it does not
understand the feature. Local node 127.0.0.4 features = {}, Remote
common_features = {RANGE_TOMBSTONES})
3.1) Add new node to mixed cluster
gossip - Feature check passed. Local node 127.0.0.4 features =
{RANGE_TOMBSTONES}, Remote common_features = {}
3.2) Add old node to mixed cluster
gossip - Feature check passed. Local node 127.0.0.4 features = {},
Remote common_features = {}
Fixes#1253
If the feature string is empty, boost::split will return
std::set<sstring> = {""} instead of std::set<sstring> = {}
which will make a node with a feaure, e.g. std::set<sstring> =
{"RANGE_TOMBSTONES"}, think it does not understand the feature of
a node with no features at all.
User may specify time after which speculative retry should happen
instead of relying on cf statics. Use provided value in speculative
executor.
Message-Id: <20160616104422.GH5961@scylladb.com>
Currently, we only stop the CQL transport server. Extract a
stop_transport() function from drain_on_shutdown() and call it from
do_isolate_on_error() to also shut down the inter-node RPC transport,
Thrift, and other communications services.
Fixes#1353
"Reclaiming many segments was observed to cause up to multi-ms
latency. With the new setting, the latency of reclamation cycle with
full segments (worst case mode) is below 1ms.
I saw no difference in throughput in a CQL write micro benchmark
in neither of these workloads:
- full segments, reclaim by random eviction
- sparse segments (3% occupancy), reclaim by compaction and no eviction
Fixes #1274."
Commit f42673ed1e ("scylla_setup: Hide
busy block devices from RAID0 configuration") wasn't enumerating
anything. Additionally it listed from /dev/ and not /dev/dm which broke
the tests conditions.
This one uses blkid instead of /proc/partitions.
A follow up patch will be required to mask encrypted devices.
Signed-of-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <1466059657-12377-1-git-send-email-benoit@scylladb.com>
Allocations will still be allowed if made directly, but callers will have the
choice (in an upcoming patch) to proceed only if memory is below this threshold.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Tomek correctly points out that since we are now using "this" in lambda
captures, we should make the region_group not movable. We currently define a
move constructor, but there are no users. So we should just remove them.
copy constructor is already deleted, and so are the copy and move assignment
operators. So by removing the move constructor, we should be fine.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This patch look in /proc/mount for the device name so
the device or it's subdevices will be excluded from the availables
RAID0 targets. It does the same with physical volume from device
mapper.
Fixes#1189
Message-Id: <1466001423-9547-4-git-send-email-benoit@scylladb.com>
is_atomic() is called for each cell in mutation applies, compaction
and query. Since the value doesn't change it can be easily cached which
would save one indirection and virtual call.
Results of perf_simple_query -c1 (median, duration 60):
before after
read 54611.49 55396.01 +1.44%
write 65378.92 68554.25 +4.86%
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1465991045-11140-1-git-send-email-pdziepak@scylladb.com>
On NUMA hardware, autonuma may reduce performance by
unmapping memory.
Since we do manual NUMA placement, autonuma will not
help anything.
We ought to disable it by setting the kernel.numa_balancing
sysctl to 0.
Fixes: #1120
Signed-of-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <1466006345-9972-1-git-send-email-benoit@scylladb.com>
dtest takes error level log as serious error. It is not a serious error
for streaming to fail to send a verb and fail a streaming session which
triggers a repair failure, for example, the peer node is gone or
stopped. Switch to use log level warn instead of level error.
Fixes repair_additional_test.py:RepairAdditionalTest.repair_kill_3_test
Fixes: #1335
Message-Id: <406fb0c4a45b81bd9c0aea2a898d7ca0787b23e9.1465979288.git.asias@scylladb.com>
dtest takes error level log as serious error. It is not a serious error
for streaming to fail to send a verb and fail a streaming session, for
example, the peer node is gone or stopped. Switch to use log level warn
instead of level error.
Fixes repair_additional_test.py:RepairAdditionalTest.repair_kill_3_test
Fixes: #1335
Message-Id: <0149d30044e6e4d80732f1a20cd20593de489fc8.1465979288.git.asias@scylladb.com>
Reclaiming many segments was observed to cause up to multi-ms
latency. With the new setting, the latency of reclamation cycle with
full segments (worst case mode) is below 1ms.
I saw no decrease in throughput compared to the step of 16 segments in
neither of these modes:
- full segments, reclaim by random evicition
- sparse segments (3% occupancy), reclaim by compaction and no eviction
Fixes#1274.
push_back() is not reentrant with pop_front(), used by the evictor. If
reclaimer runs when std::deque allocates a new node it will get
corrupted. Fix by runnning push_back() under reclaim lock.
tracing::tracing local instance is dereferenced from a
cql_server::connection::process_request(), therefore tracing::tracing
service may be stop()ed only after a CQL server service is down.
On the other hand it may not be stopped before RPC service is down
because a remote side may request a tracing for a specific command too.
This patch splits the tracing::tracing stop() into two phases:
1) Flush all pending tracing records and stop the backend.
2) Stop the service.
The first phase is called after CQL server is down and before RPC is down.
The second phase is called after RPC is down.
Fixes#1339
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Message-Id: <1465840496-19990-1-git-send-email-vladz@cloudius-systems.com>
* seastar 864d6dc...401c333 (8):
> scollectd: Support filtering specific collectd metrics
> core: Integrate error reporting with the logging framework
> rpc: wait for all replies to be completed before closing rpc server
> rpc: clean up resource accounting
> queue: fix race between pop_eventually() and abort()
> rpc_test: fix cancel test to not depend on timing.
> tutorial: explain application-specific command line options
> add ostream output operator for std::unordered_map
When read repair writes diffs back to replicas it is enough to wait
for requested CL to guaranty read monotonicity. This patch makes read
repair write reuse regular mutate functionality which already tracks
CL status. This is done by changing write response handler to not hold
mutation directly, but instead hold a container that, depending on
whether
this is read repair write or regular one, can provide different mutation
per destination.
Message-Id: <20160613124727.GL1096@scylladb.com>
query_state expects the current row limit to be updated so it
can be enforced across partition ranges. A regression introduced
in e4e8acc946 prevented that from
happening by passing a copy of the limit to querying_reader.
This patch fixes the issue by having column_family::query update
the limit as it processes partitions from the querying_reader.
Fixes#1338
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1465804012-30535-1-git-send-email-duarte@scylladb.com>
"Correctness of current uses of clear() and invalidate() relies on fact
that cache is not populated using readers created before
invalidation. Sstables are first modified and then cache is
invalidated. This is not guaranteed by current implementation
though. As pointed out by Avi, a populating read may race with the
call to clear(). If that read started before clear() and completed
after it, the cache may be populated with data which does not
correspond to the new sstable set.
To provide such guarantee, invalidate() variants were adjusted to
synchronize using _populate_phaser, similarly like row_cache::update()
does.
Fixes #1291."
There are various call-sites that explicitly check for EEXIST and
ENOENT:
$ git grep "std::error_code(E"
database.cc: if (e.code() != std::error_code(EEXIST, std::system_category())) {
database.cc: if (e.code() != std::error_code(ENOENT, std::system_category())) {
database.cc: if (e.code() != std::error_code(ENOENT, std::system_category())) {
database.cc: if (e.code() != std::error_code(ENOENT, std::system_category())) {
sstables/sstables.cc: if (e.code() == std::error_code(ENOENT, std::system_category())) {
sstables/sstables.cc: if (e.code() == std::error_code(ENOENT, std::system_category())) {
Commit 961e80a ("Be more conservative when deciding when to shut down
due to disk errors") turned these errors into a storage_io_exception
that is not expected by the callers, which causes 'nodetool snapshot'
functionality to break, for example.
Whitelist the two error codes to revert back to the old behavior of
io_check().
Message-Id: <1465454446-17954-1-git-send-email-penberg@scylladb.com>
Make storage_io_exception exception error message less cryptic by
actually including the human-readable error message from
std::system_error...
Before:
nodetool: Scylla API server HTTP POST to URL '/storage_service/snapshots' failed: Storage io error errno: 2
After:
nodetool: Scylla API server HTTP POST to URL '/storage_service/snapshots' failed: Storage I/O error: 2: No such file or directory
We can improve this further by including the name of the file that the I/O
error happened on.
Message-Id: <1465452061-15474-1-git-send-email-penberg@scylladb.com>
Several shards may share the same sstable - e.g., when re-starting scylla
with a different number of shards, or when importing sstables from an
external source. Sharing an sstable is fine, but it can result in excessive
disk space use because the shared sstable cannot be deleted until all
the shards using it have finished compacting it. Normally, we have no idea
when the shards will decide to compact these sstables - e.g., with size-
tiered-compaction a large sstable will take a long time until we decide
to compact it. So what this patch does is to initiate compaction of the
shared sstables - on each shard using it - so that a soon as possible after
the restart, we will have the original sstable is split into separate
sstables per shard, and the original sstable can be deleted. If several
sstables are shared, we serialize this compaction process so that each
shard only rewrites one sstable at a time. Regular compactions may happen
in parallel, but they will not not be able to choose any of the shared
sstables because those are already marked as being compacted.
Commit 3f2286d0 increased the need for this patch, because since that
commit, if we don't delete the shared sstable, we also cannot delete
additional sstables which the different shards compacted with it. For one
scylla user, this resulted in so much excessive disk space use, that it
literally filled the whole disk.
After this patch commit 3f2286d0, or the discussion in issue #1318 on how
to improve it, is no longer necessary, because we will never compact a shared
sstable together with any other sstable - as explained above, the shared
sstables are marked as "being compacted" so the regular compactions will
avoid them.
Fixes#1314.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1465406235-15378-1-git-send-email-nyh@scylladb.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Previously, we were using a stat to decide if compaction should be
retried, but that's not efficient. The information is also lost
after node is restarted.
After these changes, compaction will be retried until strategy is
satisfied, i.e. there is nothing to compact.
We will now be doing the following in a loop:
Get compaction job from compaction strategy.
If cannot run, finish the loop.
Otherwise, compact this column family.
Go back to start of the loop.
By the way, pending_compactions stat will be deprecated after this
commit. Previously, it was increased to indicate the want for
compaction and decreased when compaction finished. Now, we can
compact more than we asked for, so it would be decreased below 0.
Also, it's the strategy that will tell the want for compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <899df0d8d807f6b5d9bb8600d7c63b4e260cc282.1465398243.git.raphaelsc@scylladb.com>
Sometimes a metric previously reported from collectd is not available
anymore. Previously, this caused scyllatop to log and exception to the
user - which in effect destroyes the user experience and inhibits
monitoring other metrics. This patch makes ScyllaTop handle this
problem. It will display such metrics and 'not available', and exclude
them from some and average computations.
Closes issue #1287.
Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <1465301178-27544-1-git-send-email-yoav@scylladb.com>
From Asias:
In f27e5d2a6 (messaging_service: Delay listening ms during boot up),
messaging_service startup is splitted into two stages. Adjust the api
registration code and fix up the messaging_service stop code.
This patch makes a few minor improvements in the parser:
- merge first and rest into 2-argument form of Word to define
identifier – should give some performance boost, simpler code
- replace Literal(keyword_string) with Keyword(keyword_string)
throughout - stricter parsing, avoids misinterpreting identifiers
with keywords
- replace expr.setResultsName("name") with expr("name") throughout –
this is a style change (no actual change in underlying parser
behavior), but I find this form easier to follow
- add calls to setName to make exceptions more readable
Message-Id: <005901d1bbd2$711f7bb0$535e7310$@austin.rr.com>
There are two problems:
1. _server_tls is not stopped
2. _server and _server_tls might not be created if
messaging_service::start_listen is not called yet.
Since messaging_service is fully initialized in
storage_service::init_server which calls
messaging_service::start_listen, we need to delay
the messaging_service api registration after it.
The rate_moving_average is used by timed_rate_moving_average to return
its internal values.
If there are no timed event, the mean_rate is not propertly initilized.
To solve that the mean_rate is now initilized to 0 in the structure
definition.
Refs #1306
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1465231006-7081-1-git-send-email-amnon@scylladb.com>
This variable if set to true will activate
developer mode. It will be set by using the
-e option of docker run.
The xfs bind mount behavior and the cpuset behavior
will be set by using the relevant docker command
lines options and documented in the scylla/docker
howto.
Fixes: #1267
Signed-of-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <1465213713-2537-1-git-send-email-benoit@scylladb.com>
Add a support for defining a probability (a value in a [0,1] range)
for tracing the next CQL request.
Traces for requests that are chosen to be traced due to this feature
are not going to flushed immediately.
Use std::subtract_with_carry_engine (implements the "lagged Fibonacci" algorithm)
random number engine for fastest generation of random integer values.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Correctness of current uses of clear() and invalidate() relies on fact
that cache is not populated using readers created before
invalidation. Sstables are first modified and then cache is
invalidated. This is not guaranteed by current implementation
though. As pointed out by Avi, a populating read may race with the
call to clear(). If that read started before clear() and completed
after it, the cache may be populated with data which does not
correspond to the new sstable set.
To provide such guarantee, invalidate() variants were adjusted to
synchronize using _populate_phaser, similarly like row_cache::update()
does.
A tracing session life cycle includes 3 stages:
1) Active: when new trace records are being added to this session.
2) Pending for flushing to a storage: when session is over but not
yet flushed to the storage ("backend").
3) Flushing: when session's records are being flushed to the storage
and this process is not yet completed.
Sessions may accumulate in each of the stages above and we should limit
the maximum amount of sessions being accumulated in each of them in order to avoid OOM
situation.
Current in-tree implementation only limits the number of tracing sessions
accumulated in the first ("Active") stage.
Since currently every closing session is being immediately flushed (as long
as "settraceprobability" is not implemented) the second stage never accumulates
tracing sessions.
The third stage is currently not controlled at all and if, for instance, we
succeed to push enough tracing session towards a slow storage backend, they may
accumulate there consuming an uncontrolled amount of memory and may eventually consume
all of it.
This patch fixes this unpleasant situation by implying the following strategy:
- Limit the total amount of accumulated tracing sessions in all stages above together
by a static value - 2 times "flush threshold". "2 times" is needed to allow new
tracing sessions to accumulate in the stage 2 while sessions in the stage 3 are still
being processed.
- Forcefully flush sessions in the stage 2 to the storage when their count reaches a "flush
threshold".
This would ensure that there will not more than totally (2 * "flush threshold") sessions (in any stage)
on each shard.
An advantage of this strategy is its simplicity - we only need a single threshold to control all stages.
If we feel that we needed a finer graining for each stage we may add separate limits for each of them
in the future.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
* dist/ami/files/scylla-ami 72ae258...863cc45 (3):
> Move --cpuset/--smp parameter settings from scylla_sysconfig_setup to scylla_ami_setup
> convert scylla_install_ami to bash script
> 'sh -x -e' is not valid since all scripts converted to bash script, so remove them
Call for a tracing::tracing::create_session() doesn't promise a session creation.
Check that the session is actually created before trying to use it.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Currently we only shut down on EIO. Expand this to shut down on any
system_error.
This may cause us to shut down prematurely due to a transient error,
but this is better than not shutting down due to a permanent error
(such as ENOSPC or EPERM). We may whitelist certain errors in the future
to improve the behavior.
Fixes#1311.
Message-Id: <1465136956-1352-1-git-send-email-avi@scylladb.com>
It was discussed that leveled strategy may not benefit from parallel
compaction feature because almost all compaction jobs will have similar
size. It was also found that leveled strategy wasn't working correctly
with it because two overlapping sstable (targetting the same level)
could be created in parallel by two ongoing compaction.
Fixes#1293.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <60fe165d611c0283ca203c6d3aa2662ab091e363.1464883077.git.raphaelsc@scylladb.com>
From Duarte:
This patchset adds the range_tombstone_list data structure,
used to hold a set of disjoint range tombstones, and changes
the internal representation of row tombstones to use that
data structure.
Fixes#1155
[tgrabiec: Added compound_wrapper::make_empty(const schema&) overload
to fix compilation failure in tracing code]
This patch enables the RANGE_TOMBSTONES supported feature, meaning
that the node is capable of accepting row entry tombstones as range
tombstones.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch uses the composite_marker to add inclusiveness information
to the prefixes of a range tombstone.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Since Scylla now supports proper range tombstones, the code for
reading ranges from sstables and converting them to overlapping
tombstones is no longer necessary, and is, in fact, wasteful as
the internal representation converts overlapping tombstones back to
ranges.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch moves the difference between two mutation_partition's
row_tombstones inside the range_tombstone_list.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the type of the mutation partition's row_tombstones
to be a range_tombstone_list, so that they are now represented as a
set of disjoint ranges. All of its usages are updated accordingly.
Fixes#1155
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the range tombstones feature, which is not enabled
yet, to the storage_service, so that consumers can query for it.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the gms::feature destructor so it
checks whether the gossiper has been stopped before trying
to unregister the feature.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch extracts the code from sstables/partition.cc which is used
to transform a set of range tombstones into a set of overlapping
scylladb tombstones.
The range_tombstone_merger will be used to send mutations to nodes not
yet updated to support the internal range tombstone representation.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This class is responsible for representing a set of range tombstones
as non-overlapping disjoint sets of range tombstones.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces the range_tombstone class, composed of
a [start, end] pair of clustering_key_prefixes, the type
of inclusiveness of each bound, and a tombstone.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the idl-compiler so that the default value of a
field can be set to the value of a previous field in the class:
class P {
uint32_t x;
uint32_t y = x;
};
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
... and make it a clustering_key_prefix, in preparation of
supporting not-whole-row range tombstones.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Config provides operators << >> for string_map which makes it impossible
to have generic stream operators for unordered_map. Fix it by making
string_map a separate type and not just an alias.
Message-Id: <20160602102642.GJ9939@scylladb.com>
Limit disk bandwidth to 5MB/s to emulate a slow disk:
echo "8:0 5000000" >
/cgroup/blkio/limit/blkio.throttle.write_bps_device
echo "8:0 5000000" >
/cgroup/blkio/limit/blkio.throttle.read_bps_device
Start scylla node 1 with low memory:
scylla -c 1 -m 128M --auto-bootstrap false
Run c-s:
taskset -c 7 cassandra-stress write duration=5m cl=ONE -schema
'replication(factor=1)' -pop seq=1..100000 -rate threads=20
limit=2000/s -node 127.0.0.1
Start scylla node 2 with low memory:
scylla -c 1 -m 128M --auto-bootstrap true
Without this patch, I saw std::bad_alloc during streaming
ERROR 2016-06-01 14:31:00,196 [shard 0] storage_proxy - exception during
mutation write to 127.0.0.1: std::bad_alloc (std::bad_alloc)
...
ERROR 2016-06-01 14:31:10,172 [shard 0] database - failed to move
memtable to cache: std::bad_alloc (std::bad_alloc)
...
To fix:
1. Apply the streaming mutation limiter before we read the mutation into
memory to avoid wasting memory holding the mutation which we can not
send.
2. Reduce the parallelism of sending streaming mutations. Before we send each
range in parallel, after we send each range one by one.
before: nr_vnode * nr_shard * (send_info + cf.make_reader memory usage)
after: nr_shard * (send_info + cf.make_reader memory usage)
We can at least save memory usage by the factor of nr_vnode, 256 by
default.
In my setup, fix 1) alone is not enough, with both fix 1) and 2), I saw
no std::bad_alloc. Also, I did not see streaming bandwidth dropped due
to 2).
In addition, I tested grow_cluster_test.py:GrowClusterTest.test_grow_3_to_4,
as described:
https://github.com/scylladb/scylla/issues/1270#issuecomment-222585375
With this patch, I saw no std::bad_alloc any more.
Fixes: #1270
Message-Id: <7703cf7a9db40e53a87f0f7b5acbb03fff2daf43.1464785542.git.asias@scylladb.com>
"This series introduces a tracing infrastructure that may be used
for tracing CQL commands execution and measuring latencies of separate
stages of CQL handling as defined by a CQL binary protocol specification.
To begin tracing one should create a "tracing session", which may then
be used to issuing tracing events.
If execution of a specific CQL command involves other Nodes (not only a Coordinator),
then a "tracing session ID" is passed to that Node (in the context of the
corresponding RPC call). Then this "session ID" may be used to create a
"secondary tracing session" to issue tracing events in the context of the original session.
The series contains an implementation of tracing that uses a keyspace in the current
cluster for storing tracing information.
This series contains a demo per-request tracing instrumentation of a QUERY
CQL command and even this instrumentation is partial: it only fully instruments
a QUERY->SELECT->read_data call chain.
This is by all means a very beginning of the proper instrumentation which is
to come.
Right now the latencies for a single SELECT for a single raw with RF 1 from a 2 Nodes cluster
on my laptop started using ccm (for C* all default parameters, for scylla - memory 256MB, --smp 2)
are as follows (pseudo-graphics warning):
--------------------------------------------------------------------------------------------
| scylla (2 Nodes x 2 shards each) | C* 2.1.8
_______________________________________|___________________________________|________________
Coordinator and replica are same Node | |
(TRACING OFF): | 0.3ms | 0.3ms
c-s with a single thread mean latency | (was 0.2ms before the last |
value | rebase with a master) |
--------------------------------------------------------------------------------------------
Coordinator and replica are same Node | |
(TRACING ON) | ~250us | ~1200us
Running a SELECT command from a cqlsh | |
a few times | |
--------------------------------------------------------------------------------------------
Coordinator and replica are not on the | |
same Node | ~700us | >2500us
(TRACING ON) | |
--------------------------------------------------------------------------------------------
To begin tracing one may use a cqlsh "TRACING ON/OFF" commands:
cqlsh> TRACING ON
Now Tracing is enabled
cqlsh> select "C0", "C1" from keyspace1.standard1 where key=0x12345679;
C0 | C1
--------------------+------
0x000000000001e240 | null
(1 rows)
Tracing session: 146f0180-21e7-11e6-b244-000000000000
activity | timestamp | source | source_elapsed
-------------------------------------------------------------------+----------------------------+-----------+----------------
select "C0", "C1" from keyspace1.standard1 where key=0x12345679; | 2016-05-24 22:38:24.536000 | 127.0.0.1 | 0
message received from /127.0.0.1 [0] | 2016-05-24 22:38:24.537000 | 127.0.0.2 | --
Done reading options [0] | 2016-05-24 22:38:24.537000 | 127.0.0.1 | 3
read_data handling is done [0] | 2016-05-24 22:38:24.537000 | 127.0.0.2 | 37
Parsing a statement [0] | 2016-05-24 22:38:24.537000 | 127.0.0.1 | 3
Processing a statement [0] | 2016-05-24 22:38:24.537000 | 127.0.0.1 | 56
Done processing - preparing a result [0] | 2016-05-24 22:38:24.537000 | 127.0.0.1 | 550
Request complete | 2016-05-24 22:38:24.536560 | 127.0.0.1 | 560
cqlsh>"
This is a demo instrumentation:
- Check if a tracing info is present in the read_command.
- If yes - create a tracing session with the given tracing
session ID.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Instrument a coordinator of a SELECT query to send tracing session
info to the corresponding replica Nodes.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
- Store a trace state inside a client_state.
- Start tracing in a cql_server::connection::process_query().
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
- Add a tracing ID (UUID) optional field to cql_server::response.
- If _tracing_id is set make_frame() would insert a tracing ID
in the response message. According to CQL spec it should be the
first thing in the response "body" and the TRACING bit (0x02) should be
set in the "flags" field.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
When client_state is created with an external_tag - store
a client address in the client state.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
trace_state: Is a single tracing session.
tracing: A sharded service that contains an i_trace_backend_helper instance
and is a "factory" of trace_state objects.
trace_state main interface functions are:
- begin(): Start time counting (should be used via tracing::begin() wrapper).
- trace(): Create a tracing event - it's coupled with a time passed since begin()
(should be used via tracing::trace() wrapper).
- ~trace_state(): Destructor will close the tracing session.
"tracing" service main interface function is:
- start(): Initialize a backend.
- stop(): Shut down a backend.
- create_session(): Creates a new tracing session.
(tracing::end_session(): Is called by a trace_state destructor).
When trace_state needs to store a tracing event it uses a backend helper from
a "tracing" service.
A "tracing" service limits a number of opened tracing session by a static number.
If this number is reached - next sessions will be dropped.
trace_state implements a similar strategy in regard to tracing events per singe
session.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
Uses a CQL keyspace system_traces to store tracing information.
Uses two tables:
CREATE TABLE system_traces.sessions (
session_id uuid,
command text,
client inet,
coordinator inet,
duration int,
parameters map<text, text>,
request text,
started_at timestamp,
PRIMARY KEY ((session_id)))
and
CREATE TABLE system_traces.events (
session_id uuid,
event_id timeuuid,
activity text,
source inet,
source_elapsed int,
thread text,
PRIMARY KEY ((session_id), event_id))
system_traces.sessions table contains records of tracing sessions.
system_traces.sessions columns description:
- session_id: an ID of the session.
- command: type of a command this session was created for
(currently supported "NONE", "QUERY" and "REPAIR").
- client: IP of the client that issued the command.
- coordinator: IP of a coordinator that received the command.
- duration: total duration of the tracing session (in us).
- parameters: optional parameters for this session, passed to
i_trace_state::begin() call.
- request: a CQL command this tracing session is created for.
- started_at: the time the session has been started at.
system_traces.events contains records of separate tracing events.
system_traces.events columns description:
- session_id: an ID of the session.
- event_id: an ID of the event.
- activity: the trace point description - a message given to
i_trace_state::trace().
- source: IP of the Node where trace event was issued.
- source_elapsed: time passed since creation of a tracing session (in us) on
the Node where this trace event was issued.
- thread: name of the thread in who's context this trace event was
issued in (currently its "core N", where 'N' is an index of
a shard the trace event was issued on).
This class will cache lambdas creating the corresponding mutations for each tracing
record requested to be stored till flush() method is called.
flush() will merge all pending mutations to "sessions" and "events" tables and
then apply a mutation to "events" table and when it completes - to "sessions"
table. This way it'll ensure that when some tracing session is visible, all its
events are visible too.
trace_keyspace_helper exposes a few metrics via collectd:
- tracing_error - a total number of errors (not including OOM)
- bad_column_family_errors - number of times a tracing record wasn't
stored because system_trace tables' schema
didn't match the expected value. This may happen if
a DB administrator is doing funny things like altering
the schemas of the above tables.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
This class represents an interface for a specific backend that is
going to store tracing information.
The specific implementation may and expected to implement caching
of pending tracing records.
Interface functions are:
- start(): Initialize a backend (e.g. create keyspace and tables).
- stop(): Flush all pending work and shut down the backend.
- store_session_record()/store_event_record():
Cache/store the corresponding tracing records.
- flush(): Flush pending tracing records.
Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
writes_attempts suppose to count how many time data was sent out, but
currently it counts even those replicas in other DCs that get the data
through a coordinator. Fix it by counting only when data is actually sent.
Message-Id: <20160601153124.GB9939@scylladb.com>
"One of the things we need to do as part of the throttle rework I am doing is to
serialize memtable flushes to some extent - that will guarantee that in case
we're throttling, the flushes finish earlier and release memory earlier, if
compared to the case in which we just let all tables flush freely and
simultaneously."
* seastar 0bcdd28...864d6dc (4):
> Logging framework
> Add libubsan and libasan to fedora deps docs
> tests: add rpc cancellable tests
> rpc: add cancellable interface
Dropped logging implementation in favor of seastar's due to a link
conflict with operator<<.
This series adds a constructor to malformed_sstable_exception that
includes a filename and converts some call-sites to use it.
There are still plenty of low-level sites that don't even know the
sstable filename they are operating on. We need to either change the
code to carry the filename to lower layers or find a higher-level
call-site where we can catch malformed_sstable_exception and rethrow it
with the sstable filename. But that's for another series by someone who
knows the sstable code well.
Refs #669.
This reverts commit b3ed55be1d.
The issue is in the failing dtest, not this commit. Gleb writes:
"The bug is in the test, not the patch. Test waits for repair session
to end one way or the other when node is killed, but for nodetool to
know if repair is completed it needs to poll for it. If node dies
before nodetool managed to see repair completion it will stuck
forever since jmx is alive, but does not provide answers any more.
The patch changes timing, repair is completed much close to exit now,
so problem appears, but it may happen even without the patch.
The fix is for dtest to kill jmx as part of killing a node
operation."
Now that Lucas fixed the problem in scylla-ccm, revert the revert.
We can only free memory for a region_group when the entire memtable is released.
This means that while the disk can handle requests from multiple memtables just fine,
we won't free any memory until all of them finish. If we are under a pressure situation
we will take a lot more time to leave it.
Ideally, with write-behind, we would allow just one memtable to be flushed at a
time. But since we don't have it enabled, it's better to serialize the flushes
so that only some memtables (4) are flushed at a time. Having the memtable writer
bandwidth all to itself, the memtable will finish sooner, release memory sooner,
and recover the system's health sooner.
We would like to do that without having streaming and memtables starve each
other. Ideally, that should mean half the bandwidth for each - but that
sacrifices memtable writes in the common case there is no streaming. Again,
write behind will help here, and since this is something we intend to do, there
is no need to complicate the code too much for an interim solution.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This patch introduces an explicit behavior enum class - one of delayed or
immediate, that allow callers to tell the memtable list whether they want a
delayed flush (default), or force an immediate flush. So far this only affects
the streaming code (memtables just ignore it), but the concept is one that can
be easily generalized.
With that in place, we can revert back the stop function to use the standard
flush. I have argued before that adding infrastructure like that would not be
worth it for the sake of stop alone, but some other code could now use it.
Specifically, the active reclaimer for the throttler would like to force
immediate flushes, as delayed flushes really won't make a lot of difference in
reducing memory usage.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
When a node starts up, peer node can send gossip syn message to it
before the gossip message handlers are registered in messaging_service.
We can see:
scylla[123]: [shard 0] rpc - client a.b.c.d: unknown verb exception 6 ignored
To fix, we delay the listening of messaging_service to the point when
gossip message handlers are registered.
Message-Id: <9b20d85e199ef0e44cdcde2920123a301a88f3d7.1464254400.git.asias@scylladb.com>
Metadata usually doesn't change after it is created; make that visible in
the code, allowing further optimizations to be applied later.
Message-Id: <1464334638-7971-3-git-send-email-avi@scylladb.com>
Rather than dynamic_cast<>ing the statement to see whether it is a
select statement, add a virtual function to cql_statement to get the
result metadata.
This is faster and easier to follow.
Message-Id: <1464334638-7971-2-git-send-email-avi@scylladb.com>
Scylla-jmx and collectd can preempt scylla and induce long latencies. Tune
the scheduler to provide lower latencies.
Since when the support processes are not running we normally do not context
switch (one thread per core, remember?), there should be no effect on
throughput.
The tunings are provided in a separate package, which can be uninstalled
if the server is shared with other applications which are negatively
affected by the tuning.
Fixes#1218.
Message-Id: <1464529625-12825-1-git-send-email-avi@scylladb.com>
compact_on_idle will lead users to thinking we're talking about sstable
compaction, not log-structured-allocator compaction.
Rename the variable to reduce the probability of confusion.
Message-Id: <1464261650-14136-1-git-send-email-avi@scylladb.com>
When read/write to a partition happens in parallel reader may detect
digest mismatch that may potentially cause cross DC read repair attempt,
but the repair is not really needed, so added latency is not justified.
This patch tries to prevent such parallel access from causing heavy
cross DC repair operation buy checking a timestamp of most resent
modification. If the modification happens less then "write timeout"
seconds ago the patch assumes that the read operation raced with write
one and cancel cross DC repair, but only if CL is LOCAL_*.
The space calculation counters in column family had two problem:
1. The total bytes is an ever growing counter, which is meaningless for
the API.
2. Trying to simply sum the size on all shards, ignores the fact that the
same sstable file can be referenced by multiple shards, this is
especially noticeable during migration time.
To solve this, the implementation was modified so instead of
collecting the sizes, the API would collect a map of file name to size
and then would do the summing.
This removes the duplications and fixes the total bytes calculation
Calling cfstats before the change with load after a compaction happend:
$ nodetool cfstats keyspace1
Keyspace: keyspace1
Verify write latency 1068253.0 76435
Read Count: 75915
Read Latency: 0.5953986037015082 ms.
Write Count: 76435
Write Latency: 0.013975966507490025 ms.
Pending Flushes: 0
Table: standard1
SSTable count: 5
Space used (live): 44261215
Space used (total): 219724478
After the fix:
$ nodetool cfstats keyspace1
Keyspace: keyspace1
Verify write latency 1863206.0 124219
Read Count: 125401
Read Latency: 0.9381053978835895 ms.
Write Count: 124219
Write Latency: 0.01499936402643718 ms.
Pending Flushes: 0
Table: standard1
SSTable count: 6
Space used (live): 50402904
Space used (total): 50402904
Space used by snapshots (total): 0
Fixes: #1042
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1464518757-14666-2-git-send-email-amnon@scylladb.com>
We have recently commited a fix to a broken streaming bug that involved
reverting column_family::stop() back to calling the custom seal functions
explicitly for both memtables and streaming memtables.
We here add a comment to explain why that had to be done.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <fe94b5883e9c29adc7fc9ee9f498894c057e7b64.1464293167.git.glauber@scylladb.com>
"This patch changes the way we wait for supported features. We no longer
sleep periodically, waking up to check if the wanted features are now
avaiable. Instead, we register waiters in a condition variable that is
signaled whenever new endpoint information is received.
We also add a new poll interface based on the feature class, which
encapsulates the availability of a cluster feature."
This class encapsulates the waiting for a cluster feature. A feature
object is registered with the gossiper, which is responsible for later
marking it as enabled.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch changes the sleep-based mechanism of detecting new features
by instead registering waiters with a condition variable that is
signaled whenever a new endpoint information is received.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch removes the timeout when waiting for features,
since future patches will make this argument unnecessary.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch fixes an inadvertent change to the shadow endpoint state
map in gossiper::run, done by calling get_heart_beat_state() which
also updates the endpoint state's timestamp. This did not happen for
the normal map, but did happen for the shadow map. As a result, every
time gossiper::run() was scheduled, endpoint_map_changed would always
be true and all the shards would make superfluous copies of the
endpoint state maps.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1464309023-3254-2-git-send-email-duarte@scylladb.com>
"This patchset provides a way to enable SET_NIC(posix_net_conf.sh) on
non-AMI environment.
Also support -mq option of the script.
This also contains number of bug fixes of scripts.
Fixes#1192"
NOTE: scyllatop now requires the urwid library
previously, if there were more metrics that lines in the terminal
window, the user could not see some of the metrics. Now the user can
scroll.
As an added bonus, the program will not crash when the window size
changes.
Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <1464098832-5755-1-git-send-email-yoav@scylladb.com>
Vlad reported a strange user configuration:
SCYLLA_ARGS="--log-to-syslog 1 --log-to-stdout 0 --default-log-level
info --collectd-address=127.0.0.1:25826 --collectd=1
--collectd-poll-period 60000 --network-stack posix --num-io-queues 32
--max-io-requests 128 --replace-address 10.0.4.131"
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "10.0.4.131"
In the mean while, 10.0.4.131 is the IP address of the node itself.
When the node was started, the following message were reported.
Apr 13 06:31:12 n0 scylla[19681]: [shard 0] gossip - Connect seeds again
... (20 seconds passed)
Apr 13 06:31:13 n0 scylla[19681]: [shard 0] gossip - Connect seeds again
... (21 seconds passed)
Apr 13 06:31:14 n0 scylla[19681]: [shard 0] gossip - Connect seeds again
... (22 seconds passed)
Apr 13 06:31:15 n0 scylla[19681]: [shard 0] gossip - Connect seeds again
... (23 seconds passed)
The configruation is invalid, becasue for --replace-address to
work, at least one working seed node should be alive. Catch the
configuration error and fail it with an appropriate error message.
Fixes#1183
Message-Id: <a94a082d896313e7a668915ae21fe2c03719da3a.1464164058.git.asias@scylladb.com>
_live_endpoints_just_added tracks the peer node which just becomes live.
When a down node gets back, the peer nodes can receive multiple messages
which would mark the node up, e.g., the message piled up in the sender's
tcp stack, after a node was blocked with gdb and released. Each such
message will trigger a echo message and when the reply of the echo
message is received (real_mark_alive), the same node will be added to
_live_endpoints_just_added.push_back more than once. Thus, we see the
same node be favored more than once:
INFO 2016-04-12 12:09:57,399 [shard 0] gossip -
do_gossip_to_live_member: Favor newly added node 127.0.0.2
INFO 2016-04-12 12:09:58,412 [shard 0] gossip -
do_gossip_to_live_member: Favor newly added node 127.0.0.2
INFO 2016-04-12 12:09:59,429 [shard 0] gossip -
do_gossip_to_live_member: Favor newly added node 127.0.0.2
INFO 2016-04-12 12:10:00,429 [shard 0] gossip -
do_gossip_to_live_member: Favor newly added node 127.0.0.2
INFO 2016-04-12 12:10:01,430 [shard 0] gossip -
do_gossip_to_live_member: Favor newly added node 127.0.0.2
INFO 2016-04-12 12:10:02,442 [shard 0] gossip -
do_gossip_to_live_member: Favor newly added node 127.0.0.2
INFO 2016-04-12 12:10:03,454 [shard 0] gossip -
do_gossip_to_live_member: Favor newly added node 127.0.0.2
To fix, do not insert the node if it is already in
_live_endpoints_just_added.
Fixes#1178
Message-Id: <6bcfad4430fbc63b4a8c40ec86a2744bdfafb40f.1464161975.git.asias@scylladb.com>
In commit 4981362f57, I have introduced a regression that was thankfully
caught by our dtest infrastructure.
That patch is a preparation patch for the active reclaim patchset that is to
come, and it consolidated all the flushes using the memtable_list's seal_fn
function instead of calling the seal function explicitly.
The problem here is that the streaming memtables have the delayed mechanism,
about which the memtable_list is unaware. Calling memtable_list's
seal_active_memtable() for the streaming memtables calls the delayed version,
that does not guarantee flush. If we're lucky, we will indeed flush after the
timer expires, but if we're not we'll just stop the CF with data not flushed.
There are two options to fix this: the first is to teach the memtable_list about
the delayed/forced mechanism, and the second is to just call the correct
function explicitly during shutdown, and then when the time comes to add
continuations to the result of the seal, add them here as well.
Although the second option involves a bit more work and duplication, I think it
is better in the sense that the delayed / forced mechanism really is something
that belong to the streaming only. Being this the only user, I don't think it
justifies complicating the memtable_list with this concept.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <b26017c825ccf585f39f58c4ab3787d78e551f5f.1464126884.git.glauber@scylladb.com>
"This change is intended to make migration process safer and easier.
All column families will now have a directory called upload.
With this feature, users may choose to copy migrated sstables to upload
directory of respective column families, and run 'nodetool refresh'.
That's supposed to be the preferred option from now on."
The default CQL frame compression algorithm in Cassandra is LZ4. Add
support for decompressing incoming frames and compressing outgoing
frames with LZ4 if the CQL driver asks for that.
Fixes#416
Message-Id: <1464086807-11325-1-git-send-email-penberg@scylladb.com>
* seastar 6a849ac...aed893e (3):
> net: move 'transport' enum to seastar namespace
> net: sctp protocol support for posix stack
> future: Support get() when state is at a promise
This patch solve a problem where a complex type is define as version
depended (with the version attribute) but doesn't have a default value.
In those cases the default constructor is used, but in the case of
complex types (template) param_type should be use to get the C++ type.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1463916723-15322-1-git-send-email-amnon@scylladb.com>
This change is intended to make migration process safer and easier.
All column families will now have a directory called upload.
With this feature, users may choose to copy migrated sstables to upload
directory of respective column families, and call 'nodetool refresh'.
That's supposed to be the preferred option from now on.
For each sstable in upload directory, refresh will do the following:
1) Mutate sstable level to 0.
2) Create hard links to its components in column family dir, using
a new generation. We make it safe by creating a hard link to temporary
TOC first.
3) Remove all of its components in upload directory.
This new code runs after refresh checked for new sstables in the column
family directory. Otherwise, we could have a generation conflict.
Unlike the first step, this new step runs with sstable write enabled.
It's easier here because we know exactly which sstables are new.
After that, refresh will load new sstables found in column family
and upload directories.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
It's not working because it tries to overwrite existing statistics
file with exclusive flag.
It's fixed by writing new statistics into temporary file and
renaming it into place.
If Scylla failed in middle of rewrite, a temporary file is left
over. So boot code was adjusted to delete a temporary file created
by this rewrite procedure.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently, we register snitch API in set_server_gossip_settle() which
waits until a node has joined the cluster. This makes 'nodetool status'
not properly show the status of a joining node. Fix the issue by
registering snitch API earlier.
Fixes#1269.
Message-Id: <1463576381-15484-1-git-send-email-penberg@scylladb.com>
Since we added scylla-conf package, we cannot install scylla-server/-tools without the package, because of this --localrpm is failing.
So copy scylla-conf package to AMI, and install it to fix the problem.
These parameters are only required for AMI, not for non-AMI environment which want to enable SET_NIC, so split them to indivisual script / conf file, call it from AMI install script.
In a preparation move for the LSA throttler, we have reordered the
initialization fields in database.hh so that the sizes of the regions are
computed before the initialization of the region.
However, that seemingly innocent move broke one of our tests. The reason behind
that, is that if we don't destroy the column families before destroying the
region, we may end up with a use after free in the memtable destructor - that
itself expects to call into the region.
This patch reorders the initialization so that the CF list still comes after the
dirty regions (therefore being destroyed first), while maintaining the relative
ordering between size / region that we needed in the first place.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0669984b5bccdb2c950f2444bdee4427abad56ba.1463508884.git.glauber@scylladb.com>
In perf-flame, I saw in
service::storage_proxy::create_write_response_handler (2.66% cpu)
gossiper::is_alive takes 0.72% cpu
locator::token_metadata::pending_endpoints_for takes 1.2% cpu
After this patch:
service::storage_proxy::create_write_response_handler (2.17% cpu)
gossiper::is_alive does not show up at all
locator::token_metadata::pending_endpoints_for takes 1.3% cpu
There is no need to copy the endpoint_state from the endpoint_state_map
to check if a node is alive. Optimize it since gossiper::is_alive is
called in the fast path.
Message-Id: <2144310aef8d170cab34a2c96cb67cabca761ca8.1463540290.git.asias@scylladb.com>
Refresh will rewrite statistics of any migrated sstable with level
> 0. However, this operation is currently not working because O_EXCL
flag is used, meaning that create will fail.
It turns out that we don't actually need to change on-disk level of
a sstable by overwriting statistics file.
We can only set in-memory level of a sstable to 0. If Scylla reboots
before all migrated sstables are compacted, leveled strategy is smart
enough to detect sstables that overlap, and set their in-memory level
to 0.
Fixes#1124.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
pending_endpoints_for is called frequently by
storage_proxy::create_write_response_handler when doing cql query.
Before this patch, each call to pending_endpoints_for involves
converting a multimap (std::unordered_multimap<range<token>,
inet_address>>) to map (std::unordered_map<range<token>,
std::unordered_set<inet_address>>).
To speed up the token to pending endpoint mapping search, a interval map
is introduced. It is faster than searching the map linearly and can
avoid caching the token/pending endpoint mapping.
With this patch, the operations per second drop during adding node
period gets much better.
Before:
45K to 10K
After:
45k to 38K
(The number is measured with the streaming code skipping to send data to
rule out the streaming factor.)
Refs: #1223
object
The API would expose now the rate_moving_average and
rate_moving_average_and_histogram.
The old end points remains for the transition period, but marked as
depricated.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch replaces the latency histogram to
rate_moving_avrage_and_histogram and the counters to
rate_moving_average.
The old endpoints where left unchagned but marked as depricated when
needed.
This patch replaces the helper function for column family with two
function, one that collect the relevant column family from all shareds
and another one that do the translation to json object.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
timed_rate_moving_average_and_histogram
As part of moving the derived statistic in to scylla, this replaces the
histogram object in the column_family to
timed_rate_moving_average_and_histogram.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
As part of moving the derived statistic in to scylla, this replaces the
counter in the row_cache stats to
timed_rate_moving_average_and_histogram.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
timed_rate_moving_average_and_histogram
As part of moving the derived statistic in to scylla, this replaces the
histogram object in the storage_proxy to
timed_rate_moving_average_and_histogram. and the read, write and range
counters where replaced by rate_moving_average.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds the helper function that are used to sum the
rate_moving_average and rate_moving_average_and_histogram.
The current sum functionality for histogram was modified to support
rate and histogram but return a histogram. This way current endpoints
would continue to behave the same.
It also cleans the histogram related method by using the plus operator
in the histogram.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds a few data structure for derived and accumulative
statistics that are similiar to the yammer implementation used by the
JMX.
It also adds a plus operator to histogram which cleans the histogram
usage.
moving_average - An exponentially-weighted moving average. calculate an event rate
on a given interval.
rate_moving_average and timed_rate_moving_average - Calculate 1m, 5m and
15m ewma an all time avrage and a counter.
rate_moving_average_and_histogram and
timed_rate_moving_average_and_histogram - Combines a histogram with a
rate_moving_average. It also expose a histogram API so it will be an
easy task to replace a histogram with a
timed_rate_moving_average_and_histogram.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
We've been keeping two constructors for the column family to allow for a
version without the commitlog. But it's by now quite complicated to maintain
the two, because changes always have to be made in two places.
This patch adds a private constructor that does the actual construction, and
have the public constructors to call it.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <dd3cb0b9c20ad154a6131bad6ece619f70ed5025.1463448522.git.glauber@scylladb.com>
I would like to be able to apply a function at the end of every flush, that is
common for both memtables and streaming memtables. For instance, to unthrottle
current waiters. Right now some calls to seal_active_memtable are open coded,
calling the column family's function directly, for both the main memtable list
and the streaming list.
This patch moves all the current open code callers to call the respective
memtable_list function.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0c780254f3c4eb03e2bcd856b83941cf49a84b85.1463448522.git.glauber@scylladb.com>
As Nadav pointed out, SETENV and sudo -E might be causes security hole:
https://github.com/scylladb/scylla/issues/1028#issuecomment-196202171
So drop them now, sourcing envfiles from scylla_prepare / scylla_stop scripts
instead.
Also on "[PATCH] ubuntu: Fix the init script variable sourcing" thread
we have problem to passing variables from envfiles to scylla_prepare /
scylla_stop on Ubuntu, it seems better to sourcing from these scripts.
Additionally, this fixes#1249
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1462989906-30062-1-git-send-email-syuu@scylladb.com>
"The Prepared message has a metadata section that's similar to result set
metadata but not exactly the same. Fix serialization by introducing a
separate prepared_metadata class like Origin has and implement
serialization as per the CQL protocol specification. This fixes one CQL
binary protocol version 4 issue that we currently have.
The changes have been verified by running the gocql integration tests
using v4. Please note that this series does *not* enable v4 for clients
because Cassandra 2.1.x series only supports CQL binary protocol v3."
Introduce a new prepared_metadata class that holds prepared statement
metadata and implement CQL binary protocol serialization that works for
all versions.
From Piotr:
Fixes#656.
It makes it possible to slice using clustering ranges in mutation
readers. We don't have row index yet so the slicing is just ignoring
data which is out of range.
Add additional parameters to mp_row_consumer to be able to fetch
only cells for given clustering key ranges
This will be used in row_cache when it will work on clustering key
level instead of partition key level.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
rate_moving_average and rate_moving_average_and_histogram are type that
are used by the JMX. They are based on the yammer meter and timer and
are used to collect derivative information.
Specificlly: rate_moving_average calculate rates and
rate_moving_average_and_histogram collect rates and
histogram.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
When running with DEBUG verbosity, scyllatop will now log every single
value it receives from collectd. When you suspect that scyllatop is
somehow distorting values, this is a good way to check it.
Signed-off-by: Yoav Kleinberger <yoav@scylladb.com>
Message-Id: <1463320730-6631-1-git-send-email-yoav@scylladb.com>
"Writes may start to be rejected by replicas after issuing alter table
which doesn't affect columns. This affects all versions with alter table
support.
Fixes#1258"
Currently we only do that when column set changes. When prepared
statements are executed, paramaters like read repair chance are read
from schema version stored in the statement. Not invalidating prepared
statements on changes of such parameters will appear as if alter took
no effect.
Fixes#1255.
Message-Id: <1462985495-9767-1-git-send-email-tgrabiec@scylladb.com>
Spotted during code review.
If it doesn't defer, we may execute then_wrapped() body before we
change the state. Fix by moving then_wrapped() body after state changes.
The problem was that "s" would not be marked as synced-with if it came from
shard != 0.
As a result, mutation using that schema would fail to apply with an exception:
"attempted to mutate using not synced schema of ..."
The problem could surface when altering schema without changing
columns and restarting one of the nodes so that it forgets past
versions.
Fixes#1258.
Will be covered by dtest:
SchemaManagementTest.test_prepared_statements_work_after_node_restart_after_altering_schema_without_changing_columns
Stop using /var/lib/scylla, use $SCYLLA_HOME instead.
systemd seems does not extract variables on Environment="HOME=$SCYLLA_HOME", but both CentOS/Ubuntu able to run scylla-server without $HOME, so dropped it.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1462977871-26632-1-git-send-email-syuu@scylladb.com>
ALTER KEYSPACE should allow no replication strategy to be set,
in which case old strategy should be kept.
Initial translation from origin missed this.
Fixes#1256
Message-Id: <1462967584-2875-2-git-send-email-calle@scylladb.com>
Currently scylla_io_setup hardcoded to run iotune on /var/lib/scylla, but user may change data directory by modifying scylla.yaml, and it may on different block device.
So use scylla_config_get.py to get configuration from scylla.yaml, passes it to iotune.
Fixes#1167
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1462955824-21983-2-git-send-email-syuu@scylladb.com>
To parse scylla.yaml, scylla_config_get.py is added.
It can be use like 'scylla_config_get.py [key name]' from shell script, or command line.
This is needed for scylla_io_setup, to get 'data_file_directories' from shellscript.
Currently it does not supported to specify key name of nested data structure, but enough for scyll_io_setup.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1462955824-21983-1-git-send-email-syuu@scylladb.com>
Since Ubuntu 14.04LTS needs scylla-gdb package which install to /opt/scylladb, we need to port scylla-env package to Ubuntu as well.
This change introduces scylla-env package to Ubuntu 14.04LTS.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1462825880-20866-2-git-send-email-syuu@scylladb.com>
Since Ubuntu 14.04LTS needs scylla-gdb package which install to /opt/scylladb, we need to port scylla-env package to Ubuntu as well.
To do it, share the package directory on dist/common/dep at first.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1462825880-20866-1-git-send-email-syuu@scylladb.com>
Reloads keyspace metadata and replaces in existing keyspace.
Note: since keyspace metadata, and consequently, replication
strategy now becomes volatile, keyspace::metadata now returns
shared pointer by value (i.e. keep-alive).
Replication strategy should receive the same treatment, but
since it is extensively used, but never kept across a
continuation, I've just added a comment for now.
1.) It most likely is not, i.e. either tcp or more likely, ssl
negotiation failure. In any case, we can still try next
connection.
2.) Not retrying will cause us to "leak" the accept, and then hang
on shutdown.
Also, promote logging message on accept exception to "warn", since
dtest(s?) depend on seeing log output.
Message-Id: <1462283265-27051-4-git-send-email-calle@scylladb.com>
To simplify init of msg service, use credendials_builder
to encapsulate tls options so actual credentials can be
more easily created in each shard.
Message-Id: <1462283265-27051-2-git-send-email-calle@scylladb.com>
* seastar 7782ad4...3dec26f (3):
> tests/mkcert.gmk: Fix makefile bug in snakeoil cert generator
> tls_test: Add case to do a little checking of credentials_builder
> tls: Add credentials_builder - copyable credentials "factory"
From Avi:
When we shut down, we may have to give up on some pending atomic
sstable deletions, because not all shards may have agreed to delete
all members of the set.
This is expected, so silence these frightening error messages.
Fixes#1235.
This patch adds support for secure connection attempts to be
cancellable.
Fixes#862
Includes seastar upstream merge:
* seastar f1a3520...7782ad4 (1):
> Merge "rpc: Allow client connections to be cancelled" from Duarte
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1462783335-10731-1-git-send-email-duarte@scylladb.com>
Avi says:
"During shutdown, we prevent new compactions, but perhaps too late.
Memtables are flushed and these can trigger compaction."
To solve that, let's stop compaction manager at a very early step
of shutdown. We will still try to stop compaction manager in
database::stop() because user may ask for a shutdown before scylla
was fully started. It's fine to stop compaction manager twice.
Only the first call will actually stop the manager.
Fixes#1238.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <c64ab11f3c91129c424259d317e48abc5bde6ff3.1462496694.git.raphaelsc@scylladb.com>
* seastar e536555...ab74536 (4):
> reactor: kill max_inline_continuations
> smp: optimize smp_message_queue::flush_request_batch() for empty queue
> thread: do not yield if idle
> Merge "Fixes for iotune" from Glauber
Clustering key prefix may have less columns than described in schema.
Deserailiaztion should stop when end of buffer is reached.
Message-Id: <20160503140420.GP23113@scylladb.com>
It was noticed that small sstables will accumulate for a column family because
scylla was limited to two compaction per shard, and a column family could have
at most one compaction running at a given shard. With the number of sstables
increasing rapidly, read performance is degraded.
At the moment, our compaction manager works by running two compaction task
handlers that run in parallel to the rest of the system. Each task handler
gets to run when needed, gets a column family from compaction manager queue,
runs compaction on it, and goes to sleep again. That's basically its cycle.
Compaction manager only allows one instance of a column family to be on its
queue, meaning that it's impossible for a column family to be compacted in
parallel. One compaction starts after another for a given column family.
To solve the problem described, we want to concurrently run compaction jobs
of a column family that have different "size tier" (or "weight").
For those unfamiliar, compaction job contains a list of sstables that will be
compacted together.
The "size tier" of a compaction job is the log of the total size of the input
sstables. So a compaction job only gets to run if its "size tier" is not the
same of an ongoing compaction. There is no point in compacting concurrently at
the same "size tier", because that slows down both compactions.
We will no longer queue column families in compaction manager. Instead, we
create a new fiber to run compaction on demand.
This fiber that runs asynchronously will do the following:
1) Get a compaction job from compaction strategy.
2) Calculate "size tier" of compaction job.
3) Run compaction job if its "size tier" is not the same of an ongoing
compaction for the given column family.
As before, it may decide to re-compact a column family based on a stat stored
in column family object.
Ran all compaction-related dtests.
Fixes#1216.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <d30952ff136192a522bde4351926130addec8852.1462311908.git.raphaelsc@scylladb.com>
In initial implementation I figured this was not required, but
we get issues communicating across nodes if system tables
don't have the same UUID, since creation is forcefully local, yet
shared.
Just do a manual re-create of the scema with a name UUID, and
use migration manager directly.
Message-Id: <1462194588-11964-1-git-send-email-calle@scylladb.com>
The patch calculates row count during result building and while merging.
If one of results that are being merged does not have row count the
merged result will not have one either.
Fixes: #1220
While the server_credentials object is technically immutable
(esp with last change in seastar), the ::shared_ptr holding them
is not safe to share across shards.
Pre-create cpu x credentials and then move-hand them out in service
start-up instead.
Fixes assertion error in debug builds. And just maybe real memory
corruption in release.
Requires seastar tls change:
"Change server_credentials to copy dh_params input"
Message-Id: <1462187704-2056-1-git-send-email-calle@scylladb.com>
Leveled compaction strategy is doing a lot of work whenever it's asked to get
a list of sstables to be compacted. It's checking if a sstable overlaps with
another sstable in the same level twice. First, when adding a sstable to a
list with sstables at the same level. Second, after adding all sstables to
their respective lists.
It's enough to check that a sstable creates an overlap in its level only once.
So I am changing the code to unconditionally insert a sstable to its respective
list, and after that, it will call repair_overlapping_sstables() that will send
any sstable that creates an overlap in its level to L0 list.
By the way, the optimization isn't in the compaction itself, instead in the
strategy code that gets a set of sstables to be compacted.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <8c8526737277cb47987a3a5dbd5ff3bb81a6d038.1461965074.git.raphaelsc@scylladb.com>
Currently scylla_setup is unusable when user does not want to install scylla-jmx because it checks package unconditionally, but some users (or developers) does not want to install it, so let's ask to skip check or not on interactive prompt.
Also, scylla-tools package should installed for most of the case, added check code for the package.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1460662354-10221-1-git-send-email-syuu@scylladb.com>
"summary":"Start reporting on one or more collectd metric",
"type":"void",
"nickname":"enable_collectd",
"produces":[
"application/json"
],
"parameters":[
{
"name":"pluginid",
"description":"The plugin ID, describe the component the metric belongs to. Examples are cache, thrift, etc'. Regex are supported.The plugin ID, describe the component the metric belong to. Examples are: cache, thrift etc'. regex are supported",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"instance",
"description":"The plugin instance typically #CPU indicating per CPU metric. Regex are supported. Omit for all",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"type",
"description":"The plugin type, the type of the information. Examples are total_operations, bytes, total_operations, etc'. Regex are supported. Omit for all",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"type_instance",
"description":"The plugin type instance, the specific metric. Exampls are total_writes, total_size, zones, etc'. Regex are supported, Omit for all",
"required":false,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"enable",
"description":"set to true to enable all, anything else or omit to disable",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
]
},
@@ -63,10 +114,10 @@
"operations":[
{
"method":"GET",
"summary":"Get a collectd value",
"summary":"Get a list of all collectd metrics and their status",
"type":"array",
"items":{
"type":"type_instance_id"
"type":"collectd_metric_status"
},
"nickname":"get_collectd_items",
"produces":[
@@ -74,6 +125,25 @@
],
"parameters":[
]
},
{
"method":"POST",
"summary":"Enable or disable all collectd metrics",
"type":"void",
"nickname":"enable_all_collectd",
"produces":[
"application/json"
],
"parameters":[
{
"name":"enable",
"description":"set to true to enable all, anything else or omit to disable",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
]
}
@@ -113,6 +183,20 @@
}
}
}
},
"collectd_metric_status":{
"id":"collectd_metric_status",
"description":"Holds a collectd id and an enable flag",
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.