Files with a conf extension are run by the scylla_prepare on the AMI.
The scylla-housekeeping configuration file is not a bash script and
should not be run.
This patch changes its extension to cfg which is more python like.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1470896759-22651-2-git-send-email-amnon@scylladb.com>
maybe_flush_pi_block, which is called for each cell, assumes that
block_first_colname will be empty when the first cell is encountered
for each partition.
This didn't hold after writing partition which generated no index
entry, because block_first_colname was cleared only when there way any
data written into the promoted index. Fix by always clearing the name.
The effect was that the promoted index entry for the next partition
would be flushed sooner than necessary (still counting since the start
of the previous partition) and with offset pointing to the start of
the current partition. This will cause parsing error when such sstable
is read through promoted index entry because the offset is assumed to
point to a cell not to partition start.
Fixes#1567
Message-Id: <1470909915-4400-1-git-send-email-tgrabiec@scylladb.com>
The dist flag mark the debian package as distributed package.
As such the housekeeping configuration file will be included in the
package and will not need to be created by the scylla_setup.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1470907208-502-2-git-send-email-amnon@scylladb.com>
Add '--smp', '--memory', and '--overprovisioned' options to the Docker
image. The options are written to /etc/scylla.d/docker.conf file, which
is picked up by the Scylla startup scripts.
You can now, for example, restrict your Docker container to 1 CPU and 1
GB of memory with:
$ docker run --name some-scylla penberg/scylla --smp 1 --memory 1G --overprovisioned 1
Needed by folks who want to run Scylla on Docker in production.
Cc: Sasha Levin <alexander.levin@verizon.com>
Message-Id: <1470680445-25731-1-git-send-email-penberg@scylladb.com>
Scylla was tracking min and max column names instead. Min and max
clustering components are tracked to optimize reads that use a
clustering filter. For more details:
https://issues.apache.org/jira/browse/CASSANDRA-5514
Also fix potential bug if clustering value is empty.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The sanitizer of the debug build warns when a "bool" variable is read when
containing a value not 0 or 1. In particular, if a class has an
uninitialized bool field, which class logic allows to only be set later,
then "move"ing such an object will read the uninitialized value and produce
this warning.
This patch fixes four of these warnings seen in sstable_test by initializing
some bool fields to false, even though the code doesn't strictly need this
initialization.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1470744318-10230-1-git-send-email-nyh@scylladb.com>
This patch sets the default validator for dynamic column families.
Doing so has no consequences in terms of behavior, but it causes the
correct type to be shown when describing the column family through
cassandra-cli.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1470739773-30497-1-git-send-email-duarte@scylladb.com>
"The series adds an optional configuration file to the scylla-housekeeping. The
file act as a way to prevent the scylla-housekeeping to run. A missing
configuration file, will make the scylla-housekeeping immediately.
The series adds a flag to the build_rpm that differentiate between public
distributions that would contain the configuration file and private
distributions that will not contain it which will cause the setup script to
create it."
The promoted-index reading code contained a bug where it copied the value
of an disengaged optional (this non-value was never used, but it was still
copied ). Fix it by keeping the optional<> as such longer.
This bug caused tests/sstable_test in the debug build to crash (the release
build somehow worked).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1470742418-8813-1-git-send-email-nyh@scylladb.com>
This patch ensures we always send the column metadata, even when the
column family is dynamic and the metadata is empty, as some clients
like cassandra-cli always assume its presence.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1470740971-31169-1-git-send-email-duarte@scylladb.com>
Commit 0d8463aba5 broke some of the tests with an assertion
failure about local_is_initialized(). It turns out that there is more than
one level of local_is_initialized() we need to check... For some tests,
neither locals were initialized, but for others, one was and the other
wasn't, and the wrong one was tested.
With this patch, all unit tests except "flush_queue_test.cc" pass on my
machine. I doubt this test is relevant to the promoted index patches,
but I'll continue to investigate it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <1470695199-32649-1-git-send-email-nyh@scylladb.com>
"
While periodic mode is a all-bets-off crap-shoot as far as knowing if
data actually reached disk or not, batch mode is supposed to be
somewhat more reliable/deterministic.
Thus, if we get an exception writing/flushing the current buffer,
we should propagate exceptions to all execution paths involved
in this buffer.
Flush queue can now (optionally) propagate exceptions to all clients, and
commit log uses this to ensure that commit log writers in batch mode
all generate exceptions on disk errors.
Also includes some rudimentary tests for flush queue mechanisms.
Note: other main user, sstable flushing, is not affected, as default
mode is still to keep exceptions to individual worker continuations,
not waiters."
"The goal of this patch series is to support reading and writing of a
"promoted index" - the Cassandra 2.* SSTable feature which allows reading
only a part of the partition without needing to read an entire partition
when it is very long. To make a long story short, a "promoted index" is
a sample of each partition's column names, written to the SSTable Index
file with that partition's entry. See a longer explanation of the index
file format, and the promoted index, here:
https://github.com/scylladb/scylla/wiki/SSTables-Index-File
There are two main features in this series - first enabling reading of
parts of partitions (using the promoted index stored in an sstable),
and then enable writing promoted indexes to new sstables. These two
features are broken up into smaller stand-alone pieces to facilitate the
review.
Three features are still missing from this series and are planned to be
developed later:
1. When we fail to parse a partition's promoted index, we silently fall back
to reading the entire partition. We should log (with rate limiting) and
count these errors, to help in debugging sstable problems.
2. The current code only uses the promoted index when looking for a single
contiguous clustering-key range. If the ck range is non-contiguous, we
fall back to reading the entire partition. We should use the promoted
index in that case too.
3. The current code only uses the promoted index when reading a single
partition, via sstable::read_row(). When scanning through all or a
range of partitions (read_rows() or read_range_rows()), we do not yet
use the promoted index; We read contiguously from data file (we do not
even read from the index file, so unsurprisingly we can't use it)."
In this unit test, we create using Scylla C++ code, the same large
partition with 13520 CQL rows as we previously imported from Cassandra
for the large partition test. We then verify that the sstable index file
we just wrote is byte-for-byte identical to the one previously created by
Cassandra. They should indeed be identical, because the data file has the
same layout (even if timestamps are different) and our default promoted-
index block size is the same (64K) so the sample of columns should be
identical.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds writing of promoted index to sstables.
The promoted index is basically a sample of columns and their positions
for large partitions: The promoted index appears in the sstable's index
file for partitions which are larger than 64 KB, and divides the partition
to 64 KB blocks (as in Cassandra, this interval is configurable through
the column_index_size_in_kb config parameter). Beyond modifying the index
file, having a promoted index may also modify the data file: Since each
of blocks may be read independently, we need to add in the beginning of
each block the list of range tombstones that are still open at that
position.
See also https://github.com/scylladb/scylla/wiki/SSTables-Index-FileFixes#959
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add to the range_tombstone_accumulator a range_tombstones_for_row(ck)
method.
Just like the existing tombstone_for_row(ck), this function drops from
the accumulator tombstones that end before ck. But while the existing
function returned just a single tombstone affecting the given row (the
most recent tombstone), the new function range_tombstones_for_row(ck)
returns all the accumulated range tombstones which cover ck.
This function will be useful for the promoted-index writing code later,
which divides a partition into blocks which may be read independently,
so each block needs to start with a repeat of the earlier tombstones
which still cover the first row in the new block.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds support more efficiently reading small parts of a large
partition, without reading the entire partition as we had to do so far.
This is done using the "promoted index".
The "promoted index" is stored in the sstable index file, and provides
for each large sstable row ("partition" in CQL nomenclature) a sample of
the column names at (for example) 64KB intervals. This means that when we
read a slice of columns (e.g., cql rows), or page through a large partition,
we do not have to read the entire partition from disk.
This patch only implements the read side of promoted index - a later patch
will add the write-side support (i.e., writing the promoted index to the
index file while saving the sstable). Nevertheless this patch can already
be tested by reading existing sstables from Cassandra which include a
promoted index - such as the one included in the test in the previous patch.
The use of the promoted index currently has two limitations:
1. It is only used when reading a single partition with sstable::read_row(),
not when scanning through many partitions with sstable::read_range_rows()
or sstable::read_rows().
2. It is only used when filtering a single clustering-key range, rather
than a list of disjoint ranges. A single range is the common case.
These two issues will be improved later. In the meantime, in those
unsupported cases we simply continue to read entire partitions, so we're not
worse-off than before.
Also note that this patch only helps when sstable::read_row() is used with
a clustering-key prefix (i.e., a slice). Our higher-level request handling
code may decide to read an entire partition into the cache, and not use
a clustering-key prefix at all when reading. We will need to indepdently
improve the high-level code to use read_row()'s slicing capabilities
when paging through large partitions, for example.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Our sstable reading code is currently hard-coded to read entire partitions,
even if we know that only a subset of the columns are requested.
This patch introduces find_disk_ranges(), a function to find the ranges of
bytes we need to read from the sstable data file to guarantee that the
desired columns from the desired partition are read.
The returned range may be the entire byte range of the given partition -
as found using the summary and index files - but if the index contains a
"promoted index" (basically a sample of column positions for each key)
we may return a smaller range. The "disk_read_range" type introduced in
the previous patch is extended here to support reading a partial partition -
by including additional information which would be missed when reading only
part of a partition (viz., the partition key and the partition's tombstone).
This function isn't used in this patch - we will wire its use in the next
patch, which will complete the read-side support for the promoted index.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Our index_entry type, holding one partition's entry that we read from the
index file, already contained the "_promoted_index" which we read from
disk - as an unparsed byte buffer. But there wasn't any API to access
this buffer after it was read. This patch adds a trivial getter, to get
a read-only view of this buffer.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The "struct column" code in partition.cc is generally useful code for
parsing serialized column names from the sstable. It is currently private
inside the "mp_row_consumer" class. But in a next patch we'll also want
to use it in the "sstable" class, for the promoted-index parsing code,
which among other things also needs to deserialize column names.
The trivial fix, in this patch, is to make this code "public". However,
for now it is still available only in partition.cc.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Currently, the main sstable data parsing entry point data_consume_rows()
takes a contiguous range of bytes to read from disk and parse. This range
is supposed to be an entire partition or contiguous group of partitions.
and is self contained (can be parsed without extra information about the
identity of these partitions).
For the promoted index feature (which we will add in a following patch)
we will want the range to span only a part of a partition, and will need
the caller to provide some information not available to the parser (such
as the partition's key). In the future, we will also want to support a
vector of byte ranges, instead of just one.
So in preparation for this, this patch simply replaces the start/end pair
by a new class disk_read_range, which can be easily extended in later
patches. No new functionality is introduced in this patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In a later patch adding "promoted index" read support, we would like to
parse only part of an sstable row. In that case, the parser should start
not at the usual ROW_START state, but rather at the ATOM_START state.
But there's a problem: The sstable parser consumer currently assumes that
the parser stops after the start of the row, before reading any atoms.
So in the partial row case too, we must stop parsing before reading the
first atom.
For this, this patch adds the new "STOP_THEN_ATOM_START" parser state.
When starting in this state, the parser stops immediately (with
row_consumer::proceed::no), and when restarted again it will be in the
ATOM_START case.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a test that takes an sstable with one partition of 13,520
clustering rows (spanning 700 KB in the data file), and attempts to read
various slices CQL rows, counting that we got back the expected number
of rows.
The sstable included here was generated by Cassandra, and includes a
promoted index. Promoted index reading is not supported yet (we will
add it in the next patch), so for now the code will always read the
entire partition from disk; But still the clustering-key filtering is
already functional, and will drop some of the rows as requested,
so this test will pass.
Later, when we add promoted index support, we should check that this test
still passes - promoted index will make the reads in this test more
efficient (which the test cannot verify), but the important thing to check
is that it doesn't break any of these tests.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Adding the configuration file is a way to make the running
scylla-housekeeping service not run the check version.
If the file does not exists, it will be set by the setup script.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The new dist flag difrentiate between public distribution and private
compilations.
For public distributation the housekeeping configuration file will be
installed.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
When running the scylla_setup, the script would check that the
housekeeping configuration file: housekepping.conf exists.
If not, it would ask the user if to run the check version option or not
and create the file accordingly.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
The housekeeping.conf is a configuration file for scylla-housekeeping.
By default it will be included in the rpm and state that the
check-version would be run.
If the file is missing, or if check-version is set to false, the check
version operation will not be performed.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds an optional config file that can be passed from the
command line.
If the config file is specified and does not exist, the script will
terminate.
The only parameter that is currently available is check-version that can
be either true or false.
The ConfigParser module is used to read the config file, that should be
on the form of:
[housekeeping]
check-version: True
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
"The compact column is a dense schema's single regular column. Its
existence has been a source of bugs, so this patchset removes the
column_kind::compact_column, as well as further references to compact
columns from the code base.
Fixes#1542"
"Kubernetes is unhappy with our Docker image because we start systemd
under the hood. Fix that by switching to use "supervisord" to manage the
two processes -- "scylla" and "scylla-jmx":
http://blog.kunicki.org/blog/2016/02/12/multiple-entrypoints-in-docker/
While at it, fix up "docker logs" and "docker exec cqlsh" to work
out-of-the-box, and update our documentation to match what we have.
Further work is needed to ensure Scylla production configuration works
as expected and is documented accordingly."
Calls like later() and with_gate() may allocate memory, although that is not
very common. This can create a problem in the sense that it will potentially
recurse and bring us back to the allocator during free - which is the very thing
we are trying to avoid with the call to later().
This patch wraps the relevant calls in the reclaimer lock. This do mean that the
allocation may fail if we are under severe pressure - which includes having
exhausted all reserved space - but at least we won't recurse back to the
allocator.
To make sure we do this as early as possible, we just fold both release_requests
and do_release_requests into a single function
Thanks Tomek for the suggestion.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <980245ccc17960cf4fcbbfedb29d1878a98d85d8.1470254846.git.glauber@scylladb.com>
Fix Docker Hub documentation to match what we have right now.
More work is needed in the following areas:
* How to make a cluster
* How to configure Docker image for production use