Direct motivation for this is to be able to use two index readers from
a single mutation reader, one for lower bound of the range and one for
the upper bound of the range, without sacrificing optimization of
avoiding index reads when forwarding to partition ranges which are
close by. After the change, all index readers of given sstable will
share index buffers, so lower bound reader can reuse the page read by
the upper bound reader.
The reason for using two readers will be so that we are able to skip
inside the partition range, not only outside of it. This is not
possible if we use the same index reader to locate the upper bound of
the range, because we may only advance the cursor.
Failing to close a file properly before destroying file's object causes
crashes.
[tgrabiec: fixed typo]
Message-Id: <20170221144858.GG11471@scylladb.com>
close() operation is like a destructor, it cannot fail. It just
reports errors, but close itself succeeds. So we should proceed with
the closing even if it fails.
Message-Id: <1484245886-7269-1-git-send-email-tgrabiec@scylladb.com>
After resharding, sstables may be owned by all shards, which
means that file descriptors and memory usage for metadata will
increase by a factor equal to number of shards. That can easily
lead to OOM.
SSTable components are immutable, so they can be stored in one
shard and shared with others that need it. We use the following
formula to decide which shard will open the sstable and share
it with the others: (generation % smp::count), which is the
inverse of how we calculate generation for new sstables.
So if no resharding is performed, everything is shard-local.
With this approach, resource usage due to loaded sstables will
be evenly distributed among shards.
For this approach to work, we now only populate keyspaces from
shard 0. It's now the sole responsible for iterating through
column family dirs. In addition, most of population functions
are now free and take distributed database object as parameter.
Fixes#1951.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We intend to share immutable sstable components among shards to
reduce excessive memory usage when resharding shared sstables.
This change is about grouping those components into a structure,
and using foreign ptr to make sure that the structure will be
deleted by whichever shard created it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Rename _components to _recognized_components because _components
will be used to name a field with shareable components.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
file output streams take the responsibility of closing the file, they
will close the file as part of closing the stream.
During sstable writing we create sstable object and keep file
references there as well. Sstable object also has responsibility for
closing the files, and does so from sstable::~sstable().
Double close was supposed to be avoided by a construct like this:
writer.close().get();
_file = {};
However if close() failed, which can happen when write-ahead failed,
_file would not be cleared, and both the writer and sstable would
close the file. This will result in a crash in
append_challenged_posix_file_impl::close(), which is not prepared to
be closed twice.
Another problem is that if exception happened before we reached that
construct, we still should close the writer. Currently we don't, so
there's no double close on the file, but that's a bug which needs to
be fixed and once that's fixed double close on _file will be even more
likely.
The fix employed here is to not keep files inside sstable object when
writing. As soon as the writer is constructed, it's the only owner of
the file.
Fixes#1764.
Message-Id: <1482428648-22553-1-git-send-email-tgrabiec@scylladb.com>
The Cassandra derived sstable tools (and likely Cassandra itself) object to
a new sub-component in the Statistics component; create a new Scylla
component instead to host this data.
Allow declaring discriminated unions (with an enum type as the
discriminant and any sstable serializable type as a value) and sets
of these unions, with the disciminant as the key. Parsers and writers
are auto-generated.
GCC 5.3.1 was unable to convert bound to optional<bound>.
sstables/sstables.cc:2494:123: error: no matching function for call to
‘nonwrapping_range<dht::ring_position>::nonwrapping_range(dht::ring_position,
dht::ring_position)’
(dtr.right.exclusive ? dht::ring_position::starting_at :
dht::ring_position::ending_at)(std::move(t2)));
In file included from ./dht/i_partitioner.hh:52:0,
from ./query-request.hh:28,
from ./clustering_key_filter.hh:27,
from sstables/sstables.hh:35,
from sstables/sstables.cc:38:
./range.hh:441:14: note: candidate: nonwrapping_range<T>::nonwrapping_range(
const wrapping_range<U>&) [with T = dht::ring_position]
explicit nonwrapping_range(const wrapping_range<T>& r)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <95bbf984cd73a61739c8da99cf6cd5e94f1d1457.1479954360.git.raphaelsc@scylladb.com>
When we load an sstable, we don't know beforehand which shards it belongs
to; we don't want to open it until we do. Add a method that allows us
to read just the sharding data, without opening anything else.
Add a metadata component that describes token ranges that are spanned by
this sstable. With the current sharding algorithm, where each shard owns
a single token range, the first/last partition key is sufficient to
describing sharding information, but for multi-range algorithms, this
is not sufficient.
Instead of calculating the offset for each statistic component manually,
use a loop to iterate over all components, accumulating the offset as we
go along.
write() doesn't need to change its input; so change it to const.
The only snag is that describe_type() isn't and can't be made const-correct,
so cheat when it is called and const_cast the input.
This helps in writing a generic serialized_size() that is const correct,
in the next patch.
The leakage results in deleted sstables being opened until shutdown, and disk
space isn't released. That's because column_family::rebuild_sstable_list()
will not remove reference to deleted sstables if an exception was triggered in
sstables::delete_atomically(). A sstable only has its files closed when its
object is destructed.
The exception happens when a major compaction is issued in parallel to a
regular one, and one of them will be unable to delete a sstable already deleted
by the other. That results in remove_by_toc_name() triggering boost::filesystem
::filesystem_error because TOC and temporary TOC don't exist.
We wouldn't have seen this problem if major compaction were going through
compaction manager, but remove_by_toc_name() and rebuild_sstable_list() should
be made resilient.
Fixes#1840.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <d43b2e78f9658e2c3c5bbb7f813756f18874bf92.1479390842.git.raphaelsc@scylladb.com>
If sstable Summary is not present Scylla does not refuses to boot but
instead creates summary information on the fly. There is a bug in this
code though. Summary files is a map between keys and offsets into Index
file, but the code creates map between keys and Data file offsets
instead. Fix it by keeping offset of an index entry in index_entry
structure and use it during Summary file creation.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20161116165421.GA22296@scylladb.com>
In order to ensure exception safety, the atomic sstable deletion code
creates a copy of the list of sstables pending deletion, modifies that
copy, and then replaces the original data with the copy. This guarantees
that any exception does not change the data, since the assignment does
not require allocation.
However, it does result in quadratic behavior. During startup, all
sstables are loaded on each shard, and each shard deletes sstables that
are do not have any partitions served by that shard; this results in
almost all sstables being deleted from all shards, with all that work
going to shard 0; the list grows to O(nr sstables), and there are
O((nr sstables) * (nr shards)) operations to perform.
Fix by replacing the copy-modify-assign method with an in-place update,
but one that is designed to only commit changes after all allocations
have been made; in addition, instead of using a list, use a hash table,
removing another source of quadratic behavior.
Fixes#1812 (the quadratic beahvior part).
By default, io checker will cause Scylla to shutdown if it finds
specific system errors. Right now, io checker isn't flexible
enough to allow a specialized handler. For example, we don't want
to Scylla to shutdown if there's an permission problem when
uploading new files from upload dir. This desired flexibility is
made possible here by allowing a handler parameter to io check
functions and also changing existing code to take advantage of it.
That's a step towards fixing #1709.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Single partition and partition range reads are expected to behave
considerably different so it is worth to have them use separate file
stream history. This also makes reads use different history for each
sstable which is also a good thing.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch allows sstable readers to be fast forwarded without making it
necessary to recreate the reader (and dropping all buffers in the
process). It is built on top of index_reader and ability of
data_consume_context to be fast forwarded.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
index_reader is a helper that implements index lookups. Its goal is to
avoid dropping read buffers if they still may be needed (for example to
get end bound of the range or after fast forwarding the reader).
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch implements the get_key_samples() function, on which a
future patch will base an implementation of the describe_splits()
thrift verb closer to Cassandra's.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the estimated_keys_for_range() function, which
estimates the number of keys present between the specified range.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The latest virtual dirty patches broke the SSTable tests. The reason for
this is that those tests will flush synthetic memtables that do not have
a region_group attached to it.
Normally in cases like this we would just give the flush_reader an empty
region group. However, the memtable class constructor takes a
region_group pointer and that can be null according to the interface.
So we must conditionally test it.
If there isn't a region_group involved, the virtual dirty accounting
should be disabled: after all, we won't even have the baseline memory
to begin with.
One of the approaches to fix this could be to just provide null
accounter classes to be used as a surrogate for the accounting classes
in this case. However, since this is mostly used for tests, a much
simpler way is to just revert back to the scanning reader in that case.
The scanning reader is similar enough to the flush_reader, except that
it can handle partial ranges, slices, and delegate accesses to an
sstable post-flush. We don't need any of that, but as argued above,
there is no need to remove it either.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Message-Id: <1475667271-60806-1-git-send-email-glommer@scylladb.com>
Right now the special reader doesn't do much, but the idea is that we will
soon replace it will a reader that specializes in flush, and is in turn able
to provide read-side on-flush functionality like virtual dirty.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
stdx::optional<T> uses quite elaborate std::enable_if_t magic to decide
whether the argument passed to its constructor should be used for a call
T constructor or stdx::optional<T> constructor.
Apparently, with GCC 6.2 having T constructor which accepts any type
confuses that magic and we end up with compile errors.
The solution is to have from_range() method that replaces that
constructor from range. There is also constructor that creates a key
from std::vector<bytes> so that code generated by IDL works as it did
before.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1474550971-15309-1-git-send-email-pdziepak@scylladb.com>
leveled strategy uses heavily first and last decorated keys of a
sstable to get overlapping sstables in a given level. By storing
first and last decorated keys in sstable object, it's expected
that performance of leveled strategy (not compaction) will be
improved.
We will set first and last keys in sstable when either loading
or sealing it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <0abca819454ab4c088541bb49714f1f6a7dc4f42.1473959677.git.raphaelsc@scylladb.com>