From Pawel:
This series enables row cache to serve range queries. In order to achieve
that row cache needs to know whether there are some other partitions in
the specified range that are not cached and need to be read from the sstables.
That information is provied by key_readers, which work very similarly to
mutation_readers, but return only the decorated key of partitions in
range. In case of sstables key_readers is implemented to use partition
index.
Approach like this has the disadvantage of needing to access the disk
even if all partitions in the range are cached. There are (at least) two
solutions ways of dealing with that problem:
- cache partition index - that will also help in all other places where it
is neededed
- add a flag to cache_entry which, when set, indicates that the immediate
successor of the partition is also in the cache. Such flag would be set
by mutation reader and cleared during eviction. It will also allow
newly created mutations from memtable to be moved to cache provided that
both their successors and predecessors are already there.
The key_reader part of this patchsets adds a lot of new code that probably
won't be used in any other place, but the alternative would be to always
interleave reads from cache with reads from sstables and that would be
more heavy on partition index, which isn't cached.
Fixes#185.
"Fixes: #469
We occasionally generate memtables that are not empty, yet have no
high replay_position set. (Typical case is CL replay, but apparently
there are others).
Moreover, we can do this repeatedly, and thus get caught in the flush
queue ordering restrictions.
Solve this by treating a flush without replay_position as a flush at the
highest running position, i.e. "last" in queue. Note that this will not
affect the actual flush operation, nor CL callbacks, only anyone waiting
for the operation(s) to complete.
To do this, the flush_queue had its restrictions eased, and some introspection
methods added."
We go to the filesystem to check if the snapshot exists. This should make us
robust against deletions of existing snapshots from the filesystem.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
This allows for us to delete an existing snapshot. It works at the column
family level, and removing it from the list of keyspace snapshots needs to
happen only when all CFs are processed. Therefore, that is provided as a
separate operation.
The filesystem code is a bit ugly: it can be made better by making our file
lister more generic. First step would be to call it walker, not lister...
For now, we'll use the fact that there are mostly two levels in the snapshot
hierarchy to our advantage, and avoid a full recursion - using the same lambda
for all calls would require us to provide a separate class to handle the state,
that's part of making this generic.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
There are situations in which we would like to match more than one directory
type. One example of that, would be a recursive delete operation: we need to
delete the files inside directories and the directories themselves, but we
still don't want a "delete all" since finding anything other than a directory
or a file is an error, and we should treat it as such.
Since there aren't that many times, it should be ok performance wise to just
use a list. I am using an unordered_set here just because it is easy enough,
but we could actually relax it later if needed. In any case, users of the
interface should not worry about that, and that decision is abstracted away
into lister::dir_entry_types.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
"Those are fixes needed for the snapshotting process itself. I have bundled this
in the create_snapshot series before to avoid a rebase, but since I will have to
rewrite that to get rid of the snapshot manager (and go to the filesystem),
I am sending those out on their own."
We occasionally generate memtables that are not empty, yet have no
high replay_position set. (Typical case is CL replay, but apparently
there are others).
Moreover, we can do this repeatedly, and thus get caught in the flush
queue ordering restrictions.
Solve this by treating a flush without replay_position as a flush at the
highest running position, i.e. "last" in queue. Note that this will not
affect the actual flush operation, nor CL callbacks, only anyone waiting
for the operation(s) to complete.
"This patchset introduces leveled compaction to Scylla.
We don't handle all corner cases yet, but we already have the strategy
and compaction working as expected. Test cases were written and I also
tested the stability with a load of cassandra-stress.
Leveled compaction may output more than one sstable because there is
a limit on the size of sstables. 160M by default.
Related to handling of partial compaction, it's still something to be
worked on.
Anyway, it will not be a big problem. Why? Suppose that a leveled
compaction will generate 2 sstables, and scylla is interrupted after
the first sstable is completely written but before the second one is
completely written. The next boot will delete the second sstable,
because it was partially written, but will not do anything with the
first one as it was completely written.
As a result, we will have two sstables with redundant data."
With the distribute-and-sync method we are using, if an exception happens in
the snapshot creation for any reason (think file permissions, etc), that will
just hang the server since our shard won't do the necessary work to
synchronize and note that we done our part (or tried to) in snapshot creation.
Make the then clause a finally, so that the sync part is always executed.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
create_links will fail in one of the shards if one of the SSTables happen to be
shared. It should be fine if the link already exists, so let's just ignore that case.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Current code calls make_directory, which will fail if the directory already exists.
We didn't use this code path much before, but once we start creating CF directories
on CF creation - and not on SSTable creation, that will become our default method.
Use touch_directory instead
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Adapt our compaction code to start writing a new sstable if the
one being written reached its maximum size. Leveled strategy works
with that concept. If a strategy other than leveled is being used,
everything will work as before.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
Currently, we are calculating truncated_at during truncate() independently for
each shard. It will work if we're lucky, but it is fairly easy to trigger cases
in which each shard will end up with a slightly different time.
The main problem here, is that this time is used as the snapshot name when auto
snapshots are enabled. Previous to my last fixes, this would just generate two
separate directories in this case, which is wrong but not severe.
But after the fix, this means that both shards will wait for one another to
synchronize and this will hang the database.
Fix this by making sure that the truncation time is calculated before
invoke_on_all in all needed places.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
We are generating a general object ({}), whereas Cassandra 2.1.x generates an
array ([]). Let's do that as well to avoid surprising parsers.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
We still need to write a manifest when there are no files in the snapshot.
But because we have never reached the touch_directory part in the sstables
loop for that case, nobody would have created jsondir in that case.
Since now all the file handling is done in the seal_snapshot phase, we should
just make sure the directory exists before initiating any other disk activity.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
We currently have one optimization that returns early when there are no tables
to be snapshotted.
However, because of the way we are writing the manifest now, this will cause
the shard that happens to have tables to be waiting forever. So we should get
rid of it. All shards need to pass through the synchronization point.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
If we are hashing more than one CF, the snapshot themselves will all have the same name.
This will cause the files from one of them to spill towards the other when writing the manifest.
The proper hash is the jsondir: that one is unique per manifest file.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
This patch adds the read and write latency estimated histogram support
and add an estimatd histogram to the number of sstable that were used in
a read.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
Currently, the snapshot code has all shards writing the manifest file. This is
wrong, because all previous writes to the last will be overwritten. This patch
fixes it, by synchronizing all writes and leaving just one of the shards with the
task of closing the manifest.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
The way manifest creation is currently done is wrong: instead of a final
manifest containing all files from all shards, the current code writes a
manifest containing just the files from the shard that happens to be the
unlucky loser of the writing race.
In preparation to fix that, separate the manifest creation code from the rest.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
We do need to sync jsondir after we write the manifest file (previously done,
but with a question), and before we start it (not previously done) to guarantee
that the manifest file won't reference any file that is not visible yet.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
Remove the about to be dropped CF from the UUID lookup table before
truncating and stopping it. This closes a race window where new
operations based on the UUID might be initiated after truncate
completes.
Signed-off-by: Pekka Enberg <penberg@scylladb.com>
From Pekka:
This patch series implements support for CQL DROP TABLE. It uses the newly
added truncate infrastructure under the hood. After this series, the
test_table CQL test in dtest passes:
[penberg@nero urchin-dtest]$ nosetests -v cql_tests.py:TestCQL.table_test
table_test (cql_tests.TestCQL) ... ok
----------------------------------------------------------------------
Ran 1 test in 23.841s
OK
For drop_column_family(), we want to first remove the column_family from
lookup tables and truncate after that to avoid races. Introduce a
truncate() variant that takes keyspace and column_family references.
Signed-off-by: Pekka Enberg <penberg@scylladb.com>
Currently, we control incremental backups behavior from the storage service.
This creates some very concrete problems, since the storage service is not
always available and initialized.
The solution is to move it to the column family (and to the keyspace so we can
properly propagate the conf file value). When we change this from the api, we will
have to iterate over all of them, changing the value accordingly.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
This patch contains the following changes, in the definition of the read
and write latency histogram it removes the mask value, so the the
default value will be used.
To support the gothering of the read latency histogram the query method
cannot be const as it modifies the histogram statistics.
The read statistic is sample based and it should have no real impact on
performance, if there will be an impact, we can always change it in the
future to a lower sampling rate.
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
Only tables that arise from flushes are backed up. Compacted tables are not.
Therefore, the place for that to happen is right after our flush.
Note that due to our sharded architecture, it is possible that in the face of a
value change some shards will backup sstables while others won't.
This is, in theory, possible to mitigate through a rwlock. However, this
doesn't differ from the situation where all tables are coming from a single
shard and the toggle happens in the middle of them.
The code as is guarantees that we'll never partially backup a single sstable,
so that is enough of a guarantee.
Signed-off-by: Glauber Costa <glommer@scylladb.com>
When we convert exceptions into CQL server errors, type information is
not preserved. Therefore, improve exception error messages to make
debugging dtest failures, for example, slightly easier.
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
When we write an SSTable, all its components are already in memory. load() is
to big of a hammer.
We still want to keep the write operation separated from the preparation to
read, but in the case of a newly written SSTable, all we need to do is to open
the index and data file.
Fixes#300
Signed-off-by: Glauber Costa <glommer@scylladb.com>
WARN level is for messages which should draw log reader's attention,
journalctl highlights them for example. Populating of keyspace is a
fairly normal thing, so it should be logged on lower level.
Race condition happens when two or more shards will try to delete
the same partial sstable. So the problem doesn't affect scylla
when it boots with a single shard.
To fix this problem, shard 0 will be made the responsible for
deleting a partial sstable.
fixes#359.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
If an sstable is irrelevant for a shard, delete it. The deletion will
only complete when all shards agree (either ignore the sstable or
delete it after compaction).
In event of a compaction failure, run_compaction would be called
more than one time for a request, which could result in an
underflow in the stats pending_compactions.
Let's fix that by only decreasing it if compaction succeeded.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
When populating a column family, we will now delete all components
of a sstable with a temporary toc file. A sstable with a temporary
TOC file means that it was partially written, and can be safely
deleted because the respective data is either saved in the commit
log, or in the compacted sstables in case of the partial sstable
being result of a compaction.
Deletion procedure is guarded against power failure by only deleting
the temporary TOC file after all other components were deleted.
Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>