Commit Graph

438 Commits

Author SHA1 Message Date
Vlad Zolotarov
756de38a9d database: actually check that a snapshot directory exists
Actually check that a snapshot directory with a given tag
exists instead of just checking that a 'snapshot' directory
exists.

Fixes issue #689

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2015-12-29 12:59:00 +01:00
Avi Kivity
41bd266ddd db: provide more information on "Unrecognized error" while loading sstables
This information can be used to understand the root cause of the failure.

Refs #692.
2015-12-29 10:23:32 +02:00
Pekka Enberg
eeadf601e6 Merge "cleanups and improvements" from Raphael 2015-12-18 13:45:11 +02:00
Pekka Enberg
e56bf8933f Improve not implemented errors
Print out the function name where we're throwing the exception from to
make it easier to debug such exceptions.
2015-12-18 10:51:37 +01:00
Raphael S. Carvalho
41be378ff1 db: fix build of sstable list in column_family::compact_sstables
The last two loops were incorrectly inside the first one. That's a
bug because a new sstable may be emplaced more than once in the
sstable list, which can cause several problems. mark_for_deletion
may also be called more than once for compacted sstables, however,
it is idempotent.
Found this issue while auditing the code.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2015-12-16 17:46:17 +02:00
Raphael S. Carvalho
6142efaedb db: fix indentation
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2015-12-14 12:43:34 -02:00
Raphael S. Carvalho
7bbc1b49b6 db: add missing sstable::mark_for_deletion call
If a sstable doesn't belong to current shard, mark_for_deletion
should be called for the deletion manager to still work.
It doesn't mean that the sstable will be deleted, but that the
sstable is not relevant to the current shard, thus it can be
deleted by the deletion manager in the future.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2015-12-14 12:42:26 -02:00
Amnon Heiman
2086c651ba column_family: get_snapshot_details should return empty map for no snapshots
If there is no snapshot directory for the specific column family,
get_snapshot_details should return an empty map.

This patch check that a directory exists before trying to iterate over
it.

Fixes #619

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2015-12-07 12:51:04 +01:00
Tomasz Grabiec
bc23ebcbc3 schema_tables: Replace schema_result::value_type with equivalent movable type
future<> requires and will assert nothrow move constructible types.
2015-12-07 09:50:27 +01:00
Amnon Heiman
7e79d35f85 Estimated histogram: Clean the add interface
The add interface of the estimated histogram is confusing as it is not
clear what units are used.

This patch removes the general add method and replace it with a add_nano
that adds nanoseconds or add that gets duration.

To be compatible with origin, nanoseconds vales are translated to
microseconds.
2015-12-01 15:28:06 +02:00
Asias He
aa2b11f21b database: Move is_replacing and get_replace_address to database class
So they can be used outside storage_service.
2015-11-30 09:15:42 +08:00
Tomasz Grabiec
a7c11d1e30 db: Fix handling of missing column family
The FIXMEs are no longer valid, we load schema on bootstrap and don't
support hot-plugging of column families via file system (nor does
Cassandra).

Handling of missing tables matches Cassandra 2.1, applies log
it and continue, queries propagate the error.
2015-11-25 16:59:15 +02:00
Raphael S. Carvalho
0f3ccc1143 db: optimize the sstable loading process
Currently, we only determine if a sstable belongs to current shard
after loading some of its components into memory. For example,
filter may be considerably big and its content is irrelevant to
decide if a sstable should be included to a given shard.
Start using the functions previously introduced to optimize the
sstable loading process. add_sstable no longer checks if a sstable
is relevant to the current shard.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2015-11-19 13:34:25 -02:00
Raphael S. Carvalho
0ce2b7bc8d db: introduce belongs_to_current_shard
Returns true if key range belongs to current shard.
False otherwise.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2015-11-19 13:34:21 -02:00
Raphael S. Carvalho
966e8c7144 db: introduce parallelism to sstable loading
Boot may be slow because the function that loads sstables do so
serially instead of in parallel. In the callback supplied to
lister::scan_dir, let's push the future returned by probe_file
(function that loads sstable) into a vector of future and wait
for all of them at the end.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2015-11-19 13:34:11 -02:00
Glauber Costa
fa1ae45218 database: export collectd metrics about the state of memtable flushing
When analyzing a recent performance issue, I found helpful to keep track of
the amount of memtables that are currently in flight, as well as how much memory
they are consuming in the system.

Although those are memtable statistics, I am grouping them under the "cf_stats"
structure: being the column family a central piece of the puzzle, it is reasonable
to assume that a lot of metrics about it would be potentially welcome in the future.

Note that we don't want to reuse the "stats" structure in the column family: for once,
the fields not always map precisely (pending flushes, for instance, only tracks explicit
flushes), and also the stats structure is a lot more complex than we need.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-11-12 20:17:22 +02:00
Calle Wilund
284b10cabe Make partition_slice::row_ranges mulitplex on partition
Allows for having more than one clustering row range set, depending on
PK queried (although right now limited to one - which happens to be exactly
the number of mutiplexing paging needs... What a coincidence...)

Encapsulates the row_ranges member in a query function, and if needed holds
ranges outside the default one in an extra object.

Query result::builder::add_partition now fetches the correct row range for
the partition, and this is the range used in subsequent iteration.
2015-11-10 13:12:33 +01:00
Gleb Natapov
d77a2a0f03 do not try to write same memtable to sstable twice if moving it to a cache failed.
Error handling in column_family::try_flush_memtable_to_sstable() is
misplaced. It happens after update_cache(), so writing sstable may
have succeeded, but moving memtable into the cache may have failed.
update_cache() destroys memtable even if it fails, but error handler
is not aware of it (it does not even distinguish whether error happened
during sstable creation or moving into cache) and when it tells caller
to retry it retries with already destroyed memtable. Fix it by ignoring
moving to cache errors.
2015-11-09 11:27:37 +01:00
Avi Kivity
cb93af2ad7 Revert "do not try to write same memtable to sstable twice if moving it to a cache failed."
This reverts commit fff37d15cd.

Says Tomek (and the comment in the code):

"update_cache() must be called before unlinking the memtable because cache + memtable at any time is supposed to be authoritative source of data for contained partitions. If there is a cache hit in cache, sstables won't be checked. If we unlink the memtable before cache is updated, it's possible that a query will miss data which was in that unlinked memtable, if it hits in the cache (with an old value)."
2015-11-09 11:22:12 +02:00
Gleb Natapov
fff37d15cd do not try to write same memtable to sstable twice if moving it to a cache failed.
Error handling in column_family::try_flush_memtable_to_sstable() is
misplaced. It happens after update_cache(), so writing sstable may
have succeeded, but moving memtable into the cache may have failed.
update_cache() destroys memtable even if it fails, but error handler
is not aware of it (it does not even distinguish whether error happened
during sstable creation or moving into cache) and when it tells caller
to retry it retries with already destroyed memtable. Fix it by ignoring
moving to cache errors.
2015-11-09 09:56:45 +02:00
Asias He
20ecb0bede database: Introduce get_initial_tokens
Get initial tokens specified by the initial_token in scylla.conf.

E.g.,

   --initial-token "-1112521204969569328,1117992399013959838"
   --initial-token "1117992399013959838"

It can be multiple tokens split by comma.
2015-11-04 10:40:12 +08:00
Calle Wilund
ceb9f4d647 database: Just do commitlog::shutdown on shutdown. It will do flushes. 2015-10-26 14:56:24 +01:00
Avi Kivity
f7087da054 Merge "GET methods for snapshots" from Glauber
"The snapshots API need to expose GET methods so people can
query information on them. Now that taking snapshots is supported,
this relatively simple series implement get_snapshot_details, a
column family method, and wire that up through the storage_service."
2015-10-22 15:23:45 +03:00
Avi Kivity
5f3a46eabb Merge "load_new_sstables" from Glauber
"This patchset implements load_new_sstables, allowing one to move tables inside the
data directory of a CF, and then call "nodetool refresh" to start using them.

Keep in mind that for Cassandra, this is deemed an unsafe operation:
https://issues.apache.org/jira/browse/CASSANDRA-6245

It is still for us something we should not recommend - unless the CF is totally
empty and not yet used, but we can do a much better job in the safety front.

To guarantee that, the process works in four steps:

1) All writes to this specific column family are disabled. This is a horrible thing to
   do, because dirty memory can grow much more than desired during this. Throughout out
   this implementation, we will try to keep the time during which the writes are disabled
   to its bare minimum.

   While disabling the writes, each shard will tell us about the highest generation number
   it has seen.

2) We will scan all tables that we haven't seen before. Those are any tables found in the
   CF datadir, that are higher than the highest generation number seen so far. We will link
   them to new generation numbers that are sequential to the ones we have so far, and end up
   with a new generation number that is returned to the next step

3) The generation number computed in the previous step is now propagated to all CFs, which
   guarantees that all further writes will pick generation numbers that won't conflict with
   the existing tables. Right after doing that, the writes are resumed.

4) The tables we found in step 2 are passed on to each of the CFs. They can now load those
   tables while operations to the CF proceed normally."
2015-10-22 13:42:24 +03:00
Glauber Costa
36cea4313e column family: load new sstables
CF-level code to load new SSTables. There isn't really a lot of complication
here. We don't even need to repopulate the entire SSTable directory: by
requiring that the external service who is coordinating this tell us explicitly
about the new SSTables found in the scan process, we can just load them
specifically and add them to the SSTable map.

All new tables will start their lifes as shared tables, and will be unshared
if it is possible to do so: this all happens inside add_sstable and there isn't
really anything special in this front.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:06:22 +02:00
Glauber Costa
61be9fb02d reshuffle tables: mechanism to adjust new sstables' generation number
Before loading new SSTables into the node, we need to make sure that their
generation numbers are sequential (at least if we want to follow Cassandra's
footsteps here).

Note that this is unsafe by design. More information can be found at:
https://issues.apache.org/jira/browse/CASSANDRA-6245

However, we can already to slightly better in two ways:

Unlike Cassandra, this method takes as a parameter a generation number. We
will not touch tables that are before that number at all. That number must be
calculated from all shards as the highest generation number they have seen themselves.
Calling load_new_sstables in the absence of new tables will therefore do nothing,
and will be completely safe.

It will also return the highest generation number found after the reshuffling
process.  New writers should start writing after that. Therefore, new tables
that are created will have a generation number that is higher than any of this,
and will therefore be safe.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:06:22 +02:00
Glauber Costa
1351c1cc13 database: mechanism to stop writing sstables
During certain operations we need to stop writing SSTables. This is needed when
we want to load new SSTables into the system. They will have to be scanned by all
shards, agreed upon, and in most cases even renamed. Letting SSTables be written
at that point makes it inherently racy - specially with the rename.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:06:22 +02:00
Glauber Costa
29e2ad7fd8 column family: commonize code to calculate the desired SSTable generation
We will reuse this for load_new_sstables.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:02:43 +02:00
Tomasz Grabiec
764d913d84 Merge branch 'pdziepak/row-cache-range-query/v4' from seastar-dev.git
From Pawel:

This series enables row cache to serve range queries. In order to achieve
that row cache needs to know whether there are some other partitions in
the specified range that are not cached and need to be read from the sstables.
That information is provied by key_readers, which work very similarly to
mutation_readers, but return only the decorated key of partitions in
range. In case of sstables key_readers is implemented to use partition
index.

Approach like this has the disadvantage of needing to access the disk
even if all partitions in the range are cached. There are (at least) two
solutions ways of dealing with that problem:
 - cache partition index - that will also help in all other places where it
   is neededed
 - add a flag to cache_entry which, when set, indicates that the immediate
   successor of the partition is also in the cache. Such flag would be set
   by mutation reader and cleared during eviction. It will also allow
   newly created mutations from memtable to be moved to cache provided that
   both their successors and predecessors are already there.

The key_reader part of this patchsets adds a lot of new code that probably
won't be used in any other place, but the alternative would be to always
interleave reads from cache with reads from sstables and that would be
more heavy on partition index, which isn't cached.

Fixes #185.
2015-10-21 15:26:45 +02:00
Glauber Costa
77513a40db database: get_snapshot_details
For each of the snapshots available, the api may query for some information:
the total size on disk, and the "real" size. As far as I could understand, the
real size is the size that is used by the SSTables themselves, while the total
size includes also the metadata about the snapshot - like the manifest.json
file.

Details follow:

In the original Cassandra code, total size is:

    long sizeOnDisk = FileUtils.folderSize(snapshot);

folderSize recurses on directories, and adds file.length() on files. Again, my
understanding is that file_size() would give us the same as the length() method
for Java.

The other value, real (or true) size is:

    long trueSize = getTrueAllocatedSizeIn(snapshot);

getTrueAllocatedSizeIn seems to be a tree walker, whose visitor is an instance
of TrueFilesSizeVisitor. What that visitor does, is add up the size of the files
within the tree who are "acceptable".

An acceptable file is a file which:

starts with the same prefix as we want (IOW, belongs to the same SSTable, we
will just test that directly), and is not "alive". The alive list is just the
list of all SSTables in the system that are used by the CFs.

What this tries to do, is to make sure that the trueSnapshotSize is just the
extra space on disk used by the snapshot. Since the snapshots are links, then
if a table goes away, it adds to this size. If it would be there anyway, it does
not.

We can do that in a lot simpler fashion: for each file, we will just look at
the original CF directory, and see if we can find the file there. If we can't,
then it counts towards the trueSize. Even for files that are deleted after
compaction, that "eventually" works, and that simplifies the code tremendously
given that we don't have to neither list all files in the system - as Cassandra
does - or go check other shards for liveness information - as we would have to
do.

The scheme I am proposing may need some tweaks when we support multiple data
directories, as the SSTables may not be directly below the snapshot level.
Still, it would be trivial to inform the CF about their possible locations.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 13:48:44 +02:00
Avi Kivity
5453cfbab7 Merge "snapshots: take + clear" from Glauber
"This is the code for taking a snapshot, and clearing a snapshot."
2015-10-21 08:59:42 +03:00
Paweł Dziepak
c1e95dd893 row_cache: pass underlying key_source to row_cache
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2015-10-20 20:27:53 +02:00
Paweł Dziepak
96a42a9c69 column_family: add sstables_as_key_source()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2015-10-20 20:27:53 +02:00
Avi Kivity
cf734132e7 Merge "Flusing of CF:s without replay positions" from Calle
"Fixes: #469

We occasionally generate memtables that are not empty, yet have no
high replay_position set. (Typical case is CL replay, but apparently
there are others).

Moreover, we can do this repeatedly, and thus get caught in the flush
queue ordering restrictions.

Solve this by treating a flush without replay_position as a flush at the
highest running position, i.e. "last" in queue. Note that this will not
affect the actual flush operation, nor CL callbacks, only anyone waiting
for the operation(s) to complete.

To do this, the flush_queue had its restrictions eased, and some introspection
methods added."
2015-10-20 17:36:57 +03:00
Glauber Costa
d236b01b48 snapshots: check existence of snapshots
We go to the filesystem to check if the snapshot exists. This should make us
robust against deletions of existing snapshots from the filesystem.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-20 15:58:26 +02:00
Glauber Costa
d3aef2c1a5 database: support clear snapshot
This allows for us to delete an existing snapshot. It works at the column
family level, and removing it from the list of keyspace snapshots needs to
happen only when all CFs are processed. Therefore, that is provided as a
separate operation.

The filesystem code is a bit ugly: it can be made better by making our file
lister more generic. First step would be to call it walker, not lister...

For now, we'll use the fact that there are mostly two levels in the snapshot
hierarchy to our advantage, and avoid a full recursion - using the same lambda
for all calls would require us to provide a separate class to handle the state,
that's part of making this generic.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-20 15:38:14 +02:00
Glauber Costa
500ee99c93 file lister: allow for more than one directory type
There are situations in which we would like to match more than one directory
type.  One example of that, would be a recursive delete operation: we need to
delete the files inside directories and the directories themselves, but we
still don't want a "delete all" since finding anything other than a directory
or a file is an error, and we should treat it as such.

Since there aren't that many times, it should be ok performance wise to just
use a list. I am using an unordered_set here just because it is easy enough,
but we could actually relax it later if needed. In any case, users of the
interface should not worry about that, and that decision is abstracted away
into lister::dir_entry_types.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-20 15:38:14 +02:00
Avi Kivity
2575a4602e Merge "Fix for snapshots/create_links and shared SSTables" from Glauber
"Those are fixes needed for the snapshotting process itself. I have bundled this
in the create_snapshot series before to avoid a rebase, but since I will have to
rewrite that to get rid of the snapshot manager (and go to the filesystem),
I am sending those out on their own."
2015-10-20 13:49:17 +03:00
Calle Wilund
02732f19f2 database: Handle CF flush with no high replay_position
We occasionally generate memtables that are not empty, yet have no
high replay_position set. (Typical case is CL replay, but apparently
there are others).

Moreover, we can do this repeatedly, and thus get caught in the flush
queue ordering restrictions.

Solve this by treating a flush without replay_position as a flush at the
highest running position, i.e. "last" in queue. Note that this will not
affect the actual flush operation, nor CL callbacks, only anyone waiting
for the operation(s) to complete.
2015-10-20 08:24:04 +02:00
Avi Kivity
2ccb5feabd Merge "Support nodetool cfhistogram"
"This series adds the missing estimated histogram to the column family and to
the API so the nodetool cfhistogram would work."
2015-10-19 17:11:46 +03:00
Avi Kivity
f4706c7050 Merge "initial support to leveled compaction" from Raphael
"This patchset introduces leveled compaction to Scylla.
We don't handle all corner cases yet, but we already have the strategy
and compaction working as expected. Test cases were written and I also
tested the stability with a load of cassandra-stress.

Leveled compaction may output more than one sstable because there is
a limit on the size of sstables. 160M by default.
Related to handling of partial compaction, it's still something to be
worked on.

Anyway, it will not be a big problem. Why? Suppose that a leveled
compaction will generate 2 sstables, and scylla is interrupted after
the first sstable is completely written but before the second one is
completely written. The next boot will delete the second sstable,
because it was partially written, but will not do anything with the
first one as it was completely written.
As a result, we will have two sstables with redundant data."
2015-10-19 16:17:45 +03:00
Glauber Costa
218fdebbeb snapshot: do not allow exceptions in snapshot creation hang us
With the distribute-and-sync method we are using, if an exception happens in
the snapshot creation for any reason (think file permissions, etc), that will
just hang the server since our shard won't do the necessary work to
synchronize and note that we done our part (or tried to) in snapshot creation.

Make the then clause a finally, so that the sync part is always executed.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-19 13:37:02 +02:00
Glauber Costa
9083a0e5a7 snapshots: fix generation of snapshots with shared sstables
create_links will fail in one of the shards if one of the SSTables happen to be
shared. It should be fine if the link already exists, so let's just ignore that case.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-19 13:37:01 +02:00
Tomasz Grabiec
19d7d30e67 Replace references to 'urchin' with 'scylla' 2015-10-19 11:08:05 +03:00
Glauber Costa
df857eb8c6 database: touch directories for the column family
Current code calls make_directory, which will fail if the directory already exists.
We didn't use this code path much before, but once we start creating CF directories
on CF creation - and not on SSTable creation, that will become our default method.

Use touch_directory instead

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-17 13:08:07 +02:00
Raphael S. Carvalho
35b75e9b67 adapt compaction procedure to support leveled strategy
Adapt our compaction code to start writing a new sstable if the
one being written reached its maximum size. Leveled strategy works
with that concept. If a strategy other than leveled is being used,
everything will work as before.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-10-16 01:54:52 -03:00
Calle Wilund
012ab24469 column_family: Add flush queue object to act as ordering guarantee 2015-10-14 14:07:40 +02:00
Glauber Costa
b2fef14ada do not calculate truncation time independently
Currently, we are calculating truncated_at during truncate() independently for
each shard. It will work if we're lucky, but it is fairly easy to trigger cases
in which each shard will end up with a slightly different time.

The main problem here, is that this time is used as the snapshot name when auto
snapshots are enabled. Previous to my last fixes, this would just generate two
separate directories in this case, which is wrong but not severe.

But after the fix, this means that both shards will wait for one another to
synchronize and this will hang the database.

Fix this by making sure that the truncation time is calculated before
invoke_on_all in all needed places.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-09 17:17:11 +03:00
Glauber Costa
1549a43823 snapshots: fix json type
We are generating a general object ({}), whereas Cassandra 2.1.x generates an
array ([]). Let's do that as well to avoid surprising parsers.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-08 16:54:51 +02:00
Glauber Costa
cc343eb928 snapshots: handle jsondir creation for empty files case
We still need to write a manifest when there are no files in the snapshot.
But because we have never reached the touch_directory part in the sstables
loop for that case, nobody would have created jsondir in that case.

Since now all the file handling is done in the seal_snapshot phase, we should
just make sure the directory exists before initiating any other disk activity.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-08 16:54:51 +02:00