Commit Graph

53948 Commits

Author SHA1 Message Date
Paweł Dziepak
1c05d7b927 mutation_partition: fix row_marker::apply() for equal timestamps
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2015-10-22 12:08:53 +02:00
Paweł Dziepak
7fab0ee867 mutation_partition: add compare_row_marker_for_merge()
A compare_atomic_cell_for_merge() equivalent intended to be used
with row markers.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2015-10-22 12:08:53 +02:00
Asias He
04291dec28 storage_service: Enable call to excise 2015-10-22 17:41:02 +08:00
Asias He
ee551d070f storage_service: Enable add_expire_time_if_found in excise and handle_state_removing 2015-10-22 17:41:02 +08:00
Asias He
2137ab5522 storage_service: Implement add_expire_time_if_found 2015-10-22 17:41:02 +08:00
Asias He
e2fbd146d7 storage_service: Print keyspace info unbootstrap in debug 2015-10-22 17:41:02 +08:00
Asias He
affab296b0 storage_service: Fix ranges in stream_hints
Use range = make_open_ended_both_sides to present the entire ring.
2015-10-22 17:41:02 +08:00
Asias He
0d2fb9c99d storage_service: Add extract_expire_time 2015-10-22 17:41:02 +08:00
Asias He
58225216b3 storage_service: Fix immediate return for get_changed_ranges_for_leaving
It is a leftover when get_changed_ranges_for_leaving is get stubbed.
2015-10-22 17:41:02 +08:00
Asias He
fb27d682ad storage_service: Fix use after free for stream_plan
sp is a stack variable, it is gone when the function returns.
Fix it using a shared pointer.
2015-10-22 17:41:02 +08:00
Asias He
69b7028f84 storage_service: Fix token contains in handle_state_leaving
std::includes requires sorted container. get_tokens_for returns
std::unordered_set. Fix by put tokens into std::set.
2015-10-22 17:41:02 +08:00
Asias He
4785798904 storage_service: Kill unimplemented in decommission 2015-10-22 17:41:02 +08:00
Asias He
ce6dd0f8f8 storage_service: Implement start_leaving 2015-10-22 17:41:02 +08:00
Paweł Dziepak
513ab87b47 row_cache: update hit and miss stats in scanning reader
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2015-10-22 12:25:02 +03:00
Paweł Dziepak
b1b830bcbb row_cache: merge cache_entry::compare and ring_position_compare
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2015-10-22 12:25:02 +03:00
Tomasz Grabiec
f1306d3771 tests: Add tests for reversed mutation queries 2015-10-22 10:32:08 +02:00
Tomasz Grabiec
cc5cc7117d mutation_query: Respect 'reversed' partition_slice option
Fixes #480
2015-10-22 10:32:08 +02:00
Tomasz Grabiec
9dbd5a92d0 partition_slice_builder: Introduce reversed() 2015-10-22 10:32:08 +02:00
Amnon Heiman
c130381284 Adding live_scanned and tombstone scaned histogram to column family
This series adds a histogrm to the column family for live scanned and
tombstone scaned.

It expose those histogram via the API instead of the stub implmentation,
currently exist.

The implementation update of the histogram will be added in a different
series.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-10-22 11:13:28 +03:00
Amnon Heiman
378a97b66b API: Add row cahe hits and miss per column family
Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-10-22 11:12:14 +03:00
Glauber Costa
fd8e5c7e4c api: load new sstables
Just a wrapper into the storage_service's homonymous call.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:30:04 +02:00
Glauber Costa
673788ed46 storage_service load_new_sstables.
This is the storage_service implementation of load_new_sstables, and this is
where most of the complication lives.

Keep in mind that for Cassandra, this is deemed an unsafe operation:
https://issues.apache.org/jira/browse/CASSANDRA-6245

It is still for us something we should not recommend - unless the CF is
totally empty and not yet used, but we can do a much better job in the safety front.

To guarantee that, the process works in four steps:

1) All writes to this specific column family are disabled. This is a horrible thing to
   do, because dirty memory can grow much more than desired during this. Throughout out
   this implementation, we will try to keep the time during which the writes are disabled
   to its bare minimum.

   While disabling the writes, each shard will tell us about the highest generation number
   it has seen.

2) We will scan all tables that we haven't seen before. Those are any tables found in the
   CF datadir, that are higher than the highest generation number seen so far. We will link
   them to new generation numbers that are sequential to the ones we have so far, and end up
   with a new generation number that is returned to the next step

3) The generation number computed in the previous step is now propagated to all CFs, which
   guarantees that all further writes will pick generation numbers that won't conflict with
   the existing tables. Right after doing that, the writes are resumed.

4) The tables we found in step 2 are passed on to each of the CFs. They can now load those
   tables while operations to the CF proceed normally.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:06:22 +02:00
Glauber Costa
36cea4313e column family: load new sstables
CF-level code to load new SSTables. There isn't really a lot of complication
here. We don't even need to repopulate the entire SSTable directory: by
requiring that the external service who is coordinating this tell us explicitly
about the new SSTables found in the scan process, we can just load them
specifically and add them to the SSTable map.

All new tables will start their lifes as shared tables, and will be unshared
if it is possible to do so: this all happens inside add_sstable and there isn't
really anything special in this front.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:06:22 +02:00
Glauber Costa
54aaa58899 sstable_tests: test reshuffle operation
Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:06:22 +02:00
Glauber Costa
a8db2b28c7 sstable tests: test set_generation
No code works until it's been tested.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:06:22 +02:00
Glauber Costa
c5950c7bf7 sstable_test: get rid of frees
They exist. They shouldn't.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:06:22 +02:00
Glauber Costa
f60021f87f sstable_tests: commonize code to compare two components.
The current codes assumes a particular dir/generation pair. We
will use it for a more generic case. This code could really use some
clean up, by the way. We should do it later.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:06:22 +02:00
Glauber Costa
61be9fb02d reshuffle tables: mechanism to adjust new sstables' generation number
Before loading new SSTables into the node, we need to make sure that their
generation numbers are sequential (at least if we want to follow Cassandra's
footsteps here).

Note that this is unsafe by design. More information can be found at:
https://issues.apache.org/jira/browse/CASSANDRA-6245

However, we can already to slightly better in two ways:

Unlike Cassandra, this method takes as a parameter a generation number. We
will not touch tables that are before that number at all. That number must be
calculated from all shards as the highest generation number they have seen themselves.
Calling load_new_sstables in the absence of new tables will therefore do nothing,
and will be completely safe.

It will also return the highest generation number found after the reshuffling
process.  New writers should start writing after that. Therefore, new tables
that are created will have a generation number that is higher than any of this,
and will therefore be safe.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:06:22 +02:00
Glauber Costa
1351c1cc13 database: mechanism to stop writing sstables
During certain operations we need to stop writing SSTables. This is needed when
we want to load new SSTables into the system. They will have to be scanned by all
shards, agreed upon, and in most cases even renamed. Letting SSTables be written
at that point makes it inherently racy - specially with the rename.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:06:22 +02:00
Glauber Costa
29e2ad7fd8 column family: commonize code to calculate the desired SSTable generation
We will reuse this for load_new_sstables.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:02:43 +02:00
Glauber Costa
3f6d47f1f2 sstables: change the current level of an sstable
This will be used, for instance, when importing an SSTable.
We would like to force all new SSTables to sit at level 0 for
compaction purposes.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:02:42 +02:00
Glauber Costa
e11b828b6f sstables: allow an sstable to set its generation number
In some situations (restoring a backup from load_new_sstables), we want to
change the SSTable generation number. This patch provides a procedure to
achieve that.

It does so by linking the old files to new ones, and then removing the old
ones.

The reason we link instead of removing, is that we want to make sure that in
case there is a crash in the middle, the old data is still accessible.

If the crash happens after the link is done but before we start removing the
old files, that is fine: we will end up with duplicated data that will
disappear after the next compaction.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:02:42 +02:00
Glauber Costa
a4d8e99f1c sstables: fix create_links so a TemporaryTOC is generated
That is the way to generate groups of files for the SSTables, so we must do it.
Because the links were mostly used by processes like snapshots and backups
where and external tool would (hopefully) verify the results, it was not that
serious.

But we now plan to use links to bring things into the main directory. It must
absolutely be done right.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:02:42 +02:00
Glauber Costa
876e770df6 sstables: allow create_links to work with an arbitrary generation
During some situations (restoring a snapshot for instance) we may want a file
to get a different generation. This patch changes the code in create_links
slightly, so that it is able to link not only to a different location, but to
files with a different name, possibly in the same location - that is equivalent
to a generation change.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:02:42 +02:00
Glauber Costa
27efe8bde9 sstables: make read_toc public
This is done on behalf of load_new_sstables: we would like to know which
components are present in the file, but without triggering the read for the
rest of the metadata.

As noted by Avi, using this directly can leave the SSTable in an inconsistent
state. We will have to fix is later since this is not the first offender.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:02:42 +02:00
Glauber Costa
fcebf6f72d sstable tests: don't use set_generation method
There is no reason aside from testing for a table to just change its generation
number.

There will be, however, when we support loading new sstables. The method
however needs to be completely rewritten, so let's make sure the tests are not
using that.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:02:42 +02:00
Glauber Costa
f3bad2032d database: fix type for sstable generation.
Avoid using long for it, and let's use a fixed size instead.  Let's do signed
instead of unsigned to avoid upsetting any code that we may have converted.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:01:20 +02:00
Calle Wilund
412c2a1e5b storage_proxy: mutate_atomically fix for consistency and BL removal
The change to use consistency_level::ONE in send_batchlog_mutation
sort of fixes #478, but is not 100% correct.
When doing async_remove_from_batchlog, the CL is actually supposed to
be ANY.

Also, we should _not_ remove the batch log mutation from any nodes
if the mutate fails, since having it there in case of failure is sort of
the whole point of it. I.e. async_remove_from_batchlog should not be
called from a "finally", but from a "then".

Refs #478
2015-10-21 16:48:27 +02:00
Tomasz Grabiec
764d913d84 Merge branch 'pdziepak/row-cache-range-query/v4' from seastar-dev.git
From Pawel:

This series enables row cache to serve range queries. In order to achieve
that row cache needs to know whether there are some other partitions in
the specified range that are not cached and need to be read from the sstables.
That information is provied by key_readers, which work very similarly to
mutation_readers, but return only the decorated key of partitions in
range. In case of sstables key_readers is implemented to use partition
index.

Approach like this has the disadvantage of needing to access the disk
even if all partitions in the range are cached. There are (at least) two
solutions ways of dealing with that problem:
 - cache partition index - that will also help in all other places where it
   is neededed
 - add a flag to cache_entry which, when set, indicates that the immediate
   successor of the partition is also in the cache. Such flag would be set
   by mutation reader and cleared during eviction. It will also allow
   newly created mutations from memtable to be moved to cache provided that
   both their successors and predecessors are already there.

The key_reader part of this patchsets adds a lot of new code that probably
won't be used in any other place, but the alternative would be to always
interleave reads from cache with reads from sstables and that would be
more heavy on partition index, which isn't cached.

Fixes #185.
2015-10-21 15:26:45 +02:00
Gleb Natapov
6a2a0d628b storage_proxy: use CL=ONE to write logged batch
This is a regression created by logged batch code rework.

Fixes #478.
2015-10-21 15:29:49 +03:00
Avi Kivity
c69c02c162 Merge 2015-10-21 15:17:32 +03:00
Avi Kivity
c49dd5c576 Merge "move dependencies to /opt/scylladb" from Takuya 2015-10-21 15:17:04 +03:00
Glauber Costa
71c1b2fe69 api: get true snapshot size
Thin wrapper around storage service's facility.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 13:48:44 +02:00
Glauber Costa
c0630bedc2 api: get_snapshot_details
That's basically conversion work between what the storage_service returns
and the json types.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 13:48:44 +02:00
Glauber Costa
cf96d68478 storage_service: true snapshot size
For CFStats, one of the things needed is the size used by the snapshots. Since
the bulk of the work is map-reducing it and adding them together, we will just
call get_snapshot_details for the column family, and just selectively add just
what we need. No need for a separate method here.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 13:48:44 +02:00
Glauber Costa
718fea7048 storage_service: get_snapshot_details
The column family object can, for each column family, provide us with a map between
each snapshots it knows about, and two sizes: the total size, and the "real" (or live)
size, which is how much extra space the snapshot is costing us.

This patch map-reduces all CFs to accumulate that system-wide, and then formats that
into an a map of "snapshot_details". That is a more convenient format to be consumed
by our json generator.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 13:48:44 +02:00
Glauber Costa
77513a40db database: get_snapshot_details
For each of the snapshots available, the api may query for some information:
the total size on disk, and the "real" size. As far as I could understand, the
real size is the size that is used by the SSTables themselves, while the total
size includes also the metadata about the snapshot - like the manifest.json
file.

Details follow:

In the original Cassandra code, total size is:

    long sizeOnDisk = FileUtils.folderSize(snapshot);

folderSize recurses on directories, and adds file.length() on files. Again, my
understanding is that file_size() would give us the same as the length() method
for Java.

The other value, real (or true) size is:

    long trueSize = getTrueAllocatedSizeIn(snapshot);

getTrueAllocatedSizeIn seems to be a tree walker, whose visitor is an instance
of TrueFilesSizeVisitor. What that visitor does, is add up the size of the files
within the tree who are "acceptable".

An acceptable file is a file which:

starts with the same prefix as we want (IOW, belongs to the same SSTable, we
will just test that directly), and is not "alive". The alive list is just the
list of all SSTables in the system that are used by the CFs.

What this tries to do, is to make sure that the trueSnapshotSize is just the
extra space on disk used by the snapshot. Since the snapshots are links, then
if a table goes away, it adds to this size. If it would be there anyway, it does
not.

We can do that in a lot simpler fashion: for each file, we will just look at
the original CF directory, and see if we can find the file there. If we can't,
then it counts towards the trueSize. Even for files that are deleted after
compaction, that "eventually" works, and that simplifies the code tremendously
given that we don't have to neither list all files in the system - as Cassandra
does - or go check other shards for liveness information - as we would have to
do.

The scheme I am proposing may need some tweaks when we support multiple data
directories, as the SSTables may not be directly below the snapshot level.
Still, it would be trivial to inform the CF about their possible locations.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 13:48:44 +02:00
Avi Kivity
16006949d0 logalloc: make migrator an object, not a function pointer
The migrator tells lsa how to move an object when it is compacted.
Currently it is a function pointer, which means we must know how to move
the object at compile time.  Making it an object allows us to build the
migration function at runtime, making it suitable for runtime-defined types
(such as tuples and user-defined types).

In the future, we may also store the size there for fixed-size types,
reducing lsa overhead.

C++ variable templates would have made this patch smaller, but unfortunately
they are only supported on gcc 5+.
2015-10-21 11:24:56 +02:00
Avi Kivity
e2cd40e3bc Merge "remove and decommission node support part 2" from Asias
"More preparatory patches for remove and decommission node support:

- stream hints and reanges
- unbootstrap
- replication finished notification"
2015-10-21 12:24:14 +03:00
Takuya ASADA
1bf18679bb dist: add more build dependency for binutils
Signed-off-by: Takuya ASADA <syuu@cloudius-systems.com>
2015-10-21 09:02:40 +00:00