Commit Graph

291 Commits

Author SHA1 Message Date
Avi Kivity
f7087da054 Merge "GET methods for snapshots" from Glauber
"The snapshots API need to expose GET methods so people can
query information on them. Now that taking snapshots is supported,
this relatively simple series implement get_snapshot_details, a
column family method, and wire that up through the storage_service."
2015-10-22 15:23:45 +03:00
Avi Kivity
5f3a46eabb Merge "load_new_sstables" from Glauber
"This patchset implements load_new_sstables, allowing one to move tables inside the
data directory of a CF, and then call "nodetool refresh" to start using them.

Keep in mind that for Cassandra, this is deemed an unsafe operation:
https://issues.apache.org/jira/browse/CASSANDRA-6245

It is still for us something we should not recommend - unless the CF is totally
empty and not yet used, but we can do a much better job in the safety front.

To guarantee that, the process works in four steps:

1) All writes to this specific column family are disabled. This is a horrible thing to
   do, because dirty memory can grow much more than desired during this. Throughout out
   this implementation, we will try to keep the time during which the writes are disabled
   to its bare minimum.

   While disabling the writes, each shard will tell us about the highest generation number
   it has seen.

2) We will scan all tables that we haven't seen before. Those are any tables found in the
   CF datadir, that are higher than the highest generation number seen so far. We will link
   them to new generation numbers that are sequential to the ones we have so far, and end up
   with a new generation number that is returned to the next step

3) The generation number computed in the previous step is now propagated to all CFs, which
   guarantees that all further writes will pick generation numbers that won't conflict with
   the existing tables. Right after doing that, the writes are resumed.

4) The tables we found in step 2 are passed on to each of the CFs. They can now load those
   tables while operations to the CF proceed normally."
2015-10-22 13:42:24 +03:00
Amnon Heiman
c130381284 Adding live_scanned and tombstone scaned histogram to column family
This series adds a histogrm to the column family for live scanned and
tombstone scaned.

It expose those histogram via the API instead of the stub implmentation,
currently exist.

The implementation update of the histogram will be added in a different
series.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-10-22 11:13:28 +03:00
Glauber Costa
36cea4313e column family: load new sstables
CF-level code to load new SSTables. There isn't really a lot of complication
here. We don't even need to repopulate the entire SSTable directory: by
requiring that the external service who is coordinating this tell us explicitly
about the new SSTables found in the scan process, we can just load them
specifically and add them to the SSTable map.

All new tables will start their lifes as shared tables, and will be unshared
if it is possible to do so: this all happens inside add_sstable and there isn't
really anything special in this front.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:06:22 +02:00
Glauber Costa
61be9fb02d reshuffle tables: mechanism to adjust new sstables' generation number
Before loading new SSTables into the node, we need to make sure that their
generation numbers are sequential (at least if we want to follow Cassandra's
footsteps here).

Note that this is unsafe by design. More information can be found at:
https://issues.apache.org/jira/browse/CASSANDRA-6245

However, we can already to slightly better in two ways:

Unlike Cassandra, this method takes as a parameter a generation number. We
will not touch tables that are before that number at all. That number must be
calculated from all shards as the highest generation number they have seen themselves.
Calling load_new_sstables in the absence of new tables will therefore do nothing,
and will be completely safe.

It will also return the highest generation number found after the reshuffling
process.  New writers should start writing after that. Therefore, new tables
that are created will have a generation number that is higher than any of this,
and will therefore be safe.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:06:22 +02:00
Glauber Costa
1351c1cc13 database: mechanism to stop writing sstables
During certain operations we need to stop writing SSTables. This is needed when
we want to load new SSTables into the system. They will have to be scanned by all
shards, agreed upon, and in most cases even renamed. Letting SSTables be written
at that point makes it inherently racy - specially with the rename.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:06:22 +02:00
Glauber Costa
29e2ad7fd8 column family: commonize code to calculate the desired SSTable generation
We will reuse this for load_new_sstables.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:02:43 +02:00
Glauber Costa
f3bad2032d database: fix type for sstable generation.
Avoid using long for it, and let's use a fixed size instead.  Let's do signed
instead of unsigned to avoid upsetting any code that we may have converted.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 18:01:20 +02:00
Tomasz Grabiec
764d913d84 Merge branch 'pdziepak/row-cache-range-query/v4' from seastar-dev.git
From Pawel:

This series enables row cache to serve range queries. In order to achieve
that row cache needs to know whether there are some other partitions in
the specified range that are not cached and need to be read from the sstables.
That information is provied by key_readers, which work very similarly to
mutation_readers, but return only the decorated key of partitions in
range. In case of sstables key_readers is implemented to use partition
index.

Approach like this has the disadvantage of needing to access the disk
even if all partitions in the range are cached. There are (at least) two
solutions ways of dealing with that problem:
 - cache partition index - that will also help in all other places where it
   is neededed
 - add a flag to cache_entry which, when set, indicates that the immediate
   successor of the partition is also in the cache. Such flag would be set
   by mutation reader and cleared during eviction. It will also allow
   newly created mutations from memtable to be moved to cache provided that
   both their successors and predecessors are already there.

The key_reader part of this patchsets adds a lot of new code that probably
won't be used in any other place, but the alternative would be to always
interleave reads from cache with reads from sstables and that would be
more heavy on partition index, which isn't cached.

Fixes #185.
2015-10-21 15:26:45 +02:00
Glauber Costa
77513a40db database: get_snapshot_details
For each of the snapshots available, the api may query for some information:
the total size on disk, and the "real" size. As far as I could understand, the
real size is the size that is used by the SSTables themselves, while the total
size includes also the metadata about the snapshot - like the manifest.json
file.

Details follow:

In the original Cassandra code, total size is:

    long sizeOnDisk = FileUtils.folderSize(snapshot);

folderSize recurses on directories, and adds file.length() on files. Again, my
understanding is that file_size() would give us the same as the length() method
for Java.

The other value, real (or true) size is:

    long trueSize = getTrueAllocatedSizeIn(snapshot);

getTrueAllocatedSizeIn seems to be a tree walker, whose visitor is an instance
of TrueFilesSizeVisitor. What that visitor does, is add up the size of the files
within the tree who are "acceptable".

An acceptable file is a file which:

starts with the same prefix as we want (IOW, belongs to the same SSTable, we
will just test that directly), and is not "alive". The alive list is just the
list of all SSTables in the system that are used by the CFs.

What this tries to do, is to make sure that the trueSnapshotSize is just the
extra space on disk used by the snapshot. Since the snapshots are links, then
if a table goes away, it adds to this size. If it would be there anyway, it does
not.

We can do that in a lot simpler fashion: for each file, we will just look at
the original CF directory, and see if we can find the file there. If we can't,
then it counts towards the trueSize. Even for files that are deleted after
compaction, that "eventually" works, and that simplifies the code tremendously
given that we don't have to neither list all files in the system - as Cassandra
does - or go check other shards for liveness information - as we would have to
do.

The scheme I am proposing may need some tweaks when we support multiple data
directories, as the SSTables may not be directly below the snapshot level.
Still, it would be trivial to inform the CF about their possible locations.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-21 13:48:44 +02:00
Paweł Dziepak
96a42a9c69 column_family: add sstables_as_key_source()
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
2015-10-20 20:27:53 +02:00
Glauber Costa
d236b01b48 snapshots: check existence of snapshots
We go to the filesystem to check if the snapshot exists. This should make us
robust against deletions of existing snapshots from the filesystem.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-20 15:58:26 +02:00
Glauber Costa
d3aef2c1a5 database: support clear snapshot
This allows for us to delete an existing snapshot. It works at the column
family level, and removing it from the list of keyspace snapshots needs to
happen only when all CFs are processed. Therefore, that is provided as a
separate operation.

The filesystem code is a bit ugly: it can be made better by making our file
lister more generic. First step would be to call it walker, not lister...

For now, we'll use the fact that there are mostly two levels in the snapshot
hierarchy to our advantage, and avoid a full recursion - using the same lambda
for all calls would require us to provide a separate class to handle the state,
that's part of making this generic.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-20 15:38:14 +02:00
Avi Kivity
2ccb5feabd Merge "Support nodetool cfhistogram"
"This series adds the missing estimated histogram to the column family and to
the API so the nodetool cfhistogram would work."
2015-10-19 17:11:46 +03:00
Raphael S. Carvalho
35b75e9b67 adapt compaction procedure to support leveled strategy
Adapt our compaction code to start writing a new sstable if the
one being written reached its maximum size. Leveled strategy works
with that concept. If a strategy other than leveled is being used,
everything will work as before.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-10-16 01:54:52 -03:00
Calle Wilund
012ab24469 column_family: Add flush queue object to act as ordering guarantee 2015-10-14 14:07:40 +02:00
Glauber Costa
b2fef14ada do not calculate truncation time independently
Currently, we are calculating truncated_at during truncate() independently for
each shard. It will work if we're lucky, but it is fairly easy to trigger cases
in which each shard will end up with a slightly different time.

The main problem here, is that this time is used as the snapshot name when auto
snapshots are enabled. Previous to my last fixes, this would just generate two
separate directories in this case, which is wrong but not severe.

But after the fix, this means that both shards will wait for one another to
synchronize and this will hang the database.

Fix this by making sure that the truncation time is calculated before
invoke_on_all in all needed places.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-09 17:17:11 +03:00
Amnon Heiman
6d90eebfb9 column family: Add estimated histogram impl
This patch adds the read and write latency estimated histogram support
and add an estimatd histogram to the number of sstable that were used in
a read.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-10-08 14:59:17 +03:00
Tomasz Grabiec
bc1d159c1b Merge branch 'penberg/cql-drop-table/v3' from seastar-dev.git
From Pekka:

This patch series implements support for CQL DROP TABLE. It uses the newly
added truncate infrastructure under the hood. After this series, the
test_table CQL test in dtest passes:

  [penberg@nero urchin-dtest]$ nosetests -v cql_tests.py:TestCQL.table_test
  table_test (cql_tests.TestCQL) ... ok

  ----------------------------------------------------------------------
  Ran 1 test in 23.841s

  OK
2015-10-06 13:39:25 +02:00
Pekka Enberg
afbb2f865d database: Add keyspace_metadata::remove_column_family() helper
Signed-off-by: Pekka Enberg <penberg@scylladb.com>
2015-10-06 11:28:55 +03:00
Pekka Enberg
0651ab6901 database: Futurize drop_column_family() function
Futurize drop_column_family() so that we can call truncate() from it.

Signed-off-by: Pekka Enberg <penberg@scylladb.com>
2015-10-06 11:28:55 +03:00
Pekka Enberg
85ffaa5330 database: Add truncate() variant that does not look up CF by name
For drop_column_family(), we want to first remove the column_family from
lookup tables and truncate after that to avoid races. Introduce a
truncate() variant that takes keyspace and column_family references.

Signed-off-by: Pekka Enberg <penberg@scylladb.com>
2015-10-06 11:28:54 +03:00
Glauber Costa
639ba2b99d incremental backups: move control to the CF level
Currently, we control incremental backups behavior from the storage service.
This creates some very concrete problems, since the storage service is not
always available and initialized.

The solution is to move it to the column family (and to the keyspace so we can
properly propagate the conf file value). When we change this from the api, we will
have to iterate over all of them, changing the value accordingly.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-05 13:16:11 +02:00
Glauber Costa
69d1358627 database: non const versions of get_keyspaces/column_families
We will need to change some properties of the keyspace / cf. We need an acessor
that is not marked as const.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-05 13:13:37 +02:00
Amnon Heiman
1f16765140 column family: setting the read and write latency histogram
This patch contains the following changes, in the definition of the read
and write latency histogram it removes the mask value, so the the
default value will be used.

To support the gothering of the read latency histogram the query method
cannot be const as it modifies the histogram statistics.

The read statistic is sample based and it should have no real impact on
performance, if there will be an impact, we can always change it in the
future to a lower sampling rate.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-10-04 11:52:19 +03:00
Pekka Enberg
5e27d476d4 database: Improve exception error messages
When we convert exceptions into CQL server errors, type information is
not preserved. Therefore, improve exception error messages to make
debugging dtest failures, for example, slightly easier.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-10-01 11:23:46 +03:00
Calle Wilund
68b8d8f48c database: Implement "truncate" for column family
Including snapshotting.
2015-09-30 09:09:42 +02:00
Calle Wilund
56228fba24 column family: Add "snapshot" operation. 2015-09-30 09:09:42 +02:00
Calle Wilund
c141e15a4a column family: Add "run_with_compaction_disabled" helper
A'la origin. Could as well been RAII.
2015-09-30 09:09:41 +02:00
Avi Kivity
d5cf0fb2b1 Add license notices 2015-09-20 10:43:39 +03:00
Amnon Heiman
089bd6a5bd column family: Expose the compaction strategy
This expose the compaction strategy object.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-09-12 08:35:34 +03:00
Amnon Heiman
3af683e6f4 column family: add estimate read, write
This adds an estimated read and estimated write histogram to the column
family stats object.
2015-09-12 08:35:03 +03:00
Amnon Heiman
dd7638cfa9 Expose the dirty_memory_region_group in database and add occupancy to
column_family

This patch adds a getter for the dirty_memory_region_group in the
database object and add an occupency method to column family that
returns the total occupency in all the memtable in the column family.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-09-10 00:22:08 +03:00
Avi Kivity
b96018411b Merge "Fix flush in the middle of scanning bug" from Tomasz
Fixes #309.

Conflicts:
	sstables/sstables.cc
2015-09-09 11:56:04 +03:00
Tomasz Grabiec
320ff132f8 sstables: Relax header dependencies 2015-09-09 10:07:43 +02:00
Gleb Natapov
df468504b6 schema_table: convert code to use distributed<storage_proxy> instead of storage_proxy&
All database code was converted to is when storage_proxy was made
distributed, but then new code was written to use storage_proxy& again.
Passing distributed<> object is safer since it can be passed between
shards safely. There was a patch to fix one such case yesterday, I found
one more while converting.
2015-09-09 10:19:30 +03:00
Tomasz Grabiec
c623fbe1f7 database: Keep sstable as lw_shared_ptr<> from the beginning
Allows us to save on indentation, and we need it as shared anyway later.
2015-09-08 10:19:19 +02:00
Calle Wilund
380649eb66 Database: Add commitlog flush handler to switch memtables to disk
Initiates flushing of CF:s to sstable on CL disk overflow (flush req)
2015-09-07 13:21:46 +02:00
Avi Kivity
349015a269 Merge "Fix migration manager logging" from Pekka
"Fix migration manager logging to output what origin does. Fixes #112."
2015-08-31 16:27:49 +03:00
Calle Wilund
987454d012 Database: Add "flush_all_memtables" 2015-08-31 14:29:50 +02:00
Pekka Enberg
03e0bcd8cb database: Add operator<< for keyspace_metadata
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-31 13:35:19 +03:00
Pekka Enberg
04a65ec06f database: Add keyspace_metadata::validate() helper
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-31 11:54:56 +03:00
Avi Kivity
012fd41fc0 db: hard dirty memory limit
Unlike cache, dirty memory cannot be evicted at will, so we must limit it.

This patch establishes a hard limit of 50% of all memory.  Above that,
new requests are not allowed to start.  This allows the system some time
to clean up memory.

Note that we will need more fine-grained bandwidth control than this;
the hard limit is the last line of defense against running our of reclaimable
memory.

Tested with a mixed read/write load; after reads start to dominate writes
(due to the proliferation of small sstables, and the inability of compaction
to keep up, dirty memory usage starts to climb until the hard stop prevents
it from climbing further and ooming the server).
2015-08-28 14:47:17 +02:00
Avi Kivity
5f62f7a288 Revert "Merge "Commit log replay" from Calle"
Due to test breakage.

This reverts commit 43a4491043, reversing
changes made to 5dcf1ab71a.
2015-08-27 12:39:08 +03:00
Avi Kivity
0fff367230 Merge "test for compaction metadata's ancestors" from Raphael 2015-08-27 11:07:53 +03:00
Avi Kivity
4e3c9c5493 Merge "compaction manager fixes" from Raphael 2015-08-27 11:05:26 +03:00
Avi Kivity
43a4491043 Merge "Commit log replay" from Calle
"Initial implementation/transposition of commit log replay.

* Changes replay position to be shard aware
* Commit log segment ID:s now follow basically the same scheme as origin;
  max(previous ID, wall clock time in ms) + shard info (for us)
* SStables now use the DB definition of replay_position.
* Stores and propagates (compaction) flush replay positions in sstables
* If CL segments are left over from a previous run, they, and existing
  sstables are inspected for high water mark, and then replayed from
  those marks to amend mutations potentially lost in a crash
* Note that CPU count change is "handled" in so much that shard matching is
  per _previous_ runs shards, not current.

Known limitations:
* Mutations deserialized from old CL segments are _not_ fully validated
  against existing schemas.
* System::truncated_at (not currently used) does not handle sharding afaik,
  so watermark ID:s coming from there are dubious.
* Mutations that fail to apply (invalid, broken) are not placed in blob files
  like origin. Partly because I am lazy, but also partly because our serial
  format differs, and we currently have no tools to do anything useful with it
* No replay filtering (Origin allows a system property to designate a filter
  file, detailing which keyspace/cf:s to replay). Partly because we have no
  system properties.

There is no unit test for the commit log replayer (yet).
Because I could not really come up with a good one given the test
infrastructure that exists (tricky to kill stuff just "right").
The functionality is verified by manual testing, i.e. running scylla,
building up data (cassandra-stress), kill -9 + restart.
This of course does not really fully validate whether the resulting DB is
100% valid compared to the one at k-9, but at least it verified that replay
took place, and mutations where applied.
(Note that origin also lacks validity testing)"
2015-08-27 10:53:36 +03:00
Amnon Heiman
b5ceef451e keyspace: Add the get_non_system_keyspaces and expose the replication
This patch adds the get_non_system_keyspaces that found in origin and
expose the replication strategy. With the get_replication_strategy
method.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-08-25 19:39:13 +03:00
Calle Wilund
df8d7a8295 Database: Add "flush_all_memtables" 2015-08-25 09:41:56 +02:00
Avi Kivity
4390be3956 Rename 'negative_mutation_reader' to 'partition_presence_checker'
Suggested by Tomek.
2015-08-24 18:03:22 +03:00