Commit Graph

381 Commits

Author SHA1 Message Date
Pekka Enberg
b40999b504 database: Fix drop_column_family() UUID lookup race
Remove the about to be dropped CF from the UUID lookup table before
truncating and stopping it. This closes a race window where new
operations based on the UUID might be initiated after truncate
completes.

Signed-off-by: Pekka Enberg <penberg@scylladb.com>
2015-10-06 17:10:17 +02:00
Pekka Enberg
9576b0ef23 database: Implement drop_keyspace()
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-10-06 14:53:35 +03:00
Tomasz Grabiec
bc1d159c1b Merge branch 'penberg/cql-drop-table/v3' from seastar-dev.git
From Pekka:

This patch series implements support for CQL DROP TABLE. It uses the newly
added truncate infrastructure under the hood. After this series, the
test_table CQL test in dtest passes:

  [penberg@nero urchin-dtest]$ nosetests -v cql_tests.py:TestCQL.table_test
  table_test (cql_tests.TestCQL) ... ok

  ----------------------------------------------------------------------
  Ran 1 test in 23.841s

  OK
2015-10-06 13:39:25 +02:00
Pekka Enberg
b1e6ab144a database: Implement drop_column_family()
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-10-06 11:28:55 +03:00
Pekka Enberg
0651ab6901 database: Futurize drop_column_family() function
Futurize drop_column_family() so that we can call truncate() from it.

Signed-off-by: Pekka Enberg <penberg@scylladb.com>
2015-10-06 11:28:55 +03:00
Pekka Enberg
85ffaa5330 database: Add truncate() variant that does not look up CF by name
For drop_column_family(), we want to first remove the column_family from
lookup tables and truncate after that to avoid races. Introduce a
truncate() variant that takes keyspace and column_family references.

Signed-off-by: Pekka Enberg <penberg@scylladb.com>
2015-10-06 11:28:54 +03:00
Glauber Costa
639ba2b99d incremental backups: move control to the CF level
Currently, we control incremental backups behavior from the storage service.
This creates some very concrete problems, since the storage service is not
always available and initialized.

The solution is to move it to the column family (and to the keyspace so we can
properly propagate the conf file value). When we change this from the api, we will
have to iterate over all of them, changing the value accordingly.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-05 13:16:11 +02:00
Avi Kivity
7c23ec49ae Merge "Support incremental backups" from Glauber
"Generate backups when the configuration file indicates we should;
toggle behavior on/off through the API."
2015-10-04 13:49:20 +03:00
Amnon Heiman
1f16765140 column family: setting the read and write latency histogram
This patch contains the following changes, in the definition of the read
and write latency histogram it removes the mask value, so the the
default value will be used.

To support the gothering of the read latency histogram the query method
cannot be const as it modifies the histogram statistics.

The read statistic is sample based and it should have no real impact on
performance, if there will be an impact, we can always change it in the
future to a lower sampling rate.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-10-04 11:52:19 +03:00
Glauber Costa
d4edb82c9e column_family: incremental backups
Only tables that arise from flushes are backed up. Compacted tables are not.
Therefore, the place for that to happen is right after our flush.

Note that due to our sharded architecture, it is possible that in the face of a
value change some shards will backup sstables while others won't.

This is, in theory, possible to mitigate through a rwlock. However, this
doesn't differ from the situation where all tables are coming from a single
shard and the toggle happens in the middle of them.

The code as is guarantees that we'll never partially backup a single sstable,
so that is enough of a guarantee.

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-10-02 18:23:27 +02:00
Pekka Enberg
5e27d476d4 database: Improve exception error messages
When we convert exceptions into CQL server errors, type information is
not preserved. Therefore, improve exception error messages to make
debugging dtest failures, for example, slightly easier.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-10-01 11:23:46 +03:00
Calle Wilund
68b8d8f48c database: Implement "truncate" for column family
Including snapshotting.
2015-09-30 09:09:42 +02:00
Calle Wilund
56228fba24 column family: Add "snapshot" operation. 2015-09-30 09:09:42 +02:00
Calle Wilund
c141e15a4a column family: Add "run_with_compaction_disabled" helper
A'la origin. Could as well been RAII.
2015-09-30 09:09:41 +02:00
Glauber Costa
22294dd6a0 do not re-read sstable components after write
When we write an SSTable, all its components are already in memory. load() is
to big of a hammer.

We still want to keep the write operation separated from the preparation to
read, but in the case of a newly written SSTable, all we need to do is to open
the index and data file.

Fixes #300

Signed-off-by: Glauber Costa <glommer@scylladb.com>
2015-09-29 10:00:26 +02:00
Tomasz Grabiec
d033cdcefe db: Move "Populating Keyspace ..." message from WARN to INFO level
WARN level is for messages which should draw log reader's attention,
journalctl highlights them for example. Populating of keyspace is a
fairly normal thing, so it should be logged on lower level.
2015-09-23 15:28:44 +02:00
Avi Kivity
d5cf0fb2b1 Add license notices 2015-09-20 10:43:39 +03:00
Raphael S. Carvalho
461ecc55e3 sstable: fix race condition when deleting a partial sstable
Race condition happens when two or more shards will try to delete
the same partial sstable. So the problem doesn't affect scylla
when it boots with a single shard.
To fix this problem, shard 0 will be made the responsible for
deleting a partial sstable.

fixes #359.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-09-16 19:58:44 +03:00
Avi Kivity
7e1d03d098 db: delete ignored sstables
If an sstable is irrelevant for a shard, delete it.  The deletion will
only complete when all shards agree (either ignore the sstable or
delete it after compaction).
2015-09-14 10:14:00 +02:00
Avi Kivity
cab2148141 Merge "partial sstable handling" from Raphael
closes issue #75.
2015-09-13 12:03:50 +03:00
Raphael S. Carvalho
e65c91f324 db: avoid possible underflow on stats pending_compactions
In event of a compaction failure, run_compaction would be called
more than one time for a request, which could result in an
underflow in the stats pending_compactions.
Let's fix that by only decreasing it if compaction succeeded.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-09-13 11:59:34 +03:00
Raphael S. Carvalho
538611ab93 sstable: delete sstable generation with temporary toc file
When populating a column family, we will now delete all components
of a sstable with a temporary toc file. A sstable with a temporary
TOC file means that it was partially written, and can be safely
deleted because the respective data is either saved in the commit
log, or in the compacted sstables in case of the partial sstable
being result of a compaction.
Deletion procedure is guarded against power failure by only deleting
the temporary TOC file after all other components were deleted.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-09-13 03:17:58 -03:00
Raphael S. Carvalho
7677202700 db: handle temporary TOC file when populating cf
When populating a cf, we should also check for a sstable with
temporary TOC file, and act accordingly. By the time being,
we will only refuse to boot. Subsequent work is to gather all
files of a sstable with a temporary TOC file and delete them.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-09-13 03:03:30 -03:00
Amnon Heiman
dd7638cfa9 Expose the dirty_memory_region_group in database and add occupancy to
column_family

This patch adds a getter for the dirty_memory_region_group in the
database object and add an occupency method to column family that
returns the total occupency in all the memtable in the column family.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-09-10 00:22:08 +03:00
Tomasz Grabiec
882f231ef2 database: Move sstable_range_wrapping_reader to sstable_mutation_readers.hh
Fixes compilation problem in memtable.cc. Should be part of the series
committed in b96018411b
2015-09-09 12:03:06 +02:00
Avi Kivity
b96018411b Merge "Fix flush in the middle of scanning bug" from Tomasz
Fixes #309.

Conflicts:
	sstables/sstables.cc
2015-09-09 11:56:04 +03:00
Tomasz Grabiec
a0c180ef49 memtable: Fix flush in the middle of scanning bug
Fixes #309.

When scanning memtable readers detect is was flushed, which means that
it started to be moved to cache, they fall back to reading from
memtable's sstable.

Eventually what we should do is to combine memtable and cache contents
so that as long as data is not evicted we won't do IO. We do not
support scanning in cache yet though, so there is no point in doing
this now, and it is not trivial.
2015-09-09 10:17:35 +02:00
Avi Kivity
5bbe526738 Merge sstable deletion
Deleting sstables is tricky, since they can be shared across shards.

This patchset introduces an sstable deletion agreement table, that records
the agreement of shards to delete an sstable.  Sstables are only deleted
after all shards have agreed.

With this, we can change core count across boots.

Fixes #53.
2015-09-09 11:01:13 +03:00
Gleb Natapov
df468504b6 schema_table: convert code to use distributed<storage_proxy> instead of storage_proxy&
All database code was converted to is when storage_proxy was made
distributed, but then new code was written to use storage_proxy& again.
Passing distributed<> object is safer since it can be passed between
shards safely. There was a patch to fix one such case yesterday, I found
one more while converting.
2015-09-09 10:19:30 +03:00
Avi Kivity
b76d7db432 db: mark newly created sstables as unshared
Other shards know nothing about them, so they won't mark them for deletion
when the time comes.
2015-09-08 16:45:28 +03:00
Calle Wilund
1004e090f8 Database: Use commitlog::shutdown to help making shutdown more coherent
Should more or less mean that data in sstables + stuff in CL is the
actual DB state.
2015-09-08 11:55:21 +02:00
Tomasz Grabiec
d52853c4fe database: Restore indentation 2015-09-08 10:19:19 +02:00
Tomasz Grabiec
c623fbe1f7 database: Keep sstable as lw_shared_ptr<> from the beginning
Allows us to save on indentation, and we need it as shared anyway later.
2015-09-08 10:19:19 +02:00
Tomasz Grabiec
820a50a36e db: Move FIXME to a more appropriate place
From column_family's point of view, calling write_components() is all
it needs. The FIXME belongs more to an implementation of
write_components().
2015-09-08 10:19:19 +02:00
Tomasz Grabiec
ecf4841953 Fix typo in 'attempt' 2015-09-08 10:19:19 +02:00
Avi Kivity
a95d3f9cf5 Merge "Commitlog shutdown" from Calle
"Refs #293

* Add a commitlog::sync_all_segments, that explicitly forces all pending
  disk writes
* Only delete segments from disk IFF they are marked clean. Thus on partial
  shutdown or whatnot, even if CL is destroyed (destructor runs) disk files
  not yet clean visavi sstables are preserved and replayable
* Do a sync_all_segments first of all in database::stop.

Exactly what to not stop in main I leave up to others discretion, or at least
another patch."
2015-09-08 11:11:18 +03:00
Tomasz Grabiec
15ae1a92cb Merge branch 'pdziepak/compaction-remove-items/v4' from seastar-dev.git
From Pawel:

This series makes compaction remove items that are no longer items:
 - expired cells are changed into tombstones
 - items covered by higher level tombstones are removed
 - expired tombstones are removed if possible

Fixes #70.
Fixes #71.
2015-09-08 09:23:00 +02:00
Paweł Dziepak
969fe6b878 sstables: make compact_sstables() take ref to column_family
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
2015-09-07 21:20:32 +02:00
Calle Wilund
6b81845041 Database: Do a commitlog::sync_all first on stop.
Refs #293
IFF one desires to _not_ shutdown stuff cleanly, still running this first
in database::stop will at least ensure that mutations already in CL transit
will end up on disk and be replayable
2015-09-07 20:32:04 +02:00
Calle Wilund
d614143f5e Commitlog/database: Fixup series "Commit log flush request on disk overflow"
Also at seastar-dev: calle/commitlog_flush_v3
(And, yes, this time I _did_ update the remote!)

Refs #262

Commit of original series was done on stale version (v2) due to authors
inability to multitask and update git repos.

v3:
* Removed future<> return value from callbacks. I.e. flush callback is now
  only fully syncronous over actual call
2015-09-07 21:29:19 +03:00
Avi Kivity
dee9060b12 Merge "Commit log flush request on disk overflow" from Calle
"Fixes #262

Handles CL disk size exceeding configured max size by calling flush handlers
for each dirty CF id / high replay_position mark. (Instead of uncontrolled
delete as previously).

* Increased default max disk size to 8GB. Same as Origin/scylla.yaml (so no
   real change, but synced).
* Divide the max disk size by cpus (so sum of all shards == max)
* Abstract flush callbacks in CL
* Handler in DB that initiates memtable->sstable writes when called.

Note that the flush request is done "syncronously" in new_segment() (i.e.
when getting a new segment and crossing threshold). This is however more or
less congruent with Origin, which will do a request-sync in the corresponding
case.
Actual dealing with the request should at least in production code however be
done async, and in DB it is, i.e. we initiate sstable writes. Hopefully
they finish soon, and CL segments will be released (before next segment is
allocated).

If the flush request does _not_ eventually result in any CF:s becoming
clean and segments released we could potentially be issuing flushes
repeatedly, but never more often than on every new segment."
2015-09-07 18:46:48 +03:00
Calle Wilund
380649eb66 Database: Add commitlog flush handler to switch memtables to disk
Initiates flushing of CF:s to sstable on CL disk overflow (flush req)
2015-09-07 13:21:46 +02:00
Tomasz Grabiec
802a9db9b0 Fix spelling of 'definitely_doesnt_exist' 2015-09-06 21:24:58 +02:00
Glauber Costa
0fc2995b54 database: initialize sst field
The reader has a field for the sstable, but we are not initializing it, so it
can be destroyed before we finish our job. It seems to work here, but transposing
this code to the test case crashed it. So this means at some point we will crash
here as well.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-09-02 06:01:38 +03:00
Paweł Dziepak
9ab44d6754 database: log row::max_vector_size and internal_count
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
2015-08-31 17:29:16 +02:00
Avi Kivity
349015a269 Merge "Fix migration manager logging" from Pekka
"Fix migration manager logging to output what origin does. Fixes #112."
2015-08-31 16:27:49 +03:00
Avi Kivity
f2a79aa7f6 Merge Prepare for closing sstables, part 1
Read-ahead will require that we close input_streams.  As part of that
we have to close sstables, and mutation_readers (which encapsulate
input_streams).  This is part 1 of a patchset series to do that.

(The overarching goal is to enable read-ahead for sstables, see #244)

Conflicts:
	sstables/compaction.cc
2015-08-31 16:15:18 +03:00
Avi Kivity
7090dffe91 mutation_reader: switch to a class based implementation
Using a lambda for implementing a mutation_reader is nifty, but does not
allow us to add methods.

Switch to a class-based implementation in anticipation of adding a close()
method.
2015-08-31 15:53:53 +03:00
Calle Wilund
987454d012 Database: Add "flush_all_memtables" 2015-08-31 14:29:50 +02:00
Calle Wilund
f14e3cf8d0 Database: do not create shard-specific dirs for commitlog
New ID scheme allows for a single dir for all segments from all shards.
2015-08-31 14:29:46 +02:00