Commit Graph

342 Commits

Author SHA1 Message Date
Calle Wilund
d614143f5e Commitlog/database: Fixup series "Commit log flush request on disk overflow"
Also at seastar-dev: calle/commitlog_flush_v3
(And, yes, this time I _did_ update the remote!)

Refs #262

Commit of original series was done on stale version (v2) due to authors
inability to multitask and update git repos.

v3:
* Removed future<> return value from callbacks. I.e. flush callback is now
  only fully syncronous over actual call
2015-09-07 21:29:19 +03:00
Avi Kivity
dee9060b12 Merge "Commit log flush request on disk overflow" from Calle
"Fixes #262

Handles CL disk size exceeding configured max size by calling flush handlers
for each dirty CF id / high replay_position mark. (Instead of uncontrolled
delete as previously).

* Increased default max disk size to 8GB. Same as Origin/scylla.yaml (so no
   real change, but synced).
* Divide the max disk size by cpus (so sum of all shards == max)
* Abstract flush callbacks in CL
* Handler in DB that initiates memtable->sstable writes when called.

Note that the flush request is done "syncronously" in new_segment() (i.e.
when getting a new segment and crossing threshold). This is however more or
less congruent with Origin, which will do a request-sync in the corresponding
case.
Actual dealing with the request should at least in production code however be
done async, and in DB it is, i.e. we initiate sstable writes. Hopefully
they finish soon, and CL segments will be released (before next segment is
allocated).

If the flush request does _not_ eventually result in any CF:s becoming
clean and segments released we could potentially be issuing flushes
repeatedly, but never more often than on every new segment."
2015-09-07 18:46:48 +03:00
Calle Wilund
380649eb66 Database: Add commitlog flush handler to switch memtables to disk
Initiates flushing of CF:s to sstable on CL disk overflow (flush req)
2015-09-07 13:21:46 +02:00
Tomasz Grabiec
802a9db9b0 Fix spelling of 'definitely_doesnt_exist' 2015-09-06 21:24:58 +02:00
Glauber Costa
0fc2995b54 database: initialize sst field
The reader has a field for the sstable, but we are not initializing it, so it
can be destroyed before we finish our job. It seems to work here, but transposing
this code to the test case crashed it. So this means at some point we will crash
here as well.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-09-02 06:01:38 +03:00
Paweł Dziepak
9ab44d6754 database: log row::max_vector_size and internal_count
Signed-off-by: Paweł Dziepak <pdziepak@cloudius-systems.com>
2015-08-31 17:29:16 +02:00
Avi Kivity
349015a269 Merge "Fix migration manager logging" from Pekka
"Fix migration manager logging to output what origin does. Fixes #112."
2015-08-31 16:27:49 +03:00
Avi Kivity
f2a79aa7f6 Merge Prepare for closing sstables, part 1
Read-ahead will require that we close input_streams.  As part of that
we have to close sstables, and mutation_readers (which encapsulate
input_streams).  This is part 1 of a patchset series to do that.

(The overarching goal is to enable read-ahead for sstables, see #244)

Conflicts:
	sstables/compaction.cc
2015-08-31 16:15:18 +03:00
Avi Kivity
7090dffe91 mutation_reader: switch to a class based implementation
Using a lambda for implementing a mutation_reader is nifty, but does not
allow us to add methods.

Switch to a class-based implementation in anticipation of adding a close()
method.
2015-08-31 15:53:53 +03:00
Calle Wilund
987454d012 Database: Add "flush_all_memtables" 2015-08-31 14:29:50 +02:00
Calle Wilund
f14e3cf8d0 Database: do not create shard-specific dirs for commitlog
New ID scheme allows for a single dir for all segments from all shards.
2015-08-31 14:29:46 +02:00
Pekka Enberg
03e0bcd8cb database: Add operator<< for keyspace_metadata
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-31 13:35:19 +03:00
Pekka Enberg
04a65ec06f database: Add keyspace_metadata::validate() helper
Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-31 11:54:56 +03:00
Avi Kivity
012fd41fc0 db: hard dirty memory limit
Unlike cache, dirty memory cannot be evicted at will, so we must limit it.

This patch establishes a hard limit of 50% of all memory.  Above that,
new requests are not allowed to start.  This allows the system some time
to clean up memory.

Note that we will need more fine-grained bandwidth control than this;
the hard limit is the last line of defense against running our of reclaimable
memory.

Tested with a mixed read/write load; after reads start to dominate writes
(due to the proliferation of small sstables, and the inability of compaction
to keep up, dirty memory usage starts to climb until the hard stop prevents
it from climbing further and ooming the server).
2015-08-28 14:47:17 +02:00
Avi Kivity
5f62f7a288 Revert "Merge "Commit log replay" from Calle"
Due to test breakage.

This reverts commit 43a4491043, reversing
changes made to 5dcf1ab71a.
2015-08-27 12:39:08 +03:00
Avi Kivity
0fff367230 Merge "test for compaction metadata's ancestors" from Raphael 2015-08-27 11:07:53 +03:00
Avi Kivity
4e3c9c5493 Merge "compaction manager fixes" from Raphael 2015-08-27 11:05:26 +03:00
Avi Kivity
43a4491043 Merge "Commit log replay" from Calle
"Initial implementation/transposition of commit log replay.

* Changes replay position to be shard aware
* Commit log segment ID:s now follow basically the same scheme as origin;
  max(previous ID, wall clock time in ms) + shard info (for us)
* SStables now use the DB definition of replay_position.
* Stores and propagates (compaction) flush replay positions in sstables
* If CL segments are left over from a previous run, they, and existing
  sstables are inspected for high water mark, and then replayed from
  those marks to amend mutations potentially lost in a crash
* Note that CPU count change is "handled" in so much that shard matching is
  per _previous_ runs shards, not current.

Known limitations:
* Mutations deserialized from old CL segments are _not_ fully validated
  against existing schemas.
* System::truncated_at (not currently used) does not handle sharding afaik,
  so watermark ID:s coming from there are dubious.
* Mutations that fail to apply (invalid, broken) are not placed in blob files
  like origin. Partly because I am lazy, but also partly because our serial
  format differs, and we currently have no tools to do anything useful with it
* No replay filtering (Origin allows a system property to designate a filter
  file, detailing which keyspace/cf:s to replay). Partly because we have no
  system properties.

There is no unit test for the commit log replayer (yet).
Because I could not really come up with a good one given the test
infrastructure that exists (tricky to kill stuff just "right").
The functionality is verified by manual testing, i.e. running scylla,
building up data (cassandra-stress), kill -9 + restart.
This of course does not really fully validate whether the resulting DB is
100% valid compared to the one at k-9, but at least it verified that replay
took place, and mutations where applied.
(Note that origin also lacks validity testing)"
2015-08-27 10:53:36 +03:00
Avi Kivity
e6965c520d Merge "Adding the ownership suport to storage_service" from Amnon
"This series adds the missing code from origin to support this functionality.
While doing so, some method where changed to be const when it was more
appropriate and a few const version of methods where added when the two
variation was required."
2015-08-25 20:13:33 +03:00
Amnon Heiman
b5ceef451e keyspace: Add the get_non_system_keyspaces and expose the replication
This patch adds the get_non_system_keyspaces that found in origin and
expose the replication strategy. With the get_replication_strategy
method.

Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>
2015-08-25 19:39:13 +03:00
Vlad Zolotarov
08e7736f0b database::find_column_family(): init the exception with the readable message
Make the exceptions created inside database::find_column_family() return
a readable message from their what() method.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
2015-08-25 18:00:19 +03:00
Calle Wilund
df8d7a8295 Database: Add "flush_all_memtables" 2015-08-25 09:41:56 +02:00
Calle Wilund
5524da8f18 Database: do not create shard-specific dirs for commitlog
New ID scheme allows for a single dir for all segments from all shards.
2015-08-25 09:40:52 +02:00
Avi Kivity
4390be3956 Rename 'negative_mutation_reader' to 'partition_presence_checker'
Suggested by Tomek.
2015-08-24 18:03:22 +03:00
Raphael S. Carvalho
c65af6e188 api: add get_unleveled_sstables to column family api
Adding to API function to return count of sstables in L0 if leveled
compaction strategy is enabled, 0 otherwise. Currently, we don't
support leveled compaction strategy, so function to return count of
sstables in L0 always return zero.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-08-24 11:56:31 -03:00
Raphael S. Carvalho
4c9c144987 compaction_manager: avoid concurrent compaction on the same cf
It was noticed that the same sstable files could be selected for
compaction if concurrent compaction happens on the same cf.
That's possible because compaction manager uses 2 tasks for
handling compactions.

Solution is to not duplicate cf in the compaction manager queue,
and re-schedule compaction for a cf if needed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-08-24 11:11:47 -03:00
Avi Kivity
8a4648761c tests: make test cql environment use volatile system keyspace
Prevents hangs due to the database not being able to persist a memtable.

Tested-by: Asias He <asias@cloudius-systems.com>
2015-08-24 13:50:22 +03:00
Avi Kivity
6f11322220 db: move annoying log on non-durable cf to quieter place
Fixes #174.
2015-08-23 23:12:07 +03:00
Avi Kivity
c01bc16f58 db: don't give up flushing a memtable on error
We must try again, or the memtable's memory will never be reclaimed.
2015-08-19 19:36:41 +03:00
Avi Kivity
6846909533 db: extract sstable flushing code to a function 2015-08-19 19:36:41 +03:00
Avi Kivity
5bf5476beb db: add collectd counter for dirty memory 2015-08-19 19:36:41 +03:00
Avi Kivity
c175025bb6 db: place all memtables into a single region_group
We can use this to track the amount of unevictable memory in the
system.
2015-08-19 19:36:41 +03:00
Avi Kivity
7b67b04822 db: wire up max memtable size configuration 2015-08-19 13:17:27 +03:00
Avi Kivity
176ab06f77 db: demote commitlog reorderign detected log message to debug
It's less rare than we thought and also less interesting.
2015-08-19 09:26:23 +03:00
Raphael S. Carvalho
820ba6f4d2 adapt compaction manager for column family removal
We need a way to remove a column family from the compaction manager
because when dropping a column family we need to make sure that the
compaction manager doesn't hold a reference to it anymore.

So compaction manager queue is now of column_family, allowing us
to cancel requests pertaining to a column family being dropped.
There may be an ongoing compaction for the column family being
dropped, so we also need to wait for its termination.

Testcase for compaction manager was also adapted and improved.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-08-18 11:38:06 +03:00
Glauber Costa
89366dc2c2 sstables: do not accept files with missing TOC.
We can catch most errors when we try to load an sstable. But if the TOC file is
the one missing, we won't try to load the sstable at all. This case is still an
invalid case, but it is way easier for us to treat it by waiting for all files
to be loaded, and then checking if we saw a file during scan_dir, without its
corresponding TOC.

Fixes #114

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-16 15:21:40 +03:00
Glauber Costa
0650579ace sstables: refuse to boot on corrupted sstables
We are now skipping them. That's dangerous.

Fixes #115

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-16 15:21:38 +03:00
Raphael S. Carvalho
9823164c89 db: introduce compaction manager
Currently, each column family creates a fiber to handle compaction requests
in parallel to the system. If there are N column families, N compactions
could be running in parallel, which is definitely horrible.

To solve that problem, a per-database compaction manager is introduced here.

Compaction manager is a feature used to service compaction requests from N
column families. Parallelism is made available by creating more than one
fiber to service the requests. That being said, N compaction requests will
be served by M fibers.

A compaction request being submitted will go to a job queue shared between
all fibers, and the fiber with the lowest amount of pending jobs will be
signalled.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-08-11 17:25:46 +03:00
Avi Kivity
1016b21089 cache: improve preloading of flushed memtable mutations
If a mutation definitely doesn't exist in all sstables, then we can
certainly load it into the cache.
2015-08-09 22:46:08 +03:00
Glauber Costa
c2a0232048 database: generate UUIDs compatible with Cassandra 2.1.8
Without this, Cassandra won't even try to read our sstables. The containing
directories will be ignored.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-07 08:31:56 -05:00
Glauber Costa
c8ca9b376d database: change default sstable version
Let's change the default generated tables to ka, which is the one that is present
in Origin

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-07 08:31:55 -05:00
Glauber Costa
2d1b965f91 database: change filename parser to also accept ka
A ka file has a slightly different name on disk. Change the
parser so we can deal with both

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-07 08:31:55 -05:00
Glauber Costa
cd8c9ad288 sstables: add ks and cf name to sstable constructor
When a schema is available, we use it. However, we have, by now, way too many
tests. Some of them use tables for which we don't even know the schema. It would
have been a massive amount of work to require a schema for all of them - so I am
keeping both constructors around.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-07 08:31:55 -05:00
Glauber Costa
77e06c3ab1 sstables: remove name parameter
It is currently only used to log a message, and for that we have an sstable
method that will do just fine. Using the name itself just makes it being passed
along throughout the captures.  Remove it.

Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>
2015-08-07 08:31:53 -05:00
Raphael S. Carvalho
64fcd16c0c db: adding data to column family statistics for API
Adding required data for column family API to be implemented.

Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>
2015-08-06 17:38:59 +03:00
Avi Kivity
48a1ce28fc Merge "Switch to log-structured allocator" from Tomasz 2015-08-06 15:45:39 +03:00
Tomasz Grabiec
926509525f row_cache: Switch to using LSA 2015-08-06 14:05:16 +02:00
Tomasz Grabiec
18ec9c3643 db: Move column_family::flush() to source file 2015-08-06 14:05:16 +02:00
Tomasz Grabiec
3b92ba2857 db: Add memtable flush logging 2015-08-06 14:05:16 +02:00
Pekka Enberg
dae1119796 database: Fix create keyspace ASan error
ASan does not like commit 05c23c7f73
("database: Add create_keyspace_on_all() helper"):

  ==8112==WARNING: AddressSanitizer failed to allocate 0x7f88b84fc690 bytes
  ==8112==AddressSanitizer's allocator is terminating the process instead of returning 0
  ==8112==If you don't like this behavior set allocator_may_return_null=1
  ==8112==Sanitizer CHECK failed: ../../../../libsanitizer/sanitizer_common/sanitizer_allocator.cc:147 ((0)) != (0) (0, 0)

I was not able to determine the source of the bug. Make ASan happy by
reverting the code movement and using the "cpu zero" trick we use for
table creation.

Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>
2015-08-06 13:02:58 +03:00