scylladb

Author	SHA1	Message	Date
Avi Kivity	f4706c7050	Merge "initial support to leveled compaction" from Raphael "This patchset introduces leveled compaction to Scylla. We don't handle all corner cases yet, but we already have the strategy and compaction working as expected. Test cases were written and I also tested the stability with a load of cassandra-stress. Leveled compaction may output more than one sstable because there is a limit on the size of sstables. 160M by default. Related to handling of partial compaction, it's still something to be worked on. Anyway, it will not be a big problem. Why? Suppose that a leveled compaction will generate 2 sstables, and scylla is interrupted after the first sstable is completely written but before the second one is completely written. The next boot will delete the second sstable, because it was partially written, but will not do anything with the first one as it was completely written. As a result, we will have two sstables with redundant data."	2015-10-19 16:17:45 +03:00
Tomasz Grabiec	19d7d30e67	Replace references to 'urchin' with 'scylla'	2015-10-19 11:08:05 +03:00
Glauber Costa	df857eb8c6	database: touch directories for the column family Current code calls make_directory, which will fail if the directory already exists. We didn't use this code path much before, but once we start creating CF directories on CF creation - and not on SSTable creation, that will become our default method. Use touch_directory instead Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-17 13:08:07 +02:00
Raphael S. Carvalho	35b75e9b67	adapt compaction procedure to support leveled strategy Adapt our compaction code to start writing a new sstable if the one being written reached its maximum size. Leveled strategy works with that concept. If a strategy other than leveled is being used, everything will work as before. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-10-16 01:54:52 -03:00
Calle Wilund	012ab24469	column_family: Add flush queue object to act as ordering guarantee	2015-10-14 14:07:40 +02:00
Glauber Costa	b2fef14ada	do not calculate truncation time independently Currently, we are calculating truncated_at during truncate() independently for each shard. It will work if we're lucky, but it is fairly easy to trigger cases in which each shard will end up with a slightly different time. The main problem here, is that this time is used as the snapshot name when auto snapshots are enabled. Previous to my last fixes, this would just generate two separate directories in this case, which is wrong but not severe. But after the fix, this means that both shards will wait for one another to synchronize and this will hang the database. Fix this by making sure that the truncation time is calculated before invoke_on_all in all needed places. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-09 17:17:11 +03:00
Glauber Costa	1549a43823	snapshots: fix json type We are generating a general object ({}), whereas Cassandra 2.1.x generates an array ([]). Let's do that as well to avoid surprising parsers. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-08 16:54:51 +02:00
Glauber Costa	cc343eb928	snapshots: handle jsondir creation for empty files case We still need to write a manifest when there are no files in the snapshot. But because we have never reached the touch_directory part in the sstables loop for that case, nobody would have created jsondir in that case. Since now all the file handling is done in the seal_snapshot phase, we should just make sure the directory exists before initiating any other disk activity. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-08 16:54:51 +02:00
Glauber Costa	efdfc78c0c	snapshots: get rid of empty tables optimization We currently have one optimization that returns early when there are no tables to be snapshotted. However, because of the way we are writing the manifest now, this will cause the shard that happens to have tables to be waiting forever. So we should get rid of it. All shards need to pass through the synchronization point. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-08 16:54:51 +02:00
Glauber Costa	0776ca1c52	snapshots: don't hash pending snapshots by snapshot name If we are hashing more than one CF, the snapshot themselves will all have the same name. This will cause the files from one of them to spill towards the other when writing the manifest. The proper hash is the jsondir: that one is unique per manifest file. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-08 16:54:51 +02:00
Glauber Costa	725ae03772	snapshots: write the manifest file from a single shard Currently, the snapshot code has all shards writing the manifest file. This is wrong, because all previous writes to the last will be overwritten. This patch fixes it, by synchronizing all writes and leaving just one of the shards with the task of closing the manifest. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-08 11:36:36 +02:00
Glauber Costa	25d24222fe	snapshots: separate manifest creation The way manifest creation is currently done is wrong: instead of a final manifest containing all files from all shards, the current code writes a manifest containing just the files from the shard that happens to be the unlucky loser of the writing race. In preparation to fix that, separate the manifest creation code from the rest. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-08 11:36:36 +02:00
Glauber Costa	abc63e4669	snapshots: clarify and fix sync behavior We do need to sync jsondir after we write the manifest file (previously done, but with a question), and before we start it (not previously done) to guarantee that the manifest file won't reference any file that is not visible yet. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-08 11:36:36 +02:00
Glauber Costa	ca4babdb57	snapshots: close file after flush We are currently flushing it, but not closing it. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-08 11:36:36 +02:00
Pekka Enberg	b40999b504	database: Fix drop_column_family() UUID lookup race Remove the about to be dropped CF from the UUID lookup table before truncating and stopping it. This closes a race window where new operations based on the UUID might be initiated after truncate completes. Signed-off-by: Pekka Enberg <penberg@scylladb.com>	2015-10-06 17:10:17 +02:00
Pekka Enberg	9576b0ef23	database: Implement drop_keyspace() Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>	2015-10-06 14:53:35 +03:00
Tomasz Grabiec	bc1d159c1b	Merge branch 'penberg/cql-drop-table/v3' from seastar-dev.git From Pekka: This patch series implements support for CQL DROP TABLE. It uses the newly added truncate infrastructure under the hood. After this series, the test_table CQL test in dtest passes: [penberg@nero urchin-dtest]$ nosetests -v cql_tests.py:TestCQL.table_test table_test (cql_tests.TestCQL) ... ok ---------------------------------------------------------------------- Ran 1 test in 23.841s OK	2015-10-06 13:39:25 +02:00
Pekka Enberg	b1e6ab144a	database: Implement drop_column_family() Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>	2015-10-06 11:28:55 +03:00
Pekka Enberg	0651ab6901	database: Futurize drop_column_family() function Futurize drop_column_family() so that we can call truncate() from it. Signed-off-by: Pekka Enberg <penberg@scylladb.com>	2015-10-06 11:28:55 +03:00
Pekka Enberg	85ffaa5330	database: Add truncate() variant that does not look up CF by name For drop_column_family(), we want to first remove the column_family from lookup tables and truncate after that to avoid races. Introduce a truncate() variant that takes keyspace and column_family references. Signed-off-by: Pekka Enberg <penberg@scylladb.com>	2015-10-06 11:28:54 +03:00
Glauber Costa	639ba2b99d	incremental backups: move control to the CF level Currently, we control incremental backups behavior from the storage service. This creates some very concrete problems, since the storage service is not always available and initialized. The solution is to move it to the column family (and to the keyspace so we can properly propagate the conf file value). When we change this from the api, we will have to iterate over all of them, changing the value accordingly. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-05 13:16:11 +02:00
Avi Kivity	7c23ec49ae	Merge "Support incremental backups" from Glauber "Generate backups when the configuration file indicates we should; toggle behavior on/off through the API."	2015-10-04 13:49:20 +03:00
Amnon Heiman	1f16765140	column family: setting the read and write latency histogram This patch contains the following changes, in the definition of the read and write latency histogram it removes the mask value, so the the default value will be used. To support the gothering of the read latency histogram the query method cannot be const as it modifies the histogram statistics. The read statistic is sample based and it should have no real impact on performance, if there will be an impact, we can always change it in the future to a lower sampling rate. Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>	2015-10-04 11:52:19 +03:00
Glauber Costa	d4edb82c9e	column_family: incremental backups Only tables that arise from flushes are backed up. Compacted tables are not. Therefore, the place for that to happen is right after our flush. Note that due to our sharded architecture, it is possible that in the face of a value change some shards will backup sstables while others won't. This is, in theory, possible to mitigate through a rwlock. However, this doesn't differ from the situation where all tables are coming from a single shard and the toggle happens in the middle of them. The code as is guarantees that we'll never partially backup a single sstable, so that is enough of a guarantee. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-02 18:23:27 +02:00
Pekka Enberg	5e27d476d4	database: Improve exception error messages When we convert exceptions into CQL server errors, type information is not preserved. Therefore, improve exception error messages to make debugging dtest failures, for example, slightly easier. Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>	2015-10-01 11:23:46 +03:00
Calle Wilund	68b8d8f48c	database: Implement "truncate" for column family Including snapshotting.	2015-09-30 09:09:42 +02:00
Calle Wilund	56228fba24	column family: Add "snapshot" operation.	2015-09-30 09:09:42 +02:00
Calle Wilund	c141e15a4a	column family: Add "run_with_compaction_disabled" helper A'la origin. Could as well been RAII.	2015-09-30 09:09:41 +02:00
Glauber Costa	22294dd6a0	do not re-read sstable components after write When we write an SSTable, all its components are already in memory. load() is to big of a hammer. We still want to keep the write operation separated from the preparation to read, but in the case of a newly written SSTable, all we need to do is to open the index and data file. Fixes #300 Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-09-29 10:00:26 +02:00
Tomasz Grabiec	d033cdcefe	db: Move "Populating Keyspace ..." message from WARN to INFO level WARN level is for messages which should draw log reader's attention, journalctl highlights them for example. Populating of keyspace is a fairly normal thing, so it should be logged on lower level.	2015-09-23 15:28:44 +02:00
Avi Kivity	d5cf0fb2b1	Add license notices	2015-09-20 10:43:39 +03:00
Raphael S. Carvalho	461ecc55e3	sstable: fix race condition when deleting a partial sstable Race condition happens when two or more shards will try to delete the same partial sstable. So the problem doesn't affect scylla when it boots with a single shard. To fix this problem, shard 0 will be made the responsible for deleting a partial sstable. fixes #359. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-09-16 19:58:44 +03:00
Avi Kivity	7e1d03d098	db: delete ignored sstables If an sstable is irrelevant for a shard, delete it. The deletion will only complete when all shards agree (either ignore the sstable or delete it after compaction).	2015-09-14 10:14:00 +02:00
Avi Kivity	cab2148141	Merge "partial sstable handling" from Raphael closes issue #75.	2015-09-13 12:03:50 +03:00
Raphael S. Carvalho	e65c91f324	db: avoid possible underflow on stats pending_compactions In event of a compaction failure, run_compaction would be called more than one time for a request, which could result in an underflow in the stats pending_compactions. Let's fix that by only decreasing it if compaction succeeded. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-09-13 11:59:34 +03:00
Raphael S. Carvalho	538611ab93	sstable: delete sstable generation with temporary toc file When populating a column family, we will now delete all components of a sstable with a temporary toc file. A sstable with a temporary TOC file means that it was partially written, and can be safely deleted because the respective data is either saved in the commit log, or in the compacted sstables in case of the partial sstable being result of a compaction. Deletion procedure is guarded against power failure by only deleting the temporary TOC file after all other components were deleted. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-09-13 03:17:58 -03:00
Raphael S. Carvalho	7677202700	db: handle temporary TOC file when populating cf When populating a cf, we should also check for a sstable with temporary TOC file, and act accordingly. By the time being, we will only refuse to boot. Subsequent work is to gather all files of a sstable with a temporary TOC file and delete them. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-09-13 03:03:30 -03:00
Amnon Heiman	dd7638cfa9	Expose the dirty_memory_region_group in database and add occupancy to column_family This patch adds a getter for the dirty_memory_region_group in the database object and add an occupency method to column family that returns the total occupency in all the memtable in the column family. Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>	2015-09-10 00:22:08 +03:00
Tomasz Grabiec	882f231ef2	database: Move sstable_range_wrapping_reader to sstable_mutation_readers.hh Fixes compilation problem in memtable.cc. Should be part of the series committed in `b96018411b`	2015-09-09 12:03:06 +02:00
Avi Kivity	b96018411b	Merge "Fix flush in the middle of scanning bug" from Tomasz Fixes #309. Conflicts: sstables/sstables.cc	2015-09-09 11:56:04 +03:00
Tomasz Grabiec	a0c180ef49	memtable: Fix flush in the middle of scanning bug Fixes #309. When scanning memtable readers detect is was flushed, which means that it started to be moved to cache, they fall back to reading from memtable's sstable. Eventually what we should do is to combine memtable and cache contents so that as long as data is not evicted we won't do IO. We do not support scanning in cache yet though, so there is no point in doing this now, and it is not trivial.	2015-09-09 10:17:35 +02:00
Avi Kivity	5bbe526738	Merge sstable deletion Deleting sstables is tricky, since they can be shared across shards. This patchset introduces an sstable deletion agreement table, that records the agreement of shards to delete an sstable. Sstables are only deleted after all shards have agreed. With this, we can change core count across boots. Fixes #53.	2015-09-09 11:01:13 +03:00
Gleb Natapov	df468504b6	schema_table: convert code to use distributed<storage_proxy> instead of storage_proxy& All database code was converted to is when storage_proxy was made distributed, but then new code was written to use storage_proxy& again. Passing distributed<> object is safer since it can be passed between shards safely. There was a patch to fix one such case yesterday, I found one more while converting.	2015-09-09 10:19:30 +03:00
Avi Kivity	b76d7db432	db: mark newly created sstables as unshared Other shards know nothing about them, so they won't mark them for deletion when the time comes.	2015-09-08 16:45:28 +03:00
Calle Wilund	1004e090f8	Database: Use commitlog::shutdown to help making shutdown more coherent Should more or less mean that data in sstables + stuff in CL is the actual DB state.	2015-09-08 11:55:21 +02:00
Tomasz Grabiec	d52853c4fe	database: Restore indentation	2015-09-08 10:19:19 +02:00
Tomasz Grabiec	c623fbe1f7	database: Keep sstable as lw_shared_ptr<> from the beginning Allows us to save on indentation, and we need it as shared anyway later.	2015-09-08 10:19:19 +02:00
Tomasz Grabiec	820a50a36e	db: Move FIXME to a more appropriate place From column_family's point of view, calling write_components() is all it needs. The FIXME belongs more to an implementation of write_components().	2015-09-08 10:19:19 +02:00
Tomasz Grabiec	ecf4841953	Fix typo in 'attempt'	2015-09-08 10:19:19 +02:00
Avi Kivity	a95d3f9cf5	Merge "Commitlog shutdown" from Calle "Refs #293 * Add a commitlog::sync_all_segments, that explicitly forces all pending disk writes * Only delete segments from disk IFF they are marked clean. Thus on partial shutdown or whatnot, even if CL is destroyed (destructor runs) disk files not yet clean visavi sstables are preserved and replayable * Do a sync_all_segments first of all in database::stop. Exactly what to not stop in main I leave up to others discretion, or at least another patch."	2015-09-08 11:11:18 +03:00

1 2 3 4 5 ...

395 Commits