scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-19 16:15:07 +00:00

Author	SHA1	Message	Date
Calle Wilund	ceb9f4d647	database: Just do commitlog::shutdown on shutdown. It will do flushes.	2015-10-26 14:56:24 +01:00
Avi Kivity	f7087da054	Merge "GET methods for snapshots" from Glauber "The snapshots API need to expose GET methods so people can query information on them. Now that taking snapshots is supported, this relatively simple series implement get_snapshot_details, a column family method, and wire that up through the storage_service."	2015-10-22 15:23:45 +03:00
Avi Kivity	5f3a46eabb	Merge "load_new_sstables" from Glauber "This patchset implements load_new_sstables, allowing one to move tables inside the data directory of a CF, and then call "nodetool refresh" to start using them. Keep in mind that for Cassandra, this is deemed an unsafe operation: https://issues.apache.org/jira/browse/CASSANDRA-6245 It is still for us something we should not recommend - unless the CF is totally empty and not yet used, but we can do a much better job in the safety front. To guarantee that, the process works in four steps: 1) All writes to this specific column family are disabled. This is a horrible thing to do, because dirty memory can grow much more than desired during this. Throughout out this implementation, we will try to keep the time during which the writes are disabled to its bare minimum. While disabling the writes, each shard will tell us about the highest generation number it has seen. 2) We will scan all tables that we haven't seen before. Those are any tables found in the CF datadir, that are higher than the highest generation number seen so far. We will link them to new generation numbers that are sequential to the ones we have so far, and end up with a new generation number that is returned to the next step 3) The generation number computed in the previous step is now propagated to all CFs, which guarantees that all further writes will pick generation numbers that won't conflict with the existing tables. Right after doing that, the writes are resumed. 4) The tables we found in step 2 are passed on to each of the CFs. They can now load those tables while operations to the CF proceed normally."	2015-10-22 13:42:24 +03:00
Glauber Costa	36cea4313e	column family: load new sstables CF-level code to load new SSTables. There isn't really a lot of complication here. We don't even need to repopulate the entire SSTable directory: by requiring that the external service who is coordinating this tell us explicitly about the new SSTables found in the scan process, we can just load them specifically and add them to the SSTable map. All new tables will start their lifes as shared tables, and will be unshared if it is possible to do so: this all happens inside add_sstable and there isn't really anything special in this front. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:06:22 +02:00
Glauber Costa	61be9fb02d	reshuffle tables: mechanism to adjust new sstables' generation number Before loading new SSTables into the node, we need to make sure that their generation numbers are sequential (at least if we want to follow Cassandra's footsteps here). Note that this is unsafe by design. More information can be found at: https://issues.apache.org/jira/browse/CASSANDRA-6245 However, we can already to slightly better in two ways: Unlike Cassandra, this method takes as a parameter a generation number. We will not touch tables that are before that number at all. That number must be calculated from all shards as the highest generation number they have seen themselves. Calling load_new_sstables in the absence of new tables will therefore do nothing, and will be completely safe. It will also return the highest generation number found after the reshuffling process. New writers should start writing after that. Therefore, new tables that are created will have a generation number that is higher than any of this, and will therefore be safe. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:06:22 +02:00
Glauber Costa	1351c1cc13	database: mechanism to stop writing sstables During certain operations we need to stop writing SSTables. This is needed when we want to load new SSTables into the system. They will have to be scanned by all shards, agreed upon, and in most cases even renamed. Letting SSTables be written at that point makes it inherently racy - specially with the rename. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:06:22 +02:00
Glauber Costa	29e2ad7fd8	column family: commonize code to calculate the desired SSTable generation We will reuse this for load_new_sstables. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:02:43 +02:00
Tomasz Grabiec	764d913d84	Merge branch 'pdziepak/row-cache-range-query/v4' from seastar-dev.git From Pawel: This series enables row cache to serve range queries. In order to achieve that row cache needs to know whether there are some other partitions in the specified range that are not cached and need to be read from the sstables. That information is provied by key_readers, which work very similarly to mutation_readers, but return only the decorated key of partitions in range. In case of sstables key_readers is implemented to use partition index. Approach like this has the disadvantage of needing to access the disk even if all partitions in the range are cached. There are (at least) two solutions ways of dealing with that problem: - cache partition index - that will also help in all other places where it is neededed - add a flag to cache_entry which, when set, indicates that the immediate successor of the partition is also in the cache. Such flag would be set by mutation reader and cleared during eviction. It will also allow newly created mutations from memtable to be moved to cache provided that both their successors and predecessors are already there. The key_reader part of this patchsets adds a lot of new code that probably won't be used in any other place, but the alternative would be to always interleave reads from cache with reads from sstables and that would be more heavy on partition index, which isn't cached. Fixes #185.	2015-10-21 15:26:45 +02:00
Glauber Costa	77513a40db	database: get_snapshot_details For each of the snapshots available, the api may query for some information: the total size on disk, and the "real" size. As far as I could understand, the real size is the size that is used by the SSTables themselves, while the total size includes also the metadata about the snapshot - like the manifest.json file. Details follow: In the original Cassandra code, total size is: long sizeOnDisk = FileUtils.folderSize(snapshot); folderSize recurses on directories, and adds file.length() on files. Again, my understanding is that file_size() would give us the same as the length() method for Java. The other value, real (or true) size is: long trueSize = getTrueAllocatedSizeIn(snapshot); getTrueAllocatedSizeIn seems to be a tree walker, whose visitor is an instance of TrueFilesSizeVisitor. What that visitor does, is add up the size of the files within the tree who are "acceptable". An acceptable file is a file which: starts with the same prefix as we want (IOW, belongs to the same SSTable, we will just test that directly), and is not "alive". The alive list is just the list of all SSTables in the system that are used by the CFs. What this tries to do, is to make sure that the trueSnapshotSize is just the extra space on disk used by the snapshot. Since the snapshots are links, then if a table goes away, it adds to this size. If it would be there anyway, it does not. We can do that in a lot simpler fashion: for each file, we will just look at the original CF directory, and see if we can find the file there. If we can't, then it counts towards the trueSize. Even for files that are deleted after compaction, that "eventually" works, and that simplifies the code tremendously given that we don't have to neither list all files in the system - as Cassandra does - or go check other shards for liveness information - as we would have to do. The scheme I am proposing may need some tweaks when we support multiple data directories, as the SSTables may not be directly below the snapshot level. Still, it would be trivial to inform the CF about their possible locations. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 13:48:44 +02:00
Avi Kivity	5453cfbab7	Merge "snapshots: take + clear" from Glauber "This is the code for taking a snapshot, and clearing a snapshot."	2015-10-21 08:59:42 +03:00
Paweł Dziepak	c1e95dd893	row_cache: pass underlying key_source to row_cache Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-10-20 20:27:53 +02:00
Paweł Dziepak	96a42a9c69	column_family: add sstables_as_key_source() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-10-20 20:27:53 +02:00
Avi Kivity	cf734132e7	Merge "Flusing of CF:s without replay positions" from Calle "Fixes: #469 We occasionally generate memtables that are not empty, yet have no high replay_position set. (Typical case is CL replay, but apparently there are others). Moreover, we can do this repeatedly, and thus get caught in the flush queue ordering restrictions. Solve this by treating a flush without replay_position as a flush at the highest running position, i.e. "last" in queue. Note that this will not affect the actual flush operation, nor CL callbacks, only anyone waiting for the operation(s) to complete. To do this, the flush_queue had its restrictions eased, and some introspection methods added."	2015-10-20 17:36:57 +03:00
Glauber Costa	d236b01b48	snapshots: check existence of snapshots We go to the filesystem to check if the snapshot exists. This should make us robust against deletions of existing snapshots from the filesystem. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-20 15:58:26 +02:00
Glauber Costa	d3aef2c1a5	database: support clear snapshot This allows for us to delete an existing snapshot. It works at the column family level, and removing it from the list of keyspace snapshots needs to happen only when all CFs are processed. Therefore, that is provided as a separate operation. The filesystem code is a bit ugly: it can be made better by making our file lister more generic. First step would be to call it walker, not lister... For now, we'll use the fact that there are mostly two levels in the snapshot hierarchy to our advantage, and avoid a full recursion - using the same lambda for all calls would require us to provide a separate class to handle the state, that's part of making this generic. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-20 15:38:14 +02:00
Glauber Costa	500ee99c93	file lister: allow for more than one directory type There are situations in which we would like to match more than one directory type. One example of that, would be a recursive delete operation: we need to delete the files inside directories and the directories themselves, but we still don't want a "delete all" since finding anything other than a directory or a file is an error, and we should treat it as such. Since there aren't that many times, it should be ok performance wise to just use a list. I am using an unordered_set here just because it is easy enough, but we could actually relax it later if needed. In any case, users of the interface should not worry about that, and that decision is abstracted away into lister::dir_entry_types. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-20 15:38:14 +02:00
Avi Kivity	2575a4602e	Merge "Fix for snapshots/create_links and shared SSTables" from Glauber "Those are fixes needed for the snapshotting process itself. I have bundled this in the create_snapshot series before to avoid a rebase, but since I will have to rewrite that to get rid of the snapshot manager (and go to the filesystem), I am sending those out on their own."	2015-10-20 13:49:17 +03:00
Calle Wilund	02732f19f2	database: Handle CF flush with no high replay_position We occasionally generate memtables that are not empty, yet have no high replay_position set. (Typical case is CL replay, but apparently there are others). Moreover, we can do this repeatedly, and thus get caught in the flush queue ordering restrictions. Solve this by treating a flush without replay_position as a flush at the highest running position, i.e. "last" in queue. Note that this will not affect the actual flush operation, nor CL callbacks, only anyone waiting for the operation(s) to complete.	2015-10-20 08:24:04 +02:00
Avi Kivity	2ccb5feabd	Merge "Support nodetool cfhistogram" "This series adds the missing estimated histogram to the column family and to the API so the nodetool cfhistogram would work."	2015-10-19 17:11:46 +03:00
Avi Kivity	f4706c7050	Merge "initial support to leveled compaction" from Raphael "This patchset introduces leveled compaction to Scylla. We don't handle all corner cases yet, but we already have the strategy and compaction working as expected. Test cases were written and I also tested the stability with a load of cassandra-stress. Leveled compaction may output more than one sstable because there is a limit on the size of sstables. 160M by default. Related to handling of partial compaction, it's still something to be worked on. Anyway, it will not be a big problem. Why? Suppose that a leveled compaction will generate 2 sstables, and scylla is interrupted after the first sstable is completely written but before the second one is completely written. The next boot will delete the second sstable, because it was partially written, but will not do anything with the first one as it was completely written. As a result, we will have two sstables with redundant data."	2015-10-19 16:17:45 +03:00
Glauber Costa	218fdebbeb	snapshot: do not allow exceptions in snapshot creation hang us With the distribute-and-sync method we are using, if an exception happens in the snapshot creation for any reason (think file permissions, etc), that will just hang the server since our shard won't do the necessary work to synchronize and note that we done our part (or tried to) in snapshot creation. Make the then clause a finally, so that the sync part is always executed. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-19 13:37:02 +02:00
Glauber Costa	9083a0e5a7	snapshots: fix generation of snapshots with shared sstables create_links will fail in one of the shards if one of the SSTables happen to be shared. It should be fine if the link already exists, so let's just ignore that case. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-19 13:37:01 +02:00
Tomasz Grabiec	19d7d30e67	Replace references to 'urchin' with 'scylla'	2015-10-19 11:08:05 +03:00
Glauber Costa	df857eb8c6	database: touch directories for the column family Current code calls make_directory, which will fail if the directory already exists. We didn't use this code path much before, but once we start creating CF directories on CF creation - and not on SSTable creation, that will become our default method. Use touch_directory instead Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-17 13:08:07 +02:00
Raphael S. Carvalho	35b75e9b67	adapt compaction procedure to support leveled strategy Adapt our compaction code to start writing a new sstable if the one being written reached its maximum size. Leveled strategy works with that concept. If a strategy other than leveled is being used, everything will work as before. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-10-16 01:54:52 -03:00
Calle Wilund	012ab24469	column_family: Add flush queue object to act as ordering guarantee	2015-10-14 14:07:40 +02:00
Glauber Costa	b2fef14ada	do not calculate truncation time independently Currently, we are calculating truncated_at during truncate() independently for each shard. It will work if we're lucky, but it is fairly easy to trigger cases in which each shard will end up with a slightly different time. The main problem here, is that this time is used as the snapshot name when auto snapshots are enabled. Previous to my last fixes, this would just generate two separate directories in this case, which is wrong but not severe. But after the fix, this means that both shards will wait for one another to synchronize and this will hang the database. Fix this by making sure that the truncation time is calculated before invoke_on_all in all needed places. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-09 17:17:11 +03:00
Glauber Costa	1549a43823	snapshots: fix json type We are generating a general object ({}), whereas Cassandra 2.1.x generates an array ([]). Let's do that as well to avoid surprising parsers. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-08 16:54:51 +02:00
Glauber Costa	cc343eb928	snapshots: handle jsondir creation for empty files case We still need to write a manifest when there are no files in the snapshot. But because we have never reached the touch_directory part in the sstables loop for that case, nobody would have created jsondir in that case. Since now all the file handling is done in the seal_snapshot phase, we should just make sure the directory exists before initiating any other disk activity. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-08 16:54:51 +02:00
Glauber Costa	efdfc78c0c	snapshots: get rid of empty tables optimization We currently have one optimization that returns early when there are no tables to be snapshotted. However, because of the way we are writing the manifest now, this will cause the shard that happens to have tables to be waiting forever. So we should get rid of it. All shards need to pass through the synchronization point. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-08 16:54:51 +02:00
Glauber Costa	0776ca1c52	snapshots: don't hash pending snapshots by snapshot name If we are hashing more than one CF, the snapshot themselves will all have the same name. This will cause the files from one of them to spill towards the other when writing the manifest. The proper hash is the jsondir: that one is unique per manifest file. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-08 16:54:51 +02:00
Amnon Heiman	6d90eebfb9	column family: Add estimated histogram impl This patch adds the read and write latency estimated histogram support and add an estimatd histogram to the number of sstable that were used in a read. Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>	2015-10-08 14:59:17 +03:00
Glauber Costa	725ae03772	snapshots: write the manifest file from a single shard Currently, the snapshot code has all shards writing the manifest file. This is wrong, because all previous writes to the last will be overwritten. This patch fixes it, by synchronizing all writes and leaving just one of the shards with the task of closing the manifest. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-08 11:36:36 +02:00
Glauber Costa	25d24222fe	snapshots: separate manifest creation The way manifest creation is currently done is wrong: instead of a final manifest containing all files from all shards, the current code writes a manifest containing just the files from the shard that happens to be the unlucky loser of the writing race. In preparation to fix that, separate the manifest creation code from the rest. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-08 11:36:36 +02:00
Glauber Costa	abc63e4669	snapshots: clarify and fix sync behavior We do need to sync jsondir after we write the manifest file (previously done, but with a question), and before we start it (not previously done) to guarantee that the manifest file won't reference any file that is not visible yet. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-08 11:36:36 +02:00
Glauber Costa	ca4babdb57	snapshots: close file after flush We are currently flushing it, but not closing it. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-08 11:36:36 +02:00
Pekka Enberg	b40999b504	database: Fix drop_column_family() UUID lookup race Remove the about to be dropped CF from the UUID lookup table before truncating and stopping it. This closes a race window where new operations based on the UUID might be initiated after truncate completes. Signed-off-by: Pekka Enberg <penberg@scylladb.com>	2015-10-06 17:10:17 +02:00
Pekka Enberg	9576b0ef23	database: Implement drop_keyspace() Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>	2015-10-06 14:53:35 +03:00
Tomasz Grabiec	bc1d159c1b	Merge branch 'penberg/cql-drop-table/v3' from seastar-dev.git From Pekka: This patch series implements support for CQL DROP TABLE. It uses the newly added truncate infrastructure under the hood. After this series, the test_table CQL test in dtest passes: [penberg@nero urchin-dtest]$ nosetests -v cql_tests.py:TestCQL.table_test table_test (cql_tests.TestCQL) ... ok ---------------------------------------------------------------------- Ran 1 test in 23.841s OK	2015-10-06 13:39:25 +02:00
Pekka Enberg	b1e6ab144a	database: Implement drop_column_family() Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>	2015-10-06 11:28:55 +03:00
Pekka Enberg	0651ab6901	database: Futurize drop_column_family() function Futurize drop_column_family() so that we can call truncate() from it. Signed-off-by: Pekka Enberg <penberg@scylladb.com>	2015-10-06 11:28:55 +03:00
Pekka Enberg	85ffaa5330	database: Add truncate() variant that does not look up CF by name For drop_column_family(), we want to first remove the column_family from lookup tables and truncate after that to avoid races. Introduce a truncate() variant that takes keyspace and column_family references. Signed-off-by: Pekka Enberg <penberg@scylladb.com>	2015-10-06 11:28:54 +03:00
Glauber Costa	639ba2b99d	incremental backups: move control to the CF level Currently, we control incremental backups behavior from the storage service. This creates some very concrete problems, since the storage service is not always available and initialized. The solution is to move it to the column family (and to the keyspace so we can properly propagate the conf file value). When we change this from the api, we will have to iterate over all of them, changing the value accordingly. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-05 13:16:11 +02:00
Avi Kivity	7c23ec49ae	Merge "Support incremental backups" from Glauber "Generate backups when the configuration file indicates we should; toggle behavior on/off through the API."	2015-10-04 13:49:20 +03:00
Amnon Heiman	1f16765140	column family: setting the read and write latency histogram This patch contains the following changes, in the definition of the read and write latency histogram it removes the mask value, so the the default value will be used. To support the gothering of the read latency histogram the query method cannot be const as it modifies the histogram statistics. The read statistic is sample based and it should have no real impact on performance, if there will be an impact, we can always change it in the future to a lower sampling rate. Signed-off-by: Amnon Heiman <amnon@cloudius-systems.com>	2015-10-04 11:52:19 +03:00
Glauber Costa	d4edb82c9e	column_family: incremental backups Only tables that arise from flushes are backed up. Compacted tables are not. Therefore, the place for that to happen is right after our flush. Note that due to our sharded architecture, it is possible that in the face of a value change some shards will backup sstables while others won't. This is, in theory, possible to mitigate through a rwlock. However, this doesn't differ from the situation where all tables are coming from a single shard and the toggle happens in the middle of them. The code as is guarantees that we'll never partially backup a single sstable, so that is enough of a guarantee. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-02 18:23:27 +02:00
Pekka Enberg	5e27d476d4	database: Improve exception error messages When we convert exceptions into CQL server errors, type information is not preserved. Therefore, improve exception error messages to make debugging dtest failures, for example, slightly easier. Signed-off-by: Pekka Enberg <penberg@cloudius-systems.com>	2015-10-01 11:23:46 +03:00
Calle Wilund	68b8d8f48c	database: Implement "truncate" for column family Including snapshotting.	2015-09-30 09:09:42 +02:00
Calle Wilund	56228fba24	column family: Add "snapshot" operation.	2015-09-30 09:09:42 +02:00
Calle Wilund	c141e15a4a	column family: Add "run_with_compaction_disabled" helper A'la origin. Could as well been RAII.	2015-09-30 09:09:41 +02:00

1 2 3 4 5 ...

417 Commits