scylladb

Author	SHA1	Message	Date
Vlad Zolotarov	756de38a9d	database: actually check that a snapshot directory exists Actually check that a snapshot directory with a given tag exists instead of just checking that a 'snapshot' directory exists. Fixes issue #689 Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>	2015-12-29 12:59:00 +01:00
Avi Kivity	41bd266ddd	db: provide more information on "Unrecognized error" while loading sstables This information can be used to understand the root cause of the failure. Refs #692.	2015-12-29 10:23:32 +02:00
Pekka Enberg	eeadf601e6	Merge "cleanups and improvements" from Raphael	2015-12-18 13:45:11 +02:00
Pekka Enberg	e56bf8933f	Improve not implemented errors Print out the function name where we're throwing the exception from to make it easier to debug such exceptions.	2015-12-18 10:51:37 +01:00
Raphael S. Carvalho	41be378ff1	db: fix build of sstable list in column_family::compact_sstables The last two loops were incorrectly inside the first one. That's a bug because a new sstable may be emplaced more than once in the sstable list, which can cause several problems. mark_for_deletion may also be called more than once for compacted sstables, however, it is idempotent. Found this issue while auditing the code. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2015-12-16 17:46:17 +02:00
Raphael S. Carvalho	6142efaedb	db: fix indentation Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2015-12-14 12:43:34 -02:00
Raphael S. Carvalho	7bbc1b49b6	db: add missing sstable::mark_for_deletion call If a sstable doesn't belong to current shard, mark_for_deletion should be called for the deletion manager to still work. It doesn't mean that the sstable will be deleted, but that the sstable is not relevant to the current shard, thus it can be deleted by the deletion manager in the future. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2015-12-14 12:42:26 -02:00
Amnon Heiman	2086c651ba	column_family: get_snapshot_details should return empty map for no snapshots If there is no snapshot directory for the specific column family, get_snapshot_details should return an empty map. This patch check that a directory exists before trying to iterate over it. Fixes #619 Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2015-12-07 12:51:04 +01:00
Tomasz Grabiec	bc23ebcbc3	schema_tables: Replace schema_result::value_type with equivalent movable type future<> requires and will assert nothrow move constructible types.	2015-12-07 09:50:27 +01:00
Amnon Heiman	7e79d35f85	Estimated histogram: Clean the add interface The add interface of the estimated histogram is confusing as it is not clear what units are used. This patch removes the general add method and replace it with a add_nano that adds nanoseconds or add that gets duration. To be compatible with origin, nanoseconds vales are translated to microseconds.	2015-12-01 15:28:06 +02:00
Asias He	aa2b11f21b	database: Move is_replacing and get_replace_address to database class So they can be used outside storage_service.	2015-11-30 09:15:42 +08:00
Tomasz Grabiec	a7c11d1e30	db: Fix handling of missing column family The FIXMEs are no longer valid, we load schema on bootstrap and don't support hot-plugging of column families via file system (nor does Cassandra). Handling of missing tables matches Cassandra 2.1, applies log it and continue, queries propagate the error.	2015-11-25 16:59:15 +02:00
Raphael S. Carvalho	0f3ccc1143	db: optimize the sstable loading process Currently, we only determine if a sstable belongs to current shard after loading some of its components into memory. For example, filter may be considerably big and its content is irrelevant to decide if a sstable should be included to a given shard. Start using the functions previously introduced to optimize the sstable loading process. add_sstable no longer checks if a sstable is relevant to the current shard. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2015-11-19 13:34:25 -02:00
Raphael S. Carvalho	0ce2b7bc8d	db: introduce belongs_to_current_shard Returns true if key range belongs to current shard. False otherwise. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2015-11-19 13:34:21 -02:00
Raphael S. Carvalho	966e8c7144	db: introduce parallelism to sstable loading Boot may be slow because the function that loads sstables do so serially instead of in parallel. In the callback supplied to lister::scan_dir, let's push the future returned by probe_file (function that loads sstable) into a vector of future and wait for all of them at the end. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2015-11-19 13:34:11 -02:00
Glauber Costa	fa1ae45218	database: export collectd metrics about the state of memtable flushing When analyzing a recent performance issue, I found helpful to keep track of the amount of memtables that are currently in flight, as well as how much memory they are consuming in the system. Although those are memtable statistics, I am grouping them under the "cf_stats" structure: being the column family a central piece of the puzzle, it is reasonable to assume that a lot of metrics about it would be potentially welcome in the future. Note that we don't want to reuse the "stats" structure in the column family: for once, the fields not always map precisely (pending flushes, for instance, only tracks explicit flushes), and also the stats structure is a lot more complex than we need. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-11-12 20:17:22 +02:00
Calle Wilund	284b10cabe	Make partition_slice::row_ranges mulitplex on partition Allows for having more than one clustering row range set, depending on PK queried (although right now limited to one - which happens to be exactly the number of mutiplexing paging needs... What a coincidence...) Encapsulates the row_ranges member in a query function, and if needed holds ranges outside the default one in an extra object. Query result::builder::add_partition now fetches the correct row range for the partition, and this is the range used in subsequent iteration.	2015-11-10 13:12:33 +01:00
Gleb Natapov	d77a2a0f03	do not try to write same memtable to sstable twice if moving it to a cache failed. Error handling in column_family::try_flush_memtable_to_sstable() is misplaced. It happens after update_cache(), so writing sstable may have succeeded, but moving memtable into the cache may have failed. update_cache() destroys memtable even if it fails, but error handler is not aware of it (it does not even distinguish whether error happened during sstable creation or moving into cache) and when it tells caller to retry it retries with already destroyed memtable. Fix it by ignoring moving to cache errors.	2015-11-09 11:27:37 +01:00
Avi Kivity	cb93af2ad7	Revert "do not try to write same memtable to sstable twice if moving it to a cache failed." This reverts commit `fff37d15cd`. Says Tomek (and the comment in the code): "update_cache() must be called before unlinking the memtable because cache + memtable at any time is supposed to be authoritative source of data for contained partitions. If there is a cache hit in cache, sstables won't be checked. If we unlink the memtable before cache is updated, it's possible that a query will miss data which was in that unlinked memtable, if it hits in the cache (with an old value)."	2015-11-09 11:22:12 +02:00
Gleb Natapov	fff37d15cd	do not try to write same memtable to sstable twice if moving it to a cache failed. Error handling in column_family::try_flush_memtable_to_sstable() is misplaced. It happens after update_cache(), so writing sstable may have succeeded, but moving memtable into the cache may have failed. update_cache() destroys memtable even if it fails, but error handler is not aware of it (it does not even distinguish whether error happened during sstable creation or moving into cache) and when it tells caller to retry it retries with already destroyed memtable. Fix it by ignoring moving to cache errors.	2015-11-09 09:56:45 +02:00
Asias He	20ecb0bede	database: Introduce get_initial_tokens Get initial tokens specified by the initial_token in scylla.conf. E.g., --initial-token "-1112521204969569328,1117992399013959838" --initial-token "1117992399013959838" It can be multiple tokens split by comma.	2015-11-04 10:40:12 +08:00
Calle Wilund	ceb9f4d647	database: Just do commitlog::shutdown on shutdown. It will do flushes.	2015-10-26 14:56:24 +01:00
Avi Kivity	f7087da054	Merge "GET methods for snapshots" from Glauber "The snapshots API need to expose GET methods so people can query information on them. Now that taking snapshots is supported, this relatively simple series implement get_snapshot_details, a column family method, and wire that up through the storage_service."	2015-10-22 15:23:45 +03:00
Avi Kivity	5f3a46eabb	Merge "load_new_sstables" from Glauber "This patchset implements load_new_sstables, allowing one to move tables inside the data directory of a CF, and then call "nodetool refresh" to start using them. Keep in mind that for Cassandra, this is deemed an unsafe operation: https://issues.apache.org/jira/browse/CASSANDRA-6245 It is still for us something we should not recommend - unless the CF is totally empty and not yet used, but we can do a much better job in the safety front. To guarantee that, the process works in four steps: 1) All writes to this specific column family are disabled. This is a horrible thing to do, because dirty memory can grow much more than desired during this. Throughout out this implementation, we will try to keep the time during which the writes are disabled to its bare minimum. While disabling the writes, each shard will tell us about the highest generation number it has seen. 2) We will scan all tables that we haven't seen before. Those are any tables found in the CF datadir, that are higher than the highest generation number seen so far. We will link them to new generation numbers that are sequential to the ones we have so far, and end up with a new generation number that is returned to the next step 3) The generation number computed in the previous step is now propagated to all CFs, which guarantees that all further writes will pick generation numbers that won't conflict with the existing tables. Right after doing that, the writes are resumed. 4) The tables we found in step 2 are passed on to each of the CFs. They can now load those tables while operations to the CF proceed normally."	2015-10-22 13:42:24 +03:00
Glauber Costa	36cea4313e	column family: load new sstables CF-level code to load new SSTables. There isn't really a lot of complication here. We don't even need to repopulate the entire SSTable directory: by requiring that the external service who is coordinating this tell us explicitly about the new SSTables found in the scan process, we can just load them specifically and add them to the SSTable map. All new tables will start their lifes as shared tables, and will be unshared if it is possible to do so: this all happens inside add_sstable and there isn't really anything special in this front. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:06:22 +02:00
Glauber Costa	61be9fb02d	reshuffle tables: mechanism to adjust new sstables' generation number Before loading new SSTables into the node, we need to make sure that their generation numbers are sequential (at least if we want to follow Cassandra's footsteps here). Note that this is unsafe by design. More information can be found at: https://issues.apache.org/jira/browse/CASSANDRA-6245 However, we can already to slightly better in two ways: Unlike Cassandra, this method takes as a parameter a generation number. We will not touch tables that are before that number at all. That number must be calculated from all shards as the highest generation number they have seen themselves. Calling load_new_sstables in the absence of new tables will therefore do nothing, and will be completely safe. It will also return the highest generation number found after the reshuffling process. New writers should start writing after that. Therefore, new tables that are created will have a generation number that is higher than any of this, and will therefore be safe. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:06:22 +02:00
Glauber Costa	1351c1cc13	database: mechanism to stop writing sstables During certain operations we need to stop writing SSTables. This is needed when we want to load new SSTables into the system. They will have to be scanned by all shards, agreed upon, and in most cases even renamed. Letting SSTables be written at that point makes it inherently racy - specially with the rename. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:06:22 +02:00
Glauber Costa	29e2ad7fd8	column family: commonize code to calculate the desired SSTable generation We will reuse this for load_new_sstables. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:02:43 +02:00
Tomasz Grabiec	764d913d84	Merge branch 'pdziepak/row-cache-range-query/v4' from seastar-dev.git From Pawel: This series enables row cache to serve range queries. In order to achieve that row cache needs to know whether there are some other partitions in the specified range that are not cached and need to be read from the sstables. That information is provied by key_readers, which work very similarly to mutation_readers, but return only the decorated key of partitions in range. In case of sstables key_readers is implemented to use partition index. Approach like this has the disadvantage of needing to access the disk even if all partitions in the range are cached. There are (at least) two solutions ways of dealing with that problem: - cache partition index - that will also help in all other places where it is neededed - add a flag to cache_entry which, when set, indicates that the immediate successor of the partition is also in the cache. Such flag would be set by mutation reader and cleared during eviction. It will also allow newly created mutations from memtable to be moved to cache provided that both their successors and predecessors are already there. The key_reader part of this patchsets adds a lot of new code that probably won't be used in any other place, but the alternative would be to always interleave reads from cache with reads from sstables and that would be more heavy on partition index, which isn't cached. Fixes #185.	2015-10-21 15:26:45 +02:00
Glauber Costa	77513a40db	database: get_snapshot_details For each of the snapshots available, the api may query for some information: the total size on disk, and the "real" size. As far as I could understand, the real size is the size that is used by the SSTables themselves, while the total size includes also the metadata about the snapshot - like the manifest.json file. Details follow: In the original Cassandra code, total size is: long sizeOnDisk = FileUtils.folderSize(snapshot); folderSize recurses on directories, and adds file.length() on files. Again, my understanding is that file_size() would give us the same as the length() method for Java. The other value, real (or true) size is: long trueSize = getTrueAllocatedSizeIn(snapshot); getTrueAllocatedSizeIn seems to be a tree walker, whose visitor is an instance of TrueFilesSizeVisitor. What that visitor does, is add up the size of the files within the tree who are "acceptable". An acceptable file is a file which: starts with the same prefix as we want (IOW, belongs to the same SSTable, we will just test that directly), and is not "alive". The alive list is just the list of all SSTables in the system that are used by the CFs. What this tries to do, is to make sure that the trueSnapshotSize is just the extra space on disk used by the snapshot. Since the snapshots are links, then if a table goes away, it adds to this size. If it would be there anyway, it does not. We can do that in a lot simpler fashion: for each file, we will just look at the original CF directory, and see if we can find the file there. If we can't, then it counts towards the trueSize. Even for files that are deleted after compaction, that "eventually" works, and that simplifies the code tremendously given that we don't have to neither list all files in the system - as Cassandra does - or go check other shards for liveness information - as we would have to do. The scheme I am proposing may need some tweaks when we support multiple data directories, as the SSTables may not be directly below the snapshot level. Still, it would be trivial to inform the CF about their possible locations. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 13:48:44 +02:00
Avi Kivity	5453cfbab7	Merge "snapshots: take + clear" from Glauber "This is the code for taking a snapshot, and clearing a snapshot."	2015-10-21 08:59:42 +03:00
Paweł Dziepak	c1e95dd893	row_cache: pass underlying key_source to row_cache Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-10-20 20:27:53 +02:00
Paweł Dziepak	96a42a9c69	column_family: add sstables_as_key_source() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2015-10-20 20:27:53 +02:00
Avi Kivity	cf734132e7	Merge "Flusing of CF:s without replay positions" from Calle "Fixes: #469 We occasionally generate memtables that are not empty, yet have no high replay_position set. (Typical case is CL replay, but apparently there are others). Moreover, we can do this repeatedly, and thus get caught in the flush queue ordering restrictions. Solve this by treating a flush without replay_position as a flush at the highest running position, i.e. "last" in queue. Note that this will not affect the actual flush operation, nor CL callbacks, only anyone waiting for the operation(s) to complete. To do this, the flush_queue had its restrictions eased, and some introspection methods added."	2015-10-20 17:36:57 +03:00
Glauber Costa	d236b01b48	snapshots: check existence of snapshots We go to the filesystem to check if the snapshot exists. This should make us robust against deletions of existing snapshots from the filesystem. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-20 15:58:26 +02:00
Glauber Costa	d3aef2c1a5	database: support clear snapshot This allows for us to delete an existing snapshot. It works at the column family level, and removing it from the list of keyspace snapshots needs to happen only when all CFs are processed. Therefore, that is provided as a separate operation. The filesystem code is a bit ugly: it can be made better by making our file lister more generic. First step would be to call it walker, not lister... For now, we'll use the fact that there are mostly two levels in the snapshot hierarchy to our advantage, and avoid a full recursion - using the same lambda for all calls would require us to provide a separate class to handle the state, that's part of making this generic. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-20 15:38:14 +02:00
Glauber Costa	500ee99c93	file lister: allow for more than one directory type There are situations in which we would like to match more than one directory type. One example of that, would be a recursive delete operation: we need to delete the files inside directories and the directories themselves, but we still don't want a "delete all" since finding anything other than a directory or a file is an error, and we should treat it as such. Since there aren't that many times, it should be ok performance wise to just use a list. I am using an unordered_set here just because it is easy enough, but we could actually relax it later if needed. In any case, users of the interface should not worry about that, and that decision is abstracted away into lister::dir_entry_types. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-20 15:38:14 +02:00
Avi Kivity	2575a4602e	Merge "Fix for snapshots/create_links and shared SSTables" from Glauber "Those are fixes needed for the snapshotting process itself. I have bundled this in the create_snapshot series before to avoid a rebase, but since I will have to rewrite that to get rid of the snapshot manager (and go to the filesystem), I am sending those out on their own."	2015-10-20 13:49:17 +03:00
Calle Wilund	02732f19f2	database: Handle CF flush with no high replay_position We occasionally generate memtables that are not empty, yet have no high replay_position set. (Typical case is CL replay, but apparently there are others). Moreover, we can do this repeatedly, and thus get caught in the flush queue ordering restrictions. Solve this by treating a flush without replay_position as a flush at the highest running position, i.e. "last" in queue. Note that this will not affect the actual flush operation, nor CL callbacks, only anyone waiting for the operation(s) to complete.	2015-10-20 08:24:04 +02:00
Avi Kivity	2ccb5feabd	Merge "Support nodetool cfhistogram" "This series adds the missing estimated histogram to the column family and to the API so the nodetool cfhistogram would work."	2015-10-19 17:11:46 +03:00
Avi Kivity	f4706c7050	Merge "initial support to leveled compaction" from Raphael "This patchset introduces leveled compaction to Scylla. We don't handle all corner cases yet, but we already have the strategy and compaction working as expected. Test cases were written and I also tested the stability with a load of cassandra-stress. Leveled compaction may output more than one sstable because there is a limit on the size of sstables. 160M by default. Related to handling of partial compaction, it's still something to be worked on. Anyway, it will not be a big problem. Why? Suppose that a leveled compaction will generate 2 sstables, and scylla is interrupted after the first sstable is completely written but before the second one is completely written. The next boot will delete the second sstable, because it was partially written, but will not do anything with the first one as it was completely written. As a result, we will have two sstables with redundant data."	2015-10-19 16:17:45 +03:00
Glauber Costa	218fdebbeb	snapshot: do not allow exceptions in snapshot creation hang us With the distribute-and-sync method we are using, if an exception happens in the snapshot creation for any reason (think file permissions, etc), that will just hang the server since our shard won't do the necessary work to synchronize and note that we done our part (or tried to) in snapshot creation. Make the then clause a finally, so that the sync part is always executed. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-19 13:37:02 +02:00
Glauber Costa	9083a0e5a7	snapshots: fix generation of snapshots with shared sstables create_links will fail in one of the shards if one of the SSTables happen to be shared. It should be fine if the link already exists, so let's just ignore that case. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-19 13:37:01 +02:00
Tomasz Grabiec	19d7d30e67	Replace references to 'urchin' with 'scylla'	2015-10-19 11:08:05 +03:00
Glauber Costa	df857eb8c6	database: touch directories for the column family Current code calls make_directory, which will fail if the directory already exists. We didn't use this code path much before, but once we start creating CF directories on CF creation - and not on SSTable creation, that will become our default method. Use touch_directory instead Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-17 13:08:07 +02:00
Raphael S. Carvalho	35b75e9b67	adapt compaction procedure to support leveled strategy Adapt our compaction code to start writing a new sstable if the one being written reached its maximum size. Leveled strategy works with that concept. If a strategy other than leveled is being used, everything will work as before. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-10-16 01:54:52 -03:00
Calle Wilund	012ab24469	column_family: Add flush queue object to act as ordering guarantee	2015-10-14 14:07:40 +02:00
Glauber Costa	b2fef14ada	do not calculate truncation time independently Currently, we are calculating truncated_at during truncate() independently for each shard. It will work if we're lucky, but it is fairly easy to trigger cases in which each shard will end up with a slightly different time. The main problem here, is that this time is used as the snapshot name when auto snapshots are enabled. Previous to my last fixes, this would just generate two separate directories in this case, which is wrong but not severe. But after the fix, this means that both shards will wait for one another to synchronize and this will hang the database. Fix this by making sure that the truncation time is calculated before invoke_on_all in all needed places. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-09 17:17:11 +03:00
Glauber Costa	1549a43823	snapshots: fix json type We are generating a general object ({}), whereas Cassandra 2.1.x generates an array ([]). Let's do that as well to avoid surprising parsers. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-08 16:54:51 +02:00
Glauber Costa	cc343eb928	snapshots: handle jsondir creation for empty files case We still need to write a manifest when there are no files in the snapshot. But because we have never reached the touch_directory part in the sstables loop for that case, nobody would have created jsondir in that case. Since now all the file handling is done in the seal_snapshot phase, we should just make sure the directory exists before initiating any other disk activity. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-08 16:54:51 +02:00

1 2 3 4 5 ...

438 Commits