scylladb

Author	SHA1	Message	Date
Avi Kivity	dc6be68852	Merge "promoted index for reading partial partitions" from Nadav "The goal of this patch series is to support reading and writing of a "promoted index" - the Cassandra 2.* SSTable feature which allows reading only a part of the partition without needing to read an entire partition when it is very long. To make a long story short, a "promoted index" is a sample of each partition's column names, written to the SSTable Index file with that partition's entry. See a longer explanation of the index file format, and the promoted index, here: https://github.com/scylladb/scylla/wiki/SSTables-Index-File There are two main features in this series - first enabling reading of parts of partitions (using the promoted index stored in an sstable), and then enable writing promoted indexes to new sstables. These two features are broken up into smaller stand-alone pieces to facilitate the review. Three features are still missing from this series and are planned to be developed later: 1. When we fail to parse a partition's promoted index, we silently fall back to reading the entire partition. We should log (with rate limiting) and count these errors, to help in debugging sstable problems. 2. The current code only uses the promoted index when looking for a single contiguous clustering-key range. If the ck range is non-contiguous, we fall back to reading the entire partition. We should use the promoted index in that case too. 3. The current code only uses the promoted index when reading a single partition, via sstable::read_row(). When scanning through all or a range of partitions (read_rows() or read_range_rows()), we do not yet use the promoted index; We read contiguously from data file (we do not even read from the index file, so unsurprisingly we can't use it)." (cherry picked from commit `700feda0db`)	2016-08-09 17:54:15 +03:00
Paweł Dziepak	9e8db53c46	sstables: allow row consumer to stop at any point Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:50 +01:00
Raphael S. Carvalho	70b793e4d3	tests: add test for statistics rewrite Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-05-20 17:26:12 -03:00
Raphael S. Carvalho	5aeeb0b3e8	compaction: add support to parallel compaction on the same column family It was noticed that small sstables will accumulate for a column family because scylla was limited to two compaction per shard, and a column family could have at most one compaction running at a given shard. With the number of sstables increasing rapidly, read performance is degraded. At the moment, our compaction manager works by running two compaction task handlers that run in parallel to the rest of the system. Each task handler gets to run when needed, gets a column family from compaction manager queue, runs compaction on it, and goes to sleep again. That's basically its cycle. Compaction manager only allows one instance of a column family to be on its queue, meaning that it's impossible for a column family to be compacted in parallel. One compaction starts after another for a given column family. To solve the problem described, we want to concurrently run compaction jobs of a column family that have different "size tier" (or "weight"). For those unfamiliar, compaction job contains a list of sstables that will be compacted together. The "size tier" of a compaction job is the log of the total size of the input sstables. So a compaction job only gets to run if its "size tier" is not the same of an ongoing compaction. There is no point in compacting concurrently at the same "size tier", because that slows down both compactions. We will no longer queue column families in compaction manager. Instead, we create a new fiber to run compaction on demand. This fiber that runs asynchronously will do the following: 1) Get a compaction job from compaction strategy. 2) Calculate "size tier" of compaction job. 3) Run compaction job if its "size tier" is not the same of an ongoing compaction for the given column family. As before, it may decide to re-compact a column family based on a stat stored in column family object. Ran all compaction-related dtests. Fixes #1216. Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <d30952ff136192a522bde4351926130addec8852.1462311908.git.raphaelsc@scylladb.com>	2016-05-04 11:46:09 +03:00
Glauber Costa	60ab3b3f50	sstable_tests: make sure the generation of the Summary is sane When we recreate the summary from a missing Summary, we should make sure it is generated sanely, and that it resembles the Summary that would have otherwise been there. In this tests we'll grab one of the Summary tests we've been doing, and just apply them to the non-existent Summary file. We expect the same results on those cases. Plus, a new test is added with some sanity checking. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-04-12 11:55:01 -04:00
Pekka Enberg	38a54df863	Fix pre-ScyllaDB copyright statements People keep tripping over the old copyrights and copy-pasting them to new files. Search and replace "Cloudius Systems" with "ScyllaDB". Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>	2016-04-08 08:12:47 +03:00
Raphael Carvalho	370b1336fe	service: fix refresh Vlad and I were working on finding the root of the problems with refresh. We found that refresh was deleting existing sstable files because of a bug in a function that was supposed to return the maximum generation of a column family. The intention of this function is to get generation from last element of column_family::_sstables, which is of type std::map. However, we were incorrectly using std::map::end() to get last element, so garbage was being read instead of maximum generation. If the garbage value is lower than the minimum generation of a column family, then reshuffle_sstables() would set generation of all existing sstables to a lower value. That would confuse our mechanism used to delete sstables because sstables loaded at boot stage were touched. Solution to this problem is about using rbegin() instead of end() to get last element from column_family::_sstables. The other problem is that refresh will only load generations that are larger than or equal to X, so new sstables with lower generation will not be loaded. Solution is about creating a set with generation of live SSTables from all shards, and using this set to determine whether a generation is new or not. The last change was about providing an unused generation to reshuffle procedure by adding one to the maximum generation. That's important to prevent reshuffle from touching an existing SSTable. Tested 'refresh' under the following scenarios: 1) Existing generations: 1, 2, 3, 4. New ones: 5, 6. 2) Existing generations: 3, 4, 5, 6. New ones: 1, 2. 3) Existing generations: 1, 2, 3, 4. New ones: 7, 8. 4) No existing generation. No new generation. 5) No existing generation. New ones: 1, 2. I also had to adapt existing testcase for reshuffle procedure. Fixes #1073. Signed-off-by: Raphael Carvalho <raphaelsc@scylladb.com> Message-Id: <1c7b8b7f94163d5cd00d90247598dd7d26442e70.1458694985.git.raphaelsc@scylladb.com>	2016-03-23 10:21:58 +02:00
Benoît Canet	1fb9a48ac5	exception: Optionally shutdown communication on I/O errors. I/O errors cannot be fixed by Scylla the only solution is to shutdown the database communications. Signed-off-by: Benoît Canet <benoit@scylladb.com> Message-Id: <1458154098-9977-1-git-send-email-benoit@scylladb.com>	2016-03-17 15:02:52 +02:00
Raphael S. Carvalho	031bf57c19	sstables: bail out if toc exists for generation used by write_components Currently, if sstable::write_components() is called to write a new sstable using the same generation of a sstable that exists, a temporary TOC will be unconditionally created. Afterwards, the same sstable::write_components() will fail when it reaches sstable::create_data(). The reason is obvious because data component exists for that generation (in this scenario). After that, user will not be able to boot scylla anymore because there is a generation with both a TOC and a temporary TOC. We cannot simply remove a generation with TOC and temporary TOC because user data will be lost (again, in this scenario). After all, the temporary TOC was only created because sstable::write_components() was wrongly called with the generation of a sstable that exists. Solution proposed by this patch is to trigger exception if a TOC file exists for the generation used. Some SSTable unit tests were also changed to guarantee that we don't try to overwrite components of an existing sstable. Refs #1014. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <caffc4e19cdcf25e4c6b9dd277d115422f8246c4.1457643565.git.raphaelsc@scylladb.com>	2016-03-11 09:22:51 +02:00
Glauber Costa	a339296385	database: turn sstable generation number into an optional This patch makes sure that every time we need to create a new generation number - the very first step in the creation of a new SSTable, the respective CF is already initialized and populated. Failure to do so can lead to data being overwritten. Extensive details about why this is important can be found in Scylla's Github Issue #1014 Nothing should be writing to SSTables before we have the chance to populate the existing SSTables and calculate what should the next generation number be. However, if that happens, we want to protect against it in a way that does not involve overwriting existing tables. This is one of the ways to do it: every column family starts in an unwriteable state, and when it can finally be written to, we mark it as writeable. Note that this cannot be a part of add_column_family. That adds a column family to a db in memory only, and if anybody is about to write to a CF, that was most likely already called. We need to call this explicitly when we are sure we're ready to issue disk operations safely. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-03-10 21:06:05 -05:00
Glauber Costa	8e4bf025ae	sstables: wire priority for read path All the SSTable read path can now take an io_priority. The public functions will take a default parameter which is Seastar's default priority. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-01-25 15:20:38 -05:00
Glauber Costa	74fbd8fac0	do not call open_file_dma directly We have an API that wraps open_file_dma which we use in some places, but in many other places we call the reactor version directly. This patch changes the latter to match the former. It will have the added benefit of allowing us to make easier changes to these interfaces if needed. Signed-off-by: Glauber Costa <glauber@scylladb.com> Message-Id: <29296e4ec6f5e84361992028fe3f27adc569f139.1451950408.git.glauber@scylladb.com>	2016-01-05 10:37:57 +02:00
Avi Kivity	47499dcf18	data_value: make conversion from bytes explicit Since bytes is a very generic value that is returned from many calls, it is easy to pass it by mistake to a function expecting a data_value, and to get a wrong result. It is impossible for the data_value constructor to know if the argument is a genuine bytes variable, a data_value of another type, but serialized, or some other serialized data type. To prevent misuse, make the data_value(bytes) constructor (and complementary data_value(optional<bytes>) explicit.	2015-11-13 17:12:29 +02:00
Avi Kivity	2c3591cbd9	data_value de-any-fication We use boost::any to convert to and from database values (stored in serlialized form) and native C++ values. boost::any captures information about the data type (how to copy/move/delete etc.) and stores it inside the boost::any instance. We later retrieve the real value using boost::any_cast. However, data_value (which has a boost::any member) already has type information as a data_type instance. By teaching data_type intances about the corresponding native type, we can elimiante the use of boost::any. While boost::any is evil and eliminating it improves efficiency somewhat, the real goal is growing native type support in data_type. We will use that later to store native types in the cache, enabling O(log n) access to collections, O(1) access to tuples, and more efficient large blob support.	2015-10-30 17:38:51 +01:00
Glauber Costa	54aaa58899	sstable_tests: test reshuffle operation Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:06:22 +02:00
Glauber Costa	a8db2b28c7	sstable tests: test set_generation No code works until it's been tested. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:06:22 +02:00
Glauber Costa	c5950c7bf7	sstable_test: get rid of frees They exist. They shouldn't. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:06:22 +02:00
Glauber Costa	f60021f87f	sstable_tests: commonize code to compare two components. The current codes assumes a particular dir/generation pair. We will use it for a more generic case. This code could really use some clean up, by the way. We should do it later. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:06:22 +02:00
Glauber Costa	fcebf6f72d	sstable tests: don't use set_generation method There is no reason aside from testing for a table to just change its generation number. There will be, however, when we support loading new sstables. The method however needs to be completely rewritten, so let's make sure the tests are not using that. Signed-off-by: Glauber Costa <glommer@scylladb.com>	2015-10-21 18:02:42 +02:00
Avi Kivity	d5cf0fb2b1	Add license notices	2015-09-20 10:43:39 +03:00
Glauber Costa	a9ab31dd9c	index_entry: move its fields to private visibility And provide accessors. This will give us the freedom to change their internal storage. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-29 14:05:36 -05:00
Glauber Costa	13d59c9618	index_entry: do away with the disk_string<> fields Now that we are using the NSM, and not the general parser for the index, there is no reason to keep using disk_string<>s in it. Since it is staying in the way of further optimizations, let's get rid of it and use bytes directly. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-29 14:05:36 -05:00
Glauber Costa	93e55969f2	sstables: modify read_indexes so it no longer takes a quantity read_indexes was one of the first functions coded in the sstable read path. At the time, I made the (now so obviously) wrong decision to code it generic enough so that we could specify the number of items to be read, instead of an upper bound in the file. The main reason for that, was that without the Summary, we have no way to know where to stop reading, and the Summary is a relatively new addition to the C* codebase: while I didn't really check when it got in, the code is full of tests for its presence. That turned out to be totally useless: we always read the indexes with the help of the Summary. While the Summary is a relatively new addition to C*, it is present in all version we aim to support. Meaning that reads without the Summary will never happen in our codebase. Even if, in the future, we happen to ditch the Summary file, we are very likely to do so in favor of some other structure that also allows us to manipulate precise borders in the Index. The code as it is, however, would not be too big of a problem if that wasn't causing us performance problems. But it is, and the majority of it is caused by the fact that our underlying read_indexes do not know in advance how many bytes to read, forcing us to do an element-per-element read. It's time for a change. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-27 16:44:25 +03:00
Glauber Costa	976de6f6f4	sstables: get cf and ks strings for filename We will need them to properly build names in some situations. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-07 08:31:55 -05:00
Glauber Costa	cd8c9ad288	sstables: add ks and cf name to sstable constructor When a schema is available, we use it. However, we have, by now, way too many tests. Some of them use tables for which we don't even know the schema. It would have been a massive amount of work to require a schema for all of them - so I am keeping both constructors around. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-07 08:31:55 -05:00
Raphael S. Carvalho	004af400de	tests: add method in sstables::test to write components Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-08-06 17:39:05 +03:00
Avi Kivity	c720cddc5c	tests: mv tests/urchin/* -> tests/ Now that seastar is in a separate repository, we can use the tests/ directory.	2015-08-05 14:16:52 +03:00

27 Commits