scylladb

Author	SHA1	Message	Date
Tomasz Grabiec	a2a8312c78	sstables: Add trace-level logging for sstable consumption	2017-03-28 18:34:55 +02:00
Tomasz Grabiec	27d86dfe18	sstables: Enable skipping to cells at data_consume_context level	2017-03-28 18:10:39 +02:00
Tomasz Grabiec	d5e704ca1e	sstables: Make key_view constructor from bytes_view explicit	2017-03-28 18:10:39 +02:00
Paweł Dziepak	5905729c4a	sstables: read counter cells	2017-02-02 10:35:14 +00:00
Paweł Dziepak	19ad35610b	sstables: do not discard future returned by fast_forward_to() continuous_data_consumer::fast_forward_to() returns a future which was later ignored by data_consume_context::fast_forward_to(). With the current implementation, the future in question is always ready and that's why the problem didn't manifest itself in the form of crashes or invalid results. Message-Id: <20170120105746.7300-1-pdziepak@scylladb.com>	2017-01-20 12:22:17 +01:00
Gleb Natapov	ae0a2935b4	sstables: fix ad-hoc summary creation If sstable Summary is not present Scylla does not refuses to boot but instead creates summary information on the fly. There is a bug in this code though. Summary files is a map between keys and offsets into Index file, but the code creates map between keys and Data file offsets instead. Fix it by keeping offset of an index entry in index_entry structure and use it during Summary file creation. Reviewed-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20161116165421.GA22296@scylladb.com>	2016-11-17 11:05:23 +02:00
Paweł Dziepak	ab0eeae82d	sstables: keep separate stream history for single and range reads Single partition and partition range reads are expected to behave considerably different so it is worth to have them use separate file stream history. This also makes reads use different history for each sstable which is also a good thing. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	c63e88d556	sstables: implement mutation_reader::impl::fast_forward_to() This patch allows sstable readers to be fast forwarded without making it necessary to recreate the reader (and dropping all buffers in the process). It is built on top of index_reader and ability of data_consume_context to be fast forwarded. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	0bc873ace5	sstables: add fast_forward_to() to continuous_data_consumer Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Paweł Dziepak	25b91c51e2	ssables: add data_consume_rows_context::reset() reset() is going to be used to restore valid state after fast forwarding the reader. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-10-19 15:29:08 +01:00
Nadav Har'El	022a69caea	sstables: promoted index read support This patch adds support more efficiently reading small parts of a large partition, without reading the entire partition as we had to do so far. This is done using the "promoted index". The "promoted index" is stored in the sstable index file, and provides for each large sstable row ("partition" in CQL nomenclature) a sample of the column names at (for example) 64KB intervals. This means that when we read a slice of columns (e.g., cql rows), or page through a large partition, we do not have to read the entire partition from disk. This patch only implements the read side of promoted index - a later patch will add the write-side support (i.e., writing the promoted index to the index file while saving the sstable). Nevertheless this patch can already be tested by reading existing sstables from Cassandra which include a promoted index - such as the one included in the test in the previous patch. The use of the promoted index currently has two limitations: 1. It is only used when reading a single partition with sstable::read_row(), not when scanning through many partitions with sstable::read_range_rows() or sstable::read_rows(). 2. It is only used when filtering a single clustering-key range, rather than a list of disjoint ranges. A single range is the common case. These two issues will be improved later. In the meantime, in those unsupported cases we simply continue to read entire partitions, so we're not worse-off than before. Also note that this patch only helps when sstable::read_row() is used with a clustering-key prefix (i.e., a slice). Our higher-level request handling code may decide to read an entire partition into the cache, and not use a clustering-key prefix at all when reading. We will need to indepdently improve the high-level code to use read_row()'s slicing capabilities when paging through large partitions, for example. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2016-08-07 17:47:09 +03:00
Nadav Har'El	1975abbfd6	sstables: disk_read_range Currently, the main sstable data parsing entry point data_consume_rows() takes a contiguous range of bytes to read from disk and parse. This range is supposed to be an entire partition or contiguous group of partitions. and is self contained (can be parsed without extra information about the identity of these partitions). For the promoted index feature (which we will add in a following patch) we will want the range to span only a part of a partition, and will need the caller to provide some information not available to the parser (such as the partition's key). In the future, we will also want to support a vector of byte ranges, instead of just one. So in preparation for this, this patch simply replaces the start/end pair by a new class disk_read_range, which can be easily extended in later patches. No new functionality is introduced in this patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2016-08-07 17:47:02 +03:00
Nadav Har'El	6fed716bd3	sstables: STOP_THEN_ATOM_START parser state In a later patch adding "promoted index" read support, we would like to parse only part of an sstable row. In that case, the parser should start not at the usual ROW_START state, but rather at the ATOM_START state. But there's a problem: The sstable parser consumer currently assumes that the parser stops after the start of the row, before reading any atoms. So in the partial row case too, we must stop parsing before reading the first atom. For this, this patch adds the new "STOP_THEN_ATOM_START" parser state. When starting in this state, the parser stops immediately (with row_consumer::proceed::no), and when restarted again it will be in the ATOM_START case. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2016-08-07 17:47:01 +03:00
Paweł Dziepak	02ffc28f0d	sstables: extend sstable life until reader is fully closed data_consume_rows_context needs to have close() called and the returned future waited for before it can be destroyed. data_consume_context::impl does that in the background upon its destruction. However, it is possible that the sstable is removed before data_consume_rows_context::close() completes in which case EBADF may happen. The solution is to make data_consume_context::impl keep a reference to the sstable and extend its life time until closing of data_consume_rows_context (which is performed in the background) completes. Side effect of this change is also that data_consume_context no longer requires its user to make sure that the sstable exists as long as it is in use since it owns its own reference to it. Fixes #1537. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1470222225-19948-1-git-send-email-pdziepak@scylladb.com>	2016-08-03 13:19:08 +02:00
Nadav Har'El	c647d917e0	sstables: move to_bytes_view to header file Move the to_bytes_view(temporary_buffer<char>) function from source file to header file where is can be used in more places. This saves one use of reinterpret_cast (which we are no re-evaluating), and moreover, we want to use this function also in the promoted index code (to return a bytes_view from the promoted index which was saved as a temporary_buffer). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1468761437-27046-1-git-send-email-nyh@scylladb.com>	2016-07-17 16:29:26 +03:00
Paweł Dziepak	55a6911d7a	sstables: close input_stream<> properly If read ahead is going to be enabled it is important to close input_stream<> properly (and wait for completion) before destroying it. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-30 11:39:01 +01:00
Paweł Dziepak	9e8db53c46	sstables: allow row consumer to stop at any point Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-06-20 21:29:50 +01:00
Pekka Enberg	38a54df863	Fix pre-ScyllaDB copyright statements People keep tripping over the old copyrights and copy-pasting them to new files. Search and replace "Cloudius Systems" with "ScyllaDB". Message-Id: <1460013664-25966-1-git-send-email-penberg@scylladb.com>	2016-04-08 08:12:47 +03:00
Nadav Har'El	b7e29691c2	sstables: avoid index and data file over-reads When we do a streaming read that knows the expected end position of the read, we can use a large read-ahead buffer, and at the same time, stop reading at exactly the intended end (or small rounding of it to the DMA block size) and not waste resources blindly reading a large amount of data after the end just to fill the read-ahead buffer. The sstable reading code, both for reading the data file and the index file, created a file input stream without specifiying its end, thereby losing this optimization - so when a large buffer was used, we would get a large over-read. This patch fixes this, so sstable data file and index file are read using a file input stream which is a ware of its end. Fixes #964. Note that this patch does not change the behavior when reading a compressed data file. For compressed read, we did not have the problem of over-read in the first place, because chunks are read one by one. But we do have other sources of inefficiencies there (stemming, again, from the fact that the compressed chunks are read one by one), and I opened a separate issue #992 for that. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1457219304-12680-1-git-send-email-nyh@scylladb.com>	2016-03-08 17:26:10 +02:00
Glauber Costa	8e4bf025ae	sstables: wire priority for read path All the SSTable read path can now take an io_priority. The public functions will take a default parameter which is Seastar's default priority. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2016-01-25 15:20:38 -05:00
Tomasz Grabiec	657841922a	Mark move constructors noexcept when possible	2015-12-07 09:50:27 +01:00
Avi Kivity	d5cf0fb2b1	Add license notices	2015-09-20 10:43:39 +03:00
Raphael S. Carvalho	6f07379646	row consumer: don't fallthrough if mask cannot be consumed When row consumer fallthrough from ATOM_NAME_BYTES to ATOM_MASK, we assume that mask can be consumed, but it may happen that data.size() equals to zero, thus mask cannot be consumed. Solution is to add read_8 so that the code will only fallthrough if mask can be consumed right away. Fixes #197. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-09-04 12:26:03 +03:00
Avi Kivity	22f87e4252	sstables: avoid passing zero-sized buffers to make_file_input_stream With new seastar, it crashes.	2015-08-31 19:40:59 +03:00
Glauber Costa	2623362d20	continuous_data_consumer: do not pass reference to child Since the child is a base class, we don't need to pass a reference: we can just cast our 'this' pointer. By doing that, the move constructor can come back. Welcome back, move constructor. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-29 20:32:56 +03:00
Glauber Costa	4b174c754d	commonize the NSM In order to reuse the NSM in other scenarios, we need to push as much code as possible into a common class. This patch does that, making the continuous_data_consumer class now the main placeholder for the NSM class. The actual readers will have to inherit from it. However, despite using inheritance, I am not using virtual functions at all instead, we let the continuous_data_consumer receive an instance of the derived class, and then it can safely call its methods without paying the cost of virtual functions. In other attempt, I had kept the main process() function in the derived class, that had the responsibility of then coding the loop. With the use of the new pattern, we can keep the loop logic in the base class, which is a lot cleaner. There is a performance penalty associated with it, but it is fairly small: 0.5 % in the sequential_read perf_sstable test. I think we can live with it. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-28 18:56:26 -05:00
Glauber Costa	d9b7f4bde3	row consumer: separate processing of buffers from the main loop In my previous attempt, I have separated the state processor for the main loop, leaving that to be filled by a derived class. That felt a lot more natural, because then we don't have to replicate the loop logic in the derived classes. But well, oh, well, life is hard. Specially on fast paths. Doing that makes us insert an extra call in this loop, and that is noticeable: we would be 1.5 % slower, and that is not even counting the cost of making the state processing a virtual function later on. I could just argue that this is acceptable due to decoupling gains, but why I would argue that, if I can just rewrite it in a way that no performance is lost? And then I did. The disadvantage of this, is that the derived class will now have to re-code the loop, but no performance is lost. Another advantage of this, is that the derived class will now be able to call into process_buffer directly, without using virtual functions in this path for any of them. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-28 17:30:44 -05:00
Glauber Costa	fbd68c3b01	row consumer: move consume_be to consumer.hh It will be reused by the continuous_data_consumer Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-28 17:30:43 -05:00
Glauber Costa	e1945e473b	row consumer: make non_consuming an instance member It is now a static member that gets the instance members as parameters. There is no reason for that, and this will complicate the decoupling, since the prestate reader won't know about state. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com> Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-08-28 17:18:19 -05:00
Glauber Costa	f45b807f34	row consumer: move proceed class to a separate class Continuing the work of decoupling the the prestate and state parts of the NSM so we can reuse it, move the proceed class to a different holding class. Proceeding or not has nothing to do with "rows". Signed-off-by: Glauber Costa <glommer@cloudius-systems.com> Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-08-28 17:18:06 -05:00
Glauber Costa	49ac04a60a	row consumer: fall through more often Because we didn't had before a way to know whether or not the read completed, we would always go back to the main loop, and would only optimize sequential reads for some kinds of data. However, As one could see in the previous patch, the new read_X functions will notify completion, allowing us to just fallthrough to the next case if that is the only possibility. In most cases, it isn't. With this, we can apply this optimization throughout all cases where we don't branch states, and with a very elegant resulting code. The performance actually increases by 0.75 %. It is not much, but it is more than the error margin (which sits at 0.20 %), and because the code is not made unreadable by it, this is a clear win to me. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-28 16:30:22 -05:00
Glauber Costa	1f930cda4a	row consumer: extend use of read for multi-value fields In an attempt to gain some cycles, we are testing whether we can read many values at once, and if so, using consume_be directly for those. What we can do in this situation, is read the first value, and let the read fall through the next case if the read succeeds. The code actually looks a lot more elegant this way. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-28 16:30:22 -05:00
Glauber Costa	0ad8afb0ec	row consumer: extend usage of the read_* functions In some places, we cannot use our read_* functions, because we don't know whether or not it succeeded, and that is important when passing the state along. The fix for this is trivial, since we can just return it from the reader. Note for reviewers: The commend in one of the functions say we should use: "read_bytes(data, _u32, _key ...". But in the actual code, the where buffer is _val, not _key. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-28 16:30:22 -05:00
Glauber Costa	13af0ffbd2	row consumer: fix read_bytes temporary len It shouldn't be _u16, but rather whatever we passed as len. It currently works because all callers pass _u16 as len. But this will soon change. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com> Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-08-28 16:30:22 -05:00
Glauber Costa	62a26ef411	row consumer: don't switch state implicitly Soon enough, all the state machine will be separated from the prestate handling. To make it easier, we will decouple them as much as we can. Not automatically switching states in the read functions is part of this. Signed-off-by: Glauber Costa <glommer@cloudius-systems.com> Reviewed-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-08-28 16:30:22 -05:00
Glauber Costa	79db076f20	sstables: allow specification of buffer_size for data_stream For sstable reads, bigger buffers are not always better, because it can be the case that we want to read just a piece of data. However, as it so happens, we have already two variants for read: when we want to read a single key, we will use read_row(), which will try to bring all data in: so it will use smaller buffer. For read_rows(), that will naturally span multiple buffers, we have end and start points: with that, we can have a good estimation of the expected buffer size, and we can have it go up until we reaches the 128k limit we have for writes. Before: 209578.62 +- 135.73 partitions / sec (30 runs, 1 concurrent ops) After: 291703.98 +- 218.95 partitions / sec (30 runs, 1 concurrent ops) Gain: 39.19 % Signed-off-by: Glauber Costa <glommer@cloudius-systems.com>	2015-08-27 09:11:13 -05:00
Tomasz Grabiec	7aea858108	sstables: Make data_consume_rows(0, 0) return no rows data_consume_rows(0, 0) was returning all partitions instead of no partitions, because -1 was passed as count in such case, which was then casted to uint64_t. Special-casing it that way is problematic for code which calculates the bounds, and when the key is not found we simple end up with 0 as upper bound. Instead of convoluting the range lookup code to special case for 0, let's simplify the interface so that (0, 0) returns no rows, same as (1, 1). There is a new overload of data_consume_rows() without bounds, which returns all data.	2015-07-22 13:10:01 +02:00
Nadav Har'El	d42c05b6ad	sstable: Pull-style read interface This patch replaces the sstable read APIs from having "push" style, to having "pull style". The sstable read code has two APIs: 1. An API for sequentially consuming low-level sstable items - sstable row's beginning and end, cells, tombstones, etc. 2. An API for sequentially consuming entire sstable rows in our "mutation" format. Before this patch, both APIs were in "push style": The user supplies callback functions, and the sstable read code "pushes" to these functions the desired items (low-level sstable parts, or whole mutations). However, a push API is very inconvenient for users, like the query processing code, or the compaction code, which both iterate over mutations. Such code wants to control its own progression through the iteration - the user prefers to "pull" the next mutation when it wants it; Moreover, the user wants to stop pulling more mutations if it wants, without worrying about various continuations that are still scheduled in the background (the latter concern was especially problematic in the "push" design). The modified APIs are: 1. The functions for iterating over mutations, sstable::read_rows() et al., now return a "mutation_reader" object which can be used for iterating over the mutation: mutation_reader::read() asks for the next mutation, and returns a future to it (or an unassigned value on EOF). You can see an example on how it is used in sstable_mutation_test.cc. 2. The functions for consuming low-level sstable items (row begin, cell, etc.) are still partially push-style - the items are still fed into the consume object - but consumpton now stops (instead of defering and continuing later, as in the old code) when the consumer asks to. The caller can resume the consumption later when it wishes to (in this sense, this is a "pull" API, because the user asks for more input when it wants to). This patch does not remove input_stream's feature of a consumer function returning a non-ready future. However, this feature is no longer used anywhere in our code - the new sstable reader code stops the consumption when old sstable reader code paused it temporarily with a non-ready future. Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-06-03 10:55:34 +03:00
Nadav Har'El	33270efc39	sstables: make consume_row_end() a future After commit `3ae81e68a0`, we already support in input_stream::consume() the possibility of the consumer blocking by returning a future. But the code for sstable consumption had now way to use this capability. This patch adds a future<> return code for consume_row_end(), allowing the consumer to pause after reading each sstable row (but not, currently, after each cell in the row). We also need to use this capability in read_range_rows(), which wrongly ignored the future<> returned by the "walker" function - now this future<> is returned to the sstable reader, and causes it to pause reading until the future is fulfilled. Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-05-28 18:53:32 +03:00
Avi Kivity	3ae81e68a0	Merge seastar upstream Updated sstables::data_consume_rows_context for input_stream::consume() API change.	2015-05-19 19:57:09 +03:00
Raphael S. Carvalho	68d76cb915	sstables: fix a bug in data_consume_rows_context::read_64 When the temporary buffer has enough data for a uint64 to be consumed, we readily consume it. The problem is that we were wrongly storing the uint64 into a uint32 variable. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-05-08 11:16:42 +02:00
Tomasz Grabiec	e4ef356cc3	Revert "sstables: fix a bug in data_consume_rows_context::read_64" This reverts commit `f80f00476c`. This is the vrong version of the patch.	2015-05-08 11:16:10 +02:00
Raphael S. Carvalho	f80f00476c	sstables: fix a bug in data_consume_rows_context::read_64 When the temporary buffer has enough data for a uint64 to be consumed, we readily consume it. The problem is that we were wrongly storing the uint64 into a uint32 variable. Signed-off-by: Raphael S. Carvalho <raphaelsc@cloudius-systems.com>	2015-05-08 11:05:20 +02:00
Nadav Har'El	8e6d11df1b	sstable read: support deleted cells This patch adds support to reading deleted cells (a.k.a. cell tombstones) from the SSTable. The way deleted cells are encoded in the sstable is explained in the "Cell tombstone" section of https://github.com/cloudius-systems/urchin/wiki/SSTables-interpretation-in-Urchin This more-or-less completes the low-level SSTable row reading code - the only remaining untreated case are counters, which we agreed to leave to later. If counters are found in the SSTable, we'll throw an exception. This patch adds a new callback, consume_deleted_cell, taking the name of the cell and its deletion_time (as usual, deletion_time includes both a 64-bit timestamp, for ordering events, and a 32-bit "local_deletion_time" used to schedule gc of old tombstones). This patch also adds a test SSTable with deleted cell, created by the following Cassandra Commands: CREATE TABLE deleted ( name text, age int, PRIMARY KEY (name) ); INSERT INTO deleted (name, age) VALUES ('nadav', 40); <flush table - the second table is what we're after> DELETE age FROM deleted WHERE name = 'nadav'; We test our ability to read this sstable, and see the deleted cell and its expected deletion time. Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-04-28 14:56:04 +03:00
Nadav Har'El	0adc4812ba	sstable read: support cell expiration time This patch adds support to reading sstable cells with expiration time. It adds two more parameters to the row_consumer::consume_cell() - "ttl" and "expiration". The "ttl" is the original TTL set on the cell in seconds, the "expiration" is the absolute time (in seconds since the Unix epoch) when this cell is set to expire. I don't know why both values are needed... When a cell has no expiration time set (most cells will be like that), the callback with will be called expiration==0 (and ttl==0). This patch also adds a test SSTable with cells with set TTL, created by the following Cassandra commands: CREATE TABLE ttl ( name text, age int, PRIMARY KEY (name) ); INSERT INTO ttl (name, age) VALUES ('nadav', 40) USING TTL 3600; And tests our ability to read the resulting sstable, and get the expected expiration time. Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-04-28 14:56:01 +03:00
Nadav Har'El	486e6271a1	sstables: data file row reading and streaming The previous implementation could read either one sstable row or several, but only when all the data was read in advance into a contiguous memory buffer. This patch changes the row read implementation into a state machine, which can work on either a pre-read buffer, or data streamed via the input_stream::consume() function: The sstable::data_consume_rows_at_once() method reads the given byte range into memory and then processes it, while the sstable::data_consume_rows() method reads the data piecementally, not trying to fit all of it into memory. The first function is (or will be...) optimized for reading one row, and the second function for iterating over all rows - although both can be used to read any number of rows. The state-machine implementation is unfortunately a bit ugly (and much longer than the code it replaces), and could probably be improved in the future. But the focus was parsing performance: when we use large buffers (the default is 8192 bytes), most of the time we don't need to read byte-by-byte, and efficiently read entire integers at once, or even larger chunks. For strings (like column names and values), we even avoid copying them if they don't cross a buffer boundary. To test the rare boundary-crossing case despite having a small sstable, the code includes in "#if 0" a hack to split one buffer into many tiny buffers (1 byte, or any other number) and process them one by one. The tests still pass with this hack turned on. This implementation of sstable reading also adds a feature not present in the previous version: reading range tombstones. An sstable with an INSERT of a collection always has a range tombstone (to delete all old items from the collection), so we need this feature to read collections. A test for this is included in this patch. Signed-off-by: Nadav Har'El <nyh@cloudius-systems.com>	2015-04-13 17:40:46 +03:00

46 Commits