scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-23 18:10:39 +00:00

Author	SHA1	Message	Date
Piotr Jastrzebski	11a354b144	Introduce sstable::read_row_flat This will be used together with sstables::read_range_rows to migrate sstables::as_mutation_source(). Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-20 16:26:54 +01:00
Piotr Jastrzebski	65c6f339d6	Delete sstable_streamed_mutation It's no longer used so can be safely removed. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-20 16:26:54 +01:00
Piotr Jastrzebski	e241b0c2de	Stop using streamed_mutation in sstable_data_source Use a partition_header instead. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-20 16:26:54 +01:00
Piotr Jastrzebski	375c321e9d	Stop using streamed_mutation in consumer and reader Don't use streamed_mutation in mp_row_consumer and sstable_mutation_reader. Also use sstable_mutation_reader in sstable::read_row. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-20 16:22:57 +01:00
Piotr Jastrzebski	f7bf782a41	Store sstable_mutation_reader pointer in mp_row_consumer The reader will be used by mp_row_consumer instead of streamed_mutation in next patches. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-16 22:06:28 +01:00
Piotr Jastrzebski	145fcf846e	Move advance_to_upper_bound above sstable_mutation_reader It will be used in sstable_mutation_reader when the reader will be used to implement sstable::read_row. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-16 22:06:28 +01:00
Piotr Jastrzebski	1c7938c44d	Replace "sm" with "partition" in get_next_sm and on_sm_finished Streamed mutation won't be used any more so get_next_partition and on_partition_finished are more suitable names. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-16 22:06:28 +01:00
Piotr Jastrzebski	4943f52ad7	Remove unused sstable_mutation_reader constructor The constructor is never used so it can be safely removed. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-16 22:06:28 +01:00
Piotr Jastrzebski	c7971eb8e3	Move mp_row_consumer methods implementations to the bottom Those methods have to be below sstable_mutation_reader because they will be using the reader instead of streamed_mutation. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-16 22:06:28 +01:00
Piotr Jastrzebski	537b42e153	Turn sstable_mutation_reader into a flat_mutation_reader This is the first step which still uses streamed_mutation. Next step will be to get rid of streamed_mutation. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-16 22:06:00 +01:00
Piotr Jastrzebski	74f0c01865	Add sstables::read_rows_flat and sstables::read_range_rows_flat Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-15 15:33:23 +01:00
Piotr Jastrzebski	ea449c9cce	Replace sstables::mutation_reader with ::mutation_reader This will make migration to flat_mutation_reader much easier and sstables::mutation_reader is going away with this migration anyway. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-15 10:40:01 +01:00
Piotr Jastrzebski	228f0737f4	Reduce dependencies from mp_row_consumer to sstable_streamed_mutation Before this patch mp_row_consumer was using sstable_streamed_mutation in two ways: 1. Populate sstable_streamed_mutation's buffer with mutation_fragments 2. Advance sstable_streamed_mutation's sstable_data_source to new position. We can easily reduce those dependencies only to the first one. This will reduce the coupling between those classes and simplify the flow of execution. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2017-11-15 10:40:01 +01:00
Duarte Nunes	baeec0935f	Replace query::full_slice with schema::full_slice() query::full_slice doesn't select any regular or static columns, which is at odds with the expectations of its users. This patch replaces it with the schema::full_slice() version. Refs #2885 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1507732800-9448-2-git-send-email-duarte@scylladb.com>	2017-10-17 11:25:53 +02:00
Botond Dénes	dead2617ce	mp_row_consumer: remove unnecessary _reasource_tracker member Leftovers from `a43901f84`. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <88237d9cd97feeca47e12ec4af89c90f1a3a6bb5.1507535176.git.bdenes@scylladb.com>	2017-10-09 10:59:40 +03:00
Botond Dénes	a43901f842	row_consumer: de-virtualize io_priority() and resource_tracker() Fixes #2830 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <448a1f739ab8c88a7a5562bce8dce5ae6efdf934.1507302530.git.bdenes@scylladb.com>	2017-10-06 18:50:12 +01:00
Botond Dénes	47e07b787e	restricted_mutation_reader: restrict based-on memory consumption Restrict readers based on their memory consumption, instead of the count of the top-level readers. To do this an interposer is installed at the input_stream level which tracks buffers emmited by the stream. This way we can have an accurate picture of the readers' actual memory consumption. New readers will consume 16k units from the semaphore up-front. This is to account their own memory-consumption, apart from the buffers they will allocate. Creating the reader will be deferred to when there are enough resources to create it. As before only new readers will be blocked on an exhausted semaphore, existing readers can continue to work.	2017-10-03 12:44:12 +03:00
Avi Kivity	78eae8bf48	Revert "Merge "Make restricting_mutation_reader more accurate" from Botond" This reverts commit `c6e5dcc556`, reversing changes made to `19b21a0ab2`. Failes to build, plus author has more changes.	2017-10-03 11:58:59 +03:00
Botond Dénes	33e97e7457	restricted_mutation_reader: restrict based-on memory consumption Restrict readers based on their memory consumption, instead of the count of the top-level readers. To do this an interposer is installed at the input_stream level which tracks buffers emmited by the stream. This way we can have an accurate picture of the readers' actual memory consumption. New readers will consume 16k units from the semaphore up-front. This is to account their own memory-consumption, apart from the buffers they will allocate. Creating the reader will be deferred to when there are enough resources to create it. As before only new readers will be blocked on an exhausted semaphore, existing readers can continue to work.	2017-09-20 11:14:35 +03:00
Paweł Dziepak	3e1d09e71d	sstables: do not expect counter shards to be sorted	2017-09-05 10:32:48 +01:00
Tomasz Grabiec	65e488c150	sstables: Fix abort in mutation reader for certain skip pattern The problem happens for the following sequence of events: 1) reader stops in the middle of some partition before it skips to another partition range 2) reader is fast forwarded to a partition range which has no data in the sstable. There are some partitions between the previous partition range and the one we skip to 3) the reader is asked for next partition The problem was that mutation_reader::fast_forward_to() was putting the reader in _read_enabled == false state in step 2, but data_consume_context was not fast forwarded to the range. When in step 3 we were asked for the next partition, we attempted to skip using index (because of 1). The result of the skip was some position which is outside of the current range of data_consume_context, which causes it to abort. To fix, add a check for _read_enabled before we try to skip.	2017-08-28 10:28:15 +02:00
Tomasz Grabiec	dc3c8863f3	sstables: Fix reader returning partition past the query range in some cases If index was used to skip to the next partition (because the current partition wasn't consumed in full) and reader's partition range ends before the data file ends, we did not detect that we're out of range before returning a streamed_mutation. Fix by checking _context.eof() before doing that. Refs #2733.	2017-08-28 10:16:27 +02:00
Paweł Dziepak	7b0f75c0d1	sstables: avoid indirect calls to abstract_type::is_multi_cell()	2017-07-26 14:38:27 +01:00
Paweł Dziepak	28c105e4a7	sstables: avoid copying key components	2017-07-26 14:38:27 +01:00
Paweł Dziepak	e0a04cb7fe	sstables: make sure that fill_buffer() actually fills buffer streamed_mutation::impl::fill_buffer() is supposed to either push mutation fragments to the buffer or set EOS flag. However, it was possible that mp_row_consumer would return proceed::no if a skip was needed without satisfying any of these conditions.	2017-07-26 14:36:36 +01:00
Tomasz Grabiec	a9237c1666	schema: Revert back to the 1.7 layout of static compact tables in memory We are using C* 3.x compatible layout in schema tables but want to keep using the 1.7 layout in memory for compatibility during rolling upgrade. This patch switches the schema and schema_builder classes back to the old layout. Translation of layout happens when converting to/from schema mutations. Notable changes: 1) Includes a revert of commit `6260f31e08` "thrift: Update CQL mapping of static CFs". 2) Brings back the "default_validation_class" schema attribute. In v3 it can be dervied from column definitions, but in v2 it can't, so we have to store it. 3) legacy_schema_migrator and schema_builder don't have to do conversions to v3, this is now handled by the v3_columns class. schema_builder works with the same layout as schema, that is v2. 4) Includes a revert of commit `66991a7ccb` "v3 schema test fixes" Fixes #2555.	2017-07-19 09:52:15 +02:00
Nadav Har'El	3018df11b5	Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170619152629.11703-1-nyh@scylladb.com>	2017-06-19 18:31:32 +03:00
Avi Kivity	6e2c9ef9fb	Revert "Allow reading exactly desired byte ranges and fast_forward_to" This reverts commit `317d7fc253` (and also the related `2c57ab84b2`). It causes crashes during range scans, reported by Gleb: "To reproduce I run SELECT * FROM keyspace1.standard1; on typical c-s dataset and 3 node cluster. Backtrace: at /home/gleb/work/seastar/seastar/core/apply.hh:36 rvalue=<unknown type in /home/gleb/work/seastar/build/release/scylla, CU 0x54cf307, DIE 0x55ebf2a>) at /home/gleb/work/seastar/seastar/core/do_with.hh:57 range=std::vector of length 6, capacity 8 = {...}) at /home/gleb/work/seastar/seastar/core/future-util.hh:142 at ./seastar/core/future.hh:890 at /home/gleb/work/seastar/seastar/core/future-util.hh:119 at /home/gleb/work/seastar/seastar/core/future-util.hh:142	2017-06-18 16:10:21 +03:00
Nadav Har'El	317d7fc253	Allow reading exactly desired byte ranges and fast_forward_to In commit `c63e88d556`, support was added for fast_forward_to() in data_consume_rows(). Because an input stream's end cannot be changed after creation, that patch ignores the specified end byte, and uses the end of file as the end position of the stream. As result of this, even when we want to read a specific byte range (e.g., in the repair code to checksum the partitions in a given range), the code reads an entire 128K buffer around the end byte, or significantly more, with read-ahead enabled. This causes repair to do more than 10 times the amount of I/O it really has to do in the checksumming phase (which in the current implementation, reads small ranges of partitions at a time). This patch has two levels: 1. In the lower level, sstable::data_consume_rows(), which reads all partitions in a given disk byte range, now gets another byte position, "last_end". That can be the range's end, the end of the file, or anything in between the two. It opens the disk stream until last_end, which means 1. we will never read-ahead beyond last_end, and 2. fast_fordward_to() is not allowed beyond last_end. 2. In the upper level, we add to the various layers of sstable readers, mutation readers, etc., a boolean flag mutation_reader::forwarding, which says whether fast_forward_to() is allowed on the stream of mutations to move the stream to a different partition range. Note that this flag is separate from the existing boolean flag streamed_mutation::fowarding - that one talks about skipping inside a single partition, while the flag we are adding is about switching the partition range being read. Most of the functions that previously accepted streamed_mutation::forwarding now accept also the option mutation_reader::forwarding. The exception are functions which are known to read only a single partition, and not support fast_forward_to() a different partition range. We note that if mutation_reader::forwarding::no is requested, and fast_forward_to() is forbidden, there is no point in reading anything beyond the range's end, so data_consume_rows() is called with last_end as the range's end. But if forwarding::yes is requested, we use the end of the file as last_end, exactly like the code before this patch did. Importantly, we note that the repair's partition reading code, column_family::make_streaming_reader, uses mutation_reader::forwarding::no, while the other existing reading code will use the default forwarding::yes. In the future, we can further optimize the amount of bytes read from disk by replacing forwarding::yes by an actual last partition that may ever be read, and use its byte position as the last_end passed to data_consume_rows. But we don't do this yet, and it's not a regression from the existing code, which also opened the file input stream until the end of the file, and not until the end of the range query. Moreover, such an improvement will not improve of anything if the overall range is always very large, in which case not over-reading at its end will not improve performance. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20170614072122.13473-1-nyh@scylladb.com>	2017-06-15 13:22:46 +01:00
Piotr Jastrzebski	6528f3a963	Make sure mutation_reader for sstables can be fast-forwarded Fixes #2145. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> [tgrabiec: Extracted from a series, fixed title] Message-Id: <1495639745-19387-1-git-send-email-tgrabiec@scylladb.com>	2017-05-24 16:36:24 +01:00
Paweł Dziepak	3ecceaee48	Merge "Fix fast_forward_to() on sstable reader being ignored in some cases" from Tomasz "When mutation reader enters the partition using index, streamed_mutation object is returned to the user before the row start fragment is processed. In that case, when we process the row start, we should ignore it and not call setup_for_partition() again. That may override user's fast_forward_to() request." * 'tgrabiec/fix-initial-fast-forward-to-for-single-key-sstable-readers' of github.com:scylladb/seastar-dev: tests: mutation_source_test: Test forwarding in single-key readers sstables: Remove unused code sstables: mutation_reader: Fix setup_for_partition() being called twice in some cases sstables: Fix verify_end_state() to tolerate ATOM_START_2 state	2017-05-17 15:35:30 +01:00
Tomasz Grabiec	e07cc44af2	sstables: Remove unused code	2017-05-16 13:31:01 +02:00
Tomasz Grabiec	0e23f8aa9b	sstables: mutation_reader: Fix setup_for_partition() being called twice in some cases When mutation reads enters the partition using index, streamed_mutation object is returned to the user before the row start fragment is processed. In that case, when we process the row start, we should ignore it and not call setup_for_partition() again. That may override user's fast_forward_to() request.	2017-05-16 13:31:01 +02:00
Calle Wilund	6c8b5fc09d	schema_tables: Use v3 schema tables and formats Switches system/schema_* for system_schema/*, updates schema/schema builder and uses to hold/expect v3 style info (i.e. types & dropped).	2017-05-10 16:44:48 +00:00
Tomasz Grabiec	e56711a54d	sstables: mutation_reader: Avoid reading index when restrictions cover whole partition The check for is_static_row() used to be enough, but it no longer is after optimization made in commit `3e06065`, which avoids reading the static row. Message-Id: <1494241164-25810-1-git-send-email-tgrabiec@scylladb.com>	2017-05-09 11:03:18 +01:00
Duarte Nunes	d45596ae8e	sstables: Read and write shadowable tombstones This patch serializes shadowable tombstones to sstables by adding a new, incompatible atom's mask. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2017-04-25 11:46:33 +02:00
Tomasz Grabiec	3472a74de4	sstables: Remove unused code	2017-04-20 11:23:05 +02:00
Tomasz Grabiec	c1059ca8e4	sstables: mutation_reader: Use index_reader::advance_to_next_partition() to skip to next partition It's cheaper than a key-based lookup, so use it when we can.	2017-04-20 11:23:05 +02:00
Tomasz Grabiec	4742008b70	sstables: mutation_reader: Use index_reader for single-partition reads This switches single-partition query to use the index_reader infrastructure. Index lookups via index_reader are faster than find_disk_ranges(). perf_fast_forward, rows: 1000000, value size: 100 Before: Testing forwarding with clustering restriction in a large partition: pk-scan time [s] frags frag/s aio [KiB] blocked dropped idx hit idx miss idx blk cpu no 0.002182 2 916 3 152 2 0 0 1 1 88.1% After: Testing forwarding with clustering restriction in a large partition: pk-scan time [s] frags frag/s aio [KiB] blocked dropped idx hit idx miss idx blk cpu no 0.000758 2 2639 3 152 2 0 0 1 1 48.6% This is also a cleanup, a step towards converting all code to use the index_reader.	2017-04-20 11:23:05 +02:00
Tomasz Grabiec	9d8795089d	sstables: mutation_reader: Add trace-level logging	2017-04-20 11:18:55 +02:00
Tomasz Grabiec	b198c31c46	sstables: mutation_reader: Move partition reading code to sstable_data_source It will be reused for read_row(), which does't create mutation_reader instance, only sstable_data_source.	2017-04-20 11:18:26 +02:00
Tomasz Grabiec	6e4bca0be6	sstables: mutation_reader: Move definitions out of the class body To make further refactoring easier to review. No functional changes here.	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	4ed7e529db	sstables: Move binary_search() to a header There are instantiations of binary_search() used in sstables.cc, but defined in partition.cc. The instantiations are explicitly declared in partition.cc, but the types changed and they became obsolete. The thing worked because partition.cc also instantiated it with the right type. But after that code will be removed, it no longer would, and we would get a linker error. To avoid such problems, define binary_search() in a header.	2017-04-20 10:54:38 +02:00
Tomasz Grabiec	3e8795494e	sstables: mutation_reader: Advance to next partition using index in some cases To produce a streamed_mutation for the next partition, we need to read its key and the tombstone. Currently we always do that by consuming the partition header from the data file. In some cases that may cause unnecessary IO. It's better to obtain partition information from the index if we already have it. We can save on IO if the user will skip past the front of partition immediately after. It is also better to pay the cost of reading the index if we know that we will need to use the index anyway soon. This patch predicts that by checking if there are any clustering restrictions. If there are any, we will almost surely need_skip() and use the index anyway. This change also lays the ground for unification of multi and single partiton queries without loss of performance.	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	ae72c159b1	sstables: index_reader: Introduce promoted_index_view So that we have a nice way of extracting tombstone out of it. We not always need fully parsed index.	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	0ef33b7f29	sstables: mutation_reader: Move _index_in_current to sstable_data_source sstable_data_source holds a shared state between mutation_reader and streamed_mutation for sstables. The information whether index is in current partition will have to be accessed by both in the following patches.	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	885f53d905	sstables: mutation_reader: Avoid resetting the walker Before the change, the following scenario was happening: 1) we try to skip based on clustering restrictions 2) we find the page and fast forward to it, recording walker's lower bound counter 3) we read the first fragment, it's not a tombstone, so we reset the walker, and its lower bound counter too 4) the fragment is not in range (the range starts in the middle of the page) 5) needs_skip() is true, we redo the index lookup, which wastes some CPU This change fixes the problem by avoiding resetting the walker. We can do that because leading tombstones are checked with a non-mutable contains_tombstone()	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	b030ce693d	sstables: mutation_reader: Don't try to read index to skip to static row Static row is always at the beginning, there's no point in doing index lookups.	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	3e060659f1	sstables: mutation_reader: Don't try to read static row if table doesn't have any	2017-04-20 10:54:37 +02:00
Tomasz Grabiec	77d3e30239	sstables: mutation_reader: Use index to skip across clustering restrictions Improves scans with clustering restrictions. Before the change such scans would scan whole partition. Below are results of a test case from perf_fast_forward which selects few rows from a large partition using query restrictions (not fast forwarding). Before: stride rows time [s] frags frag/s aio [KiB] blocked dropped idx hit idx miss idx blk cpu 1000000 1 0.000609 1 1642 3 152 2 1 0 1 1 38.0% 500000 2 0.242255 2 8 511 64152 398 4 0 1 1 98.6% 250000 4 0.281592 4 14 749 95832 564 4 0 1 1 98.4% 125000 8 0.328056 8 24 873 111704 657 4 0 1 1 98.4% 62500 16 0.306700 16 52 935 119640 751 4 0 1 1 99.4% After: stride rows time [s] frags frag/s aio [KiB] blocked dropped idx hit idx miss idx blk cpu 1000000 1 0.000711 1 1406 3 152 2 1 0 1 1 42.1% 500000 2 0.000910 2 2197 5 216 3 2 0 1 1 39.2% 250000 4 0.001384 4 2891 9 344 5 4 0 1 1 35.3% 125000 8 0.003197 8 2502 21 728 13 8 0 1 1 53.1% 62500 16 0.006664 16 2401 41 1368 25 16 0 1 1 58.2%	2017-04-20 10:54:37 +02:00

1 2 3 4 5

201 Commits