scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-01 12:36:56 +00:00

Author	SHA1	Message	Date
Piotr Jastrzebski	7434be348c	sstables: Support reading range_tombstones Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com> Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2018-08-22 18:27:41 +02:00
Piotr Jastrzebski	d19a108d87	sstables: Support null columns in ck Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2018-08-22 14:32:10 +02:00
Piotr Jastrzebski	3636697663	sstables: Add consumer_m::consume_range_tombstone Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2018-08-22 12:53:15 +02:00
Vladimir Krivopalov	c8422c9a91	sstables: Add operator<< overload for bound_kind_m. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-08-20 16:22:53 -07:00
Vladimir Krivopalov	f1b9f82ff5	sstables: Use std::optional instead of std::experimental::optional. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-08-17 18:20:05 -07:00
Vladimir Krivopalov	3e92434eed	sstables: Support skipping inside wide partitions using index. This fix adds proper support for skipping inside wide partitions using index for sliced reads. This significantly reduces disk I/O for filtered queries. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-08-17 18:20:04 -07:00
Vladimir Krivopalov	4bf1e9de3f	sstables: Support resetting data_consume_rows_context_m to indexable_element::cell. Set the proper parsing state when resetting to indexable_element::cell. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-08-17 10:09:19 -07:00
Avi Kivity	b443a9b930	compaction: demote compaction start/end messages to DEBUG level Compactions start and end all the time, especially with many shards, and don't contribute much to understanding what is going on these days. Compaction throughput is available through the metrics and other information is available via the compaction history table. Demote compaction start and end messages to DEBUG level to keep the log clean. Cleaning and resharding compactions are kept as INFO, at least for now, since they are manual operations and therefore rarer. Message-Id: <20180724132859.14109-1-avi@scylladb.com>	2018-07-25 09:53:39 +01:00
Vladimir Krivopalov	ec7f853f49	sstables: Do not pass liveness_info to consume_row_end(). The liveness_info is unconditionally added to the _in_progress_row as of commit `cbfc741d70` so no need to pass it to consume_row_end() and add conditionally. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com> Message-Id: <7cd3e599817cbd4b857c3295153602cd2b9a6ef1.1532311852.git.vladimir@scylladb.com>	2018-07-23 13:10:36 +03:00
Piotr Jastrzebski	abf3fc1b98	sstables: Fix ck filtering and fast forwarding Both were broken before this change. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-07-20 16:34:37 -07:00
Piotr Jastrzebski	564bcfa4d0	sstables: Introduce mutation_fragment_filter This class encapsulates the logic related to clustering key filtering and fast forwarding. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-07-20 16:19:07 -07:00
Vladimir Krivopalov	4d3467d793	sstables: Add getter for end_open_marker to index_reader. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-07-20 13:51:13 -07:00
Vladimir Krivopalov	5561c713d9	sstables: Do not seek through the promoted index for static row positions. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-07-20 13:51:13 -07:00
Vladimir Krivopalov	917528c427	sstables: Read promoted index stored in SSTables 3.x ('mc') format. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-07-20 13:51:13 -07:00
Vladimir Krivopalov	86d14f8166	sstables: Make promoted_index_block support clustering keys for both ka/la and mc formats. This is a pre-requisite for parsing promoted index blocks written in SSTables 'mc' format. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-07-20 13:51:13 -07:00
Vladimir Krivopalov	997ebaaa14	sstables: Support reading signed vints in continuous_data_consumer. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-07-20 13:50:17 -07:00
Vladimir Krivopalov	540dfcc9bf	sstables: Factor out the code building a vector of fixed clustering values lengths. This code will be re-used in promoted_index_blocks_parser to parse clustering key prefixes from SSTables 3.x format. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-07-20 13:50:17 -07:00
Vladimir Krivopalov	741d5f3b5d	sstables: Remove unused includes from index_entry.hh Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-07-20 13:50:17 -07:00
Vladimir Krivopalov	f50ffa267f	sstables: Support parsing index entries from SSTables 3.x format. With this patch, index_reader is capable of reading index_entries from both 'ka'/'la' and 'mc' formats. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-07-20 13:50:17 -07:00
Piotr Jastrzebski	d0f8c71e28	sstables: move bound_kind_m to header and add helper methods. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2018-07-20 13:50:17 -07:00
Tomasz Grabiec	ef4fb1f91d	sstables: mp_row_consumer_m: Add trace-level logging Very useful for debugging. The old mp_row_consumer_k_l had this. Message-Id: <1532000326-28649-1-git-send-email-tgrabiec@scylladb.com>	2018-07-19 14:58:00 +03:00
Avi Kivity	ef9b36376c	Merge "database: support multiple data directories" from Glauber " While Cassandra supports multiple data directories, we have been historically supporting just one. The one-directory model suits us better because of the I/O Scheduler and so far we have seen very few requests -- if any, to support this. Still, the infrastructure needed to support multiple directories can be beneficial so I am trying to bring this in. For simplicity, we will treat the first directory in the list as the main directory. By being able to still associate one singular directory with a table, most of the code doesn't have to change and we don't have to worry about how to distribute data between the directories. In this design: - We scan all data directories for existing data. - resharding only happens within a particular data directory. - snapshot details are accumulated with data for all directories that host snapshots for the tables we are examining - snapshots are created with files in its own directories, but the manifest file goes to the main directory. For this one, note that in Cassandra the same thing happens, except that there is no "main" directory. Still the manifest file is still just in one of them. - SSTables are flushed into the main directory. - Compactions write data into the main directory Despite the restrictions, one example of usage of this is recovery. If we have network attached devices for instance, we can quickly attach a network device to an existing node and make the data immediately available as it is compacted back to main storage. Tests: unit (release) " * 'multi-data-file-v2' of github.com:glommer/scylla: database: change ident database: support multiple data directories database: allow resharing to specify a directory database: support multiple directories in get_snapshot_details database: move get_snapshot_info into a seastar::thread snapshots: always create the snapshot directory sstables: pass sstable dir with entry descriptor database: make nodetool listsnapshots print correct information sstables: correctly create descriptors for snapshots	2018-07-15 13:31:04 +03:00
Vladimir Krivopalov	cf7b42619d	clustering_ranges_walker: Improve class consistency and readability. This patch addresses several issues. 1. The class no longer uses placement-new trick for move-assignment. It was incorrect to use because the class contains const refererences and re-initializing the same region of memory would result in undefined behaviour on accessing these members. 2. Use boost::iterator_range for tracking the current range of cr_ranges. It is easier to deal with and avoids possible bugs like assigning only one of two iterators Message-Id: <4096182c4ee2fb1157e135c487c41012b266ba69.1531440684.git.vladimir@scylladb.com>	2018-07-13 11:23:33 +02:00
Tomasz Grabiec	b17f7257a9	sstables: index_reader: Reduce size of index_entry by indirecting promoted_index Reduces size of index_entry from 384 bytes to 64 bytes by using indirection for the optional promoted index instead of embedding it. Improves query time from 9ms to 4ms in a micro benchmark with a very large index page. Message-Id: <1531406354-10089-1-git-send-email-tgrabiec@scylladb.com>	2018-07-12 17:46:58 +03:00
Avi Kivity	a4a2f743a8	Merge "Avoid large allocations when reading sstable index pages" from Tomasz " If there is a lot of partitions in the index page, index_list may grow large and require large contiguous blocks of memory, because it's based on std::vector. That puts pressure on the memory allocator, and if memory is fragmented, may not be possible to satisfy without a lot of eviction. Switch to chunked_vector to avoid this. Refs #3597 " * 'tgrabiec/avoid-large-alloc-in-index-reader' of github.com:tgrabiec/scylla: sstables: Switch index_list to chunked_vector to avoid large allocations utils: chunked_vector: Do not require T to be default-constructible for clear() utils: chunked_vector: Implement front()	2018-07-12 15:30:18 +03:00
Tomasz Grabiec	3b2890e1db	sstables: Switch index_list to chunked_vector to avoid large allocations If there is a lot of partitions in the index page, index_list may grow large and require large contiguous blocks of memory. That puts pressure on the memory allocator, and if memory is fragmented, may not be possible to satisfy without a lot of eviction.	2018-07-11 16:55:20 +02:00
Piotr Jastrzebski	54fc6dde35	sstables: Support deleted cells in reading SST3 Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2018-07-10 10:03:29 +02:00
Glauber Costa	86239e4e22	sstables: pass sstable dir with entry descriptor We have been assuming that all SSTables for a table will be in the same directory, and we pass the directory name to make_descriptor only because that's the way in ka and la to find out the keyspace and table names. However, SSTables for a given column family could be spread into multiple directories. So let's pass it down with the descriptor so we can load from the right place. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-07-05 16:45:26 -04:00
Glauber Costa	4a62866104	sstables: correctly create descriptors for snapshots Our regular expression for parsing SSTable files tests for the directory for the la file format, since that file format does not include the ks/cf pair in the file name itself. However, the regular expression does not cover the case in which the SSTable files are coming from snapshots. This patch extends the regex so they are also covered. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2018-07-05 16:19:09 -04:00
Raphael S. Carvalho	dfd1e1229e	sstables/compaction_manager: fix typo in function name to reevaluate postponed compaction Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180702185343.26682-1-raphaelsc@scylladb.com>	2018-07-05 18:54:14 +03:00
Avi Kivity	f4caa418ff	Merge "Fix the "LCS data-loss bug"" from Botond " This series fixes the "LCS data-loss bug" where full scans (and everything that uses them) would miss some small percentage (> 0.001%) of the keys. This could easily lead to permanent data-loss as compaction and decomission both use full scans. `aeffbb673` worked around this bug by disabling the incremental reader selectors (the class identified as the source of the bug) altogether. This series fixes the underlying issue and reverts `aeffbb673`. The root cause of the bug is that the `incremental_reader_selector` uses the current read position to poll for new readers using `sstable_set::incremental_selector::select()`. This means that when the currently open sstables contain no partitions that would intersect with some of the yet unselected sstables, those sstables would be ignored. Solve the problem by not calling `select()` with the current read position and always pass the `next_position` returned in the previous call. This means that the traversal of the sstable-set happens at a pace defined by the sstable-set itself and this guarantees that no sstable will be jumped over. When asked for new readers the `incremental_reader_selector` will now iteratively call `select()` using the `next_position` from the previous `select()` call until it either receives some new, yet unselected sstables, or `next_position` surpasses the read position (in which case `select()` will be tried again later). The `sstable_set::incremental_selector` was not suitable in its present state to support calling `select()` with the `next_position` from a previous call as in some cases it could not make progress due to inclusiveness related ambiguities. So in preparation to the above fix `sstable_set` was updated to work in terms of ring-position instead of tokens. Ring-position can express positions in a much more fine-grained way then token, including positions after/before tokens and keys. This allows for a clear expression of `next_position` such that calling `select()` with it guarantees forward progress in the token-space. Tests: unit(release, debug) Refs: #3513 " * 'leveled-missing-keys/v4' of https://github.com/denesb/scylla: tests/mutation_reader_test: combined_mutation_reader_test: use SEASTAR_THREAD_TEST_CASE tests/mutation_reader_test: refactor combined_mutation_reader_test tests/mutation_reader_test: fix reader_selector related tests Revert "database: stop using incremental selectors" incremental_reader_selector: don't jump over sstables mutation_reader: reader_selector: use ring_position instead of token sstables_set::incremental_selector: use ring_position instead of token compatible_ring_position: refactor to compatible_ring_position_view dht::ring_position_view: use token_bound from ring_position i_partitioner: add free function ring-position tri comparator mutation_reader_merger::maybe_add_readers(): remove early return mutation_reader_merger: get rid of _key	2018-07-05 09:33:12 +03:00
Raphael S. Carvalho	7d6af5da3a	sstables/compaction_manager: properly reevaluate postponed compactions for leveled strategy Function to reevaluate postponed compaction was called too early for strategies that don't allow parallel compaction (only leveled strategy (LCS) at this moment). Such strategies must first have the ongoing compaction deregistered before reevaluating the postponed ones. Manager uses task list of ongoing compaction to decides if there's ongoing compaction for a given column family. So compaction could stop making progress at all if and only if we stop flushing new data. So it could happen that a column family would be left with lots of pending compaction, leading the user to think all compacting is done, but after reboot, there will be lots of compaction activity. We'll both improve method to detect parallel compaction here and also add a call to reevaluate postponed compaction after compaction is done. Fixes #3534. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20180702185327.26615-1-raphaelsc@scylladb.com>	2018-07-04 16:30:21 +01:00
Botond Dénes	a8e795a16e	sstables_set::incremental_selector: use ring_position instead of token Currently `sstable_set::incremental_selector` works in terms of tokens. Sstables can be selected with tokens and internally the token-space is partitioned (in `partitioned_sstable_set`, used for LCS) with tokens as well. This is problematic for severeal reasons. The sub-range sstables cover from the token-space is defined in terms of decorated keys. It is even possible that multiple sstables cover multiple non-overlapping sub-ranges of a single token. The current system is unable to model this and will at best result in selecting unnecessary sstables. The usage of token for providing the next position where the intersecting sstables change [1] causes further problems. Attempting to walk over the token-space by repeatedly calling `select()` with the `next_position` returned from the previous call will quite possibly lead to an infinite loop as a token cannot express inclusiveness/exclusiveness and thus the incremental selector will not be able to make progress when the upper and lower bounds of two neighbouring intervals share the same token with different inclusiveness e.g. [t1, t2](t2, t3]. To solve these problems update incremental_selector to work in terms of ring position. This makes it possible to partition the token-space amoing sstables at decorated key granularity. It also makes it possible for select() to return a next_position that is guaranteed to make progress. partitioned_sstable_set now builds the internal interval map using the decorated key of the sstables, not just the tokens. incremental_selector::select() now uses `dht::ring_position_view` as both the selector and the next_position. ring_position_view can express positions between keys so it can also include information about inclusiveness/exclusiveness of the next interval guaranteeing forward progress. [1] `sstable_set::incremental_selector::selection::next_position`	2018-07-04 17:42:33 +03:00
Botond Dénes	bf2645c616	compatible_ring_position: refactor to compatible_ring_position_view compatible_ring_position's sole purpose is to allow creating boost::icl::interval_map with dht::ring_position as the key and list of sstables as the value. This function is served equally well if compatible_ring_position wraps a `dht::ring_position_view` instead of a `dht::ring_position` with the added benefit of not having to copy the possibly heavy `dht::decorated_key` around. It also makes it possible to do lookups with `dht::ring_position_view` which is much more versatile and allows avoiding copies just to make lookups. The only downside is that `dht::ring_position_view` requires the lifetime of the "viewed" object to be taken care of. This is not a concern however, as so long as an interval is present in the map the represented sstable is guaranteed to be alive to, as the interval map participates in the ownership of the stored sstables. Rename compatible_ring_position to compatible_ring_position_view to reflect the changes. While at it upgrade the std::experimental::optional to std::optional.	2018-07-04 08:19:39 +03:00
Vladimir Krivopalov	b24eb5c11d	sstables: Remove "lower_" from index_reader public methods. The index_reader class public interface has been amended to only deal with the upper bound cursor along with advancing the lower bound. Since the class users can only explicitly operate with the lower bound cursor (take data file position, advance to the next partition, etc), it no longer makes sense to specify that the method operates on the lower bound cursor in its name. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-06-29 11:48:33 -07:00
Vladimir Krivopalov	30109a693b	sstables: Make index_reader::advance_upper_past() method private. No changes made to the code except that it is moved around. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-06-29 11:47:48 -07:00
Vladimir Krivopalov	80d1d5017f	sstables: Stop using index_reader::advance_upper_past() outside the class. The only case when it needs to be called is when an index_reader is advanced to a specific partition as part of sstable_reader initialisation. Instead, we're passing an optional upper_bound parameter that is used to call advance_upper_past() internally if partition is found. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-06-29 11:47:20 -07:00
Vladimir Krivopalov	a497edcbda	sstables: Move promoted_index_block from types.hh to index_entry.hh. It is only being used by index_reader internally and never exposed so should not be listed in commonly used types. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-06-28 12:28:59 -07:00
Vladimir Krivopalov	81fba73e9d	sstables: Factor out promoted index into a separate class. An index entry may or may not have a promoted index. All the optional fields are better scoped under the same class to avoid lots of separate optional fields and give better representation. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-06-28 12:28:59 -07:00
Vladimir Krivopalov	fc629b9ca6	sstables: Use std::optional instead of std::experimental optional in index_reader. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-06-27 16:47:53 -07:00
Calle Wilund	054514a47a	sstables::compress: Ensure unqualified compressor name if possible Fixes #3546 Both older origin and scylla writes "known" compressor names (i.e. those in origin namespace) unqualified (i.e. LZ4Compressor). This behaviour was not preserved in the virtualization change. But probably should be. Message-Id: <20180627110930.1619-1-calle@scylladb.com>	2018-06-27 14:16:50 +03:00
Avi Kivity	cb549c767a	database: rename column_family to table The name "column_family" is both awkward and obsolete. Rename to the modern and accurate "table". An alias is kept to avoid huge code churn. To prevent a One Definition Rule violation, a preexisting "table" type is moved to a new namespace row_cache_stress_test. Tests: unit (release) Message-Id: <20180624065238.26481-1-avi@scylladb.com>	2018-06-24 14:54:46 +03:00
Tomasz Grabiec	2d4177355a	Merge "Support for writing range tombstones to SSTables 3.x" from Vladimir This patchset brings support for writing range tombstones to SSTables 3.x. ('mc' format). In SSTables 3.x, range tombstones are represented by so-called range tombstone markers (hereafter RT markers) that denote range tombstone start and end bounds. So each range tombstone is represented in data file by two ordered RT markers. There are also markers that both close the previous range tombstone and open the new one in case if two range tombstones are ajdacent. This is done to consume less disk space on such occasions. Range tombstones written as RT markers are naturally non-overlapping. * github.com:argenet/scylla projects/sstables-30/write-range-tombstones/v6 range_tombstone_stream: Remove an unused boolean flag. Revert "Add missing enum values to bound_kind." sstables: Move to_deletion_time helper up and make it static. sstables: Write end-of-partition byte before flushing the last index block. sstables: Add support for writing range tombstones in SSTables 3.x format. tests: Add unit test covering simple range tombstone. tests: Add unit test covering adjacent range tombstones. tests: Add test to cover non-adjacent RTs. tests: Add test covering mixed rows and range tombstones. tests: Add test covering SSTables 3.x with many RTs. tests: Add unit test covering overlapping RTs and rows. tests: Add tests writing a range tombstone and a row overlapping with its start. tests: Add tests writing a range tombstone and a row overlapping with its end. tests: Add function that writes from multiple memtable into SSTables. tests: Add test where 2nd range tombstone covers the remainder of the 1st one. tests: Add test writing two non-adjacent range tombstones with same clustering key prefix at their bounds. tests: Add test covering overlapped range tombstones.	2018-06-22 15:47:18 +02:00
Vladimir Krivopalov	5559fc2121	sstables: Add support for writing range tombstones in SSTables 3.x format. For SSTables 3.x. ('mc' format), range tombstones are represented by their bounds that are written to the data file as so-called RT markers. For adjacent range tombstones, an RT marker can be of a 'boundary' type which means it closes the previous range tombstone and opens the new one. Internally, sstable_writer_m relies on range_tombstone_stream to both de-overlap incoming range tombstones and order them so that when they are drained they can be easily thought of as just pairs of their bounds.	2018-06-20 18:08:36 -07:00
Avi Kivity	b97e1aeff5	Merge "Consume row marker correctly" from Piotr " Make sure we properly handle row marker and row tombstone when reading a row. Tests: unit {release} " * 'haaawk/sstables3/read-liveness-info-v4' of ssh://github.com/scylladb/seastar-dev: sstable: consume row marker in data_consume_rows_context_m sstable: Add consumer_m::consume_row_marker_and_tombstone sstable: add is_set and to_row_marker to liveness_info	2018-06-20 14:44:03 +03:00
Piotr Jastrzebski	75edaff7b6	sstable: consume row marker in data_consume_rows_context_m Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2018-06-20 13:13:29 +02:00
Piotr Jastrzebski	cbfc741d70	sstable: Add consumer_m::consume_row_marker_and_tombstone Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>	2018-06-20 13:13:16 +02:00
Tomasz Grabiec	5548eb96f7	Merge "store prepared statements parameters values" from Vlad * https://github.com/vladzcloudius/scylla.git tracing_prepared_parameters-v6: cql3::query_options: add get_names() method tracing::trace_state: hide the internals of params_values tracing: store queries statements for BATCH tracing: store the prepared statements parameters values	2018-06-19 19:12:26 +02:00
Vladimir Krivopalov	100eb03f29	sstables: Write end-of-partition byte before flushing the last index block. This is to stay compliant with the Origin for SSTables 3.x. It differs from SSTables 2.x (ka/la) as for those the last promoted index block is pushed first and the end-of-partition byte is written after. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-06-18 14:28:25 -07:00
Vladimir Krivopalov	ad0b911b03	sstables: Move to_deletion_time helper up and make it static. It is used for writing end_open_marker for promoted index. Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>	2018-06-18 14:25:13 -07:00

1 2 3 4 5 ...

1519 Commits