scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-31 03:56:42 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	66292c0ef0	sstables: Fix bug in promoted index generation maybe_flush_pi_block, which is called for each cell, assumes that block_first_colname will be empty when the first cell is encountered for each partition. This didn't hold after writing partition which generated no index entry, because block_first_colname was cleared only when there way any data written into the promoted index. Fix by always clearing the name. The effect was that the promoted index entry for the next partition would be flushed sooner than necessary (still counting since the start of the previous partition) and with offset pointing to the start of the current partition. This will cause parsing error when such sstable is read through promoted index entry because the offset is assumed to point to a cell not to partition start. Fixes #1567 Message-Id: <1470909915-4400-1-git-send-email-tgrabiec@scylladb.com> (cherry picked from commit `f1c2481040`)	2016-08-11 13:09:05 +03:00
Nadav Har'El	0b9f83c6b6	sstable: avoid copying non-existant value The promoted-index reading code contained a bug where it copied the value of an disengaged optional (this non-value was never used, but it was still copied ). Fix it by keeping the optional<> as such longer. This bug caused tests/sstable_test in the debug build to crash (the release build somehow worked). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1470742418-8813-1-git-send-email-nyh@scylladb.com> (cherry picked from commit `e005762271`)	2016-08-10 13:14:49 +03:00
Nadav Har'El	0475a98de1	Avoid some warnings in debug build The sanitizer of the debug build warns when a "bool" variable is read when containing a value not 0 or 1. In particular, if a class has an uninitialized bool field, which class logic allows to only be set later, then "move"ing such an object will read the uninitialized value and produce this warning. This patch fixes four of these warnings seen in sstable_test by initializing some bool fields to false, even though the code doesn't strictly need this initialization. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1470744318-10230-1-git-send-email-nyh@scylladb.com> (cherry picked from commit `c2e4f5ba16`)	2016-08-09 17:54:54 +03:00
Nadav Har'El	0b69e37065	Fix failing tests Commit `0d8463aba5` broke some of the tests with an assertion failure about local_is_initialized(). It turns out that there is more than one level of local_is_initialized() we need to check... For some tests, neither locals were initialized, but for others, one was and the other wasn't, and the wrong one was tested. With this patch, all unit tests except "flush_queue_test.cc" pass on my machine. I doubt this test is relevant to the promoted index patches, but I'll continue to investigate it. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1470695199-32649-1-git-send-email-nyh@scylladb.com> (cherry picked from commit `bce020efbd`)	2016-08-09 17:54:49 +03:00
Avi Kivity	dc6be68852	Merge "promoted index for reading partial partitions" from Nadav "The goal of this patch series is to support reading and writing of a "promoted index" - the Cassandra 2.* SSTable feature which allows reading only a part of the partition without needing to read an entire partition when it is very long. To make a long story short, a "promoted index" is a sample of each partition's column names, written to the SSTable Index file with that partition's entry. See a longer explanation of the index file format, and the promoted index, here: https://github.com/scylladb/scylla/wiki/SSTables-Index-File There are two main features in this series - first enabling reading of parts of partitions (using the promoted index stored in an sstable), and then enable writing promoted indexes to new sstables. These two features are broken up into smaller stand-alone pieces to facilitate the review. Three features are still missing from this series and are planned to be developed later: 1. When we fail to parse a partition's promoted index, we silently fall back to reading the entire partition. We should log (with rate limiting) and count these errors, to help in debugging sstable problems. 2. The current code only uses the promoted index when looking for a single contiguous clustering-key range. If the ck range is non-contiguous, we fall back to reading the entire partition. We should use the promoted index in that case too. 3. The current code only uses the promoted index when reading a single partition, via sstable::read_row(). When scanning through all or a range of partitions (read_rows() or read_range_rows()), we do not yet use the promoted index; We read contiguously from data file (we do not even read from the index file, so unsurprisingly we can't use it)." (cherry picked from commit `700feda0db`)	2016-08-09 17:54:15 +03:00
Avi Kivity	8c20741150	Revert "sstables: promoted index write support" This reverts commit `c0e387e1ac`. The full patchset needs to be backported instead.	2016-08-09 17:53:24 +03:00
Avi Kivity	3e3eaa693c	Revert "Fix failing tests" This reverts commit `8d542221eb`. It is needed, but prevents another revert from taking place. Will be reinstated later	2016-08-09 17:52:57 +03:00
Avi Kivity	03ef0a9231	Revert "Avoid some warnings in debug build" This reverts commit `47bf8181af`. It is needed, but prevents another revert from taking place. Will be reinstated later.	2016-08-09 17:52:09 +03:00
Nadav Har'El	47bf8181af	Avoid some warnings in debug build The sanitizer of the debug build warns when a "bool" variable is read when containing a value not 0 or 1. In particular, if a class has an uninitialized bool field, which class logic allows to only be set later, then "move"ing such an object will read the uninitialized value and produce this warning. This patch fixes four of these warnings seen in sstable_test by initializing some bool fields to false, even though the code doesn't strictly need this initialization. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1470744318-10230-1-git-send-email-nyh@scylladb.com> (cherry picked from commit `c2e4f5ba16`)	2016-08-09 16:58:27 +03:00
Nadav Har'El	8d542221eb	Fix failing tests Commit `0d8463aba5` broke some of the tests with an assertion failure about local_is_initialized(). It turns out that there is more than one level of local_is_initialized() we need to check... For some tests, neither locals were initialized, but for others, one was and the other wasn't, and the wrong one was tested. With this patch, all unit tests except "flush_queue_test.cc" pass on my machine. I doubt this test is relevant to the promoted index patches, but I'll continue to investigate it. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1470695199-32649-1-git-send-email-nyh@scylladb.com> (cherry picked from commit `bce020efbd`)	2016-08-09 16:58:27 +03:00
Nadav Har'El	c0e387e1ac	sstables: promoted index write support This patch adds writing of promoted index to sstables. The promoted index is basically a sample of columns and their positions for large partitions: The promoted index appears in the sstable's index file for partitions which are larger than 64 KB, and divides the partition to 64 KB blocks (as in Cassandra, this interval is configurable through the column_index_size_in_kb config parameter). Beyond modifying the index file, having a promoted index may also modify the data file: Since each of blocks may be read independently, we need to add in the beginning of each block the list of range tombstones that are still open at that position. See also https://github.com/scylladb/scylla/wiki/SSTables-Index-File Fixes #959 Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `0d8463aba5`)	2016-08-09 16:58:27 +03:00
Paweł Dziepak	99dfbedf36	sstables: extend sstable life until reader is fully closed data_consume_rows_context needs to have close() called and the returned future waited for before it can be destroyed. data_consume_context::impl does that in the background upon its destruction. However, it is possible that the sstable is removed before data_consume_rows_context::close() completes in which case EBADF may happen. The solution is to make data_consume_context::impl keep a reference to the sstable and extend its life time until closing of data_consume_rows_context (which is performed in the background) completes. Side effect of this change is also that data_consume_context no longer requires its user to make sure that the sstable exists as long as it is in use since it owns its own reference to it. Fixes #1537. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1470222225-19948-1-git-send-email-pdziepak@scylladb.com> (cherry picked from commit `02ffc28f0d`)	2016-08-03 13:19:50 +02:00
Tomasz Grabiec	b224ff6ede	Merge 'pdziepak/row-cache-wide-entries/v4' from seastar-dev.git This series adds the ability for partition cache to keep information whether partition size makes it uncacheable. During, reads these entries save us IO operations since we already know that the partiiton is too big to be put in the cache. First part of the patchset makes all mutation_readers allow the streamed_mutations they produce to outlive them, which is a guarantee used later by the code handling reading large partitions. (cherry picked from commit `d2ed75c9ff`)	2016-08-02 20:24:29 +02:00
Duarte Nunes	ff8a795021	sstables: Validate static cell is on static column This patch enforces compatibility between a cell and the corresponding column definition with regards to them being static. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-28 12:11:46 +02:00
Duarte Nunes	5ad0448cc9	sstables: Don't assume cell name is compound The current code assumes cell names are always compound and may wrongly report a non-static row as such, since it looks at the first bytes of the name assuming they are the component's length. Tables with compact storage (which cannot contain static rows) may not have a compound comparator, so we check for the table's compoundness before checking for the static marker. We do this by delegating to composite_view::is_static. Fixes #1495 Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1469616205-4550-4-git-send-email-duarte@scylladb.com>	2016-07-28 12:11:27 +02:00
Duarte Nunes	35ab2cadc2	sstables: Remove duplication in extract_clustering_key This patch removes some duplicated code in extract_clustering_key(), which is already handled in composite_view. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1469397806-8067-1-git-send-email-duarte@scylladb.com>	2016-07-28 12:11:22 +02:00
Duarte Nunes	a1cee9f97c	sstables: Remove superfluous call to check_static() When building a column we're calling check_static() two times; refector things a bit so that this doesn't happen and we reuse the previous calculation. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1469397748-7987-1-git-send-email-duarte@scylladb.com>	2016-07-28 12:11:15 +02:00
Paweł Dziepak	07d5e939be	sstables: avoid recursion in sstable_streamed_mutation::read_next() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> (cherry picked from commit `04f2c278c2`)	2016-07-27 14:06:03 +03:00
Paweł Dziepak	a2a5a22504	sstables: protect against duplicated range tombstones Promoted index may cause sstable to have range tombstones duplicated several times. These duplicates appear in the "wrong" place since they are smaller than the entity preceeding them. This patch ignores such duplicates by skipping range tombstones that are smaller than previously read ones. Moreover, these duplicted range tombstone may appear in the middle of clustering row, so the sstable reader has also gained the ability to merge parts of the row in such cases. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> (cherry picked from commit `08032db269`)	2016-07-27 14:05:58 +03:00
Paweł Dziepak	69a0e6e002	stables: fix skipping partitions with no rows If partition contains no static and clustering rows or range tombstones mp_row_consumer will return disengaged mutation_fragment_opt with is_mutation_end flag set to mark end of this partition. Current, mutation_reader::impl code incorrectly recognized disengaged mutation fragment as end of the stream of all mutations. This patch fixes that by using is_mutation_end flag to determine whether end of partition or end of stream was reached. Fixes #1503. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com> Message-Id: <1469525449-15525-1-git-send-email-pdziepak@scylladb.com> (cherry picked from commit `efa690ce8c`)	2016-07-26 13:10:31 +03:00
Raphael S. Carvalho	2d66a4621a	compaction: do not convert timestamp resolution to uppercase C* only allows timestamp resolution in uppercase, so we shouldn't be forgiving about it, otherwise migration to C* will not work. Timestamp resolution is stored in compaction strategy options of schema BTW. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <d64878fc9bbcf40fd8de3d0f08cce9f6c2fde717.1469133851.git.raphaelsc@scylladb.com> (cherry picked from commit `c4f34f5038`)	2016-07-25 13:47:23 +03:00
Raphael S. Carvalho	789fb0db97	compaction: implement date tiered compaction strategy options Now date tiered compaction strategy will take into account the strategy options which are defined in the schema. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `eaa6e281a2`)	2016-07-21 12:00:18 +03:00
Nadav Har'El	c647d917e0	sstables: move to_bytes_view to header file Move the to_bytes_view(temporary_buffer<char>) function from source file to header file where is can be used in more places. This saves one use of reinterpret_cast (which we are no re-evaluating), and moreover, we want to use this function also in the promoted index code (to return a bytes_view from the promoted index which was saved as a temporary_buffer). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1468761437-27046-1-git-send-email-nyh@scylladb.com>	2016-07-17 16:29:26 +03:00
Paweł Dziepak	93cc4454a6	streamed_mutation: emit range_tombstones directly Originally, streamed_mutations guaranteed that emitted tombstones are disjoint. In order to achieve that two separate objects were produced for each range tombstone: range_tombstone_begin and range_tombstone_end. Unfortunately, this forced sstable writer to accumulate all clustering rows between range_tombstone_begin and range_tombstone_end. However, since there is no need to write disjoint tombstones to sstables (see #1153 "Write range tombstones to sstables like Cassandra does") it is also not necessary for streamed_mutations to produce disjoint range tombstones. This patch changes that by making streamed_mutation produce range_tombstone objects directly. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-13 09:51:18 +01:00
Nadav Har'El	aec90a22da	sstable parsing: assert we do not lose clustering rows The sstable parsing code calls mp_row_consumer::flush() after every clustering row has been read, and this puts the now complete row in a single field "_ready". The assumption is that at this point parsing will stop, the consumer will move out this _ready (mp_row_consumer::get_mutation_fragment()) and when flush() is later called again, _ready will be empty again. This assumption is correct in our code, but is based on an intricate combination of estoreric parts of the code, such as: 1. In data_consume_row_context we stop parsing after reading the parition's header, before reading any clustering rows, giving the caller the chance to call sstable_streamed_mutation::read_next() to be prepared for the incoming mutations. 2. In mp_row_consumer::flush_if_needed(), we stop the parser after each individual clustering row. It is easy to break this assumption, and I did this in one of my code changes, and the result was silent loss of clustering rows, as "_ready" got silently overwritten before the reader had a chance to move it out. What this patch does is to add an assertion: If a clustering row is silently lost before being transferred to the mutation fragment reader, we croak. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <1468389955-24600-1-git-send-email-nyh@scylladb.com>	2016-07-13 09:42:48 +01:00
Duarte Nunes	4eca7632ec	sstables: Replace composite fields with raw bytes This patch fixes a regression introduced in `f81329be60`, which made keys compound by default when using a particular ctor, in turn leading to mismatches when comparing the same key built with functions that properly consider compoundness. As a temporary fix, the sstable::key and sstable::key_view classes store raw bytes instead of a composite. Signed-off-by: Duarte Nunes <duarte@scylladb.com> Message-Id: <1468339295-3924-1-git-send-email-duarte@scylladb.com>	2016-07-12 18:08:04 +02:00
Duarte Nunes	f81329be60	sstables: sstables::key delegates to composite The sstables::key class now delegates much of its functionality to the composite class. All existing behavior is preserved. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-11 23:37:33 +02:00
Duarte Nunes	ad8ff1df7e	sstables: Replace composite class This patch replaces the sstables::composite class with the one in compound_compat.hh. Signed-off-by: Duarte Nunes <duarte@scylladb.com>	2016-07-11 16:55:11 +02:00
Avi Kivity	24e3026e32	Merge "compaction manager refactoring" from Raphael	2016-07-10 17:16:23 +03:00
Tomasz Grabiec	8c4b5e4283	db: Avoiding checking bloom filters during compaction Checking bloom filters of sstables to compute max purgeable timestamp for compaction is expensive in terms of CPU time. We can avoid calculating it if we're not about to GC any tombstone. This patch changes compacting functions to accept a function instead of ready value for max_purgeable. I verified that bloom filter operations no longer appear on flame graphs during compaction-heavy workload (without tombstones). Refs #1322.	2016-07-10 09:54:20 +02:00
Raphael S. Carvalho	ed5e7e6842	compaction: refactor compaction manager Previously, same function was used to handle both regular compaction and cleanup requests. That's bad because a lot of conditions were added for both compaction types to live in the same function. Now, cleanup and regular compaction will live in different functions. They share a lot of code, so helper functions were introduced. This change is also important for user-initiated compaction that will go through compaction manager in the future. Code is also a lot easier to read now. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-08 16:37:53 -03:00
Raphael S. Carvalho	da6a2b429d	compaction: add functions to register and deregister compacting sstables Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-08 16:00:51 -03:00
Raphael S. Carvalho	4d6dce8ec9	compaction: add helper function to get candidates for strategy Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-08 15:06:14 -03:00
Raphael S. Carvalho	bfc5376548	compaction: remove gate from compaction manager task There is no longer a need to use gate for regular termination of fiber that runs compaction. Now, we only set task->stopping to true, ask for compaction termination, and wait for its future to resolve. Code is simplified a lot with this change. Reviewed-by: Nadav Har'El <nyh@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-08 15:05:10 -03:00
Avi Kivity	8dab93a853	sstables: fix low disk utilization with compression and small chunk lengths As Nadav notes we use the chunk length as the buffer size for the compressed stream too. Fix by using it only for the outer (uncompressed) stream; the inner (compressed) stream uses the sstable buffer size, 128 kiB. Fixes #1402. Message-Id: <1467910556-5759-1-git-send-email-avi@scylladb.com> Reviewed-by: Nadav Har'El <nyh@scylladb.com>	2016-07-07 18:13:30 +01:00
Paweł Dziepak	5bc51821fe	sstables: allow writing unsealed sstables The purpose of this patch is to split the actions of writing sstable and sealing it. As long as the sstable is unsealed it is considered incomplete and is going to be removed on reboot. Such functionality is needed in order to defer visibility of sstables created during streaming until the streaming is complete. Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:18:35 +01:00
Paweł Dziepak	a7b6c1110f	sstables: do not require seal_sstable() to be run in thread Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-07 12:18:35 +01:00
Raphael S. Carvalho	0772d20c60	fix compilation in debug mode build/debug/sstables/compaction_strategy.o: In function `date_tiered_manifest::date_tiered_manifest(std::map<basic_sstring<char, unsigned int, 15u>, basic_sstring<char, unsigned int, 15u>, std::less<basic_sstring<char, unsigned int, 15u> >, std::allocator<std::pair<basic_sstring<char, unsigned int, 15u> const, basic_sstring<char, unsigned int, 15u> > > > const&)': /home/centos/scylla/sstables/date_tiered_compaction_strategy.hh:67: undefined reference to `date_tiered_manifest::DEFAULT_BASE_TIME_SECONDS' That's fixed by moving definition of static constexpr outside the class. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20c16ad71f64900aa5591018bc4e976406cfebb3.1467870383.git.raphaelsc@scylladb.com>	2016-07-07 11:52:37 +03:00
Avi Kivity	02530faeb2	compaction: fix tombstones not being garbage collected during compaction `2a46410f4a` changed sstable_list from a map to a set, so it is no longer sorted by generation. The code for finding the list of sstables not being compacted relied on this sort order, and now broke, returning a longer list than needed (including some of the sstables being compacted). As a result, the compaction code preserved the tombstones, incorrectly thinking there was still live data they referenced. Fix by sorting the set explicitly. Fixes #1429. Message-Id: <1467793026-6571-1-git-send-email-avi@scylladb.com>	2016-07-06 10:22:31 +02:00
Raphael S. Carvalho	b699ef2de3	compaction: wire up date tiered compaction strategy After this commit, date tiered compaction strategy is supported on Scylla. To understand how it works, take a look at our wiki page: https://github.com/scylladb/scylla/wiki/SSTable-compaction#date-tiered-compaction Fixes #511. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-06 02:11:47 -03:00
Raphael S. Carvalho	e5cc0cc6c4	compaction: implement date tiered compaction strategy This commit is basically about converting Java to C++. Date tiered compaction strategy isn't wired yet. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-06 02:11:47 -03:00
Raphael S. Carvalho	e9076f39be	compaction: implement function to get fully expired sstables Strongly based on org.apache.cassandra.db.compaction. CompactionController.getFullyExpiredSSTables. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-06 02:11:47 -03:00
Raphael S. Carvalho	92848efc42	sstables: make overlapping functions static That's needed for a function that will get overlapping sstables to get fully expired ones. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-06 01:34:34 -03:00
Raphael S. Carvalho	8d38fa49d4	sstables: move code to get uncompacting sstables to a function Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-06 01:33:55 -03:00
Raphael S. Carvalho	cc6c383249	sstables: properly keep track of max local deletion time We weren't updating max local deletion time for cells that contain ttl, or for tombstone cells. If there is a live cell with no ttl, then max local deletion time is supposed to store maximum value, which means that the sstable will not be fully expired later on. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-06 01:13:24 -03:00
Raphael S. Carvalho	1ecd9bdefc	sstables: fix type of max_local_deletion_time max_local_deletion_time was incorrectly using an unsigned type instead of a signed one. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-06 01:13:13 -03:00
Raphael S. Carvalho	f9ab94d266	compaction: import DateTieredCompactionStrategy.java File can be found at the following C* directory: src/java/org/apache/cassandra/db/compaction Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2016-07-06 01:12:49 -03:00
Avi Kivity	cb59e724ee	Merge "Fix enabling sstable read ahead" from Paweł "This series contains remaining changes necessary to safely enable read ahead of sstables. Basically, it makes sure that input_streams are always properly closed (even in case of exception during read)."	2016-07-05 19:04:19 +03:00
Raphael S. Carvalho	43926026c3	compaction: introduce compaction strategy method to estimate pending compaction At the moment, it's not possible to know how many compaction are needed for compaction strategy to be satisfied. It's not possible to know exactly the number of pending compaction, but the strategy can provide an estimation. For size tiered, it's based on number of sstables in each bucket. By dividing bucket size by max threshold, we get number of compaction needed to compact that single bucket. For leveled, it's about the number of sstables that exceeds the limit in each level. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <e209e52f6159ee274a8358b69961a7c0ce357f7d.1467667054.git.raphaelsc@scylladb.com>	2016-07-05 19:03:11 +03:00
Paweł Dziepak	4acf77d755	sstables: drop unused data_stream_at() Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>	2016-07-04 18:17:43 +01:00

1 2 3 4 5 ...

692 Commits