scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-25 19:10:42 +00:00

Author	SHA1	Message	Date
Tomasz Grabiec	a09479e63c	Merge "Validate position in partition monotonicity" from Benny Introduce mutation_fragment_stream_validator class and use it as a Filter to flat_mutation_reader::consume_in_thread from sstable::write_components to validate partition region and optionally clustering key monotonicity. Fixes #4803	2019-09-09 15:38:31 +02:00
Benny Halevy	34d306b982	config: add enable_sstable_key_validation option key monotonicity validation requires an overhead to store the last key and also to compare therefore provide an option to enable/disable it (disabled by default). Refs #4804 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2019-09-09 15:30:59 +03:00
Benny Halevy	496467d0a2	sstables: writer: Validate input mutation fragment stream Fixes #4803 Refs #4804 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2019-09-09 15:30:59 +03:00
Benny Halevy	41b60b8bc5	compaction: s/filter_func/make_partition_filter/ It expresses the purpose of this function better as suggested by Tomasz Grabiec. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2019-09-09 15:30:59 +03:00
Benny Halevy	bc29520eb8	flat_mutation_reader: consume_in_thread: add mutation_filter For validating mutation_fragment's monotonicity. Note: forwarding constructor allows implicit conversion by current callers. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2019-09-04 13:42:37 +03:00
Rafael Ávila de Espíndola	000514e7cc	sstable: close file_writer if an exception in thrown The previous code was not exception safe and would eventually cause a file to be destroyed without being closed, causing an assert failure. Unfortunately it doesn't seem to be possible to test this without error injection, since using an invalid directory fails before this code is executed. Fixes #4948 Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20190904002314.79591-1-espindola@scylladb.com>	2019-09-04 13:28:55 +03:00
Avi Kivity	8fb59915bb	Merge "Minor cleanup patches for sstables" from Asias * 'cleanup_sstables' of https://github.com/asias/scylla: sstables: Move leveled_compaction_strategy implementation to source file sstables: Include dht/i_partitioner.hh for dht::partition_range	2019-09-03 14:47:44 +03:00
Rafael Ávila de Espíndola	036f51927c	sstables: Remove unused include Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20190827210424.37848-1-espindola@scylladb.com>	2019-08-28 11:32:44 +03:00
Benny Halevy	869b518dca	sstables: auto-delete unsealed sstables Fixes #4807 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20190827082044.27223-1-bhalevy@scylladb.com>	2019-08-28 09:46:17 +03:00
Benny Halevy	20083be9f6	sstables: delete_atomically: fix misplaced parenthesis in pending_delete_log warning message Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20190818064637.9207-1-bhalevy@scylladb.com>	2019-08-26 19:50:21 +03:00
Botond Dénes	136fc856c5	treewide: silence discarded future warnings for questionable discards This patches silences the remaining discarded future warnings, those where it cannot be determined with reasonable confidence that this was indeed the actual intent of the author, or that the discarding of the future could lead to problems. For all those places a FIXME is added, with the intent that these will be soon followed-up with an actual fix. I deliberately haven't fixed any of these, even if the fix seems trivial. It is too easy to overlook a bad fix mixed in with so many mechanical changes.	2019-08-26 19:28:43 +03:00
Botond Dénes	fddd9a88dd	treewide: silence discarded future warnings for legit discards This patch silences those future discard warnings where it is clear that discarding the future was actually the intent of the original author, and they did the necessary precautions (handling errors). The patch also adds some trivial error handling (logging the error) in some places, which were lacking this, but otherwise look ok. No functional changes.	2019-08-26 18:54:44 +03:00
Asias He	2f24fd9106	sstables: Move leveled_compaction_strategy implementation to source file It is better than putting everything in header.	2019-08-26 16:49:48 +08:00
Asias He	b69138c4e4	sstables: Include dht/i_partitioner.hh for dht::partition_range Get rid of one FIXME.	2019-08-26 16:35:18 +08:00
Avi Kivity	0d0ee20f76	Merge "Implement `sstable_info` API command (info on sstables)" from Calle " Refs #4726 Implement the api portion of a "describe sstables" command. Adds rest types for collecting both fixed and dynamic attributes, some grouped. Allows extensions to add attributes as well. (Hint hint) " * 'sstabledesc' of https://github.com/elcallio/scylla: api/storage_service: Add "sstable_info" command sstables/compress: Make compressor pointer accessible from compression info sstables.hh: Add attribute description API to file extension sstables.hh: Add compression component accessor sstables.hh: Make "has_component" public	2019-08-12 21:16:08 +03:00
Raphael S. Carvalho	b436c41128	compaction_manager: Prevent sstable runs from being partially compacted Manager trims sstables off to allow compaction jobs to proceed in parallel according to their weights. The problem is that trimming procedure is not sstable run aware, so it could incorrectly remove only a subset of a sstable run, leading to partial sstable run compaction. Compaction of a sstable run could lead to inneficiency because the run structure would be messed up, affecting all amplification factors, and the same generation could even end up being compacted twice. This is fixed by making the trim procedure respect the sstable runs. Fixes #4773. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20190730042023.11351-1-raphaelsc@scylladb.com>	2019-08-11 17:20:20 +03:00
Raphael S. Carvalho	76cde84540	sstables/compaction_manager: Fix logic for filtering out partial sstable runs ignore_partial_runs() brings confusion because i__p__r() equal to true doesn't mean filter out partial runs from compaction. It actually means not caring about compaction of a partial run. The logic was wrong because any compaction strategy that chooses not to ignore partial sstable run[1] would have any fragment composing it incorrectly becoming a candidate for compaction. This problem could make compaction include only a subset of fragments composing the partial run or even make the same fragment be compacted twice due to parallel compaction. [1]: partial sstable run is a sstable that is still being generated by compaction and as a result cannot be selected as candidate whatsoever. Fix is about making sure partial sstable run has none of its fragments selected for compaction. And also renaming i__p__r. Fixes #4729. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20190807022814.12567-1-raphaelsc@scylladb.com>	2019-08-08 14:11:35 +03:00
Calle Wilund	95a8ff12e7	sstables/compress: Make compressor pointer accessible from compression info	2019-08-06 07:07:44 +00:00
Calle Wilund	d15c63627c	sstables.hh: Add attribute description API to file extension	2019-08-06 07:07:44 +00:00
Calle Wilund	4c67d702c2	sstables.hh: Add compression component accessor	2019-08-06 07:07:44 +00:00
Calle Wilund	770f912221	sstables.hh: Make "has_component" public	2019-08-06 07:07:44 +00:00
Kamil Braun	f14e6e73bb	Add ZStandard compression This adds the option to compress sstables using the Zstandard algorithm (https://facebook.github.io/zstd/). To use, pass 'sstable_compression': 'org.apache.cassandra.io.compress.ZstdCompressor' to the 'compression' argument when creating a table. You can also specify a 'compression_level'. See Zstd documentation for the available compression levels. Resolves #2613. Signed-off-by: Kamil Braun <kbraun@scylladb.com>	2019-08-05 14:55:53 +02:00
Kamil Braun	7a61bcb021	Fix the value of the chunk length parameter passed to compressors This commit also fixes a bug in sstables/compress.cc, where chunk length in bytes was passed to the compressor as chunk length in kilobytes. Fortunately, none of the compressors implemented until now used this parameter. Signed-off-by: Kamil Braun <kbraun@scylladb.com>	2019-08-05 14:31:33 +02:00
Tomasz Grabiec	43c7144133	sstables: writer: Validate that partition is closed when the input mutation stream ends Not emitting partition_end for a partition is incorrect. Sstable writer assumes that it is emitted. If it's not, the sstable will not be written correctly. The partition index entry for the last partition will be left partially written, which will may result in errors during reads. Also, statistics and sstable key ranges will not include the last partition. It's better to catch this problem at the time of writing, and not generate bad sstables. Another way of handling this would be to implicitly generate a partition_end, but I don't think that we should do this. We cannot trust the mutation stream when invariants are violated, we don't know if this was really the last partition which was supposed to be written. So it's safer to fail the write. Enabled for both mc and la/ka.	2019-08-02 11:13:54 +02:00
Avi Kivity	77686ab889	Merge "Make SSTable cleanup run aware" from Raphael " Fixes #4663. Fixes #4718. " * 'make_cleanup_run_aware_v3' of https://github.com/raphaelsc/scylla: tests/sstable_datafile_test: Check cleaned sstable is generated with expected run id table: Make SSTable cleanup run aware compaction: introduce constants for compaction descriptor compaction: Make it possible to config the identifier of the output sstable run table: do not rely on undefined behavior in cleanup_sstables	2019-07-31 19:10:22 +03:00
Avi Kivity	b272db368f	sstable: index_reader: close index_reader::reader more robustly If we had an error while reading, then we would have failed to close the reader, which in turn can cause memory corruption. Make the closing more robust by using then_wrapped (that doesn't skip on exception) and log the error for analysis. Fixes #4761.	2019-07-26 14:26:04 +02:00
Raphael S. Carvalho	8c97e0e43e	compaction: introduce constants for compaction descriptor Make it easier for users, and also avoid duplicating knowledge about descriptor defaults across the codebase. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2019-07-15 23:39:44 -03:00
Raphael S. Carvalho	a1db29e705	compaction: Make it possible to config the identifier of the output sstable run Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2019-07-15 23:39:38 -03:00
Botond Dénes	7a4a609e88	Introduce Garbage Collected Consumer to Mutation Compactor Introduce consumer in mutation compactor that will only consume data that is purged away from regular consumer. The goal is to allow compaction implementation to do whatever it wants with the garbage collected data, like saving it for preventing data resurrection from ever happening, like described in issue #4531. noop_compacted_fragments_consumer is made available for users that don't need this capability. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2019-07-15 17:38:00 +03:00
Rafael Ávila de Espíndola	281f3a69f8	mc writer: Fix exception safety when closing _index_writer This fixes a possible cause of #4614. From the backtrace in that issue, it looks like a file is being closed twice. The first point in the backtrace where that seems likely is in the MC writer. My first idea was to add a writer::close and make it the responsibility of the code using the writer to call it. That way we would move work out of the destructor. That is a bit hard since the writer is destroyed from flat_mutation_reader::impl::~consumer_adapter and that would need to get a close function too. This patch instead just fixes an exception safety issue. If _index_writer->close() throws, _index_writer is still valid and ~writer will try to close it again. If the exception was thrown after _completed.set_value(), that would explain the assert about _completed.set_value() being called twice. With this patch the path outside of the destructor now moves the writer to a local variable before trying to close it. Fixes #4614 Message-Id: <20190710171747.27337-1-espindola@scylladb.com>	2019-07-10 19:27:19 +02:00
Avi Kivity	dd76943125	Merge "Segregate data when streaming by timestamp for time window compaction strategy" from Botond " When writing streamed data into sstables, while using time window compaction strategy, we have to emit a new sstable for each time window. Otherwise we can end up with sstables, mixing data from wildly different windows, ruining the compaction strategy's ability to drop entire sstables when all data within is expired. This gets worse as these mixed sstables get compacted together with sstables that used to contain a single time window. This series provides a solution to this by segregating the data by its atom's the time-windows. This is done on the new RPC streaming and the new row-level, repair, memtable-flush and compaction, ensuring that the segregation requirement is respected at all times. Fixes: #2687 " * 'segregate-data-into-sstables-by-time-window-streaming/v2.1' of ssh://github.com/denesb/scylla: streaming,repair: restore indentation repair: pass the data stream through the compaction strategy's interposer consumer streaming: pass the data stream through the compaction strategy's interposer consumer TWCS: implement add_interposer_consumer() compaction_strategy: add add_interposer_consumer() Add mutation_source_metadata tests: add unit test for timestamp_based_splitting_writer Add timestamp_based_splitting_writer Introduce mutation_writer namespace	2019-06-26 19:18:52 +03:00
Botond Dénes	ee563928df	TWCS: implement add_interposer_consumer() Exploit the interposer customization point to inject a consumer that will segregate the mutation stream based on the contained atoms' timestamps, allowing the requirements of TWCS to be mantained every time sstables are written to disk. For the implementation, `timestamp_based_splitting_writer` is used, with a classifier that maps timestamps to windows.	2019-06-26 18:45:36 +03:00
Botond Dénes	a280dcfe4c	compaction_strategy: add add_interposer_consumer() This will be the customization point for compaction strategies, used to inject a specific interposer consumer that can manipulate the fragment stream so that it satisfies the requirements of the compaction strategy. For now the only candidate for injecting such an interposer is time-window compaction strategy, which needs to write sstables that only contains atoms belonging to the same time-window. By default no interposer is injected. Also add an accompanying customization point `adjust_partition_estimate()` which returns the estimated per-sstable partition-estimate that the interposer will produce.	2019-06-26 15:45:59 +03:00
Avi Kivity	adcc95dddc	Merge "sstable: mc: reader: Optimize multi-partition scans for data sets with small partitions" from Tomasz " Currently, parser and the consumer save its state and return the control to the caller, which then figures out that it needs to enter a new partition, and that it doesn't need to skip. We do it twice, after row end, and after row start. All this work could be avoided if the consumer installed by the reader adjusted its state and pushed the fragments on the spot. This patch achieves just that. This results in less CPU overhead. The ka/la reader is left still stopping after row end. Brings a 20% improvement in frag/s for a full scan in perf_fast_forward (Haswell, NVMe): perf_fast_forward -c1 -m1G --run-tests=small-partition-skips: Before: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu -> 1 0 0.952372 4 1000000 1050009 755 1050765 1046585 976.0 971 124256 1 0 0 0 0 0 0 0 99.7% After: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu -> 1 0 0.790178 4 1000000 1265538 1150 1266687 1263684 975.0 971 124256 2 0 0 0 0 0 0 0 99.6% Tests: unit (dev) " * 'sstable-optimize-partition-scans' of https://github.com/tgrabiec/scylla: sstable: mc: reader: Do not stop parsing across partitions sstables: reader: Move some parser state from sstable_mutation_reader to mp_row_consumer_reader sstables: reader: Simplify _single_partition_read checking sstables: reader: Update stats from on_next_partition() sstables: mutation_fragment_filter: Drop unnecessary calls to _walker.out_of_range() sstables: ka/la: reader make push_ready_fragments() safe to call many times sstables: mc: reader: Move out-of-range check out of push_ready_fragments() sstables: reader: Return void from push_ready_fragments() sstables: reader: Rename on_end_of_stream() to on_out_of_clustering_range() sstables: ka/la: reader: Make sure push_ready_fragments() does not miss to emit partition_end	2019-06-26 13:19:12 +03:00
Raphael S. Carvalho	293557a34e	sstables: Fix partitioned_sstable_set by making it self sufficient Partitioned sstable set is not self sufficient, because it uses compatible_ring_position_view as key for interval map, which is constructed from a decorated key in sstable object. If sstable object is destroyed, like when compaction releases it early, partitioned set potentially no longer works because c__r__p__v would store information that is already freed, meaning its use implies use-after-free. Therefore, the problem happens when partitioned set tries to access the interval of its interval map and uses freed information from c__r__p__v. Fix is about using the newly introduced compatible_ring_position_or_view which can hold a ring_position, meaning that partitioned set is no longer dependent on lifetime of sstable object. Retire compatible_ring_position_view.hh as it is now unused. Fixes #4572. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2019-06-23 16:29:13 +03:00
Tomasz Grabiec	fa2ed3ecce	sstable: mc: reader: Do not stop parsing across partitions Currently, parser and the consumer save its state and return the control to the caller, which then figures out that it needs to enter a new partition, and that it doesn't need to skip. We do it twice, after row end, and after row start. All this work could be avoided if the consumer installed by the reader adjusted its state and pushed the fragments on the spot. This patch achieves just that. This results in less CPU overhead. The ka/la reader is left still stopping after row end. Brings a 20% improvement in frag/s for a full scan in perf_fast_forward (Haswell, NVMe): perf_fast_forward -c1 -m1G --run-tests=small-partition-skips: Before: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu -> 1 0 0.952372 4 1000000 1050009 755 1050765 1046585 976.0 971 124256 1 0 0 0 0 0 0 0 99.7% After: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu -> 1 0 0.790178 4 1000000 1265538 1150 1266687 1263684 975.0 971 124256 2 0 0 0 0 0 0 0 99.6%	2019-06-19 14:29:02 +02:00
Tomasz Grabiec	386079472a	sstables: reader: Move some parser state from sstable_mutation_reader to mp_row_consumer_reader This state will be needed by the consumer to handle crossing partition boundaries on its own. While at it, document it.	2019-06-19 14:29:02 +02:00
Tomasz Grabiec	92cb07debd	sstables: reader: Simplify _single_partition_read checking The old code was making advance_to_next_partition() behave incorrectly when _single_partition_read, which was compensated by a check in read_partition(). Cleaner to exit early.	2019-06-19 14:29:02 +02:00
Tomasz Grabiec	7f4c041ba0	sstables: reader: Update stats from on_next_partition() After partition_start is emitted directly from the parser's consumer, read_partition() will not always be called for each produced partition.	2019-06-19 14:29:02 +02:00
Tomasz Grabiec	0964a8fb38	sstables: mutation_fragment_filter: Drop unnecessary calls to _walker.out_of_range() out_of_range() cannot change to true when the position falls into the ranges, we only need to check it when it falls outside them.	2019-06-19 14:29:02 +02:00
Tomasz Grabiec	556ccf4373	sstables: ka/la: reader make push_ready_fragments() safe to call many times Not a bug fix, just makes the implementation more robust against changes. Before this patch this might have resulted in partition_end being pushed many times.	2019-06-19 14:29:01 +02:00
Tomasz Grabiec	ef6edff673	sstables: mc: reader: Move out-of-range check out of push_ready_fragments() Currently, calling push_ready_fragments() with _mf_filter disengaged or with _mf_filter->out_of_range() causes it to call _reader->on_out_of_clustering_range(), which emits the partition_end fragment. It's incorrect to emit this fragment twice, or zero times, so correctness depends on the fact that push_ready_fragments() is called exactly once when transitioning between partitions. This is proved to be tricky to ensure, especially after partition_end starts to be emitted in a different path as well. Ensuring that push_ready_fragments() is NOT called after partition_end is emitted from consume_partition_end() becomes tricky. After having to fix this problem many times after unrelated changes to the flow, I decide that it's better to refactor. This change moves the call of on_out_of_clustering_range() out of push_ready_fragments(), making the latter safe to call any number of times. The _mf_filter->out_of_range() check is moved to sites which update the filter. It's also good because it gets rid of conditionals.	2019-06-19 14:29:01 +02:00
Tomasz Grabiec	552fe21812	sstables: reader: Return void from push_ready_fragments() The result is ignored, which is fine, so make it official to avoid confusion.	2019-06-19 14:29:01 +02:00
Tomasz Grabiec	1488b57933	sstables: reader: Rename on_end_of_stream() to on_out_of_clustering_range() The old name is confusing, because we're not always ending the stream when we call it.	2019-06-19 14:29:01 +02:00
Tomasz Grabiec	9b8ac5ecbc	sstables: ka/la: reader: Make sure push_ready_fragments() does not miss to emit partition_end Currently, if there is a fragment in _ready and _out_of_range was set after row end was consumer, push_ready_fragments() would return without emitting partition_end. This is problematic once we make consume_row_start() emit partiton_start directly, because we will want to assume that all fragments for the previous partition are emitted by then. If they're not, then we'd emit partition_start before partition_end for the previous partition. The fix is to make sure that push_ready_fragments() emits everything.	2019-06-19 14:14:38 +02:00
Piotr Jastrzebski	a41c9763a9	sstables: distinguish empty and missing cellpath Before this patch mc sstables writer was ignoring empty cellpaths. This is a wrong behaviour because it is possible to have empty key in a map. In such case, our writer creats a wrong sstable that we can't read back. This is becaus a complex cell expects cellpath for each simple cell it has. When writer ignores empty cellpath it writes nothing and instead it should write a length of zero to the file so that we know there's an empty cellpath. Fixes #4533 Tests: unit(release) Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <46242906c691a56a915ca5994b36baf87ee633b7.1560532790.git.piotr@scylladb.com>	2019-06-14 20:36:41 +03:00
Raphael S. Carvalho	62aa0ea3fa	sstables: fix log of failure on large data entry deletion by fixing use-after-move Fixes #4532. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20190527200828.25339-1-raphaelsc@scylladb.com>	2019-06-12 10:55:46 +03:00
Juliana Oliveira	fd83f61556	Add a warning for partitions with too many rows This patch adds a warning option to the user for situations where rows count may get bigger than initially designed. Through the warning, users can be aware of possible data modeling problems. The threshold is initially set to '100,000'. Tests: unit (dev) Message-Id: <20190528075612.GA24671@shenzou.localdomain>	2019-06-06 19:48:57 +03:00
Raphael S. Carvalho	f360d5a936	sstables: export output operator for sstable run It wasn't being exported in any header. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20190527182246.19007-1-raphaelsc@scylladb.com>	2019-06-02 10:25:51 +03:00
Raphael S. Carvalho	cabeb12b4e	sstables: add output operator for sstable run the output will look like as follow: Run = { Identifier: 647044fd-d3d4-43c4-b014-b546943ead0d Fragments = { 1471=-9223317893235177836:-7063220874380325121 1478=5924386327138804918:8070482595977135657 1472=-7063202587832032132:-4903425074566642766 1473=-4903298949436784325:-2739716797579745183 1474=-2739703419744073436:-589328117804966275 1477=3734534455848060136:5924372906965333873 1476=1579822226461317527:3734518878340722529 1475=-589322393539097068:1579813857236466583 1479=8070499046054048682:9223317594733741806 } } Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20190524043331.5093-1-raphaelsc@scylladb.com>	2019-05-24 08:36:08 +03:00

1 2 3 4 5 ...

1895 Commits