Introduce mutation_fragment_stream_validator class and use it as a
Filter to flat_mutation_reader::consume_in_thread from
sstable::write_components to validate partition region and optionally
clustering key monotonicity.
Fixes#4803
key monotonicity validation requires an overhead to store the last key and also to compare
therefore provide an option to enable/disable it (disabled by default).
Refs #4804
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The previous code was not exception safe and would eventually cause a
file to be destroyed without being closed, causing an assert failure.
Unfortunately it doesn't seem to be possible to test this without
error injection, since using an invalid directory fails before this
code is executed.
Fixes#4948
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190904002314.79591-1-espindola@scylladb.com>
* 'cleanup_sstables' of https://github.com/asias/scylla:
sstables: Move leveled_compaction_strategy implementation to source file
sstables: Include dht/i_partitioner.hh for dht::partition_range
This patches silences the remaining discarded future warnings, those
where it cannot be determined with reasonable confidence that this was
indeed the actual intent of the author, or that the discarding of the
future could lead to problems. For all those places a FIXME is added,
with the intent that these will be soon followed-up with an actual fix.
I deliberately haven't fixed any of these, even if the fix seems
trivial. It is too easy to overlook a bad fix mixed in with so many
mechanical changes.
This patch silences those future discard warnings where it is clear that
discarding the future was actually the intent of the original author,
*and* they did the necessary precautions (handling errors). The patch
also adds some trivial error handling (logging the error) in some
places, which were lacking this, but otherwise look ok. No functional
changes.
"
Refs #4726
Implement the api portion of a "describe sstables" command.
Adds rest types for collecting both fixed and dynamic attributes, some grouped. Allows extensions to add attributes as well. (Hint hint)
"
* 'sstabledesc' of https://github.com/elcallio/scylla:
api/storage_service: Add "sstable_info" command
sstables/compress: Make compressor pointer accessible from compression info
sstables.hh: Add attribute description API to file extension
sstables.hh: Add compression component accessor
sstables.hh: Make "has_component" public
Manager trims sstables off to allow compaction jobs to proceed in parallel
according to their weights. The problem is that trimming procedure is not
sstable run aware, so it could incorrectly remove only a subset of a sstable
run, leading to partial sstable run compaction.
Compaction of a sstable run could lead to inneficiency because the run structure
would be messed up, affecting all amplification factors, and the same generation
could even end up being compacted twice.
This is fixed by making the trim procedure respect the sstable runs.
Fixes#4773.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190730042023.11351-1-raphaelsc@scylladb.com>
ignore_partial_runs() brings confusion because i__p__r() equal to true
doesn't mean filter out partial runs from compaction. It actually means
not caring about compaction of a partial run.
The logic was wrong because any compaction strategy that chooses not to ignore
partial sstable run[1] would have any fragment composing it incorrectly
becoming a candidate for compaction.
This problem could make compaction include only a subset of fragments composing
the partial run or even make the same fragment be compacted twice due to
parallel compaction.
[1]: partial sstable run is a sstable that is still being generated by
compaction and as a result cannot be selected as candidate whatsoever.
Fix is about making sure partial sstable run has none of its fragments
selected for compaction. And also renaming i__p__r.
Fixes#4729.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20190807022814.12567-1-raphaelsc@scylladb.com>
This adds the option to compress sstables using the Zstandard algorithm
(https://facebook.github.io/zstd/).
To use, pass 'sstable_compression': 'org.apache.cassandra.io.compress.ZstdCompressor'
to the 'compression' argument when creating a table.
You can also specify a 'compression_level'. See Zstd documentation for the available
compression levels.
Resolves#2613.
Signed-off-by: Kamil Braun <kbraun@scylladb.com>
This commit also fixes a bug in sstables/compress.cc, where chunk length in bytes
was passed to the compressor as chunk length in kilobytes. Fortunately,
none of the compressors implemented until now used this parameter.
Signed-off-by: Kamil Braun <kbraun@scylladb.com>
Not emitting partition_end for a partition is incorrect. Sstable
writer assumes that it is emitted. If it's not, the sstable will not
be written correctly. The partition index entry for the last partition
will be left partially written, which will may result in errors during
reads. Also, statistics and sstable key ranges will not include the
last partition.
It's better to catch this problem at the time of writing, and not
generate bad sstables.
Another way of handling this would be to implicitly generate a
partition_end, but I don't think that we should do this. We cannot
trust the mutation stream when invariants are violated, we don't know
if this was really the last partition which was supposed to be
written. So it's safer to fail the write.
Enabled for both mc and la/ka.
"
Fixes#4663.
Fixes#4718.
"
* 'make_cleanup_run_aware_v3' of https://github.com/raphaelsc/scylla:
tests/sstable_datafile_test: Check cleaned sstable is generated with expected run id
table: Make SSTable cleanup run aware
compaction: introduce constants for compaction descriptor
compaction: Make it possible to config the identifier of the output sstable run
table: do not rely on undefined behavior in cleanup_sstables
If we had an error while reading, then we would have failed to close
the reader, which in turn can cause memory corruption. Make the
closing more robust by using then_wrapped (that doesn't skip on
exception) and log the error for analysis.
Fixes#4761.
Make it easier for users, and also avoid duplicating knowledge
about descriptor defaults across the codebase.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Introduce consumer in mutation compactor that will only consume
data that is purged away from regular consumer. The goal is to
allow compaction implementation to do whatever it wants with
the garbage collected data, like saving it for preventing
data resurrection from ever happening, like described in
issue #4531.
noop_compacted_fragments_consumer is made available for users that
don't need this capability.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This fixes a possible cause of #4614.
From the backtrace in that issue, it looks like a file is being closed
twice. The first point in the backtrace where that seems likely is in
the MC writer.
My first idea was to add a writer::close and make it the responsibility
of the code using the writer to call it. That way we would move work
out of the destructor.
That is a bit hard since the writer is destroyed from
flat_mutation_reader::impl::~consumer_adapter and that would need to
get a close function too.
This patch instead just fixes an exception safety issue. If
_index_writer->close() throws, _index_writer is still valid and
~writer will try to close it again.
If the exception was thrown after _completed.set_value(), that would
explain the assert about _completed.set_value() being called twice.
With this patch the path outside of the destructor now moves the
writer to a local variable before trying to close it.
Fixes#4614
Message-Id: <20190710171747.27337-1-espindola@scylladb.com>
"
When writing streamed data into sstables, while using time window
compaction strategy, we have to emit a new sstable for each time window.
Otherwise we can end up with sstables, mixing data from wildly different
windows, ruining the compaction strategy's ability to drop entire
sstables when all data within is expired. This gets worse as these mixed
sstables get compacted together with sstables that used to contain a
single time window.
This series provides a solution to this by segregating the data by its
atom's the time-windows. This is done on the new RPC streaming and the
new row-level, repair, memtable-flush and compaction, ensuring that the
segregation requirement is respected at all times.
Fixes: #2687
"
* 'segregate-data-into-sstables-by-time-window-streaming/v2.1' of ssh://github.com/denesb/scylla:
streaming,repair: restore indentation
repair: pass the data stream through the compaction strategy's interposer consumer
streaming: pass the data stream through the compaction strategy's interposer consumer
TWCS: implement add_interposer_consumer()
compaction_strategy: add add_interposer_consumer()
Add mutation_source_metadata
tests: add unit test for timestamp_based_splitting_writer
Add timestamp_based_splitting_writer
Introduce mutation_writer namespace
Exploit the interposer customization point to inject a consumer that will
segregate the mutation stream based on the contained atoms' timestamps,
allowing the requirements of TWCS to be mantained every time sstables
are written to disk.
For the implementation, `timestamp_based_splitting_writer` is used,
with a classifier that maps timestamps to windows.
This will be the customization point for compaction strategies, used to
inject a specific interposer consumer that can manipulate the fragment
stream so that it satisfies the requirements of the compaction strategy.
For now the only candidate for injecting such an interposer is
time-window compaction strategy, which needs to write sstables that
only contains atoms belonging to the same time-window. By default no
interposer is injected.
Also add an accompanying customization point
`adjust_partition_estimate()` which returns the estimated per-sstable
partition-estimate that the interposer will produce.
"
Currently, parser and the consumer save its state and return the
control to the caller, which then figures out that it needs to enter a
new partition, and that it doesn't need to skip. We do it twice, after
row end, and after row start. All this work could be avoided if the
consumer installed by the reader adjusted its state and pushed the
fragments on the spot. This patch achieves just that.
This results in less CPU overhead.
The ka/la reader is left still stopping after row end.
Brings a 20% improvement in frag/s for a full scan in perf_fast_forward (Haswell, NVMe):
perf_fast_forward -c1 -m1G --run-tests=small-partition-skips:
Before:
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
-> 1 0 0.952372 4 1000000 1050009 755 1050765 1046585 976.0 971 124256 1 0 0 0 0 0 0 0 99.7%
After:
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
-> 1 0 0.790178 4 1000000 1265538 1150 1266687 1263684 975.0 971 124256 2 0 0 0 0 0 0 0 99.6%
Tests: unit (dev)
"
* 'sstable-optimize-partition-scans' of https://github.com/tgrabiec/scylla:
sstable: mc: reader: Do not stop parsing across partitions
sstables: reader: Move some parser state from sstable_mutation_reader to mp_row_consumer_reader
sstables: reader: Simplify _single_partition_read checking
sstables: reader: Update stats from on_next_partition()
sstables: mutation_fragment_filter: Drop unnecessary calls to _walker.out_of_range()
sstables: ka/la: reader make push_ready_fragments() safe to call many times
sstables: mc: reader: Move out-of-range check out of push_ready_fragments()
sstables: reader: Return void from push_ready_fragments()
sstables: reader: Rename on_end_of_stream() to on_out_of_clustering_range()
sstables: ka/la: reader: Make sure push_ready_fragments() does not miss to emit partition_end
Partitioned sstable set is not self sufficient, because it uses compatible_ring_position_view
as key for interval map, which is constructed from a decorated key in sstable object.
If sstable object is destroyed, like when compaction releases it early, partitioned set
potentially no longer works because c__r__p__v would store information that is already freed,
meaning its use implies use-after-free.
Therefore, the problem happens when partitioned set tries to access the interval of its
interval map and uses freed information from c__r__p__v.
Fix is about using the newly introduced compatible_ring_position_or_view which can hold a
ring_position, meaning that partitioned set is no longer dependent on lifetime of sstable
object.
Retire compatible_ring_position_view.hh as it is now unused.
Fixes#4572.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, parser and the consumer save its state and return the
control to the caller, which then figures out that it needs to enter a
new partition, and that it doesn't need to skip. We do it twice, after
row end, and after row start. All this work could be avoided if the
consumer installed by the reader adjusted its state and pushed the
fragments on the spot. This patch achieves just that.
This results in less CPU overhead.
The ka/la reader is left still stopping after row end.
Brings a 20% improvement in frag/s for a full scan in perf_fast_forward (Haswell, NVMe):
perf_fast_forward -c1 -m1G --run-tests=small-partition-skips:
Before:
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
-> 1 0 0.952372 4 1000000 1050009 755 1050765 1046585 976.0 971 124256 1 0 0 0 0 0 0 0 99.7%
After:
read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu
-> 1 0 0.790178 4 1000000 1265538 1150 1266687 1263684 975.0 971 124256 2 0 0 0 0 0 0 0 99.6%
The old code was making advance_to_next_partition() behave
incorrectly when _single_partition_read, which was compensated by a
check in read_partition().
Cleaner to exit early.
Not a bug fix, just makes the implementation more robust against changes.
Before this patch this might have resulted in partition_end being
pushed many times.
Currently, calling push_ready_fragments() with _mf_filter disengaged
or with _mf_filter->out_of_range() causes it to call
_reader->on_out_of_clustering_range(), which emits the partition_end
fragment. It's incorrect to emit this fragment twice, or zero times,
so correctness depends on the fact that push_ready_fragments() is
called exactly once when transitioning between partitions.
This is proved to be tricky to ensure, especially after partition_end
starts to be emitted in a different path as well. Ensuring that
push_ready_fragments() is *NOT* called after partition_end is emitted
from consume_partition_end() becomes tricky.
After having to fix this problem many times after unrelated changes to
the flow, I decide that it's better to refactor.
This change moves the call of on_out_of_clustering_range() out of
push_ready_fragments(), making the latter safe to call any number of
times.
The _mf_filter->out_of_range() check is moved to sites which update
the filter.
It's also good because it gets rid of conditionals.
Currently, if there is a fragment in _ready and _out_of_range was set
after row end was consumer, push_ready_fragments() would return
without emitting partition_end.
This is problematic once we make consume_row_start() emit
partiton_start directly, because we will want to assume that all
fragments for the previous partition are emitted by then. If they're
not, then we'd emit partition_start before partition_end for the
previous partition. The fix is to make sure that
push_ready_fragments() emits everything.
Before this patch mc sstables writer was ignoring
empty cellpaths. This is a wrong behaviour because
it is possible to have empty key in a map. In such case,
our writer creats a wrong sstable that we can't read back.
This is becaus a complex cell expects cellpath for each
simple cell it has. When writer ignores empty cellpath
it writes nothing and instead it should write a length
of zero to the file so that we know there's an empty cellpath.
Fixes#4533
Tests: unit(release)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <46242906c691a56a915ca5994b36baf87ee633b7.1560532790.git.piotr@scylladb.com>
This patch adds a warning option to the user for situations where
rows count may get bigger than initially designed. Through the
warning, users can be aware of possible data modeling problems.
The threshold is initially set to '100,000'.
Tests: unit (dev)
Message-Id: <20190528075612.GA24671@shenzou.localdomain>