"
Not emitting partition_end for a partition is incorrect. SStable
writer assumes that it is emitted. If it's not, the sstable will not
be written correctly. The partition index entry for the last partition
will be left partially written, which will result in errors during
reads. Also, statistics and sstable key ranges will not include the
last partition.
It's better to catch this problem at the time of writing, and not
generate bad sstables.
Another way of handling this would be to implicitly generate a
partition_end, but I don't think that we should do this. We cannot
trust the mutation stream when invariants are violated, we don't know
if this was really the last partition which was supposed to be
written. So it's safer to fail the write.
Enabled for both mc and la/ka.
Passing --abort-on-internal-error on the command line will switch to
aborting instead of throwing an exception.
The reason we don't abort by default is that it may bring the whole
cluster down and cause unavailability, while it may not be necessary
to do so. It's safer to fail just the affected operation,
e.g. repair. However, failing the operation with an exception leaves
little information for debugging the root cause. So the idea is that the
user would enable aborts on only one of the nodes in the cluster to
get a core dump and not bring the whole cluster down.
"
* 'catch-unclosed-partition-sstable-write' of https://github.com/tgrabiec/scylla:
sstables: writer: Validate that partition is closed when the input mutation stream ends
config, exceptions: Add helper for handling internal errors
utils: config_file: Introduce named_value::observe()
(cherry picked from commit 95c0804731)
(cherry picked from commit cf4c238b28)
This fixes a possible cause of #4614.
From the backtrace in that issue, it looks like a file is being closed
twice. The first point in the backtrace where that seems likely is in
the MC writer.
My first idea was to add a writer::close and make it the responsibility
of the code using the writer to call it. That way we would move work
out of the destructor.
That is a bit hard since the writer is destroyed from
flat_mutation_reader::impl::~consumer_adapter and that would need to
get a close function too.
This patch instead just fixes an exception safety issue. If
_index_writer->close() throws, _index_writer is still valid and
~writer will try to close it again.
If the exception was thrown after _completed.set_value(), that would
explain the assert about _completed.set_value() being called twice.
With this patch the path outside of the destructor now moves the
writer to a local variable before trying to close it.
Fixes#4614
Message-Id: <20190710171747.27337-1-espindola@scylladb.com>
(cherry picked from commit 281f3a69f8)
Before this patch mc sstables writer was ignoring
empty cellpaths. This is a wrong behaviour because
it is possible to have empty key in a map. In such case,
our writer creats a wrong sstable that we can't read back.
This is becaus a complex cell expects cellpath for each
simple cell it has. When writer ignores empty cellpath
it writes nothing and instead it should write a length
of zero to the file so that we know there's an empty cellpath.
Fixes#4533
Tests: unit(release)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <46242906c691a56a915ca5994b36baf87ee633b7.1560532790.git.piotr@scylladb.com>
(cherry picked from commit a41c9763a9)
"
Before this patchset empty counters were incorrectly persisted for
MC format. No value was written to disk for them. The correct way
is to still write a header that informs the counter is empty.
We also need to make sure that reading wrongly persisted empty
counters works because customers may have sstables with wrongly
persisted empty counters.
Fixes#4363
"
* 'haaawk/4363/v3' of github.com:scylladb/seastar-dev:
sstables: add test for empty counters
docs: add CorrectEmptyCounters to sstable-scylla-format
sstables: Add a feature for empty counters in Scylla.db.
sstables: Write header for empty counters
sstables: Remove unused variables in make_counter_cell
sstables: Handle empty counter value in read path
(cherry picked from commit 899ebe483a)
"
Static compact tables are tables with compact storage and no
clustering columns.
Before this patch, Scylla was writing rows of static compact tables as
clustered rows instead of as static rows. That's because in our in-memory
model such tables have regular rows and no static row. In Cassandra's
schema (since 3.x), those tables have columns which are marked as
static and there are no regular columns.
This worked fine as long as Scylla was writing and reading those
sstables. But when importing sstables from Cassandra, our reader was
skipping the static row, since it's not present in our schema, and
returning no rows as a result. Also, Cassandra, and Scylla tools,
would have problems reading those sstables.
Fix this by writing rows for such tables the same way as Cassandra
does. In order to support rolling downgrade, we do that only when all
nodes are upgraded.
Fixes#4139.
Tests:
- unit (dev)
"
* tag 'static-compact-mc-fix-v3.1' of github.com:tgrabiec/scylla:
tests: sstables: Test reading of static compact sstable generated by Cassandra
tests: sstables: Add test for writing and reading of static compact tables
sstables: mc: Write static compact tables the same way as Cassandra
sstable: mc: writer: Set _static_row_written inside write_static_row()
sstables: Add sstable::features()
sstables: mc: writer: Prepare write_static_row() for working with any column_kind
storage_service: Introduce the CORRECT_STATIC_COMPACT feature flag
sstables: mc: writer: Build indexed_columns together with serialization_header
sstables: mc: writer: De-optimize make_serialization_header()
sstable: mc: writer: Move attaching of mc-specific components out of generic code
(cherry picked from commit eddb98e8c6)
"
Currently we keep the entries in a circular_buffer, which uses
a contiguous storage. For large partitions with many promoted index
entries this can cause OOM and sstable compaction failure.
A similar problem exists for the offset vector built
in write_promoted_index().
This change solves the problem by serializing promoted index entries
and the offset vector on the fly directly into a bytes_ostream, which
uses fragmented storage.
The serialization of the first entry is deferred, so that
serialization is avoided if there will be less than 2
entries. Promoted index is not added for such partitions.
There still remains a problem that large-enough promoted index can cause OOM.
Refs #4217
Tests:
- unit (release)
- scylla-bench write
Branches: 3.0
"
* tag 'fix-large-alloc-for-promoted-index-v3' of github.com:tgrabiec/scylla:
sstables: mc: writer: Avoid large allocations for maintaining promoted index
sstables: mc: writer: Avoid double-serialization of the promoted index
(cherry picked from commit fdefee696e)
"
Contains several improvements for fast-forwarding and slicing readers. Mainly
for the MC format, but not only:
- Exiting the parser early when going out of the fast-forwarding window [MC-format-only]
- Avoiding reading of the head of the partition when slicing
- Avoiding parsing rows which are going to be skipped [MC-format-only]
"
* 'sstable-mc-optimize-slicing-reads' of github.com:tgrabiec/scylla:
sstables: mc: reader: Skip ignored rows before parsing them
sstables: mc: reader: Call _cells.clear() when row ends rather than when it starts
sstables: mc: mutation_fragment_filter: Take position_in_partition rather than a clustering_row
sstables: mc: reader: Do not call consume_row_marker_and_tombstone() for static rows
sstables: mc: parser: Allow the consumer to skip the whole row
sstables: continuous_data_consumer: Introduce skip()
sstables: continuous_data_consumer: Make position() meaningful inside state_processor::process_state()
sstables: mc: parser: Allocate dynamic_bitset once per read instead of once per row
sstables: reader: Do not read the head of the partition when index can be used
sstables: mc: mutation_fragment_filter: Check the fast-forward window first
sstables: mc: writer: Avoid calling unsigned_vint::serialized_size()
(cherry picked from commit e6d26a528f)
"
The motivation is to keep code related to each format separate, to make it
easier to comprehend and reduce incremental compilation times.
Also reduces dependency on sstable writer code by removing writer bits from
sstales.hh.
The ka/la format writers are still left in sstables.cc, they could be also extracted.
"
* 'extract-sstable-writer-code' of github.com:tgrabiec/scylla:
sstables: Make variadic write() not picked on substitution error
sstables: Extract MC format writer to mc/writer.cc
sstables: Extract maybe_add_summary_entry() out of components_writer
sstables: Publish functions used by writers in writer.hh
sstables: Move common write functions to writer.hh
sstables: Extract sstable_writer_impl to a header
sstables: Do not include writer.hh from sstables.hh
sstables: mc: Extract bound_kind_m related stuff into mc/types.hh
sstables: types: Extract sstable_enabled_features::all()
sstables: Move components_writer to .cc
tests: sstable_datafile_test: Avoid dependency on components_writer
(cherry picked from commit b023e8b45d)