checksum_combine() is much slower than re-feeding the buffer to
checksum() for the zlib CRC32 checksummer.
Introduce Checksum::prefer_combine() to determine this and select
more optimal behavior for given checksummer.
Improves performance of memtable flush with compression enabled by 30%.
do_process_buffer had two unreachable default cases and a long
if-else-if chain.
This converts the the if-else-if chain to a switch and a helper
function.
This moves the error checking from run time to compile time. If we
were to add a 128 bit integer for example, gcc would complain about it
missing from the switch.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20181125221451.106067-1-espindola@scylladb.com>
This is needed for parallel compaction to work with sstable run based approach.
That's because regular compaction clones a set containing all sstables of its
column family. So compaction A can potentially hold a reference to a compacting
sstable of compaction B, so preventing compacting B from releasing its exhausted
sstable.
So all replacements are propagated to all compactions of a given column family,
and compactions in turn, including the one which initiated the propagation,
will do the replacement.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
motivation is that we need a more efficient way to find compactions
that belong to a given column family in compaction list.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Motivation is that it will be useful for catching regression on compaction
when releasing early exhausted sstables. That's because sstable's space
is only released once it's closed. So this will allow us to write a test
case and possibly use it for entities holding exhausted sstable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Filter out sstable belonging to a partial run being generated by an ongoing
compaction. Otherwise, that could lead to wrong decisions by the compaction
strategy.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
SSTables composing the same run will share the same run identifier.
Therefore, a new compaction strategy will be able to get all sstables belong
to the same run from sstable_set, which now keeps track of existing runs.
Same UUID is passed to writers of a given compaction. Otherwise, a new UUID
is picked for every sstable created by compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
sstable run is a structure that will hold all sstables that has the same
run identifier. All sstables belonging to the same run will not overlap
with one another.
It can be used by compaction strategy to work on runs instead of individual
sstables.
sstable_set structure which holds all sstables for a given column family
will be responsible for providing to its user an interface to work with
runs instead of individual sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's important for the reference to sstable to not be kept throughout
the compaction procedure, which would break the goal of releasing
space during compaction.
Manager passes a callback to compaction which calls it whenever
there's sstable replacement.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Motivation is that we want to release space for exhausted sstable and that
will only happen when all references to it are gone *and* that backlog
tracker takes the early replacement into account.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
By doing that, we'll be able to release exhausted sstable from both
simulteaneously.
That's achieved by sharing set containing input sstables with the incremental
reader selector and removing exhausted sstables from shared set when the
time has come.
Step towards reducing disk requirement for compaction by making it delete
sstable which all data is in a sealed new sstable. For that to happen,
all references must be gone.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently, compaction only replace input sstables at end of compaction,
meaning compaction must be finished for all the space of those sstables
to be released.
What we can do instead is to delete earlier some input sstable under
some conditions:
1) SStable data should be committed to a new, sealed output sstable,
meaning it's exhausted.
2) Exhausted sstable mustn't overlap with a non-exhausted sstable
because a tombstone in the exhausted could have been purged and the
shadowed data in non-exhausted could be ressurected if system
crashes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The motivation is that compaction may remove a sstable from the set while the
incremental selector is alive, and for that to work, we need to invalidate
the iterators stored by the selector. We could have added a method to notify
it, but there will be a case where the one keeping the set cannot forward
the notification to the selector. So it's better for the selector to take
care of itself. Change counter approach is used which allows the selector
to know when to invalidate the iterators.
After invalidation, selector will move the iterator back into its right
place by looking for lower bound for current pos.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Older sstables must have an identifier for them to be associated
with their own run.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
It identifies a run which a particular sstable belongs to.
Existing sstables will have a random uuid associated with it
in memory.
UUID is the correct choice because it allows sstables to be
exported without having conflicts when using identifier generated
by different nodes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
These switches are fully covered. We can be sure they will stay this
way because of -Werror and gcc's -Wswitch warning.
We can also be sure that we never have an invalid enum value since the
state machine values are not read from disk.
The patch also removes a superfluous ';'.
Message-Id: <20181124020128.111083-1-espindola@scylladb.com>
The reason for that is that it's not available in sstable format mc,
so we can no longer rely on it in common code for the currently
supported formats.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20181121170057.20900-1-raphaelsc@scylladb.com>
The format message was using the new stlye formatting markers ("{}")
which are understood by format() but not by sprint() (the latter is
basically deprecated).
The old code was serializing the row twice. Once to get the size of
its block on disk, which is needed to write the block length, and then
to actually write the block.
This patch avoids this by serializing once into a temporary buffer and
then appending that buffer to the data file writer.
I measured about 10% improvement in memtable flush throughput with
this for the small-part dataset in perf_fast_forward.
* seastar d59fcef...b924495 (2):
> build: Fix protobuf generation rules
> Merge "Restructure files" from Jesse
Includes fixup patch from Jesse:
"
Update Seastar `#include`s to reflect restructure
All Seastar header files are now prefixed with "seastar" and the
configure script reflects the new locations of files.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com>
"
In commit a33f0d6, we changed the way we handle arrays during the write
and parse code to avoid reactor stalls. Some potentially big loops were
transformed into futurized loops, and also some calls to vector resizes
were replaced by a reserve + push_back idiom.
The latter broke parsing of the estimated histogram. The reason being
that the vectors that are used here are already initialized internally
by the estimated_histogram object. Therefore, when we push_back, we
don't fill the array all the way from index 0, but end up with a zeroed
beginning and only push back some of the elements we need.
We could revert this array to a resize() call. After all, the reason we
are using reserve + push_back is to avoid calling the constructor member
for each element, but We don't really expect the integer specialization
to do any of that.
However, to avoid confusion with future developers that may feel tempted
to converted this as well for the sake of consistency, it is safer to
just make sure these arrays are zeroed.
Fixes#3918
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181116130853.10473-1-glauber@scylladb.com>
When moving sstables between directories, this helper function
will create links and update generation and dir accordingly.
It's expected to be called in thread context.
"
It appears that in case when there are any static columns in serialization header,
Cassandra would write a (possibly empty) static row to every partition
in the SSTables file.
This patchset alings Scylla's logic with that of Cassandra.
Note that Scylla optimizes the case when no partition contains a static
row because it keeps track of updated columns that Scylla currently does
not do - see #3901 for details.
Fixes#3900.
"
* 'projects/sstables-30/write-all-static-rows/v1' of https://github.com/argenet/scylla:
tests: Test writing empty static rows for partitions in tables with static columns.
sstables: Ignore empty static rows on reading.
sstables: Write empty static rows when there are static columns in the table.
MC format lacks ancestors metadata, so we need to workaround it by using
ancestors in metadata collector, which is only available for a sstable
written during this instance. It works fine here because we only want
to know if a sstable recently compacted has an ancestor which wasn't
yet deleted.
Fixes#3852.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <20181102154951.22950-1-raphaelsc@scylladb.com>
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().
Mechanically converted with https://github.com/avikivity/unsprint.
It is possible to have collections in a static row so we need to check
for collection-wide tombstones like with clustering rows.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Collections are permitted in static rows so same partitioning as for
regular columns is required.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
In Cassandra, row columns are stored in a BTree that uses the following
ordering on them:
- all atomic columns go first, then all multi-cell ones
- columns of both types (atomic and multi-cell) are
lexicographically ordered by name regarding each other
Since schema already has all columns lexicographically sorted by name,
we only need to stably partition them by atomicity for that.
Fixes#3853
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This representation makes it easier to operate with compound structures
instead of separate values that were stored in multiple containers.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Previously, we've been writing the wrong missing columns indices for
static rows because write_missing_columns() explicitly used regular
columns internally.
Now, it takes the proper column kind into account.
Fixes#3892
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
"
This patchset adddresses two problems with shadowable deletions handling
in SSTables 3.x. ('mc' format).
Firstly, we previously did not set a flag indicating the presence of
extended flags byte with HAS_SHADOWABLE_DELETION bitmask on writing.
This would break subsequent reading and cause all types of failures up
to crash.
Secondly, when reading rows with this extended flag set, we need to
preserve that information and create a shadowable_tombstone for the row.
Tests: unit {release}
+
Verified manually with 'hexdump' and using modified 'sstabledump' that
second (shadowable) tombstone is written for MV tables by Scylla.
+
DTest (materialized_views_test.py:TestMaterializedViews.hundred_mv_concurrent_test)
that originally failed due to this issue has successfully passed locally.
"
* 'projects/sstables-30/shadowable-deletion/v4' of https://github.com/argenet/scylla:
tests: Add tests writing both regular and shadowable tombstones to SSTables 3.x.
tests: Add test covering writing and reading a shadowable tombstone with SSTables 3.x.
sstables: Support Scylla-specific extension for writing shadowable tombstones.
sstables: Introduce a feature for shadowable tombstones in Scylla.db.
memtable: Track regular and shadowable tombstones separately in encoding_stats_collector.
sstables: Error out when reading SSTables 3.x with Cassandra shadowable deletion.
sstables: Support checking row extension flags for Cassandra shadowable deletion.
Even when we're using a full clustering range, need_skip() will return
true when we start a new partition and advance_context() will be
called with position_in_partition::before_all_clustered_rows(). We
should detect that there is no need to skip to that position before
the call to advance_to(*_current_partition_key), which will read the
index page.
Fixes#3868.
Message-Id: <1539881775-8578-1-git-send-email-tgrabiec@scylladb.com>
After the new in-memory representation of cells was introduced there was
a regression in atomic_cell_or_collection::operator<< which stopped
printing the content of the cell. This makes debugging more incovenient
are time-consuming. This patch fixes the problem. Schema is propagated
to the atomic_cell_or_collection printer and the full content of the
cell is printed.
Fixes#3571.
Message-Id: <20181024095413.10736-1-pdziepak@scylladb.com>
The original SSTables 'mc' format, as defined in Cassandra, does not provide
a way to store shadowable deletion in addition to regular row deletion
for materialized views.
It is essential to store it because of known corner-case issues that
otherwise appear.
For this to work, we introduce a Scylla-specific extended flag to be set
in SSTables in 'mc' format that indicates a shadowable tombstone is
written after the regular row tombstone.
This is deemed to be safe because shadowable tombstones are specific to
materialized views and MV tables are not supposed to be imported or
exported.
Note that a shadowable tombstone can be written without a regular
tombstone as well as along with it.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This is used to indicate that the SSTables being read may contain a
Scylla-specific HAS_SCYLLA_SHADOWABLE_TOMBSTONE extended flag set.
If feature is not disabled, we should not honour this flag.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This flag can be only used in MV tables that are not supposed to be
imported to Scylla.
Since Scylla representation of shadowable tombstones differs from that
of Cassandra, such SSTables are rejected on read and Scylla never sets
this flag on writing.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Previously we were making assumptions about missing columns
(the size of its value, whether it's a collection or a counter) but
they didn't have to be always true. Now we're using column type
from serialization header to use the right values.
Fixes#3859
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Without that, we don't know where to look for the problems
Before:
compaction failed: sstables::malformed_sstable_exception (Too big ttl: 3163676957)
After:
compaction_manager - compaction failed: sstables::malformed_sstable_exception (Too big ttl: 4294967295 in sstable /var/lib/scylla/data/system_traces/events-8826e8e9e16a372887533bc1fc713c25/mc-832-big-Data.db)
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181016181004.17838-1-glauber@scylladb.com>