Commit Graph

1616 Commits

Author SHA1 Message Date
Tomasz Grabiec
743cf43847 sstables: Avoid checksum_combine() for the crc32 checksummer
checksum_combine() is much slower than re-feeding the buffer to
checksum() for the zlib CRC32 checksummer.

Introduce Checksum::prefer_combine() to determine this and select
more optimal behavior for given checksummer.

Improves performance of memtable flush with compression enabled by 30%.
2018-11-26 18:57:33 +01:00
Tomasz Grabiec
88cf1c61ba sstables: compress: Avoid unnecessary checksum_combine() 2018-11-26 14:31:38 +01:00
Tomasz Grabiec
8372cf7bcc sstables: checksum_utils: Add missing include 2018-11-26 14:31:38 +01:00
Rafael Ávila de Espíndola
6746907999 Use fully covered switches in continuous_data_consumer
do_process_buffer had two unreachable default cases and a long
if-else-if chain.

This converts the the if-else-if chain to a switch and a helper
function.

This moves the error checking from run time to compile time. If we
were to add a 128 bit integer for example, gcc would complain about it
missing from the switch.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20181125221451.106067-1-espindola@scylladb.com>
2018-11-25 22:52:11 +00:00
Raphael S. Carvalho
2058001f94 sstables/compaction: propagate sstable replacement to all compaction of a CF
This is needed for parallel compaction to work with sstable run based approach.
That's because regular compaction clones a set containing all sstables of its
column family. So compaction A can potentially hold a reference to a compacting
sstable of compaction B, so preventing compacting B from releasing its exhausted
sstable.

So all replacements are propagated to all compactions of a given column family,
and compactions in turn, including the one which initiated the propagation,
will do the replacement.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-11-24 18:53:30 -02:00
Raphael S. Carvalho
953fdcc867 sstables: store cf pointer in compaction_info
motivation is that we need a more efficient way to find compactions
that belong to a given column family in compaction list.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-11-24 18:53:28 -02:00
Raphael S. Carvalho
824c20b76d sstables: add sstable's on closed handling
Motivation is that it will be useful for catching regression on compaction
when releasing early exhausted sstables. That's because sstable's space
is only released once it's closed. So this will allow us to write a test
case and possibly use it for entities holding exhausted sstable.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-11-24 18:53:25 -02:00
Raphael S. Carvalho
e88d1d54b9 sstables/compaction_manager: prevent partial run from being selected for compaction
Filter out sstable belonging to a partial run being generated by an ongoing
compaction. Otherwise, that could lead to wrong decisions by the compaction
strategy.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-11-24 18:53:22 -02:00
Raphael S. Carvalho
23884fe9f6 compaction: use same run identifier for sstables generated by same compaction
SSTables composing the same run will share the same run identifier.
Therefore, a new compaction strategy will be able to get all sstables belong
to the same run from sstable_set, which now keeps track of existing runs.

Same UUID is passed to writers of a given compaction. Otherwise, a new UUID
is picked for every sstable created by compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-11-24 18:53:20 -02:00
Raphael S. Carvalho
4f68cb34a6 sstables: introduce sstable run
sstable run is a structure that will hold all sstables that has the same
run identifier. All sstables belonging to the same run will not overlap
with one another.
It can be used by compaction strategy to work on runs instead of individual
sstables.

sstable_set structure which holds all sstables for a given column family
will be responsible for providing to its user an interface to work with
runs instead of individual sstables.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-11-24 18:53:18 -02:00
Raphael S. Carvalho
fc92fb955d sstables/compaction_manager: release reference to exhausted sstable through callback
That's important for the reference to sstable to not be kept throughout
the compaction procedure, which would break the goal of releasing
space during compaction.

Manager passes a callback to compaction which calls it whenever
there's sstable replacement.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-11-24 18:53:16 -02:00
Raphael S. Carvalho
3f309ebba9 sstables/compaction: stop tracking exhausted input sstable in compaction_read_monitor
Motivation is that we want to release space for exhausted sstable and that
will only happen when all references to it are gone *and* that backlog
tracker takes the early replacement into account.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-11-24 18:53:13 -02:00
Raphael S. Carvalho
f6df949c1a compaction: share sstable set with incremental reader selector
By doing that, we'll be able to release exhausted sstable from both
simulteaneously.
That's achieved by sharing set containing input sstables with the incremental
reader selector and removing exhausted sstables from shared set when the
time has come.

Step towards reducing disk requirement for compaction by making it delete
sstable which all data is in a sealed new sstable. For that to happen,
all references must be gone.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-11-24 18:53:10 -02:00
Raphael S. Carvalho
e5a0b05c15 sstables/compaction: release space earlier of exhausted input sstables
Currently, compaction only replace input sstables at end of compaction,
meaning compaction must be finished for all the space of those sstables
to be released.

What we can do instead is to delete earlier some input sstable under
some conditions:

1) SStable data should be committed to a new, sealed output sstable,
meaning it's exhausted.
2) Exhausted sstable mustn't overlap with a non-exhausted sstable
because a tombstone in the exhausted could have been purged and the
shadowed data in non-exhausted could be ressurected if system
crashes.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-11-24 18:53:07 -02:00
Raphael S. Carvalho
ace070c8fc sstables: make partitioned sstable set's incremental selector resilient to changes in the set
The motivation is that compaction may remove a sstable from the set while the
incremental selector is alive, and for that to work, we need to invalidate
the iterators stored by the selector. We could have added a method to notify
it, but there will be a case where the one keeping the set cannot forward
the notification to the selector. So it's better for the selector to take
care of itself. Change counter approach is used which allows the selector
to know when to invalidate the iterators.

After invalidation, selector will move the iterator back into its right
place by looking for lower bound for current pos.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-11-24 18:53:05 -02:00
Raphael S. Carvalho
a66b1954cc sstables: use a random uuid for sstables without run identifier
Older sstables must have an identifier for them to be associated
with their own run.

Reviewed-by: Nadav Har'El <nyh@scylladb.com>

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-11-24 18:53:01 -02:00
Raphael S. Carvalho
62025fa52c sstables: add run identifier to scylla metadata
It identifies a run which a particular sstable belongs to.
Existing sstables will have a random uuid associated with it
in memory.

UUID is the correct choice because it allows sstables to be
exported without having conflicts when using identifier generated
by different nodes.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-11-24 18:52:44 -02:00
Rafael Ávila de Espíndola
d18bbe9d45 Remove unreachable default cases.
These switches are fully covered. We can be sure they will stay this
way because of -Werror and gcc's -Wswitch warning.

We can also be sure that we never have an invalid enum value since the
state machine values are not read from disk.

The patch also removes a superfluous ';'.
Message-Id: <20181124020128.111083-1-espindola@scylladb.com>
2018-11-24 09:31:51 +00:00
Raphael S. Carvalho
d29482dce8 sstables: deprecate sstable metadata's ancestors
The reason for that is that it's not available in sstable format mc,
so we can no longer rely on it in common code for the currently
supported formats.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20181121170057.20900-1-raphaelsc@scylladb.com>
2018-11-23 19:38:32 +01:00
Paweł Dziepak
edb5402a73 sstable: use format() instead of sprint()
The format message was using the new stlye formatting markers ("{}")
which are understood by format() but not by sprint() (the latter is
basically deprecated).
2018-11-22 11:30:31 +00:00
Tomasz Grabiec
049926bfb8 sstables: mc: Avoid serialization of promoted index when empty
calculate_write_size() adds some overhead, even if we're not going to
write anything.
2018-11-21 14:04:27 +01:00
Tomasz Grabiec
0a9f5b563a sstables: mc: Avoid double serialization of rows
The old code was serializing the row twice. Once to get the size of
its block on disk, which is needed to write the block length, and then
to actually write the block.

This patch avoids this by serializing once into a temporary buffer and
then appending that buffer to the data file writer.

I measured about 10% improvement in memtable flush throughput with
this for the small-part dataset in perf_fast_forward.
2018-11-21 14:04:27 +01:00
Tomasz Grabiec
8e8b96c6ed sstables: checksummed_file_data_sink_impl: Bypass output_stream
We can avoid the data copying by switching from this:

  sink -> stream -> sink

to this:

  sink -> sink
2018-11-21 14:04:27 +01:00
Avi Kivity
775b7e41f4 Update seastar submodule
* seastar d59fcef...b924495 (2):
  > build: Fix protobuf generation rules
  > Merge "Restructure files" from Jesse

Includes fixup patch from Jesse:

"
Update Seastar `#include`s to reflect restructure

All Seastar header files are now prefixed with "seastar" and the
configure script reflects the new locations of files.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com>
"
2018-11-21 00:01:44 +02:00
Glauber Costa
c6811bd877 sstables: correctly parse estimated histograms
In commit a33f0d6, we changed the way we handle arrays during the write
and parse code to avoid reactor stalls. Some potentially big loops were
transformed into futurized loops, and also some calls to vector resizes
were replaced by a reserve + push_back idiom.

The latter broke parsing of the estimated histogram. The reason being
that the vectors that are used here are already initialized internally
by the estimated_histogram object. Therefore, when we push_back, we
don't fill the array all the way from index 0, but end up with a zeroed
beginning and only push back some of the elements we need.

We could revert this array to a resize() call. After all, the reason we
are using reserve + push_back is to avoid calling the constructor member
for each element, but We don't really expect the integer specialization
to do any of that.

However, to avoid confusion with future developers that may feel tempted
to converted this as well for the sake of consistency, it is safer to
just make sure these arrays are zeroed.

Fixes #3918

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181116130853.10473-1-glauber@scylladb.com>
2018-11-16 20:52:44 +02:00
Piotr Sarna
ff361ca877 sstables: add move_to_new_dir_in_thread function
When moving sstables between directories, this helper function
will create links and update generation and dir accordingly.
It's expected to be called in thread context.
2018-11-13 11:45:30 +01:00
Piotr Sarna
b7977f4790 sstables: add staging directory to regex
datadir/staging directory becomes a valid path for an sstable.
2018-11-13 11:45:30 +01:00
Piotr Sarna
3970808294 sstables: add is_staging() method
This method returns true if the last part of directory structure
is /staging.
2018-11-13 11:45:30 +01:00
Paweł Dziepak
6469a1b451 Merge "Write static rows for all partitions if there are static columns" from Vladimir
"
It appears that in case when there are any static columns in serialization header,
Cassandra would write a (possibly empty) static row to every partition
in the SSTables file.

This patchset alings Scylla's logic with that of Cassandra.

Note that Scylla optimizes the case when no partition contains a static
row because it keeps track of updated columns that Scylla currently does
not do - see #3901 for details.

Fixes #3900.
"

* 'projects/sstables-30/write-all-static-rows/v1' of https://github.com/argenet/scylla:
  tests: Test writing empty static rows for partitions in tables with static columns.
  sstables: Ignore empty static rows on reading.
  sstables: Write empty static rows when there are static columns in the table.
2018-11-09 12:01:25 -08:00
Raphael S. Carvalho
1c5934c934 sstables: fix procedure to get fully expired sstables with MC format
MC format lacks ancestors metadata, so we need to workaround it by using
ancestors in metadata collector, which is only available for a sstable
written during this instance. It works fine here because we only want
to know if a sstable recently compacted has an ancestor which wasn't
yet deleted.

Fixes #3852.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Vladimir Krivopalov <vladimir@scylladb.com>
Message-Id: <20181102154951.22950-1-raphaelsc@scylladb.com>
2018-11-06 09:28:37 +02:00
Vladimir Krivopalov
f767dfbb33 sstables: Ignore empty static rows on reading.
Fixes #3900.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-11-05 13:47:30 -08:00
Vladimir Krivopalov
89051d37e3 sstables: Write empty static rows when there are static columns in the table.
This is consistent with what Cassandra does.

Fixes #3900.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-11-05 13:28:50 -08:00
Avi Kivity
455f00e993 sstables: convert sprint() to format()
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().

Mechanically converted with https://github.com/avikivity/unsprint.
2018-11-01 13:16:17 +00:00
Avi Kivity
738e713edf sstables: fix bad format string syntax
Some sprint() calls use the fmt language instead of the printf syntax. Convert
them all the way to format().
2018-11-01 13:16:17 +00:00
Vladimir Krivopalov
6bd738ceb1 sstables: Check for complex deletion when writing static rows.
It is possible to have collections in a static row so we need to check
for collection-wide tombstones like with clustering rows.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-10-29 14:59:19 -07:00
Vladimir Krivopalov
6b7003088a sstables: Use std::reference_wrapper<> instead of a helper structure.
No need to store column_id separately as it can be accessed from the
column_definition.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-10-29 14:58:08 -07:00
Vladimir Krivopalov
8592b834d1 sstables: Partition static columns by atomicity when reading/writing SSTables 3.x.
Collections are permitted in static rows so same partitioning as for
regular columns is required.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-10-29 10:32:02 -07:00
Vladimir Krivopalov
7e56e9fca6 sstables: Re-order columns (atomic first, then collections) for SSTables 3.x.
In Cassandra, row columns are stored in a BTree that uses the following
ordering on them:
    - all atomic columns go first, then all multi-cell ones
    - columns of both types (atomic and multi-cell) are
      lexicographically ordered by name regarding each other

Since schema already has all columns lexicographically sorted by name,
we only need to stably partition them by atomicity for that.

Fixes #3853

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-10-26 15:58:33 -07:00
Vladimir Krivopalov
210507b867 sstables: Use a compound structure for storing information used for reading columns.
This representation makes it easier to operate with compound structures
instead of separate values that were stored in multiple containers.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-10-26 11:32:44 -07:00
Vladimir Krivopalov
44043cfd44 sstables: Honour the column kind when writing missing columns in 'mc' format.
Previously, we've been writing the wrong missing columns indices for
static rows because write_missing_columns() explicitly used regular
columns internally.

Now, it takes the proper column kind into account.

Fixes #3892

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-10-25 17:09:09 -07:00
Benny Halevy
44e5c2643b compaction_manager::maybe_stop_on_error: add stop_iteration param
some call sites are stopping in any case, regardless of what
maybe_stop_on_error returns. Reflect that in the log messages.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20181017105758.9602-2-bhalevy@scylladb.com>
2018-10-24 18:39:52 +03:00
Avi Kivity
8210f4c982 Merge "Properly writing/reading shadowable deletions with SSTables 3.x." from Vladimir
"
This patchset adddresses two problems with shadowable deletions handling
in SSTables 3.x. ('mc' format).

Firstly, we previously did not set a flag indicating the presence of
extended flags byte with HAS_SHADOWABLE_DELETION bitmask on writing.
This would break subsequent reading and cause all types of failures up
to crash.

Secondly, when reading rows with this extended flag set, we need to
preserve that information and create a shadowable_tombstone for the row.

Tests: unit {release}
+

Verified manually with 'hexdump' and using modified 'sstabledump' that
second (shadowable) tombstone is written for MV tables by Scylla.

+
DTest (materialized_views_test.py:TestMaterializedViews.hundred_mv_concurrent_test)
that originally failed due to this issue has successfully passed locally.
"

* 'projects/sstables-30/shadowable-deletion/v4' of https://github.com/argenet/scylla:
  tests: Add tests writing both regular and shadowable tombstones to SSTables 3.x.
  tests: Add test covering writing and reading a shadowable tombstone with SSTables 3.x.
  sstables: Support Scylla-specific extension for writing shadowable tombstones.
  sstables: Introduce a feature for shadowable tombstones in Scylla.db.
  memtable: Track regular and shadowable tombstones separately in encoding_stats_collector.
  sstables: Error out when reading SSTables 3.x with Cassandra shadowable deletion.
  sstables: Support checking row extension flags for Cassandra shadowable deletion.
2018-10-24 18:20:16 +03:00
Tomasz Grabiec
9e756d3863 sstable_mutation_reader: Do not read partition index when scanning
Even when we're using a full clustering range, need_skip() will return
true when we start a new partition and advance_context() will be
called with position_in_partition::before_all_clustered_rows(). We
should detect that there is no need to skip to that position before
the call to advance_to(*_current_partition_key), which will read the
index page.

Fixes #3868.

Message-Id: <1539881775-8578-1-git-send-email-tgrabiec@scylladb.com>
2018-10-24 15:55:13 +03:00
Paweł Dziepak
637b9a7b3b atomic_cell_or_collection: make operator<< show cell content
After the new in-memory representation of cells was introduced there was
a regression in atomic_cell_or_collection::operator<< which stopped
printing the content of the cell. This makes debugging more incovenient
are time-consuming. This patch fixes the problem. Schema is propagated
to the atomic_cell_or_collection printer and the full content of the
cell is printed.

Fixes #3571.

Message-Id: <20181024095413.10736-1-pdziepak@scylladb.com>
2018-10-24 13:29:51 +03:00
Vladimir Krivopalov
759d36a26e sstables: Support Scylla-specific extension for writing shadowable tombstones.
The original SSTables 'mc' format, as defined in Cassandra, does not provide
a way to store shadowable deletion in addition to regular row deletion
for materialized views.
It is essential to store it because of known corner-case issues that
otherwise appear.

For this to work, we introduce a Scylla-specific extended flag to be set
in SSTables in 'mc' format that indicates a shadowable tombstone is
written after the regular row tombstone.

This is deemed to be safe because shadowable tombstones are specific to
materialized views and MV tables are not supposed to be imported or
exported.

Note that a shadowable tombstone can be written without a regular
tombstone as well as along with it.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-10-23 16:30:42 -07:00
Vladimir Krivopalov
e168433945 sstables: Introduce a feature for shadowable tombstones in Scylla.db.
This is used to indicate that the SSTables being read may contain a
Scylla-specific HAS_SCYLLA_SHADOWABLE_TOMBSTONE extended flag set.

If feature is not disabled, we should not honour this flag.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-10-23 16:30:42 -07:00
Vladimir Krivopalov
b7d48c1ccd sstables: Error out when reading SSTables 3.x with Cassandra shadowable deletion.
This flag can be only set in MV tables that are not supported to be
imported to Scylla.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-10-23 16:30:42 -07:00
Vladimir Krivopalov
8f79f76116 sstables: Support checking row extension flags for Cassandra shadowable deletion.
This flag can be only used in MV tables that are not supposed to be
imported to Scylla.
Since Scylla representation of shadowable tombstones differs from that
of Cassandra, such SSTables are rejected on read and Scylla never sets
this flag on writing.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-10-23 16:30:42 -07:00
Piotr Jastrzebski
cafb3dc2ae sstables 3: Correctly handle dropped columns in column_translation
Previously we were making assumptions about missing columns
(the size of its value, whether it's a collection or a counter) but
they didn't have to be always true. Now we're using column type
from serialization header to use the right values.

Fixes #3859

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-10-18 19:13:44 +02:00
Glauber Costa
7edae5421d sstables: print sstable path in case of an exception
Without that, we don't know where to look for the problems

Before:

 compaction failed: sstables::malformed_sstable_exception (Too big ttl: 3163676957)

After:

 compaction_manager - compaction failed: sstables::malformed_sstable_exception (Too big ttl: 4294967295 in sstable /var/lib/scylla/data/system_traces/events-8826e8e9e16a372887533bc1fc713c25/mc-832-big-Data.db)

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20181016181004.17838-1-glauber@scylladb.com>
2018-10-16 20:31:20 +01:00