Commit Graph

137 Commits

Author SHA1 Message Date
Benny Halevy
1ccd72f115 sstables: mc: use int64_t for local_deletion_time and ttl
In preparation for changing gc_clock::duration::rep to int64_t.

Refs #3353

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-01-22 15:34:32 +02:00
Benny Halevy
156f9ffa11 sstables: add capped_local_deletion_time stats counter
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-01-22 15:34:32 +02:00
Benny Halevy
7609a04565 sstables: mc: metadata collector: cap local_deletion_time at max
max local_deletion_time_tracker in stats is int32_t so just track the limit
of (max int32_t - 1) if time_point is greater than the limit.
This corresponds to Cassandra's MAX_DELETION_TIME.

Refs #3353

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-01-22 15:34:32 +02:00
Benny Halevy
bd6861989d sstables: mc: use proper gc_clock types for local_deletion_time and ttl
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-01-22 15:34:32 +02:00
Benny Halevy
6465a673f5 sstables: mc: define expired_liveness_ttl as signed int32_t
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-01-22 13:36:35 +02:00
Benny Halevy
c4c2133e3e sstables: mc: change write_delta_deletion_time to receive tombstone rather than deletion_time
mc format only writes delta local_deletion_time of tombstones.
Conventional deletion_time is written only for the partition header.

Restructure the code to pass a tombstone to write_delta_deletion_time
rather than struct deletion_time to prepare for using 64-bit deletion times.

The tombstone uses gc_clock::time_point while struct
deletion_time is limited to int32_t local_deletion_time.

Note that for "live" tombstones we encode <api::missing_timestamp,
no_deletion_time> as was previously evaluated by to_deletion_time().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2019-01-22 13:36:35 +02:00
Benny Halevy
844a2de263 sstables: mc: prevent signed integer overflow
Fix runtime error: signed integer overflow
introduced by 2dc3776407

Delta-encoded values may wrap around if the encoded value is
less than the base value.  This could happen in two places:
In the mc-format serialization header itself, where the base values are implicit
Cassandra epoch time, and in the sstables data files, where the base values
are taken from the encoding_stats (later written to the serialization_header).

In these cases, when the calculation is done using signed integer/long we may see
"runtime error: signed integer overflow" messages in debug mode
(with -fsanitize=undefined / -fsanitize=signed-integer-overflow).

Overflow here is expected and harmless since we do not gurantee that
neither the base values in the serialization header are greater than
or equal to Cassandra's epoch now that the delta-encoded values are
always greater than or equal to the respective base values in
the serialization header.

To prevent these warnings, the subtraction/addition should be done with unsigned
(two's complement) arithmetic and the result converted to the signed type.

Note that to keep the code simple where possible, when also rely on implicit
conversion of signed integers to unsigned when either one of added value is unsigned
and the other is signed.

Fixes: #4098

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190120142950.15776-1-bhalevy@scylladb.com>
2019-01-20 16:59:46 +02:00
Benny Halevy
2dc3776407 sstables: mc: sign-extend serialization_header min_local_deletion_time_base and min_ttl_base
Refs #4074
Refs #3353

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190110141439.1324-1-bhalevy@scylladb.com>
2019-01-10 16:23:20 +02:00
Benny Halevy
60323b79d1 sstables: mc: sign-extend delta local_deletion_time and delta ttl
Follow Cassandra's encoding so that values that are less than the
baseline encoding_stats will wrap-around in 64-bits rather tham 32.

Fixes #4074
Refs #3353

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190109192703.18371-1-bhalevy@scylladb.com>
2019-01-09 21:43:30 +02:00
Rafael Ávila de Espíndola
26ac2c23ef Change *_row_* names that refer to partitions
This renames some variables and functions to make it clear that they
refer to partitions and not rows.

Old versions of sstablemetadata used to refer to a row histogram, but
current versions now mention a partition histogram instead.

This patch doesn't change the exposed API names.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20181229223311.4184-2-espindola@scylladb.com>
2019-01-09 14:53:42 +02:00
Duarte Nunes
fa2b0384d2 Replace std::experimental types with C++17 std version.
Replace stdx::optional and stdx::string_view with the C++ std
counterparts.

Some instances of boost::variant were also replaced with std::variant,
namely those that called seastar::visit.

Scylla now requires GCC 8 to compile.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190108111141.5369-1-duarte@scylladb.com>
2019-01-08 13:16:36 +02:00
Benny Halevy
40410465d7 sstables: mc: expired_liveness_ttl should be max int32_t rather than max uint32_t
Corresponding to Cassandra's EXPIRED_LIVENESS_TTL = Integer.MAX_VALUE;

Fixes #4060

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20190107172457.20430-1-bhalevy@scylladb.com>
2019-01-07 18:41:37 +01:00
Tomasz Grabiec
a4721b4d50 sstables: types: Extract sstable_enabled_features::all() 2018-12-12 12:06:45 +01:00
Tomasz Grabiec
fad4fba4bc sstables: Templetize write() functions on the writer
Will allow writing to both a file_writer, or an in-memory writer like
a bytes_ostream.
2018-12-10 20:08:16 +01:00
Tomasz Grabiec
aa19f98d18 sstables: Write Statistics.db offset map entries in the same order as Cassandra
Before this patch we were writing offset map enteies in unspecified
order, the one returned by std::unorderd_map. Cassandra writes them
sorted by metadata_type. Use the same order for improved
compatibility.

Fixes #3955.

Message-Id: <1543846649-22861-1-git-send-email-tgrabiec@scylladb.com>
2018-12-03 16:40:24 +02:00
Raphael S. Carvalho
a66b1954cc sstables: use a random uuid for sstables without run identifier
Older sstables must have an identifier for them to be associated
with their own run.

Reviewed-by: Nadav Har'El <nyh@scylladb.com>

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-11-24 18:53:01 -02:00
Raphael S. Carvalho
62025fa52c sstables: add run identifier to scylla metadata
It identifies a run which a particular sstable belongs to.
Existing sstables will have a random uuid associated with it
in memory.

UUID is the correct choice because it allows sstables to be
exported without having conflicts when using identifier generated
by different nodes.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2018-11-24 18:52:44 -02:00
Raphael S. Carvalho
d29482dce8 sstables: deprecate sstable metadata's ancestors
The reason for that is that it's not available in sstable format mc,
so we can no longer rely on it in common code for the currently
supported formats.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20181121170057.20900-1-raphaelsc@scylladb.com>
2018-11-23 19:38:32 +01:00
Avi Kivity
775b7e41f4 Update seastar submodule
* seastar d59fcef...b924495 (2):
  > build: Fix protobuf generation rules
  > Merge "Restructure files" from Jesse

Includes fixup patch from Jesse:

"
Update Seastar `#include`s to reflect restructure

All Seastar header files are now prefixed with "seastar" and the
configure script reflects the new locations of files.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com>
"
2018-11-21 00:01:44 +02:00
Vladimir Krivopalov
759d36a26e sstables: Support Scylla-specific extension for writing shadowable tombstones.
The original SSTables 'mc' format, as defined in Cassandra, does not provide
a way to store shadowable deletion in addition to regular row deletion
for materialized views.
It is essential to store it because of known corner-case issues that
otherwise appear.

For this to work, we introduce a Scylla-specific extended flag to be set
in SSTables in 'mc' format that indicates a shadowable tombstone is
written after the regular row tombstone.

This is deemed to be safe because shadowable tombstones are specific to
materialized views and MV tables are not supposed to be imported or
exported.

Note that a shadowable tombstone can be written without a regular
tombstone as well as along with it.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-10-23 16:30:42 -07:00
Vladimir Krivopalov
e168433945 sstables: Introduce a feature for shadowable tombstones in Scylla.db.
This is used to indicate that the SSTables being read may contain a
Scylla-specific HAS_SCYLLA_SHADOWABLE_TOMBSTONE extended flag set.

If feature is not disabled, we should not honour this flag.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-10-23 16:30:42 -07:00
Vladimir Krivopalov
8f79f76116 sstables: Support checking row extension flags for Cassandra shadowable deletion.
This flag can be only used in MV tables that are not supposed to be
imported to Scylla.
Since Scylla representation of shadowable tombstones differs from that
of Cassandra, such SSTables are rejected on read and Scylla never sets
this flag on writing.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-10-23 16:30:42 -07:00
Vladimir Krivopalov
e71cc5ab20 sstables: Introduce TTL limitation and special 'expired TTL' value.
This allows to store expired liveness info in SSTables 3.x format
without introducing a possible conflict with real TTL values.

As per Cassandra, TTL cannot exceed 20 years so taking the maximum value
as a special value for indicating expired liveness info is safe.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-10-10 11:44:14 -07:00
Vladimir Krivopalov
bdca27ae41 sstables: Always store only min bases in serialization_header.
There previously was an inconsistency in treating min values stored in a
serialization_header. They are written to or read from a Statistics.db
as deltas against fixed bases, but when we parse timeouts from the data
file, we need the full bases, not just deltas.

This inconsistency causes wrong timestamp values if we write an sstable
and then read from it using one and the same sstable object because we
turn min values into bases on write and then don't adjust them back
because we already have them in memory.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-25 17:23:40 -07:00
Vladimir Krivopalov
48fa088ec6 sstables: Do not parse ancestors from compaction metadata for SSTables 3.x
Ancestors array has been removed starting from 'ma' format
(CASSANDRA-7066).

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-09-19 17:11:43 -07:00
Vladimir Krivopalov
4bf1e9de3f sstables: Support resetting data_consume_rows_context_m to indexable_element::cell.
Set the proper parsing state when resetting to indexable_element::cell.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-08-17 10:09:19 -07:00
Vladimir Krivopalov
a497edcbda sstables: Move promoted_index_block from types.hh to index_entry.hh.
It is only being used by index_reader internally and never exposed so
should not be listed in commonly used types.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-06-28 12:28:59 -07:00
Piotr Jastrzebski
a3683d6e0f sstables 3: add serialization_header::adjust
In SSTables 3, min timestamp and min deletion time in serialization
header are not stored normally but instead the difference between
their value and the cassandra "epoch" is stored.
This is supposed to make SSTables smaller. As a consequence, we have
to add the "epoch" after reading the values to obtain the actual
values of min timestamp and min deletion time.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-15 09:10:48 +02:00
Piotr Jastrzebski
2b8ff15f9f column_flags_m: add HAS_COMPLEX_DELETION
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-06-07 22:47:19 +02:00
Piotr Jastrzebski
f6e1c38486 Introduce column_flags_m
This will be used for reading columns from data file.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 19:54:16 +02:00
Piotr Jastrzebski
d8cd8e04ed Add unfiltered_flags_m::has_all_columns
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
b849eefc8c Use disk_string_vint_size for bytes_array_vint_size
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:39:52 +02:00
Piotr Jastrzebski
5ca4bfd69a disk_array_vint_size: Remove unused Size template parameter
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-05-23 16:15:44 +02:00
Vladimir Krivopalov
56ac941a2e Fix the order of items in stats_metadata.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:45:10 -07:00
Vladimir Krivopalov
5db6002720 Write serialization header to Statistics.db for SSTables 3.x.
Serialization header is a new components in Statistics.db introduced in
SSTables 3.0 ('ma') format. It is essential for reading data file as it
contains the base values used for delta-encoded values (timestamps,
TTLs, local deletion times) and description of column types.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-04 15:43:17 -07:00
Vladimir Krivopalov
3e471116b4 Separate statistics for count of cells, columns and rows in column_stats.
SSTables 3.0 format makes a distinction between count of cells and count
of columns. In that sense, a column of a collection type counts as one
column but every atomic cell in it counts as a separate cell.

Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
2018-05-03 17:05:06 -07:00
Piotr Jastrzebski
2ee3d8b87b Introduce consumer_m and data_consume_rows_context_m
Those classes can handle SSTables in MC format.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-26 12:49:38 +02:00
Piotr Jastrzebski
df457166b0 Add support for 3_x stats metadata
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
e1e23ec555 Pass sstable version to describe_type
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
1cc1f9af5f Pass sstable version to write methods
This will allow writing different versions differently

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Piotr Jastrzebski
08da518dae metadata_type: add Serialization type
Ignore it while reading sstable 3_x and throw
if it's present when reading 2_x.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-24 11:30:26 +02:00
Avi Kivity
28be4ff5da Revert "Merge "Implement loading sstables in 3.x format" from Piotr"
This reverts commit 513479f624, reversing
changes made to 01c36556bf. It breaks
booting.

Fixes #3376.
2018-04-23 06:47:00 +03:00
Piotr Jastrzebski
b683870644 Add support for 3_x stats metadata
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 15:06:51 +02:00
Piotr Jastrzebski
26ab3056ae Pass sstable version to describe_type
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 14:41:11 +02:00
Piotr Jastrzebski
0022c309ee Pass sstable version to write methods
This will allow writing different versions differently

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 14:41:10 +02:00
Piotr Jastrzebski
65fe564cd2 metadata_type: add Serialization type
Ignore it while reading sstable 3_x and throw
if it's present when reading 2_x.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2018-04-22 14:40:04 +02:00
Glauber Costa
b2f9958071 large_bitset: use a chunked_vector internally and simplify API
save and load functions for the large_bitset were introduced by Avi with
d590e327c0.

In that commit, Avi says:

"... providing iterator-based load() and save() methods.  The methods
support partial load/save so that access to very large bitmaps can be
split over multiple tasks."

The only user of this interface is SSTables. And turns out we don't really
split the access like that. What we do instead is to create a chunked vector
and then pass its begin() method with position = 0 and let it write everything.

The problem here is that this require the chunked vector to be fully
initialized, not just reserved. If the bitmap is large enough that in itself
can take a long time without yielding (up to 16ms seen in my setup).

We can simplify things considerably by moving the large_bitset to use a
chunked vector internally: it already uses a poor man's version of it
by allocating chunks internally (it predates the chunked_vector).

By doing that, we can turn save() into a simple copy operation, and do
away with load altogether by adding a new constructor that will just
copy an existing chunked_vector.

Fixes #3341
Tests: unit (release)

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180409234726.28219-1-glauber@scylladb.com>
2018-04-10 10:25:06 +03:00
Glauber Costa
f5c32423b8 summary: don't go through all entries when computing memory size.
Summary has a function, memory_size(), that estimates the amount of
memory the summary takes. It is my understanding that this is called
to serve information to tooling.

First, this function is innacurate because it doesn't take into account
the tokens per each entry, just the keys. But more importantly, it has
to iterate over all keys which can be pretty expensive if the entries
list is long. We are now keeping that in a memory area, with just
pointers in the entry. So instead of iterating through the entries, we
can iterate through the memory areas, which is much cheaper.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20180316120915.16809-1-glauber@scylladb.com>
2018-03-16 12:57:19 +00:00
Glauber Costa
e680c7c8cc abstract summary entry version of the token with a token view
dht::token doesn't have a trivial destructor, so destroying an array
full of those can be quite expensive. If we use the same trick as we
used for the summary - storing the token data in a stable memory
location - we can leave the entries with a trivial destructor and destroy
the chunks themselves. Those being larger, they will be more efficient
to delete.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-15 12:24:15 -04:00
Glauber Costa
091b0f9d41 summary_entry: do not store key bytes in each summary entry
If we store a bytes_view instead of bytes, that has a trivial destructor
and then we don't need to destroy each element individually. To do that,
we allocate the data in a couple of large arrays which can be disposed of
easily and point to it.

We still can't destroy trivially because of the token.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2018-03-14 10:46:20 -04:00