The name "column_family" is both awkward and obsolete. Rename to
the modern and accurate "table".
An alias is kept to avoid huge code churn.
To prevent a One Definition Rule violation, a preexisting "table"
type is moved to a new namespace row_cache_stress_test.
Tests: unit (release)
Message-Id: <20180624065238.26481-1-avi@scylladb.com>
This patchset brings support for writing range tombstones to SSTables
3.x. ('mc' format).
In SSTables 3.x, range tombstones are represented by so-called range
tombstone markers (hereafter RT markers) that denote range tombstone
start and end bounds. So each range tombstone is represented in data
file by two ordered RT markers.
There are also markers that both close the previous range tombstone and
open the new one in case if two range tombstones are ajdacent. This is
done to consume less disk space on such occasions.
Range tombstones written as RT markers are naturally non-overlapping.
* github.com:argenet/scylla projects/sstables-30/write-range-tombstones/v6
range_tombstone_stream: Remove an unused boolean flag.
Revert "Add missing enum values to bound_kind."
sstables: Move to_deletion_time helper up and make it static.
sstables: Write end-of-partition byte before flushing the last index
block.
sstables: Add support for writing range tombstones in SSTables 3.x
format.
tests: Add unit test covering simple range tombstone.
tests: Add unit test covering adjacent range tombstones.
tests: Add test to cover non-adjacent RTs.
tests: Add test covering mixed rows and range tombstones.
tests: Add test covering SSTables 3.x with many RTs.
tests: Add unit test covering overlapping RTs and rows.
tests: Add tests writing a range tombstone and a row overlapping with
its start.
tests: Add tests writing a range tombstone and a row overlapping with
its end.
tests: Add function that writes from multiple memtable into SSTables.
tests: Add test where 2nd range tombstone covers the remainder of the
1st one.
tests: Add test writing two non-adjacent range tombstones with same
clustering key prefix at their bounds.
tests: Add test covering overlapped range tombstones.
For SSTables 3.x. ('mc' format), range tombstones are represented by
their bounds that are written to the data file as so-called RT markers.
For adjacent range tombstones, an RT marker can be of a 'boundary' type
which means it closes the previous range tombstone and opens the new
one.
Internally, sstable_writer_m relies on range_tombstone_stream to both
de-overlap incoming range tombstones and order them so that when they
are drained they can be easily thought of as just pairs of their bounds.
"
Make sure we properly handle row marker and row tombstone
when reading a row.
Tests: unit {release}
"
* 'haaawk/sstables3/read-liveness-info-v4' of ssh://github.com/scylladb/seastar-dev:
sstable: consume row marker in data_consume_rows_context_m
sstable: Add consumer_m::consume_row_marker_and_tombstone
sstable: add is_set and to_row_marker to liveness_info
* https://github.com/vladzcloudius/scylla.git tracing_prepared_parameters-v6:
cql3::query_options: add get_names() method
tracing::trace_state: hide the internals of params_values
tracing: store queries statements for BATCH
tracing: store the prepared statements parameters values
This is to stay compliant with the Origin for SSTables 3.x.
It differs from SSTables 2.x (ka/la) as for those the last promoted
index block is pushed first and the end-of-partition byte is written
after.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
If we fail to produce a SizeTiered compaction with the configured
min_threshold, we can try again to compact any two - unless there is a
global bypass telling us no to.
This will still privilege doing larger compactions in size buckets where
that is possible, but if we are idle will try to compact any two
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In SSTables 3, min timestamp and min deletion time in serialization
header are not stored normally but instead the difference between
their value and the cassandra "epoch" is stored.
This is supposed to make SSTables smaller. As a consequence, we have
to add the "epoch" after reading the values to obtain the actual
values of min timestamp and min deletion time.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Store the prepared statement positional parameters values in the
corresponding system_traces.sessions entry in the 'parameters' column
(which has a map<text,text> type).
Parameters are stored as a pair of "param[X]" : "value", where X is
the index of the parameter starting from 0 and the "value" is the first
64 characters of the parameter's value string representation.
If parameters were given with their names attached (see the description
on bit 0x40 of QUERY flags in the CQL binary protocol specification) then
parameters are going to be stored in the "param[X](<bound variable name>)" : "value"
form.
If the value's string representation is longer than 64 characters then the "value" will
contain only first 64 characters of it and will have the "..." at
the end.
For a BATCH of prepared statements the parameter "name" will have a form of
param[Y][X] where Y is the index of the corresponding prepared statement
in the BATCH and X is the index of the parameter. Both X and Y start from
0.
Note:
Had to switch to boost::range::find() in sstables::big_sstable_set in order to
address the "ambiguous overload" compilation error.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
"
Implement and test support for reading collections in SSTables 3.
Tests: unit {release}
"
* 'haaawk/sstables3/read-collections-v1' of ssh://github.com/scylladb/seastar-dev:
sstables 3: Add tests for reading collections
flat_mutation_reader_assertions: add more flexible asserts
data_consume_rows_context_m: add support for collections
mp_row_consumer_m: Add support for collections
data_consume_rows_context_m: introduce cell_path
Use column_translation::*_is_collection in reading
column_translation: add *_column_is_collection()
column_flags_m: add HAS_COMPLEX_DELETION
Use read_unsigned_vint_length_bytes for COLUMN_VALUE
Use read_unsigned_vint_length_bytes for CK_BLOCKS
Implement read_unsigned_vint_length_bytes
"
This is series is for nodetool getsstables.
This patch is based on:
8daaf9833a
With some minor adjustments because of the code change in sstables.
The idea is to allow searching for all the sstables that contains a
given key.
After this patch if there is a table t1 in keyspace k1 and it has a key
called aa.
curl -X GET "http://localhost:10000/column_family/sstables/by_key/k1%3At1?key=aa"
Will return the list of sstables file names that contains that key.
"
* 'amnon/sstable_for_key_v4' of github.com:scylladb/seastar-dev:
Add the API implementation to get_sstables_by_key
api: column_family.json make the get_sstables_for_key doc clearer
column_family: Add the get_sstables_by_partition_key method
sstable test: add has_partition_key test
sstable: Add has_partition_key method
keys_test: add a test for nodetool_style string
keys: Add from_nodetool_style_string factory method
"
In preparation, we change LCS so that it tries harder to push data
to the last level, where the backlog is supposed to be zero.
The backlog is defined as:
backlog_of_stcs_in_l0 + Sum(L in level) sizeof(L) * (max_level - L) * fan_out
where:
* the fan_out is the amount of SSTables we usually compact with the
next level (usually 10).
* max_levels is the number of levels currently populated
* sizeof(L) is the total amount of data in a particular level.
Tests: unit (release)
"
* 'lcs-backlog-v2' of github.com:glommer/scylla:
LCS: implement backlog tracker for compaction controller
LCS: don't construct property in the body of constructor
LCS: try harder to move SSTables to highest levels.
leveled manifest: turn 10 into a constant
backlog: add level to write progress monitor
This is the last missing tracker among the major strategies. After
this, only DTCS is left.
To calculate the backlog, we will define the point of zero-backlog
as having all data in the last level. The backlog is then:
Sum(L in levels) sizeof(L) * (max_levels - L) * fan_out,
where:
* the fan_out is the amount of SSTables we usually compact with the
next level (usually 10).
* max_levels is the number of levels currently populated
* sizeof(L) is the total amount of data in a particular level.
Care is taken for the backlog not to jump when a new level has been just
recently created.
Aside from that, SSTables that accumulate in L0 can be subject to STCS.
We will then add a STCS backlog in those SSTables to represent that.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Right now we are constructing the _max_sstable_size_in_mb property in
the body of the constructor, which it makes it hard for us to use from
other properties.
We are doing that because we'd like to test for bounds of that value. So
a cleaner way is to have a helper function for that.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Our current implementation of LCS can end up with situations in which
just a bit of data is in the highest levels, with the majority in the
lowest levels. That happens because we will only promote things to
highest levels if the amount of data in the current level is higher than
the maximum.
This is a pre-existing problem in itself, but became even clearer when
we started trying to define what is the backlog for LCS.
We have discussed ways to fix this it by redefining the criteria on when
to move data to the next levels. That would require us to change the way
things are today considerably, allowing parallel compactions, etc. There
is significant risk that we'll increase write amplication and we would
need to carefully validate that.
For now I will propose a simpler change, that essentially solves the
"inverted pyramid" problem of current LCS without major disruption:
keep selecting compaction candidates with the same criteria that we do
today, we should help make sure we are not compacting high levels for no
reason; but if there is nothing to do, use the idle time to push data to
higher levels. As an added benefit, old data that is in the higher level
can also be compacted away faster.
With this patch we see that in an idle, post-load system all data is
eventually pushed to the last level. Systems under constant writes keep
behaving the same way they did before.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We increase levels in powers of 10 but that is a parameter
of the algorithm. At least make it into a constant so that we can
reuse it somewhere else.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"
SSTables 3.x format ('m') stores the size of previous row or RT marker
inside each row/marker. That potentially allows to traverse rows/markers
in reverse order.
The previous code calculating those sizes appeared to produce invalid
values for all rows except the first one. The problem with detecting
this bug was that neither Cassandra itself nor the sstabledump tool use
those values, they are simply rejected on reading.
From UnfilteredSerializer.deserializeRowBody() method,
https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java#L562
:
if (header.isForSSTable())
{
in.readUnsignedVInt(); // Skip row size
in.readUnsignedVInt(); // previous unfiltered size
}
So while the previous test files were technically correct in that they
contained valid data readable by Cassandra/sstabledump, they didn't
follow the format specification.
This patchset fixes the code to produce correct values and replaces
incorrect data files with correct ones. The newly generated data files
have been validated to be identical to files generated with Cassandra
using same data and timestamps as unit tests.
Tests: Unit {release}
"
* 'projects/sstables-30/fix-prev-row_size/v1' of https://github.com/argenet/scylla:
tests: Fix test files to use correct previous row sizes.
sstables: Fix calculation of previous row size for SSTables 3.x
sstables: Factor out code building promoted index blocks into separate helpers.
"
This patchset contains two fixes to the clustering key prefixes
serialization logic for SSTables 3.x.
First, it fixes a vexing typo: a bitwise-and (&) has been used instead
of a remainder operator (%) for truncating the shift value.
This did not show up in existing tests because they all had non-empty
clustering columns values.
Added tests to cover empty clustering columns values.
Second, it fixes the logic of serialization to write values up to the
prefix length, not the length of the clustering key as defined by
schema. This matches the way it is done by the Origin.
There is, however, a special case where the prefix size is smaller than
that of a clustering key but we still need to serialize up to the full
size. This is the case when a compact table is being used and some
rows in it are added using incomplete clustering keys (containing null
for trailing columns).
In Cassandra, these prefixes still have a full length and missing
columns are just set to 'null'. In our code those prefixes have their
real length, but since we need to serialize beyond it, we pass a flag to
indicate this.
"
* 'projects/sstables-30/fix-clustering-blocks/v1' of https://github.com/argenet/scylla:
tests: Add test covering compact table with non-full clustering key.
sstables: Improve clustering blocks writing, use logical clustering prefix size.
tests: Add test covering large clustering keys (>32 columns) for SSTables 3.x
tests: Add unit test covering empty values in clustering key.
sstables: Fix typo in clustering blocks write helper.
"
Add handling for missing columns and tests for it.
There are 3 cases:
1. Number of columns in a table is smaller than 64
2. Number of columns in a table is greater than 64
2a. and less than half of all possible columns are present in sstable
2b. and at least half of all possible columns are present in sstable
Case 1 is implemented using bit mask and column is present if mask & (1 << <column number>) == 0
Case 2 is implemented by storing list of column numbers for each present column
case 3 is implemented by storing list of column numbers for each absent column
"
* 'haaawk/sstables3/read-missing-columns-v3' of ssh://github.com/scylladb/seastar-dev:
sstables 3: add test for reading big dense subset of columns
sstables 3: support reading big dense subsets of columns
sstables 3: add test for reading big sparse subset of columns
sstables 3: support reading big sparse subsets of columns
sstables 3: add test for reading small subset of columns
sstables 3: support reading small subsets of columns
Small subset is contains no more than 63 elements.
Support for large subsets will come in the following
patches.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
For SSTables being written, we don't know their level yet. Add that
information to the write monitor. New SSTables will always be at L0.
Compacted SSTables will have their level determined by the compaction
process.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In the Origin, the size of the clustering key prefix used during
serialization is the actual length of the prefix and not the full size
as defined in schema. So the code is fixed to align with that logic.
This, in particular, is needed to write clustering blocks for RT
markers.
There is, however, a special case where the prefix size is smaller than
that of a clustering key but we still need to serialize up to the full
size. This is the case when a compact table is being used and some
rows in it are added using incomplete clustering keys (containing null
for trailing columns).
In Cassandra, these prefixes still have a full length and missing
columns are just set to 'null'. In our code those prefixes have their
real length, but since we need to serialize beyond it, we pass a flag to
indicate this.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
What supposed to be an operation of taking remainder turned to be a
bitwise 'and'. This didn't show up in existing tests only because they
all had non-empty clustering values.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>