"
The IndexInfo table tracks the secondary indexes that have already
been populated. Since our secondary index implementation is backed by
materialized views, we can virtualize that table so queries are
actually answered by built_views.
Fixes#3483
"
* 'built-indexes-virtual-reader/v2' of github.com:duarten/scylla:
tests/virtual_reader_test: Add test for built indexes virtual reader
db/system_keysace: Add virtual reader for IndexInfo table
db/system_keyspace: Explain that table_name is the keyspace in IndexInfo
index/secondary_index_manager: Expose index_table_name()
db/legacy_schema_migrator: Don't migrate indexes
If reader's buffer is small enough, or preemption happens often
enough, fill_buffer() may not make enough progress to advance
_lower_bound. If also iteartors are constantly invalidated across
fill_buffer() calls, the reader will not be able to make progress.
See row_cache_test.cc::test_reading_progress_with_small_buffer_and_invalidation()
for an examplary scenario.
Also reproduced in debug-mode row_cache_test.cc::test_concurrent_reads_and_eviction
Message-Id: <1528283957-16696-1-git-send-email-tgrabiec@scylladb.com>
"
As in #3423, ensuring token order on secondary index queries can be done
by adding an additional column to views that back secondary indexes.
This column is a first clustering column and contains token value,
computed on updates.
This series also updates tests and comments refering to issue 3423.
Tests: unit (release, debug)
"
* 'order_by_token_in_si_5' of https://github.com/psarna/scylla:
cql3: update token order comments
index, tests: add token column to secondary index schema
view: add handling of a token column for secondary indexes
view: add is_index method
Additional token column is now present in every view schema
that backs a secondary index. This column is always a first part
of the clustering key, so it forces token order on queries.
Column's name is ideally idx_token, but can be postfixed
with a number to ensure its uniqueness.
It also updates tests to make them acknowledge the new token order.
Fixes#3423
"
SSTables 3.x format ('m') stores the size of previous row or RT marker
inside each row/marker. That potentially allows to traverse rows/markers
in reverse order.
The previous code calculating those sizes appeared to produce invalid
values for all rows except the first one. The problem with detecting
this bug was that neither Cassandra itself nor the sstabledump tool use
those values, they are simply rejected on reading.
From UnfilteredSerializer.deserializeRowBody() method,
https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/db/rows/UnfilteredSerializer.java#L562
:
if (header.isForSSTable())
{
in.readUnsignedVInt(); // Skip row size
in.readUnsignedVInt(); // previous unfiltered size
}
So while the previous test files were technically correct in that they
contained valid data readable by Cassandra/sstabledump, they didn't
follow the format specification.
This patchset fixes the code to produce correct values and replaces
incorrect data files with correct ones. The newly generated data files
have been validated to be identical to files generated with Cassandra
using same data and timestamps as unit tests.
Tests: Unit {release}
"
* 'projects/sstables-30/fix-prev-row_size/v1' of https://github.com/argenet/scylla:
tests: Fix test files to use correct previous row sizes.
sstables: Fix calculation of previous row size for SSTables 3.x
sstables: Factor out code building promoted index blocks into separate helpers.
"
This patchset contains two fixes to the clustering key prefixes
serialization logic for SSTables 3.x.
First, it fixes a vexing typo: a bitwise-and (&) has been used instead
of a remainder operator (%) for truncating the shift value.
This did not show up in existing tests because they all had non-empty
clustering columns values.
Added tests to cover empty clustering columns values.
Second, it fixes the logic of serialization to write values up to the
prefix length, not the length of the clustering key as defined by
schema. This matches the way it is done by the Origin.
There is, however, a special case where the prefix size is smaller than
that of a clustering key but we still need to serialize up to the full
size. This is the case when a compact table is being used and some
rows in it are added using incomplete clustering keys (containing null
for trailing columns).
In Cassandra, these prefixes still have a full length and missing
columns are just set to 'null'. In our code those prefixes have their
real length, but since we need to serialize beyond it, we pass a flag to
indicate this.
"
* 'projects/sstables-30/fix-clustering-blocks/v1' of https://github.com/argenet/scylla:
tests: Add test covering compact table with non-full clustering key.
sstables: Improve clustering blocks writing, use logical clustering prefix size.
tests: Add test covering large clustering keys (>32 columns) for SSTables 3.x
tests: Add unit test covering empty values in clustering key.
sstables: Fix typo in clustering blocks write helper.
"
Add handling for missing columns and tests for it.
There are 3 cases:
1. Number of columns in a table is smaller than 64
2. Number of columns in a table is greater than 64
2a. and less than half of all possible columns are present in sstable
2b. and at least half of all possible columns are present in sstable
Case 1 is implemented using bit mask and column is present if mask & (1 << <column number>) == 0
Case 2 is implemented by storing list of column numbers for each present column
case 3 is implemented by storing list of column numbers for each absent column
"
* 'haaawk/sstables3/read-missing-columns-v3' of ssh://github.com/scylladb/seastar-dev:
sstables 3: add test for reading big dense subset of columns
sstables 3: support reading big dense subsets of columns
sstables 3: add test for reading big sparse subset of columns
sstables 3: support reading big sparse subsets of columns
sstables 3: add test for reading small subset of columns
sstables 3: support reading small subsets of columns
"
This is the first part of the first step of switching Scylla. It covers
converting cells to the new serialisation format. The actual structure
of the cells doesn't differ much from the original one with a notable
exception of the fact that large values are now fragmented and
linearisation needs to be explicit. Counters and collections still
partially rely on their old, custom serialisation code and their
handling is not optimial (although not significantly worse than it used
to be).
The new in-memory representation allows objects to be of varying size
and makes it possible to provide deserialisation context so that we
don't need to keep in each instance of an IMR type all the information
needed to interpret it. The structure of IMR types is described in C++
using some metaprogramming with the hopes of making it much easier to
modify the serialisation format that it would be in case of open-coded
serialisation functions.
Moreover, IMR types can own memory thanks to a limited support for
destructors and movers (the latter are not exactly the same thing as C++
move constructors hence a different name). This makes it (relatively)
to ensure that there is an upper bound on the size of all allocations.
For now the only thing that is converted to the IMR are atomic_cells
and collections which means that the reduction in the memory footprint
is not as big as it can be, but introducing the IMR is a big step on its
own and also paves the way towards complete elimination of unbounded
memory allocations.
The first part of this patchset contains miscellaneous preparatory
changes to various parts of the Scylla codebase. They are followed by
introduction of the IMR infrastructure. Then structure of cells is
defined and all helper functions are implemented. Next are several
treewide patches that mostly deal with propagating type information to
the cell-related operations. Finally, atomic_cell and collections are
switched to used the new IMR-based cell implementation.
The IMR is described in much more detail in imr/IMR.md added in "imr:
add IMR documentation".
Refs #2031.
Refs #2409.
perf_simple_query -c4, medians of 30 results:
./perf_base ./perf_imr diff
read 308790.08 309775.35 0.3%
write 402127.32 417729.18 3.9%
The same with 1 byte values:
./perf_base1 ./perf_imr1 diff
read 314107.26 314648.96 0.2%
write 463801.40 433255.96 -6.6%
The memory footprint is reduced, but that is partially due to removal of
small buffer optimisation (whether it will be restored depends on the
exact mesurements of the performance impact). Generally, this series was
not expected to make a huge difference as this would require converting
whole rows to the IMR.
Memory footprint:
Before:
mutation footprint:
- in cache: 1264
- in memtable: 986
After:
mutation footprint:
- in cache: 1104
- in memtable: 866
Tests: unit (release, debug)
"
* tag 'imr-cells/v3' of https://github.com/pdziepak/scylla: (37 commits)
tests/mutation: add test for changing column type
atomic_cell: switch to new IMR-based cell reperesentation
atomic_cell: explicitly state when atomic_cell is a collection member
treewide: require type for creating collection_mutation_view
treewide: require type for comparing cells
atomic_cell: introduce fragmented buffer value interface
treewide: require type to compute cell memory usage
treewide: require type to copy atomic_cell
treewide: require type info for copying atomic_cell_or_collection
treewide: require type for creating atomic_cell
atomic_cell: require column_definition for creating atomic_cell views
tests: test imr representation of cells
types: provide information for IMR
data: introduce cell
data: introduce type_info
imr/utils: add imr object holder
imr: introduce concepts
imr: add helper for allocating objects
imr: allow creating lsa migrators for IMR objects
imr: introduce placeholders
...
With the introduction of the new in-memory representation changing
column type has become a more complex operation since it needs to handle
switch from fixed-size to variable-size types. This commit adds an
explicit test for such cases.
As a prepratation for the switch to the new cell representation this
patch changes the type returned by atomic_cell_view::value() to one that
requires explicit linearisation of the cell value. Even though the value
is still implicitly linearised (and only when managed by the LSA) the
new interface is the same as the target one so that no more changes to
its users will be needed.
Since sstabledump and Cassandra do not use row size values, the new
files have been validated to be identical to files generated by
Cassandra with the same data inserted at same timestamps.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
tests/view_complex_test.cc contained a #ifdef'ed-out test claiming to
be a reproducer for issue #3362. Unfortunately, it it is not - after
earlier commits the only reason this test still fails is a mistake in
the test, which expects 0 rows in a case where the real result is 1 row.
Issue #3362 does *not* have to be fixed to fix this test.
So this patch fixes the broken test, and enables it. It also adds comments
explaining what this test is supposed to do, and why it works the way it
does.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180530142214.29398-1-nyh@scylladb.com>
"
Add handling for static rows and tests for it.
"
* 'haaawk/sstables3/read-static-v1' of ssh://github.com/scylladb/seastar-dev:
sstable_3_x_test: Add test_uncompressed_compound_static_row_read
sstable_3_x_test: add test_uncompressed_static_row_read
flat_mutation_reader_assertions: improve static row assertions
data_consume_rows_context_m: Implement support for static rows
mp_row_consumer_m: Implement support for static rows
mp_row_consumer_m: Extract fill_cells
"
We currently suffer from reactor stalls caused by non-preemptible processing
of large partitions in the following places:
(1) dropping partition entries from cache or memtables does not defer
(2) dropping partition versions abandoned by detached snapshots does not defer
(3) merging of partition versions when snapshots go away does not defer
(4) cache update from memtable processes partition entries without deferring (#2578)
(5) partition entries are upgraded to new schema atomically
This series fixes problems (1), (2) and (4), but not (3) and (5).
(1) and (2) are fixed by introducing mutation_cleaner objects which are
containers for garbage partition versions which are delaying actual freeing.
Freeing happens from memory reclaimers and is incremental.
(3) and (5) are not solved yet.
(4) is solved by having partition merging process partitions with row
granularity and defer in the middle of partition. In order to preserve update
atomicity on partition level as perceived by reads, when update starts we
create a snapshot to the current version of partition and process memtable
entry by inserting data into a separate partition version. This way if upgrade
defers in the middle of partition reads can still go to the old version and
not see partial writes. Snapshots are marked with phase numbers, and reads
will use the previous phase until whole partition is upgraded. When partition
is finally merged, the snapshots go away and the new version will eventually
be merged to the old version. Due to (3) however, this merging may still add
latency to the upgrade path.
Remaining work:
- Solving problem (3). I think the approach to take here would be to
move the task of merging versions to the background, maybe into mutation_cleaner.
- Merging range tombstones incrementally.
Performance
===========
Performance improvements were evaluated using tests/perf_row_cache_update -c1 -m1G,
which measures time it takes to update cache from memtable for various workloads
and schemas.
For large partition with lots of small rows we see a significant reduction of
scheduling latency from ~550ms to ~23ms. The cause of remainig latency is
problem (3) stated above. The run time is reduced by 70%.
For small partition case without clustering columns we see no degradation.
For small partition case with clustering key, but only 3 small rows per partition,
we see a 30% degradation in run time.
For large partition with lots of range tombstones we see degradation of 15% in
run time and scheduling latency.
Below you can see full statistics for cache update run time:
=== Small partitions, no overwrites:
Before:
avg = 433.965155
stdev = 35.958024
min = 340.093201
max = 468.564514
After:
avg = 436.929447 (+1%)
stdev = 37.130237
min = 349.410339
max = 489.953400
=== Small partition with a few rows:
Before:
avg = 315.379316
stdev = 30.059120
min = 240.340561
max = 342.408295
After:
avg = 407.232691 (+30%)
stdev = 53.918717
min = 269.514648
max = 444.846649
=== Large partition, lots of small rows:
Before:
avg = 412.870689
stdev = 227.411317
min = 286.990631
max = 1263.417847
After:
avg = 124.351705 (-70%)
stdev = 4.705762
min = 110.063255
max = 129.643387
=== Large partition, lots of range tombstones:
Before:
avg = 601.172644
stdev = 121.376866
min = 223.502136
max = 874.111572
After:
avg = 695.627588 (+15%)
stdev = 135.057004
min = 337.173950
max = 784.838745
"
* tag 'tgrabiec/clear-gently-all-partitions-v3' of github.com:tgrabiec/scylla:
mvcc: Use small_vector<> in partition_snapshot_row_cursor
utils: Extract small_vector.hh
mvcc: Erase rows gradually in apply_to_incomplete()
mvcc: partition_snapshot_row_cursor: Avoid row copying in consume() when possible
cache: real_dirty_memory_accounter: Move unpinning out of the hot path
mvcc: partition_snapshot_row_cursor: Reduce lookups in ensure_entry_if_complete()
mutation_partition: Reduce row lookups in apply_monotonically()
cache: Release dirty memory with row granularity
cache: Defer during partition merging
mvcc: partition_snapshot_row_cursor: Introduce consume_row()
mvcc: partition_snapshot_row_cursor: Introduce maybe_refresh_static()
mvcc: Make apply_to_incomplete() work with attached versions
cache: Propagate phase to apply_to_incomplete()
cache: Prepare for incremental apply_to_incomplete()
Introduce a coroutine wrapper
tests: mvcc: Encapsulate memory management details
tests: cache: Take into account that update() may defer
cache: real_dirty_memory_accounter: Allow construction without memtable
cache: Extract real_dirty_memory_accounter
mvcc: Destroy memtable partition versions gently
memtable: Destroy partitions incrementally from clear_gently()
mvcc: Remove rows from tracker gently
cache: Destroy partition versions incrementally
Introduce mutation_cleaner
mvcc: Introduce partition_version_list
mvcc: Fix move constructor of partition_version_ref() not preserving _unique_owner
database: Add API for incremental clearing of partition entries
cache: Define trivial methods inline
tests: Improve perf_row_cache_update
mutation_reader: Make empty mutation source advertize no partitions
Incremental merging will be implemented by the means of resumable
functions, which return stop_iteration::no when not yet
finished. We're not using futures, so that the caller can do work
around preemption points as well.
Curently tests have a single LSA region lock around construction of
managed objects, their manipulation, and access. This way we avoid the
complexity of dealing with allocating sections. That will not be
possible once apply_to_incomplete() is changed to enter an allocating
section itself becasue this requires region to be unlocked at
entry. The tests will have to take more fine-grained locks. That is
somewhat tricky add would add a lot of noise to tests. This patch will
make things easier by abstracting LSA management, among other things,
inside mvcc_conatiner and mvcc_partition classes.
The test incorrectly assumed that once update() is started the
cache will return only versions from last_generation. This will not
hold once we start to defer during partition merging.
Now all snapshots will have a mutation_cleaner which they will use to
gently destroy freed partition_version objects.
Destruction of memtable entries during cache update is also using the
gentle cleaner now. We need to have a separate cleaner for memtable
objects even though they're owned by cache's region, because memtable
versions must be cleared without a cache_tracker.
Each memtable will have its own cleaner, which will be merged with the
cache's cleaner when memtable is merged into cache.
Fixes some sources of reactor stalls on cache update when there are
large partition entries in memtables.
Instead of destroying whole partition_versions at once, we will do that
gently using mutation_cleaner to avoid reactor stalls.
Large deletions could happen when large partition gets invalidated,
upgraded to a new schema, or when it's abandaned by a detached snapshot.
Refs #3289.
"
Add handling for clustering columns and tests for it.
"
* 'haaawk/sstables3/read-ck-v3' of ssh://github.com/scylladb/seastar-dev:
Add test_uncompressed_compound_ck_read for SSTables 3.x
Add test_uncompressed_simple_read for SSTables 3.x
Implement reading clustering key from SSTables 3.x
column_translation: cache fixed value lengths for ck
data_consume_rows_context_m: use cached fixed column value lenghts
column_translation: store fix lengths of column values
consume_row_start: change type of clustering key
Rename ROW_BODY state to CLUSTERING_ROW