untyped_result_set_row's cell data type is bytes_opt, and the
get_block() accessor accesses the value assuming it's engaged
(relying on the caller to call has()).
has_unsalted_hash() calls get_blob() without calling has() beforehand,
potentially triggering undefined behavior.
Fix by using get_or() instead, which also simplifies the caller.
I observed failures in Jenkins in this area. It's hard to be sure
this is the root cause, since the failures triggered an internal
consistency assertion in asan rather than an asan report. However,
the error is hard to reproduce and the fix makes sense even if it
doesn't prevent the error.
See #3480 for the asan error.
Fixes#3480 (hopefully).
Message-Id: <20180602181919.29204-1-avi@scylladb.com>
"
This is the first part of the first step of switching Scylla. It covers
converting cells to the new serialisation format. The actual structure
of the cells doesn't differ much from the original one with a notable
exception of the fact that large values are now fragmented and
linearisation needs to be explicit. Counters and collections still
partially rely on their old, custom serialisation code and their
handling is not optimial (although not significantly worse than it used
to be).
The new in-memory representation allows objects to be of varying size
and makes it possible to provide deserialisation context so that we
don't need to keep in each instance of an IMR type all the information
needed to interpret it. The structure of IMR types is described in C++
using some metaprogramming with the hopes of making it much easier to
modify the serialisation format that it would be in case of open-coded
serialisation functions.
Moreover, IMR types can own memory thanks to a limited support for
destructors and movers (the latter are not exactly the same thing as C++
move constructors hence a different name). This makes it (relatively)
to ensure that there is an upper bound on the size of all allocations.
For now the only thing that is converted to the IMR are atomic_cells
and collections which means that the reduction in the memory footprint
is not as big as it can be, but introducing the IMR is a big step on its
own and also paves the way towards complete elimination of unbounded
memory allocations.
The first part of this patchset contains miscellaneous preparatory
changes to various parts of the Scylla codebase. They are followed by
introduction of the IMR infrastructure. Then structure of cells is
defined and all helper functions are implemented. Next are several
treewide patches that mostly deal with propagating type information to
the cell-related operations. Finally, atomic_cell and collections are
switched to used the new IMR-based cell implementation.
The IMR is described in much more detail in imr/IMR.md added in "imr:
add IMR documentation".
Refs #2031.
Refs #2409.
perf_simple_query -c4, medians of 30 results:
./perf_base ./perf_imr diff
read 308790.08 309775.35 0.3%
write 402127.32 417729.18 3.9%
The same with 1 byte values:
./perf_base1 ./perf_imr1 diff
read 314107.26 314648.96 0.2%
write 463801.40 433255.96 -6.6%
The memory footprint is reduced, but that is partially due to removal of
small buffer optimisation (whether it will be restored depends on the
exact mesurements of the performance impact). Generally, this series was
not expected to make a huge difference as this would require converting
whole rows to the IMR.
Memory footprint:
Before:
mutation footprint:
- in cache: 1264
- in memtable: 986
After:
mutation footprint:
- in cache: 1104
- in memtable: 866
Tests: unit (release, debug)
"
* tag 'imr-cells/v3' of https://github.com/pdziepak/scylla: (37 commits)
tests/mutation: add test for changing column type
atomic_cell: switch to new IMR-based cell reperesentation
atomic_cell: explicitly state when atomic_cell is a collection member
treewide: require type for creating collection_mutation_view
treewide: require type for comparing cells
atomic_cell: introduce fragmented buffer value interface
treewide: require type to compute cell memory usage
treewide: require type to copy atomic_cell
treewide: require type info for copying atomic_cell_or_collection
treewide: require type for creating atomic_cell
atomic_cell: require column_definition for creating atomic_cell views
tests: test imr representation of cells
types: provide information for IMR
data: introduce cell
data: introduce type_info
imr/utils: add imr object holder
imr: introduce concepts
imr: add helper for allocating objects
imr: allow creating lsa migrators for IMR objects
imr: introduce placeholders
...
Scylla now expose the prometheus API by default. This patch chagnes
scyllatop to use the Prometheus API, the collect API is still available.
The main changes in the patch:
* Move collectd specific logic inside collectd.
* Add support for help information.
* Add command line to configure prometheus end point and to enable
collectd.
* Add a prometheus class that collect information from prometheus.
Fixes: #1541
Message-Id: <20180531124156.26336-1-amnon@scylladb.com>
Only libjsoncpp >= 1.6.0 offers a safe name() method for value
iterators. For older versions, deprecated memberName() is used
instead. Note that memberName() was deprecated because of its
inability to deal with embedded null characters.
Fixes#3471
Message-Id: <e64a62bfc24ef06daee238d79d557fe6ec8979d3.1527758708.git.sarna@scylladb.com>
With the introduction of the new in-memory representation changing
column type has become a more complex operation since it needs to handle
switch from fixed-size to variable-size types. This commit adds an
explicit test for such cases.
This patch changes the implementation of atomic_cell and
atomic_cell_or_collection to use the data::cell implementation which is
based on the new in-memory representation infrastructure.
Collections are not going to be fully converted to the IMR just yet and
still use the old serialisation format. This means that they still don't
support fragmented values very well. This patch passes the information
when an atomic_cell is created as a member of a collection so that later
we can avoid fragmenting the value in such cases.
As a prepratation for the switch to the new cell representation this
patch changes the type returned by atomic_cell_view::value() to one that
requires explicit linearisation of the cell value. Even though the value
is still implicitly linearised (and only when managed by the LSA) the
new interface is the same as the target one so that no more changes to
its users will be needed.
This commit introduces cell serializers and views based on the in-memory
representation infrastructure. The code doesn't assume anything about
how the cells are stored, they can be either a part of another IMR
object (once the rows are converted to the IMR) or a separate objects
(just like current atomic_cell).
A view schema's view_info contains the id of the base regular column
that view includes in its primary key. Since the column id of a
particular column can potentially change with a new schema version, we
need to refresh the stored column id. We weren't doing that when
unselected base columns are added, and this patch fixes it by
triggering an update of the view schema when base columns are added
and the view contains a base regular column in its PK.
Fixes#3443
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180530194536.51202-1-duarte@scylladb.com>
IMR objects may own memory. object_allocator takes care of allocating
memory for all owned objects during the serialisation of their owner.
In practice a writer of the parent object would accept a helper object
created by object_allocator. That helper object would be either
responsible for computing the size of buffers that have to be allocated
or perform the actual serialisation in the same two phase manner as it
is done for the parent IMR object.
In some cases the actual value of an IMR object is not know at the
serialisation time. If the type is fixed-size we can use a placeholder
to defer writing it to a more conveninent moment.
This patch introduces destructors and movers for IMR objects which
enables them to own memory. Custom destructors and methods can be
defined by specialising appropriate classes.
This patch adds new way of serialising bytes and sstring objects in the
IDL. Using write_fragmented_<field-name>() the caller can pass a range
of fragments that would be serialised without linearising the buffer.
tests/view_complex_test.cc contained a #ifdef'ed-out test claiming to
be a reproducer for issue #3362. Unfortunately, it it is not - after
earlier commits the only reason this test still fails is a mistake in
the test, which expects 0 rows in a case where the real result is 1 row.
Issue #3362 does *not* have to be fixed to fix this test.
So this patch fixes the broken test, and enables it. It also adds comments
explaining what this test is supposed to do, and why it works the way it
does.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180530142214.29398-1-nyh@scylladb.com>
"
Add handling for static rows and tests for it.
"
* 'haaawk/sstables3/read-static-v1' of ssh://github.com/scylladb/seastar-dev:
sstable_3_x_test: Add test_uncompressed_compound_static_row_read
sstable_3_x_test: add test_uncompressed_static_row_read
flat_mutation_reader_assertions: improve static row assertions
data_consume_rows_context_m: Implement support for static rows
mp_row_consumer_m: Implement support for static rows
mp_row_consumer_m: Extract fill_cells
"
We currently suffer from reactor stalls caused by non-preemptible processing
of large partitions in the following places:
(1) dropping partition entries from cache or memtables does not defer
(2) dropping partition versions abandoned by detached snapshots does not defer
(3) merging of partition versions when snapshots go away does not defer
(4) cache update from memtable processes partition entries without deferring (#2578)
(5) partition entries are upgraded to new schema atomically
This series fixes problems (1), (2) and (4), but not (3) and (5).
(1) and (2) are fixed by introducing mutation_cleaner objects which are
containers for garbage partition versions which are delaying actual freeing.
Freeing happens from memory reclaimers and is incremental.
(3) and (5) are not solved yet.
(4) is solved by having partition merging process partitions with row
granularity and defer in the middle of partition. In order to preserve update
atomicity on partition level as perceived by reads, when update starts we
create a snapshot to the current version of partition and process memtable
entry by inserting data into a separate partition version. This way if upgrade
defers in the middle of partition reads can still go to the old version and
not see partial writes. Snapshots are marked with phase numbers, and reads
will use the previous phase until whole partition is upgraded. When partition
is finally merged, the snapshots go away and the new version will eventually
be merged to the old version. Due to (3) however, this merging may still add
latency to the upgrade path.
Remaining work:
- Solving problem (3). I think the approach to take here would be to
move the task of merging versions to the background, maybe into mutation_cleaner.
- Merging range tombstones incrementally.
Performance
===========
Performance improvements were evaluated using tests/perf_row_cache_update -c1 -m1G,
which measures time it takes to update cache from memtable for various workloads
and schemas.
For large partition with lots of small rows we see a significant reduction of
scheduling latency from ~550ms to ~23ms. The cause of remainig latency is
problem (3) stated above. The run time is reduced by 70%.
For small partition case without clustering columns we see no degradation.
For small partition case with clustering key, but only 3 small rows per partition,
we see a 30% degradation in run time.
For large partition with lots of range tombstones we see degradation of 15% in
run time and scheduling latency.
Below you can see full statistics for cache update run time:
=== Small partitions, no overwrites:
Before:
avg = 433.965155
stdev = 35.958024
min = 340.093201
max = 468.564514
After:
avg = 436.929447 (+1%)
stdev = 37.130237
min = 349.410339
max = 489.953400
=== Small partition with a few rows:
Before:
avg = 315.379316
stdev = 30.059120
min = 240.340561
max = 342.408295
After:
avg = 407.232691 (+30%)
stdev = 53.918717
min = 269.514648
max = 444.846649
=== Large partition, lots of small rows:
Before:
avg = 412.870689
stdev = 227.411317
min = 286.990631
max = 1263.417847
After:
avg = 124.351705 (-70%)
stdev = 4.705762
min = 110.063255
max = 129.643387
=== Large partition, lots of range tombstones:
Before:
avg = 601.172644
stdev = 121.376866
min = 223.502136
max = 874.111572
After:
avg = 695.627588 (+15%)
stdev = 135.057004
min = 337.173950
max = 784.838745
"
* tag 'tgrabiec/clear-gently-all-partitions-v3' of github.com:tgrabiec/scylla:
mvcc: Use small_vector<> in partition_snapshot_row_cursor
utils: Extract small_vector.hh
mvcc: Erase rows gradually in apply_to_incomplete()
mvcc: partition_snapshot_row_cursor: Avoid row copying in consume() when possible
cache: real_dirty_memory_accounter: Move unpinning out of the hot path
mvcc: partition_snapshot_row_cursor: Reduce lookups in ensure_entry_if_complete()
mutation_partition: Reduce row lookups in apply_monotonically()
cache: Release dirty memory with row granularity
cache: Defer during partition merging
mvcc: partition_snapshot_row_cursor: Introduce consume_row()
mvcc: partition_snapshot_row_cursor: Introduce maybe_refresh_static()
mvcc: Make apply_to_incomplete() work with attached versions
cache: Propagate phase to apply_to_incomplete()
cache: Prepare for incremental apply_to_incomplete()
Introduce a coroutine wrapper
tests: mvcc: Encapsulate memory management details
tests: cache: Take into account that update() may defer
cache: real_dirty_memory_accounter: Allow construction without memtable
cache: Extract real_dirty_memory_accounter
mvcc: Destroy memtable partition versions gently
memtable: Destroy partitions incrementally from clear_gently()
mvcc: Remove rows from tracker gently
cache: Destroy partition versions incrementally
Introduce mutation_cleaner
mvcc: Introduce partition_version_list
mvcc: Fix move constructor of partition_version_ref() not preserving _unique_owner
database: Add API for incremental clearing of partition entries
cache: Define trivial methods inline
tests: Improve perf_row_cache_update
mutation_reader: Make empty mutation source advertize no partitions