Commit Graph

15609 Commits

Author SHA1 Message Date
Avi Kivity
187ebdbe46 auth: fix possible use of disengaged optional in has_salted_hash()
untyped_result_set_row's cell data type is bytes_opt, and the
get_block() accessor accesses the value assuming it's engaged
(relying on the caller to call has()).

has_unsalted_hash() calls get_blob() without calling has() beforehand,
potentially triggering undefined behavior.

Fix by using get_or() instead, which also simplifies the caller.

I observed failures in Jenkins in this area. It's hard to be sure
this is the root cause, since the failures triggered an internal
consistency assertion in asan rather than an asan report. However,
the error is hard to reproduce and the fix makes sense even if it
doesn't prevent the error.

See #3480 for the asan error.

Fixes #3480 (hopefully).
Message-Id: <20180602181919.29204-1-avi@scylladb.com>
2018-06-02 19:46:32 +01:00
Avi Kivity
aab6b0ee27 Merge "Introduce new in-memory representation for cells" from Paweł
"
This is the first part of the first step of switching Scylla. It covers
converting cells to the new serialisation format. The actual structure
of the cells doesn't differ much from the original one with a notable
exception of the fact that large values are now fragmented and
linearisation needs to be explicit. Counters and collections still
partially rely on their old, custom serialisation code and their
handling is not optimial (although not significantly worse than it used
to be).

The new in-memory representation allows objects to be of varying size
and makes it possible to provide deserialisation context so that we
don't need to keep in each instance of an IMR type all the information
needed to interpret it. The structure of IMR types is described in C++
using some metaprogramming with the hopes of making it much easier to
modify the serialisation format that it would be in case of open-coded
serialisation functions.

Moreover, IMR types can own memory thanks to a limited support for
destructors and movers (the latter are not exactly the same thing as C++
move constructors hence a different name). This makes it (relatively)
to ensure that there is an upper bound on the size of all allocations.

For now the only thing that is converted to the IMR are atomic_cells
and collections which means that the reduction in the memory footprint
is not as big as it can be, but introducing the IMR is a big step on its
own and also paves the way towards complete elimination of unbounded
memory allocations.

The first part of this patchset contains miscellaneous preparatory
changes to various parts of the Scylla codebase. They are followed by
introduction of the IMR infrastructure. Then structure of cells is
defined and all helper functions are implemented. Next are several
treewide patches that mostly deal with propagating type information to
the cell-related operations. Finally, atomic_cell and collections are
switched to used the new IMR-based cell implementation.

The IMR is described in much more detail in imr/IMR.md added in "imr:
add IMR documentation".

Refs #2031.
Refs #2409.

perf_simple_query -c4, medians of 30 results:

        ./perf_base  ./perf_imr   diff
 read     308790.08   309775.35   0.3%
 write    402127.32   417729.18   3.9%

The same with 1 byte values:
        ./perf_base1  ./perf_imr1   diff
 read      314107.26    314648.96   0.2%
 write     463801.40    433255.96  -6.6%

The memory footprint is reduced, but that is partially due to removal of
small buffer optimisation (whether it will be restored depends on the
exact mesurements of the performance impact). Generally, this series was
not expected to make a huge difference as this would require converting
whole rows to the IMR.

Memory footprint:
Before:
mutation footprint:
 - in cache: 1264
 - in memtable: 986

After:
mutation footprint:
 - in cache: 1104
 - in memtable: 866

Tests: unit (release, debug)
"

* tag 'imr-cells/v3' of https://github.com/pdziepak/scylla: (37 commits)
  tests/mutation: add test for changing column type
  atomic_cell: switch to new IMR-based cell reperesentation
  atomic_cell: explicitly state when atomic_cell is a collection member
  treewide: require type for creating collection_mutation_view
  treewide: require type for comparing cells
  atomic_cell: introduce fragmented buffer value interface
  treewide: require type to compute cell memory usage
  treewide: require type to copy atomic_cell
  treewide: require type info for copying atomic_cell_or_collection
  treewide: require type for creating atomic_cell
  atomic_cell: require column_definition for creating atomic_cell views
  tests: test imr representation of cells
  types: provide information for IMR
  data: introduce cell
  data: introduce type_info
  imr/utils: add imr object holder
  imr: introduce concepts
  imr: add helper for allocating objects
  imr: allow creating lsa migrators for IMR objects
  imr: introduce placeholders
  ...
2018-05-31 19:21:15 +03:00
Amnon Heiman
bc7503feee Scyllatop to use prometheus by default
Scylla now expose the prometheus API by default. This patch chagnes
scyllatop to use the Prometheus API, the collect API is still available.

The main changes in the patch:
* Move collectd specific logic inside collectd.
* Add support for help information.
* Add command line to configure prometheus end point and to enable
collectd.

* Add a prometheus class that collect information from prometheus.

Fixes: #1541
Message-Id: <20180531124156.26336-1-amnon@scylladb.com>
2018-05-31 18:00:22 +03:00
Tomasz Grabiec
b5e42bc6a0 tests: row_cache: Do not hang when only one of the readers throws
Message-Id: <20180531122729.3314-1-tgrabiec@scylladb.com>
2018-05-31 18:00:22 +03:00
Piotr Sarna
360326fdc5 cql3: add compatibility with libjsoncpp < 1.6.0
Only libjsoncpp >= 1.6.0 offers a safe name() method for value
iterators. For older versions, deprecated memberName() is used
instead. Note that memberName() was deprecated because of its
inability to deal with embedded null characters.

Fixes #3471

Message-Id: <e64a62bfc24ef06daee238d79d557fe6ec8979d3.1527758708.git.sarna@scylladb.com>
2018-05-31 18:00:22 +03:00
Paweł Dziepak
131a47dea3 tests/mutation: add test for changing column type
With the introduction of the new in-memory representation changing
column type has become a more complex operation since it needs to handle
switch from fixed-size to variable-size types. This commit adds an
explicit test for such cases.
2018-05-31 15:51:11 +01:00
Paweł Dziepak
a040d37cd5 atomic_cell: switch to new IMR-based cell reperesentation
This patch changes the implementation of atomic_cell and
atomic_cell_or_collection to use the data::cell implementation which is
based on the new in-memory representation infrastructure.
2018-05-31 15:51:11 +01:00
Paweł Dziepak
0ea6d14cf5 atomic_cell: explicitly state when atomic_cell is a collection member
Collections are not going to be fully converted to the IMR just yet and
still use the old serialisation format. This means that they still don't
support fragmented values very well. This patch passes the information
when an atomic_cell is created as a member of a collection so that later
we can avoid fragmenting the value in such cases.
2018-05-31 15:51:11 +01:00
Paweł Dziepak
e34ff8b4bf treewide: require type for creating collection_mutation_view 2018-05-31 15:51:11 +01:00
Paweł Dziepak
9bb1f10bb6 treewide: require type for comparing cells 2018-05-31 15:51:11 +01:00
Paweł Dziepak
aa25f0844f atomic_cell: introduce fragmented buffer value interface
As a prepratation for the switch to the new cell representation this
patch changes the type returned by atomic_cell_view::value() to one that
requires explicit linearisation of the cell value. Even though the value
is still implicitly linearised (and only when managed by the LSA) the
new interface is the same as the target one so that no more changes to
its users will be needed.
2018-05-31 15:51:11 +01:00
Paweł Dziepak
ec9d166a4f treewide: require type to compute cell memory usage 2018-05-31 15:51:11 +01:00
Paweł Dziepak
418c159057 treewide: require type to copy atomic_cell 2018-05-31 15:51:11 +01:00
Paweł Dziepak
27014a23d7 treewide: require type info for copying atomic_cell_or_collection 2018-05-31 15:51:11 +01:00
Paweł Dziepak
e9d6fc48ac treewide: require type for creating atomic_cell 2018-05-31 15:51:11 +01:00
Paweł Dziepak
93130e80fb atomic_cell: require column_definition for creating atomic_cell views 2018-05-31 15:51:11 +01:00
Paweł Dziepak
b25cc61a13 tests: test imr representation of cells 2018-05-31 15:51:11 +01:00
Paweł Dziepak
43b216b43d types: provide information for IMR 2018-05-31 15:51:11 +01:00
Paweł Dziepak
eec33fda14 data: introduce cell
This commit introduces cell serializers and views based on the in-memory
representation infrastructure. The code doesn't assume anything about
how the cells are stored, they can be either a part of another IMR
object (once the rows are converted to the IMR) or a separate objects
(just like current atomic_cell).
2018-05-31 15:51:11 +01:00
Duarte Nunes
f8626c7c93 tests/view_schema_test: Test view correctness under base schema changes
Reproducer for #3443.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180530194536.51202-2-duarte@scylladb.com>
2018-05-31 12:10:50 +03:00
Duarte Nunes
c4f267bdfe database: Refresh view dependent fields when altering base
A view schema's view_info contains the id of the base regular column
that view includes in its primary key. Since the column id of a
particular column can potentially change with a new schema version, we
need to refresh the stored column id. We weren't doing that when
unselected base columns are added, and this patch fixes it by
triggering an update of the view schema when base columns are added
and the view contains a base regular column in its PK.

Fixes #3443

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180530194536.51202-1-duarte@scylladb.com>
2018-05-31 12:10:49 +03:00
Paweł Dziepak
544b3c9a34 data: introduce type_info
This patch introduces type_info class which contains all type
information needed by IMR deserialisation contexts.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
4929c1f39a imr/utils: add imr object holder
imr::object<> is an owning pointer to an IMR objects. It is LSA-aware.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
fd47858755 imr: introduce concepts
This commit adds type traits and concepts for sizers, serializers and
writers that help explicitly specify requirements of various interfaces.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
28ea36a686 imr: add helper for allocating objects
IMR objects may own memory. object_allocator takes care of allocating
memory for all owned objects during the serialisation of their owner.

In practice a writer of the parent object would accept a helper object
created by object_allocator. That helper object would be either
responsible for computing the size of buffers that have to be allocated
or perform the actual serialisation in the same two phase manner as it
is done for the parent IMR object.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
79941f2fc7 imr: allow creating lsa migrators for IMR objects
This patch introduces helpers for creating LSA migrators from IMR
deserialisation contexts and context factories.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
5ddb118c78 imr: introduce placeholders
In some cases the actual value of an IMR object is not know at the
serialisation time. If the type is fixed-size we can use a placeholder
to defer writing it to a more conveninent moment.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
8c38f09fbc tests/imr: add tests for destructor and mover methods 2018-05-31 10:09:01 +01:00
Paweł Dziepak
fa7b080443 imr: introduce destructor and mover methods
This patch introduces destructors and movers for IMR objects which
enables them to own memory. Custom destructors and methods can be
defined by specialising appropriate classes.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
c02bfb942d imr/compound: introduce tagged_type<Tag, T> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
a29a88c9d9 tests/imr/compound: add tests for structure<...> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
4f51901dfe imr/compound: introduce structure<...> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
466d91f652 tests/imr/compound: add tests for variant<Ts...> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
8e4c8ce2c4 imr/compound: introduce variant<Ts...> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
7c28c9eda8 tests/imr: add test for optional<T> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
6d7b205d1a imr: introduce optional<T> 2018-05-31 10:09:01 +01:00
Paweł Dziepak
eb2479fa9a tests: add test for new in memory representation 2018-05-31 10:09:01 +01:00
Paweł Dziepak
a995fb337c imr: introduce fundamental types
This patch introduces fundamental IMR types: a set of flags, a POD type
and a buffer.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
5f960beca1 imr: add IMR documentation 2018-05-31 10:09:01 +01:00
Paweł Dziepak
0092076167 tests: add helpers for generating random data 2018-05-31 10:09:01 +01:00
Paweł Dziepak
cc76480174 tests: introduce tests for metaprogramming helpers 2018-05-31 10:09:01 +01:00
Paweł Dziepak
ba5e64383a utils: add metaprogramming helper functions 2018-05-31 10:09:01 +01:00
Paweł Dziepak
5845d52632 idl: allow fragmented bytes_view in serialisation
This patch adds new way of serialising bytes and sstring objects in the
IDL. Using write_fragmented_<field-name>() the caller can pass a range
of fragments that would be serialised without linearising the buffer.
2018-05-31 10:09:01 +01:00
Paweł Dziepak
c41b9fc7ec utils: add fragment range
This patch introduces a FragmentRange concept which is the minimal interface all
classes representing a fragmented buffer should satisfy.
2018-05-31 10:09:01 +01:00
Nadav Har'El
a1cbeeffcd tests/view_complex_test.cc: fix and enable buggy test
tests/view_complex_test.cc contained a #ifdef'ed-out test claiming to
be a reproducer for issue #3362. Unfortunately, it it is not - after
earlier commits the only reason this test still fails is a mistake in
the test, which expects 0 rows in a case where the real result is 1 row.
Issue #3362 does *not* have to be fixed to fix this test.

So this patch fixes the broken test, and enables it. It also adds comments
explaining what this test is supposed to do, and why it works the way it
does.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180530142214.29398-1-nyh@scylladb.com>
2018-05-30 15:39:25 +01:00
Avi Kivity
9999e0e6bc Merge "Implement support for static rows in SSTable 3.0" from Piotr
"
Add handling for static rows and tests for it.
"

* 'haaawk/sstables3/read-static-v1' of ssh://github.com/scylladb/seastar-dev:
  sstable_3_x_test: Add test_uncompressed_compound_static_row_read
  sstable_3_x_test: add test_uncompressed_static_row_read
  flat_mutation_reader_assertions: improve static row assertions
  data_consume_rows_context_m: Implement support for static rows
  mp_row_consumer_m: Implement support for static rows
  mp_row_consumer_m: Extract fill_cells
2018-05-30 17:17:17 +03:00
Paweł Dziepak
62d0639fe9 Merge "Avoid reactor stalls in cache with large partitions" from Tomasz
"
We currently suffer from reactor stalls caused by non-preemptible processing
of large partitions in the following places:

  (1) dropping partition entries from cache or memtables does not defer

  (2) dropping partition versions abandoned by detached snapshots does not defer

  (3) merging of partition versions when snapshots go away does not defer

  (4) cache update from memtable processes partition entries without deferring (#2578)

  (5) partition entries are upgraded to new schema atomically

This series fixes problems (1), (2) and (4), but not (3) and (5).

(1) and (2) are fixed by introducing mutation_cleaner objects which are
containers for garbage partition versions which are delaying actual freeing.
Freeing happens from memory reclaimers and is incremental.

(3) and (5) are not solved yet.

(4) is solved by having partition merging process partitions with row
granularity and defer in the middle of partition. In order to preserve update
atomicity on partition level as perceived by reads, when update starts we
create a snapshot to the current version of partition and process memtable
entry by inserting data into a separate partition version. This way if upgrade
defers in the middle of partition reads can still go to the old version and
not see partial writes. Snapshots are marked with phase numbers, and reads
will use the previous phase until whole partition is upgraded. When partition
is finally merged, the snapshots go away and the new version will eventually
be merged to the old version. Due to (3) however, this merging may still add
latency to the upgrade path.

Remaining work:

  - Solving problem (3). I think the approach to take here would be to
    move the task of merging versions to the background, maybe into mutation_cleaner.

  - Merging range tombstones incrementally.

Performance
===========

Performance improvements were evaluated using tests/perf_row_cache_update -c1 -m1G,
which measures time it takes to update cache from memtable for various workloads
and schemas.

For large partition with lots of small rows we see a significant reduction of
scheduling latency from ~550ms to ~23ms. The cause of remainig latency is
problem (3) stated above. The run time is reduced by 70%.

For small partition case without clustering columns we see no degradation.

For small partition case with clustering key, but only 3 small rows per partition,
we see a 30% degradation in run time.

For large partition with lots of range tombstones we see degradation of 15% in
run time and scheduling latency.

Below you can see full statistics for cache update run time:

=== Small partitions, no overwrites:

Before:

  avg = 433.965155
  stdev = 35.958024
  min = 340.093201
  max = 468.564514

After:

  avg = 436.929447 (+1%)
  stdev = 37.130237
  min = 349.410339
  max = 489.953400

=== Small partition with a few rows:

Before:

  avg = 315.379316
  stdev = 30.059120
  min = 240.340561
  max = 342.408295

After:

  avg = 407.232691 (+30%)
  stdev = 53.918717
  min = 269.514648
  max = 444.846649

=== Large partition, lots of small rows:

Before:

  avg = 412.870689
  stdev = 227.411317
  min = 286.990631
  max = 1263.417847

After:

  avg = 124.351705 (-70%)
  stdev = 4.705762
  min = 110.063255
  max = 129.643387

=== Large partition, lots of range tombstones:

Before:

  avg = 601.172644
  stdev = 121.376866
  min = 223.502136
  max = 874.111572

After:

  avg = 695.627588 (+15%)
  stdev = 135.057004
  min = 337.173950
  max = 784.838745
"

* tag 'tgrabiec/clear-gently-all-partitions-v3' of github.com:tgrabiec/scylla:
  mvcc: Use small_vector<> in partition_snapshot_row_cursor
  utils: Extract small_vector.hh
  mvcc: Erase rows gradually in apply_to_incomplete()
  mvcc: partition_snapshot_row_cursor: Avoid row copying in consume() when possible
  cache: real_dirty_memory_accounter: Move unpinning out of the hot path
  mvcc: partition_snapshot_row_cursor: Reduce lookups in ensure_entry_if_complete()
  mutation_partition: Reduce row lookups in apply_monotonically()
  cache: Release dirty memory with row granularity
  cache: Defer during partition merging
  mvcc: partition_snapshot_row_cursor: Introduce consume_row()
  mvcc: partition_snapshot_row_cursor: Introduce maybe_refresh_static()
  mvcc: Make apply_to_incomplete() work with attached versions
  cache: Propagate phase to apply_to_incomplete()
  cache: Prepare for incremental apply_to_incomplete()
  Introduce a coroutine wrapper
  tests: mvcc: Encapsulate memory management details
  tests: cache: Take into account that update() may defer
  cache: real_dirty_memory_accounter: Allow construction without memtable
  cache: Extract real_dirty_memory_accounter
  mvcc: Destroy memtable partition versions gently
  memtable: Destroy partitions incrementally from clear_gently()
  mvcc: Remove rows from tracker gently
  cache: Destroy partition versions incrementally
  Introduce mutation_cleaner
  mvcc: Introduce partition_version_list
  mvcc: Fix move constructor of partition_version_ref() not preserving _unique_owner
  database: Add API for incremental clearing of partition entries
  cache: Define trivial methods inline
  tests: Improve perf_row_cache_update
  mutation_reader: Make empty mutation source advertize no partitions
2018-05-30 14:12:29 +01:00
Tomasz Grabiec
4561e97efe mvcc: Use small_vector<> in partition_snapshot_row_cursor
I measured 8% improvement in cache update throughput for small
partitions.
2018-05-30 14:41:41 +02:00
Tomasz Grabiec
db36ff0643 utils: Extract small_vector.hh 2018-05-30 14:41:41 +02:00
Tomasz Grabiec
5b59df3761 mvcc: Erase rows gradually in apply_to_incomplete()
So that we avoid double-buffering partitions.
2018-05-30 14:41:41 +02:00