Commit Graph

604 Commits

Author SHA1 Message Date
Tomasz Grabiec
dbc1894bd5 lsa: Avoid unnecessary compact_and_evict_locked()
When the reclaim request was satisfied from the pool there's no need
to call compact_and_evict_locked(). This allows us to avoid calling
boost::range::make_heap(), which is a tiny performance difference, as
well as some confusing log messages.

Message-Id: <1548091941-8534-1-git-send-email-tgrabiec@scylladb.com>
2019-01-21 20:19:20 +02:00
Paweł Dziepak
e212d37a8a utils/small_vector: fix leak in copy assignment slow path
Fixes #4105.

Message-Id: <20190118153936.5039-1-pdziepak@scylladb.com>
2019-01-18 17:49:46 +02:00
Tomasz Grabiec
6461e085fe managed_bytes: Fix compilation on gcc 8.2
The compilation fails on -Warray-bounds, even though the branch is never taken:

    inlined from ‘managed_bytes::managed_bytes(bytes_view)’ at ./utils/managed_bytes.hh:195:22,
    inlined from ‘managed_bytes::managed_bytes(const bytes&)’ at ./utils/managed_bytes.hh:162:77,
    inlined from ‘dht::token dht::bytes_to_token(bytes)’ at dht/random_partitioner.cc:68:57,
    inlined from ‘dht::token dht::random_partitioner::get_token(bytes)’ at dht/random_partitioner.cc:85:39:
/usr/include/c++/8/bits/stl_algobase.h:368:23: error: ‘void* __builtin_memmove(void*, const void*, long unsigned int)’ offset 16 from the object at ‘<anonymous>’ is out of the bounds of referenced subobject ‘managed_bytes::small_blob::data’ with type ‘signed char [15]’ at offset 0 [-Werror=array-bounds]
      __builtin_memmove(__result, __first, sizeof(_Tp) * _Num);
      ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Work around by disabling the diagnostic locally.
Message-Id: <1547205350-30225-1-git-send-email-tgrabiec@scylladb.com>
2019-01-18 13:48:05 +00:00
Duarte Nunes
fa2b0384d2 Replace std::experimental types with C++17 std version.
Replace stdx::optional and stdx::string_view with the C++ std
counterparts.

Some instances of boost::variant were also replaced with std::variant,
namely those that called seastar::visit.

Scylla now requires GCC 8 to compile.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190108111141.5369-1-duarte@scylladb.com>
2019-01-08 13:16:36 +02:00
Rafael Ávila de Espíndola
67039e942b Remove the only use of with_alignment from scylla
In c++17 there are standard ways of requesting aligned memory, so
seastar doesn't need to provide one.

This patch is in preparation for removing with_alignment from seastar.

Tests: unit (debug)

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20190107191019.22295-1-espindola@scylladb.com>
2019-01-07 21:34:47 +02:00
Duarte Nunes
3235c13125 utils/fragmented_temporary_buffer: Correctly implement remove_suffix()
The current implementation breaks the invariant that

_size_bytes = reduce(_fragments, &temporary_buffer::size)

In particular, this breaks algorithms that check the individual
segment size.

Correctly implement remove_suffix() by destroying superfluous
temporary_buffer's and by trimming the last one, if needed.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190103133523.34937-1-duarte@scylladb.com>
2019-01-03 13:37:01 +00:00
Duarte Nunes
1a88cd7992 utils/fragmented_temporary_buffer: Add remove_suffix
Essentially hide some bytes off the end of the buffer. Needed for
subsequent commit log changes.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-31 13:20:37 +00:00
Duarte Nunes
8eab0a3e01 utils/fragmented_temporary_buffer: Allow skipping in the input stream
Add fragmented_temporary_buffer::istream::skip(), needed for
subsequent commit log changes.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
2018-12-31 13:20:37 +00:00
Tomasz Grabiec
7747f2dde3 Merge "nodetool toppartitions" from Rafi & Avi
Implementation of nodetool toppartiotion query, which samples most frequest PKs in read/write
operation over a period of time.

Content:
- data_listener classes: mechanism that interfaces with mutation readers in database and table classes,
- toppartition_query and toppartition_data_listener classes to implement toppartition-specific query (this
  interfaces with data_listeners and the REST api),
- REST api for toppartitions query.

Uses Top-k structure for handling stream summary statistics (based on implementation in C*, see #2811).

What's still missing:
- JMX interface to nodetool (interface customization may be required),
- Querying #rows and #bytes (currently, only #partitions is supported).

Fixes #2811

* https://github.com/avikivity/scylla rafie_toppartitions_v7.1:
  top_k: whitespace and minor fixes
  top_k: map template arguments
  top_k: std::list -> chunked_vector
  top_k: support for appending top_k results
  nodetool toppartitions: refactor table::config constructor
  nodetool toppartitions: data listeners
  nodetool toppartitions: add data_listeners to database/table
  nodetool toppartitions: fully_qualified_cf_name
  nodetool toppartitions: Toppartitions query implementation
  nodetool toppartitions: Toppartitions query REST API
  nodetool toppartitions: nodetool-toppartitions script
2018-12-28 16:31:24 +01:00
Rafi Einstein
eda43b93c9 top_k: support for appending top_k results
Allow appending results of one top_k into another.

Signed-off-by: Rafi Einstein <rafie@scylladb.com>
2018-12-28 16:45:56 +02:00
Rafi Einstein
aeebe8e86b top_k: std::list -> chunked_vector
Replaced std::list with chunked_vector. Because chunked_vector requires
a noexcept move constructor from its value type, change the bad_boy type
in the unit test not to throw in the move constructor.

Signed-off-by: Rafi Einstein <rafie@scylladb.com>
2018-12-28 16:45:07 +02:00
Yibo Cai (Arm Technology China)
422987ab04 utils: add fast ascii string validation
Validate ascii string by ORing all bytes and check if 7-th bit is 0.
Compared with original std::any_of(), which checks ascii string byte
by byte, this new approach validates input in 8 bytes and two
independent streams. Performance is much higher for normal cases,
though slightly slower when string is very short. See table below.

Speed(MB/s) of ascii string validation
+---------------+-------------+---------+
| String length | std::any_of | u64 x 2 |
+---------------+-------------+---------+
| 9 bytes       | 1691        | 1635    |
+---------------+-------------+---------+
| 31 bytes      | 2923        | 3181    |
+---------------+-------------+---------+
| 129 bytes     | 3377        | 15110   |
+---------------+-------------+---------+
| 1039 bytes    | 3357        | 31815   |
+---------------+-------------+---------+
| 16385 bytes   | 3448        | 47983   |
+---------------+-------------+---------+
| 1048576 bytes | 3394        | 31391   |
+---------------+-------------+---------+

Signed-off-by: Yibo Cai <yibo.cai@arm.com>
Message-Id: <1544669646-31881-1-git-send-email-yibo.cai@arm.com>
2018-12-24 09:58:08 +02:00
Rafi Einstein
533e46ac72 top_k: map template arguments
Added Hash and KeyEqual template arguments to enable unordered_map in top_k implementation.

Signed-off-by: Rafi Einstein <rafie@scylladb.com>
2018-12-20 16:41:40 +02:00
Rafi Einstein
75f21954d4 top_k: whitespace and minor fixes
Style and minor logic changes from code review.

Signed-off-by: Rafi Einstein <rafie@scylladb.com>
2018-12-20 16:41:33 +02:00
Calle Wilund
66472bc52d sequenced_set: Add "insert" method, following std::set semantics 2018-12-12 09:32:05 +00:00
Avi Kivity
475b151c97 Merge "Use utils::small_vector more in read path" from Paweł
"
This series optimises the read path by replacing some usages of
std::vector by utils::small_vector. The motivation for this change was
an observation that memory allocation functions are pointed out by the
profiler as the ones where we spent most time and while they have a
large number of callers storage allocation for some vectors was close to
the top. The gains are not huge, since the problem is a lot of things
adding up and not a single slow thing, but we need to start with
something.

Unfortunately, the performance of boost::container::small_vector is
quite disappointing so a new implementation of a small_vector was
introduced.

perf_simple_query -c4 --duration 60, medians:

       ./perf_before  ./perf_after  diff
 read      343086.80     360720.53  5.1%

Tests: unit(release, small_vector in debug)
"

* tag 'small_vector/v2.1' of https://github.com/pdziepak/scylla:
  partition_slice: use small_vector for column_ids
  mutation_fragment_merger: use small_vector
  auth: use small_vector in resource
  auth: avoid list-initialisation of vectors
  idl: serialiser: add serialiser for utils::small_vector
  idl: serialiser: deduplicate vector serialisers
  utils: introduce small_vector
  intrusive_set_external_comparator: make iterator nothrow move constructible
  mutation_fragment_merger: value-initialise iterator
2018-12-10 13:50:59 +02:00
Yibo Cai (Arm Technology China)
6717816a8d utils/gz: optimize crc_combine for arm64
Signed-off-by: Yibo Cai <yibo.cai@arm.com>
Message-Id: <1544418903-26290-1-git-send-email-yibo.cai@arm.com>
2018-12-10 10:31:08 +02:00
Paweł Dziepak
23d19d21bd utils: introduce small_vector
small_vector is a variation of std::vector<> that reserves a configurable
amount of storage internally, without the need for memory allocation.
This can bring measurable gains if the expected number of elements is
small. The drawback is that moving such small_vector is more expensive
and invalidates iterators as well as references which disqualifies it in
some cases.
2018-12-06 14:21:04 +00:00
Yibo Cai (Arm Technology China)
6fadba56cc utils: optimize UTF-8 validation
UTF-8 string is now validated by boost::locale::conv::utf_to_utf, it
actually does string conversions which is more than necessary.  As
observed on Arm server, UTF-8 validation can become bottleneck under
heavy loads.

This patch introduces a brand new SIMD implementation supporting both
NEON and SSE, as well as a naive approach to handle short strings.
The naive approach is 3x faster than boost utf_to_utf, whilst SIMD
method outperforms naive approach 3x ~ 5x on Arm and x86. Details at
https://github.com/cyb70289/utf8/.

UTF-8 unit test is added to check various corner cases.

Signed-off-by: Yibo Cai <yibo.cai@arm.com>
Message-Id: <1543978498-12123-1-git-send-email-yibo.cai@arm.com>
2018-12-05 21:51:01 +02:00
Tomasz Grabiec
9a4c00beb7 utils/gz: Fix compilation on non-x86 archs
gen_crc_combine_table is now executed on every build, so it should not
fail on unsupported archs. The generated file will not contain data,
but this is fine since it should not be used.

Another problem is that u32 and u64 aliases were not visible in the #else
branch in crc_combine.cc
Message-Id: <1543864425-5650-1-git-send-email-tgrabiec@scylladb.com>
2018-12-04 18:17:27 +00:00
Tomasz Grabiec
1fb792c547 utils/gz: Add fast implementation of crc32_combine()
zlib's crc32_combine() is not very efficient. It is faster to re-combine
the buffer using crc32(). It's still substantial amount of work which
could be avoided.

This patch introduces a fast implementation of crc32_combine() which
uses a different algorithm than zlib. It also utilizes intrinsics for
carry-less multiplication instruction to perform the computation faster.
The details of the algorithm can be found in code comments.

Performance results using perf_checksum and second buffer of length 64 KiB:

zlib CRC32 combine:   38'851   ns
libdeflate CRC32:      4'797   ns
fast_crc32_combine():     11   ns

So the new implementation is 3500x faster than zlib's, and 417x faster than
re-checksumming the buffer using libdeflate.

Tested on i7-5960X CPU @ 3.00GHz

Performance was also evaluated using sstable writer benchmark:

  perf_fast_forward --populate --sstable-format=mc --data-directory /tmp/perf-mc \
     --value-size=10000 --rows 1000000 --datasets small-part

It yielded 9% improvement in median frag/s (129'055 vs 117'977).
2018-12-03 14:40:35 +01:00
Tomasz Grabiec
cd3d9d357b utils/gz: Add pre-computed polynomials
gen_crc_combine_table.cc will be run during build to produce tables
with precomputed polynomials (4 x 256 x u32). The definitions will
reside in:

  build/<mode>/gen/utils/gz/crc_combine_table.cc

It takes 20ms to generate on my machine.

The purpose of those polynomials will be explained in crc_combine.cc
2018-12-03 14:36:09 +01:00
Tomasz Grabiec
63e0da9e58 utils/gz: Import Barett reduction implementation from libdeflate 2018-12-03 14:36:09 +01:00
Tomasz Grabiec
bb7d95d6c3 utils: Extract clmul() from crc.hh 2018-12-03 14:36:08 +01:00
Avi Kivity
c6d700279b class_registry: introduce a non-static variant of class_registry
class_registry's staticness brings has the usual problem of
static classes (loss of dependency information) and prevents us
from librarifying Scylla since all objects that define a registration
must be linked in.

Take a first step against this staticness by defining a nonstatic
variant. The static class_registry is then redefined in terms of the
nonstatic class. After all uses have been converted, the static
variant can be retired.
Message-Id: <20181126130935.12837-1-avi@scylladb.com>
2018-11-26 13:30:21 +00:00
Benny Halevy
dcd18e2b62 remove exec permission from top_k source files
This was introduced by 32525f2694

Cc: Rafi Einstein <rafie@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20181121163352.13325-1-bhalevy@scylladb.com>
2018-11-21 18:38:50 +02:00
Tomasz Grabiec
143fd6e1c2 utils: Introduce memory_data_sink 2018-11-21 14:04:27 +01:00
Avi Kivity
775b7e41f4 Update seastar submodule
* seastar d59fcef...b924495 (2):
  > build: Fix protobuf generation rules
  > Merge "Restructure files" from Jesse

Includes fixup patch from Jesse:

"
Update Seastar `#include`s to reflect restructure

All Seastar header files are now prefixed with "seastar" and the
configure script reflects the new locations of files.

Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com>
"
2018-11-21 00:01:44 +02:00
Tomasz Grabiec
57e25fa0f8 utils: phased_barrier: Make advance_and_await() have strong exception guarantees
Currently, when advance_and_await() fails to allocate the new gate
object, it will throw bad_alloc and leave the phased_barrier object in
an invalid state. Calling advance_and_await() again on it will result
in undefined behavior (typically SIGSEGV) beacuse _gate will be
disengaged.

One place affected by this is table::seal_active_memtable(), which
calls _flush_barrier.advance_and_await(). If this throws, subsequent
flush attempts will SIGSEGV.

This patch rearranges the code so that advance_and_await() has strong
exception guarantees.
Message-Id: <1542645562-20932-1-git-send-email-tgrabiec@scylladb.com>
2018-11-20 16:15:12 +00:00
Avi Kivity
be99101f36 utils: convert sprint() to format()
sprint() recently became more strict, throwing on sprint("%s", 5). Replace
with the more modern format().

Mechanically converted with https://github.com/avikivity/unsprint.
2018-11-01 13:16:17 +00:00
Avi Kivity
3cf434b863 utils: estimated_histogram: convert generated format strings to fmt
Convert printf games to format games.

Note that fmt supports specifying the field width as an argument, but that
is left to a dedicated change.
2018-11-01 13:16:17 +00:00
Avi Kivity
7726ce23b7 utils: i_filter: rename "format" variable
The format variable hides the format function, which we'll soon want to use
here. Rename the format variable to unhide the function.
2018-11-01 13:16:17 +00:00
Yibo Cai (Arm Technology China)
79136e895f utils/crc: calculate crc in parallel
It achieves 2.0x speedup on intel E5 and 1.1x to 2.5x speedup on
various arm64 microarchitectures.

The algorithm cuts data into blocks of 1024 bytes and calculates crc
for each block, which is furthur divided into three subblocks of 336
bytes(42 uint64) each, and 16 remaining bytes(2 uint64).

For each iteration, three independent crc are caculated for one uint64
from each subgroup. It increases IPC(instructions per cycle) much.
After subblocks are done, three crc and remaining two uint64 are
combined using carry-less multiplication to reach the final result
for one block of 1024 bytes.

Signed-off-by: Yibo Cai <yibo.cai@arm.com>
Message-Id: <1541042759-24767-1-git-send-email-yibo.cai@arm.com>
2018-11-01 10:19:32 +02:00
Yibo Cai (Arm Technology China)
1c48e3fbec utils/crc: leverage arm64 crc extension
It achieves 6.7x to 11x speedup on various arm64 microarchitectures.

Signed-off-by: Yibo Cai <yibo.cai@arm.com>
Message-Id: <1540781879-15465-1-git-send-email-yibo.cai@arm.com>
2018-10-29 10:50:48 +02:00
Rafi Einstein
32525f2694 Space-Saving Top-k algorithm for handling stream summary statistics
Based on the following implementation ([2]) for the Space-Saving algorithm from [1].
[1] http://www.cse.ust.hk/~raywong/comp5331/References/EfficientComputationOfFrequentAndTop-kElementsInDataStreams.pdf
[2] https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/StreamSummary.java

The algorithm keeps a map between keys seen and their counts, keeping a bound on the number of tracked keys.
Replacement policy evicts the key with the lowest count while inheriting its count, and recording an estimation
of the error which results from that.
This error estimation can be later used to prove if the distribution we arrived at corresponds to the real top-K,
which we can display alongside the results.
Accuracy depends on the number of tracked keys.

Introduced as part of 'nodetool toppartition' query implementation.

Refs #2811
Message-Id: <20181027220937.58077-1-rafie@scylladb.com>
2018-10-28 10:10:28 +02:00
Tomasz Grabiec
fe0a0bdf1e utils/loading_shared_values: Add missing stat update call in one of the cases
Message-Id: <1540469591-32738-1-git-send-email-tgrabiec@scylladb.com>
2018-10-25 15:15:05 +03:00
Avi Kivity
aaab8a3f46 utils: crc32: mark power crc32 assembly as not requiring an executable stack
The linker uses an opt-in system for non-executable stack: if all object files
opt into a non-executable stack, the binary will have a non-executable stack,
which is very desirable for security. The compiler cooperates by opting into
a non-executable stack whenever possible (always for our code).

However, we also have an assembly file (for fast power crc32 computations).
Since it doesn't opt into a non-executable stack, we get a binary with
executable stack, which Gentoo's build system rightly complains about.

Fix by adding the correct incantation to the file.

Fixes #3799.

Reported-by: Alexys Jacob <ultrabug@gmail.com>
Message-Id: <20181002151251.26383-1-avi@scylladb.com>
2018-10-02 18:48:23 +01:00
Paweł Dziepak
2bcaf4309e utils/reusable_buffer: do not warn about large allocations
Reusable buffers are meant to be used when protocol or third-party
library limiations force us to allocate large contiguous buffers. There
isn't much that can be done about this so there is little point in
warning about that.

Fixes #3788.
Message-Id: <20180928085141.6469-1-pdziepak@scylladb.com>
2018-09-30 11:12:23 +03:00
Paweł Dziepak
2e5b375309 utils: drop data_output 2018-09-18 17:22:59 +01:00
Paweł Dziepak
cbe2ef9e5c utils: fragmented_temporary_buffer::view: add remove_prefix() 2018-09-18 17:22:59 +01:00
Paweł Dziepak
e464ad4f5d utils: fragmented_temporary_buffer: add empty() and size_bytes() 2018-09-18 11:29:37 +01:00
Paweł Dziepak
f4bb219a8b utils: fragmented_temporary_buffer: add get_ostream() 2018-09-18 11:29:37 +01:00
Paweł Dziepak
252cf0c681 utils: crc: accept FragmentRange 2018-09-18 11:29:36 +01:00
Tomasz Grabiec
4fb3f7e8eb managed_vector: Make external_memory_usage() ignore reserved space
This ensures that row::external_memory_usage() is invariant to
insertion order of cells.

It should be so, so that accounting of a clustering_row, merged from
multiple MVCC versions by the partition_snapshot_flat_reader on behalf
of a memtable flush, doesn't give a greater result than what is used
by the memtable region. Overaccounting leads to assertion failure in
~flush_memory_accounter.

Fixes #3625 (hopefully).

Message-Id: <1535982513-19922-1-git-send-email-tgrabiec@scylladb.com>
2018-09-03 17:09:54 +03:00
Vlad Zolotarov
945d26e4ee loading_cache: make iterator work on top of lru_list iterators instead of loading_shared_values'
Reloading may hold value in the underlying loading_shared_values while
the corresponding cache values have already been deleted.

This may create weird situations like this:

<populate cache with 10 entries>
cache.remove(key1);
for (auto& e : cache) {
    std::out << e << std::endl;
}

<all 10 entries are printed, including the one for "key1">

In order to avoid such situations we are going to make the loading_cache::iterator
to be a transform_iterator of lru_list::iterator instead of loading_shared_values::iterator
because lru_list contains entries only for cached items.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-08-30 20:56:44 -04:00
Vlad Zolotarov
1e56c7dd58 loading_cache: make size() return the size of lru_list instead of loading_shared_values
reloading flow may hold the items in the underlying loading_shared_values
after they have been removed (e.g. via remove(key) API) thereby loading_shared_values.size()
doesn't represent the correct value for the loading_cache. lru_list.size() on the other hand - does.

Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
2018-08-30 15:55:30 -04:00
Duarte Nunes
f6aadd8077 Merge 'utils::loading_cache: improve reload() robustness' from Vlad
"This series introduces a few improvements related to a reload flow.

From now on the callback may assume that the "key" parameter value
is kept alive till the end of its execution in the reloading flow.

It may also safely evict as many items from the cache as needed."

Fixes #3606

* 'loading_cache_improve_reload-v1' of https://github.com/vladzcloudius/scylla:
  utils::loading_cache: hold a shared_value_ptr to the value when we reload
  utils::loading_cache::on_timer(): remove not needed capture of "this"
  utils::loading_cache::on_timer(): use chunked_vector for storing elements we want to reload
2018-08-28 10:52:20 +01:00
Tomasz Grabiec
1e50f85288 database: Make soft-pressure memtable flusher not consider already flushed memtables
The flusher picks the memtable list which contains the largest region
according to region_impl::evictable_occupancy().total_space(), which
follows region::occupancy().total_space(). But only the latest
memtable in the list can start flushing. It can happen that the
memtable corresponding to the largest region was already flushed to an
sstable (flush permit released), but not yet fsynced or moved to
cache, so it's still in the memtable list.

The latest memtable in the winning list may be small, or empty, in
which case the soft pressure flusher will not be able to make much
progress. There could be other memtable lists with non-empty
(flushable) latest memtables. This can lead to writes unnecessarily
blocking on dirty.

I observed this for the system memtable group, where it's easy for the
memtables to overshoot small soft pressure limits. The flusher kept
trying to flush empty memtables, while the previous non-empty memtable
was still in the group.

The CPU scheduler makes this worse, because it runs memtable_to_cache
in a separate scheduling group, so it further defers in time the
removal of the flushed memtable from the memtable list.

This patch fixes the problem by making regions corresponding to
memtables which started flushing report evictable_occupancy() as 0, so
that they're picked by the flusher last.

Fixes #3716.
Message-Id: <1535040132-11153-2-git-send-email-tgrabiec@scylladb.com>
2018-08-26 11:02:34 +03:00
Tomasz Grabiec
364418b5c5 logalloc: Make evictable_occupancy() indicate no free space
Doesn't fix any bug, but it's closer to the truth that all segments
are used rather than none is used.

Message-Id: <1535040132-11153-1-git-send-email-tgrabiec@scylladb.com>
2018-08-26 11:02:32 +03:00
Avi Kivity
2c9b886b6d logalloc: reindent
No functional changes.
Message-Id: <20180731125116.32009-1-avi@scylladb.com>
2018-08-01 00:35:54 +01:00