Commit Graph

755 Commits

Author SHA1 Message Date
Avi Kivity
9322c07c71 Merge "Use binary search in sstable promoted index" from Tomasz
"
The "promoted index" is how the sstable format calls the clustering key index within a given partition.
Large partitions with many rows have it. It's embedded in the partition index entry.

Currently, lookups in the promoted index are done by scanning the index linearly so the lookup
is O(N). For large partitions that's inefficient. It consumes both a lot of CPU and I/O.

We could do better and use binary search in the index. This patch series switches the mc-format
index reader to do that. Other formats use the old way.

The "mc" format promoted index has an extra structure at the end of the index called "offset map".
It's a vector of offsets of consecutive promoted index entries. This allows us to access random
entries in the index without reading the whole index.

The location of the offset entry for a given promoted index entry can be derived by knowing where
the offset vector ends in the index file, so the offset map also doesn't have to be read completely
into the memory.

The most tricky part is caching. We need to cache blocks read from the index file to amortize the
cost of binary search:

  - if the promoted index fits in the 32 KiB which was read from the index when looking for
    the partition entry, we don't want to issue any additional I/O to search the promoted index.

  - with large promoted indexes, the last few bisections will fall into the same I/O block and we
    want to reuse that block.

  - we don't want the cache to grow too big, we don't want to cache the whole promoted index
    as the read progresses over the index. Scanning reads may skip multiple times.

This series implements a rather simple approach which meets all the
above requirements and is not worse than the current state of affairs:

   - Each index cursor has its own cache of the index file area which corresponds to promoted index
     This is managed by the cached_file class.

   - Each index cursor has its own cache of parsed blocks. This allows the upper bound estimation to
     reuse information obtained during lower bound lookup. This estimation is used to limit
     read-aheads in the data file.

   - Each cursor drops entries that it walked past so that memory footprint stays O(log N)

   - Cached buffers are accounted to read's reader_permit.

Later, we could have a single cache shared by many readers. For that, we need to come up with eviction
policy.

Fixes #4007.

TESTING RESULTS

 * Point reads, large promoted index:

  Config: rows: 10000000, value size: 2000
  Partition size: 20 GB
  Index size: 7 MB

  Notes:

    - Slicing read into the middle of partition (offset=5000000, read=1) is a clear win for the binary search:

      time: 1.9ms vs 22.9ms
      CPU utilization: 8.9% vs 92.3%
      I/O: 21 reqs / 172 KiB vs 29 reqs / 3'520 KiB

      It's 12x faster, CPU utilization is 10x times smaller, disk utilization is 20x smaller.

    - Slicing at the front (offset=0) is a mixed bag.

      time is similar: 1.8ms
      CPU utilization is 6.7x smaller for bsearch: 8.5% vs 57.7%
      disk bandwidth utilization is smaller for bsearch but uses more IOs: 4 reqs / 320 KiB (scan) vs 17 reqs / 188 KiB (bsearch)

      bsearch uses less bandwidth because the series reduces buffer size used for index file I/O.

      scan is issuing:

         2 * 128 KB (index page)
         2 * 32 KB (data file)

      bsearch is issuing:

         1 * 64 KB (index page)
         15 * 4 KB (promoted index)
         1 * 64 KB (data file)

      The 1 * 64 KB is chosen dynamically by seastar. Sometimes it chooses 2 * 32 KB (with read-ahead).
      32 KB is the minimum I/O currently.

      Disk utilization could be further improved by changing the way seastar's dynamic I/O adjustments work
      so that it uses 1 * 4 KB when it suffices. This is left for the follow-up.

  Command:

        perf_fast_forward --datasets=large-part-ds1 \
         --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1

  Before:

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.001836          172         1        545          9        563        175        4.0      4        320       2       2        0        1        1        0        0        0  57.7%      0
    0       32        0.001858          502        32      17220        126      17776      11526        3.2      3        324       2       1        0        1        1        0        0        0  56.4%      0
    0       256       0.002833          339       256      90374        427      91757      85931        7.0      7        776       3       1        0        1        1        0        0        0  41.1%      0
    0       4096      0.017211           58      4096     237984       2011     241802     233870       66.1     66       8376      59       2        0        1        1        0        0        0  21.4%      0
    5000000 1         0.022952           42         1         44          1         45         41       29.2     29       3520      22       2        0        1        1        0        0        0  92.3%      0
    5000000 32        0.023052           43        32       1388         14       1414       1331       31.1     32       3588      26       2        0        1        1        0        0        0  91.7%      0
    5000000 256       0.024795           41       256      10325        129      10721       9993       43.1     39       4544      29       2        0        1        1        0        0        0  86.4%      0
    5000000 4096      0.038856           27      4096     105414        398     106918     103162       95.2     95      12160      78       5        0        1        1        0        0        0  61.4%      0

 After (v2):

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.001831          248         1        546         21        581        252       17.6     17        188       2       0        0        1        1        0        0        0   8.5%      0
    0       32        0.001910          535        32      16751        626      17770      13896       17.9     19        160       3       0        0        1        1        0        0        0   8.8%      0
    0       256       0.003545          266       256      72207       2333      89076      62852       26.9     24        764       7       0        0        1        1        0        0        0   9.7%      0
    0       4096      0.016800           56      4096     243812        524     245430     239736       83.6     83       8700      64       0        0        1        1        0        0        0  16.6%      0
    5000000 1         0.001968          351         1        508         19        538        380       21.3     21        172       2       0        0        1        1        0        0        0   8.9%      0
    5000000 32        0.002273          431        32      14077        436      15503      11551       22.7     22        268       3       0        0        1        1        0        0        0   8.9%      0
    5000000 256       0.003889          257       256      65824       2197      81833      57813       34.0     37        652      18       0        0        1        1        0        0        0  11.2%      0
    5000000 4096      0.017115           54      4096     239324        834     241310     231993       88.3     88       8844      65       0        0        1        1        0        0        0  16.8%      0

 After (v1):

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.001886          259         1        530          4        545        261       18.0     18        376       2       2        0        1        1        0        0        0   9.1%      0
    0       32        0.001954          513        32      16381         93      16844      15618       19.0     19        408       3       2        0        1        1        0        0        0   9.3%      0
    0       256       0.003266          318       256      78393       1820      81567      61663       30.8     26       1272       7       2        0        1        1        0        0        0  10.4%      0
    0       4096      0.017991           57      4096     227666        855     231915     225781       83.1     83       8888      55       5        0        1        1        0        0        0  15.5%      0
    5000000 1         0.002353          232         1        425          2        432        232       23.0     23        396       2       2        0        1        1        0        0        0   8.7%      0
    5000000 32        0.002573          384        32      12437         47      12571        429       25.0     25        460       4       2        0        1        1        0        0        0   8.5%      0
    5000000 256       0.003994          259       256      64101       2904      67924      51427       37.0     35       1484      11       2        0        1        1        0        0        0  10.6%      0
    5000000 4096      0.018567           56      4096     220609        448     227395     219029       89.8     89       9036      59       5        0        1        1        0        0        0  15.1%      0

 * Point reads, small promoted index (two blocks):

  Config: rows: 400, value size: 200
  Partition size: 84 KiB
  Index size: 65 B

  Notes:
     - No significant difference in time
     - the same disk utilization
     - similar CPU utilization

  Command:

      perf_fast_forward --datasets=large-part-ds1 \
         --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1

  Before:

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.000279          470         1       3587         31       3829        478        3.0      3         68       2       1        0        1        1        0        0        0  21.1%      0
    0       32        0.000276         3498        32     116038        811     122756     104033        3.0      3         68       2       1        0        1        1        0        0        0  24.0%      0
    0       256       0.000412         2554       256     621044       1778     732150     559221        2.0      2         72       2       0        0        1        1        0        0        0  32.6%      0
    0       4096      0.000510         1901       400     783883       4078     819058     665616        2.0      2         88       2       0        0        1        1        0        0        0  36.4%      0
    200     1         0.000339         2712         1       2951          8       3001       2569        2.0      2         72       2       0        0        1        1        0        0        0  17.8%      0
    200     32        0.000352         2586        32      91019        266      92427      83411        2.0      2         72       2       0        0        1        1        0        0        0  20.8%      0
    200     256       0.000458         2073       200     436503       1618     453945     385501        2.0      2         88       2       0        0        1        1        0        0        0  29.4%      0
    200     4096      0.000458         2097       200     436475       1676     458349     381558        2.0      2         88       2       0        0        1        1        0        0        0  29.0%      0

  After (v1):

    Testing slicing of large partition using clustering keys:
    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.000278          492         1       3598         30       3831        500        3.0      3         68       2       1        0        1        1        0        0        0  19.4%      0
    0       32        0.000275         3433        32     116153        753     122915      92559        3.0      3         68       2       1        0        1        1        0        0        0  22.5%      0
    0       256       0.000458         2576       256     559437       2978     728075     504375        2.1      2         88       2       0        0        1        1        0        0        0  29.0%      0
    0       4096      0.000506         1888       400     790064       3306     822360     623109        2.0      2         88       2       0        0        1        1        0        0        0  36.6%      0
    200     1         0.000382         2493         1       2619         10       2675       2268        2.0      2         88       2       0        0        1        1        0        0        0  16.3%      0
    200     32        0.000398         2393        32      80422        333      84759      22281        2.0      2         88       2       0        0        1        1        0        0        0  19.0%      0
    200     256       0.000459         2096       200     435943       1608     453989     380749        2.0      2         88       2       0        0        1        1        0        0        0  30.5%      0
    200     4096      0.000458         2097       200     436410       1651     455779     382485        2.0      2         88       2       0        0        1        1        0        0        0  29.2%      0

 * Scan with skips, large index:

  Config: rows: 10000000, value size: 2000
  Partition size: 20 GB
  Index size: 7 MB

  Notes:

    - Similar time, slightly worse for binary search: 36.1 s (scan) vs 36.4 (bsearch)

    - Slightly more I/O for bsearch: 153'932 reqs / 19'703'260 KiB (scan) vs 155'651 reqs / 19'704'088 KiB (bsearch)

      Binary search reads more by 828 KB and by 1719 IOs.
      It does more I/O to read the the promoted index offset map.

    - similar (low) memory footprint. The danger here is that by caching index blocks which we touch as we scan
      we would end up caching the whole index. But this is protected against by eviction as demonstrated by the
      last "mem" column.

  Command:

    perf_fast_forward --datasets=large-part-ds1 \
       --run-tests=large-partition-skips -c1 --test-case-duration=1

  Before:

      read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
      1       1        36.103451            4   5000000     138491         38     138601     138453   153932.0 153932   19703260  153561       1        0        1        1        0        0        0  31.5% 502690

  After (v2):

    read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    1       1        37.000145            4   5000000     135135          6     135146     135128   155651.0 155651   19704088  138968       0        0        1        1        0        0        0  34.2%      0

  After (v1):

    read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    1       1        36.965520            4   5000000     135261         30     135311     135231   155628.0 155628   19704216  139133       1        0        1        1        0        0        0  33.9% 248738

Also in:

  git@github.com:tgrabiec/scylla.git sstable-use-index-offset-map-v2

Tests:

  - unit (all modes)
  - manual using perf_fast_forward
"

* tag 'sstable-use-index-offset-map-v2' of github.com:tgrabiec/scylla:
  sstables: Add promoted index cache metrics
  position_in_partition: Introduce external_memory_usage()
  cached_file, sstables: Add tracing to index binary search and page cache
  sstables: Dynamically adjust I/O size for index reads
  sstables, tests: Allow disabling binary search in promoted index from perf tests
  sstables: mc: Use binary search over the promoted index
  utils: Introduce cached_file
  sstables: clustered_index: Relax scope of validity of entry_info
  sstables: index_entry: Introduce owning promoted_index_block_position
  compound_compat: Allow constructing composite from a view
  sstables: index_entry: Rename promoted_index_block_position to promoted_index_block_position_view
  sstables: mc: Extract parser for promoted index block
  sstables: mc: Extract parser for clustering out of the promoted index block parser
  sstables: consumer: Extract primitive_consumer
  sstables: Abstract the clustering index cursor behavior
  sstables: index_reader: Rearrange to reduce branching and optionals
2020-06-18 12:09:39 +03:00
Tomasz Grabiec
58532cdf11 cached_file, sstables: Add tracing to index binary search and page cache 2020-06-16 16:15:24 +02:00
Tomasz Grabiec
c95dd67d11 utils: Introduce cached_file
It is a read-through cache of a file.

Will be used to cache contents of the promoted index area from the
index file.

Currently, cached pages are evicted manually using the invalidate_*()
method family, or when the object is destroyed.

The cached_file represents a subset of the file. The reason for this
is to satisfy two requirements. One is that we have a page-aligned
caching, where pages are aligned relative to the start of the
underlying file. This matches requirements of the seastar I/O engine
on I/O requests.  Another requirement is to have an effective way to
populate the cache using an unaligned buffer which starts in the
middle of the file when we know that we won't need to access bytes
located before the buffer's position. See populate_front(). If we
couldn't assume that, we wouldn't be able to insert an unaligned
buffer into the cache.
2020-06-16 16:15:23 +02:00
Tomasz Grabiec
1c5db178dd Merge "logalloc: Get rid of segments migration" from Pavel
But not compaction.

When reclaiming segments to seastar non-empty segments are copied
as-is to some other place. Instead of doing this reclaimer can copy
only allocated objects and leave the freed holes behing, i.e. -- do
the regular compaction. This would be the same or better from the
timing perspective, and will help to avoid yet another compaction
pass over the same set of objects in the future.

Current migration code checks for the free segments reserve to be
above minimum to proceed with migration, so does the code after this
patch, thus the segment compaction is called with non-empty free
segments set and thus it's guaranteed not to fail the new segment
allocation (if it will be required at all).

Plus some bikeshedding patches for the run-up.

tests: unit(dev)

* https://github.com/xemul/scylla/tree/br-logalloc-compact-on-reclaim-2:
  logalloc: Compact segments on reclaim instead of migration
  logallog: Introduce RAII allocation lock
  logalloc: Shuffle code around region::impl::compact
  logalloc: Do not lock reclaimer twice
  logalloc: Do not calculate object size twice
  logalloc: Do not convert obj_desc to migrator back and forth
2020-06-15 16:28:16 +02:00
Avi Kivity
d17b05e911 Merge 'Adding Optimized pseudo floating point estimated histogram' from Amnon
"
This series Adds a pseudo-floating-point histogram implementation.
The histogram is used for time_estimated_histogram a histogram for latency tracking and then used in storage_proxy as a more efficient with a higher resolution histogram.

Follow up series would use the new histogram in other places in the system and will add an implementation that supports lower values.
Fixes #5815
Fixes #4746
"

* amnonh-quicker_estimated_histogram:
  storage_proxy: use time_estimated_histogram for latencies
  test/boost/estimated_histogram_test
  utils/histogram_metrics_helper Adding histogram converter
  utils/estimated_histogram: Adding approx_exponential_histogram
2020-06-15 10:19:36 +03:00
Amnon Heiman
f30f926703 utils/histogram_metrics_helper Adding histogram converter
This patch adds a helper converter function to convert from a
approx_exponential_histogram histogram to a seastar::metrics::histogram

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-06-15 08:22:49 +03:00
Amnon Heiman
3319756f36 utils/estimated_histogram: Adding approx_exponential_histogram
This patch adds an efficient histogram implementation.
The implementation chooses efficiency over flexibility.
That is why templates are used.

How the approx_exponential_histogram pseudo floating point histogram
works: It split the range [MIN, MAX] into log2(MAX/MIN) ranges it then
split each of that ranges linearly according to a given resolution.

For example, using resolution of 4, would be similar to using an
exponentially growing histogram with a coefficient of 1.2.

All values are uint64. To prevent handling of corner cases, it is not
allowed to set the MIN to be lower than the resolution.

The approx_exponential_histogram will probably not be used directly,
the first used is by time_estimated_histogram. A histogram for durations.

It should be compared to the estimated_histogram.

Performance comparison:
Comparison was done by inserting 2^20 values into
time_estimated_histogram and estimated_histogram.

In debug mode on a local machine insert operation took an average of
26.0 nanoseconds vs 342.2 nanoseconds.

In release mode insert operation took an average of 1.90 vs 8.28 nanoseconds

Fixes #5815

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-06-15 08:22:43 +03:00
Rafael Ávila de Espíndola
336d541f58 database: Use a flat_hash_map for _ks_cf_to_uuid
Given that the key is a std::pair, we have to explicitly mark the hash
and eq types as transparent for heterogeneous lookup to work.

With that, pass std::string_view to a few functions that just check if
a value is in the map.

This increases the .text section by 11 KiB (0.03%).

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-14 08:18:39 -07:00
Pavel Emelyanov
d908646b28 logalloc: Compact segments on reclaim instead of migration
When reclaiming segments to the seastar the code tries to free the segments
sequentially. For this it walks the segments from left to right and frees
them, but every time a non-empty segment is met it gets migrated to another
segment, that's allocated from the right end of the list.

This is waste of cycles sometimes. The destination segment inherits the
holes from the source one, and thus it will be compacted some time in the
future. Why not compact it right at the reclamation time? It will take the
same time or less, but will result in better compaction.

To acheive this, the segment to be reclaimed is compacted with the existing
compact_segment_locked() code with some special care around it.

1. The allocation of new segments from seastar is locked
2. The reclaiming of segments with evict-and-compact is locked as well
3. The emergency pool is opened (the compaction is called with non-empty
   reserve to avoid bad_alloc exception throw in the middle of compaction)
4. The segment is forcibly removed from the histogram and the closed_occupancy
   is updated just like it is with general compaction

The segments-migration auxiliary code can be removed after this.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-08 14:07:35 +03:00
Pavel Emelyanov
4db6ef7b6d logallog: Introduce RAII allocation lock
The lock disables the segment_pool to call for more segments from
the underlying allocator.

To be used in next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-08 14:07:30 +03:00
Pavel Emelyanov
2005aca444 logalloc: Shuffle code around region::impl::compact
This includes 3 small changes to facilitate next patching:
- rename region::impl::compact into compact_segment_locked
- merging former compact with compact_single_segment_locked
- moving log print and stats update into compact_segment_locked

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-08 14:06:45 +03:00
Pavel Emelyanov
8c81c6b7aa logalloc: Do not lock reclaimer twice
The tracker::impl::reclaim is already in reclaim-locked
section, no need for yet another nested lock.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-08 13:14:33 +03:00
Pavel Emelyanov
0392c5ca77 logalloc: Do not calculate object size twice
When walking objects on compaction the migrator->size() virtual fn is
called twice.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-08 13:14:33 +03:00
Pavel Emelyanov
81c9c4c7b2 logalloc: Do not convert obj_desc to migrator back and forth
When calling alloc_small the migrator is passed just to get the
object descriptor, but during compaction the descriptor is already
at hands, so no need to re-get it again.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-08 13:14:33 +03:00
Tomasz Grabiec
087fa42c1d Merge "utils: inject errors around paxos stages" from Alejo
Add Paxos error injections before/after save promise, proposal, decision,
paxos_response_handler, delete decision.

Adds a method to inject an error providing a lambda while avoiding to add
a continuation when the error injection is disabled.

For this provide error exception and enter() to allow flow control (i.e. return)
on simple error injections without lambdas.

Also includes Pavel's patch for CQL API for error injections, updated to
current error injection API and added one_shot support. Also added some
basic CQL API boost tests.

For CQL API there's a limitation of the current grammar not supporting
f(<terminal>) so values have to be inserted in a table until this is
resolved. See #5411

* https://github.com/alecco/scylla/tree/error_injection_v11:
  paxos: fix indentation
  paxos: add error injections
  utils: add timeout error injection with lambda
  utils: error injection add enter() for control flow
  utils: error injections provide error exceptions
  failure_injector: implement CQL API for failure injector class
  lwt: fix disabled error injection templates
2020-06-03 15:42:10 +02:00
Alejo Sanchez
a8b14b0227 utils: add timeout error injection with lambda
Even though calling then() on a ready future does not allocate a
continuation, calling then on the result of it will allocate.

This error injection only adds a continuation in the dependency
chain if error injections are enabled at compile timeand this particular
error injection is enabled.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-06-03 14:44:00 +02:00
Alejo Sanchez
0321172677 utils: error injection add enter() for control flow
For control flow (i.e. return) and simplicity add enter() method.

For disabled injections, this method is const returning false,
therefore it has no overhead.

Add boost test.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-06-03 14:42:48 +02:00
Avi Kivity
6f394e8e90 tombstone: use comparison operator instead of ad-hoc compare() function and with_relational_operators
The comparison operator (<=>) default implementation happens to exactly
match tombstone::compare(), so use the compiler-generated defaults. Also
default operator== and operator!= (these are not brought in by operator<=>).
These become slightly faster as they perform just an equality comparison,
not three-way compare.

shadowable_tombstone and row_tombstone depend on tombstone::compare(),
so convert them too in a similar way.

with_relational_operations.hh becomes unused, so delete it.

Tests: unit (dev)
Message-Id: <20200602055626.2874801-1-avi@scylladb.com>
2020-06-02 09:28:52 +03:00
Avi Kivity
a4c44cab88 treewide: update concepts language from the Concepts TS to C++20
Seastar recently lost support for the experimental Concepts Technical
Specification (TS) and gained support for C++20 concepts. Re-enable
concepts in Scylla by updating our use of concepts to the C++20
standard.

This change:
 - peels off uses of the GCC6_CONCEPT macro
 - removes inclusions of <seastar/gcc6-concepts.hh>
 - replaces function-style concepts (no longer supported) with
   equation-style concepts
 - semicolons added and removed as needed
 - deprecated std::is_pod replaced by recommended replacement
 - updates return type constraints to use concepts instead of
   type names (either std::same_as or std::convertible_to, with
   std::same_as chosen when possible)

No attempt is made to improve the concepts; this is a specification
update only.
Message-Id: <20200531110254.2555854-1-avi@scylladb.com>
2020-06-02 09:12:21 +03:00
Piotr Sarna
7b5db478ed big_decimal: migrate to string views
Big decimals are, among other use cases, used as a main number
type for alternator, and as such can appear on the fast path.
Parsing big decimals was performed via std::regex, which is not
precisely famous for its speeds, and also enforces unnecessary
string copying. Therefore, the implementation is replaced
with an open-coded version based on string_views.
One previous iteration of this series also included
a hand-coded state machine implementation, but it proved
to be slower than the slightly naive string_view one.
Overall, execution time is reduced by 61.6% according to
microbenchmarks, which sounds like a promising improvement.

Perf results:
test                                      iterations      median         mad         min         max

Regex (original):
big_decimal_test.from_string                   88895    11.228us    25.891ns    11.202us    11.510us

String view (new):
big_decimal_test.from_string                  232334     4.303us    21.660ns     4.282us     4.736us

State machine (experimental, ditched):
big_decimal_test.from_string                  148318     6.723us    51.896ns     6.672us     6.877us

Tests: unit(dev + release(big_decimal_test))
2020-06-01 16:11:49 +02:00
Pavel Emelyanov
ee31191e21 storage_service: Move get_generation_number to util/
This is purely utility helper routine. As a nice side effect the
inclusion of storage_service.hh is removed from several unrelated
places.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-01 09:08:40 +03:00
Botond Dénes
a9e6fe4071 utils: introduce ranges::to()
Sadly, std::ranges is missing an equivalent of boost::copy_range(), so
we introduce a replacement: ranges::to(). There is an existing proposal
to introduce something similar to the standard library:
std::ranges::to() (https://github.com/cplusplus/papers/issues/145). We
name our own version similarly, so if said proposal makes it in we can
just prepend std:: and be good.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200529141407.158960-2-bdenes@scylladb.com>
2020-05-31 12:58:59 +03:00
Pavel Emelyanov
878f8d856a logalloc: Report reclamation timing with rate
The timer.stop() call, that reports not only the time-taken, but also
the reclaimation rate, was unintentionally dropped while expanding its
scope (c70ebc7c).

Take it back (and mark the compact_and_evict_locked as private while
at it).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200528185331.10537-1-xemul@scylladb.com>
2020-05-29 14:50:43 +02:00
Pavel Emelyanov
7696ed1343 shard_tracker: Configure it in one go
Instead of doing 3 smp::invoke_on_all-s and duplicating
tracker::impl API for the tracker itself, introduce the
tracker::configure, simplify the tracker configuration
and narrow down the public tracker API.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200528185442.10682-1-xemul@scylladb.com>
2020-05-29 14:50:43 +02:00
Alejo Sanchez
bb08b5ad5a utils: error injections provide error exceptions
Provide non-timeout error exception
to facilitate control flow in injected errors.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-05-28 11:13:55 +02:00
Alejo Sanchez
2c7e01a3b6 lwt: fix disabled error injection templates
Fix disabled injection templates to match enabled ones.
Fix corresponding test to not be a continuation.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-05-28 11:13:55 +02:00
Avi Kivity
bdb5b11d19 treewide: stop using deprecated seastar::apply()
seastar::apply() is deprecated in recent versions of seastar in favor
of std::apply(), so stop including its header. Calls to unqualified
apply(..., std::tuple<>) are resolved to std::apply() by argument
dependent lookup, so no changes to call sites are necessary.

This avoids a huge number of deprecation warnings with latest seastar.
Message-Id: <20200526090552.1969633-1-avi@scylladb.com>
2020-05-27 14:07:35 +03:00
Amnon Heiman
3e5beba403 estimated_histogram: clean if0 and FIXME
This patch cleans the estimated histogram implementation.
It removes the FIXME that were left in the code from the migration time
and the if0 commented out code.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2020-05-27 08:40:05 +03:00
Avi Kivity
076c8317c7 streaming_histogram: add missing include for uint64_t
Fails dev-headers build without it.

Message-Id: <20200523061555.72087-1-avi@scylladb.com>
2020-05-23 11:09:10 +03:00
Avi Kivity
e774ee06ed Update seastar submodule
* seastar e708d1df3a...92365e7b87 (11):
  > tests: distributed_test: convert to SEASTAR_TEST_CASE
  > Merge "Avoid undefined behavior on future self move assignments" from Rafael
  > Merge "C++20 support" from Avi
  > optimized_optional: don't use experimental C++ features
  > tests: scheduling_group_test: verify that later() doesn't modify the current group
  > tests: demos: coroutine_demo: add missing include for open_file_dma()
  > rpc: minor documentation improvements
  > rpc: Assert that sinks are closed
  > Merge "Fix most tests under valgrind" from Rafael
  > distributed_test: Fix it on slow machines
  > rpc_test: Make sure we always flush and close the sink

loading_shard_values.hh: added missing include for gcc6-concepts.hh,
exposed by the submodule update.

Frozen toolchain updated for the new valgrind dependency.
2020-05-12 14:04:16 +03:00
Rafael Ávila de Espíndola
e6f4996e44 atomic_vetor: Don't pass references to callbacks
This is more strict than it needs to be, but it avoids any bugs like
the one fixed by the previous patch.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200422182304.120906-2-espindola@scylladb.com>
2020-04-23 16:06:37 +03:00
Alejo Sanchez
bd849764e0 utils: error injection sleep add support for manual_clock
Requested by @tgrabiec in previous patch (already merged).

Adds support for sleep using manual clock.
Add test.

NOTE: Removes system_clock support (and test) as sleep is not explicitly
      instantiated in seastar/src/core/reactor.cc

Branch URL: https://github.com/alecco/scylla/tree/error_injection_5_manual_clock

Tests: unit ({dev})

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20200417081518.868900-1-alejo.sanchez@scylladb.com>
2020-04-17 11:45:05 +02:00
Avi Kivity
88ade3110f treewide: replace calls to engine().some_api() with some_api()
This removes the need to include reactor.hh, a source of compile
time bloat.

In some places, the call is qualified with seastar:: in order
to resolve ambiguities with a local name.

Includes are adjusted to make everything compile. We end up
having 14 translation units including reactor.hh, primarily for
deprecated things like reactor::at_exit().

Ref #1
2020-04-05 12:46:04 +03:00
Avi Kivity
1799cfa88a logalloc: use namespace-scope seastar::idle_cpu_handler and related rather than reactor scope
This allows us to drop a #include <reactor.hh>, reducing compile time.

Several translation units that lost access to required declarations
are updated with the required includes (this can be an include of
reactor.hh itself, in case the translation unit that lost it got it
indirectly via logalloc.hh)

Ref #1.
2020-04-05 12:45:08 +03:00
Rafael Ávila de Espíndola
8da235e440 everywhere: Use futurize_invoke instead of futurize<T>::invoke
No functionality change, just simpler.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200330165308.52383-1-espindola@scylladb.com>
2020-04-03 15:53:35 +02:00
Alejo Sanchez
3a4dd0a856 utils: error injection inject() returning a future
Make inject() return a future.

Suggested by Gleb.
Botond helped on dealing with complex function/lambda overload.

Refs #3295 (closed)

Tests: unit ({dev})

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20200331143839.1781424-7-alejo.sanchez@scylladb.com>
2020-04-01 16:22:52 +02:00
Alejo Sanchez
8bae38cef9 utils: error injection support multiple clocks
Use template to support multiple clock classes for time point
for deadline injection.

Refs: #3295   (closed)

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20200331143839.1781424-6-alejo.sanchez@scylladb.com>
2020-04-01 16:22:45 +02:00
Alejo Sanchez
71f2f423bc utils: error injection reorder args for exceptions
Move exception factory to end of argument list.

Refs: #3295   (closed)

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20200331143839.1781424-5-alejo.sanchez@scylladb.com>
2020-04-01 16:22:38 +02:00
Alejo Sanchez
fd1eb6a466 utils: error injection simplify API
Split error injection C++ API to have

1. sleep duration
2. sleep to deadline (timeout)

TODO: support multiple types of clocks

Refs: #3295   (closed)

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20200331143839.1781424-4-alejo.sanchez@scylladb.com>
2020-04-01 16:22:30 +02:00
Alejo Sanchez
e5a2ba32b9 utils: error injection allocate string for remote invoke
Allocate string before sending to other shards.

Reported by Pavel Solodovnikov.

Refs #3295 (closed)

Tests: unit ({dev})

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20200328204454.1326514-2-alejo.sanchez@scylladb.com>
2020-03-31 11:58:27 +02:00
Rafael Ávila de Espíndola
c5795e8199 everywhere: Replace engine().cpu_id() with this_shard_id()
This is a bit simpler and might allow removing a few includes of
reactor.hh.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200326194656.74041-1-espindola@scylladb.com>
2020-03-27 11:40:03 +03:00
Alejo Sanchez
febcced4f1 utils: error injection with timeout/deadline
Most of Scylla code runs with a user-supplied query timeout, expressed as
absolute clock (deadline). When injecting test sleeps into such code, we most
often want to not sleep beyond the user supplied deadline. Extend error
injection API to optionally accept a deadline, and, if it is provided,
sleep no more than up to the deadline. If current time is beyond deadline,
sleep injection is skipped altogether.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20200326091600.1037717-2-alejo.sanchez@scylladb.com>
2020-03-26 12:41:10 +01:00
Rafael Ávila de Espíndola
eca0ac5772 everywhere: Update for deprecated apply functions
Now apply is only for tuples, for varargs use invoke.

This depends on the seastar changes adding invoke.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200324163809.93648-1-espindola@scylladb.com>
2020-03-25 08:49:53 +02:00
Avi Kivity
0d885dbb00 Merge "Make all headers standalone" from Botond
"
Make sure all headers compile on their own, without requiring any
additional includes externally.

Even though this requirement is not documented in our coding guides it
is still quasi enforced and we semi-regularly get and merge patches
adding missing includes to headers.

This patch-set fixes all headers and adds a `{mode}-headers` target that
can be used to verify each header. This target should be built by
promotion to ensure no new non-conforming code sneaks in.
Individual headers can be verified using the
`build/dev/path/to/header.hh.o` target, that is generated for every
header.

The majority of the headers was just missing `seastarx.hh`. I think we
should just include this via a compiler flag to remove the noise from
our code (in a followup).
"

* 'compiling-headers/v2' of https://github.com/denesb/scylla:
  configure.py: add {mode}-headers phony target
  treewide: add missing headers and/or forward declarations
  test/boost/sstable_test.hh: move generic stuff to test/lib/sstable_utils.hh
  sstables: size_tiered_backlog_tracker: move methods out-of-line
  sstables: date_tiered_compaction_strategy.hh: move methods out-of-line
2020-03-23 13:09:09 +02:00
Avi Kivity
c6a441f9c2 Update seastar submodule
* seastar 3c498abcab...92c488706c (14):
  > dpdk: restore including reactor.hh
  > tests: distributed_test: add missing #include <mutex>
  > reactor: un-static-ify make_pollfn()
  > merge: Reduce inclusions of reactor.hh
A few #includes added to compensate for this
  > sharded: delete move constructor
  > future: Avoid a move constructor call
  > future: Erase types a bit more in then_wrapped
  > memory: Drop a never nullopt optional
  > semaphore: specify get_units and with_semaphore as noexcept
  > spinlock.hh: Add include for <cassert> header
  > dpdk: Avoid a variable sized array
  > future: Add an explicit promise member to continuation
  > net: remove smart pointer wrappers around pollable_fd
  > Merge "cleanup reactor file functions" from Benny
2020-03-23 11:59:30 +02:00
Piotr Sarna
602a771105 Merge 'utils: error injector API' from Alejo
Closes #3295

The error_injection class allows injecting custom handlers into normal control
flow at the pre-determined injection points.

This is especially useful in various testing scenarios:
 * Throwing an exception at some rare and extreme corner-cases
 * Injecting a delay to test for timeouts to be handled correctly
 * More advanced uses with custom lambda as an injection handler

Injection points are defined by `inject` calls.

Enabling and disabling injections are done by the corresponding
`enable` and `disable` calls.

REST frontend APIs is provided for convenience.

Branch URL:  https://github.com/alecco/scylla/tree/as_error_injection

Tests: unit {{dev}}, unit {{debug}}

* 'as_error_injection' of github.com:alecco/scylla:
  api: add error injection to REST API
  utils: add error injection
2020-03-23 08:39:22 +01:00
Botond Dénes
e0284bb9ee treewide: add missing headers and/or forward declarations 2020-03-23 09:29:45 +02:00
Pavel Solodovnikov
057adc8b4d utils: add error injection
Error injection class is implemented in order to allow injecting
various errors (exceptions, stalls, etc.) in code for testing
purposes.

Error injection is enabled via compile flag
 SCYLLA_ENABLE_ERROR_INJECTION

TODO: manage shard instances

Enable error injection in debug/dev/sanitize modes.

Unit tests for error injection class.

Closes #3295

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
2020-03-20 19:37:48 +01:00
Rafael Ávila de Espíndola
517a01a3f6 utils: Use sstring as keys in nonstatic_class_registry
Now that seastar::string::compare has been updated, it is possible to
use sstring for this.

This reverts commit 01fe766f1f.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200311005219.280737-1-espindola@scylladb.com>
2020-03-16 11:01:15 +02:00
Avi Kivity
c020b4e5e2 logalloc: increase capacity of _regions vector outside reclaim lock
Reclaim consults the _regions vector, so we don't want it moving around while
allocating more capacity. For that we take the reclaim lock. However, that
can cause a false-positive OOM during startup:

1. all memory is allocated to LSA as part of priming (2baa16b371)
2. the _regions vector is resized from 64k to 128k, requiring a segment
   to be freed (plenty are free)
3. but reclaiming_lock is taken, so we cannot reclaim anything.

To fix, resize the _regions vector outside the lock.

Fixes #6003.
Message-Id: <20200311091217.1112081-1-avi@scylladb.com>
2020-03-11 12:29:31 +02:00