Commit Graph

138 Commits

Author SHA1 Message Date
Avi Kivity
9322c07c71 Merge "Use binary search in sstable promoted index" from Tomasz
"
The "promoted index" is how the sstable format calls the clustering key index within a given partition.
Large partitions with many rows have it. It's embedded in the partition index entry.

Currently, lookups in the promoted index are done by scanning the index linearly so the lookup
is O(N). For large partitions that's inefficient. It consumes both a lot of CPU and I/O.

We could do better and use binary search in the index. This patch series switches the mc-format
index reader to do that. Other formats use the old way.

The "mc" format promoted index has an extra structure at the end of the index called "offset map".
It's a vector of offsets of consecutive promoted index entries. This allows us to access random
entries in the index without reading the whole index.

The location of the offset entry for a given promoted index entry can be derived by knowing where
the offset vector ends in the index file, so the offset map also doesn't have to be read completely
into the memory.

The most tricky part is caching. We need to cache blocks read from the index file to amortize the
cost of binary search:

  - if the promoted index fits in the 32 KiB which was read from the index when looking for
    the partition entry, we don't want to issue any additional I/O to search the promoted index.

  - with large promoted indexes, the last few bisections will fall into the same I/O block and we
    want to reuse that block.

  - we don't want the cache to grow too big, we don't want to cache the whole promoted index
    as the read progresses over the index. Scanning reads may skip multiple times.

This series implements a rather simple approach which meets all the
above requirements and is not worse than the current state of affairs:

   - Each index cursor has its own cache of the index file area which corresponds to promoted index
     This is managed by the cached_file class.

   - Each index cursor has its own cache of parsed blocks. This allows the upper bound estimation to
     reuse information obtained during lower bound lookup. This estimation is used to limit
     read-aheads in the data file.

   - Each cursor drops entries that it walked past so that memory footprint stays O(log N)

   - Cached buffers are accounted to read's reader_permit.

Later, we could have a single cache shared by many readers. For that, we need to come up with eviction
policy.

Fixes #4007.

TESTING RESULTS

 * Point reads, large promoted index:

  Config: rows: 10000000, value size: 2000
  Partition size: 20 GB
  Index size: 7 MB

  Notes:

    - Slicing read into the middle of partition (offset=5000000, read=1) is a clear win for the binary search:

      time: 1.9ms vs 22.9ms
      CPU utilization: 8.9% vs 92.3%
      I/O: 21 reqs / 172 KiB vs 29 reqs / 3'520 KiB

      It's 12x faster, CPU utilization is 10x times smaller, disk utilization is 20x smaller.

    - Slicing at the front (offset=0) is a mixed bag.

      time is similar: 1.8ms
      CPU utilization is 6.7x smaller for bsearch: 8.5% vs 57.7%
      disk bandwidth utilization is smaller for bsearch but uses more IOs: 4 reqs / 320 KiB (scan) vs 17 reqs / 188 KiB (bsearch)

      bsearch uses less bandwidth because the series reduces buffer size used for index file I/O.

      scan is issuing:

         2 * 128 KB (index page)
         2 * 32 KB (data file)

      bsearch is issuing:

         1 * 64 KB (index page)
         15 * 4 KB (promoted index)
         1 * 64 KB (data file)

      The 1 * 64 KB is chosen dynamically by seastar. Sometimes it chooses 2 * 32 KB (with read-ahead).
      32 KB is the minimum I/O currently.

      Disk utilization could be further improved by changing the way seastar's dynamic I/O adjustments work
      so that it uses 1 * 4 KB when it suffices. This is left for the follow-up.

  Command:

        perf_fast_forward --datasets=large-part-ds1 \
         --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1

  Before:

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.001836          172         1        545          9        563        175        4.0      4        320       2       2        0        1        1        0        0        0  57.7%      0
    0       32        0.001858          502        32      17220        126      17776      11526        3.2      3        324       2       1        0        1        1        0        0        0  56.4%      0
    0       256       0.002833          339       256      90374        427      91757      85931        7.0      7        776       3       1        0        1        1        0        0        0  41.1%      0
    0       4096      0.017211           58      4096     237984       2011     241802     233870       66.1     66       8376      59       2        0        1        1        0        0        0  21.4%      0
    5000000 1         0.022952           42         1         44          1         45         41       29.2     29       3520      22       2        0        1        1        0        0        0  92.3%      0
    5000000 32        0.023052           43        32       1388         14       1414       1331       31.1     32       3588      26       2        0        1        1        0        0        0  91.7%      0
    5000000 256       0.024795           41       256      10325        129      10721       9993       43.1     39       4544      29       2        0        1        1        0        0        0  86.4%      0
    5000000 4096      0.038856           27      4096     105414        398     106918     103162       95.2     95      12160      78       5        0        1        1        0        0        0  61.4%      0

 After (v2):

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.001831          248         1        546         21        581        252       17.6     17        188       2       0        0        1        1        0        0        0   8.5%      0
    0       32        0.001910          535        32      16751        626      17770      13896       17.9     19        160       3       0        0        1        1        0        0        0   8.8%      0
    0       256       0.003545          266       256      72207       2333      89076      62852       26.9     24        764       7       0        0        1        1        0        0        0   9.7%      0
    0       4096      0.016800           56      4096     243812        524     245430     239736       83.6     83       8700      64       0        0        1        1        0        0        0  16.6%      0
    5000000 1         0.001968          351         1        508         19        538        380       21.3     21        172       2       0        0        1        1        0        0        0   8.9%      0
    5000000 32        0.002273          431        32      14077        436      15503      11551       22.7     22        268       3       0        0        1        1        0        0        0   8.9%      0
    5000000 256       0.003889          257       256      65824       2197      81833      57813       34.0     37        652      18       0        0        1        1        0        0        0  11.2%      0
    5000000 4096      0.017115           54      4096     239324        834     241310     231993       88.3     88       8844      65       0        0        1        1        0        0        0  16.8%      0

 After (v1):

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.001886          259         1        530          4        545        261       18.0     18        376       2       2        0        1        1        0        0        0   9.1%      0
    0       32        0.001954          513        32      16381         93      16844      15618       19.0     19        408       3       2        0        1        1        0        0        0   9.3%      0
    0       256       0.003266          318       256      78393       1820      81567      61663       30.8     26       1272       7       2        0        1        1        0        0        0  10.4%      0
    0       4096      0.017991           57      4096     227666        855     231915     225781       83.1     83       8888      55       5        0        1        1        0        0        0  15.5%      0
    5000000 1         0.002353          232         1        425          2        432        232       23.0     23        396       2       2        0        1        1        0        0        0   8.7%      0
    5000000 32        0.002573          384        32      12437         47      12571        429       25.0     25        460       4       2        0        1        1        0        0        0   8.5%      0
    5000000 256       0.003994          259       256      64101       2904      67924      51427       37.0     35       1484      11       2        0        1        1        0        0        0  10.6%      0
    5000000 4096      0.018567           56      4096     220609        448     227395     219029       89.8     89       9036      59       5        0        1        1        0        0        0  15.1%      0

 * Point reads, small promoted index (two blocks):

  Config: rows: 400, value size: 200
  Partition size: 84 KiB
  Index size: 65 B

  Notes:
     - No significant difference in time
     - the same disk utilization
     - similar CPU utilization

  Command:

      perf_fast_forward --datasets=large-part-ds1 \
         --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1

  Before:

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.000279          470         1       3587         31       3829        478        3.0      3         68       2       1        0        1        1        0        0        0  21.1%      0
    0       32        0.000276         3498        32     116038        811     122756     104033        3.0      3         68       2       1        0        1        1        0        0        0  24.0%      0
    0       256       0.000412         2554       256     621044       1778     732150     559221        2.0      2         72       2       0        0        1        1        0        0        0  32.6%      0
    0       4096      0.000510         1901       400     783883       4078     819058     665616        2.0      2         88       2       0        0        1        1        0        0        0  36.4%      0
    200     1         0.000339         2712         1       2951          8       3001       2569        2.0      2         72       2       0        0        1        1        0        0        0  17.8%      0
    200     32        0.000352         2586        32      91019        266      92427      83411        2.0      2         72       2       0        0        1        1        0        0        0  20.8%      0
    200     256       0.000458         2073       200     436503       1618     453945     385501        2.0      2         88       2       0        0        1        1        0        0        0  29.4%      0
    200     4096      0.000458         2097       200     436475       1676     458349     381558        2.0      2         88       2       0        0        1        1        0        0        0  29.0%      0

  After (v1):

    Testing slicing of large partition using clustering keys:
    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.000278          492         1       3598         30       3831        500        3.0      3         68       2       1        0        1        1        0        0        0  19.4%      0
    0       32        0.000275         3433        32     116153        753     122915      92559        3.0      3         68       2       1        0        1        1        0        0        0  22.5%      0
    0       256       0.000458         2576       256     559437       2978     728075     504375        2.1      2         88       2       0        0        1        1        0        0        0  29.0%      0
    0       4096      0.000506         1888       400     790064       3306     822360     623109        2.0      2         88       2       0        0        1        1        0        0        0  36.6%      0
    200     1         0.000382         2493         1       2619         10       2675       2268        2.0      2         88       2       0        0        1        1        0        0        0  16.3%      0
    200     32        0.000398         2393        32      80422        333      84759      22281        2.0      2         88       2       0        0        1        1        0        0        0  19.0%      0
    200     256       0.000459         2096       200     435943       1608     453989     380749        2.0      2         88       2       0        0        1        1        0        0        0  30.5%      0
    200     4096      0.000458         2097       200     436410       1651     455779     382485        2.0      2         88       2       0        0        1        1        0        0        0  29.2%      0

 * Scan with skips, large index:

  Config: rows: 10000000, value size: 2000
  Partition size: 20 GB
  Index size: 7 MB

  Notes:

    - Similar time, slightly worse for binary search: 36.1 s (scan) vs 36.4 (bsearch)

    - Slightly more I/O for bsearch: 153'932 reqs / 19'703'260 KiB (scan) vs 155'651 reqs / 19'704'088 KiB (bsearch)

      Binary search reads more by 828 KB and by 1719 IOs.
      It does more I/O to read the the promoted index offset map.

    - similar (low) memory footprint. The danger here is that by caching index blocks which we touch as we scan
      we would end up caching the whole index. But this is protected against by eviction as demonstrated by the
      last "mem" column.

  Command:

    perf_fast_forward --datasets=large-part-ds1 \
       --run-tests=large-partition-skips -c1 --test-case-duration=1

  Before:

      read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
      1       1        36.103451            4   5000000     138491         38     138601     138453   153932.0 153932   19703260  153561       1        0        1        1        0        0        0  31.5% 502690

  After (v2):

    read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    1       1        37.000145            4   5000000     135135          6     135146     135128   155651.0 155651   19704088  138968       0        0        1        1        0        0        0  34.2%      0

  After (v1):

    read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    1       1        36.965520            4   5000000     135261         30     135311     135231   155628.0 155628   19704216  139133       1        0        1        1        0        0        0  33.9% 248738

Also in:

  git@github.com:tgrabiec/scylla.git sstable-use-index-offset-map-v2

Tests:

  - unit (all modes)
  - manual using perf_fast_forward
"

* tag 'sstable-use-index-offset-map-v2' of github.com:tgrabiec/scylla:
  sstables: Add promoted index cache metrics
  position_in_partition: Introduce external_memory_usage()
  cached_file, sstables: Add tracing to index binary search and page cache
  sstables: Dynamically adjust I/O size for index reads
  sstables, tests: Allow disabling binary search in promoted index from perf tests
  sstables: mc: Use binary search over the promoted index
  utils: Introduce cached_file
  sstables: clustered_index: Relax scope of validity of entry_info
  sstables: index_entry: Introduce owning promoted_index_block_position
  compound_compat: Allow constructing composite from a view
  sstables: index_entry: Rename promoted_index_block_position to promoted_index_block_position_view
  sstables: mc: Extract parser for promoted index block
  sstables: mc: Extract parser for clustering out of the promoted index block parser
  sstables: consumer: Extract primitive_consumer
  sstables: Abstract the clustering index cursor behavior
  sstables: index_reader: Rearrange to reduce branching and optionals
2020-06-18 12:09:39 +03:00
Tomasz Grabiec
ab274b8203 sstables: clustered_index: Relax scope of validity of entry_info
entry_info holds views, which may get invalidated when the containing
index blocks are removed. Current implementations of next_entry() keeps
the blocks in memory as long as the cursor is alive but that will
change in new implementations of the cursor.

Adjust the assumption of tests accordingly.
2020-06-16 16:15:23 +02:00
Tomasz Grabiec
f2e52c433f sstables: index_entry: Rename promoted_index_block_position to promoted_index_block_position_view 2020-06-16 16:15:23 +02:00
Tomasz Grabiec
d5bf540079 sstables: Abstract the clustering index cursor behavior
In preparation for supporting more than one algorithm for lookups in
the promoted index, extract relevant logic out of the index_reader
(which is a partition index cursor).

The clustered index cursor implementation is now hidden behind
abstract interface called clustered_index_cursor.

The current implementation is put into the
scanning_clustered_index_cursor. It's mostly code movement with minor
adjustments.

In order to encapsulate iteration over promoted index entries,
clustered_index_cursor::next_entry() was introduced.

No change in behavior intended in this patch.
2020-06-16 16:14:17 +02:00
Pavel Emelyanov
60e283b23e auth: Move away from storage_service
Now after the auth start/stop is standalone, we can remove
reference from storage service to it. This frees some tests
from the need to carry the auth service around for nothing.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-12 22:14:33 +03:00
Pavel Emelyanov
6a46721fb7 auth: Move start-stop code into main
The auth service management is currently sitting in storage
service, but it was needed there just for cql/thrift start
code. After the latters has been moved away there are no
other reasons for the auth to be integrated with the storage
service, so move it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-06-12 22:14:33 +03:00
Rafael Ávila de Espíndola
555d8fe520 build: Be consistent about system versus regular headers
We were not consistent about using '#include "foo.hh"' instead of
'#include <foo.hh>' for scylla's own headers. This patch fixes that
inconsistency and, to enforce it, changes the build to use -iquote
instead of -I to find those headers.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200608214208.110216-1-espindola@scylladb.com>
2020-06-10 15:49:51 +03:00
Dejan Mircevski
9027b6636f Use sstring_view in execute_cql and assertions
This lets the functions operate on a wider variety of arguments and
may also be faster.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-06-10 08:10:43 +03:00
Glauber Costa
3972628fc0 compaction: split compaction.hh header
compaction.hh is one of our heavy headers, but some users just want to
use information on it about how to describe a compaction, not how to
perform one.

For that reason this patch splits the compaction_descriptor into a new
header.

The compaction_descriptor has, as a member type, compaction_options.
That is moved too, and brings with it the compaction_type. Both of those
structures would make sense in a separate header anyway.

The compaction_descriptor also wants the creator_fn and replacer_fn
functions.  We also take this opportunity to rename them into something
more descriptive

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-06-08 16:06:00 -04:00
Avi Kivity
6f394e8e90 tombstone: use comparison operator instead of ad-hoc compare() function and with_relational_operators
The comparison operator (<=>) default implementation happens to exactly
match tombstone::compare(), so use the compiler-generated defaults. Also
default operator== and operator!= (these are not brought in by operator<=>).
These become slightly faster as they perform just an equality comparison,
not three-way compare.

shadowable_tombstone and row_tombstone depend on tombstone::compare(),
so convert them too in a similar way.

with_relational_operations.hh becomes unused, so delete it.

Tests: unit (dev)
Message-Id: <20200602055626.2874801-1-avi@scylladb.com>
2020-06-02 09:28:52 +03:00
Piotr Sarna
160e2b06f9 test: move random string helpers to .cc
... since there's no reason for them to reside in a header,
and .cc is our default destination.

Message-Id: <2509410f0f71df036a7829f1f799503c1a671404.1591078777.git.sarna@scylladb.com>
2020-06-02 09:27:59 +03:00
Avi Kivity
a4c44cab88 treewide: update concepts language from the Concepts TS to C++20
Seastar recently lost support for the experimental Concepts Technical
Specification (TS) and gained support for C++20 concepts. Re-enable
concepts in Scylla by updating our use of concepts to the C++20
standard.

This change:
 - peels off uses of the GCC6_CONCEPT macro
 - removes inclusions of <seastar/gcc6-concepts.hh>
 - replaces function-style concepts (no longer supported) with
   equation-style concepts
 - semicolons added and removed as needed
 - deprecated std::is_pod replaced by recommended replacement
 - updates return type constraints to use concepts instead of
   type names (either std::same_as or std::convertible_to, with
   std::same_as chosen when possible)

No attempt is made to improve the concepts; this is a specification
update only.
Message-Id: <20200531110254.2555854-1-avi@scylladb.com>
2020-06-02 09:12:21 +03:00
Piotr Sarna
91e02ed3ad test/lib: add generating random numeric string
Useful for testing random numeric inputs, e.g. big decimals.
2020-06-01 16:11:49 +02:00
Botond Dénes
c5b0e8a45a test: move thread-safe test macro alternatives to lib/test_utils.hh
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200529130706.149603-2-bdenes@scylladb.com>
2020-05-31 16:08:02 +03:00
Botond Dénes
d68ac8bf18 treewide: remove all uses of no_reader_permit() 2020-05-28 11:34:35 +03:00
Botond Dénes
b5aa08ed77 sstables: pass valid permits to all internal reads
We will soon require a valid permit for all reads, including low level
index reads. The sstable layer has several internal reads which can not
be associated with either the user or the system read semaphores or it
would be very hard to obtain the correct semaphore, for limited/no gain.
To be able to pass a valid permit still, we either expose a permit
parameter so upper layers can pass down one, or create a local semaphore
for these reads and use that to obtain a permit.
The following methods now require a permit to be passed to them:
* `sstables::sstabe::read_data()`: only used in tests.

The following methods use internal semaphores:
* `sstables::sstable::generate_summary()` used when loading an sstable.
* `sstables::sstable::has_partition_key()`: used by a REST API method.
2020-05-28 11:34:35 +03:00
Botond Dénes
734e995639 database: add compaction read concurrency semaphore
All reads will soon require a valid permit, including those done during
compaction. To allow creating valid permits for these reads create a
compaction specific semaphore. This semaphore is unlimited as compaction
concurrency is managed by higher level layer, we use just for resource
usage accounting.
2020-05-28 11:34:35 +03:00
Botond Dénes
a08467da29 test: move away from reader_concurrency_semaphore::wait_admission()
And use the reader_permit for this instead. This refactoring has
revealed a pre-existing bug in the `test_lifecycle_policy`, which is
also addressed in this patch. The bug is that said policy executes
reader destructions in the background, and these are not waited for. For
some reason, the semaphore -> permit transition pushes these races over
the edge and we start seeing some of these destruction fibers still
being unfinished when test scopes are exited, causing all sorts of
trouble. The solution is to introduce a special gate that tests can use
to wait for all background work to finish, before the test scope is
exited.
2020-05-28 11:34:35 +03:00
Botond Dénes
9ede82ebf8 memtable: pass a valid permit to the delegate reader
All reader are soon going to require a valid permit, so make sure we
have a valid permit which we can pass to the delegate reader when
creating it. This means `memtable::make_flat_reader()` now also requires
a permit to be passed to it.
Internally the permit is stored in `scanning_reader`, which is used both
for flushes and normal reads. In the former case a permit is not
required.
2020-05-28 11:34:35 +03:00
Botond Dénes
cc5137ffe3 table: require a valid permit to be passed to most read methods
Now that the most prevalent users (range scan and single partition
reads) all pass valid permits we require all users to do so and
propagate the permit down towards `make_sstable_reader()`. The plan is
to use this permit for restricting the sstable readers, instead of the
semaphore the table is configured with. The various
`make_streaming_*reader()` overloads keep using the internal semaphores
as but they also create the permit before the read starts and pass it to
`make_sstable_reader()`.
2020-05-28 11:34:35 +03:00
Botond Dénes
0ee58d1d47 test: lib/reader_permit.hh: add make_query_class_config()
To be used by tests to obtain a query_class_config to pass to APIs that
require one. The class config contains the test semaphore.
2020-05-28 11:34:35 +03:00
Botond Dénes
97af2d98d2 test: lib: introduce reader_permit.{hh,cc}
This contains a reader concurrency semaphore for the tests, that they
can use to obtain a valid permit for reads. Soon we are going to start
working towards a point where all APIs taking a permit will require a
valid one. Before we start this work we must ensure test code is able to
obtain a valid permit.
2020-05-28 11:34:35 +03:00
Pavel Emelyanov
70391feb8e storage_service: Tossing bits around
The goal is to have main.cc add code between prepare_to_join
and join_token_ring. As a side effect this drives us closer
to proper split of storage service into sharded service itslef
vs start/boot/join code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-05-25 13:21:08 +03:00
Pavel Emelyanov
bb3a71529a features: Get rid of per-features booleans
The set of bool enable_something-s on feature_fonfig duplicates
the disabled_features set on it, so remove the former and make
full use of the latter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2020-05-25 13:09:12 +03:00
Glauber Costa
7423ccc318 compaction_manager: allow early aborts through abort sources.
The shutdown process of compaction manager starts with an explicit call
from the database object. However that can only happen everything is
already initialized. This works well today, but I am soon to change
the resharding process to operate before the node is fully ready.

One can still stop the database in this case, but reshardings will
have to finish before the abort signal is processed.

This patch passes the existing abort source to the construction of the
compaction_manager and subscribes to it. If the abort source is
triggered, the compaction manager will react to it firing and all
compactions it manages will be stopped.

We still want the database object to be able to wait for the compaction
manager, since the database is the object that owns the lifetime of
the compaction manager. To make that possible we'll use a future
that is return from stop(): no matter what triggered the abort, either
an early abort during initial resharding or a database-level event like
drain, everything will shut down in the right order.

The abort source is passed to the database, who is responsible from
constructing the compaction manager.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-05-13 16:51:25 -04:00
Glauber Costa
e29701ca1c compaction_manager: expand state to be able to differentiate between enabled and stopped
We are having many issues with the stop code in the compaction_manager.
Part of the reason is that the "stopped" state has its meaning overloaded
to indicate both "compaction manager is not accepting compactions" and
"compaction manager is not ready or destructed".

In a later step we could default to enabled-at-start, but right now we
maintain current behavior to minimize noise.

It is only possible to stop the compaction manager once.
It is possible to enable / disable the compaction manager many times.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-05-13 16:51:25 -04:00
Botond Dénes
e0f5ef5ef0 test: lib/sstable_utils: add make_keys_for_shard
A variant of make_keys() which creates keys for the requested shard. As
this version is more generic than the existing local_shards_only
variant, the former is reimplemented on top of the latter.
2020-05-12 12:07:21 +03:00
Avi Kivity
5b971397aa Revert "compaction_manager: allow early aborts through abort sources."
This reverts commit e8213fb5c3. It results
in an assertion failure in remove_index_file_test.

Fixes #6413.
2020-05-10 12:32:18 +03:00
Glauber Costa
e8213fb5c3 compaction_manager: allow early aborts through abort sources.
The shutdown process of compaction manager starts with an explicit call
from the database object. However that can only happen everything is
already initialized. This works well today, but I am soon to change
the resharding process to operate before the node is fully ready.

One can still stop the database in this case, but reshardings will
have to finish before the abort signal is processed.

This patch passes the existing abort source to the construction of the
compaction_manager and subscribes to it. If the abort source is
triggered, the compaction manager will react to it firing and all
compactions it manages will be stopped.

We still want the database object to be able to wait for the compaction
manager, since the database is the object that owns the lifetime of
the compaction manager. To make that possible we'll use a future
that is return from stop(): no matter what triggered the abort, either
an early abort during initial resharding or a database-level event like
drain, everything will shut down in the right order.

The abort source is passed to the database, who is responsible from
constructing the compaction manager.

Tests: unit (dev), manual start+stop, manual drain + stop

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200506184749.98288-1-glauber@scylladb.com>
2020-05-07 13:24:47 +03:00
Avi Kivity
2b0c317dec test: lib: exception_utils: fix crash with fmt-6.2.0
fmt, the formatting library we use, detects types with conversion
to std::string_view (and formats them as strings) and types that
support operator<<(std::ostream, const T&) (and performs custom
formatting on them). However, if <fmt/ostream.h>, the latter is
not done.

The problem happens with seastar::sstring, which implements both,
and debug mode, which disables inlining. Some translation units
do include <fmt/ostream.h>, and so generate code to do custom
formatting. exception_utils.cc doesn't, and so generates code
to format via string_view conversion. At link time, the
compiler picks one of the generated functions and includes it
in the final binary; it happened to pick one generated outside
exception_utils.cc, using custom formatting.

However, there is also code in fmt to encode which path fmt
chose - string_view or custom. This code is constexpr and so
is evaluated in exception_utils.cc. The result is that the
function to perform formatting of seastar::sstring uses custom
formatting, while the descriptor containing the method used
says it is formatting via string_view. This is enough to cause
a crash.

The problem is limited to debug mode, since in other modes
all this code is inlined, and so is consistent within the
translation unit.

We need a more general fix (hopefully in fmt), but for now a
simple fix is to add the missing include.

Ref https://github.com/fmtlib/fmt/issues/1662
2020-05-07 08:59:02 +03:00
Calle Wilund
08d069f78d messaging_service: Use reloadable TLS certificates
Changes messaging service rpc to use reloadable tls
certificates iff tls is enabled-

Note that this means that the service cannot start
listening at construction time if TLS is active,
and user need to call start_listen_ex to initialize
and actually start the service.

Since "normal" messaging service is actually started
from gms, this route too is made a continuation.
2020-05-04 11:32:21 +00:00
Tomasz Grabiec
3e74dd4df3 sstables: Move all_sstable_versions to version.hh 2020-04-17 11:34:02 +02:00
Avi Kivity
88ade3110f treewide: replace calls to engine().some_api() with some_api()
This removes the need to include reactor.hh, a source of compile
time bloat.

In some places, the call is qualified with seastar:: in order
to resolve ambiguities with a local name.

Includes are adjusted to make everything compile. We end up
having 14 translation units including reactor.hh, primarily for
deprecated things like reactor::at_exit().

Ref #1
2020-04-05 12:46:04 +03:00
Avi Kivity
5e32ecb514 test: sstable-utils: deinline do_make_keys()
This hides a call to engine_is_ready() which is only available in
reactor.hh.

Dependencies are adjusted so tests link.

Ref #1.
2020-04-05 12:46:04 +03:00
Rafael Ávila de Espíndola
3f3634ece1 test: Use feature_config_from_db_config to setup feature_config
This reduces code duplication and uses the same code path that is used
in scylla itself.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200403170235.113558-1-espindola@scylladb.com>
2020-04-03 19:59:00 +02:00
Pekka Enberg
75b55cea88 Merge "Resharding through compact sstables" from Glauber
"
This patchseries is part of my effort to make resharding less special -
and hopefully less problematic.  The next steps are a bit heavy, so I'd
like to, if possible, get this out of the way.

After these two patches, there is no more need to ever call
reshard_sstables: compact_sstables will do, and it will be able to
recognize resharding compactions.

To do that we need to unify the creator function, which is trivially
done by adding a shard parameter to regular compactions as well: they
can just ignore it. I have considered just making the
compaction_descriptor have a virtual create() function and specializing
it, but because we have to store the creator in the compaction object I
decided to keep the virtual function for now.

In a later cleanup step, if we can for instance store the entire
compaction_descriptor object in the compaction object we could do that.

Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Botond Dénes <bdenes@scylladb.com>
Tests: unit tests (dev), dtest (resharding.py)
"

* 'resharding-through-compact-sstables' of github.com:glommer/scylla:
  resharding: get rid of special reshard_sstables
  compaction: enhance compaction_descriptor with creator and replace function
2020-04-02 14:43:35 +02:00
Konstantin Osipov
9948f548a5 lwt: remove Paxos from experimental list
Always enable lightweight transactions. Remove the check for the command
line switch from the feature service, assuming LWT is always enabled.

Remove the check for LWT from Alternator.

Note that in order for the cluster to work with LWT, all nodes need
to support it.

Rename LWT to UNUSED in db/config.hh, to keep accepting lwt keyword in
--experimental-features command line option, but do nothing with it.

Changes in v2:
* remove enable_lwt feature flag, it's always there

Closes #6102

test: unit (dev, debug)
Message-Id: <20200401071149.41921-1-kostja@scylladb.com>
2020-04-01 09:12:21 +02:00
Glauber Costa
e8801cd77b compaction: enhance compaction_descriptor with creator and replace function
There are many differences between resharding and compaction that are
artificial, arising more from the way we ended up implementing it than
necessity. This patch attempts to pass the creator and replacer functions
through the compaction_descriptor.

There is a difference between the creator function for resharding and
regular compaction: resharding has to pass the shard number on behalf
of which the SSTable is created. However regular compactions can just
ignore this. No need to have a special path just for this.

After this is done, the constructor for the compaction object can be
greatly simplified. In further patches I intend to simplify it a bit
further, but some more cleanup has to happen first.

To make that happen we have to construct a compaction_descriptor object
inside the resharding function. This is temporary: resharding currently
works with a descriptor, but at some point that descriptor is lost and
broken into pieces to be passed to this function. The overarching goal
of this work is exactly to be able to keep that descriptor for as long
as possible, which should simplify things a lot.

Callers are patched, but there are plenty for sstable_datafile_test.cc.
For their benefit, a helper function is provided to keep the previous
signature (test only).

Signed-off-by: Glauber Costa <glauber@scylladb.com>
2020-03-31 19:41:25 -04:00
Piotr Jastrzebski
c44f019eee dummy_sharder: rename dummy_sharding_info.* to dummy_sharder.*
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-30 18:42:33 +02:00
Piotr Jastrzebski
e72696a8e6 sharding_info: rename the class to sharder
Also rename all variables that were named si or sinfo
to sharder.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-30 18:42:33 +02:00
Piotr Jastrzebski
2e850421a0 i_partitioner:remove embeded sharding_info
sharding_info embeded into partitioner is no longer
used anywhere and can be removed.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-30 18:42:33 +02:00
Piotr Jastrzebski
14ad965733 sstable-utils: use sharding_info::shard_of
Create sharding_info with the same parameters as
the partitioner and use it instead of the partitioner.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-30 18:42:33 +02:00
Piotr Jastrzebski
dc2e060313 create_token_range_from_keys: use sharding info for shard_of
Replace i_partitioner::shard_of with sharding_info::shard_of

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-30 18:42:33 +02:00
Piotr Jastrzebski
41591f15d2 tests: rename dummy_partitioner.* to dummy_sharding_info.*
dummy_partitioner was renamed to dummy_sharding_info in
the previous patch. This patch cleans up the names of
files. It's done in a separate patch to not obstruct
the diff of previous patch.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-30 18:42:33 +02:00
Piotr Jastrzebski
031f589dba multishard_combining_reader: use token_for_next_shard from sharding info not partitioner
Previously this function was accessing sharding logic
through partitioner obtained from the schema.

While converting tests, dummy_partitioner is turned into
dummy_sharding_info.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2020-03-30 18:42:25 +02:00
Rafael Ávila de Espíndola
c5795e8199 everywhere: Replace engine().cpu_id() with this_shard_id()
This is a bit simpler and might allow removing a few includes of
reactor.hh.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200326194656.74041-1-espindola@scylladb.com>
2020-03-27 11:40:03 +03:00
Botond Dénes
ec36c7cb2f test: random_schema: remove redundant gc grace period from tombstone expiry
Compaction automatically adds gc grace period to expiry times already,
no need to add it when creating the tombstones. Remove the redundant
additions form the code. The direct impact is really minor as this is
only used in tests, but it might confuse readers who are looking at how
tombstones are created across the codebase.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200323120948.92104-1-bdenes@scylladb.com>
2020-03-23 15:12:25 +02:00
Avi Kivity
0d885dbb00 Merge "Make all headers standalone" from Botond
"
Make sure all headers compile on their own, without requiring any
additional includes externally.

Even though this requirement is not documented in our coding guides it
is still quasi enforced and we semi-regularly get and merge patches
adding missing includes to headers.

This patch-set fixes all headers and adds a `{mode}-headers` target that
can be used to verify each header. This target should be built by
promotion to ensure no new non-conforming code sneaks in.
Individual headers can be verified using the
`build/dev/path/to/header.hh.o` target, that is generated for every
header.

The majority of the headers was just missing `seastarx.hh`. I think we
should just include this via a compiler flag to remove the noise from
our code (in a followup).
"

* 'compiling-headers/v2' of https://github.com/denesb/scylla:
  configure.py: add {mode}-headers phony target
  treewide: add missing headers and/or forward declarations
  test/boost/sstable_test.hh: move generic stuff to test/lib/sstable_utils.hh
  sstables: size_tiered_backlog_tracker: move methods out-of-line
  sstables: date_tiered_compaction_strategy.hh: move methods out-of-line
2020-03-23 13:09:09 +02:00
Botond Dénes
e0284bb9ee treewide: add missing headers and/or forward declarations 2020-03-23 09:29:45 +02:00
Botond Dénes
575466b2cf test/boost/sstable_test.hh: move generic stuff to test/lib/sstable_utils.hh
sstable_test.hh started as collection of utilities shared between the
various `_sstable_test.cc` files. Predictably other tests started using
it as well, among them some that are non boost unit tests. This poses a
problem as if we add the missing boost/test/unit_test.hpp include to
sstable_test.hh these tests will suddenly have missing symbols from
boost::test. To avoid linking boost::test into all these users, extract
utilities more widely used into sstable_utils.hh
2020-03-23 09:29:45 +02:00