Commit Graph

2133 Commits

Author SHA1 Message Date
Calle Wilund
e545b382bd commitlog: coroutinize segment::allocate 2021-07-19 08:17:33 +00:00
Nadav Har'El
5183e0cbe9 Merge 'Fix artificial view update size limit' from Piotr Sarna
The series which split the view update process into smaller parts
accidentally put an artificial 10MB limit on the generated mutation
size, which is wrong - this limit is configurable for users,
and, what's more important, this data was already validated when
it was inserted into the base table. Thus, the limit is lifted.

The series comes with a cql-pytest which failed before the fix and succeeds now. This bug is also  covered by `wide_rows_test.py:TestWideRows_with_LeveledCompactionStrategy.test_large_cell_in_materialized_view` dtest, but it needs over a minute to run, as opposed to cql-pytest's <1 second.

Fixes #9047

Tests: unit(release), dtest(wide_rows_test.py:TestWideRows_with_LeveledCompactionStrategy.test_large_cell_in_materialized_view)

Closes #9048

* github.com:scylladb/scylla:
  cql-pytest: add a materialized views suite with first cases
  db,view: drop the artificial limit on view update mutation size
2021-07-15 17:03:07 +03:00
Piotr Sarna
697e2fc66d db,view: drop the artificial limit on view update mutation size
The series which split the view update process into smaller parts
accidentally put an artificial 10MB limit on the generated mutation
size, which is wrong - this limit is configurable for users,
and, what's more important, this data was already validated when
it was inserted into the base table. Thus, the limit is lifted.

Tests: unit(release), dtest(wide_rows_test)
2021-07-15 14:09:37 +02:00
Botond Dénes
1b7eea0f52 reader_concurrency_semaphore: admission: flip the switch
This patch flips two "switches":
1) It switches admission to be up-front.
2) It changes the admission algorithm.

(1) by now all permits are obtained up-front, so this patch just yanks
out the restricted reader from all reader stacks and simultaneously
switches all `obtain_permit_nowait()` calls to `obtain_permit()`. By
doing this admission is now waited on when creating the permit.

(2) we switch to an admission algorithm that adds a new aspect to the
existing resource availability: the number of used/blocked reads. Namely
it only admits new reads if in addition to the necessary amount of
resources being available, all currently used readers are blocked. In
other words we only admit new reads if all currently admitted reads
requires something other than CPU to progress. They are either waiting
on I/O, a remote shard, or attention from their consumers (not used
currently).

We flip these two switches at the same time because up-front admission
means cache reads now need to obtain a permit too. For cache reads the
optimal concurrency is 1. Anything above that just increases latency
(without increasing throughput). So we want to make sure that if a cache
reader hits it doesn't get any competition for CPU and it can run to
completion. We admit new reads only if the read misses and has to go to
disk.

Another change made to accommodate this switch is the replacement of the
replica side read execution stages which the reader concurrency
semaphore as an execution stage. This replacement is needed because with
the introduction of up-front admission, reads are not independent of
each other any-more. One read executed can influence whether later reads
executed will be admitted or not, and execution stages require
independent operations to work well. By moving the execution stage into
the semaphore, we have an execution stage which is in control of both
admission and running the operations in batches, avoiding the bad
interaction between the two.
2021-07-14 17:19:02 +03:00
Botond Dénes
7bfa40a2f1 treewide: use make_tracking_only_permit()
For all those reads that don't (won't or can't) pass through admission
currently.
2021-07-14 17:19:02 +03:00
Botond Dénes
f28b5018f2 view/view_update_generator: use obtain_reader_permit() 2021-07-14 16:48:43 +03:00
Botond Dénes
ea2345c944 db/size_estimates_virtual_reader: mark as blocked when obtaining local ranges 2021-07-14 16:48:43 +03:00
Nadav Har'El
1ff1c3735b Merge 'Remove the mutation-based restriction checks' from Piotr Sarna
This series unifies the interface for checking if CQL restrictions are satisfied. Previously, an additional mutation-based approach was added in the materialized views layer, but the decision was reached that it's better to have a single API based on partition slices. With that, the regular selection path gets simplified at the cost of more complicated view generation path, which is a good tradeoff.
Note that in order to unify the interface, the view layer performs ugly transformations in order to adjust the input for `is_satisfied_by`. Reviewers, please take a close look at this code (`matches_view_filter`, `clustering_prefix_matches`, `partition_key_matches`), because it looks error-prone and relies on dirty internals of our serialization layer. If somebody has a better suggestion on how to do the transformation, I'm all ears.

Tests: unit(release), manual(playing with materialized views with custom filters)
Fixes #7215

Closes #8979

* github.com:scylladb/scylla:
  db,view,table: drop unneeded time point parameter
  cql3,expr: unify get_value
  cql3,expr: purge mutation-based is_satisfied_by
  db,view: migrate key checks from the deprecated is_satisfied_by
  db,view: migrate checking view filter to new is_satisfied_by
  db,view: add a helper result builder class
  db,view: move make_partition_slice helper function up
2021-07-13 12:42:13 +03:00
Piotr Sarna
a1813c9b34 db,view,table: drop unneeded time point parameter
Now that restriction checking is translated to the partition-slice-style
interface, checking the partition/clustering key restrictions for views
can be performed without the time point parameter.
The parameter is dropped from all relevant call sites.
2021-07-13 10:40:08 +02:00
Piotr Sarna
37fc3f4b5b db,view: migrate key checks from the deprecated is_satisfied_by
Last two users of the mutation-based is_satisfied_by function
were in the partition/clustering key checks. These functions are now
translated to use the original API.
2021-07-13 10:40:07 +02:00
Piotr Sarna
d6b0a8338a db,view: migrate checking view filter to new is_satisfied_by
In order to unify the interfaces, the is_satisfied_by flavor
for mutations is getting deprecated. In order to be able to remove it,
one of its biggest users, the matches_view_filter() function,
is translated to the other variant.
2021-07-13 10:04:03 +02:00
Piotr Sarna
786db7e9a8 db,view: add a helper result builder class
In order to migrate from mutation-based restriction checks,
code in view.cc needs to have a way of translating results
to partition-slice-based representation.
A slightly simplified builder from multishard_mutation_query.cc
is injected into the view code.
2021-07-13 10:04:03 +02:00
Piotr Sarna
32d87837b1 db,view: move make_partition_slice helper function up
No functional changes, it will be needed for a future patch.
2021-07-13 10:04:02 +02:00
Tomasz Grabiec
e947fac74c database: Fix cache metrics not being registered
Introduced in 6a6403d. The default constructor with dummy_app_stats is
also used by production code.

Fixes #9012
Message-Id: <20210712221447.71902-1-tgrabiec@scylladb.com>
2021-07-13 07:50:44 +03:00
Avi Kivity
222ef17305 build, treewide: enable -Wredundant-move
Returning a function parameter guarantees copy elision and does not
require a std::move().  Enable -Wredundant-move to warn us that the
move is unneeded, and gain slightly more readable code. A few violations
are trivially adjusted.

Closes #9004
2021-07-11 12:53:02 +03:00
Avi Kivity
9059514335 build, treewide: enable -Wpessimizing-move warning
This warning prevents using std::move() where it can hurt
- on an unnamed temporary or a named automatic variable being
returned from a function. In both cases the value could be
constructed directly in its final destination, but std::move()
prevents it.

Fix the handful of cases (all trivial), and enable the warning.

Closes #8992
2021-07-08 17:52:34 +03:00
Piotr Sarna
bf0777e97a view: generate view updates in smaller parts
In order to avoid large allocations and too large mutations
generated from large view updates, granularity of the process
is broken down from per-partition to smaller chunks.
The view update builder now produces partial updates, no more
than 100 view rows at a time.
2021-07-08 11:17:27 +02:00
Piotr Sarna
679dc4d824 db,view: move view_update_builder to the header
The builder is going to be used directly by the callers,
which requires making its definition public.
No semantic changes were intended.
2021-07-08 11:17:27 +02:00
Avi Kivity
99d5355007 Merge "Cache sstable indexes in memory" from Tomasz
"
The main goal of this series is to improve efficiency of reads from large partitions by
reducing amount of I/O needed to read the sstable index. This is achieved by caching
index file pages and partition index entries in memory.

Currently, the pages are cached by individual reads only for the duration of the read.
This was done to facilitate binary search in the promoted index (intra-partition index).
After this series, all reads share the index file page cache, which stays around even after reads stop.

The page cache is subject to eviction. It uses the same region as the current row cache and shares
the LRU with row cache entries. This means that LRU objects need to be virtualized. This series takes
an easy approach and does this by introducing a virtual base class. This adds an overhead to row cache
entry to store the vtable pointer.

SStable indexes have a hierarchy. There is a summary, which is a sparse partition key index into the
full partition index. This one is already kept in memory. The partition index is divided by the summary
into pages. Each entry in the partition index contains promoted index, which is a sparse index into atoms
identified by the clustering key (rows, tombstones).

In order to read the promoted index, the reader needs to read the partition index entry first.
To speed this up, this series also adds caching of partition index entries. This cache survives
reads and is subject to eviction, just like the index file page cache. The unit of caching is
the partition index page. Without this cache, each access to promoted index would have to be
preceded with the parsing of the partition index page containing the partition key.

Performance testing results follow.

1) scylla-bench large partition reads

  Populated with:

        perf_fast_forward --run-tests=large-partition-skips --datasets=sb-large-part-ds1 \
            -c1 -m1G --populate --value-size=1024 --rows=10000000

  Single partition, 9G data file, 4MB index file

  Test execution:

    build/release/scylla -c1 -m4G
    scylla-bench -workload uniform -mode read -limit 1 -concurrency 100 -partition-count 1 \
       -clustering-row-count 10000000 -duration 60m

  TL;DR: after: 2x throughput, 0.5 median latency

    Before (c1daf2bb24):

    Results
    Time (avg):	 5m21.033180213s
    Total ops:	 966951
    Total rows:	 966951
    Operations/s:	 3011.997048812112
    Rows/s:		 3011.997048812112
    Latency:
      max:		 74.055679ms
      99.9th:	 63.569919ms
      99th:		 41.320447ms
      95th:		 38.076415ms
      90th:		 37.158911ms
      median:	 34.537471ms
      mean:		 33.195994ms

    After:

    Results
    Time (avg):	 5m14.706669345s
    Total ops:	 2042831
    Total rows:	 2042831
    Operations/s:	 6491.22243800942
    Rows/s:		 6491.22243800942
    Latency:
      max:		 60.096511ms
      99.9th:	 35.520511ms
      99th:		 27.000831ms
      95th:		 23.986175ms
      90th:		 21.659647ms
      median:	 15.040511ms
      mean:		 15.402076ms

2) scylla-bench small partitions

  I tested several scenarios with a varying data set size, e.g. data fully fitting in memory,
  half fitting, and being much larger. The improvement varied a bit but in all cases the "after"
  code performed slightly better.

  Below is a representative run over data set which does not fit in memory.

  scylla -c1 -m4G
  scylla-bench -workload uniform -mode read  -concurrency 400 -partition-count 10000000 \
      -clustering-row-count 1 -duration 60m -no-lower-bound

  Before:

    Time (avg):	 51.072411913s
    Total ops:	 3165885
    Total rows:	 3165885
    Operations/s:	 61988.164024260645
    Rows/s:		 61988.164024260645
    Latency:
      max:		 34.045951ms
      99.9th:	 25.985023ms
      99th:		 23.298047ms
      95th:		 19.070975ms
      90th:		 17.530879ms
      median:	 3.899391ms
      mean:		 6.450616ms

  After:

    Time (avg):	 50.232410679s
    Total ops:	 3778863
    Total rows:	 3778863
    Operations/s:	 75227.58014424688
    Rows/s:		 75227.58014424688
    Latency:
      max:		 37.027839ms
      99.9th:	 24.805375ms
      99th:		 18.219007ms
      95th:		 14.090239ms
      90th:		 12.124159ms
      median:	 4.030463ms
      mean:		 5.315111ms

  The results include the warmup phase which populates the partition index cache, so the hot-cache effect
  is dampened in the statistics. See the 99th percentile. Latency gets better after the cache warms up which
  moves it lower.

3) perf_fast_forward --run-tests=large-partition-skips

    Caching is not used here, included to show there are no regressions for the cold cache case.

    TL;DR: No significant change

    perf_fast_forward --run-tests=large-partition-skips --datasets=large-part-ds1 -c1 -m1G

    Config: rows: 10000000, value size: 2000

    Before:

    read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
    1       0        36.429822            4  10000000     274500         62     274521     274429   153889.2 153883   19696986  153853       0        0        0        0        0        0        0  22.5%
    1       1        36.856236            4   5000000     135662          7     135670     135650   155652.0 155652   19704117  139326       1        0        1        1        0        0        0  38.1%
    1       8        36.347667            4   1111112      30569          0      30570      30569   155652.0 155652   19704117  139071       1        0        1        1        0        0        0  19.5%
    1       16       36.278866            4    588236      16214          1      16215      16213   155652.0 155652   19704117  139073       1        0        1        1        0        0        0  16.6%
    1       32       36.174784            4    303031       8377          0       8377       8376   155652.0 155652   19704117  139056       1        0        1        1        0        0        0  12.3%
    1       64       36.147104            4    153847       4256          0       4256       4256   155652.0 155652   19704117  139109       1        0        1        1        0        0        0  11.1%
    1       256       9.895288            4     38911       3932          1       3933       3930   100869.2 100868    3178298   59944   38912        0        1        1        0        0        0  14.3%
    1       1024      2.599921            4      9757       3753          0       3753       3753    26604.0  26604     801850   15071    9758        0        1        1        0        0        0  14.6%
    1       4096      0.784568            4      2441       3111          1       3111       3109     7982.0   7982     205946    3772    2442        0        1        1        0        0        0  13.8%

    64      1        36.553975            4   9846154     269359         10     269369     269337   155663.8 155652   19704117  139230       1        0        1        1        0        0        0  28.2%
    64      8        36.509694            4   8888896     243467          8     243475     243449   155652.0 155652   19704117  139120       1        0        1        1        0        0        0  26.5%
    64      16       36.466282            4   8000000     219381          4     219385     219374   155652.0 155652   19704117  139232       1        0        1        1        0        0        0  24.8%
    64      32       36.395926            4   6666688     183171          6     183180     183165   155652.0 155652   19704117  139158       1        0        1        1        0        0        0  21.8%
    64      64       36.296856            4   5000000     137753          4     137757     137737   155652.0 155652   19704117  139105       1        0        1        1        0        0        0  17.7%
    64      256      20.590392            4   2000000      97133         18      97151      94996   135248.8 131395    7877402   98335   31282        0        1        1        0        0        0  15.7%
    64      1024      6.225773            4    588288      94492       1436      95434      88748    46066.5  41321    2324378   30360    9193        0        1        1        0        0        0  15.8%
    64      4096      1.856069            4    153856      82893         54      82948      82721    16115.0  16043     583674   11574    2675        0        1        1        0        0        0  16.3%

    After:

    read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
    1       0        36.429240            4  10000000     274505         38     274515     274417   153887.8 153883   19696986  153849       0        0        0        0        0        0        0  22.4%
    1       1        36.933806            4   5000000     135377         15     135385     135354   155658.0 155658   19704085  139398       1        0        1        1        0        0        0  40.0%
    1       8        36.419187            4   1111112      30509          2      30510      30507   155658.0 155658   19704085  139233       1        0        1        1        0        0        0  22.0%
    1       16       36.353475            4    588236      16181          0      16182      16181   155658.0 155658   19704085  139183       1        0        1        1        0        0        0  19.2%
    1       32       36.251356            4    303031       8359          0       8359       8359   155658.0 155658   19704085  139120       1        0        1        1        0        0        0  14.8%
    1       64       36.203692            4    153847       4249          0       4250       4249   155658.0 155658   19704085  139071       1        0        1        1        0        0        0  13.0%
    1       256       9.965876            4     38911       3904          0       3906       3904   100875.2 100874    3178266   60108   38912        0        1        1        0        0        0  17.9%
    1       1024      2.637501            4      9757       3699          1       3700       3697    26610.0  26610     801818   15071    9758        0        1        1        0        0        0  19.5%
    1       4096      0.806745            4      2441       3026          1       3027       3024     7988.0   7988     205914    3773    2442        0        1        1        0        0        0  18.3%

    64      1        36.611243            4   9846154     268938          5     268942     268921   155669.8 155705   19704085  139330       2        0        1        1        0        0        0  29.9%
    64      8        36.559471            4   8888896     243135         11     243156     243124   155658.0 155658   19704085  139261       1        0        1        1        0        0        0  28.1%
    64      16       36.510319            4   8000000     219116         15     219126     219101   155658.0 155658   19704085  139173       1        0        1        1        0        0        0  26.3%
    64      32       36.439069            4   6666688     182954          9     182964     182943   155658.0 155658   19704085  139274       1        0        1        1        0        0        0  23.2%
    64      64       36.334808            4   5000000     137609         11     137612     137596   155658.0 155658   19704085  139258       2        0        1        1        0        0        0  19.1%
    64      256      20.624759            4   2000000      96971         88      97059      92717   138296.0 131401    7877370   98332   31282        0        1        1        0        0        0  17.2%
    64      1024      6.260598            4    588288      93967       1429      94905      88051    45939.5  41327    2324346   30361    9193        0        1        1        0        0        0  17.8%
    64      4096      1.881338            4    153856      81780        140      81920      81520    16109.8  16092     582714   11617    2678        0        1        1        0        0        0  18.2%

4) perf_fast_forward --run-tests=large-partition-slicing

    Caching enabled, each line shows the median run from many iterations

    TL;DR: We can observe reduction in IO which translates to reduction in execution time,
           especially for slicing in the middle of partition.

    perf_fast_forward --run-tests=large-partition-slicing --datasets=large-part-ds1 -c1 -m1G --keep-cache-across-test-cases

    Config: rows: 10000000, value size: 2000

    Before:

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
    0       1         0.000491          127         1       2037         24       2109        127        4.0      4        128       2       2        0        1        1        0        0        0       157      80 3058208  15.0%
    0       32        0.000561         1740        32      56995        410      60031      47208        5.0      5        160       3       2        0        1        1        0        0        0       386     111  113353  17.5%
    0       256       0.002052          488       256     124736       7111     144762      89053       16.6     17        672      14       2        0        1        1        0        0        0      2113     446   52669  18.6%
    0       4096      0.016437           61      4096     249199        692     252389     244995       69.4     69       8640      57       5        0        1        1        0        0        0     26638    1717   23321  22.4%
    5000000 1         0.002171          221         1        461          2        466        221       25.0     25        268       3       3        0        1        1        0        0        0       638     376 14311524  10.2%
    5000000 32        0.002392          404        32      13376         48      13528      13015       27.0     27        332       5       3        0        1        1        0        0        0       931     432  489691  11.9%
    5000000 256       0.003659          279       256      69967        764      73130      52563       39.5     41        780      19       3        0        1        1        0        0        0      2689     825   93756  15.8%
    5000000 4096      0.018592           55      4096     220313        433     234214     218803       94.2     94       9484      62       9        0        1        1        0        0        0     27349    2213   26562  21.0%

    After:

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
    0       1         0.000229          115         1       4371         85       4585        115        2.1      2         64       1       1        1        0        0        0        0        0        90      31 1314749  22.2%
    0       32        0.000277         2174        32     115674       1015     128109      14144        3.0      3         96       2       1        1        0        0        0        0        0       319      62   52508  26.1%
    0       256       0.001786          576       256     143298       5534     179142     113715       14.7     17        544      15       1        1        0        0        0        0        0      2110     453   45419  21.4%
    0       4096      0.015498           61      4096     264289       2006     268850     259342       67.4     67       8576      59       4        1        0        0        0        0        0     26657    1738   22897  23.7%
    5000000 1         0.000415          233         1       2411         15       2456        234        4.1      4        128       2       2        1        0        0        0        0        0       199      72 2644719  16.8%
    5000000 32        0.000635         1413        32      50398        349      51149      46439        6.0      6        192       4       2        1        0        0        0        0        0       458     128  125893  18.6%
    5000000 256       0.002028          486       256     126228       3024     146327      82559       17.8     18       1024      13       4        1        0        0        0        0        0      2123     385   51787  19.6%
    5000000 4096      0.016836           61      4096     243294        814     263434     241660       73.0     73       9344      62       8        1        0        0        0        0        0     26922    1920   24389  22.4%

Future work:

 - Check the impact on non-uniform workloads. Caching sstable indexes takes space away from the row cache
   which may reduce the hit ratio.

 - Reduce memory footprint of partition index cache. Currently, about 8x bloat over the on-disk size.

 - Disable cache population for "bypass cache" reads

 - Add a switch to disable sstable index caching, per-node, maybe per-table

 - Better sstable index format. Current format leads to inefficiency in caching since only some elements of the cached
   page can be hot. A B-tree index would be more efficient. Same applies to the partition index. Only some elements in
   the partition index page can be hot.

 - Add heuristic for reducing index file IO size when large partitions are anticipated. If we're bound by disk's
   bandwidth it's wasteful to read the front of promoted index using 32K IO, better use 4K which should cover the
   partition entry and then let binary search read the rest.

In V2:

 - Fixed perf_fast_forward regression in the number of IOs used to read partition index page
   The reader uses 32K reads, which were split by page cache into 4K reads
   Fix by propagating IO size hints to page cache and using single IO to populate it.
   New patch: "cached_file: Issue single I/O for the whole read range on miss"

 - Avoid large allocations to store partition index page entries (due to managed_vector storage).
   There is a unit test which detects this and fails.
   Fixed by implementing chunked_managed_vector, based on chunked_vector.

 - fixed bug in cached_file::evict_gently() where the wrong allocation strategy was used to free btree chunks

 - Simplify region_impl::free_buf() according to Avi's suggestions

 - Fit segment_kind in segment_descriptor::_free_space and lift requirement that _buf_pointers emptiness determines the kind

 - Workaround sigsegv which was most likely due to coroutine miscompilation. Worked around by manipulating local object scope.

 - Wire up system/drop_sstable_caches RESTful API

 - Fix use-after-move on permit for the old scanning ka/la index reader

 - Fixed more cases of double open_data() in tests leading to assert failure

 - Adjusted cached_file class doc to account for changes in behavior.

 - Rebased

Fixes #7079.
Refs #363.
"

* tag 'sstable-index-caching-v2' of github.com:tgrabiec/scylla: (39 commits)
  api: Drop sstable index caches on system/drop_sstable_caches
  cached_file: Issue single I/O for the whole read range on miss
  row_cache: cache_tracker: Do not register metrics when constructed for tests
  sstables, cached_file: Evict cache gently when sstable is destroyed
  sstables: Hide partition_index_cache implementation away from sstables.hh
  sstables: Drop shared_index_lists alias
  sstables: Destroy partition index cache gently
  sstables: Cache partition index pages in LSA and link to LRU
  utils: Introduce lsa::weak_ptr<>
  sstables: Rename index_list to partition_index_page and shared_index_lists to partition_index_cache
  sstables, cached_file: Avoid copying buffers from cache when parsing promoted index
  cached_file: Introduce get_page_units()
  sstables: read: Document that primitive_consumer::read_32() is alloc-free
  sstables: read: Count partition index page evictions
  sstables: Drop the _use_binary_search flag from index entries
  sstables: index_reader: Keep index objects under LSA
  lsa: chunked_managed_vector: Adapt more to managed_vector
  utils: lsa: chunked_managed_vector: Make LSA-aware
  test: chunked_managed_vector_test: Make exception_safe_class standard layout
  lsa: Copy chunked_vector to chunked_managed_vector
  ...
2021-07-07 18:17:10 +03:00
Calle Wilund
ce45ffdffb commitlog: Use defensive copies of segment list in iterations
Fixes #8952

In 5ebf5835b0 we added a segment
prune after flushing, to deal with deadlocks in shutdown.
This means that calls that issue sync/flush-like ops "for-all",
need to operate on a defensive copy of the list.

Closes #8980
2021-07-07 13:30:37 +02:00
Avi Kivity
4c01a88c9d logalloc: do not capture backtraces by default in debug mode
logalloc has a nice leak/double-free sanitizer, with the nice
feature of capturing backtraces to make error reports easy to
track down. But capturing backtraces is itself very expensive.

This patch makes backtrace capture optional, reducing database_test
runtime from 30 minutes to 20 minutes on my machine.

Closes #8978
2021-07-06 00:18:22 +02:00
Tomasz Grabiec
6a6403d19d row_cache: cache_tracker: Do not register metrics when constructed for tests
Some tests will create two cache_tracker instances because of one
being embedded in the sstable test env.

This would lead to double registration of metrics, which raises run
time error. Avoid by not registering metrics in prometheus in tests at
all.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
4b51e0bf30 row_cache: Move cache_tracker to a separate header
It will be needed by the sstable layer to get the to the LRU and the
LSA region. Split to avoid inclusion of whole row_cache.hh
2021-07-02 10:25:58 +02:00
Calle Wilund
a40b6a2f54 commitlog: Use disk file alignment info (with lower value if possible)
Previously, the disk block alignment of segments was hardcoded (due to
really old code). Now we use the value as declared in the actual file
opened. If we are using a previously written file (i.e. o_dsync), we
can even use the sometimes smaller "read" alignment.

Also allow config to completely override this with a disk alignment
config option (not exposed to global config yet, but can be).

v2:
* Use overwrite alignment if doing only overwrite
* Ensure to adjust actual alignment if/when doing file wrapping

v3:
* Kill alignment config param. Useless and unsafe.

Closes #8935
2021-06-29 16:00:49 +03:00
Piotr Jastrzebski
1bdcef6890 features: assume MC_SSTABLE and UNBOUNDED_RANGE_TOMBSTONES are always enabled
These features have been around for over 2 years and every reasonable
deployment should have them enabled.

The only case when those features could be not enabled is when the user
has used enable_sstables_mc_format config flag to disable MC sstable
format. This case has been eliminated by removing
enable_sstables_mc_format config flag.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-25 10:12:00 +02:00
Nadav Har'El
4d7f55a29f cql: add configurable restriction of DateTieredCompactionStrategy
DateTieredCompactionStrategy (DTCS) has been un-recommended for a long time
(users should use TimeWindowCompactionStrategy, TWCS, instead). This
patch adds a new configuration option - restrict_dtcs - which can be used
to restrict the ability to use DTCS in CREATE TABLE or ALTER TABLE
statements. This is part of a "safe mode" effort to allow an installation
to restrict operations which are un-recommended or dangerous.

The new restrict_dtcs option has three values: "true", "false", and "warn":

For the time being, "false" is still the default, and means DTCS is not
restricted  and can still be used freely. We can easily change this
default in a followup patch.

Setting a value of "true" means that DTCS *is* restricted -
trying to create a a table or alter a table with it will fail with an error.

Setting a value of "warn" will allow the create or alter operation, but
will warn the user - both with a warning message which will immediately
appear in cqlsh (for example), and with a log message.

Fixes #8914.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210624122411.435361-1-nyh@scylladb.com>
2021-06-24 20:59:27 +03:00
Kamil Braun
a3f3563828 storage_service: check for existing normal token owners before bootstrapping
The bootstrap procedure starts by "waiting for range setup", which means
waiting for a time interval specified by the `ring_delay` parameter (30s
by default) so the node can receive the tokens of other nodes before
introducing its own tokens.

However it may sometimes happen that the node doesn't receive the
tokens. There are no explicit checks for this. But the code may crash in
weird ways if the tokens-received assuption is false, and we are lucky
if it does crash (instead of, for example, allowing the node to
incorrectly bootstrap, causing data loss in the process).

Introduce an explicit check-and-throw-if-false: a bootstrapping node now
checks that there's at least one NORMAL token in the token ring, which
means that it had to have contacted at least one existing node
in the cluster, which means that it received the gossip application
states of all nodes from that node; in particular the tokens of all
nodes.

Also add an assert in CDC code which relies on that assumption
(and would cause weird division-by-zero errors if the assumption
was false; better to crash on assert than this).

Ref #8889.

Closes #8896
2021-06-24 13:19:08 +03:00
Michael Livshin
9b9efb2b42 disable caching of the system.large_* tables
The cache of system.large_{partition,rows,cells} accumulates range
tombstones (https://github.com/scylladb/scylla/issues/7750), and those
range tombstones can be evicted only together with their partition
(https://github.com/scylladb/scylla/issues/3288).

Making the system.large_* tables uncached should work around the
problem until #3288 is fixed.

Fixes #8874
Refs #7750
Refs #3288

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20210623171932.8837-1-michael.livshin@scylladb.com>
2021-06-24 12:26:45 +03:00
Avi Kivity
14252c8b71 Merge 'Commitlog: Handle disk usage and disk footprint discrepancies, ensuring we flush when needed (#8695)​ (v3)' from Calle Wilund
Fixes #8270

If we have an allocation pattern where we leave large parts of segments "wasted" (typically because the segment has empty space, but cannot hold the mutation being added), we can have a disk usage that is below threshold, yet still get a disk footprint that is over limit causing new segment allocation to stall.

We need to take a few things into account:
1.) Need to include wasted space in the threshold check. Whether or not disk is actually used does not matter here.
2.) If we stall a segment alloc, we should just flush immediately. No point in waiting for the timer task.
3.) Need to adjust the thresholds a bit. Depending on sizes, we should probably consider start flushing once we've used up space enough to be in the last available segment, so a new one is hopefully available by the time we hit the limit.
4.) (v2) Must ensure discard/delete routines are executed. Because we can race with background disk syncs, we may need to
    issue segment prunes from end_flush() so we wake up actual file deletion/recycling
5.) (v2) Shutdown must ensure discard/delete is run after we've disabled background task etc, otherwise we might fail waking up replenish and get stuck in gate
6.) (v2) Recycling or deleting segments must be consistent, regardless of shutdown. For same reason as above.
7.) (v3) Signal recycle/delete queues/promise on shutdown (with recognized marker) to handle edge case where we only have a single (allocating) segment in the list, and cannot wake up replenisher in any more civilized way.

Also fix edge case (for tests), when we have too few segment to have an active one (i.e. need flush everything).

New attempt at this, should fix intermittent shutdown deadlocks in commitlog_test.

Closes #8764

* github.com:scylladb/scylla:
  commitlog_test: Add test case for usage/disk size threshold mismatch
  commitlog_test: Improve test assertion
  commitlog: Add waitable future for background sync/flush
  commitlog: abort queues on shutdown
  commitlog: break out "abort" calls into member functions
  commitlog: Do explicit discard+delete in shutdown
  commitlog: Recycle or not should not depend on shutdown state
  commitlog: Issue discard_unused_segments on segment::flush end IFF deletable
  commitlog: Flush all segments if we only have one.
  commitlog: Always force flush if segment allocation is waiting
  commitlog: Include segment wasted (slack) size in footprint check
  commitlog: Adjust (lower) usage threshold
2021-06-24 12:03:26 +03:00
Piotr Dulikowski
de1679b1b9 hints: make hints concurrency configurable and reduce the default
Previously, hinted handoff had a hardcoded concurrency limit - at most
128 hints could be sent from a single shard at once. This commit makes
this limit configurable by adding a new configuration option:
`max_hinted_handoff_concurrency_per_shard`. This option can be updated
in runtime. Additionally, the default concurrency per shard is made
lower and is now 8.

The motivation for reducing the concurrency was to mitigate the negative
impact hints may have on performance of the receiving node due to them
not being properly isolated with respect to I/O.

Tests:
- unit(dev)
- dtest(hintedhandoff_additional_test.py)

Refs: #8624

Closes #8646
2021-06-22 15:58:56 +02:00
Calle Wilund
d6113912cd commitlog: Add waitable future for background sync/flush
Commitlog timer issues un-waited syncs on all segments. If such
a sync takes too long we can end up keeping a segment alive across
a shutdown, causing the file to be left on disk, even if actually
clean.

This adds a future in segment_manager that is "chained" with all
active syncs (hopefully just one), and ensures we wait for this
to complete in shutdown, before pruning and deleting segments
2021-06-21 06:01:19 +00:00
Pavel Emelyanov
96131349e8 schema_tables: Remove unused sharded<proxy> argument
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Pavel Emelyanov
64bb16af8a view_update_generator: Remove unused struct sstable_with_table
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-18 20:19:35 +03:00
Avi Kivity
b099e7c254 Merge "Untie hints managers and storage service" from Pavel E
"
The storage service is carried along storage proxy, hints
resource manager and hints managers (two of them) just to
subscribe the hints managers on lifecycle events (and stop
the subscription on shutdown) emitted from storage service.

This dependency chain can be greatly simplified, since the
storage proxy is already subscribed on lifecycle events and
can kick managers directly from its hooks.

tests: unit(dev),
       dtest.hintedhandoff_additional_test.hintedhandoff_basic_check_test(dev)
"

* 'br-remove-storage-service-from-hints' of https://github.com/xemul/scylla:
  hints: Drop storage service from managers
  hints: Do not subscribe managers on lifecycle events directly
2021-06-17 17:12:31 +03:00
Pavel Emelyanov
92a4278cd1 hints: Drop storage service from managers
The storage service pointer is only used so (un)subscribe
to (from) lifecycle events. Now the subscription is gone,
so can the storage service pointer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-17 15:09:36 +03:00
Pavel Emelyanov
acdc568ecf hints: Do not subscribe managers on lifecycle events directly
Managers sit on storage proxy which is already subscribed on
lifecycle events, so it can "notify" hints managers directly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-17 15:06:26 +03:00
Nadav Har'El
45c2442f49 Merge 'Avoid large allocs in mv update code' from Piotr Sarna
This series addresses #8852 by:
 * migrating to chunked_vector in view update generation code to avoid large allocations
 * reducing the number of futures kept in mutate_MV, tracking how many view updates were already sent

Combined with #8853 I was able to only observe large partition warnings in the logs for the reproducing code, without crashes, large allocation or reactor stall warnings. The reproducing code itself is not part of cql-pytest because I haven't yet figured out how to make it fast and robust.

Tests: unit(release)
Refs  #8852

Closes #8856

* github.com:scylladb/scylla:
  db,view: limit the number of simultaneous view update futures
  db,view: use chunked_vector for view updates
2021-06-17 14:01:38 +03:00
Avi Kivity
4d70f3baee storage_proxy: change unordered_set<inet_address> to small_vector in write path
The write paths in storage_proxy pass replica sets as
std::unordered_set<gms::inet_address>. This is a complex type, with
N+1 allocations for N members, so we change it to a small_vector (via
inet_address_vector_replica_set) which requires just one allocation, and
even zero when up to three replicas are used.

This change is more nuanced than the corresponding change to the read path
abe3d7d7 ("Merge 'storage_proxy: use small_vector for vectors of
inet_address' from Avi Kivity"), for two reasons:

 - there is a quadratic algorithm in
   abstract_write_response_handler::response(): it searches for a replica
   and erases it. Since this happens for every replica, it happens N^2/2
   times.
 - replica sets for writes always include all datacenters, while reads
   usually involve just one datacenter.

So, a write to a keyspace that has 5 datacenters will invoke 15*(15-1)/2
=105 compares.

We could remove this by sending the index of the replica in the replica
set to the replica and ask it to include the index in the response, but
I think that this is unnecessary. Those 105 compares need to be only
105/15 = 7 times cheaper than the corresponding unordered_set operation,
which they surely will. Handling a response after a cross-datacenter round
trip surely involves L3 cache misses, and a small_vector reduces these
to a minimum compared to an unordered_set with its bucket table, linked
list walking and managent, and table rehashing.

Tests using perf_simple_query --write --smp 1 --operations-per-shard 1000000
 --task-quota-ms show two allocations removed (as expected) and a nice
reduction in instructions executed.

before: median 204842.54 tps ( 54.2 allocs/op,  13.2 tasks/op,   49890 insns/op)
after:  median 206077.65 tps ( 52.2 allocs/op,  13.2 tasks/op,   49138 insns/op)

Closes #8847
2021-06-17 13:46:40 +03:00
Avi Kivity
98cdeaf0f2 schema_tables: make the_merge_lock thread_local
the_merge_lock is global, which is fine now because it is only used
in shard 0. However, if we run multiple nodes in a single process,
there will be multiple shard 0:s, and the_merge_lock will be accessed
from multiple threads. This won't work.

To fix, make it thread_local. It would be better to make it a member
of some controlling object, but there isn't one.

Closes #8858
2021-06-17 13:41:11 +03:00
Avi Kivity
00ff3c1366 Merge 'treewide: add support for snapshot skip-flush option' from Benny Halevy
The option is provided by nodetool snapshot
https://docs.scylladb.com/operating-scylla/nodetool-commands/snapshot/
```
nodetool [(-h <host> | --host <host>)] [(-p <port> | --port <port>)]
         [(-pp | --print-port)] [(-pw <password> | --password <password>)]
         [(-pwf <passwordFilePath> | --password-file <passwordFilePath>)]
         [(-u <username> | --username <username>)] snapshot
         [(-cf <table> | --column-family <table> | --table <table>)]
         [(-kc <kclist> | --kc.list <kclist>)]
         [(-sf | --skip-flush)] [(-t <tag> | --tag <tag>)] [--] [<keyspaces...>]

-sf / –skip-flush    Do not flush memtables before snapshotting (snapshot will not contain unflushed data)
```

But is currently ignored by scylla-jmx (scylladb/scylla-jmx#167)
and not supported at the api level.

This patch adds support for the option in advance
from the api service level down via snapshot_ctl
to the table class and snapshot implementation.

In addition, a corresponding unit test was added to verify
that taking a snapshot with `skip_flush` does not flush the memtable
(at the table::snapshot level).

Refs #8725

Closes #8726

* github.com:scylladb/scylla:
  test: database_test: add snapshot_skip_flush_works
  api: storage_service/snapshots: support skip-flush option
  snapshot: support skip_flush option
  table: snapshot: add skip_flush option
  api: storage_service/snapshots: add sf (skip_flush) option
2021-06-17 13:32:23 +03:00
Nadav Har'El
b6b4df9a47 heat-weighted load balancing: improve handling of near-perfect cache
Consider two nodes with almost-100% cache hit ratio, but not exactly
100%: one has 99.9% cache hits, the second 99.8%. Normally in HWLB we
want to equalize the miss rate in both nodes. So we send the first node
twice the number of requests we send to the second. But unless the disks
are extremely limited, this doesn't make sense: As a numeric example,
consider that we send 2000 requests to the first node and 1000 to the
second, just so the number of misses will be the same - 2 (0.1% and 0.2%
misses, respectively). At such low miss numbers, the assumption that the
disk reads are the slowest part of the operation is wrong, so trying to
equalize only this part is wrong.

So above some threshold hit rate, we should treat all hit rates as
equivalent. In the code we already had such a threshold - max_hit_rate,
but it was set to the incredibly high 0.999. We saw in actual user
runs (see issue #8815) that this threshold was too high - one node
received twice the amount of requests that another did - although both
had near-100% cache hit rates.

So in this patch we lower the max_hit_rate to 0.95. This will have two
consequences:

1. Two nodes with hit rates above 0.95 will be considered to have the
   same hit rate, so they will get equal amount of work - even if one
   has hit rate 0.98 and the other 0.99.

2. A cold node with it rate 0.0 will get 5% of the work of a node with
   the perfect hit rate limited to 0.95. This will allow the cold node to
   slowly warm up its cache. Before this patch, if the hot node happened
   to have a hit rate of 0.999 (the previous maximum), the cold node would
   get just 0.1% of the work and remain almost idle and fill its cache
   extremely slowly - which is a waste.

Fixes #8815.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210616180732.125295-1-nyh@scylladb.com>
2021-06-17 11:02:08 +02:00
Piotr Sarna
1fb831c8c1 db,view: limit the number of simultaneous view update futures
Previously the view update code generated a continuation for each
view update and stored them all in a vector. In certain cases
the number of updates can grow really large (to millions and beyond),
so it's better to only store a limited amount of these futures
at a time.
2021-06-17 10:20:52 +02:00
Piotr Sarna
a7f7716ecf db,view: use chunked_vector for view updates
The number of view updates can grow large, especially in corner
cases like removing large base partitions. Chunked vector
prevents large allocations.
2021-06-17 10:15:17 +02:00
Tomasz Grabiec
6bdf8c4c46 Merge "raft: second series of preparatory patches for group 0 discovery" from Kostja
Miscellaneous preparatory patches for group 0 discovery.

* scylla-dev/raft-group-0-part-2-v4:
  raft: (service) servers map is gid -> server, not sid -> server
  system_keyspace: raft.group_id and raft_snapshots.group_id are TIMEUUID
  raft: (server) wait for configuration transition to complete
  raft: (server) implement raft::server::get_configuration()
  raft: (service) don't throw from schema state machine
  raft: (service) permit some scylla.raft cells to be empty
  raft: (service) properly handle failure to add a server
  raft: implement is_transient_error()
2021-06-17 00:15:40 +02:00
Calle Wilund
14559b5a86 commitlog: abort queues on shutdown
In case we only have a single segment active when shutting down,
the replenisher can be blocked even though we manually flush-deleted.

Add a signal type and abort queues using this to wake up waiter and
force them to check shutdown status.
2021-06-16 15:35:56 +00:00
Calle Wilund
227b573cdf commitlog: break out "abort" calls into member functions 2021-06-16 15:35:56 +00:00
Calle Wilund
5cd9691f00 commitlog: Do explicit discard+delete in shutdown
When we are shutting down, before trying to close the gate,
we should issue a discard to ensure waking up the replenish task
2021-06-16 15:35:56 +00:00
Calle Wilund
03b8baaa8d commitlog: Recycle or not should not depend on shutdown state
If we are using recycling, we should always use recycle in
delete_segments, otherwise we can cause deadlock with replenish
task, since it will be waiting for segment, then shutdown is set,
and we are called, and can't fulfil the alloc -> deadlock
2021-06-16 15:35:56 +00:00
Calle Wilund
5ebf5835b0 commitlog: Issue discard_unused_segments on segment::flush end IFF deletable
If a segments, when finishing a flush call, is deletable, we should issue
a manual call to discard function (which moves deleteable segments off
segment list) asap, since we otherwise are dependent on more calls
from flush handlers (memtable flush). And since we could have blocked
segment allocation, this can cause deadlocks, at least in tests.
2021-06-16 15:35:56 +00:00
Calle Wilund
cbddcf46aa commitlog: Flush all segments if we only have one.
Handle test cases with borked config so we don't deadlock
in cases where we only have one segment in a commitlog
2021-06-16 15:35:56 +00:00