Commit Graph

1939 Commits

Author SHA1 Message Date
Botond Dénes
0bf07cde7b test/manual: migrate off the global test reader semaphore 2021-07-08 16:53:38 +03:00
Botond Dénes
18e0c40c5d test/unit: migrate off the global test reader semaphore 2021-07-08 16:53:38 +03:00
Botond Dénes
37a1e506b1 test/perf: migrate off the global test reader semaphore 2021-07-08 16:53:38 +03:00
Botond Dénes
2454811dd6 test/perf: perf.hh: add reader_concurrency_semaphore_wrapper
A convenience, self-closing wrapper for those perf tests that have no
way to stop the semaphore and wait for it too.
2021-07-08 16:53:38 +03:00
Botond Dénes
0e78399051 test/lib: migrate off the global test reader semaphore 2021-07-08 15:28:39 +03:00
Botond Dénes
5fff314739 test/lib/simple_schema: migrate off the global test reader semaphore 2021-07-08 15:28:39 +03:00
Botond Dénes
d520655730 test/lib/sstable_utils: migrate off the global test reader semaphore 2021-07-08 15:28:39 +03:00
Botond Dénes
3679418e62 test/lib/test_services: migrate off the global test reader semaphore 2021-07-08 15:28:39 +03:00
Botond Dénes
0acc4d63da test/lib/sstable_test_env: add reader_concurrency_semaphore member
To enable tests using the test env to conveniently create permits for
themselves, reducing the pain of migrating to local semaphores.
2021-07-08 15:28:39 +03:00
Botond Dénes
7174d1beee test/lib/cql_test_env: add make_reader_permit()
A convenience method, allowing tests using the cql test env to
conveniently create a permit, reducing the pain of migrating to local
semaphores.
2021-07-08 15:28:39 +03:00
Botond Dénes
b739525fb6 test/lib: add reader_concurrency_semaphore.hh
Supplying a convenience semaphore wrapper, which stops the contained
semaphore when destroyed. It also provides a more convenient
`make_permit()`.  This class is intended to make the migration to local
semaphores less painful.
2021-07-08 15:28:36 +03:00
Botond Dénes
b9a5fd57bf test/boost/sstable_test: migrate row counting tests to seastar thread
To facilitate further patching.
2021-07-08 12:38:21 +03:00
Botond Dénes
fb310ec6e7 test/boost/sstable_test: test_using_reusable_sst(): pass env to func
To facilitate further patching.
2021-07-08 12:38:19 +03:00
Botond Dénes
46d21e842d test/lib/reader_lifecycle_policy: add permit parameter to factory function
The factory method doesn't match the signature of
`reader_lifecycle_policy::make_reader()`, notably the permit is missing.
Add it as it is important that the wrapping evictable reader and
underlying reader share the permits.
2021-07-08 12:31:36 +03:00
Botond Dénes
2a45d643b6 test/boost/mutation_reader_test: share permit between readers in a read
Permits were designed such that there is one permit per read, being
shared by all readers in that read. Make sure readers created by tests
adhere to this.
2021-07-08 12:31:36 +03:00
Botond Dénes
0f36e5c498 memtable: migrate off the global reader concurrency semaphore
Require the caller of `create_flush_reader()` to pass a permit instead.
2021-07-08 12:31:36 +03:00
Botond Dénes
c4e71fb9b8 reader_concurrency_semaphore: remove default name parameter
Naming the concurrency semaphore is currently optional, unnamed
semaphores defaulting to "Unnamed semaphore". Although the most
important semaphores are named, many still aren't, which makes for a
poor debugging experience when one of these times out.
To prevent this, remove the name parameter defaults from those
constructors that have it and require a unique name to be passed in.
Also update all sites creating a semaphore and make sure they use a
unique name.
2021-07-08 12:31:36 +03:00
Raphael S. Carvalho
1924e8d2b6 treewide: Move compaction code into a new top-level compaction dir
Since compaction is layered on top of sstables, let's move all compaction code
into a new top-level directory.
This change will give me extra motivation to remove all layer violations, like
sstable calling compaction-specific code, and compaction entanglement with
other components like table and storage service.

Next steps:
- remove all layer violations
- move compaction code in sstables namespace into a new one for compaction.
- move compaction unit tests into its own file

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210707194058.87060-1-raphaelsc@scylladb.com>
2021-07-07 23:21:51 +03:00
Avi Kivity
5571ef0d6d compression: define 'class' attribute for compression and deprecate 'sstable_compression'
Cassandra 3.0 deprecated the 'sstable_compression' attribute and added
'class' as a replacement. Follow by supporting both.

The SSTABLE_COMPRESSION variable is renamed to SSTABLE_COMPRESSION_DEPRECATED
to detect all uses and prevent future misuse.

To prevent old-version nodes from seeing the new name, the
compression_parameters class preserves the key name when it is
constructed from an options map, and emits the same key name when
asked to generate an options map.

Existing unit tests are modified to use the new name, and a test
is added to ensure the old name is still supported.

Fixes #8948.

Closes #8949
2021-07-07 19:15:20 +02:00
Avi Kivity
99d5355007 Merge "Cache sstable indexes in memory" from Tomasz
"
The main goal of this series is to improve efficiency of reads from large partitions by
reducing amount of I/O needed to read the sstable index. This is achieved by caching
index file pages and partition index entries in memory.

Currently, the pages are cached by individual reads only for the duration of the read.
This was done to facilitate binary search in the promoted index (intra-partition index).
After this series, all reads share the index file page cache, which stays around even after reads stop.

The page cache is subject to eviction. It uses the same region as the current row cache and shares
the LRU with row cache entries. This means that LRU objects need to be virtualized. This series takes
an easy approach and does this by introducing a virtual base class. This adds an overhead to row cache
entry to store the vtable pointer.

SStable indexes have a hierarchy. There is a summary, which is a sparse partition key index into the
full partition index. This one is already kept in memory. The partition index is divided by the summary
into pages. Each entry in the partition index contains promoted index, which is a sparse index into atoms
identified by the clustering key (rows, tombstones).

In order to read the promoted index, the reader needs to read the partition index entry first.
To speed this up, this series also adds caching of partition index entries. This cache survives
reads and is subject to eviction, just like the index file page cache. The unit of caching is
the partition index page. Without this cache, each access to promoted index would have to be
preceded with the parsing of the partition index page containing the partition key.

Performance testing results follow.

1) scylla-bench large partition reads

  Populated with:

        perf_fast_forward --run-tests=large-partition-skips --datasets=sb-large-part-ds1 \
            -c1 -m1G --populate --value-size=1024 --rows=10000000

  Single partition, 9G data file, 4MB index file

  Test execution:

    build/release/scylla -c1 -m4G
    scylla-bench -workload uniform -mode read -limit 1 -concurrency 100 -partition-count 1 \
       -clustering-row-count 10000000 -duration 60m

  TL;DR: after: 2x throughput, 0.5 median latency

    Before (c1daf2bb24):

    Results
    Time (avg):	 5m21.033180213s
    Total ops:	 966951
    Total rows:	 966951
    Operations/s:	 3011.997048812112
    Rows/s:		 3011.997048812112
    Latency:
      max:		 74.055679ms
      99.9th:	 63.569919ms
      99th:		 41.320447ms
      95th:		 38.076415ms
      90th:		 37.158911ms
      median:	 34.537471ms
      mean:		 33.195994ms

    After:

    Results
    Time (avg):	 5m14.706669345s
    Total ops:	 2042831
    Total rows:	 2042831
    Operations/s:	 6491.22243800942
    Rows/s:		 6491.22243800942
    Latency:
      max:		 60.096511ms
      99.9th:	 35.520511ms
      99th:		 27.000831ms
      95th:		 23.986175ms
      90th:		 21.659647ms
      median:	 15.040511ms
      mean:		 15.402076ms

2) scylla-bench small partitions

  I tested several scenarios with a varying data set size, e.g. data fully fitting in memory,
  half fitting, and being much larger. The improvement varied a bit but in all cases the "after"
  code performed slightly better.

  Below is a representative run over data set which does not fit in memory.

  scylla -c1 -m4G
  scylla-bench -workload uniform -mode read  -concurrency 400 -partition-count 10000000 \
      -clustering-row-count 1 -duration 60m -no-lower-bound

  Before:

    Time (avg):	 51.072411913s
    Total ops:	 3165885
    Total rows:	 3165885
    Operations/s:	 61988.164024260645
    Rows/s:		 61988.164024260645
    Latency:
      max:		 34.045951ms
      99.9th:	 25.985023ms
      99th:		 23.298047ms
      95th:		 19.070975ms
      90th:		 17.530879ms
      median:	 3.899391ms
      mean:		 6.450616ms

  After:

    Time (avg):	 50.232410679s
    Total ops:	 3778863
    Total rows:	 3778863
    Operations/s:	 75227.58014424688
    Rows/s:		 75227.58014424688
    Latency:
      max:		 37.027839ms
      99.9th:	 24.805375ms
      99th:		 18.219007ms
      95th:		 14.090239ms
      90th:		 12.124159ms
      median:	 4.030463ms
      mean:		 5.315111ms

  The results include the warmup phase which populates the partition index cache, so the hot-cache effect
  is dampened in the statistics. See the 99th percentile. Latency gets better after the cache warms up which
  moves it lower.

3) perf_fast_forward --run-tests=large-partition-skips

    Caching is not used here, included to show there are no regressions for the cold cache case.

    TL;DR: No significant change

    perf_fast_forward --run-tests=large-partition-skips --datasets=large-part-ds1 -c1 -m1G

    Config: rows: 10000000, value size: 2000

    Before:

    read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
    1       0        36.429822            4  10000000     274500         62     274521     274429   153889.2 153883   19696986  153853       0        0        0        0        0        0        0  22.5%
    1       1        36.856236            4   5000000     135662          7     135670     135650   155652.0 155652   19704117  139326       1        0        1        1        0        0        0  38.1%
    1       8        36.347667            4   1111112      30569          0      30570      30569   155652.0 155652   19704117  139071       1        0        1        1        0        0        0  19.5%
    1       16       36.278866            4    588236      16214          1      16215      16213   155652.0 155652   19704117  139073       1        0        1        1        0        0        0  16.6%
    1       32       36.174784            4    303031       8377          0       8377       8376   155652.0 155652   19704117  139056       1        0        1        1        0        0        0  12.3%
    1       64       36.147104            4    153847       4256          0       4256       4256   155652.0 155652   19704117  139109       1        0        1        1        0        0        0  11.1%
    1       256       9.895288            4     38911       3932          1       3933       3930   100869.2 100868    3178298   59944   38912        0        1        1        0        0        0  14.3%
    1       1024      2.599921            4      9757       3753          0       3753       3753    26604.0  26604     801850   15071    9758        0        1        1        0        0        0  14.6%
    1       4096      0.784568            4      2441       3111          1       3111       3109     7982.0   7982     205946    3772    2442        0        1        1        0        0        0  13.8%

    64      1        36.553975            4   9846154     269359         10     269369     269337   155663.8 155652   19704117  139230       1        0        1        1        0        0        0  28.2%
    64      8        36.509694            4   8888896     243467          8     243475     243449   155652.0 155652   19704117  139120       1        0        1        1        0        0        0  26.5%
    64      16       36.466282            4   8000000     219381          4     219385     219374   155652.0 155652   19704117  139232       1        0        1        1        0        0        0  24.8%
    64      32       36.395926            4   6666688     183171          6     183180     183165   155652.0 155652   19704117  139158       1        0        1        1        0        0        0  21.8%
    64      64       36.296856            4   5000000     137753          4     137757     137737   155652.0 155652   19704117  139105       1        0        1        1        0        0        0  17.7%
    64      256      20.590392            4   2000000      97133         18      97151      94996   135248.8 131395    7877402   98335   31282        0        1        1        0        0        0  15.7%
    64      1024      6.225773            4    588288      94492       1436      95434      88748    46066.5  41321    2324378   30360    9193        0        1        1        0        0        0  15.8%
    64      4096      1.856069            4    153856      82893         54      82948      82721    16115.0  16043     583674   11574    2675        0        1        1        0        0        0  16.3%

    After:

    read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu
    1       0        36.429240            4  10000000     274505         38     274515     274417   153887.8 153883   19696986  153849       0        0        0        0        0        0        0  22.4%
    1       1        36.933806            4   5000000     135377         15     135385     135354   155658.0 155658   19704085  139398       1        0        1        1        0        0        0  40.0%
    1       8        36.419187            4   1111112      30509          2      30510      30507   155658.0 155658   19704085  139233       1        0        1        1        0        0        0  22.0%
    1       16       36.353475            4    588236      16181          0      16182      16181   155658.0 155658   19704085  139183       1        0        1        1        0        0        0  19.2%
    1       32       36.251356            4    303031       8359          0       8359       8359   155658.0 155658   19704085  139120       1        0        1        1        0        0        0  14.8%
    1       64       36.203692            4    153847       4249          0       4250       4249   155658.0 155658   19704085  139071       1        0        1        1        0        0        0  13.0%
    1       256       9.965876            4     38911       3904          0       3906       3904   100875.2 100874    3178266   60108   38912        0        1        1        0        0        0  17.9%
    1       1024      2.637501            4      9757       3699          1       3700       3697    26610.0  26610     801818   15071    9758        0        1        1        0        0        0  19.5%
    1       4096      0.806745            4      2441       3026          1       3027       3024     7988.0   7988     205914    3773    2442        0        1        1        0        0        0  18.3%

    64      1        36.611243            4   9846154     268938          5     268942     268921   155669.8 155705   19704085  139330       2        0        1        1        0        0        0  29.9%
    64      8        36.559471            4   8888896     243135         11     243156     243124   155658.0 155658   19704085  139261       1        0        1        1        0        0        0  28.1%
    64      16       36.510319            4   8000000     219116         15     219126     219101   155658.0 155658   19704085  139173       1        0        1        1        0        0        0  26.3%
    64      32       36.439069            4   6666688     182954          9     182964     182943   155658.0 155658   19704085  139274       1        0        1        1        0        0        0  23.2%
    64      64       36.334808            4   5000000     137609         11     137612     137596   155658.0 155658   19704085  139258       2        0        1        1        0        0        0  19.1%
    64      256      20.624759            4   2000000      96971         88      97059      92717   138296.0 131401    7877370   98332   31282        0        1        1        0        0        0  17.2%
    64      1024      6.260598            4    588288      93967       1429      94905      88051    45939.5  41327    2324346   30361    9193        0        1        1        0        0        0  17.8%
    64      4096      1.881338            4    153856      81780        140      81920      81520    16109.8  16092     582714   11617    2678        0        1        1        0        0        0  18.2%

4) perf_fast_forward --run-tests=large-partition-slicing

    Caching enabled, each line shows the median run from many iterations

    TL;DR: We can observe reduction in IO which translates to reduction in execution time,
           especially for slicing in the middle of partition.

    perf_fast_forward --run-tests=large-partition-slicing --datasets=large-part-ds1 -c1 -m1G --keep-cache-across-test-cases

    Config: rows: 10000000, value size: 2000

    Before:

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
    0       1         0.000491          127         1       2037         24       2109        127        4.0      4        128       2       2        0        1        1        0        0        0       157      80 3058208  15.0%
    0       32        0.000561         1740        32      56995        410      60031      47208        5.0      5        160       3       2        0        1        1        0        0        0       386     111  113353  17.5%
    0       256       0.002052          488       256     124736       7111     144762      89053       16.6     17        672      14       2        0        1        1        0        0        0      2113     446   52669  18.6%
    0       4096      0.016437           61      4096     249199        692     252389     244995       69.4     69       8640      57       5        0        1        1        0        0        0     26638    1717   23321  22.4%
    5000000 1         0.002171          221         1        461          2        466        221       25.0     25        268       3       3        0        1        1        0        0        0       638     376 14311524  10.2%
    5000000 32        0.002392          404        32      13376         48      13528      13015       27.0     27        332       5       3        0        1        1        0        0        0       931     432  489691  11.9%
    5000000 256       0.003659          279       256      69967        764      73130      52563       39.5     41        780      19       3        0        1        1        0        0        0      2689     825   93756  15.8%
    5000000 4096      0.018592           55      4096     220313        433     234214     218803       94.2     94       9484      62       9        0        1        1        0        0        0     27349    2213   26562  21.0%

    After:

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
    0       1         0.000229          115         1       4371         85       4585        115        2.1      2         64       1       1        1        0        0        0        0        0        90      31 1314749  22.2%
    0       32        0.000277         2174        32     115674       1015     128109      14144        3.0      3         96       2       1        1        0        0        0        0        0       319      62   52508  26.1%
    0       256       0.001786          576       256     143298       5534     179142     113715       14.7     17        544      15       1        1        0        0        0        0        0      2110     453   45419  21.4%
    0       4096      0.015498           61      4096     264289       2006     268850     259342       67.4     67       8576      59       4        1        0        0        0        0        0     26657    1738   22897  23.7%
    5000000 1         0.000415          233         1       2411         15       2456        234        4.1      4        128       2       2        1        0        0        0        0        0       199      72 2644719  16.8%
    5000000 32        0.000635         1413        32      50398        349      51149      46439        6.0      6        192       4       2        1        0        0        0        0        0       458     128  125893  18.6%
    5000000 256       0.002028          486       256     126228       3024     146327      82559       17.8     18       1024      13       4        1        0        0        0        0        0      2123     385   51787  19.6%
    5000000 4096      0.016836           61      4096     243294        814     263434     241660       73.0     73       9344      62       8        1        0        0        0        0        0     26922    1920   24389  22.4%

Future work:

 - Check the impact on non-uniform workloads. Caching sstable indexes takes space away from the row cache
   which may reduce the hit ratio.

 - Reduce memory footprint of partition index cache. Currently, about 8x bloat over the on-disk size.

 - Disable cache population for "bypass cache" reads

 - Add a switch to disable sstable index caching, per-node, maybe per-table

 - Better sstable index format. Current format leads to inefficiency in caching since only some elements of the cached
   page can be hot. A B-tree index would be more efficient. Same applies to the partition index. Only some elements in
   the partition index page can be hot.

 - Add heuristic for reducing index file IO size when large partitions are anticipated. If we're bound by disk's
   bandwidth it's wasteful to read the front of promoted index using 32K IO, better use 4K which should cover the
   partition entry and then let binary search read the rest.

In V2:

 - Fixed perf_fast_forward regression in the number of IOs used to read partition index page
   The reader uses 32K reads, which were split by page cache into 4K reads
   Fix by propagating IO size hints to page cache and using single IO to populate it.
   New patch: "cached_file: Issue single I/O for the whole read range on miss"

 - Avoid large allocations to store partition index page entries (due to managed_vector storage).
   There is a unit test which detects this and fails.
   Fixed by implementing chunked_managed_vector, based on chunked_vector.

 - fixed bug in cached_file::evict_gently() where the wrong allocation strategy was used to free btree chunks

 - Simplify region_impl::free_buf() according to Avi's suggestions

 - Fit segment_kind in segment_descriptor::_free_space and lift requirement that _buf_pointers emptiness determines the kind

 - Workaround sigsegv which was most likely due to coroutine miscompilation. Worked around by manipulating local object scope.

 - Wire up system/drop_sstable_caches RESTful API

 - Fix use-after-move on permit for the old scanning ka/la index reader

 - Fixed more cases of double open_data() in tests leading to assert failure

 - Adjusted cached_file class doc to account for changes in behavior.

 - Rebased

Fixes #7079.
Refs #363.
"

* tag 'sstable-index-caching-v2' of github.com:tgrabiec/scylla: (39 commits)
  api: Drop sstable index caches on system/drop_sstable_caches
  cached_file: Issue single I/O for the whole read range on miss
  row_cache: cache_tracker: Do not register metrics when constructed for tests
  sstables, cached_file: Evict cache gently when sstable is destroyed
  sstables: Hide partition_index_cache implementation away from sstables.hh
  sstables: Drop shared_index_lists alias
  sstables: Destroy partition index cache gently
  sstables: Cache partition index pages in LSA and link to LRU
  utils: Introduce lsa::weak_ptr<>
  sstables: Rename index_list to partition_index_page and shared_index_lists to partition_index_cache
  sstables, cached_file: Avoid copying buffers from cache when parsing promoted index
  cached_file: Introduce get_page_units()
  sstables: read: Document that primitive_consumer::read_32() is alloc-free
  sstables: read: Count partition index page evictions
  sstables: Drop the _use_binary_search flag from index entries
  sstables: index_reader: Keep index objects under LSA
  lsa: chunked_managed_vector: Adapt more to managed_vector
  utils: lsa: chunked_managed_vector: Make LSA-aware
  test: chunked_managed_vector_test: Make exception_safe_class standard layout
  lsa: Copy chunked_vector to chunked_managed_vector
  ...
2021-07-07 18:17:10 +03:00
Pavel Solodovnikov
b959f5d394 test: lib: copy query_options in single_node_cql_env::execute_cql()
`query_processor::execute_direct()` takes a non-const ref
to query options, meaning it's not safe to pass the same
instance to subsequent invocations of `execute_direct()`
in the tests.

Copy default query options at each invocation of `execute_cql()`
so no possible side-effects can occur.

Tests: unit(dev, debug)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210705094824.243573-2-pa.solodovnikov@scylladb.com>
2021-07-07 11:46:50 +03:00
Nadav Har'El
775a64b003 test/alternator: test for change in CDC preimage
In pull request #8568, the CDC API changed slightly, with preimage data
gaining extra "delete$k" values for columns whose preimage was missing.
In this new test, we verify that this change did not break Alternator.
We didn't expect it to break Alternator, because it just outputs the known
base-table columns and ignores the columns which weren't a real base-table
column - like this "delete$k".

In the test we set up a stream with preimages, ensure that a real column
(note that an LSI key is a real column instead of a map element) has a
null preimage - and see that the preimage is returned as expected,
without fake columns like "delete$k".

The test passes, showing that PR #8568 was ok.
The test also passes, as expected, on DynamoDB.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210504120121.915829-1-nyh@scylladb.com>
2021-07-06 14:53:42 +02:00
Nadav Har'El
76227fafad cql-pytest: use NetworkTopologyStrategy, not SimpleStrategy
All tests in cql-pytest use a test keyspace created with the SimpleStrategy
replication strategy. This was never intentional. We are recommending to
users that they should use NetworkTopologyStrategy instead, and even
want to deprecate SimpleStrategy (this is #8586), so tests should stop
using SimpleStrategy and should start using the same strategy users would
use in their applications - NetworkTopologyStrategy.

Almost all tests are fixed by a single change in conftest.py which
changes how "test_keyspace" is created. But additionally, tests in
test_keyspace.py which explicitly create keyspaces (that's the point of
that test file...) also had to be modified to use NetworkTopologyStrategy.
Note that none of the tests relied on any special features or
implementation details of SimpleStrategy.

This patch is part of the bigger effort to remove reliance on
SimpleStrategy from all tests, of all types - which we will need to do if
we ever want to forbid SimpleStrategy by default. The issue of that effort:
Refs #8638

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210620102341.195533-1-nyh@scylladb.com>
2021-07-06 14:52:46 +02:00
Avi Kivity
2c8d84b864 Merge "Make logging for sstable data corruption useful" from Raphael
"
When a corrupted sstable fails to be read either on regular read or in
regular compaction, our logging is not useful as it can't pinpoint
the SSTable that was being read from, also it may not print useful
details about the corruption.

For example, when a compaction fails on data corruption, a cryptic
message as follow will be dumped:
    compaction_manager - compaction failed: std::runtime_error (compressed chunk failed checksum): retrying

there are two problems with the log above:
    1) it doesn't tell us which sstable is corrupted
    2) it doesn't tell us detailed info about the checksum failure on compressed chunk

with those problems fixed, we'll now get a much more useful message:
    compaction_manager - compaction failed: sstables::malformed_sstable_exception (Failed to read partition
        from SSTable /home/.../md-74-big-Data.db due to compressed chunk of size 3735 at file offset 406491
        failed checksum, expected=0, actual=1422312584): retrying

tests: mode(dev).
"

* 'log_data_corruption_v2.1' of github.com:raphaelsc/scylla:
  sstables: Attach sstable name to exception triggered in sstable mutation reader
  test/broken_sstable_test: Make test more robust
  sstables: Make log more useful when compressed chunk fails checksum
  sstables: Use correct exception when compressed chunk fails checksum
2021-07-05 20:37:19 +03:00
Avi Kivity
e2f865c739 Merge 'Use expressions to calculate the global-index partition slice' from Dejan Mircevski
Another step towards dropping the `restrictions` class.  When calculating the partition slice of a global-index table, use `expression` objects instead of a `restrictions` subclass.

Refs #7217.

Tests: unit (all dev, some debug)

Closes #8950

* github.com:scylladb/scylla:
  cql3: Use expr for global-index partition slice
  cql3: Fully explain statement_restrictions members
  cql3: Pass schema reference not pointer
  cql3: Replace count_if with find_atom
  cql3: Fix _partition_range_is_simple calculation
  cql3: Add test cases for indexed partition column
2021-07-05 18:04:54 +03:00
Nadav Har'El
12b058abdf Merge 'repair: row_level: clear_gently on close' from Benny Halevy
To prevent a reactor stall as seen in #8926

Fixes #8926

Test: unit(dev)
DTest: repair_additional_test.py:RepairAdditionalTest.repair_same_row_diff_value_3nodes_diff_shard_count_test

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #8928

* github.com:scylladb/scylla:
  repair: row_level: clear_gently: clear_gently each repair_row
  repair: row_level: repair_meta: clear_gently on stop
  repair: row_level_repair: run: stop master repair_meta
  utils: stall_free: implemnt clear_gently of froeign_ptr
  utils: stall_free: define generic clear_gently methods
2021-07-04 23:00:37 +03:00
Tomasz Grabiec
7d34799f3f sstables: Drop shared_index_lists alias 2021-07-02 19:02:14 +02:00
Tomasz Grabiec
9f957f1cf9 sstables: Cache partition index pages in LSA and link to LRU
As part of this change, the container for partition index pages was
changed from utils::loading_shared_values to intrusive_btree. This is
to avoid reactor stalls which the former induces with a large number
of elements (pages) due to its use of a hashtable under the hood,
which reallocates contiguous storage.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
b3728f7d9b utils: Introduce lsa::weak_ptr<>
Simplifies managing non-owning references to LSA-managed objects. The
lsa::weak_ptr is a smart pointer which is not invalidated by LSA and
can be used safely in any allocator context. Dereferenced will always
give a valid reference.

This can be used as a building block for implementing cursors into
LSA-based caches.

Example simple use:

     // LSA-managed
     struct X : public lsa::weakly_referencable<X> {
         int value;
     };

     lsa::weak_ptr<X> x_ptr = with_allocator(region(), [] {
           X* x = current_allocator().construct<X>();
           return x->weak_from_this();
     });

     std::cout << x_ptr->value;
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
2a852cd0c9 sstables: Rename index_list to partition_index_page and shared_index_lists to partition_index_cache
The new names are less confusing.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
06e373e272 sstables: index_reader: Keep index objects under LSA
In preparation for caching index objects, manage them under LSA.

Implementation notes:

key_view was changed to be a view on managed_bytes_view instead of
bytes, so it now can be fragmented. Old users of key_view now have to
linearize it.  Actual linearization should be rare since partition
keys are typically small.

Index parser is now not constructing the index_entry directly, but
produces value objects which live in the standard allocator space:

  class parsed_promoted_index_entry;
  calss parsed_partition_index_entry;

This change was needed to support consumers which don't populate the
partition index cache and don't use LSA,
e.g. sstable::generate_summary(). It's now consumer's responsibility
to allocate index_entry out of parsed_partition_index_entry.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
78e5b9fd85 utils: lsa: chunked_managed_vector: Make LSA-aware
The max chunk size is set to be 10% of segment size.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
856e4a539d test: chunked_managed_vector_test: Make exception_safe_class standard layout
Required by managed_vector<> due to its use of offsetof()

In preparation for swtiching chunked_managed_vector storage to
managed_vector<>.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
c87ea09535 lsa: Copy chunked_vector to chunked_managed_vector
In preparation for adapting it to LSA. Split into two steps to make
reiew easier.
2021-07-02 19:02:14 +02:00
Tomasz Grabiec
2b673478aa sstables: index_reader: Do not expose index_entry references
index_entry will be an LSA-managed object. Those have to be accessed
with care, with the LSA region locked.

This patch hides most of direct index_entry accesses inside the
index_reader so that users are safe.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
a955e7971d sstables: index_reader: Don't store schema reference inside index_entry
To save space.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
484e06d69b cached_file: Always start at offset 0
All current uses start at offset 0, so simplify the code by assuming it.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
078a6e422b sstables: Cache all index file reads
After this patch, there is a singe index file page cache per
sstable, shared by index readers. The cache survives reads,
which reduces amount of I/O on subsequent reads.

As part of this, cached_file needed to be adjusted in the following ways.

The page cache may occupy a significant portion of memory. Keeping the
pages in the standard allocator could cause memory fragmentation
problems. To avoid them, the cache_file is changed to keep buffers in LSA
using lsa_buffer allocation method.

When a page is needed by the seastar I/O layer, it needs to be copied
to a temporary_buffer which is stable, so must be allocated in the
standard allocator space. We copy the page on-demand. Concurrent
requests for the same page will share the temporary_buffer. When page
is not used, it only lives in the LSA space.

In the subsequent patches cached_file::stream will be adjusted to also support
access via cached_page::ptr_type directly, to avoid materializating a
temporary_buffer.

While a page is used, it is not linked in the LRU so that it is not
freed. This ensures that the storage which is actively consumed
remains stable, either via temporary_buffer (kept alive by its
deleter), or by cached_page::ptr_type directly.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
b5ca0eb2a2 lsa: Introduce lsa_buffer
lsa_buffer is similar in spirit to std::unique_ptr<char[]>. It owns
buffers allocated inside LSA segments. It uses an alternative
allocation method which differs from regular LSA allocations in the
following ways:

  1) LSA segments only hold buffers, they don't hold metadata. They
     also don't mix with standard allocations. So a 128K segment can
     hold 32 4K buffers.

  2) objects' life time is managed by lsa_buffer, an owning smart
     pointer, which is automatically updated when buffers are migrated
     to another segment. This makes LSA allocations easier to use and
     off-loads metadata management to the client (which can keep the
     lsa_buffer wherever he wants).

The metadata is kept inside segment_descriptor, in a vector. Each
allocated buffer will have an entangled object there (8 bytes), which
is paired with an entabled object inside lsa_buffer.

The reason to have an alternative allocation method is to efficiently
pack buffers inside LSA segments.
2021-07-02 19:02:13 +02:00
Dejan Mircevski
53f376b83f cql3: Add test cases for indexed partition column
We didn't have a case when a global index exists on a partition column
and the SELECT statement specifies the full partition key.

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2021-07-02 17:28:56 +02:00
Tomasz Grabiec
f537d1a7e5 tests: sstables: Do not call open_data() twice
make_sstable_containing() already calls open_data(), so does
load(). This will trigger assertion failure added in a later patch:

   assert(!_cached_index_file);

There is no need to call load() here.
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
627a2ef087 test: cached_file: Add test for eof_error 2021-07-02 10:25:58 +02:00
Tomasz Grabiec
8fbea0b5b7 utils: cached_file: Introduce file wrapper
It's an adpator between seastar::file and cached_file. It gives a
seastar::file which will serve reads using a given cached_file as a
read-through cache.
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
8e2118069b sstables: cached_file: Account buffers returned by cached_file under read_permit
We want buffers to be accounted only when they are used outside
cached_file. Cached pages should not be accounted because they will
stay around for longer than the read after subsequent commits.
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
a5c72ed899 sstables, database: Keep cache_tracker reference inside sstables_manager
So that sstable code can pick it up for caching (lru and region).
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
7fa4e10aa0 row_cache: Use generic LRU for eviction
In preparation for tracking different kinds of objects, not just
rows_entry, in the LRU, switch to the LRU implementation form
utils/lru.hh which can hold arbitrary element type.
2021-07-02 10:25:58 +02:00
Benny Halevy
9963d15613 utils: stall_free: implemnt clear_gently of froeign_ptr
clear_gently of the foreign_ptr needs to run on the owning
shard, so provide a specialization from the SmartPointer
implementation.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-07-01 19:16:11 +03:00
Benny Halevy
eca9f45c59 utils: stall_free: define generic clear_gently methods
Define a bunch of clear_gently methods that asynchronously
clear the contents of containers and allow yielding.

This replaces clear_gently(std::list<T>&) used by row level
repair by a more generic template implementation.

Note that we do not use coroutines in this patch
to facilitate backporting to releases that do not support coroutines
and since a miscompilation bug was hit with clang++ 11 when attempting
to coroutinize this patch (see
https://bugs.llvm.org/show_bug.cgi?id=50345).

Test: stall_free_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-07-01 19:00:49 +03:00
Avi Kivity
4209dfd753 Merge "evictable_readers: don't drop static rows, drop assumption about snapshot isolation" from Botond
"
This mini-series fixes two loosely related bugs around reader recreation
in the evictable reader (related by both being around reader
recreation). A unit test is also added which reproduces both of them and
checks that the fixes indeed work. More details in the patches
themselves.
This series replaces the two independent patches sent before:
* [PATCH v1] evictable_reader: always reset static row drop flag
* [PATCH v1] evictable_reader: relax partition key check on reader
  recreation

As they depend on each other, it is easier to add a test if they are in
a series.

Fixes: #8923
Fixes: #8893

Tests: unit(dev, mutation_reader_test:debug)
"

* 'evictable-reader-recreation-more-bugs/v1' of https://github.com/denesb/scylla:
  test: mutation_reader_test: add more test for reader recreation
  evictable_reader: relax partition key check on reader recreation
  evictable_reader: always reset static row drop flag
2021-07-01 14:15:46 +03:00
Botond Dénes
75e8d2d04a test: mutation_reader_test: add more test for reader recreation 2021-06-30 11:21:58 +03:00