Commit Graph

22473 Commits

Author SHA1 Message Date
Avi Kivity
5d99d667ec Merge "Build system improvements for packaging" from Pekka
"
This patch series attempts to decouple package build and release
infrastructure, which is internal to Scylla (the company). The goal of
this series is to make it easy for humans and machines to build the full
Scylla distribution package artifacts, and make it easy to quickly
verify them.

The improvements to build system are done in the following steps.

1. Make scylla.git a super-module, which has git submodules for
   scylla-jmx and scylla-tools.  A clone of scylla.git is now all that
   is needed to access all source code of all the different components
   that make up a Scylla distribution, which is a preparational step to
   adding "dist" ninja build target. A scripts/sync-submodules.sh helper
   script is included, which allows easy updating of the submodules to the
   latest head of the respective git repositories.

2. Make builds reproducible by moving the remaining relocatable package
   specific build options from reloc/build_reloc.sh to the build system.
   After this step, you can build the exact same binaries from the git
   repository by using the dbuild version from scylla.git.

3. Add a "dist" target to ninja build, which builds all .rpm and .deb
   packages with one command. To build a release, run:

   $ ./tools/toolchain/dbuild ./configure.py --mode release

   $ ./tools/toolchain/dbuild ninja-build dist

   and you will now have .rpm and .deb packages to all the components of
   a Scylla distribution.

4. Add a "dist-check" target to ninja build for verification of .rpm and
   .deb packages in one command. To verify all the built packages, run:

   $  ninja-build dist-check

   Please note that you must run this step on the host, because the
   target uses Docker under the hood to verify packages by installing
   them on different Linux distributions.

   Currently only CentOS 7 verification is supported.

All these improvements are done so that backward compatibility is
retained. That is, any existing release infrastructure or other build
scripts are completely unaffacted.

Future improvements to consider:

- Package repository generation: add a "ninja repo" command to generate
  a .rpm and .deb repositories, which can be uploaded to a web site.
  This makes it possible to build a downloadable Scylla distribution
  from scylla.git. The target requires some configuration, which user
  has to provide. For example, download URL locations and package
  signing keys.

- Amazon Machine Image (AMI) support: add a "ninja ami" command to
  simplify the steps needed to generate a Scylla distribution AMI.

- Docker image support: add a "ninja docker" command to simplify the
  steps needed to generate a Scylla distribution Docker image.

- Simplify and unify package build: simplify and unify the various shell
  scripts needed to build packages in different git repositories. This
  step will break backward compatiblity and can be done only after
  relevant build scripts and release infrastructure is updated.
"

* 'penberg/packaging/v5' of github.com:penberg/scylla:
  docs: Update packaging documentation
  build: Add "dist-check" target
  scripts/testing: Add "dist-check" for package verification
  build: Add "dist" target
  reloc: Add '--builddir' option to build_deb.sh
  build: Add "-ffile-prefix-map" to cxxflags
  docs: Document sync-submodules.sh script in maintainer.md
  sync-submodules.sh: Add script for syncing submodules
  Add scylla-tools submodule
  Add scylla-jmx submodule
2020-06-18 12:59:52 +03:00
Dejan Mircevski
aec1acd1d5 range_test: Add cases for singular intersection
Intersection was previously not tested for singular ranges.  This
ensures it will always work for singular ranges, too.

Tests: unit(dev)

Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
2020-06-18 12:38:31 +03:00
Yaron Kaikov
e9d5852b0c dbuild: Add an option to run dbuild using podman
Following https://github.com/scylladb/scylla/pull/5333, we want to be
able to run dbuild using podman or docker by setting enviorment variable
named: DBUILD_TOOL

DBUILD_TOOL will use docker by default unless we explicitly set the tool podmand

Fixes: https://github.com/scylladb/scylla/pull/6644
2020-06-18 12:13:39 +03:00
Avi Kivity
9322c07c71 Merge "Use binary search in sstable promoted index" from Tomasz
"
The "promoted index" is how the sstable format calls the clustering key index within a given partition.
Large partitions with many rows have it. It's embedded in the partition index entry.

Currently, lookups in the promoted index are done by scanning the index linearly so the lookup
is O(N). For large partitions that's inefficient. It consumes both a lot of CPU and I/O.

We could do better and use binary search in the index. This patch series switches the mc-format
index reader to do that. Other formats use the old way.

The "mc" format promoted index has an extra structure at the end of the index called "offset map".
It's a vector of offsets of consecutive promoted index entries. This allows us to access random
entries in the index without reading the whole index.

The location of the offset entry for a given promoted index entry can be derived by knowing where
the offset vector ends in the index file, so the offset map also doesn't have to be read completely
into the memory.

The most tricky part is caching. We need to cache blocks read from the index file to amortize the
cost of binary search:

  - if the promoted index fits in the 32 KiB which was read from the index when looking for
    the partition entry, we don't want to issue any additional I/O to search the promoted index.

  - with large promoted indexes, the last few bisections will fall into the same I/O block and we
    want to reuse that block.

  - we don't want the cache to grow too big, we don't want to cache the whole promoted index
    as the read progresses over the index. Scanning reads may skip multiple times.

This series implements a rather simple approach which meets all the
above requirements and is not worse than the current state of affairs:

   - Each index cursor has its own cache of the index file area which corresponds to promoted index
     This is managed by the cached_file class.

   - Each index cursor has its own cache of parsed blocks. This allows the upper bound estimation to
     reuse information obtained during lower bound lookup. This estimation is used to limit
     read-aheads in the data file.

   - Each cursor drops entries that it walked past so that memory footprint stays O(log N)

   - Cached buffers are accounted to read's reader_permit.

Later, we could have a single cache shared by many readers. For that, we need to come up with eviction
policy.

Fixes #4007.

TESTING RESULTS

 * Point reads, large promoted index:

  Config: rows: 10000000, value size: 2000
  Partition size: 20 GB
  Index size: 7 MB

  Notes:

    - Slicing read into the middle of partition (offset=5000000, read=1) is a clear win for the binary search:

      time: 1.9ms vs 22.9ms
      CPU utilization: 8.9% vs 92.3%
      I/O: 21 reqs / 172 KiB vs 29 reqs / 3'520 KiB

      It's 12x faster, CPU utilization is 10x times smaller, disk utilization is 20x smaller.

    - Slicing at the front (offset=0) is a mixed bag.

      time is similar: 1.8ms
      CPU utilization is 6.7x smaller for bsearch: 8.5% vs 57.7%
      disk bandwidth utilization is smaller for bsearch but uses more IOs: 4 reqs / 320 KiB (scan) vs 17 reqs / 188 KiB (bsearch)

      bsearch uses less bandwidth because the series reduces buffer size used for index file I/O.

      scan is issuing:

         2 * 128 KB (index page)
         2 * 32 KB (data file)

      bsearch is issuing:

         1 * 64 KB (index page)
         15 * 4 KB (promoted index)
         1 * 64 KB (data file)

      The 1 * 64 KB is chosen dynamically by seastar. Sometimes it chooses 2 * 32 KB (with read-ahead).
      32 KB is the minimum I/O currently.

      Disk utilization could be further improved by changing the way seastar's dynamic I/O adjustments work
      so that it uses 1 * 4 KB when it suffices. This is left for the follow-up.

  Command:

        perf_fast_forward --datasets=large-part-ds1 \
         --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1

  Before:

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.001836          172         1        545          9        563        175        4.0      4        320       2       2        0        1        1        0        0        0  57.7%      0
    0       32        0.001858          502        32      17220        126      17776      11526        3.2      3        324       2       1        0        1        1        0        0        0  56.4%      0
    0       256       0.002833          339       256      90374        427      91757      85931        7.0      7        776       3       1        0        1        1        0        0        0  41.1%      0
    0       4096      0.017211           58      4096     237984       2011     241802     233870       66.1     66       8376      59       2        0        1        1        0        0        0  21.4%      0
    5000000 1         0.022952           42         1         44          1         45         41       29.2     29       3520      22       2        0        1        1        0        0        0  92.3%      0
    5000000 32        0.023052           43        32       1388         14       1414       1331       31.1     32       3588      26       2        0        1        1        0        0        0  91.7%      0
    5000000 256       0.024795           41       256      10325        129      10721       9993       43.1     39       4544      29       2        0        1        1        0        0        0  86.4%      0
    5000000 4096      0.038856           27      4096     105414        398     106918     103162       95.2     95      12160      78       5        0        1        1        0        0        0  61.4%      0

 After (v2):

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.001831          248         1        546         21        581        252       17.6     17        188       2       0        0        1        1        0        0        0   8.5%      0
    0       32        0.001910          535        32      16751        626      17770      13896       17.9     19        160       3       0        0        1        1        0        0        0   8.8%      0
    0       256       0.003545          266       256      72207       2333      89076      62852       26.9     24        764       7       0        0        1        1        0        0        0   9.7%      0
    0       4096      0.016800           56      4096     243812        524     245430     239736       83.6     83       8700      64       0        0        1        1        0        0        0  16.6%      0
    5000000 1         0.001968          351         1        508         19        538        380       21.3     21        172       2       0        0        1        1        0        0        0   8.9%      0
    5000000 32        0.002273          431        32      14077        436      15503      11551       22.7     22        268       3       0        0        1        1        0        0        0   8.9%      0
    5000000 256       0.003889          257       256      65824       2197      81833      57813       34.0     37        652      18       0        0        1        1        0        0        0  11.2%      0
    5000000 4096      0.017115           54      4096     239324        834     241310     231993       88.3     88       8844      65       0        0        1        1        0        0        0  16.8%      0

 After (v1):

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.001886          259         1        530          4        545        261       18.0     18        376       2       2        0        1        1        0        0        0   9.1%      0
    0       32        0.001954          513        32      16381         93      16844      15618       19.0     19        408       3       2        0        1        1        0        0        0   9.3%      0
    0       256       0.003266          318       256      78393       1820      81567      61663       30.8     26       1272       7       2        0        1        1        0        0        0  10.4%      0
    0       4096      0.017991           57      4096     227666        855     231915     225781       83.1     83       8888      55       5        0        1        1        0        0        0  15.5%      0
    5000000 1         0.002353          232         1        425          2        432        232       23.0     23        396       2       2        0        1        1        0        0        0   8.7%      0
    5000000 32        0.002573          384        32      12437         47      12571        429       25.0     25        460       4       2        0        1        1        0        0        0   8.5%      0
    5000000 256       0.003994          259       256      64101       2904      67924      51427       37.0     35       1484      11       2        0        1        1        0        0        0  10.6%      0
    5000000 4096      0.018567           56      4096     220609        448     227395     219029       89.8     89       9036      59       5        0        1        1        0        0        0  15.1%      0

 * Point reads, small promoted index (two blocks):

  Config: rows: 400, value size: 200
  Partition size: 84 KiB
  Index size: 65 B

  Notes:
     - No significant difference in time
     - the same disk utilization
     - similar CPU utilization

  Command:

      perf_fast_forward --datasets=large-part-ds1 \
         --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1

  Before:

    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.000279          470         1       3587         31       3829        478        3.0      3         68       2       1        0        1        1        0        0        0  21.1%      0
    0       32        0.000276         3498        32     116038        811     122756     104033        3.0      3         68       2       1        0        1        1        0        0        0  24.0%      0
    0       256       0.000412         2554       256     621044       1778     732150     559221        2.0      2         72       2       0        0        1        1        0        0        0  32.6%      0
    0       4096      0.000510         1901       400     783883       4078     819058     665616        2.0      2         88       2       0        0        1        1        0        0        0  36.4%      0
    200     1         0.000339         2712         1       2951          8       3001       2569        2.0      2         72       2       0        0        1        1        0        0        0  17.8%      0
    200     32        0.000352         2586        32      91019        266      92427      83411        2.0      2         72       2       0        0        1        1        0        0        0  20.8%      0
    200     256       0.000458         2073       200     436503       1618     453945     385501        2.0      2         88       2       0        0        1        1        0        0        0  29.4%      0
    200     4096      0.000458         2097       200     436475       1676     458349     381558        2.0      2         88       2       0        0        1        1        0        0        0  29.0%      0

  After (v1):

    Testing slicing of large partition using clustering keys:
    offset  read      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    0       1         0.000278          492         1       3598         30       3831        500        3.0      3         68       2       1        0        1        1        0        0        0  19.4%      0
    0       32        0.000275         3433        32     116153        753     122915      92559        3.0      3         68       2       1        0        1        1        0        0        0  22.5%      0
    0       256       0.000458         2576       256     559437       2978     728075     504375        2.1      2         88       2       0        0        1        1        0        0        0  29.0%      0
    0       4096      0.000506         1888       400     790064       3306     822360     623109        2.0      2         88       2       0        0        1        1        0        0        0  36.6%      0
    200     1         0.000382         2493         1       2619         10       2675       2268        2.0      2         88       2       0        0        1        1        0        0        0  16.3%      0
    200     32        0.000398         2393        32      80422        333      84759      22281        2.0      2         88       2       0        0        1        1        0        0        0  19.0%      0
    200     256       0.000459         2096       200     435943       1608     453989     380749        2.0      2         88       2       0        0        1        1        0        0        0  30.5%      0
    200     4096      0.000458         2097       200     436410       1651     455779     382485        2.0      2         88       2       0        0        1        1        0        0        0  29.2%      0

 * Scan with skips, large index:

  Config: rows: 10000000, value size: 2000
  Partition size: 20 GB
  Index size: 7 MB

  Notes:

    - Similar time, slightly worse for binary search: 36.1 s (scan) vs 36.4 (bsearch)

    - Slightly more I/O for bsearch: 153'932 reqs / 19'703'260 KiB (scan) vs 155'651 reqs / 19'704'088 KiB (bsearch)

      Binary search reads more by 828 KB and by 1719 IOs.
      It does more I/O to read the the promoted index offset map.

    - similar (low) memory footprint. The danger here is that by caching index blocks which we touch as we scan
      we would end up caching the whole index. But this is protected against by eviction as demonstrated by the
      last "mem" column.

  Command:

    perf_fast_forward --datasets=large-part-ds1 \
       --run-tests=large-partition-skips -c1 --test-case-duration=1

  Before:

      read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
      1       1        36.103451            4   5000000     138491         38     138601     138453   153932.0 153932   19703260  153561       1        0        1        1        0        0        0  31.5% 502690

  After (v2):

    read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    1       1        37.000145            4   5000000     135135          6     135146     135128   155651.0 155651   19704088  138968       0        0        1        1        0        0        0  34.2%      0

  After (v1):

    read    skip      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    cpu    mem
    1       1        36.965520            4   5000000     135261         30     135311     135231   155628.0 155628   19704216  139133       1        0        1        1        0        0        0  33.9% 248738

Also in:

  git@github.com:tgrabiec/scylla.git sstable-use-index-offset-map-v2

Tests:

  - unit (all modes)
  - manual using perf_fast_forward
"

* tag 'sstable-use-index-offset-map-v2' of github.com:tgrabiec/scylla:
  sstables: Add promoted index cache metrics
  position_in_partition: Introduce external_memory_usage()
  cached_file, sstables: Add tracing to index binary search and page cache
  sstables: Dynamically adjust I/O size for index reads
  sstables, tests: Allow disabling binary search in promoted index from perf tests
  sstables: mc: Use binary search over the promoted index
  utils: Introduce cached_file
  sstables: clustered_index: Relax scope of validity of entry_info
  sstables: index_entry: Introduce owning promoted_index_block_position
  compound_compat: Allow constructing composite from a view
  sstables: index_entry: Rename promoted_index_block_position to promoted_index_block_position_view
  sstables: mc: Extract parser for promoted index block
  sstables: mc: Extract parser for clustering out of the promoted index block parser
  sstables: consumer: Extract primitive_consumer
  sstables: Abstract the clustering index cursor behavior
  sstables: index_reader: Rearrange to reduce branching and optionals
2020-06-18 12:09:39 +03:00
Pekka Enberg
4d48f22827 docs: Update packaging documentation 2020-06-18 10:20:08 +03:00
Pekka Enberg
9e279ec2a9 build: Add "dist-check" target
This adds a "dist-check" target to ninja build. The target needs to be
run on the host because package verification is done with Docker.
2020-06-18 10:20:08 +03:00
Pekka Enberg
584c7130a1 scripts/testing: Add "dist-check" for package verification
This adds a "dist-check.sh" script in tools/testing, which performs
distribution package verification by installing packages under Docker.
2020-06-18 10:16:46 +03:00
Pekka Enberg
8e1a561fba build: Add "dist" target 2020-06-18 10:16:46 +03:00
Pekka Enberg
7b7c91a34b reloc: Add '--builddir' option to build_deb.sh
The build system will call this script. It needs control over where the
packages are built to allow building packages for the different build
modes.
2020-06-18 09:54:37 +03:00
Pekka Enberg
013f87f388 build: Add "-ffile-prefix-map" to cxxflags
This patch adds "-ffile-prefix-map" to cxxflags for all build modes.
This has two benefits:

1, Relocatable packages no longer have any special build flags, which
   makes deeper integration with the build system possible (e.g.
   targets for packages).

2 Builds are now reproducible, which makes debugging easier in case you
  only have a backtrace, but no artifacts. Rafael explains:

  "BTW, I think I found another argument for why we should always build
   with -ffile-prefix-map=.

   There was user after free test failure on next promotion. I am unable
   to reproduce it locally, so it would be super nice to be able to
   decode the backtrace.

   I was able to do it, but I had to create a
   /jenkins/workspace/scylla-master/next/ directory and build from there
   to get the same results as the bot."

Acked-by: Botond Dénes <bdenes@scylladb.com>
Acked-by: Nadav Har'El <nyh@scylladb.com>
Acked-by: Rafael Avila de Espindola <espindola@scylladb.com>
2020-06-18 09:54:37 +03:00
Pekka Enberg
71da4e6e79 docs: Document sync-submodules.sh script in maintainer.md 2020-06-18 09:54:37 +03:00
Pekka Enberg
e3376472e8 sync-submodules.sh: Add script for syncing submodules 2020-06-18 09:54:37 +03:00
Pekka Enberg
d759d7567b Add scylla-tools submodule 2020-06-18 09:54:37 +03:00
Pekka Enberg
9edf858d30 Add scylla-jmx submodule 2020-06-18 09:54:37 +03:00
Benny Halevy
5926cfc298 CMakeLists.txt: Update to C++20
Following 427398641a

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20200618052956.570260-1-bhalevy@scylladb.com>
2020-06-18 09:51:23 +03:00
Pekka Enberg
02b733c22b Revert "dbuild: Add an option to run with 'docker' or 'podman'"
This reverts commit ac7237f991. The logic
is wrong and always picks "podman" if it's installed on the system even
if user asks for "docker" with the DBUILD_TOOL environment variable.
This wreaks havoc on machines that have both docker and podman packages
installed, but podman is not configured correctly.
2020-06-18 09:22:33 +03:00
Juliusz Stasiewicz
8628ede009 cdc: Fix segfault when stream ID key is too short
When a token is calculated for stream_id, we check that the key is
exactly 16 bytes long. If it's not - `minimum_token` is returned
and client receives empty result.

This used to be the expected behavior for empty keys; now it's
extended to keys of any incorrect length.

Fixes #6570
2020-06-17 18:19:37 +03:00
Nadav Har'El
095ddf0d41 alternator test: use ConsistentRead=True where missing
All tests that write some data and then read it back need to use
ConsistentRead=True, otherwise the test may sporadically fail on a multi-
node cluster.

In the previous patch we fixed the full_query()/full_scan() convenience
functions. In this patch, I audited the calls to the boto3 read methods -
get_item(), batch_get_item(), query(), scan(), and although most of them
did use ConsistentRead=True as needed, I found some missing and this patch
fixes them.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200616080334.825893-1-nyh@scylladb.com>
2020-06-17 14:57:45 +02:00
Nadav Har'El
c298088375 alternator test: use ConsistentRead=True for full_query/scan
Many of the Alternator tests use the convenience functions full_query()/
full_scan() to read from the table. Almost all these tests need to be able
to read their own writes, i.e., want ConsistentRead=True, but none of them
explicitly specified this parameter. Such tests may sporadically fail when
running on cluster with multiple nodes.

So this patch follows a TODO in the code, and makes ConsistentRead=True
the default for the full_*() functions. The caller can still override it
with ConsistentRead=False - and this is necessary in the GSI tests, because
ConsistentRead=True is not allowed in GSIs.

Note that while ConsistentRead=True is now the default for the full_*()
convenience functions, but it is still not the default for the lower level
boto3 functions scan(), query() and get_item() - so usages of those should
be evaluated as well and missing ConsistentRead=True, if any, should be
added.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200616073821.824784-1-nyh@scylladb.com>
2020-06-17 14:57:45 +02:00
Raphael S. Carvalho
2f680b3458 size_tiered_backlog_tracker: Rename total_bytes
Reader can assume total_bytes and _total_bytes have the same meaning,
but they don't, so let's give the former a more descriptive name.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200616175055.16771-1-raphaelsc@scylladb.com>
2020-06-17 13:39:30 +03:00
Avi Kivity
d2ab6a24a1 Update seastar submodule
* seastar 8f0858cfd7...b515d63735 (2):
  > do_with: replace seastar::apply() calls with std::apply()
  > Merge "Resolve various http fixmes" from Piotr
2020-06-17 12:59:16 +03:00
Nadav Har'El
ba59034402 merge: Use std::string_view in a few more apis
Merged patch series by Rafael Ávila de Espíndola:

The main advantage is that callers now don't have to construct
sstrings. It is also a 0.09% win in text size (from 41804308 to
41766484 bytes) and the tps reported by

perf_simple_query --duration 16 --smp 1  -m4G >> log 2>err

in 500 randomized runs goes up by 0.16% (from 162259 to 162517).

Rafael Ávila de Espíndola (3):
  service: Pass a std::string_view to client_state::set_keyspace
  cql3: Use a flat_hash_map in untyped_result_set_row
  cql3: Pass std::string_view to various untyped_result_set member
    functions

 cql3/untyped_result_set.hh | 30 ++++++++++++++++--------------
 service/client_state.hh    |  2 +-
 cql3/untyped_result_set.cc |  6 +++---
 service/client_state.cc    |  4 ++--
 4 files changed, 22 insertions(+), 20 deletions(-)
2020-06-16 20:31:36 +03:00
Avi Kivity
b608af870b dist: debian: do not require root during package build
Debian package builds provide a root environment for the installation
scripts, since that's what typical installation scripts expect. To
avoid providing actual root, a "fakeroot" system is used where syscalls
are intercepted and any effect that requires root (like chown) is emulated.

However, fakeroot sporadically fails for us, aborting the package build.
Since our install scripts don't really require root (when operating in
the --packaging mode), we can just tell dpkg-buildpackage that we don't
need fakeroot. This ought to fix the sporadic failures.

As a side effect, package builds are faster.

Fixes #6655.
2020-06-16 20:27:04 +03:00
Tomasz Grabiec
266e3f33d1 sstables: Add promoted index cache metrics 2020-06-16 16:15:24 +02:00
Tomasz Grabiec
9885d0e806 position_in_partition: Introduce external_memory_usage() 2020-06-16 16:15:24 +02:00
Tomasz Grabiec
58532cdf11 cached_file, sstables: Add tracing to index binary search and page cache 2020-06-16 16:15:24 +02:00
Tomasz Grabiec
ecb6abe717 sstables: Dynamically adjust I/O size for index reads
Currently, index reader uses 128 KiB I/O size with read-ahead. That is
a waste of bandwidth if index entries contain large promoted index and
binary search will be used within the promoted index, which may not
need to access as much.

The read-ahead is wasted both when using binary search and when using
the scanning cursor.

On the other hand, large I/O is optimal if there is no promoted index
and we're going to parse the whole page.

There is no way to predict which case it is up front before reading
the index.

Attaching dynamic adjustments (per-sstable) lets the system auto adjust
to the workload from past history.

The large promoted index workload will settle on reading 32 KiB (with
read-ahead). This is still not optimal, we should lower the buffer
size even more. But that requires a seastar change, so is deferred.
2020-06-16 16:15:23 +02:00
Tomasz Grabiec
19501d9ef2 sstables, tests: Allow disabling binary search in promoted index from perf tests 2020-06-16 16:15:23 +02:00
Tomasz Grabiec
c0ee997614 sstables: mc: Use binary search over the promoted index
Currently, lookups in the promoted index are done by scanning the index linearly so the lookup
is O(N). For large partitions that's inefficient. It consumes both a lot of CPU and I/O.

We could do better and use binary search in the index. This patch series switches the mc-format
index reader to do that. Other formats use the old way.

The "mc" format promoted index has an extra structure at the end of the index called "offset map".
It's a vector of offsets of consecutive promoted index entries. This allows us to access random
entries in the index without reading the whole index.

The location of the offset entry for a given promoted index entry can be derived by knowing where
the offset vector ends in the index file, so the offset map also doesn't have to be read completely
into the memory.

The most tricky part is caching. We need to cache blocks read from the index file to amortize the
cost of binary search:

  - if the promoted index fits in the 32 KiB which was read from the index when looking for
    the partition entry, we don't want to issue any additional I/O to search the promoted index.

  - with large promoted indexes, the last few bisections will fall into the same I/O block and we
    want to reuse that block.

  - we don't want the cache to grow too big, we don't want to cache the whole promoted index
    as the read progresses over the index. Scanning reads may skip multiple times.

This patch implements a rather simple approach which meets all the
above requirements and is not worse than the current state of affairs:

   - Each index cursor has its own cache of the index file area which corresponds to promoted index
     This is managed by the cached_file class.

   - Each index cursor has its own cache of parsed blocks. This allows the upper bound estimation to
     reuse information obtained during lower bound lookup. This estimation is used to limit
     read-aheads in the data file.

   - Each cursor drops entries that it walked past so that memory footprint stays O(log N)

   - Cached buffers are accounted to read's reader_permit.
2020-06-16 16:15:23 +02:00
Tomasz Grabiec
c95dd67d11 utils: Introduce cached_file
It is a read-through cache of a file.

Will be used to cache contents of the promoted index area from the
index file.

Currently, cached pages are evicted manually using the invalidate_*()
method family, or when the object is destroyed.

The cached_file represents a subset of the file. The reason for this
is to satisfy two requirements. One is that we have a page-aligned
caching, where pages are aligned relative to the start of the
underlying file. This matches requirements of the seastar I/O engine
on I/O requests.  Another requirement is to have an effective way to
populate the cache using an unaligned buffer which starts in the
middle of the file when we know that we won't need to access bytes
located before the buffer's position. See populate_front(). If we
couldn't assume that, we wouldn't be able to insert an unaligned
buffer into the cache.
2020-06-16 16:15:23 +02:00
Tomasz Grabiec
ab274b8203 sstables: clustered_index: Relax scope of validity of entry_info
entry_info holds views, which may get invalidated when the containing
index blocks are removed. Current implementations of next_entry() keeps
the blocks in memory as long as the cursor is alive but that will
change in new implementations of the cursor.

Adjust the assumption of tests accordingly.
2020-06-16 16:15:23 +02:00
Tomasz Grabiec
ea2fbcc2cd sstables: index_entry: Introduce owning promoted_index_block_position 2020-06-16 16:15:23 +02:00
Tomasz Grabiec
714da3c644 compound_compat: Allow constructing composite from a view 2020-06-16 16:15:23 +02:00
Tomasz Grabiec
f2e52c433f sstables: index_entry: Rename promoted_index_block_position to promoted_index_block_position_view 2020-06-16 16:15:23 +02:00
Tomasz Grabiec
101fd613c5 sstables: mc: Extract parser for promoted index block
It will be reused in binary search over the index.
2020-06-16 16:15:14 +02:00
Tomasz Grabiec
a557c374fd sstables: mc: Extract parser for clustering out of the promoted index block parser
This parser will be used stand-alone when doing a binary search over
promoted index blocks. We will only parse the start key not the whole
block.
2020-06-16 16:14:31 +02:00
Tomasz Grabiec
95df7126a7 sstables: consumer: Extract primitive_consumer
This change extracts the parser for primitive types out of
continuous_data_consumer so that it can be used stand-alone
or embedded in other parsers.
2020-06-16 16:14:30 +02:00
Tomasz Grabiec
d5bf540079 sstables: Abstract the clustering index cursor behavior
In preparation for supporting more than one algorithm for lookups in
the promoted index, extract relevant logic out of the index_reader
(which is a partition index cursor).

The clustered index cursor implementation is now hidden behind
abstract interface called clustered_index_cursor.

The current implementation is put into the
scanning_clustered_index_cursor. It's mostly code movement with minor
adjustments.

In order to encapsulate iteration over promoted index entries,
clustered_index_cursor::next_entry() was introduced.

No change in behavior intended in this patch.
2020-06-16 16:14:17 +02:00
Tomasz Grabiec
a858f87b11 sstables: index_reader: Rearrange to reduce branching and optionals
No change in logic.

Will make it easier to make further refactoring.
2020-06-16 16:13:39 +02:00
Yaron Kaikov
ac7237f991 dbuild: Add an option to run with 'docker' or 'podman'
This adds support for configuring whether to run dbuild with 'docker' or
'podman' via a new environment variable, DBUILD_TOOL. While at it, check
if 'podman' exists, and prefer that by default as the tool for dbuild.
2020-06-16 15:18:46 +03:00
Gleb Natapov
7ca937778d cql transport: do not log broken pipe error when a client closes its side of a connection abruptly
Fixes #5661

Message-Id: <20200615075958.GL335449@scylladb.com>
2020-06-16 13:59:12 +02:00
Nadav Har'El
41a049d906 README: better explanation of dependencies and build
In this patch I rewrote the explanations in both README.md and HACKING.md
about Scylla's dependencies, and about dbuild.

README.md used to mention only dbuild. It now explains better (I think)
why dbuild is needed in the first place, and that the alternative is
explained in HACKING.md.

HACKING.md used to explain *only* install-dependencies.sh - and now explains
why it is needed, what install-dependencies.sh and that it ONLY works on
very recent distributions (e.g., Fedora older than 32 are not supported),
and now also mentions the alternative - dbuild.

Mentions of incorrect requirements (like "gcc > 8.1") were fixed or dropped.

Mention of the archaic 'scripts/scylla_current_repo' script, which we used
to need to install additional packages on non-Fedora systems, was dropped.
The script itself is also removed.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200616100253.830139-1-nyh@scylladb.com>
2020-06-16 13:26:04 +02:00
Avi Kivity
bd794629f9 range: rename range template family to interval
nonwrapping_range<T> and related templates represent mathematical
intervals, and are different from C++ ranges. This causes confusion,
especially when C++ ranges and the range templates are used together.

As the first step to disentable this, introduce a new interval.hh
header with the contents of the old range.hh header, renaming as
follows:

  range_bound  -> interval_bound
  nonwrapping_range -> nonwrapping_interval
  wrapping_range -> wrapping_interval
  Range -> Interval (concepts)

The range alias, which previously aliased wrapping_range, did
not get renamed - instead the interval alias now aliases
nonwrapping_interval, which is the natural interval type. I plan
to follow up making interval the template, and nonwrapping_interval
the alias (or perhaps even remove it).

To avoid churn, a new range.hh header is provided with the old names
as aliases (range, nonwrapping_range, wrapping_range, range_bound,
and Range) with the same meaning as their former selves.

Tests: unit (dev)
2020-06-16 13:36:20 +03:00
Piotr Sarna
3bcc2e8f09 Merge 'hinted handoff: improve segment replay logic' from PiotrD
This series contains two improvements to hint file replay logic
in hints manager:

- During replay of a hint file, keeping track of the first hint that fails
  to be sent is now done via a simple std::optional variable instead of an
  unordered_set. This slightly reduces complexity of next replay position
  calculation.
- A corner case is handled: if reading commitlog fails, but there won't be an
  error related to sending hints, starting position wouldn't be updated. This
  could cause us to replay more hints than necessary.

Tests:

- unit(dev)
- dtest(hintedhandoff_additional_test, dev)

* piodul-hints-manager-handle-commitlog-failure-in-replay-position-calculation:
  hinted handoff: use bool instead of send_state_set
  hinted handoff: update replay position on commitlog failure
  hinted handoff: remove rps_set, use first_failed_rp instead
2020-06-16 12:24:55 +02:00
Avi Kivity
6ba7b8f3f5 Update seastar submodule
* seastar 81242ccc3f...8f0858cfd7 (18):
  > Merge 'future, future-utils: stop returning a variadic future from when_all_succeed'
  > file: introduce layered_file_impl, a helper for layered files
  > net: packet: mark move assignment operator as noexcept
  > core: weak_ptr, weakly_referencable: implement empty default constructor
  > circular_buffer: Fix build with gcc 11 (avoid template parameters in d'tor declaration)
  > test: weak_ptr_test: fix static asserts about nothrow constructibility
  > coroutines: Fix clang build
  > cmake: Delete SEASTAR_COROUTINES_TS
  > Merge "future-util: Mark a few more functions as noexcept" from Rafael
  > tests: add a perf test to measure the fair_queue performance
  > Merge "iostream: make iostream stack nothrow move constructible" from Benny
  > future: Move most of rethrow_with_nested out of line.
  > future_test: Add test for nested exceptions in finally
  > core: Add noexcept to unaligned members functions
  > Merge "core: make weak_ptr and checked_ptr default and move nothrow constructible" from Benny
  > core: file: Fix typo in a comment
  > byteorder: Mark functions as noexcept
  > future: replace CanInvoke concepts with std::invocable
2020-06-16 13:19:36 +03:00
Piotr Sarna
e59d41dad6 alternator: use plain function pointer instead of std::function
Since all function handlers are plain functions without any state,
there's no need for wrapping them with a 32-byte std::function
when a plain function pointer would suffice.

Reported-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <913c1de7d02c252b40dc0c545989ec83fe74e5a9.1592291413.git.sarna@scylladb.com>
2020-06-16 12:08:21 +03:00
Raphael S. Carvalho
238ba899c0 compaction_manager: use double for backlog everywhere
Avi says:
"The backlog is a large number that changes slowly, so float
might not have enough resolution to track small changes.

For example, if the backlog is 800GB and changes less than 100kB, then
we won't see a change (float resolution is 2^23 ~ 1:8,000,000).

This is outside the normal range of values (usually the backlog changes
a lot more than 100kB per 15-second period), so it will work, but better
to be more careful."

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200615150621.17543-1-raphaelsc@scylladb.com>
2020-06-16 12:05:05 +03:00
Rafael Ávila de Espíndola
3e1307a6d1 cql3: Pass std::string_view to various untyped_result_set member functions
Taking a std::string_view is a bit more flexible.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-15 15:47:15 -07:00
Rafael Ávila de Espíndola
3a9b4e7d26 cql3: Use a flat_hash_map in untyped_result_set_row
No functionality changed. This just makes it possible to use
heterogeneous lookups, which the next patch will add.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-15 15:46:25 -07:00
Rafael Ávila de Espíndola
65d56095d0 service: Pass a std::string_view to client_state::set_keyspace
No change in the implementation since it was already copying the
string. Taking a std::string_view is just a bit more flexible.

Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
2020-06-15 15:46:25 -07:00