scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-30 03:30:49 +00:00

Author	SHA1	Message	Date
Avi Kivity	5d99d667ec	Merge "Build system improvements for packaging" from Pekka " This patch series attempts to decouple package build and release infrastructure, which is internal to Scylla (the company). The goal of this series is to make it easy for humans and machines to build the full Scylla distribution package artifacts, and make it easy to quickly verify them. The improvements to build system are done in the following steps. 1. Make scylla.git a super-module, which has git submodules for scylla-jmx and scylla-tools. A clone of scylla.git is now all that is needed to access all source code of all the different components that make up a Scylla distribution, which is a preparational step to adding "dist" ninja build target. A scripts/sync-submodules.sh helper script is included, which allows easy updating of the submodules to the latest head of the respective git repositories. 2. Make builds reproducible by moving the remaining relocatable package specific build options from reloc/build_reloc.sh to the build system. After this step, you can build the exact same binaries from the git repository by using the dbuild version from scylla.git. 3. Add a "dist" target to ninja build, which builds all .rpm and .deb packages with one command. To build a release, run: $ ./tools/toolchain/dbuild ./configure.py --mode release $ ./tools/toolchain/dbuild ninja-build dist and you will now have .rpm and .deb packages to all the components of a Scylla distribution. 4. Add a "dist-check" target to ninja build for verification of .rpm and .deb packages in one command. To verify all the built packages, run: $ ninja-build dist-check Please note that you must run this step on the host, because the target uses Docker under the hood to verify packages by installing them on different Linux distributions. Currently only CentOS 7 verification is supported. All these improvements are done so that backward compatibility is retained. That is, any existing release infrastructure or other build scripts are completely unaffacted. Future improvements to consider: - Package repository generation: add a "ninja repo" command to generate a .rpm and .deb repositories, which can be uploaded to a web site. This makes it possible to build a downloadable Scylla distribution from scylla.git. The target requires some configuration, which user has to provide. For example, download URL locations and package signing keys. - Amazon Machine Image (AMI) support: add a "ninja ami" command to simplify the steps needed to generate a Scylla distribution AMI. - Docker image support: add a "ninja docker" command to simplify the steps needed to generate a Scylla distribution Docker image. - Simplify and unify package build: simplify and unify the various shell scripts needed to build packages in different git repositories. This step will break backward compatiblity and can be done only after relevant build scripts and release infrastructure is updated. " * 'penberg/packaging/v5' of github.com:penberg/scylla: docs: Update packaging documentation build: Add "dist-check" target scripts/testing: Add "dist-check" for package verification build: Add "dist" target reloc: Add '--builddir' option to build_deb.sh build: Add "-ffile-prefix-map" to cxxflags docs: Document sync-submodules.sh script in maintainer.md sync-submodules.sh: Add script for syncing submodules Add scylla-tools submodule Add scylla-jmx submodule	2020-06-18 12:59:52 +03:00
Dejan Mircevski	aec1acd1d5	range_test: Add cases for singular intersection Intersection was previously not tested for singular ranges. This ensures it will always work for singular ranges, too. Tests: unit(dev) Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2020-06-18 12:38:31 +03:00
Yaron Kaikov	e9d5852b0c	dbuild: Add an option to run dbuild using podman Following https://github.com/scylladb/scylla/pull/5333, we want to be able to run dbuild using podman or docker by setting enviorment variable named: DBUILD_TOOL DBUILD_TOOL will use docker by default unless we explicitly set the tool podmand Fixes: https://github.com/scylladb/scylla/pull/6644	2020-06-18 12:13:39 +03:00
Avi Kivity	9322c07c71	Merge "Use binary search in sstable promoted index" from Tomasz " The "promoted index" is how the sstable format calls the clustering key index within a given partition. Large partitions with many rows have it. It's embedded in the partition index entry. Currently, lookups in the promoted index are done by scanning the index linearly so the lookup is O(N). For large partitions that's inefficient. It consumes both a lot of CPU and I/O. We could do better and use binary search in the index. This patch series switches the mc-format index reader to do that. Other formats use the old way. The "mc" format promoted index has an extra structure at the end of the index called "offset map". It's a vector of offsets of consecutive promoted index entries. This allows us to access random entries in the index without reading the whole index. The location of the offset entry for a given promoted index entry can be derived by knowing where the offset vector ends in the index file, so the offset map also doesn't have to be read completely into the memory. The most tricky part is caching. We need to cache blocks read from the index file to amortize the cost of binary search: - if the promoted index fits in the 32 KiB which was read from the index when looking for the partition entry, we don't want to issue any additional I/O to search the promoted index. - with large promoted indexes, the last few bisections will fall into the same I/O block and we want to reuse that block. - we don't want the cache to grow too big, we don't want to cache the whole promoted index as the read progresses over the index. Scanning reads may skip multiple times. This series implements a rather simple approach which meets all the above requirements and is not worse than the current state of affairs: - Each index cursor has its own cache of the index file area which corresponds to promoted index This is managed by the cached_file class. - Each index cursor has its own cache of parsed blocks. This allows the upper bound estimation to reuse information obtained during lower bound lookup. This estimation is used to limit read-aheads in the data file. - Each cursor drops entries that it walked past so that memory footprint stays O(log N) - Cached buffers are accounted to read's reader_permit. Later, we could have a single cache shared by many readers. For that, we need to come up with eviction policy. Fixes #4007. TESTING RESULTS * Point reads, large promoted index: Config: rows: 10000000, value size: 2000 Partition size: 20 GB Index size: 7 MB Notes: - Slicing read into the middle of partition (offset=5000000, read=1) is a clear win for the binary search: time: 1.9ms vs 22.9ms CPU utilization: 8.9% vs 92.3% I/O: 21 reqs / 172 KiB vs 29 reqs / 3'520 KiB It's 12x faster, CPU utilization is 10x times smaller, disk utilization is 20x smaller. - Slicing at the front (offset=0) is a mixed bag. time is similar: 1.8ms CPU utilization is 6.7x smaller for bsearch: 8.5% vs 57.7% disk bandwidth utilization is smaller for bsearch but uses more IOs: 4 reqs / 320 KiB (scan) vs 17 reqs / 188 KiB (bsearch) bsearch uses less bandwidth because the series reduces buffer size used for index file I/O. scan is issuing: 2 * 128 KB (index page) 2 * 32 KB (data file) bsearch is issuing: 1 * 64 KB (index page) 15 * 4 KB (promoted index) 1 * 64 KB (data file) The 1 * 64 KB is chosen dynamically by seastar. Sometimes it chooses 2 * 32 KB (with read-ahead). 32 KB is the minimum I/O currently. Disk utilization could be further improved by changing the way seastar's dynamic I/O adjustments work so that it uses 1 * 4 KB when it suffices. This is left for the follow-up. Command: perf_fast_forward --datasets=large-part-ds1 \ --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1 Before: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.001836 172 1 545 9 563 175 4.0 4 320 2 2 0 1 1 0 0 0 57.7% 0 0 32 0.001858 502 32 17220 126 17776 11526 3.2 3 324 2 1 0 1 1 0 0 0 56.4% 0 0 256 0.002833 339 256 90374 427 91757 85931 7.0 7 776 3 1 0 1 1 0 0 0 41.1% 0 0 4096 0.017211 58 4096 237984 2011 241802 233870 66.1 66 8376 59 2 0 1 1 0 0 0 21.4% 0 5000000 1 0.022952 42 1 44 1 45 41 29.2 29 3520 22 2 0 1 1 0 0 0 92.3% 0 5000000 32 0.023052 43 32 1388 14 1414 1331 31.1 32 3588 26 2 0 1 1 0 0 0 91.7% 0 5000000 256 0.024795 41 256 10325 129 10721 9993 43.1 39 4544 29 2 0 1 1 0 0 0 86.4% 0 5000000 4096 0.038856 27 4096 105414 398 106918 103162 95.2 95 12160 78 5 0 1 1 0 0 0 61.4% 0 After (v2): offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.001831 248 1 546 21 581 252 17.6 17 188 2 0 0 1 1 0 0 0 8.5% 0 0 32 0.001910 535 32 16751 626 17770 13896 17.9 19 160 3 0 0 1 1 0 0 0 8.8% 0 0 256 0.003545 266 256 72207 2333 89076 62852 26.9 24 764 7 0 0 1 1 0 0 0 9.7% 0 0 4096 0.016800 56 4096 243812 524 245430 239736 83.6 83 8700 64 0 0 1 1 0 0 0 16.6% 0 5000000 1 0.001968 351 1 508 19 538 380 21.3 21 172 2 0 0 1 1 0 0 0 8.9% 0 5000000 32 0.002273 431 32 14077 436 15503 11551 22.7 22 268 3 0 0 1 1 0 0 0 8.9% 0 5000000 256 0.003889 257 256 65824 2197 81833 57813 34.0 37 652 18 0 0 1 1 0 0 0 11.2% 0 5000000 4096 0.017115 54 4096 239324 834 241310 231993 88.3 88 8844 65 0 0 1 1 0 0 0 16.8% 0 After (v1): offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.001886 259 1 530 4 545 261 18.0 18 376 2 2 0 1 1 0 0 0 9.1% 0 0 32 0.001954 513 32 16381 93 16844 15618 19.0 19 408 3 2 0 1 1 0 0 0 9.3% 0 0 256 0.003266 318 256 78393 1820 81567 61663 30.8 26 1272 7 2 0 1 1 0 0 0 10.4% 0 0 4096 0.017991 57 4096 227666 855 231915 225781 83.1 83 8888 55 5 0 1 1 0 0 0 15.5% 0 5000000 1 0.002353 232 1 425 2 432 232 23.0 23 396 2 2 0 1 1 0 0 0 8.7% 0 5000000 32 0.002573 384 32 12437 47 12571 429 25.0 25 460 4 2 0 1 1 0 0 0 8.5% 0 5000000 256 0.003994 259 256 64101 2904 67924 51427 37.0 35 1484 11 2 0 1 1 0 0 0 10.6% 0 5000000 4096 0.018567 56 4096 220609 448 227395 219029 89.8 89 9036 59 5 0 1 1 0 0 0 15.1% 0 * Point reads, small promoted index (two blocks): Config: rows: 400, value size: 200 Partition size: 84 KiB Index size: 65 B Notes: - No significant difference in time - the same disk utilization - similar CPU utilization Command: perf_fast_forward --datasets=large-part-ds1 \ --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1 Before: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.000279 470 1 3587 31 3829 478 3.0 3 68 2 1 0 1 1 0 0 0 21.1% 0 0 32 0.000276 3498 32 116038 811 122756 104033 3.0 3 68 2 1 0 1 1 0 0 0 24.0% 0 0 256 0.000412 2554 256 621044 1778 732150 559221 2.0 2 72 2 0 0 1 1 0 0 0 32.6% 0 0 4096 0.000510 1901 400 783883 4078 819058 665616 2.0 2 88 2 0 0 1 1 0 0 0 36.4% 0 200 1 0.000339 2712 1 2951 8 3001 2569 2.0 2 72 2 0 0 1 1 0 0 0 17.8% 0 200 32 0.000352 2586 32 91019 266 92427 83411 2.0 2 72 2 0 0 1 1 0 0 0 20.8% 0 200 256 0.000458 2073 200 436503 1618 453945 385501 2.0 2 88 2 0 0 1 1 0 0 0 29.4% 0 200 4096 0.000458 2097 200 436475 1676 458349 381558 2.0 2 88 2 0 0 1 1 0 0 0 29.0% 0 After (v1): Testing slicing of large partition using clustering keys: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.000278 492 1 3598 30 3831 500 3.0 3 68 2 1 0 1 1 0 0 0 19.4% 0 0 32 0.000275 3433 32 116153 753 122915 92559 3.0 3 68 2 1 0 1 1 0 0 0 22.5% 0 0 256 0.000458 2576 256 559437 2978 728075 504375 2.1 2 88 2 0 0 1 1 0 0 0 29.0% 0 0 4096 0.000506 1888 400 790064 3306 822360 623109 2.0 2 88 2 0 0 1 1 0 0 0 36.6% 0 200 1 0.000382 2493 1 2619 10 2675 2268 2.0 2 88 2 0 0 1 1 0 0 0 16.3% 0 200 32 0.000398 2393 32 80422 333 84759 22281 2.0 2 88 2 0 0 1 1 0 0 0 19.0% 0 200 256 0.000459 2096 200 435943 1608 453989 380749 2.0 2 88 2 0 0 1 1 0 0 0 30.5% 0 200 4096 0.000458 2097 200 436410 1651 455779 382485 2.0 2 88 2 0 0 1 1 0 0 0 29.2% 0 * Scan with skips, large index: Config: rows: 10000000, value size: 2000 Partition size: 20 GB Index size: 7 MB Notes: - Similar time, slightly worse for binary search: 36.1 s (scan) vs 36.4 (bsearch) - Slightly more I/O for bsearch: 153'932 reqs / 19'703'260 KiB (scan) vs 155'651 reqs / 19'704'088 KiB (bsearch) Binary search reads more by 828 KB and by 1719 IOs. It does more I/O to read the the promoted index offset map. - similar (low) memory footprint. The danger here is that by caching index blocks which we touch as we scan we would end up caching the whole index. But this is protected against by eviction as demonstrated by the last "mem" column. Command: perf_fast_forward --datasets=large-part-ds1 \ --run-tests=large-partition-skips -c1 --test-case-duration=1 Before: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 1 1 36.103451 4 5000000 138491 38 138601 138453 153932.0 153932 19703260 153561 1 0 1 1 0 0 0 31.5% 502690 After (v2): read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 1 1 37.000145 4 5000000 135135 6 135146 135128 155651.0 155651 19704088 138968 0 0 1 1 0 0 0 34.2% 0 After (v1): read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 1 1 36.965520 4 5000000 135261 30 135311 135231 155628.0 155628 19704216 139133 1 0 1 1 0 0 0 33.9% 248738 Also in: git@github.com:tgrabiec/scylla.git sstable-use-index-offset-map-v2 Tests: - unit (all modes) - manual using perf_fast_forward " * tag 'sstable-use-index-offset-map-v2' of github.com:tgrabiec/scylla: sstables: Add promoted index cache metrics position_in_partition: Introduce external_memory_usage() cached_file, sstables: Add tracing to index binary search and page cache sstables: Dynamically adjust I/O size for index reads sstables, tests: Allow disabling binary search in promoted index from perf tests sstables: mc: Use binary search over the promoted index utils: Introduce cached_file sstables: clustered_index: Relax scope of validity of entry_info sstables: index_entry: Introduce owning promoted_index_block_position compound_compat: Allow constructing composite from a view sstables: index_entry: Rename promoted_index_block_position to promoted_index_block_position_view sstables: mc: Extract parser for promoted index block sstables: mc: Extract parser for clustering out of the promoted index block parser sstables: consumer: Extract primitive_consumer sstables: Abstract the clustering index cursor behavior sstables: index_reader: Rearrange to reduce branching and optionals	2020-06-18 12:09:39 +03:00
Pekka Enberg	4d48f22827	docs: Update packaging documentation	2020-06-18 10:20:08 +03:00
Pekka Enberg	9e279ec2a9	build: Add "dist-check" target This adds a "dist-check" target to ninja build. The target needs to be run on the host because package verification is done with Docker.	2020-06-18 10:20:08 +03:00
Pekka Enberg	584c7130a1	scripts/testing: Add "dist-check" for package verification This adds a "dist-check.sh" script in tools/testing, which performs distribution package verification by installing packages under Docker.	2020-06-18 10:16:46 +03:00
Pekka Enberg	8e1a561fba	build: Add "dist" target	2020-06-18 10:16:46 +03:00
Pekka Enberg	7b7c91a34b	reloc: Add '--builddir' option to build_deb.sh The build system will call this script. It needs control over where the packages are built to allow building packages for the different build modes.	2020-06-18 09:54:37 +03:00
Pekka Enberg	013f87f388	build: Add "-ffile-prefix-map" to cxxflags This patch adds "-ffile-prefix-map" to cxxflags for all build modes. This has two benefits: 1, Relocatable packages no longer have any special build flags, which makes deeper integration with the build system possible (e.g. targets for packages). 2 Builds are now reproducible, which makes debugging easier in case you only have a backtrace, but no artifacts. Rafael explains: "BTW, I think I found another argument for why we should always build with -ffile-prefix-map=. There was user after free test failure on next promotion. I am unable to reproduce it locally, so it would be super nice to be able to decode the backtrace. I was able to do it, but I had to create a /jenkins/workspace/scylla-master/next/ directory and build from there to get the same results as the bot." Acked-by: Botond Dénes <bdenes@scylladb.com> Acked-by: Nadav Har'El <nyh@scylladb.com> Acked-by: Rafael Avila de Espindola <espindola@scylladb.com>	2020-06-18 09:54:37 +03:00
Pekka Enberg	71da4e6e79	docs: Document sync-submodules.sh script in maintainer.md	2020-06-18 09:54:37 +03:00
Pekka Enberg	e3376472e8	sync-submodules.sh: Add script for syncing submodules	2020-06-18 09:54:37 +03:00
Pekka Enberg	d759d7567b	Add scylla-tools submodule	2020-06-18 09:54:37 +03:00
Pekka Enberg	9edf858d30	Add scylla-jmx submodule	2020-06-18 09:54:37 +03:00
Benny Halevy	5926cfc298	CMakeLists.txt: Update to C++20 Following `427398641a` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20200618052956.570260-1-bhalevy@scylladb.com>	2020-06-18 09:51:23 +03:00
Pekka Enberg	02b733c22b	Revert "dbuild: Add an option to run with 'docker' or 'podman'" This reverts commit `ac7237f991`. The logic is wrong and always picks "podman" if it's installed on the system even if user asks for "docker" with the DBUILD_TOOL environment variable. This wreaks havoc on machines that have both docker and podman packages installed, but podman is not configured correctly.	2020-06-18 09:22:33 +03:00
Juliusz Stasiewicz	8628ede009	cdc: Fix segfault when stream ID key is too short When a token is calculated for stream_id, we check that the key is exactly 16 bytes long. If it's not - `minimum_token` is returned and client receives empty result. This used to be the expected behavior for empty keys; now it's extended to keys of any incorrect length. Fixes #6570	2020-06-17 18:19:37 +03:00
Nadav Har'El	095ddf0d41	alternator test: use ConsistentRead=True where missing All tests that write some data and then read it back need to use ConsistentRead=True, otherwise the test may sporadically fail on a multi- node cluster. In the previous patch we fixed the full_query()/full_scan() convenience functions. In this patch, I audited the calls to the boto3 read methods - get_item(), batch_get_item(), query(), scan(), and although most of them did use ConsistentRead=True as needed, I found some missing and this patch fixes them. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200616080334.825893-1-nyh@scylladb.com>	2020-06-17 14:57:45 +02:00
Nadav Har'El	c298088375	alternator test: use ConsistentRead=True for full_query/scan Many of the Alternator tests use the convenience functions full_query()/ full_scan() to read from the table. Almost all these tests need to be able to read their own writes, i.e., want ConsistentRead=True, but none of them explicitly specified this parameter. Such tests may sporadically fail when running on cluster with multiple nodes. So this patch follows a TODO in the code, and makes ConsistentRead=True the default for the full_() functions. The caller can still override it with ConsistentRead=False - and this is necessary in the GSI tests, because ConsistentRead=True is not allowed in GSIs. Note that while ConsistentRead=True is now the default for the full_() convenience functions, but it is still not the default for the lower level boto3 functions scan(), query() and get_item() - so usages of those should be evaluated as well and missing ConsistentRead=True, if any, should be added. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200616073821.824784-1-nyh@scylladb.com>	2020-06-17 14:57:45 +02:00
Raphael S. Carvalho	2f680b3458	size_tiered_backlog_tracker: Rename total_bytes Reader can assume total_bytes and _total_bytes have the same meaning, but they don't, so let's give the former a more descriptive name. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200616175055.16771-1-raphaelsc@scylladb.com>	2020-06-17 13:39:30 +03:00
Avi Kivity	d2ab6a24a1	Update seastar submodule * seastar 8f0858cfd7...b515d63735 (2): > do_with: replace seastar::apply() calls with std::apply() > Merge "Resolve various http fixmes" from Piotr	2020-06-17 12:59:16 +03:00
Nadav Har'El	ba59034402	merge: Use std::string_view in a few more apis Merged patch series by Rafael Ávila de Espíndola: The main advantage is that callers now don't have to construct sstrings. It is also a 0.09% win in text size (from 41804308 to 41766484 bytes) and the tps reported by perf_simple_query --duration 16 --smp 1 -m4G >> log 2>err in 500 randomized runs goes up by 0.16% (from 162259 to 162517). Rafael Ávila de Espíndola (3): service: Pass a std::string_view to client_state::set_keyspace cql3: Use a flat_hash_map in untyped_result_set_row cql3: Pass std::string_view to various untyped_result_set member functions cql3/untyped_result_set.hh \| 30 ++++++++++++++++-------------- service/client_state.hh \| 2 +- cql3/untyped_result_set.cc \| 6 +++--- service/client_state.cc \| 4 ++-- 4 files changed, 22 insertions(+), 20 deletions(-)	2020-06-16 20:31:36 +03:00
Avi Kivity	b608af870b	dist: debian: do not require root during package build Debian package builds provide a root environment for the installation scripts, since that's what typical installation scripts expect. To avoid providing actual root, a "fakeroot" system is used where syscalls are intercepted and any effect that requires root (like chown) is emulated. However, fakeroot sporadically fails for us, aborting the package build. Since our install scripts don't really require root (when operating in the --packaging mode), we can just tell dpkg-buildpackage that we don't need fakeroot. This ought to fix the sporadic failures. As a side effect, package builds are faster. Fixes #6655.	2020-06-16 20:27:04 +03:00
Tomasz Grabiec	266e3f33d1	sstables: Add promoted index cache metrics	2020-06-16 16:15:24 +02:00
Tomasz Grabiec	9885d0e806	position_in_partition: Introduce external_memory_usage()	2020-06-16 16:15:24 +02:00
Tomasz Grabiec	58532cdf11	cached_file, sstables: Add tracing to index binary search and page cache	2020-06-16 16:15:24 +02:00
Tomasz Grabiec	ecb6abe717	sstables: Dynamically adjust I/O size for index reads Currently, index reader uses 128 KiB I/O size with read-ahead. That is a waste of bandwidth if index entries contain large promoted index and binary search will be used within the promoted index, which may not need to access as much. The read-ahead is wasted both when using binary search and when using the scanning cursor. On the other hand, large I/O is optimal if there is no promoted index and we're going to parse the whole page. There is no way to predict which case it is up front before reading the index. Attaching dynamic adjustments (per-sstable) lets the system auto adjust to the workload from past history. The large promoted index workload will settle on reading 32 KiB (with read-ahead). This is still not optimal, we should lower the buffer size even more. But that requires a seastar change, so is deferred.	2020-06-16 16:15:23 +02:00
Tomasz Grabiec	19501d9ef2	sstables, tests: Allow disabling binary search in promoted index from perf tests	2020-06-16 16:15:23 +02:00
Tomasz Grabiec	c0ee997614	sstables: mc: Use binary search over the promoted index Currently, lookups in the promoted index are done by scanning the index linearly so the lookup is O(N). For large partitions that's inefficient. It consumes both a lot of CPU and I/O. We could do better and use binary search in the index. This patch series switches the mc-format index reader to do that. Other formats use the old way. The "mc" format promoted index has an extra structure at the end of the index called "offset map". It's a vector of offsets of consecutive promoted index entries. This allows us to access random entries in the index without reading the whole index. The location of the offset entry for a given promoted index entry can be derived by knowing where the offset vector ends in the index file, so the offset map also doesn't have to be read completely into the memory. The most tricky part is caching. We need to cache blocks read from the index file to amortize the cost of binary search: - if the promoted index fits in the 32 KiB which was read from the index when looking for the partition entry, we don't want to issue any additional I/O to search the promoted index. - with large promoted indexes, the last few bisections will fall into the same I/O block and we want to reuse that block. - we don't want the cache to grow too big, we don't want to cache the whole promoted index as the read progresses over the index. Scanning reads may skip multiple times. This patch implements a rather simple approach which meets all the above requirements and is not worse than the current state of affairs: - Each index cursor has its own cache of the index file area which corresponds to promoted index This is managed by the cached_file class. - Each index cursor has its own cache of parsed blocks. This allows the upper bound estimation to reuse information obtained during lower bound lookup. This estimation is used to limit read-aheads in the data file. - Each cursor drops entries that it walked past so that memory footprint stays O(log N) - Cached buffers are accounted to read's reader_permit.	2020-06-16 16:15:23 +02:00
Tomasz Grabiec	c95dd67d11	utils: Introduce cached_file It is a read-through cache of a file. Will be used to cache contents of the promoted index area from the index file. Currently, cached pages are evicted manually using the invalidate_*() method family, or when the object is destroyed. The cached_file represents a subset of the file. The reason for this is to satisfy two requirements. One is that we have a page-aligned caching, where pages are aligned relative to the start of the underlying file. This matches requirements of the seastar I/O engine on I/O requests. Another requirement is to have an effective way to populate the cache using an unaligned buffer which starts in the middle of the file when we know that we won't need to access bytes located before the buffer's position. See populate_front(). If we couldn't assume that, we wouldn't be able to insert an unaligned buffer into the cache.	2020-06-16 16:15:23 +02:00
Tomasz Grabiec	ab274b8203	sstables: clustered_index: Relax scope of validity of entry_info entry_info holds views, which may get invalidated when the containing index blocks are removed. Current implementations of next_entry() keeps the blocks in memory as long as the cursor is alive but that will change in new implementations of the cursor. Adjust the assumption of tests accordingly.	2020-06-16 16:15:23 +02:00
Tomasz Grabiec	ea2fbcc2cd	sstables: index_entry: Introduce owning promoted_index_block_position	2020-06-16 16:15:23 +02:00
Tomasz Grabiec	714da3c644	compound_compat: Allow constructing composite from a view	2020-06-16 16:15:23 +02:00
Tomasz Grabiec	f2e52c433f	sstables: index_entry: Rename promoted_index_block_position to promoted_index_block_position_view	2020-06-16 16:15:23 +02:00
Tomasz Grabiec	101fd613c5	sstables: mc: Extract parser for promoted index block It will be reused in binary search over the index.	2020-06-16 16:15:14 +02:00
Tomasz Grabiec	a557c374fd	sstables: mc: Extract parser for clustering out of the promoted index block parser This parser will be used stand-alone when doing a binary search over promoted index blocks. We will only parse the start key not the whole block.	2020-06-16 16:14:31 +02:00
Tomasz Grabiec	95df7126a7	sstables: consumer: Extract primitive_consumer This change extracts the parser for primitive types out of continuous_data_consumer so that it can be used stand-alone or embedded in other parsers.	2020-06-16 16:14:30 +02:00
Tomasz Grabiec	d5bf540079	sstables: Abstract the clustering index cursor behavior In preparation for supporting more than one algorithm for lookups in the promoted index, extract relevant logic out of the index_reader (which is a partition index cursor). The clustered index cursor implementation is now hidden behind abstract interface called clustered_index_cursor. The current implementation is put into the scanning_clustered_index_cursor. It's mostly code movement with minor adjustments. In order to encapsulate iteration over promoted index entries, clustered_index_cursor::next_entry() was introduced. No change in behavior intended in this patch.	2020-06-16 16:14:17 +02:00
Tomasz Grabiec	a858f87b11	sstables: index_reader: Rearrange to reduce branching and optionals No change in logic. Will make it easier to make further refactoring.	2020-06-16 16:13:39 +02:00
Yaron Kaikov	ac7237f991	dbuild: Add an option to run with 'docker' or 'podman' This adds support for configuring whether to run dbuild with 'docker' or 'podman' via a new environment variable, DBUILD_TOOL. While at it, check if 'podman' exists, and prefer that by default as the tool for dbuild.	2020-06-16 15:18:46 +03:00
Gleb Natapov	7ca937778d	cql transport: do not log broken pipe error when a client closes its side of a connection abruptly Fixes #5661 Message-Id: <20200615075958.GL335449@scylladb.com>	2020-06-16 13:59:12 +02:00
Nadav Har'El	41a049d906	README: better explanation of dependencies and build In this patch I rewrote the explanations in both README.md and HACKING.md about Scylla's dependencies, and about dbuild. README.md used to mention only dbuild. It now explains better (I think) why dbuild is needed in the first place, and that the alternative is explained in HACKING.md. HACKING.md used to explain only install-dependencies.sh - and now explains why it is needed, what install-dependencies.sh and that it ONLY works on very recent distributions (e.g., Fedora older than 32 are not supported), and now also mentions the alternative - dbuild. Mentions of incorrect requirements (like "gcc > 8.1") were fixed or dropped. Mention of the archaic 'scripts/scylla_current_repo' script, which we used to need to install additional packages on non-Fedora systems, was dropped. The script itself is also removed. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200616100253.830139-1-nyh@scylladb.com>	2020-06-16 13:26:04 +02:00
Avi Kivity	bd794629f9	range: rename range template family to interval nonwrapping_range<T> and related templates represent mathematical intervals, and are different from C++ ranges. This causes confusion, especially when C++ ranges and the range templates are used together. As the first step to disentable this, introduce a new interval.hh header with the contents of the old range.hh header, renaming as follows: range_bound -> interval_bound nonwrapping_range -> nonwrapping_interval wrapping_range -> wrapping_interval Range -> Interval (concepts) The range alias, which previously aliased wrapping_range, did not get renamed - instead the interval alias now aliases nonwrapping_interval, which is the natural interval type. I plan to follow up making interval the template, and nonwrapping_interval the alias (or perhaps even remove it). To avoid churn, a new range.hh header is provided with the old names as aliases (range, nonwrapping_range, wrapping_range, range_bound, and Range) with the same meaning as their former selves. Tests: unit (dev)	2020-06-16 13:36:20 +03:00
Piotr Sarna	3bcc2e8f09	Merge 'hinted handoff: improve segment replay logic' from PiotrD This series contains two improvements to hint file replay logic in hints manager: - During replay of a hint file, keeping track of the first hint that fails to be sent is now done via a simple std::optional variable instead of an unordered_set. This slightly reduces complexity of next replay position calculation. - A corner case is handled: if reading commitlog fails, but there won't be an error related to sending hints, starting position wouldn't be updated. This could cause us to replay more hints than necessary. Tests: - unit(dev) - dtest(hintedhandoff_additional_test, dev) * piodul-hints-manager-handle-commitlog-failure-in-replay-position-calculation: hinted handoff: use bool instead of send_state_set hinted handoff: update replay position on commitlog failure hinted handoff: remove rps_set, use first_failed_rp instead	2020-06-16 12:24:55 +02:00
Avi Kivity	6ba7b8f3f5	Update seastar submodule * seastar 81242ccc3f...8f0858cfd7 (18): > Merge 'future, future-utils: stop returning a variadic future from when_all_succeed' > file: introduce layered_file_impl, a helper for layered files > net: packet: mark move assignment operator as noexcept > core: weak_ptr, weakly_referencable: implement empty default constructor > circular_buffer: Fix build with gcc 11 (avoid template parameters in d'tor declaration) > test: weak_ptr_test: fix static asserts about nothrow constructibility > coroutines: Fix clang build > cmake: Delete SEASTAR_COROUTINES_TS > Merge "future-util: Mark a few more functions as noexcept" from Rafael > tests: add a perf test to measure the fair_queue performance > Merge "iostream: make iostream stack nothrow move constructible" from Benny > future: Move most of rethrow_with_nested out of line. > future_test: Add test for nested exceptions in finally > core: Add noexcept to unaligned members functions > Merge "core: make weak_ptr and checked_ptr default and move nothrow constructible" from Benny > core: file: Fix typo in a comment > byteorder: Mark functions as noexcept > future: replace CanInvoke concepts with std::invocable	2020-06-16 13:19:36 +03:00
Piotr Sarna	e59d41dad6	alternator: use plain function pointer instead of std::function Since all function handlers are plain functions without any state, there's no need for wrapping them with a 32-byte std::function when a plain function pointer would suffice. Reported-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <913c1de7d02c252b40dc0c545989ec83fe74e5a9.1592291413.git.sarna@scylladb.com>	2020-06-16 12:08:21 +03:00
Raphael S. Carvalho	238ba899c0	compaction_manager: use double for backlog everywhere Avi says: "The backlog is a large number that changes slowly, so float might not have enough resolution to track small changes. For example, if the backlog is 800GB and changes less than 100kB, then we won't see a change (float resolution is 2^23 ~ 1:8,000,000). This is outside the normal range of values (usually the backlog changes a lot more than 100kB per 15-second period), so it will work, but better to be more careful." Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200615150621.17543-1-raphaelsc@scylladb.com>	2020-06-16 12:05:05 +03:00
Rafael Ávila de Espíndola	3e1307a6d1	cql3: Pass std::string_view to various untyped_result_set member functions Taking a std::string_view is a bit more flexible. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-06-15 15:47:15 -07:00
Rafael Ávila de Espíndola	3a9b4e7d26	cql3: Use a flat_hash_map in untyped_result_set_row No functionality changed. This just makes it possible to use heterogeneous lookups, which the next patch will add. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-06-15 15:46:25 -07:00
Rafael Ávila de Espíndola	65d56095d0	service: Pass a std::string_view to client_state::set_keyspace No change in the implementation since it was already copying the string. Taking a std::string_view is just a bit more flexible. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>	2020-06-15 15:46:25 -07:00

1 2 3 4 5 ...

22473 Commits