scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-28 12:17:02 +00:00

Author	SHA1	Message	Date
Rafael Ávila de Espíndola	a67f5b2de1	sstable_3_x_test: Call await_background_jobs on every test Now every tests starts by deferring a call to await_background_jobs. That can be verified with: $ git grep -B 1 await_background test/boost/sstable_3_x_test.cc \| grep THREAD \| wc -l 90 $ git grep -A 1 SEASTAR_THREAD_TEST_CASE test/boost/sstable_3_x_test.cc \| grep await_background \| wc -l 90 Thanks to Raphael Carvalho for noticing it. Refs #6624 Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Reviewed-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200619220048.1091630-1-espindola@scylladb.com>	2020-06-22 14:03:13 +03:00
Raphael S. Carvalho	a82afa68aa	test/lib/cql_test_env: reenable auto compaction after `e40aa042a7`, auto compaction is explicitly disabled on all tables being populated and only enabled later on in the boot process. we forgot to update cql_test_env to also reenable auto compaction, so unit tests based on cql_test_env were not compacting at all. database_test, for example, was running out of file descriptors because the number kept growing unboundly due to lack of compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200618225621.15937-1-raphaelsc@scylladb.com>	2020-06-22 14:03:13 +03:00
Benny Halevy	a3918bdc96	distributed_loader: reenable verify_owner_and_mode when loading new sstables The call to `verify_owner_and_mode` from `flush_upload_dir` fell between the cracks in `b34c0c2ff6` (distributed_loader: rework uploading of SSTables). It causes https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-release/528/testReport/nodetool_additional_test/TestNodetool/nodetool_refresh_with_wrong_upload_modes_test/ to fail like this: ``` /Directory cannot be accessed .* write/ not found in 'Nodetool command '/jenkins/workspace/scylla-master/dtest-release/scylla/.ccm/scylla-repository/7351db7cab7bbf907172940d0bbf8b90afde90ba/scylla-tools-java/bin/nodetool -h 127.0.87.1 -p 7187 refresh -- keyspace1 standard1' failed; exit status: 1; stdout: nodetool: Scylla API server HTTP POST to URL '/storage_service/sstables/keyspace1' failed: Failed to load new sstables: std::filesystem::__cxx11::filesystem_error (error system:13, filesystem error: remove failed: Permission denied [/jenkins/workspace/scylla-master/dtest-release/scylla/.dtest/dtest-rqzo7km7/test/node1/data/keyspace1/standard1-8a57a660b29611eabf0c000000000000/upload/mc-3-big-TOC.txt]) ``` Reenable it in this patch makes the dtest pass again. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20200621140439.85843-1-bhalevy@scylladb.com>	2020-06-22 14:03:13 +03:00
Benny Halevy	aa4b4311e2	configure: do not define SEASTAR_ENABLE_ALLOC_FAILURE_INJECTION in debug mode Seastar uses the default allocator in debug mode so it can't inject allocation failures in this mode. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Test: mutation_test(debug) Message-Id: <20200621131819.72108-1-bhalevy@scylladb.com>	2020-06-22 14:03:13 +03:00
Nadav Har'El	e4eca5211a	docker: add option to start Alternator with HTTPS We already have a docker image option to enable alternator on an unencrypted port, "--alternator-port", but we forgot to also allow the similar option for enabling alternator on an encrypted (HTTPS) port: "--alternator-https-port" so this patch adds the missing option, and documents how to use it. Note that using this option is not enough. When this option is used, Alternator also requires two files, /etc/scylla/scylla.crt and /etc/scylla/scylla.key, to be inserted into the image. These files should contain the SSL certificate, and key, respectively. If these files are missing, you will get an error in the log about the missing file. Fixes #6583. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200621125219.12274-1-nyh@scylladb.com>	2020-06-22 14:03:13 +03:00
Avi Kivity	7351db7cab	Merge "Reshape upload files and reshard+reshape at boot" from Glauber " This patchset adds a reshape operation to each compaction strategy; that is a strategy-specific way of detecting if SSTables are in-strategy or off-strategy, and in case they are offstrategy moving them to in-strategy. Often times the number of SSTables in a particular slice of the sstable set matters for that decision (number of SSTables in the same time window for TWCS, number of SSTables per tier for STCS, number of L0 SSTables for LCS). We want to be more lenient for operations that keep the node offline, like reshape at boot, but more forgiving for operations like upload, which run in maintenance mode. To accomodate for that the threshold for considering a slice of the SSTable set offstrategy is passed as a parameter Once this patchset is applied, the upload directory will reshape the SSTables before moving them to the main directory (if needed). One side effect of it is that it is no longer necessary to take locks for the refresh operation nor disable writes in the table. With the infrastructure that we have built in the upload directory, we can apply the same set of steps to populate_column_family. Using the sstable_directory to scan the files we can reshard and reshape (usually if we resharded a reshape will be necessary) with the node still offline. This has the benefit of never adding shared SSTables to the table. Applying this patchset will unlock a host of cleanups: - we can get rid of all testing for shared sstables, sstable_need_rewrite, etc. - we can remove the resharding backlog tracker. and many others. Most cleanups are deferred for a later patchset, though. " * 'reshard-reshape-v4' of github.com:glommer/scylla: distributed_loader: reshard before the node is made online distributed_loader: rework uploading of SSTables sstable_directory: add helper to reshape existing unshared sstables compaction_strategy: add method to reshape SSTables compaction: add a new compaction type, Reshape compaction: add a size and throught pretty printer. compaction: add default implementation for some pure functions tests: fix fragile database tests distributed_loader.cc: add a helper function to extract the highest SSTable version found distributed_loader.cc : extract highest_generation_seen code compaction_manager: rename run_resharding_job distributed_loader: assume populate_column_families is run in shard 0 api: do not allow user to meddle with auto compaction too early upload: use custom error handler for upload directory sstable_directory: fix debug message	2020-06-18 17:04:53 +03:00
Glauber Costa	e40aa042a7	distributed_loader: reshard before the node is made online This patch moves the resharding process to use the new directory_with_sstables_handler infrastructure. There is no longer a clear reshard step, and that just becomes a natural part of populate_column_family. In main.cc, a couple of changes are necessary to make that happen. The first one obviously is to stop calling reshard. We also need to make sure that: - The compaction manager is started much earlier, so we can register resharding jobs with it. - auto compactions are disabled in the populate method, so resharding doesn't have to fight for bandwidth with auto compactions. Now that we are resharding through the sstable_directory, the old resharding code can be deleted. There is also no need to deal with the resharding backlog either, because the SSTables are not yet added to the sstable set at this point. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-18 09:37:18 -04:00
Glauber Costa	b34c0c2ff6	distributed_loader: rework uploading of SSTables Uploading of SSTables is problematic: for historical reasons it takes a lock that may have to wait for ongoing compactions to finish, then it disables writes in the table, and then it goes loading SSTables as if it knew nothing about them. With the sstable_directory infrastructure we can do much better: * we can reshard and reshape the SSTables in place, keeping the number of SSTables in check. Because this is an background process we can be fairly aggressive and set the reshape mode to strict. * we can then move the SSTables directly into the main directory. Because we know they are few in number we can call the more elegant add_sstable_and_invalidate_cache instead of the open coding currently done by load_new_sstables * we know they are not shared (if they were, we resharded them), simplifying the load process even further. The major changes after this patch is applied is that all compactions (resharding and reshape) needed to make the SSTables in-strategy are done in the streaming class, which reduces the impact of this operation on the node. When the SSTables are loaded, subsequent reads will not suffer as we will not be adding shared SSTables in potential high numbers, nor will we reshard in the compaction class. There is also no more need for a lock in the upload process so in the fast path where users are uploading a set of SSTables from a backup this should essentially be instantaneous. The lock, as well as the code to disable and enable table writes is removed. A future improvement is to bypass the staging directory too, in which case the reshaping compaction would already generate the view updates. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-18 09:37:18 -04:00
Glauber Costa	4d6aacb265	sstable_directory: add helper to reshape existing unshared sstables Before moving SSTables to the main directory, we may need to reshape them into in-strategy. This patch provides helper code that reshapes the SSTables that are known to be unshared local in the sstable directory, and updates the sstable directory with the result. Rehaping can be made more or less aggressive by passing a reshape mode (relaxed or strict), which will influence the amount of SSTables reshape can tolerate to consider a particular slice of the SSTable set offstrategy. Because the compaction expects an std::vector everywhere, we changed our chunked vector for the unshared sstables to a std::vector so we can more easily pass it around without conversions. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-18 09:37:18 -04:00
Glauber Costa	3c254dd49d	compaction_strategy: add method to reshape SSTables Some SSTable sets are considered to be off-strategy: they are in a shape that is at best not optimal and at worst adversarial to the current compaction strategy. This patch introduces the compaction strategy-specific method get_reshaping_job(). Given an SSTable set, it returns one compaction that can be done to bring the table closer to being in-strategy. The caller can then call this repeatedly until the table is fully in-strategy. As an example of how this is supposed to work, consider TWCS: some SSTables will belong to a single window -> in which case they are already in-strategy and don't need to be compacted, and others span multiple windows in which case they are considered off-strategy and have to be compacted. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-18 09:37:18 -04:00
Glauber Costa	0467bd0a94	compaction: add a new compaction type, Reshape From the point of view of selecting SSTables and its expected output, Reshaping really is just a normal compaction. However, there are some key differences that we would like to uphold: - Reshaping is done separately from the main SSTable set. It can be done with the node offline, or it can be done in a separate priority class. Either way, we don't want those SSTables to count towards backlog. For reads, because the SSTables are not yet registered in the backlog tracker (if offline or coming from upload), if we were to deduct compaction charges from it we would go negative. For writes, we don't want to deal with backlog management here because we will add the SSTable at once when reshaping is finished. - We don't need to do early replacements. - We would like to clearly mark the Reshaping compactions as such in the logs For the reasons above, it is nicer to add a new Reshape compaction type, a subclass of compaction, that upholds such properties. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-18 09:37:18 -04:00
Glauber Costa	c4841fa735	compaction: add a size and throught pretty printer. This is so we don't always use MB. Sometimes it is best to report GB, TB, and their equivalent throughput metrics. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-18 09:37:18 -04:00
Takuya ASADA	76af112c7d	scylla_swap_setup: don't create sparse file fallocate creates sparse file, XFS rejects such file for swapfile: https://bugzilla.redhat.com/show_bug.cgi?id=1129205 Use dd instead. Fixes #6650	2020-06-18 16:26:17 +03:00
Avi Kivity	7129662edb	Update seastar submodule * seastar b515d63735...a6c8105443 (15): > Merge "Move thread_wake_task out of line" from Rafael > future: Fix result_of_apply instantiation > future: Move the function in then/then_wrapped only once > io-queue: Dont leak desc > fair-queue: Keep request queues self-consistent > app: Do not coredump on missing options > future: promise: mark set_value as noexcept > future: future_state: mark set as noexcept > fair_queue_perf: Remove unused captures > file_io_test: Add missing override > Merge "tmp_dir: handle remove failure in do_with_thread" from Benny > api-level: Add missing api_v4 namespace > future: Fix CanApplyTuple > http: use logger instead of stderr for erro reporting > sstring: Generalize make_sstring a bit	2020-06-18 16:16:05 +03:00
Glauber Costa	ef85a2cec5	compaction: add default implementation for some pure functions There are some functions that are today pure that have an obvious implementation (for example on_new_partition, do nothing). We'll add default implementations to the compaction class, which reduces the boilerplate needed to add a new compaction type. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-18 09:00:28 -04:00
Glauber Costa	96abf80c5e	tests: fix fragile database tests This test wants to make sure that an SSTable with generation number 4, which is incomplete, gets deleted. While that works today, the way the test verifies that is fragile because new SSTables can and will be created, especially in the local directory that sees a lot of activity on startup. It works if generations don't go that far, but with SMP, even a single SSTable in the right shard can end up having generation 4. In practice this isn't an issue today because the code calls cf.update_sstables_known_generation() as soon as it sees a file, before deciding whether or not the file has to be deleted. However this behavior is not guaranteed and is changing. The best way to fix this would be to check if the file is the same, including its inode. But given that this is just a unit test (which is almost always if not always single node), I am just moving to use the peers table instead. Again, we could have created a user table, but it's just not worth the hassle. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-18 09:00:28 -04:00
Glauber Costa	072d0d3073	distributed_loader.cc: add a helper function to extract the highest SSTable version found Using a map reduce in a shared sstable directory, finds the highest version seen across all shards. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-18 09:00:28 -04:00
Glauber Costa	baa82b3a26	distributed_loader.cc : extract highest_generation_seen code We'll use it in one more other location so extract it to common code. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-18 09:00:27 -04:00
Glauber Costa	9902af894a	compaction_manager: rename run_resharding_job It will be used to run any custom job where the caller provides a function. One such example is indeed resharding, but reshaping SSTables can also fall here. The semaphore is also renamed, and we'll allow only one custom job at a time (across all possible types). We also remove the assumption of the scheduling group. The caller has to have already placed the code in the correct CPU scheduling group. The I/O priority class comes from the descriptor. To make sure that we don't regress, we wrap the entire reshard-at-boot code in the compaction class. Currently the setup would be done in the main group, and the actual resharding in the compaction group. Note that this is temporary, as this code is about to change. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-18 09:00:27 -04:00
Glauber Costa	45f3bc679e	distributed_loader: assume populate_column_families is run in shard 0 This is already the case, since main.cc calls it from shard 0 and relies on it to spread the information to the other shards. We will turn this branch - which is always taken - into an assert for the sake of future-proofing and soon add even more code that relies on this being executed in shard 0. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-18 09:00:27 -04:00
Glauber Costa	bb07678346	api: do not allow user to meddle with auto compaction too early We are about to use the auto compaction property during the populate/reshard process. If the user toggles it, the database can be left in a bad state. There should be no reason why a user would want to set that up this early. So we'll disallow it. To do that property, it is better if the check of whether or not the storage service is ready to accomodate this request is local to the storage service itself. We then move the logic of set_tables_autocompaction from api to the storage service. The API layer now merely translates the table names and pass it along. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-18 09:00:25 -04:00
Pekka Enberg	3d128f5b51	scripts: Rename sync-submodules.sh to refresh-submodules.sh Rename the script as per Nadav's suggestion and update documentation within the script. Message-Id: <20200618123446.32496-1-penberg@scylladb.com>	2020-06-18 15:39:23 +03:00
Rafael Ávila de Espíndola	f6e407ecd2	everywhere: Prepare for seastar api v4 (when_all_succeed return value) The seastar api v4 changes the return type of when_all_succeed. This patch adds discard_result when that is best solution to handle the change. This doesn't do the actual update to v4 since there are still a few issues left to fix in seastar. A patch doing just the update will follow. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200617233150.918110-1-espindola@scylladb.com>	2020-06-18 15:13:56 +03:00
Amnon Heiman	bc854342e7	approx_exponential_histogram: Makes the implementation clearer This patch aim to make the implementation and usage of the approx_exponential_histogram clearer. The approx_exponential_histogram Uses a combination of Min, Max, Precision and number of buckets where the user needs to pick 3. Most of the changes in the patch are about documenting the class and method, but following the review there are two functionality changes: 1. The user would pick: Min, Max and Precision and the number of buckets will be calculated from these values. 2. The template restrictions are now state in a requires so voiolation will be stop at compile time.	2020-06-18 14:18:21 +03:00
Tomasz Grabiec	17ee1a2eed	utils: cached_file: Fix compilation error Fix field initialization order problem. In file included from ./sstables/mc/bsearch_clustered_cursor.hh:28, from sstables/index_reader.hh:32, from sstables/sstables.cc:49: ./utils/cached_file.hh: In constructor 'cached_file::stream::stream(cached_file&, const seastar::io_priority_class&, tracing::trace_state_ptr, cached_file::page_idx_type, cached_file::offset_type)': ./utils/cached_file.hh:119:34: error: 'cached_file::stream::_trace_state' will be initialized after [-Werror=reorder] 119 \| tracing::trace_state_ptr _trace_state; \| ^~~~~~~~~~~~ ./utils/cached_file.hh:117:23: error: 'cached_file::page_idx_type cached_file::stream::_page_idx' [-Werror=reorder] 117 \| page_idx_type _page_idx; \| ^~~~~~~~~ ./utils/cached_file.hh:127:9: error: when initialized here [-Werror=reorder] 127 \| stream(cached_file& cf, const io_priority_class& pc, tracing::trace_state_ptr trace_state, \| ^~~~~~ Message-Id: <1592478082-22505-1-git-send-email-tgrabiec@scylladb.com>	2020-06-18 14:08:29 +03:00
Raphael S. Carvalho	03db448a92	sstables/backlog_tracker: Fix incorrect calculation of Compaction backlog When debugging this for first time c412a7a, I thought the problem, which causes backlog to be negative, was a bug in the implementation of the formula, but it turns out that the bug is actually in the formula itself. Not limiting the scope of this bug to STCS because its tracker is inherited by the trackers of other strategies, meaning they're also affected by this. The backlog for a SSTable is known to be Bi = Ei * log(T / Si) Where T = total Size minus compacted bytes for a table, Ci = Compacted Bytes for a SSTable, Si = Size of a SStable Ei = Ci - Si The problem was that we were assuming T > Si, but it can happen that T is lower than Si if the table in question is decreasing in size. If we rewrite SSTable backlog as Bi = Ei * log (T) - Ei * log(Si) It becomes even clearer why T cannot be lower than Si whatsoever, or the backlog calculation can go wrong because first term becomes lower than the second. Fixing the formula consists of changing it to Bi = Ei * log (T / Ei) Bi = Ei * log (T) - Ei * log (Si - Ci) After this change, the backlog still behave in a very similar way as before, which can be confirmed via this graph: https://user-images.githubusercontent.com/1409139/79627762-71afdf80-8111-11ea-9ebc-0831c4e3d9c6.png Fixes #6021. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200616174712.16505-1-raphaelsc@scylladb.com>	2020-06-18 13:56:47 +03:00
Avi Kivity	5d99d667ec	Merge "Build system improvements for packaging" from Pekka " This patch series attempts to decouple package build and release infrastructure, which is internal to Scylla (the company). The goal of this series is to make it easy for humans and machines to build the full Scylla distribution package artifacts, and make it easy to quickly verify them. The improvements to build system are done in the following steps. 1. Make scylla.git a super-module, which has git submodules for scylla-jmx and scylla-tools. A clone of scylla.git is now all that is needed to access all source code of all the different components that make up a Scylla distribution, which is a preparational step to adding "dist" ninja build target. A scripts/sync-submodules.sh helper script is included, which allows easy updating of the submodules to the latest head of the respective git repositories. 2. Make builds reproducible by moving the remaining relocatable package specific build options from reloc/build_reloc.sh to the build system. After this step, you can build the exact same binaries from the git repository by using the dbuild version from scylla.git. 3. Add a "dist" target to ninja build, which builds all .rpm and .deb packages with one command. To build a release, run: $ ./tools/toolchain/dbuild ./configure.py --mode release $ ./tools/toolchain/dbuild ninja-build dist and you will now have .rpm and .deb packages to all the components of a Scylla distribution. 4. Add a "dist-check" target to ninja build for verification of .rpm and .deb packages in one command. To verify all the built packages, run: $ ninja-build dist-check Please note that you must run this step on the host, because the target uses Docker under the hood to verify packages by installing them on different Linux distributions. Currently only CentOS 7 verification is supported. All these improvements are done so that backward compatibility is retained. That is, any existing release infrastructure or other build scripts are completely unaffacted. Future improvements to consider: - Package repository generation: add a "ninja repo" command to generate a .rpm and .deb repositories, which can be uploaded to a web site. This makes it possible to build a downloadable Scylla distribution from scylla.git. The target requires some configuration, which user has to provide. For example, download URL locations and package signing keys. - Amazon Machine Image (AMI) support: add a "ninja ami" command to simplify the steps needed to generate a Scylla distribution AMI. - Docker image support: add a "ninja docker" command to simplify the steps needed to generate a Scylla distribution Docker image. - Simplify and unify package build: simplify and unify the various shell scripts needed to build packages in different git repositories. This step will break backward compatiblity and can be done only after relevant build scripts and release infrastructure is updated. " * 'penberg/packaging/v5' of github.com:penberg/scylla: docs: Update packaging documentation build: Add "dist-check" target scripts/testing: Add "dist-check" for package verification build: Add "dist" target reloc: Add '--builddir' option to build_deb.sh build: Add "-ffile-prefix-map" to cxxflags docs: Document sync-submodules.sh script in maintainer.md sync-submodules.sh: Add script for syncing submodules Add scylla-tools submodule Add scylla-jmx submodule	2020-06-18 12:59:52 +03:00
Dejan Mircevski	aec1acd1d5	range_test: Add cases for singular intersection Intersection was previously not tested for singular ranges. This ensures it will always work for singular ranges, too. Tests: unit(dev) Signed-off-by: Dejan Mircevski <dejan@scylladb.com>	2020-06-18 12:38:31 +03:00
Yaron Kaikov	e9d5852b0c	dbuild: Add an option to run dbuild using podman Following https://github.com/scylladb/scylla/pull/5333, we want to be able to run dbuild using podman or docker by setting enviorment variable named: DBUILD_TOOL DBUILD_TOOL will use docker by default unless we explicitly set the tool podmand Fixes: https://github.com/scylladb/scylla/pull/6644	2020-06-18 12:13:39 +03:00
Avi Kivity	9322c07c71	Merge "Use binary search in sstable promoted index" from Tomasz " The "promoted index" is how the sstable format calls the clustering key index within a given partition. Large partitions with many rows have it. It's embedded in the partition index entry. Currently, lookups in the promoted index are done by scanning the index linearly so the lookup is O(N). For large partitions that's inefficient. It consumes both a lot of CPU and I/O. We could do better and use binary search in the index. This patch series switches the mc-format index reader to do that. Other formats use the old way. The "mc" format promoted index has an extra structure at the end of the index called "offset map". It's a vector of offsets of consecutive promoted index entries. This allows us to access random entries in the index without reading the whole index. The location of the offset entry for a given promoted index entry can be derived by knowing where the offset vector ends in the index file, so the offset map also doesn't have to be read completely into the memory. The most tricky part is caching. We need to cache blocks read from the index file to amortize the cost of binary search: - if the promoted index fits in the 32 KiB which was read from the index when looking for the partition entry, we don't want to issue any additional I/O to search the promoted index. - with large promoted indexes, the last few bisections will fall into the same I/O block and we want to reuse that block. - we don't want the cache to grow too big, we don't want to cache the whole promoted index as the read progresses over the index. Scanning reads may skip multiple times. This series implements a rather simple approach which meets all the above requirements and is not worse than the current state of affairs: - Each index cursor has its own cache of the index file area which corresponds to promoted index This is managed by the cached_file class. - Each index cursor has its own cache of parsed blocks. This allows the upper bound estimation to reuse information obtained during lower bound lookup. This estimation is used to limit read-aheads in the data file. - Each cursor drops entries that it walked past so that memory footprint stays O(log N) - Cached buffers are accounted to read's reader_permit. Later, we could have a single cache shared by many readers. For that, we need to come up with eviction policy. Fixes #4007. TESTING RESULTS * Point reads, large promoted index: Config: rows: 10000000, value size: 2000 Partition size: 20 GB Index size: 7 MB Notes: - Slicing read into the middle of partition (offset=5000000, read=1) is a clear win for the binary search: time: 1.9ms vs 22.9ms CPU utilization: 8.9% vs 92.3% I/O: 21 reqs / 172 KiB vs 29 reqs / 3'520 KiB It's 12x faster, CPU utilization is 10x times smaller, disk utilization is 20x smaller. - Slicing at the front (offset=0) is a mixed bag. time is similar: 1.8ms CPU utilization is 6.7x smaller for bsearch: 8.5% vs 57.7% disk bandwidth utilization is smaller for bsearch but uses more IOs: 4 reqs / 320 KiB (scan) vs 17 reqs / 188 KiB (bsearch) bsearch uses less bandwidth because the series reduces buffer size used for index file I/O. scan is issuing: 2 * 128 KB (index page) 2 * 32 KB (data file) bsearch is issuing: 1 * 64 KB (index page) 15 * 4 KB (promoted index) 1 * 64 KB (data file) The 1 * 64 KB is chosen dynamically by seastar. Sometimes it chooses 2 * 32 KB (with read-ahead). 32 KB is the minimum I/O currently. Disk utilization could be further improved by changing the way seastar's dynamic I/O adjustments work so that it uses 1 * 4 KB when it suffices. This is left for the follow-up. Command: perf_fast_forward --datasets=large-part-ds1 \ --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1 Before: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.001836 172 1 545 9 563 175 4.0 4 320 2 2 0 1 1 0 0 0 57.7% 0 0 32 0.001858 502 32 17220 126 17776 11526 3.2 3 324 2 1 0 1 1 0 0 0 56.4% 0 0 256 0.002833 339 256 90374 427 91757 85931 7.0 7 776 3 1 0 1 1 0 0 0 41.1% 0 0 4096 0.017211 58 4096 237984 2011 241802 233870 66.1 66 8376 59 2 0 1 1 0 0 0 21.4% 0 5000000 1 0.022952 42 1 44 1 45 41 29.2 29 3520 22 2 0 1 1 0 0 0 92.3% 0 5000000 32 0.023052 43 32 1388 14 1414 1331 31.1 32 3588 26 2 0 1 1 0 0 0 91.7% 0 5000000 256 0.024795 41 256 10325 129 10721 9993 43.1 39 4544 29 2 0 1 1 0 0 0 86.4% 0 5000000 4096 0.038856 27 4096 105414 398 106918 103162 95.2 95 12160 78 5 0 1 1 0 0 0 61.4% 0 After (v2): offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.001831 248 1 546 21 581 252 17.6 17 188 2 0 0 1 1 0 0 0 8.5% 0 0 32 0.001910 535 32 16751 626 17770 13896 17.9 19 160 3 0 0 1 1 0 0 0 8.8% 0 0 256 0.003545 266 256 72207 2333 89076 62852 26.9 24 764 7 0 0 1 1 0 0 0 9.7% 0 0 4096 0.016800 56 4096 243812 524 245430 239736 83.6 83 8700 64 0 0 1 1 0 0 0 16.6% 0 5000000 1 0.001968 351 1 508 19 538 380 21.3 21 172 2 0 0 1 1 0 0 0 8.9% 0 5000000 32 0.002273 431 32 14077 436 15503 11551 22.7 22 268 3 0 0 1 1 0 0 0 8.9% 0 5000000 256 0.003889 257 256 65824 2197 81833 57813 34.0 37 652 18 0 0 1 1 0 0 0 11.2% 0 5000000 4096 0.017115 54 4096 239324 834 241310 231993 88.3 88 8844 65 0 0 1 1 0 0 0 16.8% 0 After (v1): offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.001886 259 1 530 4 545 261 18.0 18 376 2 2 0 1 1 0 0 0 9.1% 0 0 32 0.001954 513 32 16381 93 16844 15618 19.0 19 408 3 2 0 1 1 0 0 0 9.3% 0 0 256 0.003266 318 256 78393 1820 81567 61663 30.8 26 1272 7 2 0 1 1 0 0 0 10.4% 0 0 4096 0.017991 57 4096 227666 855 231915 225781 83.1 83 8888 55 5 0 1 1 0 0 0 15.5% 0 5000000 1 0.002353 232 1 425 2 432 232 23.0 23 396 2 2 0 1 1 0 0 0 8.7% 0 5000000 32 0.002573 384 32 12437 47 12571 429 25.0 25 460 4 2 0 1 1 0 0 0 8.5% 0 5000000 256 0.003994 259 256 64101 2904 67924 51427 37.0 35 1484 11 2 0 1 1 0 0 0 10.6% 0 5000000 4096 0.018567 56 4096 220609 448 227395 219029 89.8 89 9036 59 5 0 1 1 0 0 0 15.1% 0 * Point reads, small promoted index (two blocks): Config: rows: 400, value size: 200 Partition size: 84 KiB Index size: 65 B Notes: - No significant difference in time - the same disk utilization - similar CPU utilization Command: perf_fast_forward --datasets=large-part-ds1 \ --run-tests=large-partition-slicing-clustering-keys -c1 --test-case-duration=1 Before: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.000279 470 1 3587 31 3829 478 3.0 3 68 2 1 0 1 1 0 0 0 21.1% 0 0 32 0.000276 3498 32 116038 811 122756 104033 3.0 3 68 2 1 0 1 1 0 0 0 24.0% 0 0 256 0.000412 2554 256 621044 1778 732150 559221 2.0 2 72 2 0 0 1 1 0 0 0 32.6% 0 0 4096 0.000510 1901 400 783883 4078 819058 665616 2.0 2 88 2 0 0 1 1 0 0 0 36.4% 0 200 1 0.000339 2712 1 2951 8 3001 2569 2.0 2 72 2 0 0 1 1 0 0 0 17.8% 0 200 32 0.000352 2586 32 91019 266 92427 83411 2.0 2 72 2 0 0 1 1 0 0 0 20.8% 0 200 256 0.000458 2073 200 436503 1618 453945 385501 2.0 2 88 2 0 0 1 1 0 0 0 29.4% 0 200 4096 0.000458 2097 200 436475 1676 458349 381558 2.0 2 88 2 0 0 1 1 0 0 0 29.0% 0 After (v1): Testing slicing of large partition using clustering keys: offset read time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 0 1 0.000278 492 1 3598 30 3831 500 3.0 3 68 2 1 0 1 1 0 0 0 19.4% 0 0 32 0.000275 3433 32 116153 753 122915 92559 3.0 3 68 2 1 0 1 1 0 0 0 22.5% 0 0 256 0.000458 2576 256 559437 2978 728075 504375 2.1 2 88 2 0 0 1 1 0 0 0 29.0% 0 0 4096 0.000506 1888 400 790064 3306 822360 623109 2.0 2 88 2 0 0 1 1 0 0 0 36.6% 0 200 1 0.000382 2493 1 2619 10 2675 2268 2.0 2 88 2 0 0 1 1 0 0 0 16.3% 0 200 32 0.000398 2393 32 80422 333 84759 22281 2.0 2 88 2 0 0 1 1 0 0 0 19.0% 0 200 256 0.000459 2096 200 435943 1608 453989 380749 2.0 2 88 2 0 0 1 1 0 0 0 30.5% 0 200 4096 0.000458 2097 200 436410 1651 455779 382485 2.0 2 88 2 0 0 1 1 0 0 0 29.2% 0 * Scan with skips, large index: Config: rows: 10000000, value size: 2000 Partition size: 20 GB Index size: 7 MB Notes: - Similar time, slightly worse for binary search: 36.1 s (scan) vs 36.4 (bsearch) - Slightly more I/O for bsearch: 153'932 reqs / 19'703'260 KiB (scan) vs 155'651 reqs / 19'704'088 KiB (bsearch) Binary search reads more by 828 KB and by 1719 IOs. It does more I/O to read the the promoted index offset map. - similar (low) memory footprint. The danger here is that by caching index blocks which we touch as we scan we would end up caching the whole index. But this is protected against by eviction as demonstrated by the last "mem" column. Command: perf_fast_forward --datasets=large-part-ds1 \ --run-tests=large-partition-skips -c1 --test-case-duration=1 Before: read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 1 1 36.103451 4 5000000 138491 38 138601 138453 153932.0 153932 19703260 153561 1 0 1 1 0 0 0 31.5% 502690 After (v2): read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 1 1 37.000145 4 5000000 135135 6 135146 135128 155651.0 155651 19704088 138968 0 0 1 1 0 0 0 34.2% 0 After (v1): read skip time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk cpu mem 1 1 36.965520 4 5000000 135261 30 135311 135231 155628.0 155628 19704216 139133 1 0 1 1 0 0 0 33.9% 248738 Also in: git@github.com:tgrabiec/scylla.git sstable-use-index-offset-map-v2 Tests: - unit (all modes) - manual using perf_fast_forward " * tag 'sstable-use-index-offset-map-v2' of github.com:tgrabiec/scylla: sstables: Add promoted index cache metrics position_in_partition: Introduce external_memory_usage() cached_file, sstables: Add tracing to index binary search and page cache sstables: Dynamically adjust I/O size for index reads sstables, tests: Allow disabling binary search in promoted index from perf tests sstables: mc: Use binary search over the promoted index utils: Introduce cached_file sstables: clustered_index: Relax scope of validity of entry_info sstables: index_entry: Introduce owning promoted_index_block_position compound_compat: Allow constructing composite from a view sstables: index_entry: Rename promoted_index_block_position to promoted_index_block_position_view sstables: mc: Extract parser for promoted index block sstables: mc: Extract parser for clustering out of the promoted index block parser sstables: consumer: Extract primitive_consumer sstables: Abstract the clustering index cursor behavior sstables: index_reader: Rearrange to reduce branching and optionals	2020-06-18 12:09:39 +03:00
Pekka Enberg	4d48f22827	docs: Update packaging documentation	2020-06-18 10:20:08 +03:00
Pekka Enberg	9e279ec2a9	build: Add "dist-check" target This adds a "dist-check" target to ninja build. The target needs to be run on the host because package verification is done with Docker.	2020-06-18 10:20:08 +03:00
Pekka Enberg	584c7130a1	scripts/testing: Add "dist-check" for package verification This adds a "dist-check.sh" script in tools/testing, which performs distribution package verification by installing packages under Docker.	2020-06-18 10:16:46 +03:00
Pekka Enberg	8e1a561fba	build: Add "dist" target	2020-06-18 10:16:46 +03:00
Pekka Enberg	7b7c91a34b	reloc: Add '--builddir' option to build_deb.sh The build system will call this script. It needs control over where the packages are built to allow building packages for the different build modes.	2020-06-18 09:54:37 +03:00
Pekka Enberg	013f87f388	build: Add "-ffile-prefix-map" to cxxflags This patch adds "-ffile-prefix-map" to cxxflags for all build modes. This has two benefits: 1, Relocatable packages no longer have any special build flags, which makes deeper integration with the build system possible (e.g. targets for packages). 2 Builds are now reproducible, which makes debugging easier in case you only have a backtrace, but no artifacts. Rafael explains: "BTW, I think I found another argument for why we should always build with -ffile-prefix-map=. There was user after free test failure on next promotion. I am unable to reproduce it locally, so it would be super nice to be able to decode the backtrace. I was able to do it, but I had to create a /jenkins/workspace/scylla-master/next/ directory and build from there to get the same results as the bot." Acked-by: Botond Dénes <bdenes@scylladb.com> Acked-by: Nadav Har'El <nyh@scylladb.com> Acked-by: Rafael Avila de Espindola <espindola@scylladb.com>	2020-06-18 09:54:37 +03:00
Pekka Enberg	71da4e6e79	docs: Document sync-submodules.sh script in maintainer.md	2020-06-18 09:54:37 +03:00
Pekka Enberg	e3376472e8	sync-submodules.sh: Add script for syncing submodules	2020-06-18 09:54:37 +03:00
Pekka Enberg	d759d7567b	Add scylla-tools submodule	2020-06-18 09:54:37 +03:00
Pekka Enberg	9edf858d30	Add scylla-jmx submodule	2020-06-18 09:54:37 +03:00
Benny Halevy	5926cfc298	CMakeLists.txt: Update to C++20 Following `427398641a` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20200618052956.570260-1-bhalevy@scylladb.com>	2020-06-18 09:51:23 +03:00
Pekka Enberg	02b733c22b	Revert "dbuild: Add an option to run with 'docker' or 'podman'" This reverts commit `ac7237f991`. The logic is wrong and always picks "podman" if it's installed on the system even if user asks for "docker" with the DBUILD_TOOL environment variable. This wreaks havoc on machines that have both docker and podman packages installed, but podman is not configured correctly.	2020-06-18 09:22:33 +03:00
Juliusz Stasiewicz	8628ede009	cdc: Fix segfault when stream ID key is too short When a token is calculated for stream_id, we check that the key is exactly 16 bytes long. If it's not - `minimum_token` is returned and client receives empty result. This used to be the expected behavior for empty keys; now it's extended to keys of any incorrect length. Fixes #6570	2020-06-17 18:19:37 +03:00
Nadav Har'El	095ddf0d41	alternator test: use ConsistentRead=True where missing All tests that write some data and then read it back need to use ConsistentRead=True, otherwise the test may sporadically fail on a multi- node cluster. In the previous patch we fixed the full_query()/full_scan() convenience functions. In this patch, I audited the calls to the boto3 read methods - get_item(), batch_get_item(), query(), scan(), and although most of them did use ConsistentRead=True as needed, I found some missing and this patch fixes them. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200616080334.825893-1-nyh@scylladb.com>	2020-06-17 14:57:45 +02:00
Nadav Har'El	c298088375	alternator test: use ConsistentRead=True for full_query/scan Many of the Alternator tests use the convenience functions full_query()/ full_scan() to read from the table. Almost all these tests need to be able to read their own writes, i.e., want ConsistentRead=True, but none of them explicitly specified this parameter. Such tests may sporadically fail when running on cluster with multiple nodes. So this patch follows a TODO in the code, and makes ConsistentRead=True the default for the full_() functions. The caller can still override it with ConsistentRead=False - and this is necessary in the GSI tests, because ConsistentRead=True is not allowed in GSIs. Note that while ConsistentRead=True is now the default for the full_() convenience functions, but it is still not the default for the lower level boto3 functions scan(), query() and get_item() - so usages of those should be evaluated as well and missing ConsistentRead=True, if any, should be added. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20200616073821.824784-1-nyh@scylladb.com>	2020-06-17 14:57:45 +02:00
Raphael S. Carvalho	2f680b3458	size_tiered_backlog_tracker: Rename total_bytes Reader can assume total_bytes and _total_bytes have the same meaning, but they don't, so let's give the former a more descriptive name. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20200616175055.16771-1-raphaelsc@scylladb.com>	2020-06-17 13:39:30 +03:00
Avi Kivity	d2ab6a24a1	Update seastar submodule * seastar 8f0858cfd7...b515d63735 (2): > do_with: replace seastar::apply() calls with std::apply() > Merge "Resolve various http fixmes" from Piotr	2020-06-17 12:59:16 +03:00
Glauber Costa	1c70a7c54e	upload: use custom error handler for upload directory SSTables created for the upload directory should be using its custom error handler. There is one user of the custom error handler in tree, which is the current upload directory function. As we will use a free function instead of a lambda in our implementation we also use the opportunity to fix it for consistency. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-16 19:42:19 -04:00
Glauber Costa	c188aef088	sstable_directory: fix debug message I just noticed while working on the reshape patches that there is an extra format bracket in two of the debug message. As they are debug I've seen them less often than the others and that slipped. Signed-off-by: Glauber Costa <glauber@scylladb.com>	2020-06-16 19:42:19 -04:00
Nadav Har'El	ba59034402	merge: Use std::string_view in a few more apis Merged patch series by Rafael Ávila de Espíndola: The main advantage is that callers now don't have to construct sstrings. It is also a 0.09% win in text size (from 41804308 to 41766484 bytes) and the tps reported by perf_simple_query --duration 16 --smp 1 -m4G >> log 2>err in 500 randomized runs goes up by 0.16% (from 162259 to 162517). Rafael Ávila de Espíndola (3): service: Pass a std::string_view to client_state::set_keyspace cql3: Use a flat_hash_map in untyped_result_set_row cql3: Pass std::string_view to various untyped_result_set member functions cql3/untyped_result_set.hh \| 30 ++++++++++++++++-------------- service/client_state.hh \| 2 +- cql3/untyped_result_set.cc \| 6 +++--- service/client_state.cc \| 4 ++-- 4 files changed, 22 insertions(+), 20 deletions(-)	2020-06-16 20:31:36 +03:00

1 2 3 4 5 ...

22501 Commits