scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-22 17:40:34 +00:00

Author	SHA1	Message	Date
Botond Dénes	6ee0f1f3a7	Merge 'replica/table: add a metric for hypothetical total file size without compression' from Michał Chojnowski This patch adds a metric for pre-compression size of sstable files. This patch adds a per-table metric `scylla_column_family_total_disk_space_before_compression`, which measures the hypothetical total size of sstables on disk, if Data.db was replaced with an uncompressed equivalent. As for the implementation: Before the patch, tables and sstable sets are already tracking their total physical file size. Whenever sstables are added or removed, the size delta is propagated from the sstable up through sstable sets into table_stats. To implement the new metric, we turn the size delta that is getting passed around from a one-dimensional to a two-dimensional value, which includes both the physical and the pre-compression size. New functionality, no backport needed. Closes scylladb/scylladb#26996 * github.com:scylladb/scylladb: replica/table: add a metric for hypothetical total file size without compression replica/table: keep track of total pre-compression file size	2025-11-20 09:10:38 +02:00
Michał Chojnowski	1cfce430f1	replica/table: keep track of total pre-compression file size Every table and sstable set keeps track of the total file size of contained sstables. Due to a feature request, we also want to keep track of the hypothetical file size if Data files were uncompressed, to add a metric that shows the compression ratio of sstables. We achieve this by replacing the relevant `uint_64 bytes_on_disk` counters everywhere with a struct that contains both the actual (post-compression) size and the hypothetical pre-compression size. This patch isn't supposed to change any observable behavior. In the next patch, we will use these changes to add a new metric.	2025-11-13 00:49:57 +01:00
Taras Veretilnyk	deb8e32e86	sstables: Add integrity option to create_single_key_sstable_reader Added an sstables::integrity_check parameter to create_single_key_sstable_reader methods across its implementations. This allows callers to enable SSTable integrity checks during single-key reads.	2025-10-28 19:27:35 +01:00
Avi Kivity	4d9271df98	Merge 'sstables: introduce sstable version `ms`' from Michał Chojnowski This is yet another part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25626 Next parts: make `ms` the default. Then, general tweaks and improvements. Later, potentially a full `da` format implementation. This patch series introduces a new, Scylla-only sstable format version `ms`, which is like `me`, but with the index components (Summary.db and Index.db) replaced with BTI index components (Partitions.db and Rows.db), as they are in Cassandra 5.0's `da` format version. (Eventually we want to just implement `da`, but there are several other changes (unrelated to the index files) between `me` and `da`. By adding this `ms` as an intermediate step we can adapt the new index formats without dragging all the other changes into the mix (and raising the risk of regressions, which is already high)). The high-level structure of the PR is: 1. Introduce new component types — `Partitions` and `Rows`. 2. Teach `class sstable` to open them when they exist. 3. Teach the sstable writer how to write index data to them. 4. Teach `class sstable` and unit tests how to deal with sstables that have no `Index` or `Summary` (but have `Partitions` and `Rows` instead). 5. Introduce the new sstable version `ms`, specify that it has `Partitions` and `Rows` instead of `Index` and `Summary`. 6. Prepare unit tests for the appearance of `ms`. 7. Enable `ms` in unit tests. 8. Make `ms` enablable via db::config (with a silent fall back to `me` until the new `MS_SSTABLE_FORMAT` cluster feature is enabled). 9. Prepare integration tests for the appearance of `ms`. 10. Enable both `ms` and `me` in tests where we want both versions to be tested. This series doesn't make `ms` the default yet, because that requires teaching Scylla Manager and a few dtests about the new format first. It can be enabled by setting `sstable_format: ms` in the config. Per a review request, here is an example from `perf_fast_forward`, demonstrating some motivation for a new format. (Although not the main one. The main motivations are getting rid of restrictions on the RAM:disk ratio, and index read throughput for datasets with tiny partitions). The dataset was populated with `build/release/scylla perf-fast-forward --smp=1 --sstable-format=$VERSION --data-directory=data.$VERSION --column-index-size-in-kb=1 --populate --random-seed=0`. This test involves a partition with 1000000 clustering rows (with 32-bit keys and 100-byte values) and ~500 index blocks, and queries a few particular rows from the partition. Since the branching factor for the BIG promoted index is 2 (it's a binary search), the lookup involves ~11.2 sequential page reads per row. The BTI format has a more reasonable branching factor, so it involves ~2.3 page reads per row. `build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/me --run-tests=large-partition-select-few-rows`: ``` offset stride rows iterations avg aio aio (KiB) 500000 1 1 70 18.0 18 128 500001 1 1 647 19.0 19 132 0 1000000 1 748 15.0 15 116 0 500000 2 372 29.0 29 284 0 250000 4 227 56.0 56 504 0 125000 8 116 106.0 106 928 0 62500 16 67 195.0 195 1732 ``` `build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/ms --run-tests=large-partition-select-few-rows`: ``` offset stride rows iterations avg aio aio (KiB) 500000 1 1 51 5.1 5 20 500001 1 1 64 5.3 5 20 0 1000000 1 679 4.0 4 16 0 500000 2 492 8.0 8 88 0 250000 4 804 16.0 16 232 0 125000 8 409 31.0 31 516 0 62500 16 97 54.0 54 1056 ``` Index file size comparison for the default `perf_fast_forward` tables with `--random-seed=0`: Large partition table (dominated by intra-partition index): 2.4 MB with `me`, 732 kB with `ms`. For the small partitions table (dominated by inter-partition index): 11 MB with `me`, 8.4 MB with `ms`. External tests: I ran SCT test `longevity-mv-si-4days-streaming-test` test on 6 nodes with 30 shards each for 8 hours. No anomalies were observed. New functionality, no backport needed. Closes scylladb/scylladb#26215 * github.com:scylladb/scylladb: test/boost/bloom_filter_test: add test_rebuild_from_temporary_hashes test/cluster: add test_bti_index.py test: prepare bypass_cache_test.py for `ms` sstables sstables/trie/bti_index_reader: add a failure injection in advance_lower_and_check_if_present test/cqlpy/test_sstable_validation.py: prepare the test for `ms` sstables tools/scylla-sstable: add `--sstable-version=?` to `scylla sstable write` db/config: expose "ms" format to the users via database config test: in Python tests, prepare some sstable filename regexes for `ms` sstables: add `ms` to `all_sstable_versions` test/boost/sstable_3_x_test: add `ms` sstables to multi-version tests test/lib/index_reader_assertions: skip some row index checks for BTI indexes test/boost/sstable_inexact_index_test: explicitly use a `me` sstable test/boost/sstable_datafile_test: skip test_broken_promoted_index_is_skipped for `ms` sstables test/resource: add `ms` sample sstable files for relevant tests test/boost/sstable_compaction_test: prepare for `ms` sstables. test/boost/index_reader_test: prepare for `ms` sstables test/boost/bloom_filter_tests: prepare for `ms` sstables test/boost/sstable_datafile_test: prepare for `ms` sstables test/boost/sstable_test: prepare for `ms` sstables. sstables: introduce `ms` sstable format version tools/scylla-sstable: default to "preferred" sstable version, not "highest" sstables/mx/reader: use the same hashed_key for the bloom filter and the index reader sstables/trie/bti_index_reader: allow the caller to passing a precalculated murmur hash sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller sstables/mx: make Index and Summary components optional sstables: open Partitions.db early when it's needed to populate key range for sharding metadata sstables: adapt sstable::set_first_and_last_keys to sstables without Summary sstables: implement an alternative way to rebuild bloom filters for sstables without Index utils/bloom_filter: add `add(const hashed_key&)` sstables: adapt estimated_keys_for_range to sstables without Summary sstables: make `sstable::estimated_keys_for_range` asynchronous sstables/sstable: compute get_estimated_key_count() from Statistics instead of Summary replica/database: add table::estimated_partitions_in_range() sstables/mx: implement sstable::has_partition_key using a regular read sstables: use BTI index for queries, when present and enabled sstables/mx/writer: populate BTI index files sstables: create and open BTI index files, when enabled sstables: introduce Partition and Rows component types sstables/mx/writer: make `_pi_write_m.partition_tombstone` a `sstables::deletion_time`	2025-09-30 09:40:02 +03:00
Michał Chojnowski	4ca215abbc	sstables/mx/reader: use the same hashed_key for the bloom filter and the index reader Partitions.db uses a piece of the murmur hash of the partition key internally. The same hash is used to query the bloom filter. So to avoid computing the hash twice (which involves converting the key into a hashable linearized form) it would make sense to use the same `hashed_key` for both purposes. This is what we do in this patch. We extract the computation of the `hashed_key` from `make_pk_filter` up to its parent `sstable_set_impl::create_single_key_sstable_reader`, and we pass this hash down both to `make_pk_filter` and to the sstable reader. (And we add a pointer to the `hashed_key` as a parameter to all functions along the way, to propagate it). The number of parameters to `mx::make_reader` is getting uncomfortable. Maybe they should be packed into some structs.	2025-09-29 13:01:22 +02:00
Botond Dénes	6ba1d686e6	sstables,compaction: move make_sstable_set() implementations to compactions/ Various compaction strategies still have their respective make_sstable_set() implementation in sstables/sstable_set.cc. Move them to the appropriate .cc files in compaction/, making the compaction module more self contained.	2025-09-29 06:49:14 +03:00
Botond Dénes	86ed627fc4	compaction: move code to namespace compaction The namespace usage in this directory is very inconsistent, with files and classes scattered in: * global namespace * namespace compaction * namespace sstables With cases, where all three used in the same file. This code used to live in sstables/ and some of it still retains namespace sstables as a heritage of that time. The mismatch between the dir (future module) and the namespace used is confusing, so finish the migration and move all code in compaction/ to namespace compaction too. This patch, although large, is mechanic and only the following kind of changes are made: * replace namespace sstable {} with namespace compaction {} * add namespace compaction {} * drop/add sstables:: * drop/add compaction:: * move around forward-declarations so they are in the correct namespace context This refactoring revealed some awkward leftover coupling between sstables and compaction, in sstables/sstable_set.cc, where the make_sstable_set() methods of compaction strategies are implemented.	2025-09-25 15:03:56 +03:00
Raphael S. Carvalho	2c4a9ba70c	treewide: Rename table_state to compaction_group_view Since table_state is a view to a compaction group, it makes sense to rename it as so. With upcoming incremental repair, each replica::compaction_group will be actually two compaction groups, so there will be two views for each replica::compaction_group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:51:28 +03:00
Raphael S. Carvalho	53df911145	replica: Fix range reads spanning sibling tablets We don't guarantee that coordinators will only emit range reads that span only one tablet. Consider this scenario: 1) split is about to be finalized, barrier is executed, completes. 2) coordinator starts a read, uses pre-split erm (split not committed to group0 yet) 3) split is committed to group0, all replicas switch storage. 4) replica-side read is executed, uses a range which spans tablets. We could fix it with two-phase split execution. Rather than pushing the complexity to higher levels, let's fix incremental selector which should be able to serve all the tokens owned by a given shard. During split execution, either of sibling tablets aren't going anywhere since it runs with state machine locked, so a single read spanning both sibling tablets works as long as the selector works across tablet boundaries. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-05-27 22:39:40 -03:00
Raphael S. Carvalho	628bec4dbd	sstables: Implement sstable_set_impl::all_sstable_runs() With upcoming change where table::set_compaction_strategy() might delay update of sstable set, ICS might temporarily work with sstable set implementations other than partitioned_sstable_set. ICS relies on all_sstable_runs() during regular compaction, and today it triggers bad_function_call exception if not overriden by set implementation. To remove this strong dependency between compaction strategy and a particular set implementation, let's provide a default implementation of all_sstable_runs(), such that ICS will still work until the set is updated eventually through a process that adds or remove a sstable. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-05-06 10:06:06 -03:00
Raphael S. Carvalho	d5bee4c814	test: Verify partitioned set store split and unsplit correctly Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Raphael S. Carvalho	c77f710a0c	sstables: Fix quadratic space complexity in partitioned_sstable_set Interval map is very susceptible to quadratic space behavior when it's flooded with many entries overlapping all (or most of) intervals, since each such entry will have presence on all intervals it overlaps with. A trigger we observed was memtable flush storm, which creates many small "L0" sstables that spans roughly the entire token range. Since we cannot rely on insertion order, solution will be about storing sstables with such wide ranges in a vector (unleveled). There should be no consequence for single-key reads, since upper layer applies an additional filtering based on token of key being queried. And for range scans, there can be an increase in memory usage, but not significant because the sstables span an wide range and would have been selected in the combined reader if the range of scan overlaps with them. Anyway, this is a protection against storm of memtable flushes and shouldn't be the common scenario. It works both with tablets and vnodes, by adjusting the token range spanned by compaction group accordingly. Fixes #23634. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Raphael S. Carvalho	21d1e78457	compaction: Wire table_state into make_sstable_set() This will be useful for feeding token range owned by compaction group into sstable set. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Botond Dénes	c29c696780	readers: mv from_mutations_v2.hh from_mutations.hh Completely mechanical change.	2025-04-16 04:46:08 -04:00
Botond Dénes	b104862702	tree: s/make_mutation_reader_from_mutations_v2/make_mutation_reader_from_mutations/s Completely mechanical change.	2025-04-16 04:46:07 -04:00
Botond Dénes	a9d75c4f9d	readers: mv empty_v2.hh empty.hh Completely mechanical change.	2025-04-16 04:32:56 -04:00
Botond Dénes	05829f98f3	tree: s/make_empty_flat_reader_v2/make_empty_mutation_reader/ Completely mechanical change.	2025-04-16 04:32:56 -04:00
Avi Kivity	ac3d25eb44	sstable_set: incremental_reader_selector: be more careful when filtering out already engaged sstables The incremental reader selector maintains an unordered_set of sstables that are already engaged, and uses std::views::filter to filter those out. It adds the sstable under consideration to the set, and if addition failed (because it's already in) then it filters it out. This breaks if the filter view is executed twice - the first pass will add every sstable to the set, and the second will consider every sstable already filtered. This is what happens with libstdc++ 15 (due to the addition of vector(from_range_t) constructor), which uses the first pass to calculate the vector size and the second pass to insert the elements into a correctly-sized vector. Fix by open-coding the loop. Closes scylladb/scylladb#23597	2025-04-07 12:49:04 +03:00
Kefu Chai	9e0e99347f	sstables: explicitly call parent's default constructor in copy constructor When implementing the copy constructor for `sstable_set` (derived from `enable_lw_shared_from_this`), we intentionally need the parent's default constructor rather than its copy constructor. This is because each new `sstable_set` instance maintains its own reference count and owns a clone of the source object's implementation (`x._impl->clone()`). Although this behavior is correct, GCC warns about not calling the parent's copy constructor. This change explicitly calls the parent's default constructor to: 1. Silence GCC warnings 2. Clearly document our intention to use the default constructor 3. Follow best practices for constructor initialization The functionality remains unchanged, but the code is now more explicit about its design and free of compiler warnings. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23083	2025-02-28 13:52:24 +03:00
Kefu Chai	9fdbe0e74b	tree: Remove unused boost headers This commit eliminates unused boost header includes from the tree. Removing these unnecessary includes reduces dependencies on the external Boost.Adapters library, leading to faster compile times and a slightly cleaner codebase. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22997	2025-02-25 10:32:32 +03:00
Avi Kivity	5adaf0a605	Merge 'tree: migrate from boost::remove_if() to the standard library based alternatives' from Kefu Chai Replace boost::remove_if() with the standard library's std::erase_if() or std::ranges::remove_if() to reduce external dependencies and simplify the codebase. This change eliminates the requirement for boost::range and makes the implementation more maintainable. --- it's a cleanup, hence no need to backport. Closes scylladb/scylladb#22788 * github.com:scylladb/scylladb: service: migrate from boost::range::remove_if() to std::ranges::remove_if sstable: migrate from boost::remove_if() to std::erase_if()	2025-02-11 14:07:48 +02:00
Kefu Chai	481397317d	sstables, test: migrate from boost::copy() to std::ranges::copy() Replace boost::copy() with the standard library's std::ranges::copy() to reduce external dependencies and simplify the codebase. This change eliminates the requirement for boost::range and makes the implementation more maintainable. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22789	2025-02-11 14:55:25 +03:00
Kefu Chai	ba724a26f4	sstable: migrate from boost::remove_if() to std::erase_if() Replace boost::remove_if() with the standard library's std::erase_if() to reduce external dependencies and simplify the codebase. This change eliminates the requirement for boost::range and makes the implementation more maintainable. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-02-11 09:15:14 +08:00
Kefu Chai	6acc5294a4	treewide: migrate from boost::copy_range to std::ranges::to now that we are allowed to use C++23. we now have the luxury of using `std::ranges::to`. in this change, we: - replace `boost::copy_range` to `std::ranges::to` - remove unused `#include` of boost headers Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21880	2024-12-26 11:46:26 +02:00
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Kefu Chai	ce2f80c227	treewide: migrate from boost::make_iterator_range to ranges::subrange Replace boost::make_iterator_range() with std::ranges::subrange. This change improves code modernization and reduces external dependencies: - Replace boost::make_iterator_range() with std::ranges::subrange - Remove boost/range/iterator_range.hpp include - Improve iterator type detection in interval.hh using std::ranges::const_iterator_t<Range> This is part of ongoing efforts to modernize our codebase and minimize external dependencies. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21787	2024-12-09 21:31:53 +02:00
Avi Kivity	9024e4940c	counters.hh: drop unused boost includes Re-add them to source files that need them. Closes scylladb/scylladb#21738	2024-12-05 12:27:41 +02:00
Kefu Chai	bab12e3a98	treewide: migrate from boost::adaptors::transformed to std::views::transform now that we are allowed to use C++23. we now have the luxury of using `std::views::transform`. in this change, we: - replace `boost::adaptors::transformed` with `std::views::transform` - use `fmt::join()` when appropriate where `boost::algorithm::join()` is not applicable to a range view returned by `std::view::transform`. - use `std::ranges::fold_left()` to accumulate the range returned by `std::view::transform` - use `std::ranges::fold_left()` to get the maximum element in the range returned by `std::view::transform` - use `std::ranges::min()` to get the minimal element in the range returned by `std::view::transform` - use `std::ranges::equal()` to compare the range views returned by `std::view::transform` - remove unused `#include <boost/range/adaptor/transformed.hpp>` - use `std::ranges::subrange()` instead of `boost::make_iterator_range()`, to feed `std::views::transform()` a view range. to reduce the dependency to boost for better maintainability, and leverage standard library features for better long-term support. this change is part of our ongoing effort to modernize our codebase and reduce external dependencies where possible. limitations: there are still a couple places where we are still using `boost::adaptors::transformed` due to the lack of a C++23 alternative for `boost::join()` and `boost::adaptors::uniqued`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21700	2024-12-03 09:41:32 +02:00
Kefu Chai	a5ee0c896b	treewide: migrate from boost::adaptors::filtered to std::views::filter Modernize the codebase by replacing Boost range adaptors with C++23 standard library views, reducing external dependencies and leveraging modern C++ language features. Key Changes: - Replace `boost::adaptors::filtered` with `std::views::filter` - Remove `#include <boost/range/adaptor/filtered.hpp>` - Utilize standard library range views Motivation: - Reduce project's external dependency footprint - Leverage standard library's range and view capabilities - Improve long-term code maintainability - Align with modern C++ best practices Implementation Challenges and Considerations: 1. Range Conversion and Move Semantics - `std::ranges::to` adaptor requires rvalue references - Necessitated updates to variable and parameter constness - Example: `cql3/restrictions/statement_restrictions.cc` modified to remove `const` from `common` to enable efficient range conversion 2. Range Iteration and Mutation - Range views may mutate internal state during iteration - Cannot pass ranges by const reference in some scenarios - Solution: Pass ranges by rvalue reference to explicitly indicate state invalidation Limitations: - One instance of `boost::adaptors::filtered` temporarily preserved due to lack of a C++23 alternative for `boost::join()` - A comprehensive replacement will be addressed in a follow-up change This change is part of our ongoing effort to modernize the codebase, reducing external dependencies and adopting modern C++ practices. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21648	2024-11-26 14:26:50 +02:00
Nikos Dragazis	70dd124a95	sstables: Add integrity option to factories for sstable_set readers Expose the integrity option of the sstable reader factories to the corresponding sstable_set factories, namely: * `sstable_set::make_local_shard_sstable_reader()` * `sstable_set::make_full_scan_reader()` * `sstable_set::make_range_sstable_reader()` This is needed to support integrity checking in compaction. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2024-11-11 20:42:46 +02:00
Avi Kivity	907da210b6	compound_compat: replace use of boost ranges with std ranges To reduce the dependency load, replace use of boost ranges with the std equivalent. Files that lost the indirect boost dependency have it added as a direct dependency.	2024-10-30 19:58:07 +02:00
Kefu Chai	24d14b601b	treewide: s/boost::adaptors::map_values/std::views::values/ now that we are allowed to use C++23. we now have the luxury of using `std::views::values`. in this change, we: - replace `boost::adaptors::map_values` with `std::views::values` - update affected code to work with `std::views::values` - the places where we use `boost::join()` are not changed, because we cannot use `std::views::concat` yet. this helper is only available in C++26. to reduce the dependency to boost for better maintainability, and leverage standard library features for better long-term support. this change is part of our ongoing effort to modernize our codebase and reduce external dependencies where possible. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21265	2024-10-27 21:32:45 +02:00
Łukasz Paszkowski	484655bf0d	sstable_set: Add optional statistics to make_local_shard_sstable_reader The pointer to combined_reader_statistics is propagated down to make_combined_reader in order to collect statistics. By default, a null pointer is propagated. Note that in case the pointer is valid and the sstable_set consists of exactly one sstable, statistics are skipped as all rows originate from exactly a single sstable file. The existing optimization is crucial `f75154afca`	2024-10-22 08:15:02 +02:00
Łukasz Paszkowski	84912c3155	reader_selector: Extend with maximum reader count The maximum reader count allows to predict the number of readers that can be created with create_new_readers(). This helps to correctly allocate a vector size in the rows_merged statistics when a combiner reader is created via make_combined_reader.	2024-10-22 08:15:02 +02:00
Pavel Emelyanov	77eb9ddb0f	sstable_set: Reserve vector of readers When generating readers for the set of sstables, the end size of this vector is known in advance and its storage can be reserved. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#21055	2024-10-11 09:56:17 +03:00
Benny Halevy	5a0f3889e0	treewide: use std::ranges sort functions rather than boost Using the standard library is preffered over boost. In cql3/expr/expression.cc to_sorted_vector got more of a face-list and was modernized to use also std::unique and while at it, to move its input range in the uniquely sorted result vector. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-10-01 14:19:05 +03:00
Kefu Chai	df7f332a58	sstable: s/crawling_sstable_mutation_reader/sstable_full_scan_reader "crawling" is a little bit obscure in this context. so let's rename this class to reflect the fact that this reader only reads the entire content of the sstable. both crawling reader for kl and mx formats are renamed. also, in order to be consistent, all "crawling reader" in variable names are updated as well. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-09-17 10:39:37 +08:00
Avi Kivity	aa1270a00c	treewide: change assert() to SCYLLA_ASSERT() assert() is traditionally disabled in release builds, but not in scylladb. This hasn't caused problems so far, but the latest abseil release includes a commit [1] that causes a 1000 insn/op regression when NDEBUG is not defined. Clearly, we must move towards a build system where NDEBUG is defined in release builds. But we can't just define it blindly without vetting all the assert() calls, as some were written with the expectation that they are enabled in release mode. To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT() macro in utils/assert.hh. This macro is always defined and is not conditional on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release mode. [1] `66ef711d68` Closes scylladb/scylladb#20006	2024-08-05 08:23:35 +03:00
Avi Kivity	fdc1449392	treewide: rename flat_mutation_reader_v2 to mutation_reader flat_mutation_reader_v2 was introduced in a pair of commits in 2021: `e3309322c3` "Clone flat_mutation_reader related classes into v2 variants" `08b5773c12` "Adapt flat_mutation_reader_v2 to the new version of the API" as a replacement for flat_mutation_reader, using range_tombstone_change instead of range_tombstone to represent represent range tombstones. See those commits for more information. The transition was incremental; the last use of the original flat_mutation_reader was removed in 2022 in commit `026f8cc1e7` "db: Use mutation_partition_v2 in mvcc" In turn, flat_mutation_reader was introduced in 2017 in commit `748205ca75` "Introduce flat_mutation_reader" To transition from a mutation_reader that nested rows within a partition in a separate stream, to a flat reader that streamed partitions and rows in the same stream. Here, we reclaim the original name and rename the awkward flat_mutation_reader_v2 to mutation_reader. Note that mutation_fragment_v2 remains since we still use the original for compatibilty, sometimes. Some notes about the transition: - files were also renamed. In one case (flat_mutation_reader_test.cc), the rename target already existed, so we rename to mutation_reader_another_test.cc. - a namespace 'mutation_reader' with two definitions existed (in mutation_reader_fwd.hh). Its contents was folded into the mutation_reader class. As a result, a few #includes had to be adjusted. Closes scylladb/scylladb#19356	2024-06-21 07:12:06 +03:00
Raphael S. Carvalho	0b2ec3063c	sstables: Fix incremental_reader_selector (for range reads) with tablets incremental_reader_selector is the mechanism for incremental comsumption of disjoint sstables on range reads. tablet_sstable_set was implemented, such that selector is efficient with tablets. The problem is selector is vnode addicted and will only consider a given set exhausted when maximum token is reached. With tablets, that means a range read on first tablet of a given shard will also consume other tablets living in the same shard. That results in combined reader having to work with empty sstable readers of tablets that don't intersect with the range of the read. It won't cause extra I/O because the underlying sstables don't intersect with the range of the read. It's only unnecessary CPU work, as it involves creating readers (= allocation), feeding them into combined reader, which will in turn invoke the sstable readers only to realize they don't have any data for that range. With 100k tablets (ranges), and 100 tablets per shard, and ~5 sstables per tablet, there will be this amount of readers (empty or not): (100k * ((100^2 + 100) / 2) * avg_sstable_per_tablet=5) = ~2.5 billions. ~5000 times more readers, it can be quite significant additional cpu work, even though I/O dominates the most in scans. It's an inefficiency that we rather get rid of. The behavior can be observed from logs (there's 1 sstable for each of 4 tablets, but note how readers are created for every single one of them when reading only 1 tablet range): ``` table - make_reader_v2 - range=(-inf, {-4611686018427387905, end}] incremental_reader_selector - create_new_readers(null): selecting on pos {minimum token, w=-1} sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._34qn42... that has range [{-9151620220812943033, start},{-4813568684827439727, end}] incremental_reader_selector - create_new_readers(null): selecting on pos {-4611686018427387904, w=-1} sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._368nk2... that has range [{-4599560452460784857, start},{-78043747517466964, end}] incremental_reader_selector - create_new_readers(null): selecting on pos {0, w=-1} sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._38lj42... that has range [{851021166589397842, start},{3516631334339266977, end}] incremental_reader_selector - create_new_readers(null): selecting on pos {4611686018427387904, w=-1} sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._3dba82... that has range [{5065088566032249228, start},{9215673076482556375, end}] ``` Fix is about making sure the tablet set won't select past the supplied range of the read. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#18556	2024-05-14 07:43:22 +03:00
Raphael S. Carvalho	d5a5005afa	sstables: Fix clone semantics for runs in partitioned_sstable_set When a sstable set is cloned, we don't want a change in cloned set propagating to the former one. It happens today with partitioned_sstable_set::_all_runs, because sets are sharing ownership of runs, which is wrong. Let's not violate clone semantics by copying all_runs when cloning. Doesn't affect data correctness as readers work directly with sstables, which are properly cloned. Can result in a crash in ICS when it is estimating pending tasks, but should be very rare in practice. Fixes #17878. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#17879	2024-03-20 08:41:32 +02:00
Kefu Chai	09a688d325	sstables: do not use lambda when not necessary before this change, we always reference the return value of `make_reader()`, and the return value's type `flat_mutation_reader_v2` is movable, so we can just pass it by moving away from it. in this change, instead of using a lambda, let's just have the return value of it. simpler this way. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16835	2024-01-18 15:54:49 +02:00
Raphael S. Carvalho	d1e6dfadea	sstables: Harden estimate_droppable_tombstone_ratio() interface The interface is fragile because the user may incorrectly use the wrong "gc before". Given that sstable knows how to properly calculate "gc before", let's do it in estimate__d__t__r(), leaving no room for mistakes. sstable_run's variant was also changed to conform to new interface, allowing ICS to properly estimate droppable ratio, using GC before that is calculated using each sstable's range. That's important for upcoming tablets, as we want to query only the range that belongs to a particular tablet in the repair history table. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#15931	2023-12-20 19:04:41 +02:00
Benny Halevy	aa70e3a536	dht: fold compatible_ring_position in ring_position.hh Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-11-05 20:01:29 +02:00
Piotr Jastrzebski	9edf6e4653	sstable_set: Remove unused _schema field Signed-off-by: Piotr Jastrzebski <haaawk@gmail.com>	2023-10-04 18:50:23 +02:00
Piotr Jastrzebski	ce2be977a6	sstable_set_impl: Return also schema from make_incremental_selector Define sstable_set_impl::selector_and_schema_t type as a tuple that contains both a newly created selector and a schema that the selector is using. This will allow removal of _schema field from sstable_set class as the only place it was used was make_incremental_selector. Signed-off-by: Piotr Jastrzebski <haaawk@gmail.com>	2023-10-04 18:40:05 +02:00
Piotr Jastrzebski	47917bcf22	filter: hash key once per sstable set not sstable Before this commit the primary key was hashed for bloom filter check for each sstable. This commit makes the key be hashed once per sstable set and reused for bloom filter lookups in all sstables in the set. I tested this change using perf_simple_query with the following modifications: 1. Create more than one sstable to have sstable set of more than one elements 2. Try to prevent compactions (I wasn't 100% successful) 3. Use a key that's not present to avoid reading from disk ``` diff --git a/test/perf/perf_simple_query.cc b/test/perf/perf_simple_query.cc index 26dbf1e99..6bd460df2 100644 --- a/test/perf/perf_simple_query.cc +++ b/test/perf/perf_simple_query.cc @@ -105,6 +105,8 @@ std::ostream& operator<<(std::ostream& os, const test_config& cfg) { static void create_partitions(cql_test_env& env, test_config& cfg) { std::cout << "Creating " << cfg.partitions << " partitions..." << std::endl; + // Create 10 sstables each with all the data + for (unsigned count = 0; count < 10; ++count) { for (unsigned sequence = 0; sequence < cfg.partitions; ++sequence) { if (cfg.counters) { execute_counter_update_for_key(env, make_key(sequence)); @@ -117,6 +119,7 @@ static void create_partitions(cql_test_env& env, test_config& cfg) { std::cout << "Flushing partitions..." << std::endl; env.db().invoke_on_all(&replica::database::flush_all_memtables).get(); } + } } static int64_t make_random_seq(test_config& cfg) { @@ -137,8 +140,18 @@ static std::vector<perf_result> test_read(cql_test_env& env, test_config& cfg) { query += " using timeout " + cfg.timeout; } auto id = env.prepare(query).get0(); - return time_parallel([&env, &cfg, id] { - bytes key = make_random_key(cfg); + // Always use the same key that is not present + // to make sure we don't read from disk and make + // the benchmark CPU bounded. + int64_t key_value = 6; + bytes key(bytes::initialized_later(), 5sizeof(key_value)); + auto i = key.begin(); + write<uint64_t>(i, key_value); + write<uint64_t>(i, key_value); + write<uint64_t>(i, key_value); + write<uint64_t>(i, key_value); + write<uint64_t>(i, key_value); + return time_parallel([&env, id, key] { return env.execute_prepared(id, {{cql3::raw_value::make_value(std::move(key))}}).discard_result(); }, cfg.concurrency, cfg.duration_in_seconds, cfg.operations_per_shard, cfg.stop_on_error); } @@ -423,6 +436,10 @@ static std::vector<perf_result> do_cql_test(cql_test_env& env, test_config& cfg) .with_column("C2", bytes_type) .with_column("C3", bytes_type) .with_column("C4", bytes_type) + // Try to prevent compaction + // to keep the number of sstables high + .set_compaction_enabled(false) + .set_min_compaction_threshold(2000000000) .build(); }).get(); @@ -539,6 +556,11 @@ int scylla_simple_query_main(int argc, char* argv) { const auto enable_cache = app.configuration()["enable-cache"].as<bool>(); std::cout << "enable-cache=" << enable_cache << '\n'; db_cfg->enable_cache(enable_cache); + // Try to prevent compaction + // to keep the number of sstables high + db_cfg->concurrent_compactors(1); + db_cfg->compaction_enforce_min_threshold(true); + db_cfg->compaction_throughput_mb_per_sec(1); cql_test_config cfg(db_cfg); return do_with_cql_env_thread([&app] (auto&& env) { ``` The following command showed 2-3% improvement on my machine but this depends on the lenght of the key and the number of sstables in the set. ``` ./build/release/scylla perf-simple-query --bypass-cache --flush -c 1 --random-seed=2068087418 --enable-cache false ``` Signed-off-by: Piotr Jastrzebski <haaawk@gmail.com> Closes scylladb/scylladb#15538	2023-09-26 16:27:11 +03:00
Raphael S. Carvalho	4b193c04dd	sstables: add sstable_run::run_identifier() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-25 17:18:21 -03:00
Raphael S. Carvalho	0fe2630d70	sstables: Make all_sstable_runs() more efficient by exposing frozen shared runs Users of all_sstable_runs() don't want to mutate the runs, but rather work with their content. So let's avoid copy and make the intention explicit with the new frozen_sstable_run used as return type for the interface. This will guarantee that ICS will be able to fetch uncompacting runs efficiently. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-25 17:18:20 -03:00
Raphael S. Carvalho	9f6c3369d2	sstables: Simplify sstable_set interface to retrieve runs This interface selects all runs that store at least one of the sstables in the vector. But that's very fragile, to the point that even ICS had to stop using it. A better interface is to return all runs managed by the set and allow compaction manager to do its filtering. We want to use it in ICS to avoid the overhead of rebuilding sstable runs which may be expensive as sorting is performed to guarantee the disjoint invariant. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-09-25 17:04:20 -03:00

1 2 3

148 Commits