scylladb

Author	SHA1	Message	Date
Avi Kivity	94c21e5c05	Merge 'sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions' from Tomasz Grabiec Single-row reads from large partition issue 64 KiB reads to the data file, which is equal to the default span of the promoted index block in the data file. If users would want to increase selectivity of the index to speed up single-row reads, this won't be effective. The reason is that the reader uses promoted index to look up the start position in the data file of the read, but end position will in practice extend to the next partition, and amount of I/O will be determined by the underlying file input stream implementation and its read-ahead heuristics. By default, that results in at least 2 IOs 32KB each. There is already infrastructure to lookup end position based on upper bound of the read, in anticipation for sharing the promoted index cache, but it's not effective becasue it's a non-populating lookup and the upper bound cursor has its own private cached_promoted_index, which is cold when positions are computed. It's non-populating on purpose, to avoid extra index file IO to read upper bound. In case upper bound is far-enough from the lower bound, this will only increase the cost of the read. The solution employed here is to warm up the lower bound cursor's cache before positions are computed, and use that cursor for non-populating lookup of the upper bound. We use the lower bound cursor and the slice's lower bound so that we read the same blocks as later lower-bound slicing would, so that we don't incur extra IO for cases where looking up upper bound is not worth it, that is when upper bound is far from the lower bound. If upper bound is near lower bound, then warming up using lower bound will populate cached_promoted_index with blocks which will allow us to locate the upper bound block accurately. This is especially important for single-row reads, where the bounds are around the same key. In this case we want to read the data file range which belongs to a single promoted index block. It doesn't matter that the upper bound is not exactly the same. They both will likely lie in the same block, and if not, binary search will bring adjacent blocks into cache. Even if upper bound is not near, the binary search will populate the cache with blocks which can be used to narrow down the data file range somewhat. Fixes #10030. The change was tested with perf-fast-forward. I populated the data set with `column_index_size_in_kb` set to 1 scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1 Test run: build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0 This test issues two reads of subsequent keys from the middle of a large partition (1M rows in total). The first read will miss in the index file page cache, the second read will hit. Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total. After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB. I verified using logging that the data file range matches a single promoted index block. Also, the first read which misses in cache is still faster after the change. Before: ``` running: large-partition-select-few-rows on dataset large-part-ds1 Testing selecting few rows from a large partition: stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 500000 1 0.009802 1 1 102 0 102 102 21.0 21 196 2 1 0 1 1 0 0 0 568 269 4716050 53.4% 500001 1 0.000321 1 1 3113 0 3113 3113 2.0 2 64 1 0 1 0 0 0 0 0 116 26 555110 45.0% ``` After: ``` running: large-partition-select-few-rows on dataset large-part-ds1 Testing selecting few rows from a large partition: stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 500000 1 0.009609 1 1 104 0 104 104 20.0 20 137 2 1 0 1 1 0 0 0 561 268 4633407 43.1% 500001 1 0.000217 1 1 4602 0 4602 4602 1.0 1 2 1 0 1 0 0 0 0 0 110 26 313882 64.1% ``` Backports: none, not a regression Closes scylladb/scylladb#20522 * github.com:scylladb/scylladb: perf: perf_fast_forward: Add test case for querying missing rows perf-fast-forward: Allow overriding promoted index block size perf-fast-forward: Test subsequent key reads from the middle in test_large_partition_select_few_rows perf-fast-forward: Allow adding key offset in test_large_partition_select_few_rows perf-fast-forward: Use single-partition reads in test_large_partition_select_few_rows sstables: bsearch_clustered_cursor: Add more tracing points sstables: reader: Log data file range sstables: bsearch_clustered_cursor: Unify skip_info logging sstables: bsearch_clustered_cursor: Narrow down range using "end" position of the block sstables: bsearch_clustered_cursor: Skip even to the first block test: sstables: sstable_3_x_test: Improve failure message sstables: mx: writer: Never include partition_end marker in promoted index block width sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions sstables: clustered_cursor: Track current block	2024-10-28 21:13:23 +02:00
Kefu Chai	24d14b601b	treewide: s/boost::adaptors::map_values/std::views::values/ now that we are allowed to use C++23. we now have the luxury of using `std::views::values`. in this change, we: - replace `boost::adaptors::map_values` with `std::views::values` - update affected code to work with `std::views::values` - the places where we use `boost::join()` are not changed, because we cannot use `std::views::concat` yet. this helper is only available in C++26. to reduce the dependency to boost for better maintainability, and leverage standard library features for better long-term support. this change is part of our ongoing effort to modernize our codebase and reduce external dependencies where possible. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21265	2024-10-27 21:32:45 +02:00
Avi Kivity	3124711fc4	Merge 'Report rows_merged in compaction_history rest api and nodetool' from Łukasz Paszkowski Currently, running the `nodetool compactionhistory` command or using the rest api `curl -X GET --header "Accept: application/json" "http://localhost:10000/compaction_manager/compaction_history"` return compaction history without the `row_merged` field. The series computes rows merged during compaction and provides this information to users via both the nodetool command and the rest api. The `rows_merged` field contains information on merged clustering keys across multiple sstable files. For instance, compacting two sstables of a table consisting of 7 rows where two rows are part of the both sstables, the output would have the following format: {1: 5, 2: 2}. No backport is required. It extends the existing compaction history output. Fixes https://github.com/scylladb/scylladb/issues/666 Closes scylladb/scylladb#20481 * github.com:scylladb/scylladb: test/rest_api: Add tests for compactionhistory nodetool: Add rows merged stats into compactionhistory output compaction: Update compaction history with collected histogram compaction: Remove const qualifier from methods creating sstable readers sstable_set: Add optional statistics to make_local_shard_sstable_reader make_combined_reader: Add optional parameter, combined_reader_statistics reader_selector: Extend with maximum reader count mutation_fragment_merger: Create histogram while consuming mutation fragment batches	2024-10-27 21:26:11 +02:00
Avi Kivity	ec543e3902	Merge 'Remove all_datadirs vector of strings from table::config' from Pavel Emelyanov The all_datadirs keeps paths to directories where local sstables can be. In fact, Scylla doesn't put sstables there, but can try to find them on boot and when checking snapshots. The 0th element of this vector, called datadir, had recently been removed by #20675, now it's time to drop all_datadirs as well. The needed paths can be obtained from table's storage options (see #20542) and db::config::data_file_directories option. Closes scylladb/scylladb#21212 * github.com:scylladb/scylladb: sstables: Open-code format_table_directory_name() moved recently replica,sstables: Move format_table_directory_name() table: Remove all_datadirs sstables: Generate table::all_datadirs from db::config and storage_options replica: Prepare vector of fs::path-s with table dirs table: Check storage options in get_snapshot_details()	2024-10-22 17:21:31 +03:00
Łukasz Paszkowski	484655bf0d	sstable_set: Add optional statistics to make_local_shard_sstable_reader The pointer to combined_reader_statistics is propagated down to make_combined_reader in order to collect statistics. By default, a null pointer is propagated. Note that in case the pointer is valid and the sstable_set consists of exactly one sstable, statistics are skipped as all rows originate from exactly a single sstable file. The existing optimization is crucial `f75154afca`	2024-10-22 08:15:02 +02:00
Łukasz Paszkowski	84912c3155	reader_selector: Extend with maximum reader count The maximum reader count allows to predict the number of readers that can be created with create_new_readers(). This helps to correctly allocate a vector size in the rows_merged statistics when a combiner reader is created via make_combined_reader.	2024-10-22 08:15:02 +02:00
Kefu Chai	6ead5a4696	treewide: move log.hh into utils/log.hh the log.hh under the root of the tree was created keep the backward compatibility when seastar was extracted into a separate library. so log.hh should belong to `utils` directory, as it is based solely on seastar, and can be used all subsystems. in this change, we move log.hh into utils/log.hh to that it is more modularized. and this also improves the readability, when one see `#include "utils/log.hh"`, it is obvious that this source file needs the logging system, instead of its own log facility -- please note, we do have two other `log.hh` in the tree. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-10-22 06:54:46 +03:00
Pavel Emelyanov	516a5f06a8	sstables: Open-code format_table_directory_name() moved recently This helper is small enough and it's easier to understand how table directory name is formatted without it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-10-21 15:18:19 +03:00
Pavel Emelyanov	eeb0d637bb	replica,sstables: Move format_table_directory_name() Now this helper is not needed in replica code, as all manipulations of tables' sstables now sit in the sstables/storage.cc. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-10-21 15:17:30 +03:00
Pavel Emelyanov	dedb9d349c	sstables: Generate table::all_datadirs from db::config and storage_options As mentioned in the previous patch, there are several places that need to scan all datafile directories for a given table. This list is currently stored on table.config.all_datadirs, this patch stops using one and instead generates it from db::config::data_file_directories and table's storage options. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-10-21 15:13:27 +03:00
Kefu Chai	5cd619a60c	treewide: s/boost::adaptors::map_keys/std::views::keys/ now that we are allowed to use C++23. we now have the luxury of using `std::views::keys`. in this change, we: - replace `boost::adaptors::map_keys` with `std::views::keys` - update affected code to work with `std::views::keys` to reduce the dependency to boost for better maintainability, and leverage standard library features for better long-term support. this change is part of our ongoing effort to modernize our codebase and reduce external dependencies where possible. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21198	2024-10-21 12:47:52 +03:00
Avi Kivity	c3be2489ce	treewide: drop includes of <boost/range/adaptors.hpp> This includes way too much, including <boost/regex.hpp>, which is huge. Drop includes of adaptors.hpp and replace by what is needed. Closes scylladb/scylladb#21187	2024-10-20 17:17:11 +03:00
Kefu Chai	5c0db8a49e	sstable_directory: remove extraneous semicolon one semicolon is enough to mark the end of a statement. so let's remove the extraneous one. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21171	2024-10-18 21:58:04 +03:00
Botond Dénes	568b767ec3	Merge 'schema: convert from boost ranges to std ranges' from Avi Kivity To reduce dependency load, change uses of boost ranges to std::ranges. The first patch is preparation, replacing a construct that isn't easy to support with std ranges with something simpler. No backport as this is a code cleanup. Closes scylladb/scylladb#21122 * github.com:scylladb/scylladb: schema: replace boost ranges with std ranges schema: precompute all_columns_in_select_order()	2024-10-18 08:42:50 +03:00
Avi Kivity	6fd219d982	sstables: generation_type: deinline from_string() This is not performance sensitive and penalizes everyone by including boost/regex.hpp. Fix by deinlining. Closes scylladb/scylladb#21147	2024-10-17 13:41:15 +03:00
Avi Kivity	820509026f	schema: replace boost ranges with std ranges To reduce dependency load, use std ranges instead of boost ranges. The std::ranges::{lower,upper}_bound don't support heterogeneous lookup, but a more natural solution is to use a projection to search for the name, so we use that and the custom comparator is removed. Many callers are converted as well due to poor interoperability between boost ranges and std ranges.	2024-10-15 16:42:54 +03:00
Avi Kivity	db14a01901	Merge 'Use table id as system.sstables partition key' from Pavel Emelyanov The system.sstables (a.k.a. sstables registry) primary key is "string location" as partition key and "uuid generation" as clustering one. The "location" part was taken from table.config.datadir value which, in turn, a string containing path to on-disk files if the table was located locally, e.g. /var/lib/scylla/data/ks/cf-abc123 one. Recently [1] the datadir was moved from table config onto storage options, but this string is still used as registry key. Other than being owned by a table with ID, sstables are accessed by restore-from-object-storage code [2]. To make it work, both storage driver and sstable_directory helper class maintain two formats of object prefixes for sstables components. For S3-backed sstables having a record in registry, the path used is s3://bucket/generation/component. For restore code there are user-provided prefixes that do not match the aforementioned pattern. The selection between those two is now made by checking sstable state, which is not obvious and may cause troubles for tiered storage driver. This patch changes the registry schema so that partition key becomes "uuid owner" and is set to be table.id() value. This is to stop using the local path by S3 backed sstables. Also this change makes it possible for storage driver and sstable directory to rely on the storage options only to tell different bucket prefixes formats from each other. As a side effect, the make_s3_object_name() helper, that generates the proper object name, becomes explicit for restore-from-S3 usage. Now it relies on the sstable::filename() calling this->prefix() behind the scenes and the latter to return the user-provided prefix, which is pretty fragile construction. No need to backport (and it's not going to be easy to do it), storage options feature is still experimental Refs #20675 [1] Refs #20305 [2] Closes scylladb/scylladb#20998 * github.com:scylladb/scylladb: sstables: Flatten S3 object name making sstable_directory: Flatten directory lister creation treewide: Rename sstable registry location field to be owner system_keyspace: Change sstables registry partition key type sstables: Keep location variant on s3 backend too storage_options: Use variant on S3 options sstables: Split sstable::filename() helper sstables: Add s3_storage::owner() helper	2024-10-13 20:08:43 +03:00
Pavel Emelyanov	a7042d66e3	sstables: Flatten S3 object name making The s3_storage backend driver has a method that generates object path within the bucket. Depending on options alternative it picks one of two formats: - for string prefix, it uses it implicitly via sstable::filename() call that calls storage->prefix() which, in turn, returns prefix value - for registry-backed sstables, the /bucket/generation/component path is generated This patch bruses this place up. Similarly to previous patch, this change also makes the selection based on the location alternative, not on the sstable state. As well it's idempotent change, as S3 sstables with 'upload' state only appear when restoring from object store, and in this case the string location is in use. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-10-11 14:11:28 +03:00
Pavel Emelyanov	8d5537a439	sstable_directory: Flatten directory lister creation After previous patchin, the way components lister is created for S3 storage options became quite hairy. This patch brushes things up to be easier to read. The only "functional" change here, is that selection between registry lister and S3 lister is made based on options' location held alternative, not on the sstable state value. That's in fact idempotent change, the only caller that provides string location on options is the "restore from object store" code that also sets state to be 'upload'. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-10-11 14:11:28 +03:00
Pavel Emelyanov	031893259a	treewide: Rename sstable registry location field to be owner This is sort of continuation of the previous patch. The partition key in the registry is now table_id, not string, and is better called "owner", not "location". This patch is s/location/owner/ over specific places that include field name in the schema, argument names in registry maintenance classes and tests accessing the selected row fields by name. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-10-11 14:11:28 +03:00
Pavel Emelyanov	3315e3a2a9	system_keyspace: Change sstables registry partition key type Today, the system.sstables schema uses string as partition key. Callers, in turn, use table's datadir value to reference entries in it. That's wrong, S3-backed sstables don't have any local paths to work with. The table's ID is better in this role. This patch only changes the field type to be table_id and fixes the callers to provide one. In particular, see init_table_storage() change -- instead of generating a datadir string, it sets table.id() as the options' location. Other fixed places are tests. Internally, this id value is propagated via s3_storage::owner() method, that's fixed as well. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-10-11 13:48:09 +03:00
Pavel Emelyanov	bb13b7bf72	sstables: Keep location variant on s3 backend too Previous patch put variant<string, table_id> as location of S3 options. This patch makes the S3 sstables backend driver keep variant as sstable location. As with the previous patch, driver only keeps variant, but continues using its string alternative internally. This will be changed later on. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-10-11 13:09:47 +03:00
Pavel Emelyanov	1181b6b082	storage_options: Use variant on S3 options Describing S3 storage for an sstables nowadays has two options -- via sstables registry entry and by using the direct prefix string. The former is used when putting a keyspace on S3. In this case each sstable has the corresponding entry in the system.sstables table. The latter is used by "restore from object storage" code. In that case, sstables don't have entries in the registry, but are accessed by a specific S3 object path. This patch reflects this difference by making s3_options::location be variant of string prefix and table_id owner. The owner needs more explanation, here it is. Today, the system.sstables schema defines partition key to be "string location" and clustering key to be "UUID generation". The partition key is table's datadir string, but it's wrong to use it this way. Next patches will change the partition key to be table's ID (there's table_id type for it), and before doing it storage options must be prepared to carry it onboard. This patch does it, but the table_id alternative of the location is still unused, the rest of the code keeps using the string location to reference a row in the registry table. Next patches will eventually make use of the table_id value. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-10-11 13:04:52 +03:00
Pavel Emelyanov	ba97072709	sstables: Split sstable::filename() helper To have the filename(type, prefix) one, next patches will provide prefix on their own, to avoid storage->prefix() call. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-10-11 12:47:13 +03:00
Pavel Emelyanov	6f9cb51259	sstables: Add s3_storage::owner() helper This driver uses sstring _location as part of the lookup key in the sstables registry. Next patches will need to change that and put more checks on the registry access, so introduce a helper method beforehand. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-10-11 12:47:12 +03:00
Pavel Emelyanov	77eb9ddb0f	sstable_set: Reserve vector of readers When generating readers for the set of sstables, the end size of this vector is known in advance and its storage can be reserved. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#21055	2024-10-11 09:56:17 +03:00
Benny Halevy	3a12ad96c7	sstables: scylla_metadata: add sstable identifier Keep a copy of the sstable uuid generation in a new scylla_metadata sstable_identifier attribute. If the SSTable happens to have a numerical generation just create a new time-uuid and log a message about that. Dump this new attribute in scylla sstable dump tool. And add a unit test to verify that the written (and then loaded) sstable identifier matches the sstable's generation. The motivatrion for this change stems from backup deduplication. In essence, an sstable may already have been backed up in a previous snapshot, and we don't want to abck it up again if it's already present on external storage. Today this is based on rclone that compares files checksums, but once scylla will backup the sstables using the native object-storage stack (#19890), we would like to use the sstable globally-unique identifier for deduplication. Although the uuid-generation is encoded in the sstable path, the latter may change, e.g. due to intra-node migration, so keep a copy of the original unique identifier in scylla-metadata, and that attribute would survive file-based or intra-node migrations. Fixes scylladb/scylladb#20459 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#21002	2024-10-10 08:52:46 +03:00
Avi Kivity	bb1867c7c7	Merge 'sstables: Add digest checking in the validation path of the sstable layer' from Nikos Dragazis This PR builds upon the PR for checksum validation (#20207) to further enhance scrub's corruption detection capabilities by validating digests as well. The digest (full checksum) is the checksum over the entire data, as opposed to per-chunk checksums which apply to individual chunks. Until now, digests were not examined on any code paths. This PR integrates digest checking into the compressed/checksummed data sources as an optional feature and enables it only through the validation path of the sstable layer (`sstable::validate()`). The validation path is used by the following tools: * scrub in validate mode * `sstable validate` All other reads, including normal user reads, are unaffected by this change. The PR consists of: * Extensions to the compressed and checksummed data sources to support digest checking. The data sources receive the expected digest as a parameter and calculate the actual digest incrementally across multiple get() calls. The check happens on the get() call that reaches EOF and results to an exception if the digest is invalid. A digest check requires reading the whole file range. Therefore, a partial read or skip() is treated as an internal error. * A new shareable digest component loaded on demand by the validation code. No lifecycle management. * Grouping of old scrub/validate tests for compressed and uncompressed SSTables to reduce code duplication. * scrub/validate tests for SSTables with valid checksums but invalid digests, and SSTables with no digests at all. * scrub/validate tests with 3.x Cassandra SSTables to ensure compatibility. Refs #19058. New feature, no backport is needed. Closes scylladb/scylladb#20720 * github.com:scylladb/scylladb: test: Test scrub/validate with SSTables from Cassandra compaction: Make quarantine optional for perform_sstable_scrub() test: Make random schema optional in scrub_test_framework test: Add tests for invalid digests test: Merge scrub/validate tests for compressed and uncompressed cases sstables: Verify digests on validation path sstables: Check if digest component exists sstables: Add digest in the SSTable components sstables: Add digest check in compressed data source sstables: Add digest check in checksummed data source	2024-10-09 21:33:08 +03:00
Botond Dénes	3e468608e7	Merge 'Collect sstables on boot from all datadirs (and don't collect from S3 twice)' from Pavel Emelyanov There's a long-pending issue in distributed loader. When it populates sstables on boot it loops over table.config.all_datadirs, but ignores the loop cursor (the datadir itslef), instead loading sstables from table.config.dir, which is 0th element of all_datadirs. There's a test for that, but it's also broken. Effectively collection happens from table.config.dir several times. For local sstables that's just wasted work and potentially lost sstables (but nobody seems to configure more than 1 datadir anyway). For S3 sstables it's also wasted work and incorrectness. The fix is for both -- populator and test. The former is to use all_datadirs to construct sstable_directory. To make it happen, creation of sstable_directory now depends on the storage options, the loop is moved into the branch that creates sstable_directory for local storage type. The test fix is to make sure that some sstables in non-default datadir before running population code. Closes scylladb/scylladb#20819 * github.com:scylladb/scylladb: test: Fix test_multiple_data_dirs distributed_loader: Indentation fix after previous patch distributed_loader: Use correct datadir to collect local sstable distributed_loader: Move all-datadirs loop to local storage collecting distributed_loader: Collect table subdirs based on its storage options distributed_loader: Indentation fix after previous patch distributed_loader: Squash loop of collect_subdir into one method distributed_loader: Convert map of directories into a vector distributed_loader: Make start_subdir() method work with directory distributed_loader: Drop local reference variable distributed_loader: Split start_subdir() distributed_loader: Remove allow-offstrategy argument distributed_loader: Make populate() method work with directory distributed_loader: Remove check for sstable_directory presense distributed_loader: Out-line table_populator() methods distributed_loader: Print storage options, not datadir distributed_loader: Print prepared message sstable_directory: Add sstable_state argument ot one of constructors sstable_directory: Add state() method	2024-10-09 14:43:34 +03:00
Laszlo Ersek	934b42c6a8	cmake/check_headers: correct typos Commit `efd65aebb2` ("build: cmake: add check-header target", 2023-11-13) introduced three typos: - In "cmake/check_headers.cmake", it checked whether the "parsed_args_GLOB_RECURSE" argument was defined, but then it referenced the same under the wrong name "parsed_args_RECURSIVE". - The above error masked two further typos; namely the duplicate use of "api" and "streaming" each, as targets. With "parsed_args_GLOB_RECURSE" above fixed, CMake now reports these conflicting arguments (target names). They should have been "node_ops" and "sstables", respectively. Correct the typos. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com> Closes scylladb/scylladb#20992	2024-10-08 09:38:16 +03:00
Nikos Dragazis	3a3783ee23	sstables: Verify digests on validation path Extend the validation path to perform digest checking on all SSTables. This is achieved by loading the digest component on demand and passing it to the underlying data sources only during validation. The data sources for compressed and uncompressed SSTables were modified in previous patches to support digest checking. Consider digest checking as part of the integrity checking mechanism (i.e., requires `integrity_check::yes`) to ensure it remains disabled for all reads happening outside of the validation path (i.e., `sstable::validate()`). This practically means that digest checking is enabled only for: * scrub in validate mode * sstable validate Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2024-10-07 15:21:09 +03:00
Pavel Emelyanov	87d392d071	sstable_directory: Add sstable_state argument ot one of constructors There's one constructor that became unused after `787ea4b1`. Modify it with the 'state' argument so that it could be used later. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-10-07 12:03:36 +03:00
Pavel Emelyanov	b56483ab67	sstable_directory: Add state() method The one will expose sstables state the directory works with. For convenience. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-10-07 11:23:50 +03:00
Nikita Kurashkin	874cafefab	SStables: replace assertion with malformed_sstable_exception for invalid chunk_size This will allow to see underlying sstable file Fixes #20277 Closes scylladb/scylladb#20784	2024-10-04 14:48:35 +03:00
Pavel Emelyanov	6b480589fe	Merge 'treewide: accept list of sstables in "restore" API ' from Kefu Chai before this change, we enumerate the sstables tracked by the system.sstables table, and restore them when serving requests to "storage_service/restore" API. this works fine with "storage_service/backup" API. but this "restore" API cannot be used as a drop-in replacement of the rclone based API currently used by scylla-manager. in order to fill the gap, in this change: * add the "prefix" parameter for specifying the shared prefix of sstables * add the "sstables" parameter for specifying the list of TOC components of sstables * remove the "snapshot" parameter, as we don't encode the prefix on scylla's end anymore. * make the "table" parameter mandatory. Fixes https://github.com/scylladb/scylladb/issues/20461 ---- this change is a part of the efforts to bring the native backup/restore to scylla, no need to backprt. Closes scylladb/scylladb#20685 * github.com:scylladb/scylladb: treewide: accept list of sstables in "restore" API sstable: pass get_storage_option to sstable_directory::load_sstable() test/nodetool: add body parameter to `expected_request` tools/scylla-nodetool: enable nodetool to write HTTP body	2024-10-04 12:38:08 +03:00
Nikos Dragazis	347f5ee166	sstables: Check if digest component exists Extend `read_digest()` to first check if the digest component exists before attempting to load it from disk. Make `validate_checksums()` throw an error if the component does not exist to preserve its current behavior. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2024-10-03 18:09:05 +03:00
Nikos Dragazis	7e738bcd2d	sstables: Add digest in the SSTable components SSTables store their digest in a Digest file. Add this in the list of SSTable components. In a follow-up patch we will use this component to enable digest checking in the validation path. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2024-10-03 18:09:05 +03:00
Nikos Dragazis	c893f06409	sstables: Add digest check in compressed data source Following the addition of digest check in the checksummed data source, add the same feature to the compressed data source as well. This ensures consistent behavior across any type of SSTable. This is added as an optional feature so that we can preserve the current behavior, that is verify only the per-chunk checksums during normal user reads. To ensure zero cost at runtime when disabled, we introduce the on/off switch as a template parameter. The digest calculation for compressed SSTables depends on the SSTable format, hence the new template argument for the checksum mode. This is consistent with the compressed data sink. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2024-10-03 18:09:01 +03:00
Nikos Dragazis	0df1c01759	sstables: Add digest check in checksummed data source The checksummed data source verifies the checksum of each chunk in the data files of uncompressed SSTables. This is being leveraged by scrub in validation mode. Extend the data source to check the digest (full checksum) as well. Unlike checksums, this is added as an optional feature so that SSTables without a digest can still be validated in a per-chunk basis. To enable this, the caller needs to set the template parameter `check_digest` to true, and provide the expected digest. The data source calculates the digest incrementally through multiple get() calls and compares against the expected digest after reading the whole file range. If there is a mismatch, it throws an exception. Checking the digest requires reading the whole data file. If this cannot be satisfied (e.g., due to partial read or skip()), the data source fails immediately. If the user has successfully read the whole file range, it can be safely assumed that the digest is valid. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2024-10-03 18:08:56 +03:00
Tomasz Grabiec	753f6a61fd	sstables: bsearch_clustered_cursor: Add more tracing points	2024-10-03 16:24:18 +02:00
Tomasz Grabiec	95b864497a	sstables: reader: Log data file range	2024-10-03 14:16:05 +02:00
Tomasz Grabiec	41d3ae5e81	sstables: bsearch_clustered_cursor: Unify skip_info logging Now all exit paths which return skip_info will print it in the same way which makes for easier log parsing.	2024-10-03 14:16:05 +02:00
Tomasz Grabiec	1b82d5117a	sstables: bsearch_clustered_cursor: Narrow down range using "end" position of the block This is optimization. Example: block0: start=aaa, end=aaA block1: start=bbb, end=bbB block2: whatever Before the patch, advance_to("aAA") would skip to block0, and upper bound probe would skip to block1. This way, the reader would read the range of block0 from the data file. After the patch, "end" position is taken into account, so advance_to("aAA") will notice that block0 doesn't contain the position and will skip to block1. This is especially important for dense indexes, as it allows us to skip accessing data file if the search key is missing. It also solves the edge case problem related to the fact that single row reads are using a range which with positions which are not equal to the key, but are before(key) and after(key) for the lower bound and upper bound respectively. Before the patch, advance_to(before("bbb")) would skip to block0, before the position is before the block1's start. And upper bound probe for after("bbb") would point to block2. This way the read would scan block0 needlessly. After the patch, advance_to(before("bbb")) will skip to block1 because we notice based on "end" that block0 doesn't contain the position. This change also ensures that the start position of the upper bound entry of the after_key(pos), where pos is the last advance_to() position, is warm in cache. This is needed to optimize single-row reads with a dense index so that they always read exactly one promoted index block. For this to work, probe_upper_bound() for the after_key(row) always needs to find the upper bound block in cache.	2024-10-03 14:16:05 +02:00
Tomasz Grabiec	b03f23a09b	sstables: bsearch_clustered_cursor: Skip even to the first block It was unnecessary to emit a skip info for the first block since it follows immediately the partition start, but it is relevant to the optimization of avoiding data reads for missing keys. This optimization relies on the fact that lower bound position equals upper bound position. If the reader's key is before the first key in the partition and we don't arm the skip info for the first block, lower bound would be equal to the partition start, and upper bound would be equal to the first row's position, which are not equal.	2024-10-03 14:16:05 +02:00
Tomasz Grabiec	7f077893ed	sstables: mx: writer: Never include partition_end marker in promoted index block width Currently, it may happen that the last promoted index block includes the partition_end marker. That's because we first write the partition end marker and then emit the unclosed block. This behavior matches Cassandra (checked in 3.x and 5.0.1). This is problematic for ruling out data file reads based on index. The width field is currently unused, but it will be used later where the width of the last block is used to compute the skip position past the last block for lookups which land after all keys in the partition. If width includes the marker then such a skip would land in the next partition, which is incorrect, as the reader context expects a cell element. Even if that was recognized, it's wrong - if this is not a single partition read (so upper bound is not at the next partition too), then we would read from the wrong (next) partition. We want to be able to make such skips in order to avoid unnecessary data file IO for reads of missing rows. Currently, we would always read the last block even if the key is past its "end" position. Another way to solve this would be to propagate the "past the last block" condition from the index cursor to the reader and let it deal with it, but the logic for that would be complicated. With this fix, there is no special logic required.	2024-10-03 14:09:57 +02:00
Kefu Chai	f9091066b7	treewide: replace boost::irange with std::views::iota where possible when building scylla with the standard library from GCC-14.2, shipped by fedora 41, we have following build failure: ``` /home/kefu/.local/bin/clang++ -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_DEBUG -DSEASTAR_DEBUG_PROMISE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -isystem /home/kefu/dev/scylladb/abseil -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=x86-64-v3 -mpclmul -Xclang -fexperimental-assignment-tracking=disabled -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -MD -MT CMakeFiles/scylla-main.dir/Debug/init.cc.o -MF CMakeFiles/scylla-main.dir/Debug/init.cc.o.d -o CMakeFiles/scylla-main.dir/Debug/init.cc.o -c /home/kefu/dev/scylladb/init.cc In file included from /home/kefu/dev/scylladb/init.cc:12: In file included from /home/kefu/dev/scylladb/db/config.hh:20: In file included from /home/kefu/dev/scylladb/locator/abstract_replication_strategy.hh:26: /home/kefu/dev/scylladb/locator/tablets.hh:410:30: error: unexpected type name 'size_t': expected expression 410 \| return boost::irange<size_t>(0, tablet_count()) \| boost::adaptors::transformed([] (size_t i) { \| ^ /home/kefu/dev/scylladb/locator/tablets.hh:410:23: error: no member named 'irange' in namespace 'boost' 410 \| return boost::irange<size_t>(0, tablet_count()) \| boost::adaptors::transformed([] (size_t i) { \| ~~~~~~~^ /home/kefu/dev/scylladb/locator/tablets.hh:410:38: error: left operand of comma operator has no effect [-Werror,-Wunused-value] 410 \| return boost::irange<size_t>(0, tablet_count()) \| boost::adaptors::transformed([] (size_t i) { \| ^ 3 errors generated. [16/782] Building CXX object CMakeFiles/scylla-main.dir/Debug/keys.cc.o [17/782] Building CXX object CMakeFiles/scylla-main.dir/Debug/counters.cc.o [18/782] Building CXX object CMakeFiles/scylla-main.dir/Debug/partition_slice_builder.cc.o [19/782] Building CXX object CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o FAILED: CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o /home/kefu/.local/bin/clang++ -DDEBUG -DDEBUG_LSA_SANITIZER -DFMT_SHARED -DSANITIZE -DSCYLLA_BUILD_MODE=debug -DSCYLLA_ENABLE_ERROR_INJECTION -DSEASTAR_API_LEVEL=7 -DSEASTAR_DEBUG -DSEASTAR_DEBUG_PROMISE -DSEASTAR_DEBUG_SHARED_PTR -DSEASTAR_DEFAULT_ALLOCATOR -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SHUFFLE_TASK_QUEUE -DSEASTAR_SSTRING -DSEASTAR_TYPE_ERASE_MORE -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"Debug\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -isystem /home/kefu/dev/scylladb/abseil -g -Og -g -gz -std=gnu++23 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb/build=. -march=x86-64-v3 -mpclmul -Xclang -fexperimental-assignment-tracking=disabled -Werror=unused-result -fstack-clash-protection -fsanitize=address -fsanitize=undefined -MD -MT CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o -MF CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o.d -o CMakeFiles/scylla-main.dir/Debug/mutation_query.cc.o -c /home/kefu/dev/scylladb/mutation_query.cc In file included from /home/kefu/dev/scylladb/mutation_query.cc:12: In file included from /home/kefu/dev/scylladb/schema/schema_registry.hh:17: In file included from /home/kefu/dev/scylladb/replica/database.hh:11: In file included from /home/kefu/dev/scylladb/locator/abstract_replication_strategy.hh:26: /home/kefu/dev/scylladb/locator/tablets.hh:410:30: error: unexpected type name 'size_t': expected expression 410 \| return boost::irange<size_t>(0, tablet_count()) \| boost::adaptors::transformed([] (size_t i) { \| ^ /home/kefu/dev/scylladb/locator/tablets.hh:410:23: error: no member named 'irange' in namespace 'boost' 410 \| return boost::irange<size_t>(0, tablet_count()) \| boost::adaptors::transformed([] (size_t i) { \| ~~~~~~~^ /home/kefu/dev/scylladb/locator/tablets.hh:410:38: error: left operand of comma operator has no effect [-Werror,-Wunused-value] 410 \| return boost::irange<size_t>(0, tablet_count()) \| boost::adaptors::transformed([] (size_t i) { \| ^ In file included from /home/kefu/dev/scylladb/mutation_query.cc:12: In file included from /home/kefu/dev/scylladb/schema/schema_registry.hh:17: In file included from /home/kefu/dev/scylladb/replica/database.hh:37: In file included from /home/kefu/dev/scylladb/db/snapshot-ctl.hh:20: /home/kefu/dev/scylladb/tasks/task_manager.hh:403:54: error: no member named 'irange' in namespace 'boost' 403 \| co_await coroutine::parallel_for_each(boost::irange(0u, smp::count), [&tm, id, &res, &func] (unsigned shard) -> future<> { \| ~~~~~~~^ 4 errors generated. ``` so let's take the opportunity to switch from `boost::irange` to `std::views::iota`. in this change, we: - switch from boost::irange to std::views::iota for better standard library compatibility - retain boost::irange where step parameter is used, as std::views::iota doesn't support it - this change partially modernizes our range usage while maintaining - existing functionality Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#20924	2024-10-03 10:33:33 +03:00
Tomasz Grabiec	a29501ed67	sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions Single-row reads from large partition issue 64 KiB reads to the data file, which is equal to the default span of the promoted index block in the data file. If users would want to reduce selectivity of the index to speed up single-row reads, this won't be effective. The reason is that the reader uses promoted index to look up the start position in the data file of the read, but end position will in practice extend to the next partition, and amount of I/O will be determined by the underlying file input stream implementation and its read-ahead heuristics. By default, that results in at least 2 IOs 32KB each. There is already infrastructure to lookup end position based on upper bound of the read, but it's not effective becasue it's a non-populating lookup and the upper bound cursor has its own private cached_promoted_index, which is cold when positions are computed. It's non-populating on purpose, to avoid extra index file IO to read upper bound. In case upper bound is far-enough from the lower bound, this will only increase the cost of the read. The solution employed here is to warm up the lower bound cursor's cache before positions are computed, and use that cursor for non-populating lookup of the upper bound. We use the lower bound cursor and the slice's lower bound so that we read the same blocks as later lower-bound slicing would, so that we don't incur extra IO for cases where looking up upper bound is not worth it, that is when upper bound is far from the lower bound. If upper bound is near lower bound, then warming up using lower bound will populate cached_promoted_index with blocks which will allow us to locate the upper bound block accurately. This is especially important for single-row reads, where the bounds are around the same key. In this case we want to read the data file range which belongs to a single promoted index block. It doesn't matter that the upper bound is not exactly the same. They both will likely lie in the same block, and if not, binary search will bring adjacent blocks into cache. Even if upper bound is not near, the binary search will populate the cache with blocks which can be used to narrow down the data file range somewhat. Fixes #10030. The change was tested with perf-fast-forward. I populated the data set with `column_index_size_in_kb` set to 1 scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1 Test run: build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0 This test reads two rows from the middle of a large partition (1M rows), of subsequent keys. The first read will miss in the index file page cache, the second read will hit. Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total. After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB. I verified using logging that the data file range matches a single promoted index block. Also, the first read which misses in cache is still faster after the change. Before: running: large-partition-select-few-rows on dataset large-part-ds1 Testing selecting few rows from a large partition: stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 500000 1 0.009802 1 1 102 0 102 102 21.0 21 196 2 1 0 1 1 0 0 0 568 269 4716050 53.4% 500001 1 0.000321 1 1 3113 0 3113 3113 2.0 2 64 1 0 1 0 0 0 0 0 116 26 555110 45.0% After: running: large-partition-select-few-rows on dataset large-part-ds1 Testing selecting few rows from a large partition: stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 500000 1 0.009609 1 1 104 0 104 104 20.0 20 137 2 1 0 1 1 0 0 0 561 268 4633407 43.1% 500001 1 0.000217 1 1 4602 0 4602 4602 1.0 1 2 1 0 1 0 0 0 0 0 110 26 313882 64.1% (cherry picked from commit dfb339376aff1ed961b26c4759b1604f7df35e54)	2024-10-01 18:40:34 +02:00
Tomasz Grabiec	41be5d1daf	sstables: clustered_cursor: Track current block Will be needed by the reader to jump to the current block even if we already advanced to it before, when setting up the reader context. We want to advance to lower bound earlier, before the praser skips to the lower bound. We want that in order to set input stream data file range based on index. If we didn't have access to the current block and used the result from advance_to(), the parser will think we're already in the block which has lower_bound when it attempts to skip, and will not skip, falling back to scanning.	2024-10-01 18:40:34 +02:00
Kefu Chai	787ea4b1d4	treewide: accept list of sstables in "restore" API before this change, we enumerate the sstables tracked by the system.sstables table, and restore them when serving requests to "storage_service/restore" API. this works fine with "storage_service/backup" API. but this "restore" API cannot be used as a drop-in replacement of the rclone based API currently used by scylla-manager. in order to fill the gap, in this change: * add the "prefix" parameter for specifying the shared prefix of sstables * add the "sstables" parameter for specifying the list of TOC components of sstables * remove the "snapshot" parameter, as we don't encode the prefix on scylla's end anymore. * make the "table" parameter mandatory. Fixes scylladb/scylladb#20461 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-10-01 23:24:56 +08:00
Kefu Chai	17181c2eca	sstable: pass get_storage_option to sstable_directory::load_sstable() before this change, we always pass `sstable_directory::_storage_opts` to `_manager.make_sstable()` in `sstable_directory::load_sstable()`. but when loading from object storage, we need to customize the storage_options on a per-sstable basis. the way to address this is to allow the caller of `sstable_directory::process_descriptor()` to pass a functor which return the `storage_options` to be used when creating the sstable. so, in this change, we update - sstable_directory::load_sstable() - sstable_directory::process_descriptor() so that they accept another parameter to create the storage_options. in the next commit we will pass a different functor for customizing the storage_options on a per-sstable basis when loading sstables. Refs scylladb/scylladb#20461 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-10-01 23:24:56 +08:00

1 2 3 4 5 ...

3597 Commits