scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 11:30:36 +00:00

Author	SHA1	Message	Date
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Tomasz Grabiec	8e60a0b831	Merge 'truncate: make TRUNCATE TABLE safe with tablets' from Ferenc Szili Currently truncating a table works by issuing an RPC to all the nodes which call `database::truncate_table_on_all_shards()`, which makes sure that older writes are dropped. It works with tablets, but is not safe. A concurrent replication process may bring back old data. This change makes makes TRUNCATE TABLE a topology operation, so that it excludes with other processes in the system which could interfere with it. More specifically, it makes TRUNCATE a global topology request. Backporting is not needed. Fixes #16411 Closes scylladb/scylladb#19789 * github.com:scylladb/scylladb: docs: docs: topology-over-raft: Document truncate_table request storage_proxy: fix indentation and remove empty catch/rethrow test: add tests for truncate with tablets storage_proxy: use new TRUNCATE for tablets truncate: make TRUNCATE a global topology operation storage_service: move logic of wait_for_topology_request_completion() RPC: add truncate_with_tablets RPC with frozen_topology_guard feature_service: added cluster feature for system.topology schema change system.topology_requests: change schema storage_proxy: propagate group0 client and TSM dependency	2024-12-10 17:50:50 +01:00
Avi Kivity	9024e4940c	counters.hh: drop unused boost includes Re-add them to source files that need them. Closes scylladb/scylladb#21738	2024-12-05 12:27:41 +02:00
Ferenc Szili	7f29b7d8f6	storage_proxy: propagate group0 client and TSM dependency This commit makes storage_proxy::remote dependent on raft_group0_client and topology_state_machine. storage_proxy::remote gets references to these via the call to start_remote(). These references will be needed to call storage_service::truncate_table_with_tablets().	2024-12-04 11:30:06 +01:00
Botond Dénes	f55dc71c3f	Merge 'Use checksummed input streams in `validate_checksums()`' from Nikos Dragazis With commits `ed7d352e7d` and `bb1867c7c7`, we now have input streams for both compressed and uncompressed SSTables that provide seamless checksum and digest checking. The code for these was based on `validate_checksums()`, which implements its own validation logic over raw streams. This has led to some duplicate code. This PR deduplicates the uncompressed case by modifying `validate_checksums()` to use a checksummed input stream instead of a raw stream. The same cannot be done for compressed SSTables though. The reason is that `validate_checksums()` needs to examine the whole data file, even if an invalid chunk is encountered. In the checksummed case we support that by offloading the error handling logic from the data source via a function parameter. In the compressed data source we cannot do that because it needs to return decompressed data and decompression may fail if the data are invalid. This PR also enables `validate_checksums()` to partially verify SSTables with just the per-chunk checksums if the digest is missing. In more detail, this PR consists of: * Port of some integrity checks from `do_validate_uncompressed()` to the checksummed data source. It should now be able to detect corruption due to truncated or appended chunks (expected number of chunks is retrieved from the CRC component). * Introduction of `error_handler` parameter in checksummed data source and `data_stream()`. * Refactoring of `validate_checksums()`. The JSON response of `sstable validate-checksums` was also modified to report a missing digest. * Tests for `validate_checksums()` against SSTables with truncated data, appended data, invalid digests, or no digest. Refs #19058. This PR is a hybrid of cleanup and feature. No backport is needed. Closes scylladb/scylladb#20933 * github.com:scylladb/scylladb: tools/scylla-sstable: Rename valid_checksums -> valid test: Check validate_checksums() with missing digest sstables: Allow validate_checksums() to report missing digests sstables: Refactor validate_checksums() to use checksummed data stream sstables: Add error_handler parameter to data_stream() sstables: Add error handler in checksummed data source sstables: Check for excessive chunks in checksummed data source sstables: Check for premature EOF in checksummed data source test: test_validate_checksums: Check SSTable with invalid digest test: test_validate_checksums: Check SSTable with appended data test: test_validate_checksums: Complement test for truncated SSTable	2024-12-04 10:46:18 +02:00
Avi Kivity	841481c202	Merge "move storage proxy and adjacent services to identify hosts by ids" from Gleb " This rather large patch series moves storage proxy and some adjacent services (like migration manager) to use host ids to identify nodes rather than ips. Messaging service gains a capability to address nodes by host ids (which allows dropping translations from topology coordinator code that worked on host ids already) and also makes sure that a node with incorrect host id will reject a message (can happen during address changes). The series gets rid of the raft address map completely and replaces it with the gossiper address map which is managed by the gossiper since translation is now done in the layer below raft. Fixes: scylladb/scylladb#6403 perf-simple-query -- smp 1 -m 1G output Before: enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 64336.82 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41291 insns/op, 24485 cycles/op, 0 errors) 62669.58 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41277 insns/op, 24695 cycles/op, 0 errors) 69172.12 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41326 insns/op, 24463 cycles/op, 0 errors) 56706.60 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41143 insns/op, 24513 cycles/op, 0 errors) 56416.65 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41186 insns/op, 24851 cycles/op, 0 errors) throughput: mean=61860.35 standard-deviation=5395.48 median=62669.58 median-absolute-deviation=5153.75 maximum=69172.12 minimum=56416.65 instructions_per_op: mean=41244.62 standard-deviation=76.90 median=41276.94 median-absolute-deviation=58.55 maximum=41326.19 minimum=41142.80 cpu_cycles_per_op: mean=24601.35 standard-deviation=167.39 median=24512.64 median-absolute-deviation=116.65 maximum=24851.45 minimum=24462.70 After: enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 65237.35 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 40733 insns/op, 23145 cycles/op, 0 errors) 59283.09 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40624 insns/op, 23948 cycles/op, 0 errors) 70851.03 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40625 insns/op, 23027 cycles/op, 0 errors) 70549.61 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40650 insns/op, 23266 cycles/op, 0 errors) 68634.96 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40622 insns/op, 22935 cycles/op, 0 errors) throughput: mean=66911.21 standard-deviation=4814.60 median=68634.96 median-absolute-deviation=3638.40 maximum=70851.03 minimum=59283.09 instructions_per_op: mean=40650.89 standard-deviation=47.55 median=40624.60 median-absolute-deviation=27.11 maximum=40733.37 minimum=40622.33 cpu_cycles_per_op: mean=23264.16 standard-deviation=402.12 median=23145.29 median-absolute-deviation=237.63 maximum=23947.96 minimum=22934.59 CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/13531/ SCT (longevity-100gb-4h with nemesis_selector: ['topology_changes']): https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/gleb/job/move-to-host-id/3/ Tested mixed cluster manually. " * 'gleb/move-to-host-id-v2' of github.com:scylladb/scylla-dev: (55 commits) group0: drop unused field from replace_info struct test: rename raft_address_map_test to address_map_test and move if from raft tests raft_address_map: remove raft address map topology coordinator: do not modify expire state for left/new nodes any more in raft address map topology coordinator: drop expiring entries in gossiper address map on error injections since raft one is no longer used group0: drop raft address map dependency from raft_rpc group0: move raft_ticker_type definition from raft_address_map.hh storage_service: do not update raft address map on gossiper events group0: drop raft address map dependency from raft_server_with_timeouts group0: move group0 upgrade code to host ids repair: drop raft address map dependency group0: remove unused raft address map getter from raft_group0 group0: drop raft address map from group0_state_machine dependency since it is not used there any more group0: remove dependency on raft address map from group0_state_id_handler gossiper: add get_application_state_ptr that searches by host_id gossiper: change get_live_token_owners to return host ids view: move view building to host id hints: use host id to send hints storage_proxy: remove id_vector_to_addr since it is no longer used db: consistency_level: change is_sufficient_live_nodes to work on host ids ...	2024-12-03 18:18:48 +02:00
Kefu Chai	bab12e3a98	treewide: migrate from boost::adaptors::transformed to std::views::transform now that we are allowed to use C++23. we now have the luxury of using `std::views::transform`. in this change, we: - replace `boost::adaptors::transformed` with `std::views::transform` - use `fmt::join()` when appropriate where `boost::algorithm::join()` is not applicable to a range view returned by `std::view::transform`. - use `std::ranges::fold_left()` to accumulate the range returned by `std::view::transform` - use `std::ranges::fold_left()` to get the maximum element in the range returned by `std::view::transform` - use `std::ranges::min()` to get the minimal element in the range returned by `std::view::transform` - use `std::ranges::equal()` to compare the range views returned by `std::view::transform` - remove unused `#include <boost/range/adaptor/transformed.hpp>` - use `std::ranges::subrange()` instead of `boost::make_iterator_range()`, to feed `std::views::transform()` a view range. to reduce the dependency to boost for better maintainability, and leverage standard library features for better long-term support. this change is part of our ongoing effort to modernize our codebase and reduce external dependencies where possible. limitations: there are still a couple places where we are still using `boost::adaptors::transformed` due to the lack of a C++23 alternative for `boost::join()` and `boost::adaptors::uniqued`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21700	2024-12-03 09:41:32 +02:00
Gleb Natapov	96309224ff	raft_address_map: remove raft address map It is no longer used.	2024-12-02 10:31:14 +02:00
Gleb Natapov	c65f64cc5f	storage_service: do not update raft address map on gossiper events Raft address map is not use any longer to resolve addresses anyway, so drop dependency on it from raft_ip_address_updater and rename it to reflect that it is no longer raft address map specific.	2024-12-02 10:31:13 +02:00
Gleb Natapov	1c5a7826dc	storage_service: pass gossip_address_map It will be used in the following patches.	2024-12-02 10:31:11 +02:00
Gleb Natapov	79358278f2	service: raft: move raft pinger to sending messages by host id This allows us to drop dependency on raft_address_map from direct_fd_pinger.	2024-12-02 10:31:11 +02:00
Gleb Natapov	76aa41dfcf	messaging_service: pass gossip_address_map to the mm and introduce send by id functions The function looks up provided host id in gossip_address_map and throws unknown_address if the mapping is not available. Otherwise it sends the message by IP found.	2024-12-01 12:12:30 +02:00
Gleb Natapov	ca2544e57e	gossiper: introduce gossip address map Introduce new address map that will be populated by the gossiper. Create in during initialization and pass it to the gossiper.	2024-12-01 12:12:29 +02:00
Kefu Chai	f436edfa22	mutation: remove unused "#include"s these unused includes are identified by clang-include-cleaner. after auditing the source files, all of the reports have been confirmed. please note, because `mutation/mutation.hh` does not include `seastar/coroutine/maybe_yield.hh` anymore, and quite a few source files were relying on this header to bring in the declaration of `maybe_yield()`, we have to include this header in the places where this symbol is used. the same applies to `seastar/core/when_all.hh`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-11-29 14:01:44 +08:00
Nikos Dragazis	320c9d17d4	test: test_validate_checksums: Check SSTable with invalid digest Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2024-11-25 14:37:00 +02:00
Kefu Chai	33a0e5b892	treewide: replace boost::find_if with std::ranges::find_if now that we are allowed to use C++23. we now have the luxury of using `std::ranges::find_if`. in this change, we: - replace `boost::find_if` with `std::ranges::find_if` - remove all `#include <boost/range/algorithm/find_if.hpp>` to reduce the dependency to boost for better maintainability, and leverage standard library features for better long-term support. this change is part of our ongoing effort to modernize our codebase and reduce external dependencies where possible. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-11-19 10:50:01 +08:00
Nadav Har'El	e72aabae7f	test: change sstring_view to std::string_view Our "sstring_view" is an historic alias for the standard std::string_view. The test/ directory used this old alias in a few of random places, let's change them to use the standard type name. Refs #4062. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2024-11-18 16:26:20 +02:00
Kefu Chai	00810e6a01	treewide: include seastar/core/format.hh instead of seastar/core/print.hh The later includes the former and in addition to `seastar::format()`, `print.hh` also provides helpers like `seastar::fprint()` and `seastar::print()`, which are deprecated and not used by scylladb. Previously, we include `seastar/core/print.hh` for using `seastar::format()`. and in seastar 5b04939e, we extracted `seastar::format()` into `seastar/core/format.hh`. this allows us to include a much smaller header. In this change, we just include `seastar/core/format.hh` in place of `seastar/core/print.hh`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21574	2024-11-14 17:45:07 +02:00
Lakshmi Narayanan Sreethar	67dad99ab5	compaction::table_state: implement `get_token_range_after_split()` wrapper Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-11-11 12:24:00 +05:30
Michał Jadwiszczak	7bad8378c7	main,cql_test_env: start group0_service before view_builder In scylladb/scylladb#19745, view_builder was migrated to group0 and since then it is dependent on group0_service. Because of this, group0_service should be initialized/destroyed before/after view_builder. Fixes scylladb/scylladb#20772 Co-authored-by: Dawid Mędrek <dawid.medrek@scylladb.com>	2024-11-07 14:08:11 +01:00
Avi Kivity	2531dc2d80	schema_registry: stop including replica/database.hh database.hh is a hotspot that changes often (or its dependencies do). Avoid including it to reduce recompilations. Closes scylladb/scylladb#21407	2024-11-04 13:16:27 +01:00
Botond Dénes	e942c074f2	compaction/compaction_manager: make _tasks an intrusive list _tasks is currently std::list<shared_ptr<compaction_task_executor>>, but it has no role in keeping the instances alive, this is done by the fibers which create the task (and pin a shared ptr instance). This lends itself to an intrusive list, avoiding that extra allocation upon push_back(). Using an intrusive list also makes it simpler and much cheaper (O(1) vs. O(N)) to remove tasks from the _tasks list. This will be made use of in the next patch. Code using _task has to be updated because the value_type changes from shared_ptr<compaction_task_executor> to compaction_task_executor&.	2024-11-03 10:17:11 +02:00
Kefu Chai	64122b3df3	treewide: s/boost::transform/std::ranges::transform/ now that we are allowed to use C++23. we now have the luxury of using `std::ranges::transform`. in this change, we: - replace `boost::transform` with `std::ranges::transform` - update affected code to work with `std::ranges::transform` to reduce the dependency to boost for better maintainability, and leverage standard library features for better long-term support. this change is part of our ongoing effort to modernize our codebase and reduce external dependencies where possible. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21318	2024-11-01 08:15:14 +02:00
Calle Wilund	c4361037f7	cql_test_env/gossip: Prevent double shutdown call crash Fixes scylladb/scylladb#21159 When an exception is thrown in sstable write etc such that storage_manager::isolate is initiated, we start a shutdown chain for message service, gossip etc. These are synced (properly) in storage_manager::stop, but if we somehow call gossiper::shutdown outside the normal service::stop cycle, we can end up running the method simultaneously, intertwined (missing the guard because of the state change between check and set). We then end up co_awaiting an invalid future (_failure_detector_loop_done) - a second wait. Fixed by a.) Remove superfluous gossiper::shutdown in cql_test_env. This was added in `20496ed`, ages ago. However, it should not be needed nowadays. b.) Ensure _failure_detector_loop_done is always waitable. Just to be sure. Closes scylladb/scylladb#21379	2024-10-31 10:11:20 +01:00
Avi Kivity	907da210b6	compound_compat: replace use of boost ranges with std ranges To reduce the dependency load, replace use of boost ranges with the std equivalent. Files that lost the indirect boost dependency have it added as a direct dependency.	2024-10-30 19:58:07 +02:00
Tomasz Grabiec	f3869dadc6	Merge 'compound: replace boost ranges with std ranges' from Avi Kivity Continue standardization on std::ranges. Since compound contains a custom iterator, we first have to upgrade it to C++20 iterator concepts. Cleanup / minor refactoring, so no backport. Closes scylladb/scylladb#21320 * github.com:scylladb/scylladb: compound: replace boost ranges with std ranges compound: upgrade iterator to be an std::forward_iterator	2024-10-30 11:02:51 +01:00
Kefu Chai	54d438168a	build: cmake: explicitly mark convenience libraries as STATIC before this change, these [convenience libraries](https://www.gnu.org/software/automake/manual/html_node/Libtool-Convenience-Libraries.html) were implicitly built as static libraries by default, but weren't explicitly marked as STATIC in CMake. While this worked with default settings, it could cause issues if `BUILD_SHARED_LIBS` is enabled. So before we are ready for building these components as shared libraries, let's mark all convenience libraries as STATIC for consistency and to prevent potential issues before we properly support shared library builds. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21274	2024-10-29 10:22:19 +01:00
Avi Kivity	94c21e5c05	Merge 'sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions' from Tomasz Grabiec Single-row reads from large partition issue 64 KiB reads to the data file, which is equal to the default span of the promoted index block in the data file. If users would want to increase selectivity of the index to speed up single-row reads, this won't be effective. The reason is that the reader uses promoted index to look up the start position in the data file of the read, but end position will in practice extend to the next partition, and amount of I/O will be determined by the underlying file input stream implementation and its read-ahead heuristics. By default, that results in at least 2 IOs 32KB each. There is already infrastructure to lookup end position based on upper bound of the read, in anticipation for sharing the promoted index cache, but it's not effective becasue it's a non-populating lookup and the upper bound cursor has its own private cached_promoted_index, which is cold when positions are computed. It's non-populating on purpose, to avoid extra index file IO to read upper bound. In case upper bound is far-enough from the lower bound, this will only increase the cost of the read. The solution employed here is to warm up the lower bound cursor's cache before positions are computed, and use that cursor for non-populating lookup of the upper bound. We use the lower bound cursor and the slice's lower bound so that we read the same blocks as later lower-bound slicing would, so that we don't incur extra IO for cases where looking up upper bound is not worth it, that is when upper bound is far from the lower bound. If upper bound is near lower bound, then warming up using lower bound will populate cached_promoted_index with blocks which will allow us to locate the upper bound block accurately. This is especially important for single-row reads, where the bounds are around the same key. In this case we want to read the data file range which belongs to a single promoted index block. It doesn't matter that the upper bound is not exactly the same. They both will likely lie in the same block, and if not, binary search will bring adjacent blocks into cache. Even if upper bound is not near, the binary search will populate the cache with blocks which can be used to narrow down the data file range somewhat. Fixes #10030. The change was tested with perf-fast-forward. I populated the data set with `column_index_size_in_kb` set to 1 scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1 Test run: build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0 This test issues two reads of subsequent keys from the middle of a large partition (1M rows in total). The first read will miss in the index file page cache, the second read will hit. Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total. After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB. I verified using logging that the data file range matches a single promoted index block. Also, the first read which misses in cache is still faster after the change. Before: ``` running: large-partition-select-few-rows on dataset large-part-ds1 Testing selecting few rows from a large partition: stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 500000 1 0.009802 1 1 102 0 102 102 21.0 21 196 2 1 0 1 1 0 0 0 568 269 4716050 53.4% 500001 1 0.000321 1 1 3113 0 3113 3113 2.0 2 64 1 0 1 0 0 0 0 0 116 26 555110 45.0% ``` After: ``` running: large-partition-select-few-rows on dataset large-part-ds1 Testing selecting few rows from a large partition: stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu 500000 1 0.009609 1 1 104 0 104 104 20.0 20 137 2 1 0 1 1 0 0 0 561 268 4633407 43.1% 500001 1 0.000217 1 1 4602 0 4602 4602 1.0 1 2 1 0 1 0 0 0 0 0 110 26 313882 64.1% ``` Backports: none, not a regression Closes scylladb/scylladb#20522 * github.com:scylladb/scylladb: perf: perf_fast_forward: Add test case for querying missing rows perf-fast-forward: Allow overriding promoted index block size perf-fast-forward: Test subsequent key reads from the middle in test_large_partition_select_few_rows perf-fast-forward: Allow adding key offset in test_large_partition_select_few_rows perf-fast-forward: Use single-partition reads in test_large_partition_select_few_rows sstables: bsearch_clustered_cursor: Add more tracing points sstables: reader: Log data file range sstables: bsearch_clustered_cursor: Unify skip_info logging sstables: bsearch_clustered_cursor: Narrow down range using "end" position of the block sstables: bsearch_clustered_cursor: Skip even to the first block test: sstables: sstable_3_x_test: Improve failure message sstables: mx: writer: Never include partition_end marker in promoted index block width sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions sstables: clustered_cursor: Track current block	2024-10-28 21:13:23 +02:00
Avi Kivity	d3dae09316	compound: replace boost ranges with std ranges Standardize on the standard range library. The serialize_value(initializer_list) overload is disambiguated not to call itself. Apparently it wasn't called before. Since std::ranges::subrange does not provide operator==, replace it with std::ranges::equals().	2024-10-28 18:35:41 +02:00
Avi Kivity	61d7f1f6a5	compound: upgrade iterator to be an std::forward_iterator compound::iterator isn't far from a forward_iterator, and if we want to use it with std::ranges, we have to upgrade it. This is because std::ranges::subrange() only provides front() for forward ranges, and we do use this front(). Boost apparently isn't as strict. To make it a forward_range, we have to drop operator-> and make operator* return a value (similar to std::views::tranform), since forward iterators require that pointers and references be stable, and this iterator returns a pointer to one of its members. We also add an iterator_concept member to declare the compatibility to std::ranges.	2024-10-28 17:16:36 +02:00
Kefu Chai	24d14b601b	treewide: s/boost::adaptors::map_values/std::views::values/ now that we are allowed to use C++23. we now have the luxury of using `std::views::values`. in this change, we: - replace `boost::adaptors::map_values` with `std::views::values` - update affected code to work with `std::views::values` - the places where we use `boost::join()` are not changed, because we cannot use `std::views::concat` yet. this helper is only available in C++26. to reduce the dependency to boost for better maintainability, and leverage standard library features for better long-term support. this change is part of our ongoing effort to modernize our codebase and reduce external dependencies where possible. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21265	2024-10-27 21:32:45 +02:00
Pavel Emelyanov	7595ef7303	test: Squash test::change_generation_number() into test::store() No other usages of the former helper other than immediatelly followed by the latter, no point in keepint it around. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-10-24 11:29:17 +03:00
Pavel Emelyanov	e885b0e6cd	test: Squash test::change_dir() into test::store() No other usages of the former helper other than immediatelly followed by the latter, no point in keepint it around. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-10-24 11:28:39 +03:00
Pavel Emelyanov	874cf2ea6f	test: Coroutinize sstables::test::store() Ahead of future changes Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-10-24 11:28:07 +03:00
Nadav Har'El	5fd3177057	Merge 'mv: add a dedicated read concurrency semaphore for view update read before writes' from Wojciech Mitros When writing to some tables with materialized views, we need to read from the base table first to perform a delete of the old view row. When doing so, the memory used for the read is tracked by the user read concurrency semaphore. When we have a large number of such reads, we may use up all of the semaphore units, causing the following reads to be queued. When we have some user reads coming at the same time, these reads can have very high latency due to the write workload on the base table. We want to avoid this, so that the write workload doesn't have a high impact on the latency of the read workload. This is fixed in this patch by adding a separate read concurrency semaphore just for view update read-before-writes. With the new semaphore, even if there are many view update read-before-writes, they will be queued on a different semaphore than the user reads, and they won't impact their latency. The second issue fixed by this patch is the concurrency of the view updates that is currently unlimited. Because of that view updates may take up so much memory that they we may run out of memory. This is fixed by using the read admission on the view update concurrency semaphore. This limits the number of concurrent view update reads to max_count_concurrent_view_update_reads, all other incoming view update reads are queued using just a small chunk of memory. Without this, the reads would also get queued after exceeding view_update_reader_concurrency_semaphore_serialize_limit_multiplier, but they would take much more memory while staying in the queue. The new semaphore has half the capacity of the regular user read concurrency semahpore and is currently used only for user writes - is't used independently of the scheduling group on which we base the read semaphore selection, but we use a different code path for streaming (not database::do_apply) and we shouldn't have view updates in system writes or during compaction. This patch also adds a test to confirm that the view update workload doesn't impact the read latency, as well as a test which confirms that we do not run out of memory even under heavy view udpate workload. The issue of view updates causing increased latencies most often occurs in the following scenario: * we have a medium to high write workload to a table with a materialized view which requires reading from the base table before sending the update to delete the old rows * we have any read workload * one replica is slower or is handling more writes due to an imbalance of data distribution * we write with a cl<ALL, the mentioned replica is replying to write requests slower while new ones keep being sent to it. * each write performs a read first taking resources from the user read concurrency semaphore, so when enough writes accumulate the reads using the semaphore start getting queued * the queue is shared by regular reads and view update reads. When there's enough view update reads in the queue, regular reads start getting increased latencies An sct test (perf-regression-latency-mv-read-concurrency) was prepared to somewhat resemble this scenario: * the tables were prepared satisfying the conditions above * we use a medium write workload and a very low read workload * the imbalance is achieved by writing to just a few (10) partitions - some replicas (and shards) can have twice or more used partitions than others. We also keep writing to a limited (though high) number of rows, to cause overwrites which require reading before sending the view update * to minimize the test case, we use a cluster of 3 nodes and rf=2, we write with cl=ONE to have background replica writes and read with cl=ALL to wait for the slower replica to respond. In the test above: * without the fix, the latency of reads increases over 50s * with the fix, the latency of reads stays below 20ms Fixes https://github.com/scylladb/scylladb/issues/8873 Fixes https://github.com/scylladb/scylladb/issues/15805 The patch is not that small and it isn't fixing a regression, so no backports Closes scylladb/scylladb#20887 * github.com:scylladb/scylladb: test: add test for high view update concurrency causing bad_allocs test: add test for high view update concurrency degrading read latency mv: add a dedicated read concurrency semaphore for view update read before writes	2024-10-22 22:17:23 +03:00
Kefu Chai	6ead5a4696	treewide: move log.hh into utils/log.hh the log.hh under the root of the tree was created keep the backward compatibility when seastar was extracted into a separate library. so log.hh should belong to `utils` directory, as it is based solely on seastar, and can be used all subsystems. in this change, we move log.hh into utils/log.hh to that it is more modularized. and this also improves the readability, when one see `#include "utils/log.hh"`, it is obvious that this source file needs the logging system, instead of its own log facility -- please note, we do have two other `log.hh` in the tree. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-10-22 06:54:46 +03:00
Kefu Chai	5cd619a60c	treewide: s/boost::adaptors::map_keys/std::views::keys/ now that we are allowed to use C++23. we now have the luxury of using `std::views::keys`. in this change, we: - replace `boost::adaptors::map_keys` with `std::views::keys` - update affected code to work with `std::views::keys` to reduce the dependency to boost for better maintainability, and leverage standard library features for better long-term support. this change is part of our ongoing effort to modernize our codebase and reduce external dependencies where possible. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21198	2024-10-21 12:47:52 +03:00
Wojciech Mitros	242079d70b	mv: add a dedicated read concurrency semaphore for view update read before writes When writing to some tables with materialized views, we need to read from the base table first to perform a delete of the old view row. When doing so, the memory used for the read is tracked by the user read concurrency semaphore. When we have a large number of such reads, we may use up all of the semaphore units, causing the following reads to be queued. When we have some user reads coming at the same time, these reads can have very high latency due to the write workload on the base table. We want to avoid this, so that the write workload doesn't have a high impact on the latency of the read workload. This is fixed in this patch by adding a separate read concurrency semaphore just for view update read-before-writes. With the new semaphore, even if there are many view update read-before-writes, they will be queued on a different semaphore than the user reads, and they won't impact their latency. The second issue fixed by this patch is the concurrency of the view updates that is currently unlimited. Because of that view updates may take up so much memory that they we may run out of memory. This is fixed by using the read admission on the view update concurrency semaphore. This limits the number of concurrent view update reads to max_count_concurrent_view_update_reads, all other incoming view update reads are queued using just a small chunk of memory. Without this, the reads would also get queued after exceeding view_update_reader_concurrency_semaphore_serialize_limit_multiplier, but they would take much more memory while staying in the queue. The new semaphore has half the capacity of the regular user read concurrency semahpore and is currently used only for user writes - is't used independently of the scheduling group on which we base the read semaphore selection, but we use a different code path for streaming (not database::do_apply) and we shouldn't have view updates in system writes or during compaction. Fixes https://github.com/scylladb/scylladb/issues/8873 Fixes https://github.com/scylladb/scylladb/issues/15805	2024-10-21 11:02:06 +02:00
Avi Kivity	c3be2489ce	treewide: drop includes of <boost/range/adaptors.hpp> This includes way too much, including <boost/regex.hpp>, which is huge. Drop includes of adaptors.hpp and replace by what is needed. Closes scylladb/scylladb#21187	2024-10-20 17:17:11 +03:00
Pavel Emelyanov	b11d50f591	Merge 'multishard reader: make it safe to create with admitted permits' from Botond Dénes Passing an admitted permit -- i.e. one with count resources on it -- to the multishard reader, will possibly result in a deadlock, because the permit of the multishard reader is destroyed after the permits of its child readers. Therefore its semaphore resources won't be automatically released until children acquire their own resources. This creates a dependency (an edge in the "resource allocation graph"), where the semaphore used by the multishard reader depends on the semaphores used by children. When such dependencies create a cycle, and permits are acquired by different reads in just the right order, a deadlock will happen. Users of the multishard reader have to be aware of this gotcha -- and of course they aren't. This is small wonder, considering that not even the documentation on the multishard reader mentions this problem. To work around this, the user has to call `reader_permit::release_base_resources()` on the permit, before passing it to the multishard reader. On multiple occasions, developers (including the very author of the multishard reader), forgot or didn't know about this and this resulted in deadlocks down the line. This is a design-flaw of the multishard reader, which is addressed in this PR, after which, it is safe to pass admitted or not admitted permits to the multishard reader, it will handle the call to `release_base_resources()` if needed. After fixing the problem in the multishard reader, the existing calls to `release_base_resources()` on permits passed to multishard readers are removed. A test is added which reproduces the problem and ensures we don't regress. Refs: https://github.com/scylladb/scylladb/issues/20885 (partial fix, there is another deadlock in that issue, which this PR doesn't fix) This fixes (indirectly) a regression introduced by `d98708013c` so it has to be backported to 6.2 Closes scylladb/scylladb#21058 * github.com:scylladb/scylladb: test/boost/mutation_test: add test for multishard permit safety test/lib/reader_lifecycle_policy: add semaphore factory to constructor test/lib/reader_lifecycle_policy: rename factory_function repair/row_level: drop now unneeded release_base_resource() calls readers/multishard: make multishard reader safe to create with admitted permits	2024-10-18 13:30:21 +03:00
Botond Dénes	5a3fd69374	test/lib/reader_lifecycle_policy: add semaphore factory to constructor Allowing callers to specify how the semaphore is created and stopped, instead of doing so via boolean flags like it is done currently. This method doesn't scale, so use a factory instead.	2024-10-17 08:47:50 -04:00
Botond Dénes	c8598e21e8	test/lib/reader_lifecycle_policy: rename factory_function To reader_factor_function. We are about to add a new factory function parameters, so the current factory_function has to be renamed to something more specific.	2024-10-17 08:47:50 -04:00
Avi Kivity	820509026f	schema: replace boost ranges with std ranges To reduce dependency load, use std ranges instead of boost ranges. The std::ranges::{lower,upper}_bound don't support heterogeneous lookup, but a more natural solution is to use a projection to search for the name, so we use that and the custom comparator is removed. Many callers are converted as well due to poor interoperability between boost ranges and std ranges.	2024-10-15 16:42:54 +03:00
Tomasz Grabiec	3e438d23e1	Merge 'Check system.tablets update before putting it into the table' from Pavel Emelyanov Having tablet metadata with more than 1 pending replica will prevent this metadata from being (re)loaded due to sanity check on load. This patch fails the operation which tries to save the wrong metadata with a similar sanity check. For that, changes submitted to raft are validated, and if it's topology_change that affects system.tablets, the new "replicas" and "new_replicas" values are checked similarly to how they will be on (re)load. fixes #20043 Closes scylladb/scylladb#21020 * github.com:scylladb/scylladb: tablets: Validate system.tablets update group0_client: Introduce change validation group0_client: Add shared_token_metadata dependency	2024-10-15 00:38:59 +02:00
Pavel Emelyanov	031893259a	treewide: Rename sstable registry location field to be owner This is sort of continuation of the previous patch. The partition key in the registry is now table_id, not string, and is better called "owner", not "location". This patch is s/location/owner/ over specific places that include field name in the schema, argument names in registry maintenance classes and tests accessing the selected row fields by name. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-10-11 14:11:28 +03:00
Pavel Emelyanov	3315e3a2a9	system_keyspace: Change sstables registry partition key type Today, the system.sstables schema uses string as partition key. Callers, in turn, use table's datadir value to reference entries in it. That's wrong, S3-backed sstables don't have any local paths to work with. The table's ID is better in this role. This patch only changes the field type to be table_id and fixes the callers to provide one. In particular, see init_table_storage() change -- instead of generating a datadir string, it sets table.id() as the options' location. Other fixed places are tests. Internally, this id value is propagated via s3_storage::owner() method, that's fixed as well. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-10-11 13:48:09 +03:00
Pavel Emelyanov	1181b6b082	storage_options: Use variant on S3 options Describing S3 storage for an sstables nowadays has two options -- via sstables registry entry and by using the direct prefix string. The former is used when putting a keyspace on S3. In this case each sstable has the corresponding entry in the system.sstables table. The latter is used by "restore from object storage" code. In that case, sstables don't have entries in the registry, but are accessed by a specific S3 object path. This patch reflects this difference by making s3_options::location be variant of string prefix and table_id owner. The owner needs more explanation, here it is. Today, the system.sstables schema defines partition key to be "string location" and clustering key to be "UUID generation". The partition key is table's datadir string, but it's wrong to use it this way. Next patches will change the partition key to be table's ID (there's table_id type for it), and before doing it storage options must be prepared to carry it onboard. This patch does it, but the table_id alternative of the location is still unused, the rest of the code keeps using the string location to reference a row in the registry table. Next patches will eventually make use of the table_id value. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-10-11 13:04:52 +03:00
Pavel Emelyanov	f09fe4f351	group0_client: Add shared_token_metadata dependency It will be needed later to get tablet_metadata from. The dependency is "OK", shared_token_metadata is low-level sharded service. Client already references db::system_keyspace, which in turn references replica::database which, finally, references token_metadata Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-10-10 12:27:46 +03:00
Botond Dénes	86fd9ce8fd	schema/schema: break circular dependency with replica::database The schema module (everything in schema/) is supposed to be towards the leafs in the ScyllaDB inter-module dependency graph. In other words, it should not depend on many other modules. On the other hand, almost the entire codebase depends on the schema module itself. Currently there is a circular dependency between schema and replica::database, as the latter is a required argument for schema::describe(). This is bad, not just because of the dependency mess it introduces, but also because now schema::describe() can only be used by code which has a reference to the database handy. This patch breaks this circular dependency, by introducing the schema_describe_helper interface and providing an implementation for it in database.hh. There is another circular dependency: schema <-> replica::table. This is not addressed by this patch. Closes scylladb/scylladb#20893	2024-10-10 10:07:26 +03:00
Avi Kivity	7dad248ac7	Merge 'Fix sstables registry mock' from Pavel Emelyanov There are two issues in it. First, listing the registry with a consumer callback passes wrong argument to the consumer. Second, the primary key of the registry is wrong. Both issues don't show up, because existing tests that use mock don't read from it, only write. Tests that read from registry are python tests that start scylla and thus use real registry. Closes scylladb/scylladb#20946 * github.com:scylladb/scylladb: test: Use corrcet key in sstables registry mock test: Pass entry status to mock registry consumer	2024-10-07 13:56:26 +03:00

1 2 3 4 5 ...

1380 Commits