scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-25 09:11:10 +00:00

Author	SHA1	Message	Date
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Botond Dénes	13ff9c4394	db,compaction: use utils::chunked_vector for cache invalidation ranges Instead of dht::partition_ranges_vector, which is an std::vector<> and have been seen to cause large allocations when calculating ranges to be invalidated after compaction: seastar_memory - oversized allocation: 147456 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at [Backtrace #0] void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./seastar/src/util/backtrace.cc:99 seastar::current_tasktrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:136 seastar::current_backtrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:169 seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:840 seastar::memory::cpu_pages::check_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:903 (inlined by) seastar::memory::cpu_pages::allocate_large(unsigned int, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:910 (inlined by) seastar::memory::allocate_large(unsigned long, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:1533 (inlined by) seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:1679 seastar::memory::allocate(unsigned long) at ././seastar/src/core/memory.cc:1698 (inlined by) operator new(unsigned long) at ././seastar/src/core/memory.cc:2440 (inlined by) std::__new_allocator<interval<dht::ring_position>>::allocate(unsigned long, void const) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/new_allocator.h:151 (inlined by) std::allocator<interval<dht::ring_position>>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/allocator.h:203 (inlined by) std::allocator_traits<std::allocator<interval<dht::ring_position>>>::allocate(std::allocator<interval<dht::ring_position>>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/alloc_traits.h:614 (inlined by) std::_Vector_base<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/stl_vector.h:387 (inlined by) std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::reserve(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/vector.tcc:79 dht::to_partition_ranges(utils::chunked_vector<interval<dht::token>, 131072ul> const&, seastar::bool_class<utils::can_yield_tag>) at ./dht/i_partitioner.cc:347 compaction::compaction::get_ranges_for_invalidation(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>> const&) at ./compaction/compaction.cc:619 (inlined by) compaction::compaction::get_compaction_completion_desc(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>, std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>) at ./compaction/compaction.cc:719 (inlined by) compaction::regular_compaction::replace_remaining_exhausted_sstables() at ./compaction/compaction.cc:1362 compaction::compaction::finish(std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>, std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>) at ./compaction/compaction.cc:1021 compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0::operator()() at ./compaction/compaction.cc:1960 (inlined by) compaction::compaction_result std::__invoke_impl<compaction::compaction_result, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(std::__invoke_other, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:63 (inlined by) std::__invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type std::__invoke<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:98 (inlined by) decltype(auto) std::__apply_impl<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&, std::integer_sequence<unsigned long, ...>) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2920 (inlined by) decltype(auto) std::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2935 (inlined by) seastar::future<compaction::compaction_result> seastar::futurize<compaction::compaction_result>::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at ././seastar/include/seastar/core/future.hh:1930 (inlined by) seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()::operator()() const at ././seastar/include/seastar/core/thread.hh:267 (inlined by) seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()>::call(seastar::noncopyable_function<void ()> const) at ././seastar/include/seastar/util/noncopyable_function.hh:138 seastar::noncopyable_function<void ()>::operator()() const at ./build/release/seastar/./seastar/include/seastar/util/noncopyable_function.hh:224 (inlined by) seastar::thread_context::main() at ./build/release/seastar/./seastar/src/core/thread.cc:318 dht::partition_ranges_vector is used on the hot path, so just convert the problematic user -- cache invalidation -- to use utils::chunked_vector<dht::partition_range> instead. Fixes: SCYLLADB-121 Closes scylladb/scylladb#28855	2026-03-09 22:04:54 +02:00
Avi Kivity	7230a04799	dht, sstables: replace vector with chunked_vector when computing sstable shards sstable::compute_shards_for_this_sstable() has a temporary of type std::vector<dht::token_range> (aka dht::partition_range_vector), which allocates a contiguous 300k when loading an sstable from disk. This causes large allocation warnings (it doesn't really stress the allocator since this typically happens during startup, but best to clear the warning anyway). Fix this by changing the container to by chunked_vector. It is passed to dht::ring_position_range_vector_sharder, but since we're the only user, we can change that class to accept the new type. Fixes #24198. Closes scylladb/scylladb#26353	2025-10-02 00:47:42 +02:00
Petr Gusev	801bf42ea2	sharder: add try_get_shard_for_reads method Currently, we use storage_proxy/get_cas_shard -> sharder.shard_for_reads to decide which shard to use for LWT code execution on both replicas and the coordinator. If the coordinator is not a replica, shard_for_reads returns 0 — the 'default' shard. This behavior has at least two problems: * Shard 0 may become overloaded, because all LWT coordinators that are not replicas will be served on it. * The zero shard does not match shard_for_reads on replicas, which hinders the "same shard for client and server" RPC-level optimization. To fix this, we need to know whether the current node hosts a replica for the tablet corresponding to the given token. Currently, there is no API we could use for this. For historical reasons, sharder::shard_for_reads returns 0 when the node does not host the shard, which leads to ambiguity. This commit introduces try_get_shard_for_reads, which returns a disengaged std::optional when the tablet is not present on the local node. We leave shard_for_reads method in the base sharder class, it calls try_get_shard_for_reads and returns zero by default. We need to rename tablet_sharder private methods shard_for_reads and shard_for_writes so that they don't conflict with the sharder::shard_for_reads.	2025-07-29 11:35:54 +02:00
Avi Kivity	16fb68bb5e	interval: rename start_ref() back to start() (and end_ref() etc). To reduce noise, rename start_ref() back to its original name start(), after it was changed in the previous patch to force an audit of all calls.	2025-06-14 21:26:16 +03:00
Avi Kivity	3363bc41e2	interval: rename start() to start_ref() (and end() etc). We are about to change start() to return a proxy object rather than a `const interval_bound<T>&`. This is generally transparent, except in one case: `auto x = i.start()`. With the current implementation, we'll copy object referred to and assign it to x. With the planned implementation, the proxy object will be assigned to `x`, but it will keep referring to `i`. To prevent such problems, rename start() to start_ref() and end() to end_ref(). This forces us to audit all calls, and redirect calls that will break to new start_copy() and end_copy() methods.	2025-06-14 21:26:16 +03:00
Raphael S. Carvalho	494ed6b887	dht: Add overlap_ratio() for token range Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Kefu Chai	a20cd6539c	cql3, dht: Remove redundant std::move() calls These redundant `std::move()` calls were identified by GCC-14. In general, copy elision applies to these places, so adding `std::move()` is not only unnecessary but can actually prevent the compiler from performing copy elision, as it causes the return statement to fail to satisfy the requirements for copy elision optimization. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#23063	2025-03-04 10:36:49 +03:00
Kefu Chai	7215d4bfe9	utils: do not include unused headers these unused includes were identifier by clang-include-cleaner. after auditing these source files, all of the reports have been confirmed. please note, because quite a few source files relied on `utils/to_string.hh` to pull in the specialization of `fmt::formatter<std::optional<T>>`, after removing `#include <fmt/std.h>` from `utils/to_string.hh`, we have to include `fmt/std.h` directly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-14 07:56:39 -05:00
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Avi Kivity	f744007e13	dht: auto_refreshing_sharder.hh: don't include database.hh database.hh is a heavyweight include file with a lot of fan-in. auto_refreshing_sharder.hh has a lot of fan out. The combination means a large dependency load. Deinline the class and use forward declarations to avoid the #include. There is no expected performance impact because all the functions are virtual. Ref #1 Note: this shouldn't belong in dht, but be injected by a higher layer, but this isn't addressed by the patch. Closes scylladb/scylladb#21768	2024-12-06 23:11:52 +01:00
Kefu Chai	bab12e3a98	treewide: migrate from boost::adaptors::transformed to std::views::transform now that we are allowed to use C++23. we now have the luxury of using `std::views::transform`. in this change, we: - replace `boost::adaptors::transformed` with `std::views::transform` - use `fmt::join()` when appropriate where `boost::algorithm::join()` is not applicable to a range view returned by `std::view::transform`. - use `std::ranges::fold_left()` to accumulate the range returned by `std::view::transform` - use `std::ranges::fold_left()` to get the maximum element in the range returned by `std::view::transform` - use `std::ranges::min()` to get the minimal element in the range returned by `std::view::transform` - use `std::ranges::equal()` to compare the range views returned by `std::view::transform` - remove unused `#include <boost/range/adaptor/transformed.hpp>` - use `std::ranges::subrange()` instead of `boost::make_iterator_range()`, to feed `std::views::transform()` a view range. to reduce the dependency to boost for better maintainability, and leverage standard library features for better long-term support. this change is part of our ongoing effort to modernize our codebase and reduce external dependencies where possible. limitations: there are still a couple places where we are still using `boost::adaptors::transformed` due to the lack of a C++23 alternative for `boost::join()` and `boost::adaptors::uniqued`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21700	2024-12-03 09:41:32 +02:00
Kefu Chai	6ead5a4696	treewide: move log.hh into utils/log.hh the log.hh under the root of the tree was created keep the backward compatibility when seastar was extracted into a separate library. so log.hh should belong to `utils` directory, as it is based solely on seastar, and can be used all subsystems. in this change, we move log.hh into utils/log.hh to that it is more modularized. and this also improves the readability, when one see `#include "utils/log.hh"`, it is obvious that this source file needs the logging system, instead of its own log facility -- please note, we do have two other `log.hh` in the tree. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-10-22 06:54:46 +03:00
Kefu Chai	5cd619a60c	treewide: s/boost::adaptors::map_keys/std::views::keys/ now that we are allowed to use C++23. we now have the luxury of using `std::views::keys`. in this change, we: - replace `boost::adaptors::map_keys` with `std::views::keys` - update affected code to work with `std::views::keys` to reduce the dependency to boost for better maintainability, and leverage standard library features for better long-term support. this change is part of our ongoing effort to modernize our codebase and reduce external dependencies where possible. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21198	2024-10-21 12:47:52 +03:00
Kefu Chai	3e84d43f93	treewide: use seastar::format() or fmt::format() explicitly before this change, we rely on `using namespace seastar` to use `seastar::format()` without qualifying the `format()` with its namespace. this works fine until we changed the parameter type of format string `seastar::format()` from `const char*` to `fmt::format_string<...>`. this change practically invited `seastar::format()` to the club of `std::format()` and `fmt::format()`, where all members accept a templated parameter as its `fmt` parameter. and `seastar::format()` is not the best candidate anymore. despite that argument-dependent lookup (ADT for short) favors the function which is in the same namespace as its parameter, but `using namespace` makes `seastar::format()` more competitive, so both `std::format()` and `seastar::format()` are considered as the condidates. that is what is happening scylladb in quite a few caller sites of `format()`, hence ADT is not able to tell which function the winner in the name lookup: ``` /__w/scylladb/scylladb/mutation/mutation_fragment_stream_validator.cc:265:12: error: call to 'format' is ambiguous 265 \| return format("{} ({}.{} {})", _name_view, s.ks_name(), s.cf_name(), s.id()); \| ^~~~~~ /usr/bin/../lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/format:4290:5: note: candidate function [with _Args = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>] 4290 \| format(format_string<_Args...> __fmt, _Args&&... __args) \| ^ /__w/scylladb/scylladb/seastar/include/seastar/core/print.hh:143:1: note: candidate function [with A = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>] 143 \| format(fmt::format_string<A...> fmt, A&&... a) { \| ^ ``` in this change, we change all `format()` to either `fmt::format()` or `seastar::format()` with following rules: - if the caller expects an `sstring` or `std::string_view`, change to `seastar::format()` - if the caller expects an `std::string`, change to `fmt::format()`. because, `sstring::operator std::basic_string` would incur a deep copy. we will need another change to enable scylladb to compile with the latest seastar. namely, to pass the format string as a templated parameter down to helper functions which format their parameters. to miminize the scope of this change, let's include that change when bumping up the seastar submodule. as that change will depend on the seastar change. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-09-11 23:21:40 +03:00
Avi Kivity	aa1270a00c	treewide: change assert() to SCYLLA_ASSERT() assert() is traditionally disabled in release builds, but not in scylladb. This hasn't caused problems so far, but the latest abseil release includes a commit [1] that causes a 1000 insn/op regression when NDEBUG is not defined. Clearly, we must move towards a build system where NDEBUG is defined in release builds. But we can't just define it blindly without vetting all the assert() calls, as some were written with the expectation that they are enabled in release mode. To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT() macro in utils/assert.hh. This macro is always defined and is not conditional on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release mode. [1] `66ef711d68` Closes scylladb/scylladb#20006	2024-08-05 08:23:35 +03:00
Benny Halevy	850f298ccd	everywhere: use dht::token is_{minimum,maximum} The is_minimum/is_maximum predicates are more efficient than comparing the the m{minimum,maximum}_token values, respectrively. since the is_* functions need to check only the token kind. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-07-20 21:21:42 +03:00
Kefu Chai	51d686ea9f	dht: remove unused operator<< since we've switched almost all callers of the operator<< to {fmt}, let's drop the unused operator<<:s. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-06-18 11:26:20 +08:00
Tomasz Grabiec	26f2e6aa8e	dht: is_single_shard: Avoid deprecated sharder API All current uses are used in the read path.	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	c9e6b4dca7	dht: split_range_to_single_shard: Work with static_sharder only In preparation for intra-node tablet migration, to avoid using deprecated sharder APIs. This function is used for generating sstable sharding metadata. For tablets, it is not invoked, so we can safely work with the static sharder. The call site already passes static_sharder only.	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	c380aecf64	dht: ring_position_range_sharder: Avoid deprecated sharder APIs In preparation for tablet intra-node migration. Existing uses are for reads, so it's safe to use shard_for_reads(): - in multishard reader - in forward_service The ring_position_range_vector_sharder is used when computing sstable shards, which for intra-node migration should use the view for reads. If we haven't completed streaming, sstables should be attached to the old shard (used by reads). When in write-both-read-new stage, streaming is complete, reads are using the new shard, and we should attach sstables to the new shard. When not in intra-node migration, the view for reads on the pending node will return the pending shard even if read selector is "read old". So if pending node restarts during streaming, we will attach to sstables to the shard which is used by writes even though we're using the selector for reads.	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	dd4a086b87	selective_token_sharder: Avoid use of deprecated sharder API I analyzed all the uses and all except the alternator/ttl.cc seem to be interested in the result for the purpose of reading. Alternator is not supported with tablets yet, so the use was annotated with a relevant issue.	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	9da3bd84c7	dht: Extract dht::static_sharder Before the patch, dht::sharder could be instantiated and it would behave like a static sharder. This is not safe with regards to extensions of the API because if a derived implementation forgets to override some method, it would incorrectly default to the implementation from static sharder. Better to fail the compilation in this case, so extract static sharder logic to dht::static_sharder class and make all methods in dht::sharder pure virtual. This also allows us to have algorithms indicate that they only work with static sharder by accepting the type, and have compile-time safety for this requirement. schema::get_sharder() is changed to return the static_sharder&.	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	10a4903d0c	dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard Require users to specify whether we want shard for reads or for writes by switching to appropriate non-deprecated variant. For example, shard_of() can be replaced with shard_for_reads() or shard_for_writes(). The next_shard/token_for_next_shard APIs have only for-reads variant, and the act of switching will be a testimony to the fact that the code is valid for intra-node migration.	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	aafeacc8d9	dht, auto_refreshing_sharder: Allow overriding write selector During streaming for intra-node migration we want to write only to the new shard. To achieve that, allow altering write selector in sharder::shard_for_writes() and per-instance of auto_refreshing_sharder.	2024-05-16 00:28:46 +02:00
Tomasz Grabiec	6946ad2a45	sharding: Prepare for intra-node-migration Tablet sharder is adjusted to handle intra-migration where a tablet can have two replicas on the same host. For reads, sharder uses the read selector to resolve the conflict. For writes, the write selector is used. The old shard_of() API is kept to represent shard for reads, and new method is introduced to query the shards for writing: shard_for_writes(). All writers should be switched to that API, which is not done in this patch yet. The request handler on replica side acts as a second-level coordinator, using sharder to determine routing to shards. A given sharder has a scope of a single topology version, a single effective_replication_map_ptr, which should be kept alive during writes.	2024-05-16 00:28:46 +02:00
Botond Dénes	4993d0e30a	dht/i_partitioner: to_partition_range(): don't assume input is fully inclusive Consider the inclusiveness of the token-range's start and end bounds and copy the flag to the output bounds, instead of assuming they are always inclusive.	2024-02-21 02:08:48 -05:00
Avi Kivity	605bf6e221	range.hh: retire range.hh was deprecated in `bd794629f9` (2020) since its names conflict with the C++ library concept of an iterator range. The name ::range also mapped to the dangerous wrapping_interval rather than nonwrapping_interval. Complete the deprecation by removing range.hh and replacing all the aliases by the names they point to from the interval library. Note this now exposes uses of wrapping intervals as they are now explicit. The unit tests are renamed and range.hh is deleted. Closes scylladb/scylladb#17428	2024-02-21 00:24:25 +02:00
Kefu Chai	8e8b73fa82	dht: add formatter for paritition_range_view and i_partition before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for `partition_range_view` and `i_partition`, and drop their operator<<:s. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17331	2024-02-15 09:46:03 +02:00
Kefu Chai	b1e4513c2d	dht: add formatter for dht::ring_position before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for `dht::ring_posittion`, and drop its operator<<. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#17194	2024-02-07 09:30:45 +02:00
Kefu Chai	e56e74df0a	db: add formatter for dht::ring_position_ext before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for `dht::ring_position_ext`, and drop its operator<<. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-02-02 18:37:56 +08:00
Kefu Chai	bb3ba81b15	db: add formatter for dht::ring_position_view before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for `dht::ring_position_view`, and drop its operator<<. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-02-02 18:36:17 +08:00
Kefu Chai	0cbf8f75f0	db: add formatter for dht::decorated_key and repair_sync_boundary before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter. in this change, we define formatters for dht::decorated_key and repair_sync_boundary. please note, before this change, repair_sync_boundary was using the operator<< based formatter of `dht::decorated_key`, so we are updating both of them in a single commit. because we still use the homebrew generic formatter of vector<> in to format vector<repair_sync_boundary> and vector<dht::decorated_key>, so their operator<< are preserved. Refs #13245 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16994	2024-01-29 11:11:41 +02:00
Kefu Chai	1ce58595aa	dht: do not include unused headers these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16891	2024-01-21 16:56:16 +02:00
Nadav Har'El	1bcaeb89c7	view: revert cleanup filter that doesn't work with tablets This patch reverts commit `10f8f13b90` from November 2022. That commit added to the "view update generator", the code which builds view updates for staging sstables, a filter that ignores ranges that do not belong to this node. However, 1. I believe this filter was never necessary, because the view update code already silently ignores base updates which do not belong to this replica (see get_view_natural_endpoint()). After all, the view update needs to know that this replica is the Nth owner of the base update to send its update to the Nth view replica, but if no such N exists, no view update is sent. 2. The code introduced for that filter used a per-keyspace replication map, which was ok for vnodes but no longer works for tablets, and causes the operation using it to fail. 3. The filter was used every time the "view update generator" was used, regardless of whether any cleanup is necessary or not, so every such operation would fail with tablets. So for example the dtest test_mvs_populating_from_existing_data fails with tablets: * This test has view building in parallel with automatic tablet movement. * Tablet movement is streaming. * When streaming happens before view building has finished, the streamed sstables get "view update generator" run on them. This causes the problematic code to be called. Before this patch, the dtest test_mvs_populating_from_existing_data fails when tablets are enabled. After this patch, it passes. Fixes #16598 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2024-01-14 13:24:44 +02:00
Benny Halevy	232918eef0	dht: move token_comperator to token.{cc,hh} Move the `token_comparator` definition and implementation to token.{hh,cc}, respectively since they are independent of i_partitioner. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-11-05 20:01:15 +02:00
Tomasz Grabiec	29cbdb812b	dht: Rename dht::shard_of() to dht::static_shard_of() This is in order to prevent new incorrect uses of dht::shard_of() to be accidentally added. Also, makes sure that all current uses are caught by the compiler and require an explicit rename.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	fb0bdcec0c	storage_proxy: Avoid multishard reader for tablets Currently, the coordinator splits the partition range at vnode (or tablet) boundaries and then tries to merge adjacent ranges which target the same replica. This is an optimization which makes less sense with tablets, which are supposed to be of substantial size. If we don't merge the ranges, then with tablets we can avoid using the multishard reader on the replica side, since each tablet lives on a single shard. The main reason to avoid a multishard reader is avoiding its complexity, and avoiding adapting it to work with tablet sharding. Currently, the multishard reader implementation makes several assumptions about shard assignment which do not hold with tablets. It assumes that shards are assigned in a round-robin fashion.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	28b972a588	dht: Make split_range_to_single_shard() prepared for tablet sharder The function currently assumes that shard assignment for subsequent tokens is round robin, which will not be the case for tablets. This can lead to incorrect split calculation or infinite loop. Another assumption was that subsequent splits returned by the sharder have distinct shards. This also doesn't hold for tablets, which may return the same shard for subsequent tokens. This assumption was embedded in the following line: start_token = sharder.token_for_next_shard(end_token, shard); If the range which starts with end_token is also owned by "shard", token_for_next_shard() would skip over it.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	390bcf3fae	dht: Take sharder externally in splitting functions We need those functions to work with tablet sharder, which is not accessible through schema::get_sharder(). In order to propagate the right sharder, those functions need to take it externally rather from the schema object. The sharder will come from the effective_replication_map attached to the table object. Those splitting functions are used when generating sharding metadata of an sstable. We need to keep this sharding metadata consistent with tablet mapping to shards in order for node restart to detect that those sstables belong to a single shard and that resharding is not necessary. Resharding of sstables based on tablet metadata is not implemented yet and will abort after this series. Keeping sharding metadata accurate for tablets is only necessary until compaction group integration is finished. After that, we can use the sstable token range to determine the owning tablet and thus the owning shard. Before that, we can't, because a single sstable may contain keys from different tablets, and the whole key range may overlap with keys which belong to other shards.	2023-06-21 00:58:24 +02:00
Tomasz Grabiec	e8dd5e34c3	dht: sharder: Introduce next_shard() The logic was extracted from ring_position_range_sharder::next(), and the latter was changed to rely on sharder::next_shard(). The tablet sharder will have a different implementation for next_shard(). This way, ring_position_range_sharder can work with both current sharder and the tablet sharder.	2023-06-21 00:58:24 +02:00
Kamil Braun	30cc07b40d	Merge 'Introduce tablets' from Tomasz Grabiec This PR introduces an experimental feature called "tablets". Tablets are a way to distribute data in the cluster, which is an alternative to the current vnode-based replication. Vnode-based replication strategy tries to evenly distribute the global token space shared by all tables among nodes and shards. With tablets, the aim is to start from a different side. Divide resources of replica-shard into tablets, with a goal of having a fixed target tablet size, and then assign those tablets to serve fragments of tables (also called tablets). This will allow us to balance the load in a more flexible manner, by moving individual tablets around. Also, unlike with vnode ranges, tablet replicas live on a particular shard on a given node, which will allow us to bind raft groups to tablets. Those goals are not yet achieved with this PR, but it lays the ground for this. Things achieved in this PR: - You can start a cluster and create a keyspace whose tables will use tablet-based replication. This is done by setting `initial_tablets` option: ``` CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3, 'initial_tablets': 8}; ``` All tables created in such a keyspace will be tablet-based. Tablet-based replication is a trait, not a separate replication strategy. Tablets don't change the spirit of replication strategy, it just alters the way in which data ownership is managed. In theory, we could use it for other strategies as well like EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy is augmented to support tablets. - You can create and drop tablet-based tables (no DDL language changes) - DML / DQL work with tablet-based tables Replicas for tablet-based tables are chosen from tablet metadata instead of token metadata Things which are not yet implemented: - handling of views, indexes, CDC created on tablet-based tables - sharding is done using the old method, it ignores the shard allocated in tablet metadata - node operations (topology changes, repair, rebuild) are not handling tablet-based tables - not integrated with compaction groups - tablet allocator piggy-backs on tokens to choose replicas. Eventually we want to allocate based on current load, not statically Closes #13387 * github.com:scylladb/scylladb: test: topology: Introduce test_tablets.py raft: Introduce 'raft_server_force_snapshot' error injection locator: network_topology_strategy: Support tablet replication service: Introduce tablet_allocator locator: Introduce tablet_aware_replication_strategy locator: Extract maybe_remove_node_being_replaced() dht: token_metadata: Introduce get_my_id() migration_manager: Send tablet metadata as part of schema pull storage_service: Load tablet metadata when reloading topology state storage_service: Load tablet metadata on boot and from group0 changes db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata() migration_notifier: Introduce before_drop_keyspace() migration_manager: Make prepare_keyspace_drop_announcement() return a future<> test: perf: Introduce perf-tablets test: Introduce tablets_test test: lib: Do not override table id in create_table() utils, tablets: Introduce external_memory_usage() db: tablets: Add printers db: tablets: Add persistence layer dht: Use last_token_of_compaction_group() in split_token_range_msb() locator: Introduce tablet_metadata dht: Introduce first_token() dht: Introduce next_token() storage_proxy: Improve trace-level logging locator: token_metadata: Fix confusing comment on ring_range() dht, storage_proxy: Abstract token space splitting Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries" db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms() db: Introduce get_non_local_vnode_based_strategy_keyspaces() service: storage_proxy: Avoid copying keyspace name in write handler locator: Introduce per-table replication strategy treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type locator: Introduce effective_replication_map locator: Rename effective_replication_map to vnode_effective_replication_map locator: effective_replication_map: Abstract get_pending_endpoints() db: Propagate feature_service to abstract_replication_strategy::validate_options() db: config: Introduce experimental "TABLETS" feature db: Log replication strategy for debugging purposes db: Log full exception on error in do_parse_schema_tables() db: keyspace: Remove non-const replication strategy getter config: Reformat	2023-04-27 09:40:18 +02:00
Kefu Chai	5a11d67709	dht: token: s/tri_compare/operator<=>/ now that C++20 is able to generate the default-generated comparing operators for us. there is no need to define them manually. and, `std::rel_ops::*` are deprecated in C++20. also, use `foo <=> bar` instead of `tri_compare(foo, bar)` for better readability. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-04-26 14:09:57 +08:00
Kefu Chai	cc87e10f40	dht: print pk in decorated_key with "pk" prefix this change ensures that `dk._key` is formatted with the "pk" prefix. as in `3738fcb`, the `operator<<` for partition_key was removed. so the compiler has to find an alternative when trying to fulfill the needs when this operator<< is called. fortunately, from the compiler's perspective, `partition_key` has an `operator managed_bytes_view`, and this operator does not have the explicit specifier, and, `managed_bytes_view` does support `operator<<`. so this ends up with a change in the format of `decorated_key` when it is printed using `operator<<`. the code compiles. but unfortunately, the behavior is changed, and it breaks scylla-dtest/cdc_tracing_info_test.py where the partition_key is supposed to be printed like "pk{010203}" instead of "010203". the latter is how `managed_bytes_view` is formatted. a test is added accordingly to avoid future changes which break the dtest. Fixes scylladb#13628 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #13653	2023-04-25 09:53:47 +02:00
Tomasz Grabiec	fa8ad9a585	dht: Use last_token_of_compaction_group() in split_token_range_msb()	2023-04-24 10:49:37 +02:00
Kefu Chai	c37f4e5252	treewide: use fmt::join() when appropriate now that fmtlib provides fmt::join(). see https://fmt.dev/latest/api.html#_CPPv4I0EN3fmt4joinE9join_viewIN6detail10iterator_tI5RangeEEN6detail10sentinel_tI5RangeEEERR5Range11string_view there is not need to revent the wheel. so in this change, the homebrew join() is replaced with fmt::join(). as fmt::join() returns an join_view(), this could improve the performance under certain circumstances where the fully materialized string is not needed. please note, the goal of this change is to use fmt::join(), and this change does not intend to improve the performance of existing implementation based on "operator<<" unless the new implementation is much more complicated. we will address the unnecessarily materialized strings in a follow-up commit. some noteworthy things related to this change: * unlike the existing `join()`, `fmt::join()` returns a view. so we have to materialize the view if what we expect is a `sstring` * `fmt::format()` does not accept a view, so we cannot pass the return value of `fmt::join()` to `fmt::format()` * fmtlib does not format a typed pointer, i.e., it does not format, for instance, a `const std::string`. but operator<<() always print a typed pointer. so if we want to format a typed pointer, we either need to cast the pointer to `void` or use `fmt::ptr()`. * fmtlib is not able to pick up the overload of `operator<<(std::ostream& os, const column_definition* cd)`, so we have to use a wrapper class of `maybe_column_definition` for printing a pointer to `column_definition`. since the overload is only used by the two overloads of `statement_restrictions::add_single_column_parition_key_restriction()`, the operator<< for `const column_definition*` is dropped. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-03-16 20:34:18 +08:00
Kefu Chai	df63e2ba27	types: move types.{cc,hh} into types they are part of the CQL type system, and are "closer" to types. let's move them into "types" directory. the building systems are updated accordingly. the source files referencing `types.hh` were updated using following command: ``` find . -name "*.{cc,hh}" -exec sed -i 's/\"types.hh\"/\"types\/types.hh\"/' {} + ``` the source files under sstables include "types.hh", which is indeed the one located under "sstables", so include "sstables/types.hh" instea, so it's more explicit. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes #12926	2023-02-19 21:05:45 +02:00
Kefu Chai	0cb842797a	treewide: do not define/capture unused variables these warnings are found by Clang-17 after removing `-Wno-unused-lambda-capture` and '-Wno-unused-variable' from the list of disabled warnings in `configure.py`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-02-15 22:57:18 +02:00
Benny Halevy	95a8e0b21d	table: make_compaction_groups: calculate compaction_group token ranges Add dht::split_token_range_msb that returns a token_range_vector with ranges split using a given number of most-significant bits. When creating the table's compaction groups, use dht::split_token_range_msb to calculate the token_range owned by each compaction_group. Refs #12594 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2023-01-22 22:54:26 +02:00
Benny Halevy	57ff3f240f	dht: optimize subtract_ranges Take advantage of the fact that both ranges and ranges_to_subtract are deoverlapped and sorted by to reduce the calculation complexity from quadratic to linear. Fixes #11922 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-21 15:48:28 +02:00

1 2 3 4

172 Commits