Commit Graph

172 Commits

Author SHA1 Message Date
Avi Kivity
0ae22a09d4 LICENSE: Update to version 1.1
Updated terms of non-commercial use (must be a never-customer).
2026-04-12 19:46:33 +03:00
Botond Dénes
13ff9c4394 db,compaction: use utils::chunked_vector for cache invalidation ranges
Instead of dht::partition_ranges_vector, which is an std::vector<> and
have been seen to cause large allocations when calculating ranges to be
invalidated after compaction:

    seastar_memory - oversized allocation: 147456 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at
    [Backtrace #0]
    void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89
     (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./seastar/src/util/backtrace.cc:99
    seastar::current_tasktrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:136
    seastar::current_backtrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:169
    seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:840
    seastar::memory::cpu_pages::check_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:903
     (inlined by) seastar::memory::cpu_pages::allocate_large(unsigned int, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:910
     (inlined by) seastar::memory::allocate_large(unsigned long, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:1533
     (inlined by) seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:1679
    seastar::memory::allocate(unsigned long) at ././seastar/src/core/memory.cc:1698
     (inlined by) operator new(unsigned long) at ././seastar/src/core/memory.cc:2440
     (inlined by) std::__new_allocator<interval<dht::ring_position>>::allocate(unsigned long, void const*) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/new_allocator.h:151
     (inlined by) std::allocator<interval<dht::ring_position>>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/allocator.h:203
     (inlined by) std::allocator_traits<std::allocator<interval<dht::ring_position>>>::allocate(std::allocator<interval<dht::ring_position>>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/alloc_traits.h:614
     (inlined by) std::_Vector_base<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/stl_vector.h:387
     (inlined by) std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::reserve(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/vector.tcc:79
    dht::to_partition_ranges(utils::chunked_vector<interval<dht::token>, 131072ul> const&, seastar::bool_class<utils::can_yield_tag>) at ./dht/i_partitioner.cc:347
    compaction::compaction::get_ranges_for_invalidation(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>> const&) at ./compaction/compaction.cc:619
     (inlined by) compaction::compaction::get_compaction_completion_desc(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>, std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>) at ./compaction/compaction.cc:719
     (inlined by) compaction::regular_compaction::replace_remaining_exhausted_sstables() at ./compaction/compaction.cc:1362
    compaction::compaction::finish(std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>, std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>) at ./compaction/compaction.cc:1021
    compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0::operator()() at ./compaction/compaction.cc:1960
     (inlined by) compaction::compaction_result std::__invoke_impl<compaction::compaction_result, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(std::__invoke_other, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:63
     (inlined by) std::__invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type std::__invoke<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:98
     (inlined by) decltype(auto) std::__apply_impl<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&, std::integer_sequence<unsigned long, ...>) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2920
     (inlined by) decltype(auto) std::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2935
     (inlined by) seastar::future<compaction::compaction_result> seastar::futurize<compaction::compaction_result>::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at ././seastar/include/seastar/core/future.hh:1930
     (inlined by) seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()::operator()() const at ././seastar/include/seastar/core/thread.hh:267
     (inlined by) seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()>::call(seastar::noncopyable_function<void ()> const*) at ././seastar/include/seastar/util/noncopyable_function.hh:138
    seastar::noncopyable_function<void ()>::operator()() const at ./build/release/seastar/./seastar/include/seastar/util/noncopyable_function.hh:224
     (inlined by) seastar::thread_context::main() at ./build/release/seastar/./seastar/src/core/thread.cc:318

dht::partition_ranges_vector is used on the hot path, so just convert
the problematic user -- cache invalidation -- to use
utils::chunked_vector<dht::partition_range> instead.

Fixes: SCYLLADB-121

Closes scylladb/scylladb#28855
2026-03-09 22:04:54 +02:00
Avi Kivity
7230a04799 dht, sstables: replace vector with chunked_vector when computing sstable shards
sstable::compute_shards_for_this_sstable() has a temporary of type
std::vector<dht::token_range> (aka dht::partition_range_vector), which
allocates a contiguous 300k when loading an sstable from disk. This
causes large allocation warnings (it doesn't really stress the allocator
since this typically happens during startup, but best to clear the warning
anyway).

Fix this by changing the container to by chunked_vector. It is passed
to dht::ring_position_range_vector_sharder, but since we're the only
user, we can change that class to accept the new type.

Fixes #24198.

Closes scylladb/scylladb#26353
2025-10-02 00:47:42 +02:00
Petr Gusev
801bf42ea2 sharder: add try_get_shard_for_reads method
Currently, we use storage_proxy/get_cas_shard ->
sharder.shard_for_reads to decide which shard to use for LWT code
execution on both replicas and the coordinator.

If the coordinator is not a replica, shard_for_reads returns 0 —
the 'default' shard. This behavior has at least two problems:
* Shard 0 may become overloaded, because all LWT coordinators that are
not replicas will be served on it.
* The zero shard does not match shard_for_reads on replicas, which
hinders the "same shard for client and server" RPC-level optimization.

To fix this, we need to know whether the current node hosts a replica
for the tablet corresponding to the given token. Currently, there is
no API we could use for this. For historical reasons,
sharder::shard_for_reads returns 0 when the node does not host the
shard, which leads to ambiguity.

This commit introduces try_get_shard_for_reads, which returns a
disengaged std::optional when the tablet is not present on
the local node.

We leave shard_for_reads method in the base sharder class, it calls
try_get_shard_for_reads and returns zero by default. We need to rename
tablet_sharder private methods shard_for_reads and shard_for_writes
so that they don't conflict with the sharder::shard_for_reads.
2025-07-29 11:35:54 +02:00
Avi Kivity
16fb68bb5e interval: rename start_ref() back to start() (and end_ref() etc).
To reduce noise, rename start_ref() back to its original name start(),
after it was changed in the previous patch to force an audit of all calls.
2025-06-14 21:26:16 +03:00
Avi Kivity
3363bc41e2 interval: rename start() to start_ref() (and end() etc).
We are about to change start() to return a proxy object rather
than a `const interval_bound<T>&`. This is generally transparent,
except in one case: `auto x = i.start()`. With the current implementation,
we'll copy object referred to and assign it to x. With the planned
implementation, the proxy object will be assigned to `x`, but it
will keep referring to `i`.

To prevent such problems, rename start() to start_ref() and end()
to end_ref(). This forces us to audit all calls, and redirect calls
that will break to new start_copy() and end_copy() methods.
2025-06-14 21:26:16 +03:00
Raphael S. Carvalho
494ed6b887 dht: Add overlap_ratio() for token range
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Kefu Chai
a20cd6539c cql3, dht: Remove redundant std::move() calls
These redundant `std::move()` calls were identified by GCC-14.
In general, copy elision applies to these places, so adding
`std::move()` is not only unnecessary but can actually prevent
the compiler from performing copy elision, as it causes the
return statement to fail to satisfy the requirements for
copy elision optimization.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23063
2025-03-04 10:36:49 +03:00
Kefu Chai
7215d4bfe9 utils: do not include unused headers
these unused includes were identifier by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.

please note, because quite a few source files relied on
`utils/to_string.hh` to pull in the specialization of
`fmt::formatter<std::optional<T>>`, after removing
`#include <fmt/std.h>` from `utils/to_string.hh`, we have to
include `fmt/std.h` directly.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-01-14 07:56:39 -05:00
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Avi Kivity
f744007e13 dht: auto_refreshing_sharder.hh: don't include database.hh
database.hh is a heavyweight include file with a lot of fan-in.
auto_refreshing_sharder.hh has a lot of fan out. The combination
means a large dependency load.

Deinline the class and use forward declarations to avoid the #include.

There is no expected performance impact because all the functions are
virtual.

Ref #1

Note: this shouldn't belong in dht, but be injected by a higher layer,
but this isn't addressed by the patch.

Closes scylladb/scylladb#21768
2024-12-06 23:11:52 +01:00
Kefu Chai
bab12e3a98 treewide: migrate from boost::adaptors::transformed to std::views::transform
now that we are allowed to use C++23. we now have the luxury of using
`std::views::transform`.

in this change, we:

- replace `boost::adaptors::transformed` with `std::views::transform`
- use `fmt::join()` when appropriate where `boost::algorithm::join()`
  is not applicable to a range view returned by `std::view::transform`.
- use `std::ranges::fold_left()` to accumulate the range returned by
  `std::view::transform`
- use `std::ranges::fold_left()` to get the maximum element in the
  range returned by `std::view::transform`
- use `std::ranges::min()` to get the minimal element in the range
  returned by `std::view::transform`
- use `std::ranges::equal()` to compare the range views returned
  by `std::view::transform`
- remove unused `#include <boost/range/adaptor/transformed.hpp>`
- use `std::ranges::subrange()` instead of `boost::make_iterator_range()`,
  to feed `std::views::transform()` a view range.

to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

limitations:

there are still a couple places where we are still using
`boost::adaptors::transformed` due to the lack of a C++23 alternative
for `boost::join()` and `boost::adaptors::uniqued`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21700
2024-12-03 09:41:32 +02:00
Kefu Chai
6ead5a4696 treewide: move log.hh into utils/log.hh
the log.hh under the root of the tree was created keep the backward
compatibility when seastar was extracted into a separate library.
so log.hh should belong to `utils` directory, as it is based solely
on seastar, and can be used all subsystems.

in this change, we move log.hh into utils/log.hh to that it is more
modularized. and this also improves the readability, when one see
`#include "utils/log.hh"`, it is obvious that this source file
needs the logging system, instead of its own log facility -- please
note, we do have two other `log.hh` in the tree.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-10-22 06:54:46 +03:00
Kefu Chai
5cd619a60c treewide: s/boost::adaptors::map_keys/std::views::keys/
now that we are allowed to use C++23. we now have the luxury of using
`std::views::keys`.

in this change, we:

- replace `boost::adaptors::map_keys` with `std::views::keys`
- update affected code to work with `std::views::keys`

to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21198
2024-10-21 12:47:52 +03:00
Kefu Chai
3e84d43f93 treewide: use seastar::format() or fmt::format() explicitly
before this change, we rely on `using namespace seastar` to use
`seastar::format()` without qualifying the `format()` with its
namespace. this works fine until we changed the parameter type
of format string `seastar::format()` from `const char*` to
`fmt::format_string<...>`. this change practically invited
`seastar::format()` to the club of `std::format()` and `fmt::format()`,
where all members accept a templated parameter as its `fmt`
parameter. and `seastar::format()` is not the best candidate anymore.
despite that argument-dependent lookup (ADT for short) favors the
function which is in the same namespace as its parameter, but
`using namespace` makes `seastar::format()` more competitive,
so both `std::format()` and `seastar::format()` are considered
as the condidates.

that is what is happening scylladb in quite a few caller sites of
`format()`, hence ADT is not able to tell which function the winner
in the name lookup:

```
/__w/scylladb/scylladb/mutation/mutation_fragment_stream_validator.cc:265:12: error: call to 'format' is ambiguous
  265 |     return format("{} ({}.{} {})", _name_view, s.ks_name(), s.cf_name(), s.id());
      |            ^~~~~~
/usr/bin/../lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/format:4290:5: note: candidate function [with _Args = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
 4290 |     format(format_string<_Args...> __fmt, _Args&&... __args)
      |     ^
/__w/scylladb/scylladb/seastar/include/seastar/core/print.hh:143:1: note: candidate function [with A = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
  143 | format(fmt::format_string<A...> fmt, A&&... a) {
      | ^
```

in this change, we

change all `format()` to either `fmt::format()` or `seastar::format()`
with following rules:
- if the caller expects an `sstring` or `std::string_view`, change to
  `seastar::format()`
- if the caller expects an `std::string`, change to `fmt::format()`.
  because, `sstring::operator std::basic_string` would incur a deep
  copy.

we will need another change to enable scylladb to compile with the
latest seastar. namely, to pass the format string as a templated
parameter down to helper functions which format their parameters.
to miminize the scope of this change, let's include that change when
bumping up the seastar submodule. as that change will depend on
the seastar change.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-09-11 23:21:40 +03:00
Avi Kivity
aa1270a00c treewide: change assert() to SCYLLA_ASSERT()
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.

Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.

To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.

[1] 66ef711d68

Closes scylladb/scylladb#20006
2024-08-05 08:23:35 +03:00
Benny Halevy
850f298ccd everywhere: use dht::token is_{minimum,maximum}
The is_minimum/is_maximum predicates are more
efficient than comparing the the m{minimum,maximum}_token
values, respectrively. since the is_* functions
need to check only the token kind.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-07-20 21:21:42 +03:00
Kefu Chai
51d686ea9f dht: remove unused operator<<
since we've switched almost all callers of the operator<< to {fmt},
let's drop the unused operator<<:s.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-06-18 11:26:20 +08:00
Tomasz Grabiec
26f2e6aa8e dht: is_single_shard: Avoid deprecated sharder API
All current uses are used in the read path.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
c9e6b4dca7 dht: split_range_to_single_shard: Work with static_sharder only
In preparation for intra-node tablet migration, to avoid
using deprecated sharder APIs.

This function is used for generating sstable sharding metadata.
For tablets, it is not invoked, so we can safely work with the
static sharder. The call site already passes static_sharder only.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
c380aecf64 dht: ring_position_range_sharder: Avoid deprecated sharder APIs
In preparation for tablet intra-node migration.

Existing uses are for reads, so it's safe to use shard_for_reads():
  - in multishard reader
  - in forward_service

The ring_position_range_vector_sharder is used when computing sstable
shards, which for intra-node migration should use the view for
reads. If we haven't completed streaming, sstables should be attached
to the old shard (used by reads). When in write-both-read-new stage,
streaming is complete, reads are using the new shard, and we should
attach sstables to the new shard.

When not in intra-node migration, the view for reads on the pending
node will return the pending shard even if read selector is "read old".
So if pending node restarts during streaming, we will attach to sstables
to the shard which is used by writes even though we're using the selector
for reads.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
dd4a086b87 selective_token_sharder: Avoid use of deprecated sharder API
I analyzed all the uses and all except the alternator/ttl.cc seem to
be interested in the result for the purpose of reading.

Alternator is not supported with tablets yet, so the use was annotated
with a relevant issue.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
9da3bd84c7 dht: Extract dht::static_sharder
Before the patch, dht::sharder could be instantiated and it would
behave like a static sharder. This is not safe with regards to
extensions of the API because if a derived implementation forgets to
override some method, it would incorrectly default to the
implementation from static sharder. Better to fail the compilation in
this case, so extract static sharder logic to dht::static_sharder
class and make all methods in dht::sharder pure virtual.

This also allows us to have algorithms indicate that they only work
with static sharder by accepting the type, and have compile-time
safety for this requirement.

schema::get_sharder() is changed to return the static_sharder&.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
10a4903d0c dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard
Require users to specify whether we want shard for reads or for writes
by switching to appropriate non-deprecated variant.

For example, shard_of() can be replaced with shard_for_reads() or
shard_for_writes().

The next_shard/token_for_next_shard APIs have only for-reads variant,
and the act of switching will be a testimony to the fact that the code
is valid for intra-node migration.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
aafeacc8d9 dht, auto_refreshing_sharder: Allow overriding write selector
During streaming for intra-node migration we want to write only to the
new shard. To achieve that, allow altering write selector in
sharder::shard_for_writes() and per-instance of
auto_refreshing_sharder.
2024-05-16 00:28:46 +02:00
Tomasz Grabiec
6946ad2a45 sharding: Prepare for intra-node-migration
Tablet sharder is adjusted to handle intra-migration where a tablet
can have two replicas on the same host. For reads, sharder uses the
read selector to resolve the conflict. For writes, the write selector
is used.

The old shard_of() API is kept to represent shard for reads, and new
method is introduced to query the shards for writing:
shard_for_writes(). All writers should be switched to that API, which
is not done in this patch yet.

The request handler on replica side acts as a second-level
coordinator, using sharder to determine routing to shards. A given
sharder has a scope of a single topology version, a single
effective_replication_map_ptr, which should be kept alive during
writes.
2024-05-16 00:28:46 +02:00
Botond Dénes
4993d0e30a dht/i_partitioner: to_partition_range(): don't assume input is fully inclusive
Consider the inclusiveness of the token-range's start and end bounds and
copy the flag to the output bounds, instead of assuming they are always
inclusive.
2024-02-21 02:08:48 -05:00
Avi Kivity
605bf6e221 range.hh: retire
range.hh was deprecated in bd794629f9 (2020) since its names
conflict with the C++ library concept of an iterator range. The name
::range also mapped to the dangerous wrapping_interval rather than
nonwrapping_interval.

Complete the deprecation by removing range.hh and replacing all the
aliases by the names they point to from the interval library. Note
this now exposes uses of wrapping intervals as they are now explicit.

The unit tests are renamed and range.hh is deleted.

Closes scylladb/scylladb#17428
2024-02-21 00:24:25 +02:00
Kefu Chai
8e8b73fa82 dht: add formatter for paritition_range_view and i_partition
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for
`partition_range_view` and `i_partition`, and drop their operator<<:s.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17331
2024-02-15 09:46:03 +02:00
Kefu Chai
b1e4513c2d dht: add formatter for dht::ring_position
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for `dht::ring_posittion`,
and drop its operator<<.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17194
2024-02-07 09:30:45 +02:00
Kefu Chai
e56e74df0a db: add formatter for dht::ring_position_ext
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for `dht::ring_position_ext`,
and drop its operator<<.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-02-02 18:37:56 +08:00
Kefu Chai
bb3ba81b15 db: add formatter for dht::ring_position_view
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for `dht::ring_position_view`,
and drop its operator<<.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-02-02 18:36:17 +08:00
Kefu Chai
0cbf8f75f0 db: add formatter for dht::decorated_key and repair_sync_boundary
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for dht::decorated_key and
repair_sync_boundary.

please note, before this change, repair_sync_boundary was using
the operator<< based formatter of `dht::decorated_key`, so we are
updating both of them in a single commit.

because we still use the homebrew generic formatter of vector<>
in to format vector<repair_sync_boundary> and vector<dht::decorated_key>,
so their operator<< are preserved.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16994
2024-01-29 11:11:41 +02:00
Kefu Chai
1ce58595aa dht: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16891
2024-01-21 16:56:16 +02:00
Nadav Har'El
1bcaeb89c7 view: revert cleanup filter that doesn't work with tablets
This patch reverts commit 10f8f13b90 from
November 2022. That commit added to the "view update generator", the code
which builds view updates for staging sstables, a filter that ignores
ranges that do not belong to this node. However,

1. I believe this filter was never necessary, because the view update
   code already silently ignores base updates which do not belong to
   this replica (see get_view_natural_endpoint()). After all, the view
   update needs to know that this replica is the Nth owner of the base
   update to send its update to the Nth view replica, but if no such
   N exists, no view update is sent.

2. The code introduced for that filter used a per-keyspace replication
   map, which was ok for vnodes but no longer works for tablets, and
   causes the operation using it to fail.

3. The filter was used every time the "view update generator" was used,
   regardless of whether any cleanup is necessary or not, so every
   such operation would fail with tablets. So for example the dtest
   test_mvs_populating_from_existing_data fails with tablets:
     * This test has view building in parallel with automatic tablet
       movement.
     * Tablet movement is streaming.
     * When streaming happens before view building has finished, the
       streamed sstables get "view update generator" run on them.
       This causes the problematic code to be called.

Before this patch, the dtest test_mvs_populating_from_existing_data
fails when tablets are enabled. After this patch, it passes.

Fixes #16598

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-01-14 13:24:44 +02:00
Benny Halevy
232918eef0 dht: move token_comperator to token.{cc,hh}
Move the `token_comparator` definition and
implementation to token.{hh,cc}, respectively
since they are independent of i_partitioner.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-11-05 20:01:15 +02:00
Tomasz Grabiec
29cbdb812b dht: Rename dht::shard_of() to dht::static_shard_of()
This is in order to prevent new incorrect uses of dht::shard_of() to
be accidentally added. Also, makes sure that all current uses are
caught by the compiler and require an explicit rename.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
fb0bdcec0c storage_proxy: Avoid multishard reader for tablets
Currently, the coordinator splits the partition range at vnode (or
tablet) boundaries and then tries to merge adjacent ranges which
target the same replica. This is an optimization which makes less
sense with tablets, which are supposed to be of substantial size. If
we don't merge the ranges, then with tablets we can avoid using the
multishard reader on the replica side, since each tablet lives on a
single shard.

The main reason to avoid a multishard reader is avoiding its
complexity, and avoiding adapting it to work with tablet
sharding. Currently, the multishard reader implementation makes
several assumptions about shard assignment which do not hold with
tablets. It assumes that shards are assigned in a round-robin fashion.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
28b972a588 dht: Make split_range_to_single_shard() prepared for tablet sharder
The function currently assumes that shard assignment for subsequent
tokens is round robin, which will not be the case for tablets. This
can lead to incorrect split calculation or infinite loop.

Another assumption was that subsequent splits returned by the sharder
have distinct shards. This also doesn't hold for tablets, which may
return the same shard for subsequent tokens. This assumption was
embedded in the following line:

  start_token = sharder.token_for_next_shard(end_token, shard);

If the range which starts with end_token is also owned by "shard",
token_for_next_shard() would skip over it.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
390bcf3fae dht: Take sharder externally in splitting functions
We need those functions to work with tablet sharder, which is not
accessible through schema::get_sharder(). In order to propagate the
right sharder, those functions need to take it externally rather from
the schema object. The sharder will come from the
effective_replication_map attached to the table object.

Those splitting functions are used when generating sharding metadata
of an sstable. We need to keep this sharding metadata consistent with
tablet mapping to shards in order for node restart to detect that
those sstables belong to a single shard and that resharding is not
necessary. Resharding of sstables based on tablet metadata is not
implemented yet and will abort after this series.

Keeping sharding metadata accurate for tablets is only necessary until
compaction group integration is finished. After that, we can use the
sstable token range to determine the owning tablet and thus the owning
shard. Before that, we can't, because a single sstable may contain
keys from different tablets, and the whole key range may overlap with
keys which belong to other shards.
2023-06-21 00:58:24 +02:00
Tomasz Grabiec
e8dd5e34c3 dht: sharder: Introduce next_shard()
The logic was extracted from ring_position_range_sharder::next(), and
the latter was changed to rely on sharder::next_shard().

The tablet sharder will have a different implementation for
next_shard(). This way, ring_position_range_sharder can work with both
current sharder and the tablet sharder.
2023-06-21 00:58:24 +02:00
Kamil Braun
30cc07b40d Merge 'Introduce tablets' from Tomasz Grabiec
This PR introduces an experimental feature called "tablets". Tablets are
a way to distribute data in the cluster, which is an alternative to the
current vnode-based replication. Vnode-based replication strategy tries
to evenly distribute the global token space shared by all tables among
nodes and shards. With tablets, the aim is to start from a different
side. Divide resources of replica-shard into tablets, with a goal of
having a fixed target tablet size, and then assign those tablets to
serve fragments of tables (also called tablets). This will allow us to
balance the load in a more flexible manner, by moving individual tablets
around. Also, unlike with vnode ranges, tablet replicas live on a
particular shard on a given node, which will allow us to bind raft
groups to tablets. Those goals are not yet achieved with this PR, but it
lays the ground for this.

Things achieved in this PR:

  - You can start a cluster and create a keyspace whose tables will use
    tablet-based replication. This is done by setting `initial_tablets`
    option:

    ```
        CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy',
                        'replication_factor': 3,
                        'initial_tablets': 8};
    ```

    All tables created in such a keyspace will be tablet-based.

    Tablet-based replication is a trait, not a separate replication
    strategy. Tablets don't change the spirit of replication strategy, it
    just alters the way in which data ownership is managed. In theory, we
    could use it for other strategies as well like
    EverywhereReplicationStrategy. Currently, only NetworkTopologyStrategy
    is augmented to support tablets.

  - You can create and drop tablet-based tables (no DDL language changes)

  - DML / DQL work with tablet-based tables

    Replicas for tablet-based tables are chosen from tablet metadata
    instead of token metadata

Things which are not yet implemented:

  - handling of views, indexes, CDC created on tablet-based tables
  - sharding is done using the old method, it ignores the shard allocated in tablet metadata
  - node operations (topology changes, repair, rebuild) are not handling tablet-based tables
  - not integrated with compaction groups
  - tablet allocator piggy-backs on tokens to choose replicas.
    Eventually we want to allocate based on current load, not statically

Closes #13387

* github.com:scylladb/scylladb:
  test: topology: Introduce test_tablets.py
  raft: Introduce 'raft_server_force_snapshot' error injection
  locator: network_topology_strategy: Support tablet replication
  service: Introduce tablet_allocator
  locator: Introduce tablet_aware_replication_strategy
  locator: Extract maybe_remove_node_being_replaced()
  dht: token_metadata: Introduce get_my_id()
  migration_manager: Send tablet metadata as part of schema pull
  storage_service: Load tablet metadata when reloading topology state
  storage_service: Load tablet metadata on boot and from group0 changes
  db, migration_manager: Notify about tablet metadata changes via migration_listener::on_update_tablet_metadata()
  migration_notifier: Introduce before_drop_keyspace()
  migration_manager: Make prepare_keyspace_drop_announcement() return a future<>
  test: perf: Introduce perf-tablets
  test: Introduce tablets_test
  test: lib: Do not override table id in create_table()
  utils, tablets: Introduce external_memory_usage()
  db: tablets: Add printers
  db: tablets: Add persistence layer
  dht: Use last_token_of_compaction_group() in split_token_range_msb()
  locator: Introduce tablet_metadata
  dht: Introduce first_token()
  dht: Introduce next_token()
  storage_proxy: Improve trace-level logging
  locator: token_metadata: Fix confusing comment on ring_range()
  dht, storage_proxy: Abstract token space splitting
  Revert "query_ranges_to_vnodes_generator: fix for exclusive boundaries"
  db: Exclude keyspace with per-table replication in get_non_local_strategy_keyspaces_erms()
  db: Introduce get_non_local_vnode_based_strategy_keyspaces()
  service: storage_proxy: Avoid copying keyspace name in write handler
  locator: Introduce per-table replication strategy
  treewide: Use replication_strategy_ptr as a shorter name for abstract_replication_strategy::ptr_type
  locator: Introduce effective_replication_map
  locator: Rename effective_replication_map to vnode_effective_replication_map
  locator: effective_replication_map: Abstract get_pending_endpoints()
  db: Propagate feature_service to abstract_replication_strategy::validate_options()
  db: config: Introduce experimental "TABLETS" feature
  db: Log replication strategy for debugging purposes
  db: Log full exception on error in do_parse_schema_tables()
  db: keyspace: Remove non-const replication strategy getter
  config: Reformat
2023-04-27 09:40:18 +02:00
Kefu Chai
5a11d67709 dht: token: s/tri_compare/operator<=>/
now that C++20 is able to generate the default-generated comparing
operators for us. there is no need to define them manually. and,
`std::rel_ops::*` are deprecated in C++20.

also, use `foo <=> bar` instead of `tri_compare(foo, bar)` for better
readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-04-26 14:09:57 +08:00
Kefu Chai
cc87e10f40 dht: print pk in decorated_key with "pk" prefix
this change ensures that `dk._key` is formatted with the "pk" prefix.
as in 3738fcb, the `operator<<` for partition_key was removed. so the
compiler has to find an alternative when trying to fulfill the needs
when this operator<< is called. fortunately, from the compiler's
perspective, `partition_key` has an `operator managed_bytes_view`, and
this operator does not have the explicit specifier, and,
`managed_bytes_view` does support `operator<<`. so this ends up with a
change in the format of `decorated_key` when it is printed using
`operator<<`. the code compiles. but unfortunately, the behavior is
changed, and it breaks scylla-dtest/cdc_tracing_info_test.py where the
partition_key is supposed to be printed like "pk{010203}" instead of
"010203". the latter is how `managed_bytes_view` is formatted.

a test is added accordingly to avoid future changes which break the
dtest.

Fixes scylladb#13628
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #13653
2023-04-25 09:53:47 +02:00
Tomasz Grabiec
fa8ad9a585 dht: Use last_token_of_compaction_group() in split_token_range_msb() 2023-04-24 10:49:37 +02:00
Kefu Chai
c37f4e5252 treewide: use fmt::join() when appropriate
now that fmtlib provides fmt::join(). see
https://fmt.dev/latest/api.html#_CPPv4I0EN3fmt4joinE9join_viewIN6detail10iterator_tI5RangeEEN6detail10sentinel_tI5RangeEEERR5Range11string_view
there is not need to revent the wheel. so in this change, the homebrew
join() is replaced with fmt::join().

as fmt::join() returns an join_view(), this could improve the
performance under certain circumstances where the fully materialized
string is not needed.

please note, the goal of this change is to use fmt::join(), and this
change does not intend to improve the performance of existing
implementation based on "operator<<" unless the new implementation is
much more complicated. we will address the unnecessarily materialized
strings in a follow-up commit.

some noteworthy things related to this change:

* unlike the existing `join()`, `fmt::join()` returns a view. so we
  have to materialize the view if what we expect is a `sstring`
* `fmt::format()` does not accept a view, so we cannot pass the
  return value of `fmt::join()` to `fmt::format()`
* fmtlib does not format a typed pointer, i.e., it does not format,
  for instance, a `const std::string*`. but operator<<() always print
  a typed pointer. so if we want to format a typed pointer, we either
  need to cast the pointer to `void*` or use `fmt::ptr()`.
* fmtlib is not able to pick up the overload of
  `operator<<(std::ostream& os, const column_definition* cd)`, so we
  have to use a wrapper class of `maybe_column_definition` for printing
  a pointer to `column_definition`. since the overload is only used
  by the two overloads of
  `statement_restrictions::add_single_column_parition_key_restriction()`,
  the operator<< for `const column_definition*` is dropped.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-03-16 20:34:18 +08:00
Kefu Chai
df63e2ba27 types: move types.{cc,hh} into types
they are part of the CQL type system, and are "closer" to types.
let's move them into "types" directory.

the building systems are updated accordingly.

the source files referencing `types.hh` were updated using following
command:

```
find . -name "*.{cc,hh}" -exec sed -i 's/\"types.hh\"/\"types\/types.hh\"/' {} +
```

the source files under sstables include "types.hh", which is
indeed the one located under "sstables", so include "sstables/types.hh"
instea, so it's more explicit.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #12926
2023-02-19 21:05:45 +02:00
Kefu Chai
0cb842797a treewide: do not define/capture unused variables
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-15 22:57:18 +02:00
Benny Halevy
95a8e0b21d table: make_compaction_groups: calculate compaction_group token ranges
Add dht::split_token_range_msb that returns a token_range_vector
with ranges split using a given number of most-significant bits.

When creating the table's compaction groups, use dht::split_token_range_msb
to calculate the token_range owned by each compaction_group.

Refs #12594

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2023-01-22 22:54:26 +02:00
Benny Halevy
57ff3f240f dht: optimize subtract_ranges
Take advantage of the fact that both ranges and
ranges_to_subtract are deoverlapped and sorted by
to reduce the calculation complexity from
quadratic to linear.

Fixes #11922

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-11-21 15:48:28 +02:00