before this change, we rely on `using namespace seastar` to use
`seastar::format()` without qualifying the `format()` with its
namespace. this works fine until we changed the parameter type
of format string `seastar::format()` from `const char*` to
`fmt::format_string<...>`. this change practically invited
`seastar::format()` to the club of `std::format()` and `fmt::format()`,
where all members accept a templated parameter as its `fmt`
parameter. and `seastar::format()` is not the best candidate anymore.
despite that argument-dependent lookup (ADT for short) favors the
function which is in the same namespace as its parameter, but
`using namespace` makes `seastar::format()` more competitive,
so both `std::format()` and `seastar::format()` are considered
as the condidates.
that is what is happening scylladb in quite a few caller sites of
`format()`, hence ADT is not able to tell which function the winner
in the name lookup:
```
/__w/scylladb/scylladb/mutation/mutation_fragment_stream_validator.cc:265:12: error: call to 'format' is ambiguous
265 | return format("{} ({}.{} {})", _name_view, s.ks_name(), s.cf_name(), s.id());
| ^~~~~~
/usr/bin/../lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/format:4290:5: note: candidate function [with _Args = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
4290 | format(format_string<_Args...> __fmt, _Args&&... __args)
| ^
/__w/scylladb/scylladb/seastar/include/seastar/core/print.hh:143:1: note: candidate function [with A = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
143 | format(fmt::format_string<A...> fmt, A&&... a) {
| ^
```
in this change, we
change all `format()` to either `fmt::format()` or `seastar::format()`
with following rules:
- if the caller expects an `sstring` or `std::string_view`, change to
`seastar::format()`
- if the caller expects an `std::string`, change to `fmt::format()`.
because, `sstring::operator std::basic_string` would incur a deep
copy.
we will need another change to enable scylladb to compile with the
latest seastar. namely, to pass the format string as a templated
parameter down to helper functions which format their parameters.
to miminize the scope of this change, let's include that change when
bumping up the seastar submodule. as that change will depend on
the seastar change.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in in {fmt} before v10, it provides the specialization of `fmt::formatter<..>`
for `std::string_view` as well as the specialization of `fmt::formatter<..>`
for `fmt::string_view` which is an implementation builtin in {fmt} for
compatibility of pre-C++17. and this type is used even if the code is
compiled with C++ stadandard greater or equal to C++17. also, before v10,
the `fmt::formatter<std::string_view>::format()` is defined so it accepts
`std::string_view`. after v10, `fmt::formatter<std::string_view>` still
exists, but it is now defined using `format_as()` machinery, so it's
`format()` method does not actually accept `std::string_view`, it
accepts `fmt::string_view`, as the former can be converted to
`fmt::string_view`.
this is why we can inherit from `fmt::formatter<std::string_view>` and
use `formatter<std::string_view>::format(foo, ctx);` to implement the
`format()` method with {fmt} v9, but we cannot do this with {fmt} v10,
and we would have following compilation failure:
```
FAILED: service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o
/home/kefu/.local/bin/clang++ -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSCYLLA_BUILD_MODE=release -DSEASTAR_API_LEVEL=7 -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++20 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -mllvm -inline-threshold=2500 -fno-slp-vectorize -U_FORTIFY_SOURCE -Werror=unused-result -MD -MT service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o -MF service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o.d -o service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o -c /home/kefu/dev/scylladb/service/topology_state_machine.cc
/home/kefu/dev/scylladb/service/topology_state_machine.cc:254:41: error: no matching member function for call to 'format'
254 | return formatter<std::string_view>::format(it->second, ctx);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
/usr/include/fmt/core.h:2759:22: note: candidate function template not viable: no known conversion from 'seastar::basic_sstring<char, unsigned int, 15>' to 'const fmt::basic_string_view<char>' for 1st argument
2759 | FMT_CONSTEXPR auto format(const T& val, FormatContext& ctx) const
| ^ ~~~~~~~~~~~~
```
because the inherited `format()` method actually comes from
`fmt::formatter<fmt::string_view>`. to reduce the confusion, in this
change, we just inherit from `fmt::format<string_view>`, where
`string_view` is actually `fmt::string_view`. this follows
the document at
https://fmt.dev/latest/api.html#formatting-user-defined-types,
and since there is less indirection under the hood -- we do not
use the specialization created by `FMT_FORMAT_AS` which inherit
from `formatter<fmt::string_view>`, hopefully this can improve
the compilation speed a little bit. also, this change addresses
the build failure with {fmt} v10.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18299
before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, `fmt::formatter` is added for following types for backward compatibility with {fmt} < 10:
* `utils::bad_exception_container_access`
* `cdc::no_generation_data_exception`
* classes derived from `sstables::malformed_sstable_exception`
* classes derived from `cassandra_exception`
Refs https://github.com/scylladb/scylladb/issues/13245Closesscylladb/scylladb#17944
* github.com:scylladb/scylladb:
cdc: add fmt::formatter for exception types in data_dictionary.hh
utils: add fmt::formatter for utils::bad_exception_container_access
sstables: add fmt::formatter for classes derived from sstables::malformed_sstable_exception
exceptions: add fmt::formatter for classes derived from cassandra_exception
cdc: add fmt::formatter for cdc::no_generation_data_exception
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, `fmt::formatter<cdc::no_generation_data_exception>` is
added for backward compatibility with {fmt} < 10.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
There is no need to map this node's inet_address to host_id.
The storage_service can easily just pass the local host_id.
While at it, get the other node's host_id directly
from their endpoint_state instead of looking it up
yet again in the gossiper, using the nodes' address.
Refs #12283
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
After moving the creation of uuid out of
make_new_generation_description, this function only calls the
topology_description_generator's constructor and its generate
method. We could remove this function, but we instead simplify
the code by removing the topology_description_generator class.
We can do this refactor because make_new_generation_description
is the only place using it. We inline its generate method into
make_new_generation_description and turn its private methods into
static functions.
In the future commit, we change how we initialize uuid of the
new CDC generation in the Raft-based topology. It forces us to
move this initialization out of the make_new_generation_data
function shared between Raft-based and gossiper-based topologies.
We also rename make_new_generation_data to
make_new_generation_description since it only returns
cdc::topology_description now.
We make CDC_GENERATIONS_V3 single-partition by adding the key
column and changing the clustering key from range_end to
(id, range_end). This is the first step to enabling the efficient
clearing of obsolete CDC generation data, which we need to prevent
Raft-topology snapshots from endlessly growing as we introduce new
generations over time. The next step is to change the type of the id
column to timeuuid. We do it in the following commits.
After making CDC_GENERATIONS_V3 single-partition, there is no easy
way of preserving the num_ranges column. As it is used only for
sanity checking, we remove it to simplify the implementation.
In the following commits, we modify the CDC_GENERATIONS_V3 schema
to enable efficient clearing of obsolete CDC generation data.
These modifications make the current get_cdc_generation_mutations
work only for the CDC_GENERATIONS_V2 schema, and we need a new
function for CDC_GENERATIONS_V3, so we add the "_v2" suffix.
In the following commits, we add the CDC generation optimality
check to storage_service::raft_check_and_repair_cdc_streams so
that it doesn't create new CDC generations when unnecessary. Since
generation_service::check_and_repair_cdc_streams already has
this check, we extract it to the new is_cdc_generation_optimal
function to not duplicate the code.
`cdc::generation_service::make_new_cdc_generation` would create a new
CDC generation and insert it into the `CDC_GENERATIONS_V2` table these
days. For Raft-based topology chnages we'll do the data insertion
somewhere else - in topology coordinator code. So extract the parts for
calculating the CDC generation to free-standing functions (these are
almost pure calculations, modulo accessing RNG).
The `CDC_GENERATIONS_V3` table schema is a copy-paste of the
`CDC_GENERATIONS_V2` schema. The difference is that V2 lives in
`system_distributed_keyspace` and writes to it are distributed using
regular `storage_proxy` replication mechanisms based on the token ring.
The V3 table lives in `system_keyspace` and any mutations written to it
will go through group 0.
Also extend the `TOPOLOGY` schema with new columns:
- `new_cdc_generation_data_uuid` will be stored as part of a bootstrapping
node's `ring_slice`, it stores UUID of a newly introduced CDC
generation which is used as partition key for the `CDC_GENERATIONS_V3`
table to access this new generation's data. It's a regular column,
meaning that every row (corresponding to a node) will have its own.
- `current_cdc_generation_uuid` and `current_cdc_generation_timestamp`
together form the ID of the newest CDC generation in the cluster.
(the uuid is the data key for `CDC_GENERATIONS_V3`, the timestamp is
when the CDC generation starts operating). Those are static columns
since there's a single newest CDC generation.
The function would generate a mutation timestamp for itself, take it as
parameter instead. We'll use timestamps provided by Group 0 APIs when
creating CDC generations during Group 0- based topology changes.
It was a `static` function inside system_distributed_keyspace. Later it
will be used for another table living in system_keyspace, so move it
outside, to the CDC generations module, and make it accessible from
other places.
the default generated operator<=> is exactly the same as the
handcrafted one. so let compiler do its job. also, since
operator<=> is defaulted, there is no need to define operator==
anymore, so drop it as well.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.
As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
The generation service already has all it needs to do it. This
keeps storage_service smaller and less aware about cdc internals.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It has everything needed onboard. Only two arguments are required -- the
booststrap tokens and whether or not to inject a delay.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A node with this commit, when creating a new CDC generation (during
bootstrap, upgrade, or when running checkAndRepairCdcStreams command)
will check for the CDC_GENERATIONS_V2 feature and:
- If the feature is enabled create the generation in the v2 format
and insert it into the new internal table. This is safe because
a node joins the feature only if it understands the new format.
- Otherwise create it in the v1 format, limiting its size as before,
and insert it into the old table.
The second case should only happen if we perform bootstrap or run
checkAndRepairCdcStreams in the middle of an upgrade procedure. On fully
upgraded clusters the feature shall be enabled, causing all new
generations to use the new format.
This is a new type of CDC generation identifiers. Compared to old IDs,
additionally to the timestamp it contains an UUID.
These new identifiers will allow a safer and more efficient algorithm of
introducing new generations into a cluster (introduced in a later commit).
For now, nodes keep using the old identifier format when creating new
generations and whenever they learn about a new CDC generation from gossip
they assume that it also is stored in the v1 format. But they do know how
to (de)serialize the second format and how to persist new identifiers in
local tables.
This is a follow-up to the previous commit.
Each CDC generation has a timestamp which denotes a logical point in time
when this generation starts operating. That same timestamp is
used to identify the CDC generation. We use this identification scheme
to exchange CDC generations around the cluster.
However, the fact that a generation's timestamp is used as an ID for
this generation is an implementation detail of the currently used method
of managing CDC generations.
Places in the code that deal with the timestamp, e.g. functions which
take it as an argument (such as handle_cdc_generation) are often
interested in the ID aspect, not the "when does the generation start
operating" aspect. They don't care that the ID is a `db_clock::time_point`.
They may sometimes want to retrieve the time point given the ID (such as
do_handle_cdc_generation when it calls `cdc::metadata::insert`),
but they don't care about the fact that the time point actually IS the ID.
In the future we may actually change the specific type of the ID if we
modify the generation management algorithms.
This commit is an intermediate step that will ease the transition in the
future. It introduces a new type, `cdc::generation_id`. Inside it contains
the timestamp, so:
1. if a piece of code doesn't care about the timestamp, it just passes
the ID around
2. if it does care, it can simply access it using the `get_ts` function.
The fact that `get_ts` simply accesses the ID's only field is an
implementation detail.
Using the occasion, we change the `do_handle_cdc_generation_intercept...`
function to be a standard function, not a coroutine. It turns out that -
depending on the shape of the passed-in argument - the function would
sometimes miscompile (the compiled code would not copy the argument to the
coroutine frame).
Each CDC generation always has a timestamp, but the fact that the
timestamp identifies the generation is an implementation detail.
We abstract away from this detail by using a more generic naming scheme:
a generation "identifier" (whatever that is - a timestamp or something
else).
It's possible that a CDC generation will be identified by more than a
timestamp in the (near) future.
The actual string gossiped by nodes in their application state is left
as "CDC_STREAMS_TIMESTAMP" for backward compatibility.
Some stale comments have been updated.
This function retrieves the persisted timestamp of the last known CDC
generation (which this node is currently gossiping to other nodes).
It checks that the timestamp is present; if not, it throws an error.
The check is unnecessary. It's used only in a quite esoteric place
(start_gossiping, which implements an almost-never-used API call),
and it's fine if the timestamp is gone - in start_gossiping,
we can start gossiping the tokens without the CDC generation timestamp
(well, if the timestamp is not present in system tables, something
weird must have happened, but that doesn't mean we can't resume
gossiping - fixing CDC generation management in such a case is
a separate problem).
Nodes automatically ensure that the latest CDC generation's list of
streams is present in the streams description table. When a new
generation appears, we only need to update the table for this
generation; old generations are already inserted.
However, we've changed the description table (from
`cdc_streams_descriptions` to `cdc_streams_descriptions_v2`). The
existing mechanism only ensures that the latest generation appears in
the new description table. This commit adds an additional procedure that
rewrites the older generations as well, if we find that it is necessary
to do so (i.e. when some CDC log tables may contain data in these
generations).
Currently, whole topology description for CDC is stored in a single row.
This means that for a large cluster of strong machines (say 100 nodes 64
cpus each), the size of the topology description can reach 32MB.
This causes multiple problems. First of all, there's a hard limit on
mutation size that can be written to Scylla. It's related to commit log
block size which is 16MB by default. Mutations bigger than that can't be
saved. Moreover, such big partitions/rows cause reactor stalls and
negatively influence latency of other requests.
This patch limits the size of topology description to about 4MB. This is
done by reducing the number of CDC streams per vnode and can lead to CDC
data not being fully colocated with Base Table data on shards. It can
impact performance and consistency of data.
This is just a quick fix to make it easily backportable. A full solution
to the problem is under development.
For more details see #7961, #7993 and #7985.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The meaning of the parameter changes from defining whether the function
is called in testing environment to deciding whether a delay should be
added to a timestamp of a newly created CDC generation.
This is a preparation for improvement in the following patch that does
not always add delay to every node but only to non-first node.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Following patch enables CDC by default and this means CDC has to work
will all the clusters now.
There is a problematic case when existing cluster with no CDC support
is stopped, all the binaries are updated to newer version with
CDC enabled by default. In such case, nodes know that they are already
members of the cluster but they can't find any CDC generation so they
will try to create one. This creation may fail due to lack of QUORUM
for the write.
Before this patch such situation would lead to node failing to start.
After the change, the node will start but CDC generation will be
missing. This will mean CDC won't be able to work on such cluster before
nodetool checkAndRepairCdcStreams is run to fix the CDC generation.
We still fail to bootstrap if the creation of CDC generation fails.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Fixes#7345Fixes#7346
Do a more efficient collection skip when doing paging, instead of
iterating the full sets.
Ensure some semblance of sanity in the parent-child relationship
between shards by ensuring token order sorting and finding the
apparent previous ID coverting the approximate range of new gen.
Fix endsequencenumber generation by looking at whether we are
last gen or not, instead of the (not filled in) 'expired' column.
Commit a6ad70d3da changed the format of
stream IDs: the lower 8 bytes were previously generated randomly, now
some of them have semantics. In particular, the least significant byte
contains a version (stream IDs might evolve with further releases).
This is a backward-incompatible change: the code won't properly handle
stream IDs with all lower 8 bytes generated randomly. To protect us from
subtle bugs, the code has an assertion that checks the stream ID's
version.
This means that if an experimental user used CDC before the change and
then upgraded, they might hit the assertion when a node attempts to
retrieve a CDC generation with old stream IDs from the CDC description
tables and then decode it.
In effect, the user won't even be able to start a node.
Similarly as with the case described in
d89b7a0548, the simplest fix is to rename
the tables. This fix must get merged in before CDC goes out of
experimental.
Now, if the user upgrades their cluster from a pre-rename version, the
node will simply complain that it can't obtain the CDC generation
instead of preventing the cluster from working. The user will be able to
use CDC after running checkAndRepairCDCStreams.
Since a new table is added to the system_distributed keyspace, the
cluster's schema has changed, so sstables and digests need to be
regenerated for schema_digest_test.
Fixes#6948
Changes the stream_id format from
<token:64>:<rand:64>
to
<token:64>:<rand:38><index:22><version:4>
The code will attempt to assert version match when
presented with a stored id (i.e. construct from bytes).
This means that ID:s created by previous (experimental)
versions will break.
Moves the ID encoding fully into the ID class, and makes
the code path private for the topology generation code
path.
Removes some superflous accessors but adds accessors for
token, version and index. (For alternator etc).
Commit 968177da04 has changed the schema
of cdc_topology_description and cdc_description tables in the
system_distributed keyspace.
Unfortunately this was a backwards-incompatible change: these tables
would always be created, irrespective of whether or not "experimental"
was enabled. They just wouldn't be populated with experimental=off.
If the user now tries to upgrade Scylla from a version before this change
to a version after this change, it will work as long as CDC is protected
b the experimental flag and the flag is off.
However, if we drop the flag, or if the user turns experimental on,
weird things will happen, such as nodes refusing to start because they
try to populate cdc_topology_description while assuming a different schema
for this table.
The simplest fix for this problem is to rename the tables. This fix must
get merged in before CDC goes out of experimental.
If the user upgrades his cluster from a pre-rename version, he will simply
have two garbage tables that he is free to delete after upgrading.
sstables and digests need to be regenerated for schema_digest_test since
this commit effectively adds new tables to the system_distributed keyspace.
This doesn't result in schema disagreement because the table is
announced to all nodes through the migration manager.
In new CDC Log format stream_id is represented by a single
blob column so it makes sense to store it in the same form
everywhere - including internal CDC tables.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
New CDC Log format stores stream ids as blobs.
It makes sense to keep them internally in the same form.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
CDC can get all it needs from a config and does not need
partitioner.
For base table specific operations CDC is using partitioner
from that table (obtained with schema::get_partitioner).
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>