There's a database::get_snapshot_details() method that returns collection of all snapshots for all ks.cf out there and there are several *snapshot_details* aux structures around it. This PR keeps only one "details" and cleans up the way it propagates from database up to the respective API calls.
Closesscylladb/scylladb#18317
* github.com:scylladb/scylladb:
snapshot_ctl: Brush up true_snapshots_size() internals
snapshot_ctl: Remove unused details struct
snapshot_ctl: No double recoding of details
database,snapshots: Move database::snapshot_details into snapshot_ctl
database,snapshots: Make database::get_snapshot_details() return map, not vector
table,snapshots: Move table::snapshot_details into snapshot_ctl
Currently database::get_snapshot_details() returns a collection of
snapshots. The snapshot_ctl converts this collection into similarly
looking one with slightly different structures inside. The resulting
collection is converted one more time on the API layer into another
similarly looking map.
This patch removes the intermediate conversion.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
So that it's in-sync with table::get_snapshot_details(). Next patches
will improve this place even further.
Also, there can be many snapshots and vector can grow large, but that's
less of an issue here.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
since we do not rely on FMT_DEPRECATED_OSTREAM to define the
fmt::formatter for us anymore, let's stop defining `FMT_DEPRECATED_OSTREAM`.
in this change,
* utils: drop the range formatters in to_string.hh and to_string.c, as
we don't use them anymore. and the tests for them in
test/boost/string_format_test.cc are removed accordingly.
* utils: use fmt to print chunk_vector and small_vector. as
we are not able to print the elements using operator<< anymore
after switching to {fmt} formatters.
* test/boost: specialize fmt::details::is_std_string_like<bytes>
due to a bug in {fmt} v9, {fmt} fails to format a range whose
element type is `basic_sstring<uint8_t>`, as it considers it
as a string-like type, but `basic_sstring<uint8_t>`'s char type
is signed char, not char. this issue does not exist in {fmt} v10,
so, in this change, we add a workaround to explicitly specialize
the type trait to assure that {fmt} format this type using its
`fmt::formatter` specialization instead of trying to format it
as a string. also, {fmt}'s generic ranges formatter calls the
pair formatter's `set_brackets()` and `set_separator()` methods
when printing the range, but operator<< based formatter does not
provide these method, we have to include this change in the change
switching to {fmt}, otherwise the change specializing
`fmt::details::is_std_string_like<bytes>` won't compile.
* test/boost: in tests, we use `BOOST_REQUIRE_EQUAL()` and its friends
for comparing values. but without the operator<< based formatters,
Boost.Test would not be able to print them. after removing
the homebrew formatters, we need to use the generic
`boost_test_print_type()` helper to do this job. so we are
including `test_utils.hh` in tests so that we can print
the formattable types.
* treewide: add "#include "utils/to_string.hh" where
`fmt::formatter<optional<>>` is used.
* configure.py: do not define FMT_DEPRECATED_OSTREAM
* cmake: do not define FMT_DEPRECATED_OSTREAM
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we include `fmt/ranges.h` and/or `fmt/std.h`
for formatting the container types, like vector, map
optional and variant using {fmt} instead of the homebrew
formatter based on operator<<.
with this change, the changes adding fmt::formatter and
the changes using ostream formatter explicitly, we are
allowed to drop `FMT_DEPRECATED_OSTREAM` macro.
Refs scylladb#13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
in in {fmt} before v10, it provides the specialization of `fmt::formatter<..>`
for `std::string_view` as well as the specialization of `fmt::formatter<..>`
for `fmt::string_view` which is an implementation builtin in {fmt} for
compatibility of pre-C++17. and this type is used even if the code is
compiled with C++ stadandard greater or equal to C++17. also, before v10,
the `fmt::formatter<std::string_view>::format()` is defined so it accepts
`std::string_view`. after v10, `fmt::formatter<std::string_view>` still
exists, but it is now defined using `format_as()` machinery, so it's
`format()` method does not actually accept `std::string_view`, it
accepts `fmt::string_view`, as the former can be converted to
`fmt::string_view`.
this is why we can inherit from `fmt::formatter<std::string_view>` and
use `formatter<std::string_view>::format(foo, ctx);` to implement the
`format()` method with {fmt} v9, but we cannot do this with {fmt} v10,
and we would have following compilation failure:
```
FAILED: service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o
/home/kefu/.local/bin/clang++ -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSCYLLA_BUILD_MODE=release -DSEASTAR_API_LEVEL=7 -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++20 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -mllvm -inline-threshold=2500 -fno-slp-vectorize -U_FORTIFY_SOURCE -Werror=unused-result -MD -MT service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o -MF service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o.d -o service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o -c /home/kefu/dev/scylladb/service/topology_state_machine.cc
/home/kefu/dev/scylladb/service/topology_state_machine.cc:254:41: error: no matching member function for call to 'format'
254 | return formatter<std::string_view>::format(it->second, ctx);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
/usr/include/fmt/core.h:2759:22: note: candidate function template not viable: no known conversion from 'seastar::basic_sstring<char, unsigned int, 15>' to 'const fmt::basic_string_view<char>' for 1st argument
2759 | FMT_CONSTEXPR auto format(const T& val, FormatContext& ctx) const
| ^ ~~~~~~~~~~~~
```
because the inherited `format()` method actually comes from
`fmt::formatter<fmt::string_view>`. to reduce the confusion, in this
change, we just inherit from `fmt::format<string_view>`, where
`string_view` is actually `fmt::string_view`. this follows
the document at
https://fmt.dev/latest/api.html#formatting-user-defined-types,
and since there is less indirection under the hood -- we do not
use the specialization created by `FMT_FORMAT_AS` which inherit
from `formatter<fmt::string_view>`, hopefully this can improve
the compilation speed a little bit. also, this change addresses
the build failure with {fmt} v10.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18299
If coordinator node was killed, restarted, become not
operatable during topology operation, new coordinator should be elected,
operation should be aborted and cluster should be rolled back
Error injection will be used to kill the coordinator before streaming
starts
Closesscylladb/scylladb#16197
instead of using `operator<<`, use `fmt::print()` to
format and print, so we can ditch the `operator<<`-based formatters.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18259
since Boost.Test relies on operator<< or `boost_test_print_type()`
to print the value of variables being compared, instead of defining
the fallback formatter of `boost_test_print_type()` for each
individual test, let's define it in `test/lib/test_utils.hh`, so
that it can be shared across tests.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18260
The cql-pytest framework allows running tests also against Cassandra,
but developers need to install Cassandra on their own because modern
distributions such as Fedora no longer carry a Cassandra package.
This patch adds clear and easy to follow (I think) instructions on how
to download a pre-compiled Cassadra, or alternatively how to download
and build Cassandra from source - and how either can be used with the
test/cql-pytest/run-cassandra script.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#18138
For view builder draining there's dedicated deferred action in main while all other services that need to be drained do it via storage_service. The latter is to unify shutdown for services and to make `nodetool drain` drain everything, not just some part of those. This PR makes view builder drain look the same. As a side effect it also moves `mark_existing_views_as_built` from storage service to view builder and generalizes this marking code inside view builder itself.
refs: #2737
refs: #2795Closesscylladb/scylladb#16558
* github.com:scylladb/scylladb:
storage_service: Drain view builder on drain too
view_builder: Generalize mark_as_built(view_ptr) method
view_builder: Move mark_existing_views_as_built from storage service
storage_service: Add view_builder& reference
main,cql_test_env: Move view_builder start up (and make unconditional)
Some tests want to ignore out_of_range exception in continuation and go
the longer route for that
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18216
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.
For example:
- in scylladb/scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb/scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.
I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.
After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.
With additional logging and additional head-banging, I determined
the root cause.
The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
updates the copy:
```
auto local_state = *ep_state_before;
for (auto& p : states) {
auto& state = p.first;
auto& value = p.second;
value = versioned_value::clone_with_higher_version(value);
local_state.add_application_state(state, value);
}
```
`clone_with_higher_version` bumps `version` inside
gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
phase it copies the updated `local_state` to all shards into a
separate map. In second phase the values from separate map are used to
overwrite the endpoint_state map used for gossiping.
Due to the cross-shard calls of the 1 phase, there is a yield before
the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
This uses the monotonic version_generator, so it uses a higher version
then the ones we used for states added above. Let's call this new version
X. Note that X is larger than the versions used by application_states
added above.
- now node B handles a SYN or ACK message from node A, creating
an ACK or ACK2 message in response. This message contains:
- old application states (NOT including the update described above,
because `replicate` is still sleeping before phase 2),
- but bumped heart_beat == X from `gossiper::run()` loop,
and sends the message.
- node A receives the message and remembers that the max
version across all states (including heart_beat) of node B is X.
This means that it will no longer request or apply states from node B
with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
endpoint_state with the ones it saved in phase 1. In particular it
reverts heart_beat back to smaller value, but the larger problem is that it
saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
message to node A, node A will ignore them, because their versions are
smaller than X. Or node B will never send them, because whenever node
A requests states from node B, it only requests states with versions >
X. Either way, node A will fail to observe new states of node B.
If I understand correctly, this is a regression introduced in
38c2347a3c, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.
With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.
The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.
The PR also adds a regression test.
Fixes: scylladb/scylladb#15393Fixes: scylladb/scylladb#15602Fixes: scylladb/scylladb#16668Fixes: scylladb/scylladb#16902Fixes: scylladb/scylladb#17493Fixes: scylladb/scylladb#18118
Ref: scylladb/scylla-enterprise#3720Closesscylladb/scylladb#18184
* github.com:scylladb/scylladb:
test: reproducer for missing gossiper updates
gossiper: lock local endpoint when updating heart_beat
For that the test case is modified to have 3 nodes and 2 replicas on
start. Existing test cases are changed slightly in the way "from" host
is detected.
Also, the final check for data presense is modified to check that hosts
in "replicas" have data and other hosts don't have it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Copied from the add_replica counterpart
TODO: Generalize common parts of move_tablet and add_|del_tablet_replica
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Storage service will need to drain v.b. on its drain. Also on cluster
join it marks existing views as built while it's v.b.'s job to do it.
Both will be fixed by next patching and this is prerequisite.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Just starting sharded<view_builder> is lightweight, its constructor does
nothing but initializes on-board variables. Real work takes off on
view_builder::start() which is not moved.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The current test in boost/cql_query_large_test::test_large_data only checks whether notifications for large rows and cells are written into the system keyspace. It doesn't check this for partitions.
This change adds this check for partitions.
Closesscylladb/scylladb#18189
* github.com:scylladb/scylladb:
test/boost: added test for large row count warning
test/boost: add test for writing large partition notifications
Repair memory limit includes only the size of frozen mutation
fragments in repair row. The size of other members of repair
row may grow uncontrollably and cause out of memory.
Modify what's counted to repair memory limit.
Fixes: #16710.
Closesscylladb/scylladb#17785
* github.com:scylladb/scylladb:
test: add test for repair_row::size()
repair: fix memory accounting in repair_row
When altering rf for a keyspace, all tablets in this ks will get more replicas. Part of this process is rebuilding tablets' onto new node(s). This PR extends the tablets transition code to support rebuilding of tablet on new replica.
fixes: #18030Closesscylladb/scylladb#18082
* github.com:scylladb/scylladb:
test: Check data presense as well
test: Test how tablets are copied between nodes
test: Add sanity test for tablet migration
api: Add method to add replica to a tablet
tablet: Make leaving replica optional
The current test in boost/cql_query_large_test::test_large_data only checks whether notifications for large rows and cells are written into the system keyspace. It doesn't check this for partitions.
This change adds this check for partitions.
Other than making sure that system.tablets is updated with correct
replica set, it's also good to check that the data is present on the
repsective nodes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
A few months ago, in merge d3c1be9107,
we decided that if Scylla has the experimental "tablets" feature enabled,
new Alternator tables should use this feature by default - exactly like
this is the default for new CQL tables.
Sadly, it was now decided to reverse this decision: We do not yet trust
enough LWT on tablets, and since Alternator often (if not always) relies
on LWT, we want Alternator tables to continue to use vnodes - not tablets.
The fix is trivial - just changing the default. No test needed to change
because anyway, all Alternator tests work correctly on Scylla with the
tablets experimental feature disabled. I added a new test to enshrine
the fact that Alternator does not use tablets.
An unfortunate result of this patch will be that Alternator tables
created on versions with this patch (e.g., Scylla 6.0) will not use
tablets and will continue to not use tablets even if Scylla is upgraded
(currently, the use of tablets is decided at table creation time, and
there is no way to "upgrade" a vnode-based table to be tablet based).
This patch should be reverted as soon as LWT support matures on tablets.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#18157
`database::find_column_family()` throws no_such_column_family
if an unknown ks.cf is fed to it. and we call into this function
without checking for the existence of ks.cf first. since
"/storage_service/tablets/move" is a public interface, we should
translate this error to a better http error.
in this change, we check for the existence of the given ks.cf, and
throw an exception so that it can be caught by seastar::httpd::routers,
and converted to an HTTP error.
Fixes#17198
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17217
This patches the previously introduced test by introducing the 'action'
test paramter and tweaking the final checking assertions around tablet
replicas read from system.tablets
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It just checks that after api call to move_tablet the resulting replica
is in expected state. This test will be later expanded to check for
rebuild transition.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The new API submits rebuild transition with new replicas set to be old
(current) replicas plus the provided one. It looks and acts like the
move_tablet API call with several changes:
- lacks the "source" replica argument
- submits "rebuild" transition kind
- cross racks checks are not performed
The 'force' argument is inherited from move_tablet, but is unused now
and is left for future.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In test_exception_safety_of_update_from_memtable, we have a potential
throw from external_updater.
external_updater is supposed to be infallible.
Scylla currently aborts when an external_updater throws, so a throw from
there just fails the test.
This isn't intended. We aren't testing external_updater in this test.
Fixes#18163Closesscylladb/scylladb#18171
Before the patch selection of auth version depended
on consistent topology feature but during raft recovery
procedure this feature is disabled so we need to persist
the version somewhere to not switch back to v1 as this
is not supported.
During recovery auth works in read-only mode, writes
will fail.
Fixes https://github.com/scylladb/scylladb/issues/17736Closesscylladb/scylladb#18039
* github.com:scylladb/scylladb:
auth: keep auth version in scylla_local
auth: coroutinize service::start
They result in poor distribution and poor cardinality, interfering with
tests which want to generate N partitions or rows.
Fixes: #17821Closesscylladb/scylladb#17856
Before the patch selection of auth version depended
on consistent topology feature but during raft recovery
procedure this feature is disabled so we need to persist
the version somewhere to not switch back to v1 as this
is not supported.
During recovery auth works in read-only mode, writes
will fail.
Added support to track and limit the memory usage by sstable components. A reclaimable component of an SSTable is one from which memory can be reclaimed. SSTables and their managers now track such reclaimable memory and limit the component memory usage accordingly. A new configuration variable defines the memory reclaim threshold. If the total memory of the reclaimable components exceeds this limit, memory will be reclaimed to keep the usage under the limit. This PR considers only the bloom filters as reclaimable and adds support to track and limit them as required.
The feature can be manually verified by doing the following :
1. run a single-node single-shard 1GB cluster
2. create a table with bloom-filter-false-positive-chance of 0.001 (to intentionally cause large bloom filter)
3. populate with tiny partitions
4. watch the bloom filter metrics get capped at 100MB
The default value of the `components_memory_reclaim_threshold` config variable which controls the reclamation process is `.1`. This can also be reduced further during manual tests to easily hit the threshold and verify the feature.
Fixes#17747Closesscylladb/scylladb#17771
* github.com:scylladb/scylladb:
test_bloom_filter.py: disable reclaiming memory from components
sstable_datafile_test: add tests to verify auto reclamation of components
test/lib: allow overriding available memory via test_env_config
sstables_manager: support reclaiming memory from components
sstables_manager: store available memory size
sstables_manager: add variable to track component memory usage
db/config: add a new variable to limit memory used by table components
sstable_datafile_test: add testcase to verify reclamation from sstables
sstables: support reclaiming memory from components
This reverts commit 97b203b1af.
since Seastar provides the formatter, it's not necessary to vendor it in
scylladb anymore.
Refs #13245Closesscylladb/scylladb#18114
Disabled reclaiming memory from sstable components in the testcase as it
interferes with the false positive calculation.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The cluster manager library doesn't set the asan/ubsan options
to abort on error and create core dumps; this makes debugging much
harder.
Fix by preparing the environment correctly.
Fixesscylladb/scylladb#17510Closesscylladb/scylladb#17511
currently, our homebrew formatter formats `std::map` like
```
{{k1, v1}, {k2, v2}}
```
while {fmt} formats a map like:
```
{k1: v1, k2: v2}
```
and if the type of key/value is string, {fmt} quotes it, so a
compaction strategy option is formatted like
```
{"max_threshold": "1"}
```
before switching the formatter to the ones supported by {fmt},
let's update the test to match with the new format. this should
reduce the overhead of reviewing the change of switching the
formatter. we can revert this change, and use a simpler approach
after the change of formatter lands.
Closesscylladb/scylladb#18058
* github.com:scylladb/scylladb:
test/cql-pytest: match error message formated using {fmt}
test/cql-pytest: extract scylla_error() for not allowed options test
Test.py uses `ring_delay_ms = 0` by default. CDC creates generation's timestamp by adding `ring_delay_ms` to it.
In this test, nodes are learning about new generations (introduced by upgrade procedure and then by node bootstrap) concurrently with doing writes that should go to these generations.
Because of `ring_delay_ms = 0', the generation could have been committed when it should have already been in use.
This can be seen in the following logs from a node:
```
ERROR 2024-03-22 12:29:55,431 [shard 0:strm] cdc - just learned about a CDC generation newer than the one used the last time streams were retrieved. This generation, or some newer one, should have been used instead (new generation's timestamp: 2024/03/22 12:29:54, last time streams were retrieved: 2024/03/22 12:29:55). The new generation probably arrived too late due to a network partition and we've made a write using the wrong set streams.
```
Creating writes during such a generation can result in assigning them a wrong generation or a failure. Failure may occur if it hits short time window when `generation_service::handle_cdc_generation(cdc::generation_id_v2)` has executed
`svc._cdc_metadata.prepare(...)` but`_cdc_metadata.insert(...)` has not yet been executed. With a nonzero ring_delay_ms it's not a problem, because during this time window, the generation should not be in use.
Write can fail with the following response from a node:
```
cdc: attempted to get a stream from a generation that we know about, but weren't able to retrieve (generation timestamp: 2024/03/22 12:29:54, write timestamp: 2024/03/22 12:29:55). Make sure that the replicas which contain this generation's data are alive and reachable from this node.
```
Set ring_delay_ms to 15000 for the debug mode and 5000 in other modes. Wait for the last generation to be in use and sleep one second to make sure there are writes to the CDC table in this generation.
Fixesscylladb/scylladb#17977
Reapply b4144d14c6.
Closesscylladb/scylladb#17998
* github.com:scylladb/scylladb:
test.py: test_topology_upgrade_basic: make ring_delay_ms nonzero
Reapply "test.py: adjust the test for topology upgrade to write to and read from CDC tables"
currently, our homebrew formatter formats `std::map` like
{{k1, v1}, {k2, v2}}
while {fmt} formats a map like:
{k1: v1, k2: v2}
and if the type of key/value is string, {fmt} quotes it, so a
compaction strategy option is formatted like
{"max_threshold": "1"}
before switching the formatter to the ones supported by {fmt},
let's update the test to match with the new format. this should
reduce the overhead of reviewing the change of switching the
formatter. we can revert this change, and use a simpler approach
after the change of formatter lands.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
currently, our homebrew formatter formats `std::map` like
{{k1, v1}, {k2, v2}}
while {fmt} formats a map like:
{k1: v1, k2: v2}
and if the type of key/value is string, {fmt} quotes it, so a
compaction strategy option is formatted like
{"max_threshold": "1"}
as we are switching to the formatters provided by {fmt}, would be
better to support its convention directly.
so, in this change, to prepare the change, before migrating to
{fmt}, let's refactor the test to support both formats by
extracting a helper to format the error message, so that we can
change it to emit both formats.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Test.py uses `ring_delay_ms = 0` by default. CDC creates generation's timestamp
by adding `ring_delay_ms` to it.
In this test, nodes are learning about new generations (introduced by upgrade
procedure and then by node bootstrap) concurrently with doing writes that
should go to these generations.
Because of `ring_delay_ms = 0', the generation could have been committed when
it should have already been in use.
This can be seen in the following logs from a node:
```
ERROR 2024-03-22 12:29:55,431 [shard 0:strm] cdc - just learned about a CDC generation newer than the one used the last time streams were retrieved. This generation, or some newer one, should have been used instead (new generation's timestamp: 2024/03/22 12:29:54, last time streams were retrieved: 2024/03/22 12:29:55). The new generation probably arrived too late due to a network partition and we've made a write using the wrong set streams.
```
Creating writes during such a generation can result in assigning them a wrong
generation or a failure. Failure may occur if it hits short time window when
`generation_service::handle_cdc_generation(cdc::generation_id_v2)` has executed
`svc._cdc_metadata.prepare(...)` but`_cdc_metadata.insert(...)` has not yet
been executed. With a nonzero ring_delay_ms it's not a problem, because during
this time window, the generation should not be in use.
Write can fail with the following response from a node:
```
cdc: attempted to get a stream from a generation that we know about, but weren't able to retrieve (generation timestamp: 2024/03/22 12:29:54, write timestamp: 2024/03/22 12:29:55). Make sure that the replicas which contain this generation's data are alive and reachable from this node.
```
Set ring_delay_ms to 15000 for the debug mode and 5000 in other modes.
Wait for the last generation to be in use and sleep one second to make sure
there are writes to the CDC table in this generation.
Fixes#17977
this series includes test related changes to enable us to drop `FMT_DEPRECATED_OSTREAM` deprecated in {fmt} v10.
Refs #13245Closesscylladb/scylladb#18054
* github.com:scylladb/scylladb:
test: unit: add fmt::formatter for test_data in tests
test/lib: do not print with fmt::to_string()
test/boost: print runtime_error using e.what()
* 'gleb/raft_snapshot_rpc-v3' of github.com:scylladb/scylla-dev:
raft topology: drop RAFT_PULL_TOPOLOGY_SNAPSHOT RPC
Use correct limit for raft commands throughout the code.