Waiting for gossip to settle slows down the bootstrap of the cluster.
It is safe to disable it if the topology is based on Raft.
Fixesscylladb/scylladb#16055Closesscylladb/scylladb#17960
in 2f0f53ac, we added logging of parsed command line options so that we
can see how scylla is launched in case it fails to boot. but when scylla
is called interactively in console. this echo is a little bit annoying.
see following console session
```console
$ scylla --help-loggers
Scylla version 5.5.0~dev-0.20240419.3c9651adf297 with build-id 7dd6a110e608535e5c259a03548eda6517ab4bde starting ...
command used: "./RelWithDebInfo/scylla --help-loggers"
pid: 996503
parsed command line options: [help-loggers]
Available loggers:
BatchStatement
LeveledManifest
alter_keyspace
alter_table
...
```
so in this change, we check if the stdin is associated with a terminal
device, if that the case, we don't print the scylla version, parsed
command line and pid. and the interactive session looks like:
```console
$ scylla --help-loggers
Available loggers:
BatchStatement
LeveledManifest
alter_keyspace
alter_table
```
no more distracting information printed. the original behavior
can be tested like:
```console
$ : | ./RelWithDebInfo/scylla --help-loggers
```
assuming scylla is always launched with systemd, which connects
stdin to /dev/null. see
https://www.freedesktop.org/software/systemd/man/latest/systemd.exec.html#Logging%20and%20Standard%20Input/Output
. so this behavior is preserved with this change.
Refs #4203
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18309
This hourly reevaluation is there to help tablets that have very low
write activity, which can go a long time without flushing a memtable,
and it's important to reevaluate compaction as data can get expired.
Today it can happen that we reevaluate a table that is being compacted
actively, which is waste of cpu as the reevaluation will happen anyway
when there are changes to sstable set. This waste can be amplified with
a significant tablet count in a given shard.
Eventually, we could make the revaluation time per table based on
expiration histogram, but until we get there, let's avoid this waste
by only reevaluating tables that are compaction idle for more than 1h.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#18280
in order to reduce the confusion like:
> I cannot find foobar in the list, is it supported?
also, take this opportunity to use "console" instead of "shell" for
rendering the code block. it's a better fit in this case. since we
are using pygment for syntax highlighting,
see https://pygments.org/docs/lexers/#pygments.lexers.shell.BashSessionLexer
for details on the "console" lexer.
and add a prompt before the command line, so that "console" lexer
can render the command line and output better.
also, add a note explaining that user should refer the output of
`scylla` to see the list of logger classes.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18311
clang-19 complains with `-Wdeprecated-this-capture`:
```
/home/kefu/dev/scylladb/service/storage_service.cc:5837:22: error: implicit capture of 'this' with a capture default of '=' is deprecated [-Werror,-Wdeprecated-this-capture]
5837 | auto* node = get_token_metadata().get_topology().find_node(dst.host);
| ^
/home/kefu/dev/scylladb/service/storage_service.cc:5830:44: note: add an explicit capture of 'this' to capture '*this' by reference
5830 | co_await transit_tablet(table, token, [=] (const locator::tablet_map& tmap, api::timestamp_type write_timestamp) {
| ^
| , this
```
since https://open-std.org/JTC1/SC22/WG21/docs/papers/2018/p0806r2.html
was approved, see https://eel.is/c++draft/depr.capture.this. and newer
versions of C++ compilers implemented it, so we need to capture `this`
explicitly to be more standard compliant, and to be more future-proof
in this regard.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18306
in in {fmt} before v10, it provides the specialization of `fmt::formatter<..>`
for `std::string_view` as well as the specialization of `fmt::formatter<..>`
for `fmt::string_view` which is an implementation builtin in {fmt} for
compatibility of pre-C++17. and this type is used even if the code is
compiled with C++ stadandard greater or equal to C++17. also, before v10,
the `fmt::formatter<std::string_view>::format()` is defined so it accepts
`std::string_view`. after v10, `fmt::formatter<std::string_view>` still
exists, but it is now defined using `format_as()` machinery, so it's
`format()` method does not actually accept `std::string_view`, it
accepts `fmt::string_view`, as the former can be converted to
`fmt::string_view`.
this is why we can inherit from `fmt::formatter<std::string_view>` and
use `formatter<std::string_view>::format(foo, ctx);` to implement the
`format()` method with {fmt} v9, but we cannot do this with {fmt} v10,
and we would have following compilation failure:
```
FAILED: service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o
/home/kefu/.local/bin/clang++ -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSCYLLA_BUILD_MODE=release -DSEASTAR_API_LEVEL=7 -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++20 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -mllvm -inline-threshold=2500 -fno-slp-vectorize -U_FORTIFY_SOURCE -Werror=unused-result -MD -MT service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o -MF service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o.d -o service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o -c /home/kefu/dev/scylladb/service/topology_state_machine.cc
/home/kefu/dev/scylladb/service/topology_state_machine.cc:254:41: error: no matching member function for call to 'format'
254 | return formatter<std::string_view>::format(it->second, ctx);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
/usr/include/fmt/core.h:2759:22: note: candidate function template not viable: no known conversion from 'seastar::basic_sstring<char, unsigned int, 15>' to 'const fmt::basic_string_view<char>' for 1st argument
2759 | FMT_CONSTEXPR auto format(const T& val, FormatContext& ctx) const
| ^ ~~~~~~~~~~~~
```
because the inherited `format()` method actually comes from
`fmt::formatter<fmt::string_view>`. to reduce the confusion, in this
change, we just inherit from `fmt::format<string_view>`, where
`string_view` is actually `fmt::string_view`. this follows
the document at
https://fmt.dev/latest/api.html#formatting-user-defined-types,
and since there is less indirection under the hood -- we do not
use the specialization created by `FMT_FORMAT_AS` which inherit
from `formatter<fmt::string_view>`, hopefully this can improve
the compilation speed a little bit. also, this change addresses
the build failure with {fmt} v10.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18299
This is complete version of #12786, since I take over the issue from @mykaul.
Update from original version are:
- Support ARM64 build (disable BOLT for now since it doesn't functioning)
- Changed toolchain settings to get current scylla able to build (support WASM, etc)
- Stop git clone scylladb repo, create git-archive of current scylla directory and import it
- Update Clang to 17.0.6
- Save entire clang directory for BUILD mode, not just /usr/bin/clang binary
- Implemented INSTALL_PREBUILT mode to install prebuilt image which built in BUILD mode
Note that this patch drops cross-build support of frozen toolchain, since building clang and scylla multiple time in qemu-user-static will very slow, it's not usable.
Instead, we should build the image for each architecture natively.
----
This is a different way attempting to combine building an optimized clang (using LTO, PGO and BOLT, based on compiling ScyllaDB) to dbuild. Per Avi's request, there are 3 options: skip this phase (which is the current default), build it and build + install it to the default path.
Fixes: #10985Fixes: scylladb/scylla-enterprise#2539Closesscylladb/scylladb#17196
* github.com:scylladb/scylladb:
toolchain: support building an optimized clang
configure.py: add --build-dir option
This commit includes updates related to replacing system_auth with system_auth_v2.
- The keyspace name system_auth is renamed to system_auth_v2.
- The procedures are updated to account for system_auth_v2.
- No longer required system_auth RF changes are removed from procedures.
- The information is added that if the consistent topology updates feature
was not enabled upon upgrade from 5.4, there are limitations or additional
steps to do (depending on the procedure).
The files with that kind of information are to be found in _common folders
and included as needed.
- The upgrade guide has been updated to reflect system_auth_v2 and related impacts.
Closesscylladb/scylladb#18077
when compiling the tree with clang-19, it complains:
```
/home/kefu/dev/scylladb/service/topology_coordinator.cc:1968:31: error: variable 'reject' set but not used [-Werror,-Wunused-but-set-variable]
1968 | if (auto* reject = std::get_if<join_node_response_params::rejected>(&validation_result)) {
| ^
1 error generated.
```
so, despite that we evaluate the assignment statement to see it
evaluates to true or false, the compiler still believes that the
variable is not used. probably, the value of the statement is not
dependent on the value of the value being assigned. either way,
let's use `std::holds_alternative<..>` instead of `std::get_if<..>`,
to silence this warning, and the code is a little bit more compacted,
in the sense of less tokens in the `if` statement.
in order to be self-contained, we take the opportunity to
include `<variant>` in this source file, as a function declared
in this header is used.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18291
This patch updates the seastar submodule to get the latest safety patch for the metric layer.
The latest patch allows manipulating metric_families early in the
start-up process and is safer if someone chooses to aggregate summaries.
* seastar f3058414...8fabb30a (4):
> stall-analyser: improve stall pattern matching
> TLS: Move background BYE handshake to engine::run_in_background
> metrics.cc: Safer set_metric_family_configs
> src/core/metrics.cc: handle SUMMARY add operator
Closesscylladb/scylladb#18293
Recently there had been added add_tablet_replica and del_tablet_replica API calls that copy big portion of the existing move_tablet API call's logic. This PR generalizes the common parts
Closesscylladb/scylladb#18272
* github.com:scylladb/scylladb:
tablets: Generalize transition mutations preparation
tablets: Generalize tablet-already-in-transition check
tablets: Generalize raft communications for tablet transition API calls
tablets: Drop src vs dst equality check from move_tablet()
If coordinator node was killed, restarted, become not
operatable during topology operation, new coordinator should be elected,
operation should be aborted and cluster should be rolled back
Error injection will be used to kill the coordinator before streaming
starts
Closesscylladb/scylladb#16197
Current code uses non-raft path to pull the schema, which violates
group0 linearizability because the node will have latest schema but
miss group0 updates of other system tables. In particular,
system.tablets. This manifests as repair errors due to missing
tablet_map for a given table when trying to access it. Tablet map is
always created together with the table in the same group0 command.
When a node is bootstrapping, repair calls sync_schema() to make
sure local schema is up to date. This races with group0 catch up,
and if sync_schema() wins, repair may fail on misssing tablet map.
Fix by making sync_schema() do a group0 read barrier when in raft
mode.
Fixes#18002Closesscylladb/scylladb#18175
Just like all the other commands already have it. These commands didn't have documentation at the point where they were implemented, hence the missing doc link.
The links don't work yet, but they will work once we release 6.0 and the current master documentation is promoted to stable.
Closesscylladb/scylladb#18147
* github.com:scylladb/scylladb:
tools/scylla-nodetool: fix typo: Fore -> For
tools/scylla-nodetool: add doc link for getsstables and sstableinfo commands
Currently, we use the sum of the estimated_partitions from each
participant node as the estimated_partitions for sstable produced by
repair. This way, the estimated_partitions is the biggest possible
number of partitions repair would write.
Since repair will write only the difference between repair participant
nodes, using the biggest possible estimation will overestimate the
partitions written by repair, most of the time.
The problem is that overestimated partitions makes the bloom filter
consume more memory. It is observed that it causes OOM in the field.
This patch changes the estimation to use a fraction of the average
partitions per node instead of sum. It is still not a perfect estimation
but it already improves memory usage significantly.
Fixes#18140Closesscylladb/scylladb#18141
Tablet transition handlers prepare two mutations -- one for tablets
table, that sets transition state, transition mode and few others; and
another one for topology table that "activates" the tablet_migration
state for topology coordinator.
The latter is common to all three handlers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Continuation of the previous patch -- there's a common sanity check of
tablet transition API handlers, namely that this tablet is not in
transition already.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are three transition calls -- move, add replica and del replica --
and all three work similarly. In a loop they try to get guard for raft
operation, then perform sanity checks on topology state, then prepare
mutations and then try to apply them to raft. After the loop finishes
all three wait for transition for the given tablet to complete.
This patch generalizes the raft kicking loop and the transition
completion waiting code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The code here looks like this
if src.host == dst.host
throw "Local migration not possible"
if src == dst
co_return;
The 2nd check is apparently never satisfied -- if src == dst this means
that src.host == dst.host and it should have thrown already
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
instead of using `operator<<`, use `fmt::print()` to
format and print, so we can ditch the `operator<<`-based formatters.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18259
since Boost.Test relies on operator<< or `boost_test_print_type()`
to print the value of variables being compared, instead of defining
the fallback formatter of `boost_test_print_type()` for each
individual test, let's define it in `test/lib/test_utils.hh`, so
that it can be shared across tests.
Refs #13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18260
there is chance that `utils/small_vector.hh` does not include
`using namespace seastar`, and even if it does, we should not rely
on it. but if it does not, checkhh would fail. so let's include
"seastarx.hh" in this header, so it is self-contained.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18265
The problem this series solves is correctly ignoring DOWN nodes state
when replacing a node.
When a node is replaced and there are other nodes that are down, the
replacing node is told to ignore those DOWN nodes using the
`ignore_dead_nodes_for_replace` option.
Since the replacing node is bootstrapping it starts with an empty
system.peers table so it has no notion about any node state and it
learns about all other nodes via gossip shadow round done in
`storage_service::prepare_replacement_info`.
Normally, since the DOWN nodes to ignore already joined the ring, the
remaining node will have their endpoint state already in gossip, but if
the whole cluster was restarted while those DOWN nodes did not start,
the remaining nodes will only have a partial endpoint state from them,
which is loaded from system.peers.
Currently, the partial endpoint state contains only `HOST_ID` and
`TOKENS`, and in particular it lacks `STATUS`, `DC`, and `RACK`.
The first part of this series loads also `DC` and `RACK` from
system.peers to make them available to the replacing node as they are
crucial for building a correct replication map with network topology
replication strategy.
But still, without a `STATUS` those nodes are not considered as normal
token owners yet, and they do not go through handle_state_normal which
adds them to the topology and token_metadata.
The second part of this series uses the endpoint state retrieved in the
gossip shadow round to explicitly add the ignored nodes' state to
topology (including dc and rack) and token_metadata (tokens) in
`prepare_replacement_info`. If there are more DOWN nodes that are not
explicitly ignored replace will fail (as it should).
Fixesscylladb/scylladb#15787Closesscylladb/scylladb#15788
* github.com:scylladb/scylladb:
storage_service: join_token_ring: load ignored nodes state if replacing
storage_service: replacement_info: return ignore_nodes state
locator: host_id_or_endpoint: keep value as variant
gms: endpoint_state: add getters for host_id, dc_rack, and tokens
storage_service: topology_state_load: set local STATUS state using add_saved_endpoint
gossiper: add_saved_endpoint: set dc and rack
gossiper: add_saved_endpoint: fixup indentation
gossiper: add_saved_endpoint: make host_id mandatory
gossiper: add load_endpoint_state
gossiper: start_gossiping: log local state
It tries to call container().invoke_on_all() the hard way.
Calling it directly is not possible, because there's no
sharded::invoke_on_all() const overload
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18202
The cql-pytest framework allows running tests also against Cassandra,
but developers need to install Cassandra on their own because modern
distributions such as Fedora no longer carry a Cassandra package.
This patch adds clear and easy to follow (I think) instructions on how
to download a pre-compiled Cassadra, or alternatively how to download
and build Cassandra from source - and how either can be used with the
test/cql-pytest/run-cassandra script.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#18138
For view builder draining there's dedicated deferred action in main while all other services that need to be drained do it via storage_service. The latter is to unify shutdown for services and to make `nodetool drain` drain everything, not just some part of those. This PR makes view builder drain look the same. As a side effect it also moves `mark_existing_views_as_built` from storage service to view builder and generalizes this marking code inside view builder itself.
refs: #2737
refs: #2795Closesscylladb/scylladb#16558
* github.com:scylladb/scylladb:
storage_service: Drain view builder on drain too
view_builder: Generalize mark_as_built(view_ptr) method
view_builder: Move mark_existing_views_as_built from storage service
storage_service: Add view_builder& reference
main,cql_test_env: Move view_builder start up (and make unconditional)
The partitions_bigger_than_threshold is incremented only if the previous
check detects that the partition exceeds a threshold by its size. It's
done with an extra if, but it can be done without (explicit) condition
as bool type is guaranteed by the standard to convert into integers as
true = 1 and false = 0
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18217
Currentl code counts the number of keys in it just to see if this number
is non-zero. Using .contains() method is better fit here
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18219
The sharded<database> is used as a invoke_in_all() method provider,
there's no real need in database itself. Simple smp::invoke_on_all()
would work just as good.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18221
Some tests want to ignore out_of_range exception in continuation and go
the longer route for that
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18216
When constructing a vector with partition key data, the size of that
vector is known beforehand
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18239
There's a helper map-reducer that accepts a function to call on
commitlog. All callers accumulate statistics with it, so the commitlog
argument is const pointer.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18238
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.
For example:
- in scylladb/scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb/scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.
I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.
After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.
With additional logging and additional head-banging, I determined
the root cause.
The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
updates the copy:
```
auto local_state = *ep_state_before;
for (auto& p : states) {
auto& state = p.first;
auto& value = p.second;
value = versioned_value::clone_with_higher_version(value);
local_state.add_application_state(state, value);
}
```
`clone_with_higher_version` bumps `version` inside
gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
phase it copies the updated `local_state` to all shards into a
separate map. In second phase the values from separate map are used to
overwrite the endpoint_state map used for gossiping.
Due to the cross-shard calls of the 1 phase, there is a yield before
the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
This uses the monotonic version_generator, so it uses a higher version
then the ones we used for states added above. Let's call this new version
X. Note that X is larger than the versions used by application_states
added above.
- now node B handles a SYN or ACK message from node A, creating
an ACK or ACK2 message in response. This message contains:
- old application states (NOT including the update described above,
because `replicate` is still sleeping before phase 2),
- but bumped heart_beat == X from `gossiper::run()` loop,
and sends the message.
- node A receives the message and remembers that the max
version across all states (including heart_beat) of node B is X.
This means that it will no longer request or apply states from node B
with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
endpoint_state with the ones it saved in phase 1. In particular it
reverts heart_beat back to smaller value, but the larger problem is that it
saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
message to node A, node A will ignore them, because their versions are
smaller than X. Or node B will never send them, because whenever node
A requests states from node B, it only requests states with versions >
X. Either way, node A will fail to observe new states of node B.
If I understand correctly, this is a regression introduced in
38c2347a3c, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.
With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.
The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.
The PR also adds a regression test.
Fixes: scylladb/scylladb#15393Fixes: scylladb/scylladb#15602Fixes: scylladb/scylladb#16668Fixes: scylladb/scylladb#16902Fixes: scylladb/scylladb#17493Fixes: scylladb/scylladb#18118
Ref: scylladb/scylla-enterprise#3720Closesscylladb/scylladb#18184
* github.com:scylladb/scylladb:
test: reproducer for missing gossiper updates
gossiper: lock local endpoint when updating heart_beat
By default the suitename in the junit files generated by pytest
is named `pytest` for all suites instead of the suite, ex. `topology_experimental_raft`
With this change, the junit files will use the real suitename
This change doesn't affect the Test Report in Jenkins, but it
raised part of the other task of publishing the test results to
elasticsearch https://github.com/scylladb/scylla-pkg/pull/3950
where we parse the XMLs and we need the correct suitename
Closesscylladb/scylladb#18172
When altering rf for a keyspace, all tablets in this ks may have less replicas. Part of this process is removing replicas from some node(s). This PR extends the tablets rebuild transition to handle this case by making pending_replica optional.
fixes: #18176Closesscylladb/scylladb#18203
* github.com:scylladb/scylladb:
test: Tune up tablet-transition test to check del_replica
api: Add method to delete replica from tablet
tablet: Make pending replica optional
For that the test case is modified to have 3 nodes and 2 replicas on
start. Existing test cases are changed slightly in the way "from" host
is detected.
Also, the final check for data presense is modified to check that hosts
in "replicas" have data and other hosts don't have it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Copied from the add_replica counterpart
TODO: Generalize common parts of move_tablet and add_|del_tablet_replica
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Just like leaving replica could be optional when adding replica to
tablet, the pending replica can be optional too if we're removing a
replica from tablet
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fixesscylladb/scylladb#18083
* seastar cd8a9133...f3058414 (18):
> src/core/metrics.cc: rewrite set_metric_family_configs
> include/seastar/core/metrics_api.hh: Revert d2929c2ade5bd0125a73d53280c82ae5da86218e
> sstring: include <fmt/format.h> instead of <fmt/ostream.h>
> seastar.cc: include used header
> tls: include used header of <unordered_set>
> docs: remove unused parameter from handle_connection function of echo-HTTP-server tutorial example
> stall-analyser: use 0 for the default value of --width
> http: Move parsed params and urls
> scripts: use raw string to avoid invalid escape sequences
> timed_out_error: add fmt::formatter for timed_out_error
> scripts/stall-analyser: change default branch-threshold to 3%
> scripts/stall-analyser: resolve string escape sequence warning
> io_queue: Use static vector for fair groups too
> io_queue: Use static vector to store fair queues
> stall-analyser: add space around '=' in param list
> stall-analyser: add a space between 'var: Type' in type annotation
> stall-analyser: move variables closer to where they are used
> memory: drop support for compilers that don't support aligned new
Closesscylladb/scylladb#18235
When a node bootstraps or replaces a node after full cluster
shutdown and restart, some nodes may be down.
Existing nodes in the cluster load the down nodes TOKENS
(and recently, in this series, also DC and RACK) from system.peers
and then populate locator::topology and token_metadata
accordingly with the down nodes' tokens in storage_service::join_cluster.
However, a bootstrapping/replacing node has no persistent knowledge
of the down nodes, and it learns about their existance only from gossip.
But since the down nodes have unknown status, they never go
through `handle_state_normal` (in gossiper mode) and therefore
they are not accounted as normal token owners.
This is handled by `topology_state_load`, but not with
gossip-based node operations.
This patch updates the ignored nodes (for replace) state in topology
and token_metadata as if they were loaded from system tables,
after calling `prepare_replacement_info` when raft topology changes are
disabled, based on the endpoint_state retrieved in the shadow round
initiated in prepare_replacement_info.
Fixesscylladb/scylladb#15787
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Instead of `parse_node_list` resolving host ids to inet_address
let `prepare_replacement_info` get host_id_or_endpoint from
parse_node_list and prepare `loaded_endpoint_state` for
the ignored nodes so it can be used later by the callers.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than allowing to keep both
host_id and endpoint, keep only one of them
and provide resolve functions that use the
token_metadata to resolve the host_id into
an inet_address or vice verse.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>