Commit Graph

42251 Commits

Author SHA1 Message Date
Mikołaj Grzebieluch
65cfb9b4e0 storage_service: skip wait_for_gossip_to_settle if topology changes are based on raft
Waiting for gossip to settle slows down the bootstrap of the cluster.
It is safe to disable it if the topology is based on Raft.

Fixes scylladb/scylladb#16055

Closes scylladb/scylladb#17960
2024-04-20 17:56:51 +02:00
Kefu Chai
5ab527e669 main: do not echo parsed options when calling scylla interactively
in 2f0f53ac, we added logging of parsed command line options so that we
can see how scylla is launched in case it fails to boot. but when scylla
is called interactively in console. this echo is a little bit annoying.
see following console session
```console
$ scylla --help-loggers
Scylla version 5.5.0~dev-0.20240419.3c9651adf297 with build-id 7dd6a110e608535e5c259a03548eda6517ab4bde starting ...
command used: "./RelWithDebInfo/scylla --help-loggers"
pid: 996503
parsed command line options: [help-loggers]
Available loggers:
    BatchStatement
    LeveledManifest
    alter_keyspace
    alter_table
...
```

so in this change, we check if the stdin is associated with a terminal
device, if that the case, we don't print the scylla version, parsed
command line and pid. and the interactive session looks like:

```console
$ scylla --help-loggers
Available loggers:
    BatchStatement
    LeveledManifest
    alter_keyspace
    alter_table
```
no more distracting information printed. the original behavior
can be tested like:

```console
$ : | ./RelWithDebInfo/scylla --help-loggers
```

assuming scylla is always launched with systemd, which connects
stdin to /dev/null. see
https://www.freedesktop.org/software/systemd/man/latest/systemd.exec.html#Logging%20and%20Standard%20Input/Output
. so this behavior is preserved with this change.

Refs #4203

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18309
2024-04-19 15:00:05 +03:00
Raphael S. Carvalho
223214439b compaction: Disconsider active tables in the hourly compaction reevaluation
This hourly reevaluation is there to help tablets that have very low
write activity, which can go a long time without flushing a memtable,
and it's important to reevaluate compaction as data can get expired.
Today it can happen that we reevaluate a table that is being compacted
actively, which is waste of cpu as the reevaluation will happen anyway
when there are changes to sstable set. This waste can be amplified with
a significant tablet count in a given shard.
Eventually, we could make the revaluation time per table based on
expiration histogram, but until we get there, let's avoid this waste
by only reevaluating tables that are compaction idle for more than 1h.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#18280
2024-04-19 14:33:40 +03:00
Kefu Chai
a5dae74aee doc: update nodetool setlogginglevel sample output with most recent loggers list
in order to reduce the confusion like:

> I cannot find foobar in the list, is it supported?

also, take this opportunity to use "console" instead of "shell" for
rendering the code block. it's a better fit in this case. since we
are using pygment for syntax highlighting,
see https://pygments.org/docs/lexers/#pygments.lexers.shell.BashSessionLexer
for details on the "console" lexer.

and add a prompt before the command line, so that "console" lexer
can render the command line and output better.

also, add a note explaining that user should refer the output of
`scylla` to see the list of logger classes.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18311
2024-04-19 13:25:39 +03:00
Kefu Chai
c04654e865 storage_service: capture this explicitly
clang-19 complains with `-Wdeprecated-this-capture`:

```
/home/kefu/dev/scylladb/service/storage_service.cc:5837:22: error: implicit capture of 'this' with a capture default of '=' is deprecated [-Werror,-Wdeprecated-this-capture]
 5837 |         auto* node = get_token_metadata().get_topology().find_node(dst.host);
      |                      ^
/home/kefu/dev/scylladb/service/storage_service.cc:5830:44: note: add an explicit capture of 'this' to capture '*this' by reference
 5830 |     co_await transit_tablet(table, token, [=] (const locator::tablet_map& tmap, api::timestamp_type write_timestamp) {
      |                                            ^
      |                                             , this
```

since https://open-std.org/JTC1/SC22/WG21/docs/papers/2018/p0806r2.html
was approved, see https://eel.is/c++draft/depr.capture.this. and newer
versions of C++ compilers implemented it, so we need to capture `this`
explicitly to be more standard compliant, and to be more future-proof
in this regard.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18306
2024-04-19 10:05:57 +03:00
Kefu Chai
168ade72f8 treewide: replace formatter<std::string_view> with formatter<string_view>
in in {fmt} before v10, it provides the specialization of `fmt::formatter<..>`
for `std::string_view` as well as the specialization of `fmt::formatter<..>`
for `fmt::string_view` which is an implementation builtin in {fmt} for
compatibility of pre-C++17. and this type is used even if the code is
compiled with C++ stadandard greater or equal to C++17. also, before v10,
the `fmt::formatter<std::string_view>::format()` is defined so it accepts
`std::string_view`. after v10, `fmt::formatter<std::string_view>` still
exists, but it is now defined using `format_as()` machinery, so it's
`format()` method does not actually accept `std::string_view`, it
accepts `fmt::string_view`, as the former can be converted to
`fmt::string_view`.

this is why we can inherit from `fmt::formatter<std::string_view>` and
use `formatter<std::string_view>::format(foo, ctx);` to implement the
`format()` method with {fmt} v9, but we cannot do this with {fmt} v10,
and we would have following compilation failure:

```
FAILED: service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o
/home/kefu/.local/bin/clang++ -DFMT_DEPRECATED_OSTREAM -DFMT_SHARED -DSCYLLA_BUILD_MODE=release -DSEASTAR_API_LEVEL=7 -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++20 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -mllvm -inline-threshold=2500 -fno-slp-vectorize -U_FORTIFY_SOURCE -Werror=unused-result -MD -MT service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o -MF service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o.d -o service/CMakeFiles/service.dir/RelWithDebInfo/topology_state_machine.cc.o -c /home/kefu/dev/scylladb/service/topology_state_machine.cc
/home/kefu/dev/scylladb/service/topology_state_machine.cc:254:41: error: no matching member function for call to 'format'
  254 |     return formatter<std::string_view>::format(it->second, ctx);
      |            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
/usr/include/fmt/core.h:2759:22: note: candidate function template not viable: no known conversion from 'seastar::basic_sstring<char, unsigned int, 15>' to 'const fmt::basic_string_view<char>' for 1st argument
 2759 |   FMT_CONSTEXPR auto format(const T& val, FormatContext& ctx) const
      |                      ^      ~~~~~~~~~~~~
```

because the inherited `format()` method actually comes from
`fmt::formatter<fmt::string_view>`. to reduce the confusion, in this
change, we just inherit from `fmt::format<string_view>`, where
`string_view` is actually `fmt::string_view`. this follows
the document at
https://fmt.dev/latest/api.html#formatting-user-defined-types,
and since there is less indirection under the hood -- we do not
use the specialization created by `FMT_FORMAT_AS` which inherit
from `formatter<fmt::string_view>`, hopefully this can improve
the compilation speed a little bit. also, this change addresses
the build failure with {fmt} v10.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18299
2024-04-19 07:44:07 +03:00
Avi Kivity
6e487a49aa Merge 'toolchain: support building an optimized clang' from Takuya ASADA
This is complete version of #12786, since I take over the issue from @mykaul.
Update from original version are:
- Support ARM64 build (disable BOLT for now since it doesn't functioning)
- Changed toolchain settings to get current scylla able to build (support WASM, etc)
- Stop git clone scylladb repo, create git-archive of current scylla directory and import it
-  Update Clang to 17.0.6
- Save entire clang directory for BUILD mode, not just /usr/bin/clang binary
- Implemented INSTALL_PREBUILT mode to install prebuilt image which built in BUILD mode

Note that this patch drops cross-build support of frozen toolchain, since building clang and scylla multiple time in qemu-user-static will very slow, it's not usable.
Instead, we should build the image for each architecture natively.

----

This is a different way attempting to combine building an optimized clang (using LTO, PGO and BOLT, based on compiling ScyllaDB) to dbuild. Per Avi's request, there are 3 options: skip this phase (which is the current default), build it and build + install it to the default path.

Fixes: #10985
Fixes: scylladb/scylla-enterprise#2539

Closes scylladb/scylladb#17196

* github.com:scylladb/scylladb:
  toolchain: support building an optimized clang
  configure.py: add --build-dir option
2024-04-18 19:20:23 +00:00
Anna Stuchlik
a3481a4566 doc: document the system_auth_v2 feature
This commit includes updates related to replacing system_auth with system_auth_v2.

- The keyspace name system_auth is renamed to system_auth_v2.
- The procedures are updated to account for system_auth_v2.
- No longer required system_auth RF changes are removed from procedures.
- The information is added that if the consistent topology updates feature
  was not enabled upon upgrade from 5.4, there are limitations or additional
  steps to do (depending on the procedure).
  The files with that kind of information are to be found in _common folders
  and included as needed.
- The upgrade guide has been updated to reflect system_auth_v2 and related impacts.

Closes scylladb/scylladb#18077
2024-04-18 18:33:49 +02:00
Kefu Chai
21b03d2ce3 topology_coordinator: remove unused variable
when compiling the tree with clang-19, it complains:

```
/home/kefu/dev/scylladb/service/topology_coordinator.cc:1968:31: error: variable 'reject' set but not used [-Werror,-Wunused-but-set-variable]
 1968 |                     if (auto* reject = std::get_if<join_node_response_params::rejected>(&validation_result)) {
      |                               ^
1 error generated.
```
so, despite that we evaluate the assignment statement to see it
evaluates to true or false, the compiler still believes that the
variable is not used. probably, the value of the statement is not
dependent on the value of the value being assigned. either way,
let's use `std::holds_alternative<..>` instead of `std::get_if<..>`,
to silence this warning, and the code is a little bit more compacted,
in the sense of less tokens in the `if` statement.

in order to be self-contained, we take the opportunity to
include `<variant>` in this source file, as a function declared
in this header is used.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18291
2024-04-18 18:04:56 +03:00
Amnon Heiman
e8410848a8 Update seastar submodule
This patch updates the seastar submodule to get the latest safety patch for the metric layer.
The latest patch allows manipulating metric_families early in the
start-up process and is safer if someone chooses to aggregate summaries.

* seastar f3058414...8fabb30a (4):
  > stall-analyser: improve stall pattern matching
  > TLS: Move background BYE handshake to engine::run_in_background
  > metrics.cc: Safer set_metric_family_configs
  > src/core/metrics.cc: handle SUMMARY add operator

Closes scylladb/scylladb#18293
2024-04-18 18:02:28 +03:00
Tomasz Grabiec
393cb54c01 Merge 'Generalize tablet transition API calls' from Pavel Emelyanov
Recently there had been added add_tablet_replica and del_tablet_replica API calls that copy big portion of the existing move_tablet API call's logic. This PR generalizes the common parts

Closes scylladb/scylladb#18272

* github.com:scylladb/scylladb:
  tablets: Generalize transition mutations preparation
  tablets: Generalize tablet-already-in-transition check
  tablets: Generalize raft communications for tablet transition API calls
  tablets: Drop src vs dst equality check from move_tablet()
2024-04-18 14:42:10 +02:00
Anna Stuchlik
ad81f9f56a doc: replace Scylla with ScyllaDB in Glossary
This commit replaces "Scylla" with "ScyllaDB" on the Glossary page.

The product has been rebranded as "ScyllaDB".

Closes scylladb/scylladb#18296
2024-04-18 14:59:23 +03:00
Kamil Braun
9c2a836607 Revert "Merge 'Drain view_builder in generic drain' from ScyllaDB"
This reverts commit 298a7fcbf2, reversing
changes made to 5cf53e670d.

The change made CI flaky.

Fixes: scylladb/scylladb#18278
2024-04-18 11:50:41 +02:00
Aleksandr Bykov
e8833c6f2a test: Kill coordinator during topology operation
If coordinator node was killed, restarted, become not
operatable during topology operation, new coordinator should be elected,
operation should be aborted and cluster should be rolled back

Error injection will be used to kill the coordinator before streaming
starts

Closes scylladb/scylladb#16197
2024-04-17 17:24:20 +02:00
Tomasz Grabiec
c6c8347493 migration_manager: Pull all of group0 state on repair
Current code uses non-raft path to pull the schema, which violates
group0 linearizability because the node will have latest schema but
miss group0 updates of other system tables. In particular,
system.tablets. This manifests as repair errors due to missing
tablet_map for a given table when trying to access it. Tablet map is
always created together with the table in the same group0 command.

When a node is bootstrapping, repair calls sync_schema() to make
sure local schema is up to date. This races with group0 catch up,
and if sync_schema() wins, repair may fail on misssing tablet map.

Fix by making sync_schema() do a group0 read barrier when in raft
mode.

Fixes #18002

Closes scylladb/scylladb#18175
2024-04-17 16:21:05 +02:00
Nadav Har'El
e78fc75323 Merge 'tools/scylla-nodetool: add doc link for getsstables and sstableinfo commands' from Botond Dénes
Just like all the other commands already have it. These commands didn't have documentation at the point where they were implemented, hence the missing doc link.

The links don't work yet, but they will work once we release 6.0 and the current master documentation is promoted to stable.

Closes scylladb/scylladb#18147

* github.com:scylladb/scylladb:
  tools/scylla-nodetool: fix typo: Fore -> For
  tools/scylla-nodetool: add doc link for getsstables and sstableinfo commands
2024-04-17 15:15:56 +03:00
Asias He
642f9a1966 repair: Improve estimated_partitions to reduce memory usage
Currently, we use the sum of the estimated_partitions from each
participant node as the estimated_partitions for sstable produced by
repair. This way, the estimated_partitions is the biggest possible
number of partitions repair would write.

Since repair will write only the difference between repair participant
nodes, using the biggest possible estimation will overestimate the
partitions written by repair, most of the time.

The problem is that overestimated partitions makes the bloom filter
consume more memory. It is observed that it causes OOM in the field.

This patch changes the estimation to use a fraction of the average
partitions per node instead of sum. It is still not a perfect estimation
but it already improves memory usage significantly.

Fixes #18140

Closes scylladb/scylladb#18141
2024-04-17 14:31:38 +03:00
Pavel Emelyanov
1b2cd56bcc tablets: Generalize transition mutations preparation
Tablet transition handlers prepare two mutations -- one for tablets
table, that sets transition state, transition mode and few others; and
another one for topology table that "activates" the tablet_migration
state for topology coordinator.

The latter is common to all three handlers.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-17 12:01:51 +03:00
Pavel Emelyanov
3beccb8165 tablets: Generalize tablet-already-in-transition check
Continuation of the previous patch -- there's a common sanity check of
tablet transition API handlers, namely that this tablet is not in
transition already.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-17 12:01:02 +03:00
Pavel Emelyanov
14923812ad tablets: Generalize raft communications for tablet transition API calls
There are three transition calls -- move, add replica and del replica --
and all three work similarly. In a loop they try to get guard for raft
operation, then perform sanity checks on topology state, then prepare
mutations and then try to apply them to raft. After the loop finishes
all three wait for transition for the given tablet to complete.

This patch generalizes the raft kicking loop and the transition
completion waiting code.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-17 11:59:03 +03:00
Pavel Emelyanov
c4d538320e tablets: Drop src vs dst equality check from move_tablet()
The code here looks like this

    if src.host == dst.host
        throw "Local migration not possible"

    if src == dst
        co_return;

The 2nd check is apparently never satisfied -- if src == dst this means
that src.host == dst.host and it should have thrown already

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-17 11:57:10 +03:00
Kefu Chai
e431e7dc16 test: paritioner_test: print using fmt::print()
instead of using `operator<<`, use `fmt::print()` to
format and print, so we can ditch the `operator<<`-based formatters.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18259
2024-04-17 07:13:20 +03:00
Kefu Chai
0ff28b2a2a test: extract boost_test_print_type() into test_utils.hh
since Boost.Test relies on operator<< or `boost_test_print_type()`
to print the value of variables being compared, instead of defining
the fallback formatter of `boost_test_print_type()` for each
individual test, let's define it in `test/lib/test_utils.hh`, so
that it can be shared across tests.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18260
2024-04-17 07:12:39 +03:00
Kefu Chai
2bb8e7c3c3 utils: include "seastarx.hh" in composite_abort_source.hh
there is chance that `utils/small_vector.hh` does not include
`using namespace seastar`, and even if it does, we should not rely
on it. but if it does not, checkhh would fail. so let's include
"seastarx.hh" in this header, so it is self-contained.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18265
2024-04-17 07:11:01 +03:00
David Garcia
6707bc673c docs: update theme 1.7
Closes scylladb/scylladb#18252
2024-04-16 13:48:11 +02:00
Kamil Braun
eb9ba914a3 Merge 'Set dc and rack in gossiper when loaded from system.peers and load the ignored nodes state for replace' from Benny Halevy
The problem this series solves is correctly ignoring DOWN nodes state
when replacing a node.

When a node is replaced and there are other nodes that are down, the
replacing node is told to ignore those DOWN nodes using the
`ignore_dead_nodes_for_replace` option.

Since the replacing node is bootstrapping it starts with an empty
system.peers table so it has no notion about any node state and it
learns about all other nodes via gossip shadow round done in
`storage_service::prepare_replacement_info`.

Normally, since the DOWN nodes to ignore already joined the ring, the
remaining node will have their endpoint state already in gossip, but if
the whole cluster was restarted while those DOWN nodes did not start,
the remaining nodes will only have a partial endpoint state from them,
which is loaded from system.peers.

Currently, the partial endpoint state contains only `HOST_ID` and
`TOKENS`, and in particular it lacks `STATUS`, `DC`, and `RACK`.

The first part of this series loads also `DC` and `RACK` from
system.peers to make them available to the replacing node as they are
crucial for building a correct replication map with network topology
replication strategy.

But still, without a `STATUS` those nodes are not considered as normal
token owners yet, and they do not go through handle_state_normal which
adds them to the topology and token_metadata.

The second part of this series uses the endpoint state retrieved in the
gossip shadow round to explicitly add the ignored nodes' state to
topology (including dc and rack) and token_metadata (tokens) in
`prepare_replacement_info`.  If there are more DOWN nodes that are not
explicitly ignored replace will fail (as it should).

Fixes scylladb/scylladb#15787

Closes scylladb/scylladb#15788

* github.com:scylladb/scylladb:
  storage_service: join_token_ring: load ignored nodes state if replacing
  storage_service: replacement_info: return ignore_nodes state
  locator: host_id_or_endpoint: keep value as variant
  gms: endpoint_state: add getters for host_id, dc_rack, and tokens
  storage_service: topology_state_load: set local STATUS state using add_saved_endpoint
  gossiper: add_saved_endpoint: set dc and rack
  gossiper: add_saved_endpoint: fixup indentation
  gossiper: add_saved_endpoint: make host_id mandatory
  gossiper: add load_endpoint_state
  gossiper: start_gossiping: log local state
2024-04-16 10:27:36 +02:00
Pavel Emelyanov
2c3d6fe72f storage_proxy: Simplify create_hint_sync_point() code
It tries to call container().invoke_on_all() the hard way.
Calling it directly is not possible, because there's no
sharded::invoke_on_all() const overload

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18202
2024-04-16 07:26:06 +03:00
Nadav Har'El
a175e34375 cql-pytest: add instructions on how to get Cassandra
The cql-pytest framework allows running tests also against Cassandra,
but developers need to install Cassandra on their own because modern
distributions such as Fedora no longer carry a Cassandra package.

This patch adds clear and easy to follow (I think) instructions on how
to download a pre-compiled Cassadra, or alternatively how to download
and build Cassandra from source - and how either can be used with the
test/cql-pytest/run-cassandra script.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#18138
2024-04-16 07:23:36 +03:00
Botond Dénes
298a7fcbf2 Merge 'Drain view_builder in generic drain' from ScyllaDB
For view builder draining there's dedicated deferred action in main while all other services that need to be drained do it via storage_service. The latter is to unify shutdown for services and to make `nodetool drain` drain everything, not just some part of those. This PR makes view builder drain look the same. As a side effect it also moves `mark_existing_views_as_built` from storage service to view builder and generalizes this marking code inside view builder itself.

refs: #2737
refs: #2795

Closes scylladb/scylladb#16558

* github.com:scylladb/scylladb:
  storage_service: Drain view builder on drain too
  view_builder: Generalize mark_as_built(view_ptr) method
  view_builder: Move mark_existing_views_as_built from storage service
  storage_service: Add view_builder& reference
  main,cql_test_env: Move view_builder start up (and make unconditional)
2024-04-16 07:21:42 +03:00
Pavel Emelyanov
5cf53e670d replica: Remove unused ex variable from table::take_snapshot
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18215
2024-04-16 07:16:38 +03:00
Pavel Emelyanov
f17c594d21 large_data_handler: If-less statistics increment
The partitions_bigger_than_threshold is incremented only if the previous
check detects that the partition exceeds a threshold by its size. It's
done with an extra if, but it can be done without (explicit) condition
as bool type is guaranteed by the standard to convert into integers as
true = 1 and false = 0

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18217
2024-04-16 07:16:05 +03:00
Pavel Emelyanov
0f70d276d2 tools/scylla-sstable: Use shorter check is unordered_set contains a key
Currentl code counts the number of keys in it just to see if this number
is non-zero. Using .contains() method is better fit here

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18219
2024-04-16 07:14:48 +03:00
Pavel Emelyanov
1df7c2a0e9 topology_coordinator: Mark retake_node() const
Runaway from 4d83a8c12c

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18218
2024-04-16 07:13:07 +03:00
Pavel Emelyanov
05c4042511 api/lsa: Don't use database to perform invoke-on-all
The sharded<database> is used as a invoke_in_all() method provider,
there's no real need in database itself. Simple smp::invoke_on_all()
would work just as good.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18221
2024-04-16 07:12:40 +03:00
Pavel Emelyanov
4a6291dce5 test/sstable: Use .handle_exception_type() shortcut
Some tests want to ignore out_of_range exception in continuation and go
the longer route for that

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18216
2024-04-16 07:11:35 +03:00
Pavel Emelyanov
1612aa01ca cql3: Reserve vector with pk columns
When constructing a vector with partition key data, the size of that
vector is known beforehand

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18239
2024-04-16 07:06:07 +03:00
Pavel Emelyanov
f3edde7d2e api: Qualify callback commitlog* argument with const
There's a helper map-reducer that accepts a function to call on
commitlog. All callers accumulate statistics with it, so the commitlog
argument is const pointer.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18238
2024-04-16 07:02:31 +03:00
Botond Dénes
162c9ad6f6 Merge 'gossiper: lock local endpoint when updating heart_beat' from Kamil Braun
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.

For example:
- in scylladb/scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb/scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.

I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.

After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.

With additional logging and additional head-banging, I determined
the root cause.

The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
  state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
  updates the copy:
```
            auto local_state = *ep_state_before;
            for (auto& p : states) {
                auto& state = p.first;
                auto& value = p.second;
                value = versioned_value::clone_with_higher_version(value);
                local_state.add_application_state(state, value);
            }
```
  `clone_with_higher_version` bumps `version` inside
  gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
  phase it copies the updated `local_state` to all shards into a
  separate map. In second phase the values from separate map are used to
  overwrite the endpoint_state map used for gossiping.

  Due to the cross-shard calls of the 1 phase, there is a yield before
  the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
  This uses the monotonic version_generator, so it uses a higher version
  then the ones we used for states added above. Let's call this new version
  X. Note that X is larger than the versions used by application_states
  added above.
- now node B handles a SYN or ACK message from node A, creating
  an ACK or ACK2 message in response. This message contains:
    - old application states (NOT including the update described above,
      because `replicate` is still sleeping before phase 2),
    - but bumped heart_beat == X from `gossiper::run()` loop,
  and sends the message.
- node A receives the message and remembers that the max
  version across all states (including heart_beat) of node B is X.
  This means that it will no longer request or apply states from node B
  with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
  endpoint_state with the ones it saved in phase 1. In particular it
  reverts heart_beat back to smaller value, but the larger problem is that it
  saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
  message to node A, node A will ignore them, because their versions are
  smaller than X. Or node B will never send them, because whenever node
  A requests states from node B, it only requests states with versions >
  X. Either way, node A will fail to observe new states of node B.

If I understand correctly, this is a regression introduced in
38c2347a3c, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.

With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.

The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.

The PR also adds a regression test.

Fixes: scylladb/scylladb#15393
Fixes: scylladb/scylladb#15602
Fixes: scylladb/scylladb#16668
Fixes: scylladb/scylladb#16902
Fixes: scylladb/scylladb#17493
Fixes: scylladb/scylladb#18118
Ref: scylladb/scylla-enterprise#3720

Closes scylladb/scylladb#18184

* github.com:scylladb/scylladb:
  test: reproducer for missing gossiper updates
  gossiper: lock local endpoint when updating heart_beat
2024-04-16 06:46:24 +03:00
Tzach Livyatan
289793d964 Update Driver root page
The right term is Amazon DynamoDB not AWS DynamoDB
See https://aws.amazon.com/dynamodb/

Closes scylladb/scylladb#18214
2024-04-16 06:41:28 +03:00
Beni Peled
223275b4d1 test.py: add the pytest junit_suite_name parameter
By default the suitename in the junit files generated by pytest
is named `pytest` for all suites instead of the suite, ex. `topology_experimental_raft`
With this change, the junit files will use the real suitename

This change doesn't affect the Test Report in Jenkins, but it
raised part of the other task of publishing the test results to
elasticsearch https://github.com/scylladb/scylla-pkg/pull/3950
where we parse the XMLs and we need the correct suitename

Closes scylladb/scylladb#18172
2024-04-15 21:07:00 +03:00
Tomasz Grabiec
95d93c1668 Merge 'Extend tablet_transition_kind::rebuild to remove replicas' from Pavel Emelyanov
When altering rf for a keyspace, all tablets in this ks may have less replicas. Part of this process is removing replicas from some node(s). This PR extends the tablets rebuild transition to handle this case by making pending_replica optional.

fixes: #18176

Closes scylladb/scylladb#18203

* github.com:scylladb/scylladb:
  test: Tune up tablet-transition test to check del_replica
  api: Add method to delete replica from tablet
  tablet: Make pending replica optional
2024-04-15 21:01:03 +03:00
Pavel Emelyanov
c60639d582 sstables: Coroutinize drop_caches() method
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18220
2024-04-15 17:22:59 +03:00
Pavel Emelyanov
b06b85c270 test: Tune up tablet-transition test to check del_replica
For that the test case is modified to have 3 nodes and 2 replicas on
start. Existing test cases are changed slightly in the way "from" host
is detected.

Also, the final check for data presense is modified to check that hosts
in "replicas" have data and other hosts don't have it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-15 16:31:07 +03:00
Pavel Emelyanov
8bad828208 api: Add method to delete replica from tablet
Copied from the add_replica counterpart

TODO: Generalize common parts of move_tablet and add_|del_tablet_replica

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-15 16:31:07 +03:00
Pavel Emelyanov
725b2863d2 tablet: Make pending replica optional
Just like leaving replica could be optional when adding replica to
tablet, the pending replica can be optional too if we're removing a
replica from tablet

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-15 16:31:07 +03:00
Amnon Heiman
06dc56df01 Update seastar submodule
Fixes scylladb/scylladb#18083

* seastar cd8a9133...f3058414 (18):
  > src/core/metrics.cc: rewrite set_metric_family_configs
  > include/seastar/core/metrics_api.hh: Revert d2929c2ade5bd0125a73d53280c82ae5da86218e
  > sstring: include <fmt/format.h> instead of <fmt/ostream.h>
  > seastar.cc: include used header
  > tls: include used header of <unordered_set>
  > docs: remove unused parameter from handle_connection function of echo-HTTP-server tutorial example
  > stall-analyser: use 0 for the default value of --width
  > http: Move parsed params and urls
  > scripts: use raw string to avoid invalid escape sequences
  > timed_out_error: add fmt::formatter for timed_out_error
  > scripts/stall-analyser: change default branch-threshold to 3%
  > scripts/stall-analyser: resolve string escape sequence warning
  > io_queue: Use static vector for fair groups too
  > io_queue: Use static vector to store fair queues
  > stall-analyser: add space around '=' in param list
  > stall-analyser: add a space between 'var: Type' in type annotation
  > stall-analyser: move variables closer to where they are used
  > memory: drop support for compilers that don't support aligned new

Closes scylladb/scylladb#18235
2024-04-15 15:19:59 +02:00
Tomasz Grabiec
2ceef1d600 scripts: tablet-mon.py: Support for annotating tablets by table id
Closes scylladb/scylladb#18225
2024-04-15 15:19:59 +02:00
Benny Halevy
655d624e01 storage_service: join_token_ring: load ignored nodes state if replacing
When a node bootstraps or replaces a node after full cluster
shutdown and restart, some nodes may be down.

Existing nodes in the cluster load the down nodes TOKENS
(and recently, in this series, also DC and RACK) from system.peers
and then populate locator::topology and token_metadata
accordingly with the down nodes' tokens in storage_service::join_cluster.

However, a bootstrapping/replacing node has no persistent knowledge
of the down nodes, and it learns about their existance only from gossip.
But since the down nodes have unknown status, they never go
through `handle_state_normal` (in gossiper mode) and therefore
they are not accounted as normal token owners.
This is handled by `topology_state_load`, but not with
gossip-based node operations.

This patch updates the ignored nodes (for replace) state in topology
and token_metadata as if they were loaded from system tables,
after calling `prepare_replacement_info` when raft topology changes are
disabled, based on the endpoint_state retrieved in the shadow round
initiated in prepare_replacement_info.

Fixes scylladb/scylladb#15787

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-04-14 15:45:55 +03:00
Benny Halevy
e4c3c07510 storage_service: replacement_info: return ignore_nodes state
Instead of `parse_node_list` resolving host ids to inet_address
let `prepare_replacement_info` get host_id_or_endpoint from
parse_node_list and prepare `loaded_endpoint_state` for
the ignored nodes so it can be used later by the callers.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-04-14 15:43:19 +03:00
Benny Halevy
7c2bd8dc34 locator: host_id_or_endpoint: keep value as variant
Rather than allowing to keep both
host_id and endpoint, keep only one of them
and provide resolve functions that use the
token_metadata to resolve the host_id into
an inet_address or vice verse.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-04-14 15:25:50 +03:00