Commit Graph

41212 Commits

Author SHA1 Message Date
Nadav Har'El
21e7deafeb alternator, mv: fix case of two new key columns in GSI
A materialized view in CQL allows AT MOST ONE view key column that
wasn't a key column in the base table. This is because if there were
two or more of those, the "liveness" (timestamp, ttl) of these different
columns can change at every update, and it's not possible to pick what
liveness to use for the view row we create.

We made an exception for this rule for Alternator: DynamoDB's API allows
creating a GSI whose partition key and range key are both regular columns
in the base table, and we must support this. We claim that the fact that
Alternator allows neither TTL (Alternator's "TTL" is a different feature)
nor user-defined timestamps, does allow picking the liveness for the view
row we create. But we did it wrong!

We claimed in a comment - and implemented in the code before this patch -
that in Alternator we can assume that both GSI key columns will have the
*same* liveness, and in particular timestamp. But this is only true if
one modifies both columns together! In fact, in general it is not true:
We can have two non-key attributes 'a' and 'b' which are the GSI's key
columns, and we can modify *only* b, without modifying a, in which case
the timestamp of the view modification should be b's newer timestamp,
not a's older one. The existing code took a's timestamp, assuming it
will be the same as b's, which is incorrect. The result was that if
we repeatedly modify only b, all view updates will receive the same
timestamp (a's old timestamp), and a deletion will always win over
all the modifications. This patch includes a reproducing test written by
a user (@Zak-Kent) that demonstrates how after a view row is deleted
it doesn't get recreated - because all the modifications use the same
timestamp.

The fix is, as suggested above, to use the *higher* of the two
timestamps of both base-regular-column GSI key columns as the timestamp
for the new view rows or view row deletions. The reproducer that
failed before this patch passes with it. As usual, the reproducer
passes on AWS DynamoDB as well, proving that the test is correct and
should really work.

Fixes #17119

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17172
2024-02-12 13:17:29 +02:00
Nadav Har'El
341af86167 test/cql-pytest: reproducer for GROUP BY regression
This patch adds a simple reproducer for a regression in Scylla 5.4 caused
by commit 432cb02, breaking LIMIT support in GROUP BY.

Refs #17237

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17275
2024-02-12 13:09:52 +02:00
Kefu Chai
57df20eef8 configure.py: use un-deprecated module
PEP 632 deprecates distutils module, and it is remove from Python 3.12.
we are actually using the one vendored by setuptools, if we are using
3.12. so let's use shutil for finding ninja executable.
see https://peps.python.org/pep-0632/

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17271
2024-02-12 13:05:35 +02:00
Kamil Braun
7d73c40125 Merge 'test.py: tablets: Fix flakiness of test_tablet_missing_data_repair' from Tomasz Grabiec
Reimplements stop/start sequence using rolling_restart() which is safe
with regards to UP status propagation and not prone to sudden
connection drop which may cause later CQL queries to time out. It also
ensures that CQL is up on all the remaining nodes when the with_down
callback is executed.

The test was observed to fail in CI like this:

```
  cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.157.135.26:9042 datacenter1>: ConnectionException('Pool for 127.157.135.26:9042 is shutdown')})
  ...
      @pytest.mark.repair
      @pytest.mark.asyncio
      async def test_tablet_missing_data_repair(manager: ManagerClient):
  ...
          for idx in range(0,3):
              s = servers[idx].server_id
              await manager.server_stop_gracefully(s, timeout=120)
  >           await check()
```

Hopefully: Fixes #17107

Closes scylladb/scylladb#17252

* github.com:scylladb/scylladb:
  test: py: tablets: Fix flakiness of test_tablet_missing_data_repair
  test: pylib: manager_client: Wait for driver to catch up in rolling_restart()
  test: pylib: manager_client: Accept callback in rolling_restart() to execute with node down
2024-02-12 11:52:09 +01:00
Botond Dénes
f068d1a6fa query: do not kill unpaged queries when they reach the tombstone-limit
The reason we introduced the tombstone-limit
(query_tombstone_page_limit), was to allow paged queries to return
incomplete/empty pages in the face of large tombstone spans. This works
by cutting the page after the tombstone-limit amount of tombstones were
processed. If the read is unpaged, it is killed instead. This was a
mistake. First, it doesn't really make sense, the reason we introduced
the tombstone limit, was to allow paged queries to process large
tombstone-spans without timing out. It does not help unpaged queries.
Furthermore, the tombstone-limit can kill internal queries done on
behalf of user queries, because all our internal queries are unpaged.
This can cause denial of service.

So in this patch we disable the tombstone-limit for unpaged queries
altogether, they are allowed to continue even after having processed the
configured limit of tombstones.

Fixes: #17241

Closes scylladb/scylladb#17242
2024-02-12 12:34:04 +02:00
Kefu Chai
9b85d1aebf configure.py, cmake: do not pass -Wignored-qualifiers explicitly
we recently added -Wextra to configure.py, and this option enables
a bunch of warning options, including `-Wignored-qualifiers`. so
there is no need to enable this specific warning anymore. this change
remove ths option from both `configure.py` and the CMake building system.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17272
2024-02-12 12:32:00 +02:00
Avi Kivity
c14571af16 Update seastar submodule
Because Seastar now defaults to C++23, we downgrade it explicitly to
C++20.

* seastar 289ad5e593...5d3ee98073 (10):
  > Update supported C++ standards to C++23 and C++20 (dropping C++17)
  > docker: install clang-tools-18
  > http: add handler_base::verify_mandatory_params()
  > coroutine/exception: document return_exception_ptr()
  > http: use structured-binding when appropriate
  > test/http: Read full server response before sending next
  > doc/lambda-coroutine-fiasco: fix a syntax error
  > util/source_location-compat: use __cpp_consteval
  > Fix incorrect class name in documentation.
  > Add support for missing HTTP PATCH method.

Closes scylladb/scylladb#17268
2024-02-12 12:21:47 +02:00
Patryk Wrobel
9fccd968d3 test_tablets.py: implement test_tablet_count_metric_per_shard
This change introduces a new test that verifies the
functionality related to tablet_count metric.

It checks if tablet_count metric is correctly reported
and updated when new tables are created, when tables
are dropped and when `move_tablet` is executed.

Refs: scylladb#16131
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>

Closes scylladb/scylladb#17165
2024-02-12 11:49:38 +02:00
Kefu Chai
54995fcac0 test/manual: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17255
2024-02-12 11:49:38 +02:00
Asias He
a0e46a6b47 repair: Fix rpc::source and rpc::optional parameter order in rpc message
In a mixed cluster (5.4.1-20231231.3d22f42cf9c3 and
5.5.0~dev-20240119.b1ba904c4977), in the rolling upgrade test, we saw
repair never finishing.

The following was observed:

rpc - client 127.0.0.2:65273 msg_id 5524:  caught exception while
processing a message: std::out_of_range (deserialization buffer
underflow)

It turns out the repair rpc message was not compatible between the two
versions. Even with a rpc stream verb, the new rpc parameters must come
after the rpc::source<> parameter. The rpc::source<> parameter is not
special in the sense that it must be the last parameter.

For example, it should be:

void register_repair_get_row_diff_with_rpc_stream(
std::function<future<rpc::sink<repair_row_on_wire_with_cmd>> (
const rpc::client_info& cinfo, uint32_t repair_meta_id,
rpc::source<repair_hash_with_cmd> source, rpc::optional<shard_id> dst_cpu_id_opt)>&& func);

not:

void register_repair_get_row_diff_with_rpc_stream(
std::function<future<rpc::sink<repair_row_on_wire_with_cmd>> (
const rpc::client_info& cinfo, uint32_t repair_meta_id,
rpc::optional<shard_id> dst_cpu_id_opt, rpc::source<repair_hash_with_cmd> source)>&& func);

Fixes #16941

Closes scylladb/scylladb#17156
2024-02-12 09:50:30 +02:00
Nadav Har'El
13e16475fa cql-pytest: fix skipping of tests on Cassandra or old Scylla
Recently we added a trick to allow running cql-pytests either with or
without tablets. A single fixture test_keyspace uses two separate
fixtures test_keyspace_tablets or test_keyspace_vnodes, as requested.

The problem is that even if test_keyspace doesn't use its
test_keyspace_tablets fixture (it doesn't, if the test isn't
parameterized to ask for tablets explicitly), it's still a fixture,
and it causes the test to be skipped. This causes every test to be
skipped when running on Cassandra or old Scylla which doesn't support
tablets.

The fix is simple - the internal fixture test_keyspace_tablets should
yield None instead of skipping. It is the caller, test_keyspace, which
now skips the test if tablets are requested but test_keyspace_tablets
is None.

Fixes #17266

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#17267
2024-02-11 21:03:25 +02:00
Kefu Chai
f990ea9678 tools/scylla-nodetool: implement describecluster
Refs #15588
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17240
2024-02-11 20:21:07 +02:00
Avi Kivity
14bf09f447 Merge 'utils: managed_bytes: optimize memory usage for small buffers' from Michał Chojnowski
managed_bytes is implemented as chain of blob_storage objects.
Each blob_storage contains 24 bytes of metadata. But in the most
common case -- when there is only a single element in the chain --
16 bytes of this metadata is trivial/unused.

This is regrettable waste because managed_bytes is used for every
database cell in the memtables and cache. It means that every value
of size >= 7 bytes (smaller ones fit in the inline storage of
managed_bytes) receives 16 bytes of useless overhead.

To correct that, this series adds to managed_bytes an alternative storage
layout -- used for buffers small enough to fit in one fragment -- which only
stores the necessary minimum of metadata. (That is: a pointer to the parent,
to facilitate moving the storage during memory defragmentation).

This saves 16 bytes on every cell greater than 15 bytes. Which includes e.g.
every live cell with value bigger than 6 bytes, which likely applies to most cells.

Before:
```
$ build/release/scylla perf-simple-query --duration 10
median 218692.88 tps ( 61.1 allocs/op,  13.1 tasks/op,   41762 insns/op,        0 errors)
$ build/release/scylla perf-simple-query --duration 10 --write
median 173511.46 tps ( 58.3 allocs/op,  13.2 tasks/op,   53258 insns/op,        0 errors)
$ build/release/test/perf/mutation_footprint_test -c1 --row-count=20 --partition-count=100 --data-size=8 --column-count=16
 - in cache:     2580222
 - in memtable:  2549852
```

After:
```
$ build/release/scylla perf-simple-query --duration 10
median 218780.89 tps ( 61.1 allocs/op,  13.1 tasks/op,   41763 insns/op,        0 errors)
$ build/release/scylla perf-simple-query --duration 10 --write
median 173105.78 tps ( 58.3 allocs/op,  13.2 tasks/op,   52913 insns/op,        0 errors)
$ build/release/test/perf/mutation_footprint_test -c1 --row-count=20 --partition-count=100 --data-size=8 --column-count=16
 - in cache:     2068238
 - in memtable:  2037696
```

Closes scylladb/scylladb#14263

* github.com:scylladb/scylladb:
  utils: managed_bytes: optimize memory usage for small buffers
  utils: managed_bytes: rewrite managed_bytes methods in terms of managed_bytes_view
2024-02-11 16:43:40 +02:00
Kefu Chai
cfb2c2c758 db: add formatter for gc_clock::time_point
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for `gc_clock::time_point`,
and drop its operator<<.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17254
2024-02-11 16:39:25 +02:00
Kefu Chai
33224cc10b sstables/storage: avoid unnecessary type cast
the type of `_dir` was changed to fs::path back in 637dd730, there
is no need to cast `_dir` to fs::path anymore.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17256
2024-02-11 16:37:05 +02:00
Benny Halevy
2ed29e31db gms: inet_address: make constructors explicit
In particular, `inet_address(const sstring& addr)` is
dangerous, since a function like
`topology::get_datacenter(inet_address ep)`
might accidentally convert a `sstring` argument
into an `inet_address` (which would most likely
throw an obscure std::invalid_argument if the datacenter
name does not look like an inet_address).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#17260
2024-02-11 15:44:13 +02:00
Benny Halevy
136df58cbc data_value: delete data_value(T*) constructor
Currently, since the data_value(bool) ctor
is implicit, pointers of any kind are implicitly
convertible to data_value via intermediate conversion
to `bool`.

This is error prone, since it allows unsafe comparison
between e.g. an `sstring` with `some*` by implicit
conversion of both sides to `data_value`.

For example:
```
    sstring name = "dc1";
    struct X {
        sstring s;
    };
    X x(name);
    auto p = &x;
    if (name == p) {}
```

Refs #17261

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#17262
2024-02-11 15:42:55 +02:00
Kefu Chai
d7a404e1ec alternator: add formatter for alternator::calculate_value_caller
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define formatters for `alternator::calculate_value_caller`,
and drop its operator<<.

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17259
2024-02-11 11:49:46 +02:00
Michał Chojnowski
5a3e4a1cc0 utils: managed_bytes: optimize memory usage for small buffers
managed_bytes is implemented as chain of blob_storage objects.
Each blob_storage contains 24 bytes of metadata. But in the most
common case -- when there is only a single element in the chain --
16 bytes of this metadata is trivial/unused.

This is regrettable waste because managed_bytes is used for every
database cell in the memtables and cache. It means that every value
of size >= 7 bytes (smaller ones fit in the inline storage of
managed_bytes) receives 16 bytes of useless overhead.

To correct that, this patch adds to managed_bytes an alternative storage
layout -- used for buffers small enough to fit in one contiguous
fragment -- which only stores the necessary minimum of metadata.
(That is: a pointer to the parent, to facilitate moving the storage during
memory defragmentation).
2024-02-09 20:56:20 +01:00
Tomasz Grabiec
1eedc85990 test: py: tablets: Fix flakiness of test_tablet_missing_data_repair
Reimplement stop/start sequence using rolling_restart() which is safe
with regards to UP status propagation and not prone to sudden
connection drop which may cause later CQL queries to time out. It also
ensures that CQL is up on all the remaining nodes when the with_down
callback is executed.

Hopefully: Fixes #17107
2024-02-09 20:37:06 +01:00
Tomasz Grabiec
27ed2d94fc test: pylib: manager_client: Wait for driver to catch up in rolling_restart()
For sanity of the developers who want to execute CQL queries after
rolling restarts.
2024-02-09 20:35:41 +01:00
Tomasz Grabiec
3ce4ec796a test: pylib: manager_client: Accept callback in rolling_restart() to execute with node down 2024-02-09 20:35:41 +01:00
Pavel Emelyanov
7a710425f0 streaming: Open-code on-stack lambda
It just wraps one if, no benefit in keeping it this way

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#17250
2024-02-09 20:31:09 +01:00
Petr Gusev
4554653ad9 storage_proxy: add a test for stop_remote
This patch adds a reproducer test for an issue #16382.
See scylladb/seastar#2044 for details of the problem.

The test is enabled only in dev mode since it requires
error injection mechanism. The patch adds a new injection
into storage_proxy::handle_read to simulate the problem
scenario - the node is shutting down and there are some
unfinished pending replica requests.

Closes scylladb/scylladb#16776
2024-02-09 17:23:13 +01:00
Michał Chojnowski
277a31f0ae utils: managed_bytes: rewrite managed_bytes methods in terms of managed_bytes_view
Some methods of managed_bytes contain the logic needed to read/write the
contents of managed_bytes, even though this logic is already present in
managed_bytes_{,mutable}_view.

Reimplementing those methods by using the views as intermediates allows us to
remove some code and makes the responsibilities cleaner -- after the change,
managed_bytes contains the logic of allocating and freeing the storage,
while views provide read/write access to the storage.

This change will simplify the next patch which changes the internals of
managed_bytes.
2024-02-09 17:00:33 +01:00
Botond Dénes
ba89b86913 Update tools/java submodule
* tools/java c75ce2c1...5e11ed17 (1):
  > bin/nodetool-wrapper: pass all args to nodetool for testings its ability
2024-02-09 16:34:47 +01:00
Raphael S. Carvalho
daa82f406c test_tablets: Enable table debug log in split test
If the test fails, it's helpful to see how split completion was
handled.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#17236
2024-02-09 14:38:24 +02:00
Botond Dénes
c7d9708092 Merge 'repair: delete table reference from repair related classes' from Aleksandra Martyniuk
row_level_repair and repair_meta keep a reference to a table.
If the table is dropped during repair, its object is destructed, leaving
a dangling reference.

Delete {row_level_repair,repair_meta}::_cf and replace their usages.

Fixes: #17233.

Closes scylladb/scylladb#17234

* github.com:scylladb/scylladb:
  repair: delete _cf from repair_meta
  repair: delete _cf from row_level_repair
2024-02-09 13:16:43 +02:00
Kamil Braun
e9e24f47ec Merge 'raft topology: implement upgrade and recovery procedure' from Piotr Dulikowski
This PR implements a procedure that upgrades existing clusters to use
raft-based topology operations. The procedure does not start
automatically, it must be triggered manually by the administrator after
making sure that no topology operations are currently running.

Upgrade is triggered by sending `POST
/storage_service/raft_topology/upgrade` request. This causes the
topology coordinator to start who drives the rest of the process: it
builds the `system.topology` state based on information observed in
gossip and tells all nodes to switch to raft mode. Then, topology
coordinator runs normally.

Upgrade progress is tracked in a new static column `upgrade_state` in
`system.topology`.

The procedure also serves as an extension to the current recovery
procedure on raft. The current recovery procedure requires restarting
nodes in a special mode which disables raft, perform `nodetool
removenode` on the dead nodes, clean up some state on the nodes and
restart them so that they automatically rebuild the group 0. Raft
topology fits into existing procedure by falling back to legacy topology
operations after disabling raft. After rebuilding the group 0, upgrade
needs to be triggered again.

Because upgrade is manual and it might not be convenient for
administrators to run it right after upgrading the cluster, we allow the
cluster to operate in legacy topology operations mode until upgrade,
which includes allowing new nodes to join. In order to allow it, nodes
now ask the cluster about the mode they should use to join before
proceeding by using a new `JOIN_NODE_QUERY` RPC.

The procedure is explained in more detail in `topology-over-raft.md`.

Fixes: https://github.com/scylladb/scylladb/issues/15008

Closes scylladb/scylladb#17077

* github.com:scylladb/scylladb:
  test/topology_custom: upgrade/recovery tests for topology on raft
  cdc/generation_service: in legacy mode, fall back to raft tables
  system_keyspace: add read_cdc_generation_opt
  cdc/generation_service: turn off gossip notifications in raft topo mode
  cql_test_env: move raft_topology_change_enabled var earlier
  group0_state_machine: pull snapshot after raft topology feature enabled
  storage_service: disable persistent feature enabler on upgrade
  storage_service: replicate raft features to system.peers
  storage_service: gossip tokens and cdc generation in raft topology mode
  API: add api for triggering and monitoring topology-on-raft upgrade
  storage_service: infer which topology operations to use on startup
  storage_service: set the topology kind value based on group 0 state
  raft_group0: expose link to the upgrade doc in the header
  feature_service: fall back to checking legacy features on startup
  storage_service: add fiber for tracking the topology upgrade progress
  gms: feature_service: add SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES
  topology_coordinator: implement core upgrade logic
  topology_coordinator: extract top-level error handling logic
  storage_service: initialize discovery leader's state earlier
  topology_coordinator: allow for custom sharding info in prepare_and_broadcast_cdc_generation_data
  topology_coordinator: allow for custom sharding info in prepare_new_cdc_generation_data
  topology_coordinator: remove outdated fixme in prepare_new_cdc_generation_data
  topology_state_machine: introduce upgrade_state
  storage_service: disallow topology ops when upgrade is in progress
  raft_group0_client: add in_recovery method
  storage_service: introduce join_node_query verb
  raft_group0: make discover_group0 public
  raft_group0: filter current node's IP in discover_group0
  raft_group0: remove my_id arg from discover_group0
  storage_service: make _raft_topology_change_enabled more advanced
  docs: document raft topology upgrade and recovery
2024-02-09 11:54:53 +01:00
Kefu Chai
c1c96bbc16 api/storage_service: drop /storage_service/describe_ring/ API
per its description, "`/storage_service/describe_ring/`" returns the
token ranges of an arbitrary keyspace. actually, it returns the
first keyspace which is of non-local-vnode-based-strategy. this API
is not used by nodetool, neither is it exercised in dtest.
scylla-manager has a wrapper for this API though, but that wrapper
is not used anywhere.

in this change, this API is dropped.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17197
2024-02-09 12:49:21 +02:00
Kefu Chai
c07de1fad1 topology_coordinator: s/sate/state/
fix a typo in the logging message.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17201
2024-02-09 10:27:33 +01:00
Kefu Chai
876478b84f storage_service: allow concurrent tablet migration in tablets/move API
Currently it waits for topology state machine to be idle, so it allows
one tablet to be moved at a time. We should allow it to start migration
if the current transition state is

- topology::transition_state::tablet_migration or
- topology::transition_state::tablet_draining

to allow starting parallel tablet movement. That will be useful when
scripting a custom rebalancing algorithm.

in this change, we wait until the topology state machine is idle or
it is at either of the above two states.

Fixes #16437
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#17203
2024-02-08 21:47:15 +01:00
Piotr Dulikowski
4d4976feb0 test/topology_custom: upgrade/recovery tests for topology on raft
Adds three tests for the new upgrade procedure:

- test_topology_upgrade - upgrades a cluster operating in legacy mode to
  use raft topology operations,
- test_topology_recovery_basic - performs recovery on a three-node
  cluster, no node removal is done,
- test_topology_majority_loss - simulates a majority loss scenario, i.e.
  removed two nodes out of three, performs recovery to rebuild the
  raft topology state and re-add two nodes back.
2024-02-08 19:12:28 +01:00
Piotr Dulikowski
d04b3338ce cdc/generation_service: in legacy mode, fall back to raft tables
When a node enters recovery after being in raft topology mode, topology
operations switch back to legacy mode. We want CDC to keep working when
that happens, so we need for the legacy code to be able to access
generations created back in raft mode - so that the node can still
properly serve writes to CDC log tables.

In order to make this possible, modify the legacy logic to also look for
a cdc generation in raft tables, if it is not found in legacy tables.
2024-02-08 19:12:28 +01:00
Piotr Dulikowski
fb02453686 system_keyspace: add read_cdc_generation_opt
The `system_keyspace::read_cdc_generation` loads a cdc generation from
the system tables. One of its preconditions is that the generation
exists - this precondition is quite easy to satisfy in raft mode, and
the function was designed to be used solely in that mode.

In legacy mode however, in case when we revert from raft mode through
recovery, it might be necessary to use generations created in raft mode
for some time. In order to make the function useful as a fallback in
case lookup of a generation in legacy mode fails, introduce a relaxed
variant of `read_cdc_generation` which returns std::nullopt if the
generation does not exist.
2024-02-08 19:12:28 +01:00
Piotr Dulikowski
77a8f5e3d6 cdc/generation_service: turn off gossip notifications in raft topo mode
In raft topology mode CDC information is propagated through group 0.
Prevent the generation service from reacting to gossiper notifications
after we made the switch to raft mode.
2024-02-08 19:12:28 +01:00
Piotr Dulikowski
29e286ee03 cql_test_env: move raft_topology_change_enabled var earlier
We will need to pass it to cdc::generation_service::config in the next
commit, so move it a bit earlier.
2024-02-08 19:12:28 +01:00
Piotr Dulikowski
07aba3abc4 group0_state_machine: pull snapshot after raft topology feature enabled
Pulling a snapshot of the raft topology is done via new rpc verb
(RAFT_PULL_TOPOLOGY_SNAPSHOT). If the recipient runs an older version of
scylla and does not understand the verb, sending it will result in an
error. We usually use cluster features to avoid such situations, but in
the case when a node joins the cluster, it doesn't have access to
features yet. Therefore, we need to enable pulling snapshots in two
situations:

- when the SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES feature becomes enabled,
- in case when starting group 0 server when joining a cluster that uses
  raft-based topology.
2024-02-08 19:12:28 +01:00
Piotr Dulikowski
53932420f8 storage_service: disable persistent feature enabler on upgrade
When starting in legacy mode, a gossip event listener called persistent
feature enabler is registered. This listener marks a feature as enabled
when it notices, in gossip, that all nodes declare support for the
feature.

With raft-based topology, features are managed in group 0 instead and do
not rely on the persistent feature enabler at all. Make the listener
look at the raft_topology_change_enabled() method and prevent it from
enabling more features after that method starts returning true.
2024-02-08 19:12:28 +01:00
Piotr Dulikowski
4fdd3e014a storage_service: replicate raft features to system.peers
This is necessary for cluster features to work after we switch from raft
topology mode to legacy topology mode during recovery, because
information in system.peers is used during legacy cluster feature check
and when enabling features.
2024-02-08 19:12:28 +01:00
Piotr Dulikowski
08865a0bd7 storage_service: gossip tokens and cdc generation in raft topology mode
A mixed raft/legacy cluster can happen when entering recovery mode, i.e.
when the group 0 upgrade state is set to 0 and a rolling restart is
performed. Legacy nodes expect at least information about tokens,
otherwise an internal error occurs in the handle_state_normal function.
Therefore, make nodes that use raft topology behave well with respect to
other nodes.
2024-02-08 19:12:28 +01:00
Piotr Dulikowski
a672383c2a API: add api for triggering and monitoring topology-on-raft upgrade
Implements the /storage_service/raft_topology/upgrade route. The route
supports two methods: POST, which triggers the cluster-wide upgrade to
topology-on-raft, and GET which reports the status of the upgrade.
2024-02-08 19:12:28 +01:00
Piotr Dulikowski
0bfcf7d4c6 storage_service: infer which topology operations to use on startup
Adds a piece of logic to storage_service::join_cluster which chooses the
mode in which it will boot.

If the experimental raft topology flag is disabled, it will fall back to
legacy node operations.

When the node starts for the first time, it will perform group 0
discovery. If the node creates a cluster, it will start it in raft
topology mode. If it joins an existing one, it will ask the node chosen
by the discovery algorithm about which joining method to use.

If the node is already a part of the cluster, it will base its decision
on the group0 state.
2024-02-08 19:12:28 +01:00
Piotr Dulikowski
1e0aae8576 storage_service: set the topology kind value based on group 0 state
When booting for the first time, the node determines whether to use raft
mode or not by asking the cluster, or by going straight to raft mode
when it creates a new cluster by itself. This happens before joining
group 0. However, right after joining group 0, the `upgrade_state`
column from `system.topology` is supposed to control which operations
the node is supposed to be using.

In order to have a single source of control over the flag (either
storage_service code or group 0 code), the
`_manage_topology_change_kind_from_group0` flag is added which controls
whether the `_topology_change_kind_enabled` flag is controlled from
group 0 or not.
2024-02-08 19:12:28 +01:00
Piotr Dulikowski
5392bac85b raft_group0: expose link to the upgrade doc in the header
So that it can be referenced from other files.
2024-02-08 19:12:28 +01:00
Piotr Dulikowski
3513a07d8a feature_service: fall back to checking legacy features on startup
When checking features on startup (i.e. whether support for any feature
was revoked in an unsafe way), it might happen that upgrade to raft
topology didn't finish yet. In that case, instead of loading an empty
set of features - which supposedly represents the set of features that
were enabled until last boot - we should fall back to loading the set
from the legacy `enabled_features` key in `system.scylla_local`.
2024-02-08 19:12:28 +01:00
Piotr Dulikowski
d5a2837658 storage_service: add fiber for tracking the topology upgrade progress
The topology coordinator fiber is not started if a node starts in legacy
topology mode. We need to start the raft state monitor fiber after all
preconditions for starting upgrade to raft topology are met.

Add a fiber which is spawned only in legacy mode that will wait until:

- The schema-on-raft upgrade finishes,
- The SUPPORTS_CONSISTENT_CLUSTER_MANAGEMENT feature is enabled,
- The upgrade is triggered by the user.

and, after that, will spawn the raft state monitor fiber.
2024-02-08 19:12:28 +01:00
Piotr Dulikowski
2ecb8641b1 gms: feature_service: add SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES
All nodes being capable of support for raft topology is a prerequisite
for starting upgrade to raft topology. The newly introduced feature will
track this prerequisite.
2024-02-08 19:12:28 +01:00
Piotr Dulikowski
a55797fd41 topology_coordinator: implement core upgrade logic
Implement topology coordinator's logic responsible for building the
group 0 state related to topology.
2024-02-08 19:12:28 +01:00
Piotr Dulikowski
b3369611bc topology_coordinator: extract top-level error handling logic
...to a separate method. It will be reused in another method that will
be introduced in the next commit.
2024-02-08 19:09:35 +01:00