Commit Graph

737 Commits

Author SHA1 Message Date
Aleksandra Martyniuk
31ea74b96e db: system_keyspace: change version of topology_requests schema
In 880058073b a new column (request_type)
was added to topology_requests table, but the table's schema version
wasn't changed. Due to that during cluster upgrade, the old and the new
versions occur but they are not distinguishable.

Add offset to schema version of topology_requests table if it contains
request_type column.

Fixes: #20299.

Closes scylladb/scylladb#20402
2024-09-11 16:36:35 +03:00
Michael Litvak
c1f3517a75 view_builder: improve migration to v2 with intermediate phase
Add an intermediate phase to the view builder migration to v2 where we
write to both the old and new table in order to not lose writes during
the migration.
We add an additional view builder version v1_5 between v1 and v2 where
we write to both tables. We perform a barrier before moving to v2 to
ensure all the operations to the old table are completed.
2024-09-05 15:42:35 +03:00
Michael Litvak
f3887cd80b db:system_keyspace: add view_builder_version to scylla_local
Add a new scylla_local parameter view_builder_version, and functions to
read and mutate the value.
The version value defaults to v1 if it doesn't exist in the table.
2024-09-05 15:41:04 +03:00
Michael Litvak
bf4a58bf91 db:system_keyspace: add view_build_status_v2 table
add the table system.view_build_status_v2 with the same schema as
system_distributed.view_build_status.
2024-09-05 15:41:04 +03:00
Patryk Jędrzejczak
02bb70da19 treewide: support zero-token nodes in the recovery mode
Before we implement the manual recovery tool, we must support
zero-token nodes in the recovery mode. This means that two topology
operations involving zero-token nodes must work in the gossip-based
topology:
- removing a dead zero-token node,
- restarting a live zero-token node.
We make changes necessary to make them work in this patch.
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
22d907e721 treewide: introduce support for zero-token nodes in Raft topology
We revive the `join_ring` option. We support it only in the
Raft-based topology, as we plan to remove the gossip-based topology
when we fix the last blocker - the implementation of the manual
recovery tool. In the Raft-based topology, a node can be assigned
tokens only once when it joins the cluster. Hence, we disallow
joining the ring later, which is possible in Cassandra.

The main idea behind the solution is simple. We make the unsupported
special case of zero tokens a supported normal case. Nodes with zero
tokens assigned are called "zero-token nodes" from now on.

From the topology point of view, zero-token nodes are the same as
token-owning nodes. They can be in the same states, etc. From the
data point of view, they are different. They are not members of
the token ring, so they are not present in
`token_metadata::_normal_token_owners`. Hence, they are ignored in
all non-local replication strategies. The tablet load balancer also
ignores them.

Topology operations involving zero-token nodes are simplified:
- `add` and `replace` finish in the `join_group0` state, so creating
a new CDC generation and streaming are skipped,
- `removenode` and `decommission` skip streaming,
- `rebuild` does not even contact the topology coordinator as there
is nothing to rebuild,

Also, if the topology operation involves a token-owning node,
zero-token nodes are ignored in streaming.

Zero-token nodes can be used as coordinator-only nodes, just like in
Cassandra. They can handle requests just like token-owning nodes.

The main motivation behind zero-token nodes is that they can prevent
the Raft majority loss efficiently. Zero-token nodes are group 0
voters, but they can run on much weaker and cheaper machines because
they do not replicate data and handle client requests by default
(drivers ignore them). For example, if there are two DCs, one with 4
nodes and one with 5 nodes, if we add a DC with 2 zero-token nodes,
every DC will contain less than half of the nodes, so we won't lose
the majority when any DC dies.

Another way of preventing the Raft majority loss is changing the
voter set, which is tracked by scylladb/scylladb#18793. That approach
can be used together with zero-token nodes. In the example above, if
we choose equal numbers of voters in both DCs, then a DC with one
zero-token node will be sufficient. However, in the typical setup of
2 DCs with the same number of nodes it is enough to add a DC with
only one zero-token node without changing the voter set.

Zero-token nodes could also be used as load balancers in the
Alternator.
2024-08-29 10:37:07 +02:00
Patryk Jędrzejczak
ba016c9af7 system_keyspace: load_topology_state: remove assertion impossible to hit
We store tokens in a non-frozen set, which doesn't distinguish an
empty set from no value. Hence, hitting this assertion is impossible.
2024-08-29 10:37:07 +02:00
Laszlo Ersek
9e95f3a198 db/system_keyspace: promise a tagged integer from increment_and_get_generation()
Internally, increment_and_get_generation() produces a
"gms::generation_type" value.

In turn, all callers of increment_and_get_generation() -- namely
scylla_main() [main.cc] and single_node_cql_env::run_in_thread()
[test/lib/cql_test_env.cc] -- pass the resolved value to
storage_service::init_address_map() and storage_service::join_cluster(),
both of which take a "gms::generation_type".

Therefore it is pointless to "untag" the generation value temporarily
between the producer and the consumers. Correct the return type of
increment_and_get_generation().

Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>
2024-08-14 13:35:08 +02:00
Avi Kivity
aa1270a00c treewide: change assert() to SCYLLA_ASSERT()
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.

Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.

To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.

[1] 66ef711d68

Closes scylladb/scylladb#20006
2024-08-05 08:23:35 +03:00
Aleksandra Martyniuk
c64cb98bcf db: node_ops: filter topology request entries
system_keyspace::get_topology_request_entries returns entries for
requests which are running or have finished after specified time.

In task manager node ops task set the time so that they are shown
for task_ttl seconds after they have finished.
2024-07-23 13:35:02 +02:00
Aleksandra Martyniuk
94282b5214 db: service: modify methods to get topology_requests data
Modify get_topology_request_state (and wait_for_topology_request_completion),
so that it doesn't call on_internal_error when request_id isn't
in the topology_requests table if require_entry == false.

Add other methods to get topology request entry.
2024-07-23 13:35:01 +02:00
Aleksandra Martyniuk
880058073b db: service: add request type column to topology_requests
topology_requests table will be used by task manager node ops tasks,
but it loses info about request type, which is required by tasks.

Add request_type column to topology_requests.
2024-07-23 13:35:01 +02:00
Kefu Chai
36239ec592 db/system_keyspace: drop thrift_version from system.local table
so we don't create new sstables with this unused column, but we
can still open old sstables of this table which was created with
the old schema.

Refs #3811
Refs #18416

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-06-07 09:23:10 +08:00
Kefu Chai
ad649be1bf treewide: drop thrift support
thrift support was deprecated since ScyllaDB 5.2

> Thrift API - legacy ScyllaDB (and Apache Cassandra) API is
> deprecated and will be removed in followup release. Thrift has
> been disabled by default.

so let's drop it. in this change,

* thrift protocol support is dropped
* all references to thrift support in document are dropped
* the "thrift_version" column in system.local table is
  preserved for backward compatibility, as we could load
  from an existing system.local table which still contains
  this clolumn, so we need to write this column as well.
* "/storage_service/rpc_server" is only preserved for
  backward compatibility with java-based nodetool.
* `rpc_port` and `start_rpc` options are preserved, but
  they are marked as "Unused". so that the new release
  of scylladb can consume existing scylla.yaml configurations
  which might contain these settings. by making them
  deprecated, user will be able get warned, and update
  their configurations before we actually remove them
  in the next major release.

Fixes #3811
Fixes #18416
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-06-07 06:44:59 +08:00
Pavel Emelyanov
83d491af02 config: Remove experimental TABLETS feature
... and replace it with boolean enable_tablets option. All the places
in the code are patched to check the latter option instead of the former
feature.

The option is OFF by default, but the default scylla.yaml file sets this
to true, so that newly installed clusters turn tablets ON.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18898
2024-05-30 18:03:51 +03:00
Piotr Smaron
59d3fd615f Extend system.topology with 3 new columns to store data required to process alter ks global topo req
Because ALTER KS will result in creating a global topo req, we'll have
to pass the req data to topology coordinator's state machine, and the
easiest way to do it is through sytem.topology table, which is going to
be extended with 3 extra columns carrying all the data required to
execute ALTER KS from within topology coordinator.
2024-05-28 13:55:11 +02:00
Marcin Maliszkiewicz
2ab143fb40 db: auth: move auth tables to system keyspace
Separate keyspace which also behaves as system brings
little benefit while creating some compatibility problems
like schema digest mismatch during rollback. So we decided
to move auth tables into system keyspace.

Fixes https://github.com/scylladb/scylladb/issues/18098

Closes scylladb/scylladb#18769
2024-05-26 22:30:42 +03:00
Botond Dénes
96a7ed7efb Merge 'sstables: add dead row count when issuing warning to system.large_partitions' from Ferenc Szili
This is the second half of the fix for issue #13968. The first half is already merged with PR #18346

Scylla issues warnings for partitions containing more rows than a configured threshold. The warning is issued by inserting a row into the `system.large_partitions` table. This row contains the information about the partition for which the warning is issued: keyspace, table, sstable, partition key and size, compaction time and the number of rows in the partition. A previous PR #18346 also added range tombstone count to this row.

This change adds a new counter for dead rows to the large_partitions table.

This change also adds cluster feature protection for writing into these new counters. This is needed in case a cluster is in the process of being upgraded to this new version, after which an upgraded node writes data with the new schema into `system.large_partitions`, and finally a node is then rolled back to an old version. This node will then revert the schema to the old version, but the written sstables will still contain data with the new counters, causing any readers of this table to throw errors when they encounter these cells.

This is an enhancement, and backporting is not needed.

Fixes #13968

Closes scylladb/scylladb#18458

* github.com:scylladb/scylladb:
  sstable: added test for counting dead rows
  sstable: added docs for system.large_partitions.dead_rows
  sstable: added cluster feature for dead rows and range tombstones
  sstable: write dead_rows count to system.large_partitions
  sstable: added counter for dead rows
2024-05-09 08:26:43 +03:00
Benny Halevy
ebff5f5d70 everywhere: include seastar headers using angle brackets
seastar is an external library therefore it should
use the system-include syntax.

Closes scylladb/scylladb#18513
2024-05-06 10:00:31 +03:00
Ferenc Szili
b06af5b2b9 sstable: write dead_rows count to system.large_partitions 2024-05-02 11:49:10 +02:00
Kamil Braun
d8313dda43 Merge 'db: config: move consistent-topology-changes out of experimental and make it the default for new clusters' from Patryk Jędrzejczak
We move consistent cluster management out of experimental and
make it the default for new clusters in 6.0. In code, we make the
`consistent-topology-changes` flag unused and assumed to be true.

In 6.0, the topology upgrade procedure will be manual and
voluntary, so some clusters will still be using the gossip-based
topology even though they support the raft-based topology.
Therefore, we need to continue testing the gossip-based topology.
This is possible by using the `force-gossip-topology-changes` flag
introduced in scylladb/scylladb#18284.

Ref scylladb/scylladb#17802

Closes scylladb/scylladb#18285

* github.com:scylladb/scylladb:
  docs: raft.rst: update after removing consistent-topology-changes
  treewide: fix indentation after the previous patch
  db: config: make consistent-topology-changes unused
  test: lib: single_node_cql_env: restart a node in noninitial run_in_thread calls
  test: test_read_required_hosts: run with force-gossip-topology-changes
  storage_service: join_cluster: replace force_gossip_based_join with force-gossip-topology-changes
  storage_service: join_token_ring: fix finish_setup_after_join calls
2024-04-26 14:45:29 +02:00
Botond Dénes
d566eec89a Merge 'treewide: remove {dclocal_,}read_repair_chance options' from Kefu Chai
dclocal_read_repair_chance and read_repair_chance have been removed in Cassandra 3.11 and 4.x, see
https://issues.apache.org/jira/browse/CASSANDRA-13910. if we expose these properties via DDL, Cassandra would fail to consume the CQL statement creating the table when performing migration from Scylla to Cassandra 4.x, as the latter does not understand these properties anymore.

currently the default values of `dc_local_read_repair_chance` and `read_repair_chance` are both "0". so they are practically disabled, unless user deliberately set them to a value greater than 0.

also, as a side effect, Cassandra 4.x has better support of Python3. the cqlsh shipped along with Cassandra 3.11.16 only supports python2.7, see
https://github.com/apache/cassandra/blob/cassandra-3.11.16/bin/cqlsh.py it errors out if the system only provides python3 with the error of
```
No appropriate python interpreter found.
```
but modern linux systems do not provide python2 anymore.

so, in this change, we deprecate these two options.

Fixes #3502
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18087

* github.com:scylladb/scylladb:
  docs: drop documents related to {,dclocal_}read_repair_chance
  treewide: remove {dclocal_,}read_repair_chance options
2024-04-26 10:48:47 +03:00
Patryk Jędrzejczak
0d428a3857 treewide: fix indentation after the previous patch 2024-04-25 14:33:21 +02:00
Patryk Jędrzejczak
3a34bb18cd db: config: make consistent-topology-changes unused
We make the `consistent-topology-changes` experimental feature
unused and assumed to be true in 6.0. We remove code branches that
executed if `consistent-topology-changes` was disabled.
2024-04-25 14:33:21 +02:00
Kefu Chai
c323c93fa4 treewide: remove {dclocal_,}read_repair_chance options
dclocal_read_repair_chance and read_repair_chance have been removed
in Cassandra 3.11 and 4.x, see
https://issues.apache.org/jira/browse/CASSANDRA-13910.
if we expose the properties via DDL, Cassandra would fails to consume
the CQL statement to creating the table when performing migration
from Scylla to Cassandra 4.x, as the latter does not understand
these properties anymore.

currently the default values of `dc_local_read_repair_chance` and
`read_repair_chance` are both "0". so this is practically disabled,
unless user deliberately set them to a value greater than 0.

also, as a side effect, Cassandra 4.x has better support of
Python3. the cqlsh shipped along with Cassandra 3.11.16 only
supports python2.7, see
https://github.com/apache/cassandra/blob/cassandra-3.11.16/bin/cqlsh.py
it errors out if the system only provides python3 with the error
of

```
No appropriate python interpreter found.
```
but modern linux systems do not provide python2 anymore.

so, in this change, we deprecate these two options.

Fixes #3502
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-04-25 17:15:27 +08:00
Ferenc Szili
98bec4e02a sstable: large data handler needs to count range tombstones as rows
When issuing warnings about partitions with the number of rows above a configured threshold,
the  large partitions handler does not take into consideration the number of range tombstone
markers in the total rows count. This fix adds the number of range tombstone markers to the
total number of rows and saves this total in system.large_partitions.rows (if it is above
the threshold). It also adds a new column range_tombstones to the system.large_partitions
table which only contains the number of range tombstone markers for the given partition.

This PR fixes the first part of issue #13968
It does not cover distinguishing between live and dead rows. A subsequent PR will handle that.
2024-04-22 15:24:18 +02:00
Kefu Chai
a439ebcfce treewide: include fmt/ranges.h and/or fmt/std.h
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we include `fmt/ranges.h` and/or `fmt/std.h`
for formatting the container types, like vector, map
optional and variant using {fmt} instead of the homebrew
formatter based on operator<<.
with this change, the changes adding fmt::formatter and
the changes using ostream formatter explicitly, we are
allowed to drop `FMT_DEPRECATED_OSTREAM` macro.

Refs scylladb#13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-04-19 22:56:16 +08:00
Benny Halevy
1061455442 gossiper: add load_endpoint_state
Pack the topology-related data loaded from system.peers
in `gms::load_endpoint_state`, to be used in a following
patch for `add_saved_endpoint`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-04-14 15:06:56 +03:00
Piotr Dulikowski
baae811142 Merge 'auth: keep auth version in scylla_local' from Marcin Maliszkiewicz
Before the patch selection of auth version depended
on consistent topology feature but during raft recovery
procedure this feature is disabled so we need to persist
the version somewhere to not switch back to v1 as this
is not supported.

During recovery auth works in read-only mode, writes
will fail.

Fixes https://github.com/scylladb/scylladb/issues/17736

Closes scylladb/scylladb#18039

* github.com:scylladb/scylladb:
  auth: keep auth version in scylla_local
  auth: coroutinize service::start
2024-04-03 12:25:56 +02:00
Marcin Maliszkiewicz
562caaf6c6 auth: keep auth version in scylla_local
Before the patch selection of auth version depended
on consistent topology feature but during raft recovery
procedure this feature is disabled so we need to persist
the version somewhere to not switch back to v1 as this
is not supported.

During recovery auth works in read-only mode, writes
will fail.
2024-04-02 19:04:21 +02:00
Kamil Braun
d274f63d89 Merge 'Add support for "initial-token" parameter in raft mode' from Gleb
Fixes scylladb/scylladb#17893

* 'gleb/initial-token-v1' of github.com:scylladb/scylla-dev:
  dht: drop unused parameter from get_random_bootstrap_tokens() function
  test: add test for initial_token parameter
  topology coordinator: use provided initial_token parameter to choose bootstrap tokens
  topology cooordinator: propagate initial_token option to the coordinator
2024-03-27 11:41:06 +01:00
Gleb Natapov
6ab78e13c6 topology cooordinator: propagate initial_token option to the coordinator
The patch propagates initial_token option to the topology coordinator
where it is added to join request parameter.
2024-03-26 18:43:16 +02:00
Pavel Emelyanov
8bf9098663 system_keyspace: Consolidate node-state vs tokens checks
When loading topology state, nodes are checked to have or not to have
"tokens" field set. The check is done based on node state and it's
spread across the loading method.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#17957
2024-03-26 14:55:46 +02:00
Michał Jadwiszczak
dab909b1d1 service:qos: store information about sl data migration
Save information whether service levels data was migrated to v2 table.
The information is stored in `system.scylla_local` table. It's
written with raft command and included in raft snapshot.
2024-03-21 23:14:57 +01:00
Michał Jadwiszczak
8e242f5acd db:system_keyspace: add SERVICE_LEVELS_V2 table
The table has the same schema as `system_distributed.service_levels`.
However it's created entirely at once (unlike old table which creates
base table first and then it adds other columns) because `system` tables
are local to the node.
2024-03-21 23:14:57 +01:00
Tomasz Grabiec
61b3453552 raft topology: Keep nodes in the left state to topology
Those nodes will be kept in tablet replica sets for a while after node
replace is done, until the new replica is rebuilt. So we need to know
about those node's location (dc, rack) for two reasons:

 1) algorithms which work with replica sets filter nodes based on
 their location. For example materialized views code which pairs base
 replicas with view replicas filters by datacenter first.

 2) tablet scheduler needs to identify each node's location in order
 to make decisions about new replica placement.

It's ok to not know the IP, and we don't keep it. Those nodes will not
be present in the IP-based replica sets, e.g. those returned by
get_natural_endpoints(), only in host_id-based replica
sets. storage_proxy request coordination is not affected.

Nodes in the left state are still not present in token ring, and not
considered to be members of the ring (datacanter endpoints excludes them).

In the future we could make the change even more transparent by only
loading locator::node* for those nodes and keeping node* in tablet
replica sets.

We load topology infromation only for left nodes which are actually
referenced by any tablet. To achieve that, topology loading code
queries system.tablet for the set of hosts. This set is then passed to
system.topology loading method which decides whether to load
replica_state for a left node or not.
2024-03-15 11:05:29 +01:00
Piotr Smaron
ad2d039e3d db: move all group 0 tables to schema commitlog
This is to have durability for the group0 tables.
But also because I need it specifially to make
`system.topology` & `system_schema.scylla_keyspaces`
mutations under a single raft command in https://github.com/scylladb/scylladb/pull/16723

Fixes: #15596

Closes scylladb/scylladb#17783
2024-03-14 13:33:30 +01:00
Marcin Maliszkiewicz
5a6d4dbc37 storage_service: add support for auth-v2 raft snapshots
This patch adds new RPC for pulling snapshot of auth tables.
2024-03-01 16:25:14 +01:00
Marcin Maliszkiewicz
9144d8203b db: add system_auth_v2 keyspace
New keyspace is added similarly as system_schema keyspace,
it's being registred via system_keyspace::make which calls
all_tables to build its schema.

Dummy table 'roles' is added as keyspaces are being currently
registered by walking through their tables. Full table schemas
will be added in subsequent commits.

Change can be observed via cqlsh:

cassandra@cqlsh> describe keyspaces;

system_auth_v2  system_schema       system         system_distributed_everywhere
system_auth     system_distributed  system_traces

cassandra@cqlsh> describe keyspace system_auth_v2;

CREATE KEYSPACE system_auth_v2 WITH replication = {'class': 'LocalStrategy'}  AND durable_writes = true;

CREATE TABLE system_auth_v2.roles (
    role text PRIMARY KEY
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = 'comment'
    AND compaction = {'class': 'SizeTieredCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.0
    AND default_time_to_live = 0
    AND gc_grace_seconds = 604800
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';
2024-03-01 10:40:29 +01:00
Patryk Jędrzejczak
b8aa74f539 raft topology: clean only obsolete CDC generations' data
Currently, we may clear a CDC generation's data from
CDC_GENERATIONS_V3 if it is not the last committed generation
and it is at least 24 hours old (according to the topology
coordinator's clock). However, after allowing writes to the
previous CDC generations, this condition became incorrect. We
might clear data of a generation that could still be written to.

The new solution is to clear data of the generations that
finished operating more than 24 hours ago. The rationale behind
it is in the new comment in
`topology_coordinator:clean_obsolete_cdc_generations`.

The previous solution used the clean-up candidate. After
introducing `committed_cdc_generations`, it became unneeded.
The last obsolete generation can be computed in
`topology_coordinator:clean_obsolete_cdc_generations`. Therefore,
we remove all the code that handles the clean-up candidate.

After changing how we clear CDC generations' data,
`test_current_cdc_generation_is_not_removed` became obsolete.
The tested feature is not present in the code anymore.

`test_dependency_on_timestamps` became the only test case covering
the CDC generation's data clearing. We adjust it after the changes.
2024-02-20 12:35:18 +01:00
Patryk Jędrzejczak
18cff1aa6a system_keyspace: load_topology_state: fix indentation
Broken in the previous patch.
2024-02-20 12:35:18 +01:00
Patryk Jędrzejczak
e145e758eb raft topology: store committed CDC generations' IDs in the topology
When we create a CDC generation and ring-delay is non-zero, the
timestamp of the new generation is in the future. Hence, we can
have multiple generations that can be written to. However, if we
add a new node to the cluster with the Raft-based topology, it
receives only the last committed generation. So, this node will
be rejecting writes considered correct by the other nodes until
the last committed generation starts operating.

In scylladb/scylladb#17134, we have allowed sending writes to the
previous CDC generations. So, the situation became even more
complicated. We need to adjust the Raft-based topology to ensure
all required generations are loaded into memory and their data
isn't cleared too early.

This patch is the first step of the adjustment. We replace
`current_cdc_generation_{uuid, timestamp}` with the set containing
IDs of all committed generations - `committed_cdc_generations`.
This set is sorted by timestamps, just like
`unpublished_cdc_generations`.

This patch is mostly refactoring. The last generation in
`committed_cdc_generations` is the equivalent of the previous
`current_cdc_generation_{uuid, timestamp}`. The other generations
are irrelevant for now. They will be used in the following patches.

After introducing `committed_cdc_generations`, a newly committed
generation is also unpublished (it was current and unpublished
before the patch). We introduce `add_new_committed_cdc_generation`,
which updates both sets of generations so that we don't have to
call `add_committed_cdc_generation` and
`add_unpublished_cdc_generation` together. It's easy to forget
that both of them are necessary. Before this patch, there was
no call to `add_unpublished_cdc_generation` in
`topology_coordinator::build_coordinator_state`. It was a bug
reported in scylladb/scylladb#17288. This patch fixes it.

This patch also removes "the current generation" notion from the
Raft-based topology. For the Raft-based topology, the current
generation was the last committed generation. However, for the
`cdc::metadata`, it was the generation operating now. These two
generations could be different, which was confusing. For the
`cdc::metadata`, the current generation is relevant as it is
handled differently, but for the Raft-based topology, it isn't.
Therefore, we change only the Raft-based topology. The generation
called "current" is called "the last committed" from now.
2024-02-20 12:35:16 +01:00
Kamil Braun
50ebce8acc Merge 'Purge old ip on change' from Petr Gusev
When a node changes IP address we need to remove its old IP from `system.peers` and gossiper.

We do this in `sync_raft_topology_nodes` when the new IP is saved into `system.peers` to avoid losing the mapping if the node crashes between deleting and saving the new IP. We also handle the possible duplicates in this case by dropping them on the read path when the node is restarted.

The PR also fixes the problem with old IPs getting resurrected when a node changes its IP address.
The following scenario is possible: a node `A` changes its IP from `ip1` to `ip2` with restart, other nodes are not yet aware of `ip2` so they keep gossiping `ip1`. After restart `A` receives `ip1` in a gossip message and calls `handle_major_state_change` since it considers it as a new node. Then `on_join` event is called on the gossiper notification handlers, we receive such event in `raft_ip_address_updater` and reverts the IP of the node A back to ip1.

To fix this we ensure that the new gossiper generation number is used when a node registers its IP address in `raft_address_map` at startup.

The `test_change_ip` is adjusted to ensure that the old IPs are properly removed in all cases, even if the node crashes.

Fixes #16886
Fixes #16691
Fixes #17199

Closes scylladb/scylladb#17162

* github.com:scylladb/scylladb:
  test_change_ip: improve the test
  raft_ip_address_updater: remove stale IPs from gossiper
  raft_address_map: add my ip with the new generation
  system_keyspace::update_peer_info: check ep and host_id are not empty
  system_keyspace::update_peer_info: make host_id an explicit parameter
  system_keyspace::update_peer_info: remove any_set flag optimisation
  system_keyspace: remove duplicate ips for host_id
  system_keyspace: peers table: use coroutines
  storage_service::raft_ip_address_updater: log gossiper event name
  raft topology: ip change: purge old IP
  on_endpoint_change: coroutinize the lambda around sync_raft_topology_nodes
2024-02-15 17:40:29 +01:00
Petr Gusev
2bf75c1a4e system_keyspace::update_peer_info: check ep and host_id are not empty 2024-02-15 13:21:04 +04:00
Petr Gusev
86410d71d1 system_keyspace::update_peer_info: make host_id an explicit parameter
The host_id field should always be set, so it's more
appropriate to pass it as a separate parameter.

The function storage_service::get_peer_info_for_update
is  updated. It shouldn't look for host_id app
state is the passed map, instead the callers should
get the host_id on their own.
2024-02-15 13:21:04 +04:00
Petr Gusev
e0072f7cb3 system_keyspace::update_peer_info: remove any_set flag optimisation
This optimization never worked -- there were four usages of
the update_peer_info function and in all of them some of
the peer_info fields were set or should be set:
* sync_raft_topology_nodes/process_normal_node: e.g. tokens is set
* sync_raft_topology_nodes/process_transition_node: host_id is set
* handle_state_normal: tokens is set
* storage_service::on_change: get_peer_info_for_update could potentially
return a peer_info with all fields set to empty, but this shouldn't
be possible, host_id should always be set.

Moreover, there is a bug here: we extract host_id from the
states_ parameter, which represent the gossiper application
states that have been changed. This parameter contains host_id
only if a node changes its IP address, in all other cases host_id
is unset. This means we could end up with a record with empty
host_id, if it wasn't previously set by some other means.

We are going to fix this bug in the next commit.
2024-02-15 13:21:04 +04:00
Petr Gusev
4a14988735 system_keyspace: remove duplicate ips for host_id
When a node changes IP we call sync_raft_topology_nodes
from raft_ip_address_updater::on_endpoint_change with
the old IP value in prev_ip parameter.
It's possible that the nodes crashes right after
we insert a new IP for the host_id, but before we
remove the old IP. In this commit we fix the
possible inconsistency by removing the system.peers
record with old timestamp. This is what the new
peers_table_read_fixup function is responsible for.

We call this function in all system_keyspace methods
that read the system.peers table. The function
loads the table in memory, decides if some rows
are stale by comparing their timestamps and
removes them.

The new function also removes the records with no
host_id, so we no longer need the get_host_id function.

We'll add a test for the problem this commit fixes
in the next commit.
2024-02-15 13:21:04 +04:00
Petr Gusev
fa8718085a system_keyspace: peers table: use coroutines
This is a refactoring commit with no observable
changes in behaviour.

We switch the functions to coroutines, it'll
be easy to work with them in this way in the
next commit. Also, we add more const-s
along the way.
2024-02-15 13:21:04 +04:00
Gleb Natapov
9b52dc4560 topology coordinator: make ignored_nodes list global and permanent
Currently ignored_nodes list is part of a request (removenode or
replace) and exists only while a request is handled. This patch
changes it to be global and exist outside of any request. Node stays
in the list until they eventually removed and moved to the "left" state.
If a node is specified in the ignore-dead-nodes option for any command
it will be ignored for all other operations that support ignored_nodes
(like tablet migration).
2024-02-13 16:15:35 +02:00
Piotr Dulikowski
fb02453686 system_keyspace: add read_cdc_generation_opt
The `system_keyspace::read_cdc_generation` loads a cdc generation from
the system tables. One of its preconditions is that the generation
exists - this precondition is quite easy to satisfy in raft mode, and
the function was designed to be used solely in that mode.

In legacy mode however, in case when we revert from raft mode through
recovery, it might be necessary to use generations created in raft mode
for some time. In order to make the function useful as a fallback in
case lookup of a generation in legacy mode fails, introduce a relaxed
variant of `read_cdc_generation` which returns std::nullopt if the
generation does not exist.
2024-02-08 19:12:28 +01:00