Files
scylladb/service
Avi Kivity c1c0c96643 Merge '[Backport 2026.2] QOS: self-heal stale V1-to-V2 migration state on upgrade' from Scylladb[bot]
service_levels: self-heal stale v1 marker after raft topology upgrade

This PR handles an upgrade corner case where a node may already be using
raft topology, while `system.scylla_local` still marks service levels as v1.

The problem was introduced by commit 2917ec5d51
("service:qos: service levels migration"), which added the service-levels
migration from `system_distributed.service_levels` to
`system.service_levels_v2` as part of the raft topology upgrade.

However, if the cluster had no service levels configured, there was no data
to migrate. In that case, the migration path could leave the local version
marker unchanged, so the node would later observe an inconsistent state:

  * raft topology is already enabled;
  * service levels are still marked as v1 in `system.scylla_local`.

Such clusters can be left in a stale state and fail startup during upgrade to
2026.2

This PR makes the upgrade path self-healing.

The first commit restores `service_level_controller::migrate_to_v2()`, giving
us a group0-based path for writing the service-levels v2 state even after raft
topology is already in use.

The second commit wires this path into startup. When the node detects the
stale raft-topology + service-levels-v1 state, it retries the migration a
bounded number of times and updates the version marker to v2 instead of
failing startup.

With this change, clusters that were left in this stale state can recover
automatically during upgrade to 2026.
Fixes: SCYLLADB-2038

backport: 2026.2 2026.1 we need this functionality when we are upgrading older servers

- (cherry picked from commit ac0a19aab8)
- (cherry picked from commit c2014f7e50)
- (cherry picked from commit 6188bf3e01)

Parent PR: #29749

Closes scylladb/scylladb#29905

* github.com:scylladb/scylladb:
  test/auth_cluster: simulate v1 state in self-heal test When skip_service_levels_v2_initialization is used, write an explicit v1 service level version marker while skipping v2 initialization. This lets the restart test exercise self-healing from v1 to v2.
  qos: self-heal stale service levels version on startup
  qos: reintroduce service levels v2 migration self-heal
2026-05-17 19:33:58 +03:00
..
2026-04-12 19:46:33 +03:00
2026-04-12 19:46:33 +03:00
2026-04-12 19:46:33 +03:00
2026-04-12 19:46:33 +03:00
2026-04-12 19:46:33 +03:00
2026-04-12 19:46:33 +03:00
2026-04-12 19:46:33 +03:00