Commit Graph

121 Commits

Author SHA1 Message Date
Alex
c2014f7e50 qos: self-heal stale service levels version on startup
Add self_heal_service_levels_version() and use it during startup when
  the node is already on raft topology but service levels are still marked
  as v1.

  In that stale state, migrate service levels to v2 through group0 instead
  of failing startup.
2026-05-13 17:55:20 +03:00
Alex
ac0a19aab8 qos: reintroduce service levels v2 migration self-heal
migrate_to_v2() was removed after gossip-based service level migration
  support was dropped, since upgraded nodes were expected to already use
  service levels v2.

  However, clusters affected by the old migration bug may reach raft topology
  while system.scylla_local still has a stale service level version. Restore
  the migration helper so startup can self-heal those nodes by writing the v2
  state through group0.
2026-05-13 10:16:02 +03:00
Gleb Natapov
24171ce62b db/system_distributed_keyspace: drop old service_levels table
Service level management moved to raft and old table is no longer
supported.
2026-04-15 15:48:48 +03:00
Avi Kivity
0ae22a09d4 LICENSE: Update to version 1.1
Updated terms of non-commercial use (must be a never-customer).
2026-04-12 19:46:33 +03:00
Gleb Natapov
fae5282c82 service level: fix crash during migration to driver server level
Before b59b3d4 the migration code checked that service level controller
is on v2 version before migration and the check also implicitly checked
that _sl_data_accessor field is already initialized, but now that the
check is gone the migration can start before service level controller is
fully initialized. Re add the check, but to a different place.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1049

Closes scylladb/scylladb#29021
2026-03-13 11:24:26 +01:00
Gleb Natapov
c30907b8f2 service level: remove unused get_user_scheduling_group function 2026-03-12 14:28:26 +02:00
Gleb Natapov
a934d8391d service level: drop async find_effective_service_level
find_cached_effective_service_level does exactly same thing now and it
is synchronous.
2026-03-12 14:28:26 +02:00
Gleb Natapov
f888f2dced service level: remove remnants of version 1 service level
can_use_effective_service_level_cache() always returns true now, so the
function can be dropped entirely and all the code that assumes it may
return false can be dropped as well.
2026-03-12 12:27:52 +02:00
Gleb Natapov
b59b3d4f8a service level: remove version 1 service level code 2026-03-10 10:46:48 +02:00
Gleb Natapov
be153a4eb7 service_level_controller: drop service level upgrade code
We do not allow upgrade from a version that is not updated yet, so the
code is not used any longer.
2026-03-10 10:09:38 +02:00
Gleb Natapov
4e072977d4 group0: drop in_recovery function and its uses
Legacy recovery procedure is no longer supported and the code can be
dropped.
2026-03-10 10:09:38 +02:00
Dario Mirovic
e5218157de service: qos: handle special scheduling group case for maintenance socket
service_level_controller has special handling for maintenance socket connections.
If the current user is not a named user, it should use the default scheduling group.

The reason is that the maintenance socket can communicate with Scylla before
auth_integration is registered.

The guard is already present, but it was omitted in get_cached_user_scheduling_group.

This also fixes flakiness in test_maintenance_socket.py tests.

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
dc9a90d7cb service: qos: use _auth_integration as condition for using _auth_integration
Maintenance socket connections can be established before _auth_integration is
initialized. The fix introduced with scylladb/scylladb#26856 PR check for
the value of user variable. For maintenance socket connections it will be an
anonymous user, and will fall back to using default scheduling group.

This patch changes the criteria for using default scheduling group from
the user variable to checking the _auth_integration variable itself:
- If _auth_integration is not initialized, use default scheduling group
- If _auth_integration is initialized, let it choose the scheduling group

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Pavel Emelyanov
18b5a49b0c Populate all sl:* groups into dedicated top-level supergroup
This patch changes the layout of user-facing scheduling groups from

/
`- statement
`- sl:default
`- sl:*
`- other groups (compaction, streaming, etc.)

into

/
`- user (supergroup)
   `- statement
   `- sl:default
   `- sl:*
`- other groups (compaction, streaming, etc.)

The new supergroup has 1000 static shares and is name-less, in a sense
that it only have a variable in the code to refer to and is not exported
via metrics (should be fixed in seastar if we want to).

The moved groups don't change their names or shares, only move inside
the scheduling hierarchy.

The goal of the change is to improve resource consumption of sl:*
groups. Right now activities in low-shares service levels are scheduled
on-par with e.g. streaming activity, which is considered to be low-prio
one. By moving all sl:* groups into their own supergroup with 1000
shares changes the meaning of sl:* shares. From now on these shares
values describe preirities of service level between each-other, and the
user activities compete with the rest of the system with 1000 shares,
regardless of how many service levels are there.

Unit tests keep their user groups under root supergroup (for simplicity)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28235
2026-01-21 14:14:48 +02:00
Alex
17c9d640fe server: Fix switch_tenant problem, When running on a V2 server, service-level data comes from service level cache. Because of this, we can use synchronized function to get the schedualing group.
Since we are transitioning to a Raft-based architecture where all servers will be V2, we can safely implement this fix specifically for that case.
This change adds get_cached_user_scheduling_group functionality and moves its usage out of switch_tenant function in update_scheduling_group_v2 usage.
2025-12-14 16:27:40 +02:00
Dawid Mędrek
c0f7622d12 service/qos: Do not crash Scylla if auth_integration absent
If the user connects to Scylla via the maintenance socket, it may happen
that `auth_integration` has not been registered in the service level
controller yet. One example is maintenance mode when that will never
happen; another when the connection occurs before Scylla is fully
initialized.

To avoid unnecessary crashes, we add new branches if the passed user is
absent or if it corresponds to the anonymous role. Since the role
corresponding to a connection via the maintenance socket is the anonymous
role, that solves the problem.

In those cases, we completely circumvent any calls to `auth_integration`
and handle them separately. The modified methods are:

* `get_user_scheduling_group`,
* `with_user_service_level`,
* `describe_service_levels`.

For the first two, the new behavior is in line with the previous
implementation of those functions. The last behaves differently now,
but since it's a soft error, crashing the node is not necessary anyway.
We throw an exception instead, whose error message should give the user
a hint of what might be wrong.

The other uses of `auth_integration` within the service level controller
are not problematic:

* `find_effective_service_level`,
* `find_cached_effective_service_level`.

They take the name of a role as their argument. Since the anonymous role
doesn't have a name, it's not possible to call them with it.

Fixes scylladb/scylladb#26816
2025-11-10 19:21:36 +01:00
Avi Kivity
55d4d39ae3 Merge 'transport: service_level_controller: create and use driver service level' from Andrzej Jackowski
This is a cherry-pick of https://github.com/scylladb/scylladb/pull/25412 commits, as the changes were reverted in 364316dd2f2212bbbb446eaa2a4b0bd53d125ad5 due to https://github.com/scylladb/scylladb/issues/26163.
The underlying problem (https://github.com/scylladb/scylladb/issues/26190) was fixed in seastar (https://github.com/scylladb/seastar/pull/2994), so https://github.com/scylladb/scylladb/pull/25412 commits are restored without changes (only rebase conflicts were resolved).

===
This patch series:
 - Increases the number of allowed scheduling groups to allow creation of `sl:driver`
 - Implements `create_driver_service_level` that creates `sl:driver` with shares=200 if it wasn't already created
 - Implements creation of `sl:driver` for new systems and tests in `raft_initialize_discovery_leader`
 - Modifies `topology_coordinator` to use  create `sl:driver` after upgrades.
 - Implements using `sl:driver` for new connections in `transport/server`
 - Adds to `transport/server` recognition of driver's control connections and forcing them to keep using `sl:driver`.
 - Adds tests to verify the new functionality
 - Modifies existing tests to let them pass after `sl:driver` is added
 - Modifies the documentation to contain new `sl:driver`

The changes were evaluated by a test with the following scenario ([test_connections-sl-driver.py](https://github.com/user-attachments/files/22021273/test_connections-sl-driver.py)):
 - Start ScyllaDB with one node
 - Create 1000 keyspaces, 1 table in each keyspace
 - Start `cassandra-stress` (`-rate threads=50  -mode native cql3`)
 - Run connection storm with 1000 session (100 python processes, 10 sessions each)

The maximum latency during connection storm dropped **from 224.94ms to 41.43ms** (those numbers are average from 20 test executions, were max latency was in [140ms, 361ms] before change and [31.4ms, 61.5ms] after).

The snippet of cassandra-stress output from the moment of connection storm:
Before:
```
type       total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
...
total,        789206,   85887,   85887,   85887,     0.6,     0.3,     2.0,     2.0,     2.5,     5.0,    9.0,  0.09679,      0,      0,       0,       0,       0,       0
total,        909322,  120116,  120116,  120116,     0.4,     0.2,     1.9,     2.0,     2.1,     3.1,   10.0,  0.09053,      0,      0,       0,       0,       0,       0
total,        964392,   55070,   55070,   55070,     0.9,     0.4,     2.0,     4.5,     7.7,    18.9,   11.0,  0.09203,      0,      0,       0,       0,       0,       0
total,        975705,   11313,   11313,   11313,     4.4,     3.5,     6.5,    24.5,    82.7,    83.0,   12.0,  0.11713,      0,      0,       0,       0,       0,       0
total,        987548,   11843,   11843,   11843,     4.2,     3.5,     6.5,    33.7,    48.6,    51.5,   13.0,  0.13366,      0,      0,       0,       0,       0,       0
total,        995422,    7874,    7874,    7874,     6.3,     4.0,     7.7,    85.6,   112.9,   113.5,   14.0,  0.14753,      0,      0,       0,       0,       0,       0
total,       1007228,   11806,   11806,   11806,     4.3,     3.5,     6.5,    29.1,    43.8,    87.1,   15.0,  0.15598,      0,      0,       0,       0,       0,       0
total,       1012840,    5612,    5612,    5612,     8.2,     5.0,    11.5,   121.8,   166.6,   170.1,   16.0,  0.16535,      0,      0,       0,       0,       0,       0
total,       1016186,    3346,    3346,    3346,    13.4,     7.4,    20.1,   204.9,   207.6,   210.4,   17.0,  0.17405,      0,      0,       0,       0,       0,       0
total,       1025462,    9276,    9276,    9276,     6.3,     3.9,     9.6,    74.6,   206.8,   210.0,   18.0,  0.17800,      0,      0,       0,       0,       0,       0
total,       1035979,   10517,   10517,   10517,     4.8,     3.5,     6.7,    38.5,    82.6,    83.0,   19.0,  0.18120,      0,      0,       0,       0,       0,       0
total,       1047488,   11509,   11509,   11509,     4.3,     3.5,     6.0,    32.6,    72.3,    74.0,   20.0,  0.18334,      0,      0,       0,       0,       0,       0
total,       1077456,   29968,   29968,   29968,     1.7,     1.6,     2.9,     3.6,     7.0,     8.2,   21.0,  0.17943,      0,      0,       0,       0,       0,       0
total,       1105490,   28034,   28034,   28034,     1.8,     1.8,     3.5,     4.6,     5.3,    13.8,   22.0,  0.17609,      0,      0,       0,       0,       0,       0
total,       1132221,   26731,   26731,   26731,     1.9,     1.8,     3.8,     5.2,     8.4,    11.1,   23.0,  0.17314,      0,      0,       0,       0,       0,       0
total,       1162149,   29928,   29928,   29928,     1.7,     1.7,     3.0,     4.5,     8.0,     9.1,   24.0,  0.16950,      0,      0,       0,       0,       0,       0
...
```

After:
```
type       total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
...
total,        822863,   94379,   94379,   94379,     0.5,     0.3,     2.0,     2.0,     2.1,     3.7,    9.0,  0.06669,      0,      0,       0,       0,       0,       0
total,        937337,  114474,  114474,  114474,     0.4,     0.2,     2.0,     2.0,     2.1,     3.4,   10.0,  0.06301,      0,      0,       0,       0,       0,       0
total,        986630,   49293,   49293,   49293,     1.0,     1.0,     2.0,     2.1,    17.9,    19.0,   11.0,  0.07318,      0,      0,       0,       0,       0,       0
total,       1026734,   40104,   40104,   40104,     1.2,     1.0,     2.0,     2.2,     6.3,     7.1,   12.0,  0.08410,      0,      0,       0,       0,       0,       0
total,       1066124,   39390,   39390,   39390,     1.3,     1.0,     2.0,     2.2,     2.6,     3.4,   13.0,  0.09108,      0,      0,       0,       0,       0,       0
total,       1103082,   36958,   36958,   36958,     1.3,     1.1,     2.1,     2.5,     3.1,     4.2,   14.0,  0.09643,      0,      0,       0,       0,       0,       0
total,       1141987,   38905,   38905,   38905,     1.3,     1.0,     2.0,     2.4,    11.4,    12.7,   15.0,  0.09894,      0,      0,       0,       0,       0,       0
total,       1180023,   38036,   38036,   38036,     1.3,     1.0,     2.0,     3.7,     5.6,     7.1,   16.0,  0.10070,      0,      0,       0,       0,       0,       0
total,       1216481,   36458,   36458,   36458,     1.4,     1.0,     2.1,     3.6,     4.7,     5.0,   17.0,  0.10210,      0,      0,       0,       0,       0,       0
total,       1256819,   40338,   40338,   40338,     1.2,     1.0,     2.0,     2.2,     3.5,     5.4,   18.0,  0.10173,      0,      0,       0,       0,       0,       0
total,       1295122,   38303,   38303,   38303,     1.3,     1.0,     2.0,     2.4,    21.0,    21.1,   19.0,  0.10136,      0,      0,       0,       0,       0,       0
total,       1334743,   39621,   39621,   39621,     1.3,     1.0,     2.0,     2.3,     3.3,     4.0,   20.0,  0.10055,      0,      0,       0,       0,       0,       0
total,       1375579,   40836,   40836,   40836,     1.2,     1.0,     2.0,     2.1,     3.4,     5.7,   21.0,  0.09927,      0,      0,       0,       0,       0,       0
total,       1415576,   39997,   39997,   39997,     1.2,     1.0,     2.0,     2.3,     3.2,     4.1,   22.0,  0.09807,      0,      0,       0,       0,       0,       0
total,       1449268,   33692,   33692,   33692,     1.5,     1.4,     2.5,     3.2,     4.2,     5.6,   23.0,  0.09800,      0,      0,       0,       0,       0,       0
total,       1471873,   22605,   22605,   22605,     2.2,     2.0,     4.8,     5.9,     7.0,     7.9,   24.0,  0.10015,      0,      0,       0,       0,       0,       0
...
```

Fixes: https://github.com/scylladb/scylladb/issues/24411

This is a new feature, so no backport needed.

Closes scylladb/scylladb#26411

* github.com:scylladb/scylladb:
  docs: workload-prioritization: add driver service level
  test: add test to verify use of `sl:driver`
  transport: use `sl:driver` to handle driver's control connections
  transport: whitespace only change in update_scheduling_group
  transport: call update_scheduling_group for non-auth connections
  generic_server: transport: start using `sl:driver` for new connections
  test: add test_desc_* for driver service level
  test: service_levels: add tests for sl:driver creation and removal
  test: add reload_raft_topology_state() to ScyllaRESTAPIClient
  service_level_controller: automatically create `sl:driver`
  service_level_controller: methods to create driver service level
  service_level_controller: handle special sl:driver in DESC output
  topology_coordinator: add service_level_controller reference
  system_keyspace: add service_level_driver_created
  test: add MAX_USER_SERVICE_LEVELS
2025-10-09 17:28:39 +03:00
Andrzej Jackowski
923559f46a service_level_controller: methods to create driver service level
This commit implements `get_create_driver_service_level_mutations`
and `migrate_to_driver_service_level` in service_level_controller.
Both methods create `sl:driver` with shares=200 and store this fact
in `system.scylla_local`. Both methods will be used later in this
patch series for automatic creation of sl:driver.

Refs: scylladb/scylladb#24411
2025-10-08 08:24:38 +02:00
Andrzej Jackowski
2d296a2f9b service_level_controller: handle special sl:driver in DESC output
Later in this patch series, `sl:driver` will be added as a special
service level created automatically by the system. It needs special
handling in `DESC SCHEMA ...` to ensure that during backup restore:
  1. CREATE SERVICE LEVEL does not fail if `sl:driver` already exists
  2. If `sl:driver` exists, its configuration is fully restored (emit
     ALTER SERVICE LEVEL).
  3. If `sl:driver` was removed, the information is retained (emit
     DROP SERVICE LEVEL instead of CREATE/ALTER).

Refs: scylladb/scylladb#24411
2025-10-08 08:24:33 +02:00
Michael Litvak
ad1a5b7e42 service/qos: set long timeout for auth queries on SL cache update
pass an appropriate query state for auth queries called from service
level cache reload. we use the function qos_query_state to select a
query_state based on caller context - for internal queries, we set a
very long timeout.

the service level cache reload is called from group0 reload. we want it
to have a long timeout instead of the default 5 seconds for auth
queries, because we don't have strict latency requirement on the one
hand, and on the other hand a timeout exception is undesired in the
group0 reload logic and can break group0 on the node.

Fixes scylladb/scylladb#25290
2025-09-25 16:55:29 +02:00
Avi Kivity
1258e7c165 Revert "Merge 'transport: service_level_controller: create and use driver service level' from Andrzej Jackowski"
This reverts commit fe7e63f109, reversing
changes made to b5f3f2f4c5. It is causing
test.py failures around cqlpy.

Fixes #26163

Closes scylladb/scylladb#26174
2025-09-22 09:32:46 +03:00
Andrzej Jackowski
6a911bff3f service_level_controller: methods to create driver service level
This commit implements `get_create_driver_service_level_mutations`
and `migrate_to_driver_service_level` in service_level_controller.
Both methods create `sl:driver` with shares=200 and store this fact
in `system.scylla_local`. Both methods will be used later in this
patch series for automatic creation of sl:driver.

Refs: scylladb/scylladb#24411
2025-09-18 09:28:32 +02:00
Andrzej Jackowski
5cb4577800 service_level_controller: handle special sl:driver in DESC output
Later in this patch series, `sl:driver` will be added as a special
service level created automatically by the system. It needs special
handling in `DESC SCHEMA ...` to ensure that during backup restore:
  1. CREATE SERVICE LEVEL does not fail if `sl:driver` already exists
  2. If `sl:driver` exists, its configuration is fully restored (emit
     ALTER SERVICE LEVEL).
  3. If `sl:driver` was removed, the information is retained (emit
     DROP SERVICE LEVEL instead of CREATE/ALTER).

Refs: scylladb/scylladb#24411
2025-09-18 09:28:32 +02:00
Dawid Mędrek
fc1c41536c service/qos: Move effective SL cache to auth_integration
Since `auth_integration` manages effective service levels, let's move
the relevant cache from `service_level_controller` to it.
2025-08-26 18:41:48 +02:00
Dawid Mędrek
dd5a35dc67 service/qos: Add auth::service to auth_integration
The new service, `auth_integration`, has taken over the responsibility
over managing effective service levels from `service_level_controller`.
However, before these changes, it still accessed `auth::service` via
the service level controller. Let's change that.

Note that we also remove a check that `auth::service` has been
initialized. It's not necessary anymore because the lifetime of
`auth_integration` is strictly nested within the lifetime of `auth::service`.

In actuality, `service_level_controller` should lose its reference to
`auth::service` completely. All of the management over effective service
levels has already been moved to `auth_integration`. However, the
referernce is still needed when dropping a distributed service level
because we need to update the corresponding attribute for relevant
roles.

That should not lead to invalid accesses, though. Dropping a service level
should not be possible when `auth::service` is not initialized.
2025-08-26 18:41:43 +02:00
Dawid Mędrek
e929279d74 service/qos: Reload effective SL cache conditionally
Since `service_level_controller` outlives `auth_integration`, it may
happen that we try to access it when it has already been deinitialized.
To prevent that, we only try to reload or clear the effective service
level cache when the object is still alive.

These changes solve an existing problem with an invalid memory access.
For more context, see issue scylladb/scylladb#24792.

We provide a reproducer test that consistently fails before these
changes but passes after them.

Fixes scylladb/scylladb#24792
2025-08-26 18:41:40 +02:00
Dawid Mędrek
34afb6cdd9 service/qos: Add gate to auth_integration
We add a named gate to `auth_integration` that will aid us in synchronizing
ongoing tasks with stopping the service.
2025-08-26 18:41:37 +02:00
Dawid Mędrek
7d0086b093 service/qos: Introduce auth_integration
We introduce a new type, `auth_integration`, that will be used internally
by `service_level_controller`. Its purpose is to take over the responsibility
over managing effective service levels.

The main problem of the current implementation of service level controller
is its dependency on `auth::service` whose lifetime is strictly nested
within the lifetime of service level controller. That may and already have
led to invalid memory accesses; for an example, see issue
scylladb/scylladb#24792.

Our strategy is to split service level controller into smaller parts and
ensure that we access `auth::service` only when it's valid to do so.
This commit is the first step towards that.

We don't change anything in the logic yet, just add the new type. Further
adjustments will be made in following commits.
2025-08-26 18:41:34 +02:00
Michał Jadwiszczak
10214e13bd storage_service, group0_state_machine: move SL cache update from topology_state_load() to load_snapshot()
Currently the service levels cache is unnecessarily updated in every
call of `topology_state_load()`.
But it is enough to reload it only when a snapshot is loaded.
(The cache is also already updated when there is a change to one of
`service_levels_v2`, `role_members`, `role_attributes` tables.)

Fixes scylladb/scylladb#25114
Fixes scylladb/scylladb#23065

Closes scylladb/scylladb#25116
2025-08-01 13:41:08 +02:00
Piotr Dulikowski
2bb800c004 qos: don't populate effective service level cache until auth is migrated to raft
Right now, service levels are migrated in one group0 command and auth
is migrated in the next one. This has a bad effect on the group0 state
reload logic - modifying service levels in group0 causes the effective
service levels cache to be recalculated, and to do so we need to fetch
information about all roles. If the reload happens after SL upgrade and
before auth upgrade, the query for roles will be directed to the legacy
auth tables in system_auth - and the query, being a potentially remote
query, has a timeout. If the query times out, it will throw
an exception which will break the group0 apply fiber and the node will
need to be restarted to bring it back to work.

In order to solve this issue, make sure that the service level module
does not start populating and using the service level cache until both
service levels and auth are migrated to raft. This is achieved by adding
the check both to the cache population logic and the effective service
level getter - they now look at service level's accessor new method,
`can_use_effective_service_level_cache` which takes a look at the auth
version.

Fixes: scylladb/scylladb#24963
2025-07-29 11:37:37 +02:00
Benny Halevy
3feb759943 everywhere: use utils::chunked_vector for list of mutations
Currently, we use std::vector<*mutation> to keep
a list of mutations for processing.
This can lead to large allocation, e.g. when the vector
size is a function of the number of tables.

Use a chunked vector instead to prevent oversized allocations.

`perf-simple-query --smp 1` results obtained for fixed 400MHz frequency
and PGO disabled:

Before (read path):
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...

89055.97 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39417 insns/op,   18003 cycles/op,        0 errors)
103372.72 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39380 insns/op,   17300 cycles/op,        0 errors)
98942.27 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39413 insns/op,   17336 cycles/op,        0 errors)
103752.93 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39407 insns/op,   17252 cycles/op,        0 errors)
102516.77 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39403 insns/op,   17288 cycles/op,        0 errors)
throughput:
	mean=   99528.13 standard-deviation=6155.71
	median= 102516.77 median-absolute-deviation=3844.59
	maximum=103752.93 minimum=89055.97
instructions_per_op:
	mean=   39403.99 standard-deviation=14.25
	median= 39406.75 median-absolute-deviation=9.30
	maximum=39416.63 minimum=39380.39
cpu_cycles_per_op:
	mean=   17435.81 standard-deviation=318.24
	median= 17300.40 median-absolute-deviation=147.59
	maximum=18002.53 minimum=17251.75
```

After (read path)
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
59755.04 tps ( 66.2 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39466 insns/op,   22834 cycles/op,        0 errors)
71854.16 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39417 insns/op,   17883 cycles/op,        0 errors)
82149.45 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39411 insns/op,   17409 cycles/op,        0 errors)
49640.04 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.3 tasks/op,   39474 insns/op,   19975 cycles/op,        0 errors)
54963.22 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.3 tasks/op,   39474 insns/op,   18235 cycles/op,        0 errors)
throughput:
	mean=   63672.38 standard-deviation=13195.12
	median= 59755.04 median-absolute-deviation=8709.16
	maximum=82149.45 minimum=49640.04
instructions_per_op:
	mean=   39448.38 standard-deviation=31.60
	median= 39466.17 median-absolute-deviation=25.75
	maximum=39474.12 minimum=39411.42
cpu_cycles_per_op:
	mean=   19267.01 standard-deviation=2217.03
	median= 18234.80 median-absolute-deviation=1384.25
	maximum=22834.26 minimum=17408.67
```

`perf-simple-query --smp 1 --write` results obtained for fixed 400MHz frequency
and PGO disabled:

Before (write path):
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no}
Disabling auto compaction
63736.96 tps ( 59.4 allocs/op,  16.4 logallocs/op,  14.3 tasks/op,   49667 insns/op,   19924 cycles/op,        0 errors)
64109.41 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   49992 insns/op,   20084 cycles/op,        0 errors)
56950.47 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50005 insns/op,   20501 cycles/op,        0 errors)
44858.42 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50014 insns/op,   21947 cycles/op,        0 errors)
28592.87 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50027 insns/op,   27659 cycles/op,        0 errors)
throughput:
	mean=   51649.63 standard-deviation=15059.74
	median= 56950.47 median-absolute-deviation=12087.33
	maximum=64109.41 minimum=28592.87
instructions_per_op:
	mean=   49941.18 standard-deviation=153.76
	median= 50005.24 median-absolute-deviation=73.01
	maximum=50027.07 minimum=49667.05
cpu_cycles_per_op:
	mean=   22023.01 standard-deviation=3249.92
	median= 20500.74 median-absolute-deviation=1938.76
	maximum=27658.75 minimum=19924.32
```

After (write path)
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no}
Disabling auto compaction
53395.93 tps ( 59.4 allocs/op,  16.5 logallocs/op,  14.3 tasks/op,   50326 insns/op,   21252 cycles/op,        0 errors)
46527.83 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50704 insns/op,   21555 cycles/op,        0 errors)
55846.30 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50731 insns/op,   21060 cycles/op,        0 errors)
55669.30 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50735 insns/op,   21521 cycles/op,        0 errors)
52130.17 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50757 insns/op,   21334 cycles/op,        0 errors)
throughput:
	mean=   52713.91 standard-deviation=3795.38
	median= 53395.93 median-absolute-deviation=2955.40
	maximum=55846.30 minimum=46527.83
instructions_per_op:
	mean=   50650.57 standard-deviation=182.46
	median= 50731.38 median-absolute-deviation=84.09
	maximum=50756.62 minimum=50325.87
cpu_cycles_per_op:
	mean=   21344.42 standard-deviation=202.86
	median= 21334.00 median-absolute-deviation=176.37
	maximum=21554.61 minimum=21060.24
```

Fixes #24815

Improvement for rare corner cases. No backport required

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#24919
2025-07-13 19:13:11 +03:00
Dawid Mędrek
ac9062644f cql3: Represent create_statement using managed_string
When describing a table, we need to do it carefully: if some
columns were dropped, we must specify that explicitly by

```
ALTER TABLE {table} DROP {column} USING TIMESTAMP ...
```

in the result of the DESCRIBE statement. Failing to do so
could lead to data resurrection.

However, if a table has been altered many, many times,
we might end up with a huge create statement. Constructing
it could, in turn, trigger an oversized allocation.
Some tests ran into that very problem in fact.

In this commit, we want to mitigate the problem: instead of
allocating a contiguous chunk of memory for the create
statement, we use `fragmented_ostringstream` and `managed_string`
to possibly keep data scattered in memory. It makes handling
`cql3::description` less convenient in the code, but since
the struct is pretty much immediately serialized after
creating it, it's a very good trade-off.

We provide a reproducer. It consistently passes with this commit,
while having about 50% chance of failure before it (based on my
own experiments). Playing with the parameters of the test
doesn't seem to improve that chance, so let's keep it as-is.

Fixes scylladb/scylladb#24018
2025-07-01 12:58:02 +02:00
Avi Kivity
f195c05b0d untyped_result_set: mark get_blob() as returning unfragmented data
Blobs can be large, and unfragmented blobs can easily exceed 128k
(as seen in #23903). Rename get_blob() to get_blob_unfragmented()
to warn users.

Note that most uses are fine as the blobs are really short strings.

Closes scylladb/scylladb#24102
2025-05-26 09:40:34 +02:00
Kefu Chai
aca00118fb service: fix misspellings
these misspellings were flagged by codespell.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#23334
2025-03-18 22:21:45 +02:00
Gleb Natapov
8a747fbc2a treewide: drop endpoint life cycle subscribers that do nothing
Provide default implementation for them instead. Will be easier to rework them later.
2025-03-11 12:09:22 +02:00
Abhinav Jha
e491950c47 raft topology: Add support for raft topology system tables initialization to happen before group0 initialization
In the current scenario, topology_change_kind variable, was been handled using
 _manage_topology_change_kind_from_group0 variable. This method was brittle
and had some bugs(e.g. for restart case, it led to a time gap between group0
server start and topology_change_kind being managed via group0)

Post _manage_topology_change_kind_from_group0 removal, careful management of
topology_change_kind variable was needed for maintaining correct
topology_change_kind in all scenarios. So this PR also performs a refactoring
to populate all init data to system tables even before group0 creation(via
raft_initialize_discovery_leader function). Now because raft_initialize_discovery_leader
happens before the group 0 creation, we write mutations directly to system
tables instead of a group 0 command. Hence, post group0 creation, the node
can read the correct values from system tables and correct values are
maintained throughout.

Added a new function initialize_done_topology_upgrade_state which takes
care of updating the correct upgrade state to system tables before starting
group0 server. This ensures that the node can read the correct values from
system tables and correct values are maintained throughout.

By moving raft_initialize_discovery_leader logic to happen before starting
group0 server, and not as group0 command post server start, we also get rid
of the potential problem of init group0 command not being the 1st command on
the server. Hence ensuring full integrity as expected by programmer.

Fixes: scylladb/scylladb#21114
2025-02-14 16:56:17 +05:30
Gleb Natapov
8a0fea5fef locator: topology: drop is_me ip overload along with remaning users 2025-01-16 16:37:06 +02:00
Kefu Chai
353b522ca0 treewide: migrate from boost::adaptors::reversed to std::views::reverse
now that we are allowed to use C++23. we now have the luxury of using
`std::views::reverse`.

- replace `boost::adaptors::transformed` with `std::views::transform`
- remove unused `#include <boost/range/adaptor/reversed.hpp>`

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2025-01-07 13:22:00 +02:00
Kefu Chai
e4463b11af treewide: replace boost::algorithm::join() with fmt::join()
Replace usages of `boost::algorithm::join()` with `fmt::join()` to improve
performance and reduce dependency on Boost. `fmt::join()` allows direct
formatting of ranges and tuples with custom separators without creating
intermediate strings.

When formatting comma-separated values into another string, fmt::join()
avoids the overhead of temporary string creation that
`boost::algorithm::join()` requires. This change also helps streamline
our dependencies by leveraging the existing fmt library instead of
Boost.Algorithm.

To avoid the ambiguity, some caller sites were updated to call
`seastar::format()` explicitly.

See also

- boost::algorithm::join():
  https://www.boost.org/doc/libs/1_87_0/doc/html/string_algo/reference.html#doxygen.join_8hpp
- fmt::join():
  https://fmt.dev/11.0/api/#ranges-api

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22082
2025-01-07 12:45:05 +02:00
Piotr Dulikowski
ce4032dfc0 qos: include number of shares in DESCRIBE
Now, the CREATE statements generated for each service level by the
DESCRIBE SCHEMA WITH INTERNALS statement will account for the service
level's shares.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
6d90a933cd transport/server: use scheduling group assigned to current user
Now, when the user logs in and the connection becomes authenticated, the
processing loop of the connection is switched to the scheduling group
that corresponds to the service level assigned to the logged in user.
The scheduling group is also updated when the service level assigned to
this user changes.

Starting from this commit, the scheduling groups managed by the service
level controller are actually being used by user workload.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
f1b9737e07 messaging_service: use separate set of connections per service levels
In order to make sure that the scheduling group carries over RPC, and
also to prevent priority inversion issues between different service
levels, modify the messaging service to use separate RPC connections for
each service level in order to serve user traffic.

The above is achieved by reusing the existing concept of "tenants" in
messaging service: when a new service level (or, more accurately,
service-level specific scheduling group) is first used in an RPC, a
new tenant is created.

In addition, extend the service level controller to be able to quickly
look up the service level name of the currently active scheduling group
in order to speed up the logic for choosing the tenant.
2025-01-02 07:13:34 +01:00
Piotr Dulikowski
4cfd26efaf qos: manage and assign scheduling groups to service levels
Introduce the core logic of workload prioritization, responsible for
assigning scheduling groups to service levels.

The service level controller maintains a pool of scheduling groups for
the currently present service levels, as well as a pool of unused
scheduling groups which were previously used by some service level that
was deleted during node's lifetime.

When a new service level is created, the SL controller either assigns a
scheduling group from the unused SG pool, or creates a new one if the
pool is empty. The scheduling group is renamed to "sl:<scheduling group
name>".

When updating shares of a service level (and also when creating a new
service level), the shares of the corresponding scheduling group are
synchronized with those of the service level.

When a service level is deleted, its group is released to the
aforementioned pool of unused scheduling groups and the prefix of its
name is changed from "sl:" to "sl_deleted:".

For now, these scheduling groups are not used by any user operations.
This will be changed in subsequent commits.
2025-01-02 07:13:34 +01:00
Avi Kivity
eb62593f2c treewide: use angle brackets when including seastar headers
We treat Seastar as a "system" library, and those are included
with angle brackets.

Closes scylladb/scylladb#21959
2024-12-20 16:16:28 +02:00
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Michael Litvak
53224d90be service/qos: increase timeout of internal get_service_levels queries
The function get_service_levels is used to retrieve all service levels
and it is called from multiple different contexts.
Importantly, it is called internally from the context of group0 state reload,
where it should be executed with a long timeout, similarly to other
internal queries, because a failure of this function affects the entire
group0 client, and a longer timeout can be tolerated.
The function is also called in the context of the user command LIST
SERVICE LEVELS, and perhaps other contexts, where a shorter timeout is
preferred.

The commit introduces a function parameter to indicate whether the
context is internal or not. For internal context, a long timeout is
chosen for the query. Otherwise, the timeout is shorter, the same as
before. When the distinction is not important, a default value is
chosen which maintains the same behavior.

The main purpose is to fix the case where the timeout is too short and causes
a failure that propagates and fails the group0 client.

Fixes scylladb/scylladb#20483

Closes scylladb/scylladb#21748
2024-12-09 13:20:32 +01:00
Piotr Dulikowski
c601f7a359 Merge 'transport/server: revert using async function in for_each_gently()' from Michał Jadwiszczak
This patch reverts 324b3c43c0 and adds synchronous versions of `service_level_controller::find_effective_service_level()` and `client_state::maybe_update_per_service_level_params()`.

It isn't safe to do asynchronous calls in `for_each_gently`, as the
connection may be disconnected while a call in callback preempts.

Fixes scylladb/scylladb#21801

Closes scylladb/scylladb#21761

* github.com:scylladb/scylladb:
  Revert "generic_server: use async function in `for_each_gently()`"
  transport/server: use synchronous calls in `for_each_gently` callback
  service/client_state: add synchronous method to update service level params
  qos/service_level_controller: add `find_cached_effective_service_level`
2024-12-06 08:48:41 +01:00
Michał Jadwiszczak
0a17eca5a1 qos/service_level_controller: add find_cached_effective_service_level
The method is a synchronous equivalent of
`find_effective_service_level`.
It uses recently introduced effective service level cache, so retrieve
user's effective service level is done by quick lookup to the cache.
2024-12-03 10:46:39 +01:00
Kefu Chai
bab12e3a98 treewide: migrate from boost::adaptors::transformed to std::views::transform
now that we are allowed to use C++23. we now have the luxury of using
`std::views::transform`.

in this change, we:

- replace `boost::adaptors::transformed` with `std::views::transform`
- use `fmt::join()` when appropriate where `boost::algorithm::join()`
  is not applicable to a range view returned by `std::view::transform`.
- use `std::ranges::fold_left()` to accumulate the range returned by
  `std::view::transform`
- use `std::ranges::fold_left()` to get the maximum element in the
  range returned by `std::view::transform`
- use `std::ranges::min()` to get the minimal element in the range
  returned by `std::view::transform`
- use `std::ranges::equal()` to compare the range views returned
  by `std::view::transform`
- remove unused `#include <boost/range/adaptor/transformed.hpp>`
- use `std::ranges::subrange()` instead of `boost::make_iterator_range()`,
  to feed `std::views::transform()` a view range.

to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

limitations:

there are still a couple places where we are still using
`boost::adaptors::transformed` due to the lack of a C++23 alternative
for `boost::join()` and `boost::adaptors::uniqued`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21700
2024-12-03 09:41:32 +02:00
Kefu Chai
5e391eee25 treewide: use coroutine::parallel_for_each(range) when appropriate
`coroutine::parallel_for_each` accepts both a range and a pair of
iterators. let's use the former when appropriate. it is simpler this way.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21684
2024-11-27 21:00:47 +02:00