This patch removes ability of a cluster to upgrade from not having
group0 to having one. This ability is used in gossiper based recovery
procedure that is deprecated and removed in this version. Also remove
tests that uses the procedure.
The connection's `cpu_concurrency_t` struct tracks the state of a connection
to manage the admission of new requests and prevent CPU overload during
connection storms. When a connection holds units (allowed only 0 or 1), it is
considered to be in the "CPU state" and contributes to the concurrency limits
used when accepting new connections.
The bug stems from the fact that `counted_data_source_impl::get` and
`counted_data_sink_impl::put` calls can interleave during execution. This
occurs because of `should_parallelize` and `_ready_to_respond`, the latter being
a future chain that can run in the background while requests are being read.
Consequently, while reading request (N), the system may concurrently be
writing the response for request (N-1) on the same connection.
This interleaving allows `return_all()` to be called twice before the
subsequent `consume_units()` is invoked. While the second `return_all()` call
correctly returns 0 units, the matching `consume_units()` call would
mistakenly take an extra unit from the semaphore. Over time, a connection
blocked on a read operation could end up holding an unreturned semaphore
unit. If this pattern repeats across multiple connections, the semaphore
units are eventually depleted, preventing the server from accepting any
new connections.
The fix ensures that we always consume the exact number of units that were
previously returned. With this change, interleaved operations behave as
follows:
get() return_all — returns 1 unit
put() return_all — returns 0 units
get() consume_units — takes back 1 unit
put() consume_units — takes back 0 units
Logically, the networking phase ends when the first network operation
concludes. But more importantly, when a network operation
starts, we no longer hold any units.
Other solutions are possible but the chosen one seems to be the
simplest and safest to backport.
Fixes SCYLLADB-485
Backport: all supported affected versions, bug introduced with initial feature implementation in: ed3e4f33fdClosesscylladb/scylladb#28530
* github.com:scylladb/scylladb:
test: auth_cluster: add test for hanged AUTHENTICATING connections
transport: fix connection code to consume only initially taken semaphore units
test_maintenance_socket with new way of running is flaky. Looks like the
driver tries to reconnect with an old maintenance socket from previous
driver and fails. This PR adds white list for connection that stabilize
the test
test_no_removed_node_event_on_ip_change was flaky on CI, while the issue
never reproduced locally. The assumption that under load we have race
condition and trying to check the logs before message is arrived. Small
for loop to retry added to avoid such situation.
Closesscylladb/scylladb#28635
The usual Scylla shutdown in a cluster test takes ~2.1s. 2s come from
```
co_await sleep(std::chrono::milliseconds(_gcfg.shutdown_announce_ms));
```
as the default value of `shutdown_announce_in_ms` is 2000. This sleep
makes every `server_stop_gracefully` call 2s slower. There are ~300 such
calls in cluster tests (note that some come from `rolling_restart`). So,
it looks like this sleep makes cluster tests 300 * 2s = 10min slower.
Indeed, `./test.py --mode=dev cluster` takes 61min instead of 71min
on the potwor machine (the one in the Warsaw office) without it.
We set `shutdown_announce_in_ms` to 0 for all cluster tests to make them
faster.
The sleep is completely unnecessary in tests. Removing it could introduce
flakiness, but if that's the case, then the test for which it happens is
incorrect in the first place. Tests shouldn't assume that all nodes
receive and handle the shutdown message in 2s. They should use functions
like `server_not_sees_other_server` instead, which are faster and more
reliable.
Improvement of the tests running time, so no backport. The fix of
`test_tablets_parallel_decommission` may have to be backported to
2026.1, but it can be done manually.
Closesscylladb/scylladb#28464
* github.com:scylladb/scylladb:
test: pylib: scylla_cluster: set shutdown_announce_in_ms to 0
test: test_tablets_parallel_decommission: prevent group0 majority loss
test: delete test_service_levels_work_during_recovery
The test becomes flaky in one of the following commits. However, there is
no need to fix it, as we should delete it anyway. We are in the process of
removing the gossip-based topology from the code base, which includes the
recovery mode. We don't have to rewrite the test to use the new Raft-based
recovery procedure, as there is nothing interesting to test (no regression
to legacy service levels).
The patch marks force-gossip-topology-changes as deprecated and removes
tests that use it. There is one test (test_different_group0_ids) which
is marked as xfail instead since it looks like gossiper mode was used
there as a way to easily achieve a certain state, so more investigation
is needed if the tests can be fixed to use raft mode instead.
Closesscylladb/scylladb#28383
Finishing the deprecation of the skip_mode function in favor of
pytest.mark.skip_mode. This PR is only cleaning and migrating leftover tests
that are still used and old way of skip_mode.
Closesscylladb/scylladb#28299
Before this change, the test function `_verify_tasks_processed_metrics`
verified that after service level reconfiguration, a given number of
`scylla_scheduler_tasks_processed` were processed by a given scheduling
group. Moreover, the check verified that another scheduling group
didn't process a high number of requests. The second check was vulnerable
to flakiness, because sometimes additional load caused extensive work
in the second scheduling group (e.g. password hashing in `sl:driver`
due to new connections being created).
To avoid test failures, this commit changes which metric is verified:
instead of `scylla_scheduler_tasks_processed`, the metric
`scylla_transport_cql_requests_count` is checked. This prevents similar
problems, because there is no reason for a high number of
requests to be processed by the second scheduling group. Moreover,
it allows decreasing the number of requests that are sent for
verification, and thus speeds up the test.
Fixes: scylladb/scylladb#27715Closesscylladb/scylladb#28318
Function skip_mode works only on function and only in cluster test. This if OK
when we need to skip one test, but it's not possible to use it with pytestmark
to automatically mark all tests in the file. The goal of this PR is to migrate
skip_mode to be dynamic pytest.mark that can be used as ordinary mark.
Closesscylladb/scylladb#27853
[avi: apply to test/cluster/test_tablets.py::test_table_creation_wakes_up_balancer]
`_verify_tasks_processed_metrics()` is used to check that the correct
service level is used to process requests. It takes two service levels
as arguments and executes numerous requests. After that, the number
of tasks processed by one of the service levels is expected to rise
by at least the number of executed requests. In contrast,
the second service level is expected to process fewer tasks than
the number of requests.
Unfortunately, background noise may cause some tasks to be executed
on the service level that is not supposed to process requests.
This patch increases the number of executed requests to eliminate
the chance of noise causing test failures.
Additionally, this commit extends logging to make future investigation
easier.
Fixes: https://github.com/scylladb/scylladb/issues/27715
No backport, fix for test on master.
Closesscylladb/scylladb#27735
* github.com:scylladb/scylladb:
test: remove unused `get_processed_tasks_for_group`
test: increase num of requests in driver_service_level tests
The function `get_processed_tasks_for_group` was defined twice in
`test_raft_service_levels.py`. This change removes the unused
definition to avoid confusion and clean up the code.
`_verify_tasks_processed_metrics()` is used to check that the correct
service level is used to process requests. It takes two service levels
as arguments and executes numerous requests. After that, the number
of tasks processed by one of the service levels is expected to rise
by at least the number of executed requests. In contrast,
the second service level is expected to process fewer tasks than
the number of requests.
Unfortunately, background noise may cause some tasks to be executed
on the service level that is not supposed to process requests.
This patch increases the number of executed requests to eliminate
the chance of noise causing test failures.
Additionally, this commit extends logging to make future investigation
easier.
Fixes: scylladb/scylladb#27715
The cluster framework supports this setup, while cqlpy only runs against Scylla instances in raft mode.
Therefore, the test was moved from cqlpy to the cluster-based framework.
This commit both adds the test in cluster/ and removes the old version in cqlpy/.
Pass a ManagerClient instead of a `cql` session to
`wait_until_driver_service_level_created`. This makes it easier
to add additional functionality to the helper later (e.g. waiting for
a Raft read barrier in a subsequent commit).
Refs: scylladb/scylladb#27019
The primary goal of this test is to reproduce scylladb/scylladb#26040
so the fix (278019c328) can be backported
to older branches.
Scenario: connect via CQL as an anonymous user and verify that the
`sl:default` scheduling group is used. Before the fix for #26040
`main` scheduling group was incorrectly used instead of `sl:default`.
Control connections may legitimately use `sl:driver`, so the test
accepts those occurrences while still asserting that regular anonymous
queries use `sl:default`.
This adds explicit coverage on master. After scylladb#24411 was
implemented, some other tests started to fail when scylladb#26040
was unfixed. However, none of the tests asserted this exact behavior.
Refs: scylladb/scylladb#26040
Refs: scylladb/scylladb#26581Closesscylladb/scylladb#26589
`sl:driver` is expected to be used for new and control connections,
but other connections that run user load should not use it after
the user is authenticated.
Refs: scylladb/scylladb#24411
This commit:
- Increases the number of allowed scheduling groups to allow the
creation of `sl:driver`.
- Adds the `DRIVER_SERVICE_LEVEL` feature, which prevents creating
`sl:driver` until all nodes have increased the number of
scheduling groups.
- Starts using `get_create_driver_service_level_mutations`
to unconditionally create `sl:driver` on
`raft_initialize_discovery_leader`. The purpose of this code
path is ensuring existence of `sl:driver` in new system and tests.
- Starts using `migrate_to_driver_service_level` to create `sl:driver`
if it is not already present. The creation of `sl:driver` is
managed by `topology_coordinator`, similar to other system keyspace
updates, such as the `view_builder` migration. The purpose of this
code path is handling upgrades.
- Modifies related tests to pass after `sl:driver` is added.
Later in this patch series, `sl:driver` will be used by
`transport/server` to handle selected traffic, such as the driver's
schema and topology fetches.
Refs: scylladb/scylladb#24411
Previously, tests used the hardcoded value 7 for the maximum number of
user service levels. This commit introduces a named variable that can
be shared across tests to avoid cases where this magic number goes
out of sync.
We would like to have an additional service level
available for users of the Vector Store service,
which would allow us to de/prioritize vector
operations as needed. To allow that, we increase
the number of scheduling groups from 19 to 20
and adjust the related test accordingly.
Closesscylladb/scylladb#26316
`sl:driver` is expected to be used for new and control connections,
but other connections that run user load should not use it after
the user is authenticated.
Refs: scylladb/scylladb#24411
This commit:
- Increases the number of allowed scheduling groups to allow the
creation of `sl:driver`.
- Adds the `DRIVER_SERVICE_LEVEL` feature, which prevents creating
`sl:driver` until all nodes have increased the number of
scheduling groups.
- Starts using `get_create_driver_service_level_mutations`
to unconditionally create `sl:driver` on
`raft_initialize_discovery_leader`. The purpose of this code
path is ensuring existence of `sl:driver` in new system and tests.
- Starts using `migrate_to_driver_service_level` to create `sl:driver`
if it is not already present. The creation of `sl:driver` is
managed by `topology_coordinator`, similar to other system keyspace
updates, such as the `view_builder` migration. The purpose of this
code path is handling upgrades.
- Modifies related tests to pass after `sl:driver` is added.
Later in this patch series, `sl:driver` will be used by
`transport/server` to handle selected traffic, such as the driver's
schema and topology fetches.
Refs: scylladb/scylladb#24411
Previously, tests used the hardcoded value 7 for the maximum number of
user service levels. This commit introduces a named variable that can
be shared across tests to avoid cases where this magic number goes
out of sync.
Since `service_level_controller` outlives `auth_integration`, it may
happen that we try to access it when it has already been deinitialized.
To prevent that, we only try to reload or clear the effective service
level cache when the object is still alive.
These changes solve an existing problem with an invalid memory access.
For more context, see issue scylladb/scylladb#24792.
We provide a reproducer test that consistently fails before these
changes but passes after them.
Fixesscylladb/scylladb#24792
Some cluster tests use `cluster_con` when they need a different load
balancing policy or auth provider. However, no test uses a port
other than 9042 or enables SSL, but all tests must pass `9042, False`
because these parameters don't have default values. This makes the code
more verbose. Also, it's quite obvious that 9042 stands for port, but
it's not obvious what `False` is related to, so there is a need to check
the definition of `cluster_con` while reading any test that uses it.
No reason to backport, it's only a minor refactoring.
Closesscylladb/scylladb#25516
The test creates all driver sessions by itself. As a consequence, all
sessions use the default request timeout of 10s. This can be too low for
the debug mode, as observed in scylladb/scylla-enterprise#5601.
In this commit, we change the test to use `cluster_con`, so that the
sessions have the request timeout set to 200s from now on.
Fixesscylladb/scylla-enterprise#5601
This commit changes only the test and is a CI stability improvement,
so it should be backported all the way to 2024.2. 2024.1 doesn't have
this test.
Closesscylladb/scylladb#25510
Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work.
In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version.
Fixes: scylladb/scylladb#24963
Should be backported to all versions which support upgrade to topology over raft - the issue described here may put the cluster into a state which is difficult to get out of (group0 apply fiber can break on multiple nodes, which necessitates their restart).
Closesscylladb/scylladb#25188
* github.com:scylladb/scylladb:
test: sl: verify that legacy auth is not queried in sl to raft upgrade
qos: don't populate effective service level cache until auth is migrated to raft
decrease request timeout.
In debug mode, queries may sometimes take longer than the default 30 seconds.
To address this, the timeout value `request_timeout_on_shutdown_in_seconds`
during tests is aligned with other request timeouts.
Change request timeout for tests from 180s to 90s since we must keep the request
timeout during shutdown significantly lower than the graceful shutdown timeout(2m),
or else a request timeout would cause a graceful shutdown timeout and fail a test.
Adjust `test_service_levels_upgrade`: right before upgrade to topology
on raft, enable an error injection which triggers when the standard role
manager is about to query the legacy auth tables in the
system_auth keyspace. The preceding commit which fixes
scylladb/scylladb#24963 makes sure that the legacy tables are not
queried during upgrade to topology on raft, so the error injection does
not trigger and does not cause a problem; without that commit, the test
fails.
By default, cluster tests have skip_wait_for_gossip_to_settle=0 and
ring_delay_ms=0. In tests with gossip topology, it may lead to a race,
where nodes see different state of each other.
In case of test_auth_v2_migration, there are three nodes. If the first
node already knows that the third node is NORMAL, and the second node
does not, the system_auth tables can return incomplete results.
To avoid such a race, this commit adds a check that all nodes see other
nodes as NORMAL before any writes are done.
Refs: #24163Closesscylladb/scylladb#24185
This series add a new config option: `tablets_mode_for_new_keyspaces` that replaces the existing
`enable_tablets` option. It can be set to the following values:
disabled: New keyspaces use vnodes by default, unless enabled by the tablets={'enabled':true} option
enabled: New keyspaces use tablets by default, unless disabled by the tablets={'disabled':true} option
enforced: New keyspaces must use tablets. Tablets cannot be disabled using the CREATE KEYSPACE option
`tablets_mode_for_new_keyspaces=disabled` or `tablets_mode_for_new_keyspaces=enabled` control whether
tablets are disabled or enabled by default for new keyspaces, respectively.
In either cases, tablets can be opted-in or out using the `tablets={'enabled':...}`
keyspace option, when the keyspace is created.
`tablets_mode_for_new_keyspaces=enforced` enables tablets by default for new keyspaces,
like `tablets_mode_for_new_keyspaces=enabled`.
However, it does not allow to opt-out when creating
new keyspaces by setting `tablets = {'enabled': false}`
Refs scylladb/scylla-enterprise#4355
* Requires backport to 2025.1
Closesscylladb/scylladb#22273
* github.com:scylladb/scylladb:
boost/tablets_test: verify failure to create keyspace with tablets and non network replication strategy
tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option
db/config: add tablets_mode_for_new_keyspaces option
The new option deprecates the existing `enable_tablets` option.
It will be extended in the next patch with a 3rd value: "enforced"
while will enable tablets by default for new keyspace but
without the posibility to opt out using the `tablets = {'enabled':
false}` keyspace schema option.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch makes it clear which Raft recovery procedure is used in
each test.
Tests with "This test uses the gossip-based recovery procedure." are
the tests that use the gossip-based topology. This tests should be
deleted once we make the Raft-based topology mandatory.
Tests with the new FIXME are the tests that use the Raft-based
topology. They should be changed to use the Raft-based recovery
procedure or removed if they don't test anything important with
the new procedure.