Commit Graph

11801 Commits

Author SHA1 Message Date
Nadav Har'El
d01915131a test/cqlpy: make test_indexing_paging_and_aggregation much faster
Currently, test_secondary_index.py::test_indexing_paging_and_aggregation
is very slow, and the slowest test in the test/cqlpy framework: It takes
around 13 seconds on dev build, and because it is CPU-bound (doesn't sleep),
it is much slower on debug builds. The reason for this slowness is that it
needs to set up and read over 10,000 rows which is the default
select_internal_page_size.

But after the patches in pull request (#25368), we can configure
select_internal_page_size, so in this patch we change the test to
temporarily reduce this option to just 50, and then the test can reach
the same code paths with just 142 rows instead of 20120 rows before this
patch.

As a result, the test should now be 140 times faster than it was before.
In practice, because of some fixed overheads (the test creates several
tables and indexes), in dev build mode the test run speedup is "only"
26-fold (to around half a second).

I verified that removing the code added in bb08af7 indeed makes the new
shorter test fail - and this is the only test in test_secondary_index.py
that starts to fail besides test_index_paging_group_by which is also
related (so my revert didn't just break secondary indexing completely).
So the shorter test is still a good regression test.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28268
2026-02-20 15:44:53 +02:00
Botond Dénes
6c04e02f66 Merge 'Fix restoration test's validation of streaming directions' from Pavel Emelyanov
The test_restore_with_streaming_scopes among other things checks how data streams flow while restoring. Whether or not to check the streams is decided based on the min tablet count value, which is compared with a hardcoded 512. This value of 512 matched the tablet count used by this test until it was "optimized" by #27839, where this number changed to 5 and streaming checks became off.

Good news is that the very same checks are still performed by test_refresh_with_streaming_scopes. But it's better to have a working restoration test anyway.

Minor test fix, not backporting

Closes scylladb/scylladb#28607

* github.com:scylladb/scylladb:
  test: Fix the condition for streaming directions validation
  test: Split test_backup.py::check_data_is_back() into two
2026-02-20 15:42:10 +02:00
Botond Dénes
6f88c0dbd3 Merge ' test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance' from Tomasz Grabiec
Currently, the test assumes that when
'topology_coordinator_pause_before_processing_backlog: waiting' is
logged, the task for decommission must be there. This was based on the
assumption that topology coordinator is idle and decommission request
wakes it up. But if the server is slow enough, it may still be running
the load balancer in reaction to table creation, and block on that
injection point before decommission request was added.

Fix by waiting for the task to appear rather than the injection.

Fixes SCYLLADB-715

Only 2026.1 vulnerable.

Closes scylladb/scylladb#28688

* github.com:scylladb/scylladb:
  test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance
  test: cluster: task_manager_client: Introduce wait_task_appears()
  tests: pylib: util: Add exponential backoff to wait_for
2026-02-20 15:05:36 +02:00
Pavel Emelyanov
c96420c015 tests: Re-use manager.get_server_exe()
There's a bunch of incremental repair tests that want to call scylla
sstable command. For that they try to find where scylla binary by
scanning /proc directory (see local_process_id and get_scylla_path
helpers).

There's shorter way -- just call manager.get_server_exe().

Same for backup-restore test.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28676
2026-02-20 14:59:30 +02:00
Pavel Emelyanov
a4a0d75eee test/object_store: Parametrize test_simple_backup_and_restore()
There are three tests and a function with a pair of boolean parameters
called by those. It's less code if the function becomes a test with
parameters.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28677
2026-02-20 14:57:30 +02:00
Pavel Emelyanov
a2e1293f86 test/object_store: Squash two simple-backup tests together
The test_backup_simple creates a ks/cf, takes a snapshot, backs it up,
then checks that the files were uploaded. The test_backup_move does the
same, but also plays with 'move_files' parameter to be true/false.

In fact, the "move" test was the copy of "simple" one that dropepd check
for scheduling group being "streaming" (backup with --move-files can
check the same, it's not bad), and check for destination bucket to
contain needed files (same here -- checking that files arrived to bucket
after --move-files is good).

In the end of the day, after the change backup test is run two times,
instead of three, and performs extra checks for --move-files case.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28606
2026-02-20 14:49:30 +02:00
Patryk Jędrzejczak
d399a197f5 Merge 'raft: Await instead of returning future in wait_for_state_change' from Dawid Mędrek
The `try-catch` expression is pretty much useless in its current form. If we return the future, the awaiting will only be performed by the caller, completely circumventing the exception handling.

As a result, instead of handling `raft::request_aborted` with a proper error message, the user will face `seastar::abort_requested_exception` whose message is cryptic at best. It doesn't even point to the root of the problem.

Fixes SCYLLADB-665

Backport: This is a small improvement and may help when debugging, so let's backport it to all supported versions.

Closes scylladb/scylladb#28624

* https://github.com/scylladb/scylladb:
  test: raft: Add test_aborting_wait_for_state_change
  raft: Describe exception types for wait_for_state_change and wait_for_leader
  raft: Await instead of returning future in wait_for_state_change
2026-02-20 12:17:22 +01:00
Andrzej Jackowski
eb5a564df2 test: move dtest/guardrails_test.py to test_guardrails.py
This commit moves `guardrails_test.py`, prepared in the previous
commit of this patch series, to `test/cluster/test_guardrails.py`.
It also cleans up `suite.yaml`.
2026-02-20 11:39:52 +01:00
Andrzej Jackowski
9df426d2ae test: prepare guardrails_test.py to be moved to test/cluster/
Disable `test/cluster/dtest/guardrails_test.py` in `suite.yaml` and
make it compatible with the `test/cluster/` framework. This will
allow moving this file from `test/cluster/dtest/` to `test/cluster/`
in the next commit of this patch series.

There are two motivations for moving the test:
 - Execution time reduction (from 12s to 9s in 'dev' in my env)
 - Facilitate adding new tests to the `guardrails_test.py` file
2026-02-20 11:39:43 +01:00
Botond Dénes
51a25c8af3 test/boost/batchlog_manager_test: add tests for v1 batchlog
The v1 table is used while upgrading from a pre-v2 version. We need
tests to ensure it still works.
2026-02-20 07:03:46 +02:00
Botond Dénes
83344dacbd test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2
Make the actual table name a parameter and add logic to adapt to the
variant used.
Also add dump_to_log::yes to is_rows() invokation to help debuging
tests.
2026-02-20 07:03:46 +02:00
Botond Dénes
2956714e19 test/boost/batchlog_manager_test: fix indentation 2026-02-20 07:03:46 +02:00
Botond Dénes
23732227fe test/boost/batchlog_manager_test: extract prepare_batches() method
To be shared between multiple tests in future commits.
Indentation is left broken.
2026-02-20 07:03:46 +02:00
Botond Dénes
af26956bb4 test/lib/cql_assertions: is_rows(): add dump parameter
When set to true, the query results will be logged by the testlog logger
with  debug level. A huge help when debugging failures around cql
assertions: seeing the actual query result is often enough to
immediately understand why the test failed.
2026-02-20 07:03:46 +02:00
Botond Dénes
ca2bbbad97 db/batchlog_manager: make structs stats public
Need to rename stats() -> get_stats() because it shadows the now
exported type name.
2026-02-20 07:03:46 +02:00
Botond Dénes
ac059dadc6 db/batchlog_manager: add feature_service dependency
Will be needed to check for batchlog_v2 feature.
2026-02-20 07:03:46 +02:00
Avi Kivity
27a5502f14 Merge 'Reapply "main: test: add future and abort_source to after_init_func"' from Marcin Maliszkiewicz
The patchset fixes abort_source implementation for perf-alternator and perf-cql-raw. It moves
run_standalone function to common code in perf.hh with necessary templating.

We also add extensive testing so that it's more difficult to break the tooling in the future.

Fixes SCYLLADB-560
Backport: no, internal tooling improvement

Closes scylladb/scylladb#28541

* github.com:scylladb/scylladb:
  test: cluster: add tests for perf tools
  test: perf: fix port race condition on startup in connect workload
  test: perf: prepare benchmarks to bind to custom host
  test: perf: make perf-alterantor remote port configurable
  test: perf: fix ASAN leak warnings in perf-alternator
  Reapply "main: test: add future and abort_source to after_init_func"
2026-02-19 19:12:46 +02:00
Marcin Maliszkiewicz
22c3d8d609 Merge 'db/config: enable table audit by default' from Piotr Smaron
In https://github.com/scylladb/scylladb/pull/27262 table audit has been
re-enabled by default in `scylla.yaml`, logging certain categories to a table,
which should make new Scylla deployments have audit enabled.
Now, in the next release, we also want to enable audit in `db/config.cc`,
which should enable audit for all deployments, which don't explicitly configure
audit otherwise in `scylla.yaml` (or via cmd line).
BTW. Because this commit aligns audit's default config values in `db/config.cc`
to those of `scylla.yaml`, `docs/reference/configuration-parameters.rst`, which
is based on `db/config.cc` will start showing that table audit is the default.

Refs: https://github.com/scylladb/scylladb/issues/28355
Refs: https://scylladb.atlassian.net/browse/SCYLLADB-222

No backport: table audit has been enabled in 2026.1 in `scylla.yaml`,
and should be always on starting from the next release,
which is the release we're currently merging to (2026.2).

Closes scylladb/scylladb#28376

* github.com:scylladb/scylladb:
  docs: decommission: note audit ks may require ALTERing
  docs: mention table audit enabled by default
  audit: disable DDL by default
  db/config: enable table audit by default
  test/cluster: fix `test_table_desc_read_barrier` assertion
  test/cluster: adjust audit in tests involving decommissioning its ks
  audit_test: fix incorrect config in `test_audit_type_none`
2026-02-19 16:30:11 +01:00
Patryk Jędrzejczak
45115415fb Merge 'Parametrize and merge several restoration test cases' from Pavel Emelyanov
There are four tests that check how restore with primary-replica-only option works in various scopes and topologies. Cases that check same-racks and same-datacenters are very very similar, so are those that check different-racks and different-datacenters. Parametrizing them and merging saves lots of code (+30 lines, -116 lines)

It's probably worth merging the resulting same-domain with different-domain tests, because the similarity is still large in both, but the result becomes too if-y, so not done here. Maybe later.

Improving tests, not backporting

Closes scylladb/scylladb#28569

* https://github.com/scylladb/scylladb:
  test: Merge test_restore_primary_replica_different_... tests
  test: Merge test_restore_primary_replica_same_... tests
  test: Don't specify expected_replicas in test_restore_primary_replica_different_dc_scope_all
  test: Remove local r_servers variable from test_restore_primary_replica_different_dc_scope_all
2026-02-19 15:42:55 +01:00
Pavel Emelyanov
26372e65df Merge 's3_perf: Fix the s3 perf test' from Ernest Zaslavsky
Fix the build of the test and the upload operation flow

No need to backport since it is only a test we barely use

Closes scylladb/scylladb#28595

* github.com:scylladb/scylladb:
  s3_perf: fix upload operation flow
  s3_perf: fix the CMake build
2026-02-19 15:31:43 +02:00
Dawid Mędrek
fae71f79c2 test: raft: Add test_aborting_wait_for_state_change 2026-02-19 14:21:01 +01:00
Karol Nowacki
ca7f9a8baf vector_search: fix TLS server name with IP
SNI works only with DNS hostnames. Adding an IP address causes warnings
on the server side.
This change adds SNI only if it is not an IP address.

This change has no unit tests, as this behavior is not critical,
since it causes a warning on the server side.
The critical part, that the server name is verified, is already covered.

Fixes: VECTOR-528
2026-02-19 13:00:03 +01:00
Marcin Maliszkiewicz
de4e5e10af test: perf: fix prepared statements logic in perf-simple-query
Due to lack of checks present in process_execute_internal from
transport/server.cc needs_authorization bool was always set to true
doing some extra work (check_access()) for each request.

We mirror the logic in this patch in test env which perf-simple-query
uses. This can also potentially improve runtime of unittests (marginally).

Note that bug is only in perf tool not scylla itself, the fix
decreases insns/op by around 10%:
Before: 41065 insns/op
After: 37452 insns/op
Command: ./build/release/scylla perf-simple-query --duration 5 --smp 1

Fixes https://github.com/scylladb/scylladb/issues/27941

Closes scylladb/scylladb#28704
2026-02-19 12:42:07 +02:00
Ernest Zaslavsky
45d824e0fe s3_perf: fix upload operation flow
Correct the upload operation logic. The previous flow incorrectly
checked for the test file on S3 even when performing operations that do
not download the file, such as uploads.
2026-02-19 11:14:59 +02:00
Botond Dénes
b637e17b19 db/config: don't use RBNO for scaling
Remove bootstrap and decomission from allowed_repair_based_node_ops.
Using RBNO over streaming for these operations has no benefits, as they
are not exposed to the out-of-date replica problem that replace,
removenode and rebuild are.
On top of that, RBNO is known to have problems with empty user tables.
Using streaming for boostrap and decomission is safe and faster
than RBNO in all condition, especially when the table is small.

One test needs adjustment as it relies on RBNO being used for all node
ops.

Fixes: SCYLLADB-105

Closes scylladb/scylladb#28080
2026-02-19 09:51:09 +01:00
Marcin Maliszkiewicz
3417d50add test: cluster: add tests for perf tools
It checks if all workloads can be properly
executed with succesfull startup and teardown.

Especially testing alternator in remote mode is important
because it's invoked like this during pgo training in pgo.py.

Test runtime:
Release - 24s
Debug - 1m 15s

Test time consists mostly of Scylla startup in various modes.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
c69534504c test: perf: fix port race condition on startup in connect workload
Other workloads at startup call prepopulate() which connects
with retry loop therefore it waits until cql port is open.

This commit adds a single place where we will wait for port
for all workloads.

Timeout is set to 5 minutes so that even slowest machines
are able to start.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
828f2fbdb1 test: perf: prepare benchmarks to bind to custom host
This is usefull for tests where we use local networks
like 127.5.5.5 to avoid port and host collisions.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
9f2b97bef4 test: perf: make perf-alterantor remote port configurable
It could be a usefull option to have.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
f5a212e91e test: perf: fix ASAN leak warnings in perf-alternator
Those were intentional as test process is short lived
but when we add automated tests in the following commits
we expect clean exit, with 0 exit code.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
0c76c73e34 Reapply "main: test: add future and abort_source to after_init_func"
This reverts commit ceec703bb7.

The commit was fixed with abort source handling for alternator
standalone path so it's safe to reapply.
2026-02-19 09:33:10 +01:00
Ernest Zaslavsky
4026b54a5e s3_perf: fix the CMake build
Fix the CMake build of the perf_s3_client by adding the
necessary linkage with the jsoncpp library.
2026-02-18 17:12:08 +02:00
Piotr Smaron
95ee4a562c test/cluster: fix test_table_desc_read_barrier assertion
The test `assertion desc_schema[0] == desc_schema[1]` does a direct
list comparison, which is order-sensitive. Before enabling audit by default,
both nodes would return only the test keyspace/table, so the order
didn't matter. With audit enabled, there will be multiple keyspaces,
and they can be returned in different order by different nodes.
2026-02-18 15:14:57 +01:00
Piotr Smaron
2e12b83366 test/cluster: adjust audit in tests involving decommissioning its ks
When table audit is enabled, Scylla creates the "audit" ks with
NetworkTopologyStrategy and RF=3. During node decommission, streaming can fail
for the audit ks with "zero replica after the removal" when all nodes from a DC
are removed, and so we have to ALTER audit ks to either zero the number of its
replicas, to allow for a clear decommission, or have them in the 2nd DC.

BTW. https://github.com/scylladb/scylladb/issues/27395 is the same change, but
in dtests repository.
2026-02-18 15:14:55 +01:00
Piotr Smaron
0cf20fa15a audit_test: fix incorrect config in test_audit_type_none
Passing Python `None` to setup is incorrect, because config updates are sent
as a dict and `None` is treated as "unset" - meaning: use Scylla's default.
Using the explicit string "none" to guarantee that audit is disabled.
2026-02-18 15:12:26 +01:00
Asias He
1be80c9e86 repair: Skip auto repair for tables using RF one
There is no point running repair for tables using RF one. Row level
repair will skip it but the auto repair scheduler will keep scheduling
such repairs since repair_time could not be updated.

Skip such repairs at the scheduler level for auto repair.

If the request is issued by user, we will have to schedule such
repair otherwise the user request will never be finished.

Fixes SCYLLADB-561

Closes scylladb/scylladb#28640
2026-02-18 14:32:50 +02:00
Botond Dénes
3bfd47da4b Merge 'transport: fix connection code to consume only initially taken semaphore units' from Marcin Maliszkiewicz
The connection's `cpu_concurrency_t` struct tracks the state of a connection
to manage the admission of new requests and prevent CPU overload during
connection storms. When a connection holds units (allowed only 0 or 1), it is
considered to be in the "CPU state" and contributes to the concurrency limits
used when accepting new connections.

The bug stems from the fact that `counted_data_source_impl::get` and
`counted_data_sink_impl::put` calls can interleave during execution. This
occurs because of `should_parallelize` and `_ready_to_respond`, the latter being
a future chain that can run in the background while requests are being read.
Consequently, while reading request (N), the system may concurrently be
writing the response for request (N-1) on the same connection.

This interleaving allows `return_all()` to be called twice before the
subsequent `consume_units()` is invoked. While the second `return_all()` call
correctly returns 0 units, the matching `consume_units()` call would
mistakenly take an extra unit from the semaphore. Over time, a connection
blocked on a read operation could end up holding an unreturned semaphore
unit. If this pattern repeats across multiple connections, the semaphore
units are eventually depleted, preventing the server from accepting any
new connections.

The fix ensures that we always consume the exact number of units that were
previously returned. With this change, interleaved operations behave as
follows:

get() return_all     — returns 1 unit
put() return_all     — returns 0 units
get() consume_units  — takes back 1 unit
put() consume_units  — takes back 0 units

Logically, the networking phase ends when the first network operation
concludes. But more importantly, when a network operation
starts, we no longer hold any units.

Other solutions are possible but the chosen one seems to be the
simplest and safest to backport.

Fixes SCYLLADB-485
Backport: all supported affected versions, bug introduced with initial feature implementation in: ed3e4f33fd

Closes scylladb/scylladb#28530

* github.com:scylladb/scylladb:
  test: auth_cluster: add test for hanged AUTHENTICATING connections
  transport: fix connection code to consume only initially taken semaphore units
2026-02-18 13:48:49 +02:00
Piotr Dulikowski
b9db3c9c75 Merge 'Add consistent permissions cache' from Marcin Maliszkiewicz
This patchset replaces permissions cache based on loading_cache with a new unified (permissions and roles), full, coherent auth cache.

Reason for the change is that we want to improve scenarios under stress and simplify operation manuals. New cache doesn't require any tweaking. And it behaves particularly better in scenarios with lots of schema entities (e.g. tables) combined with unprepared queries. Old cache can generate few thousands of extra internal tps due to cache refresh.

Benchmark of unprepared statements (just to populate the cache) with 1000 tables shows 3k tps of internal reads reduction and 9.1% reduction of median instructions per op. So many tables were used to show resource impact, cache could be filled with other resource types to show the same improvement.

Backport: no, it's a new feature.
Fixes https://github.com/scylladb/scylladb/issues/7397
Fixes https://github.com/scylladb/scylladb/issues/3693
Fixes https://github.com/scylladb/scylladb/issues/2589
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-147

Closes scylladb/scylladb#28078

* github.com:scylladb/scylladb:
  test: boost: add auth cache tests
  auth: add cache size metrics
  docs: conf: update permissions cache documentation
  auth: remove old permissions cache
  auth: use unified cache for permissions
  auth: ldap: add permissions reload to unified cache
  auth: add permissions cache to auth/cache
  auth: add service::revoke_all as main entry point
  auth: explicitly life-extend resource in auth_migration_listener
2026-02-18 12:03:20 +01:00
Tomasz Grabiec
af0b5d0894 Merge 'tablets global barrier: acknowledge barrier_and_drain from all nodes' from Petr Gusev
Before this series, the `global_barrier` used during tablet migration did not guarantee that `barrier_and_drain` was acknowledged by tablet replicas. As a result, if a request coordinator was fenced out, stale requests from previous topology versions could still execute on replicas in parallel with new requests from incompatible topology versions. For example, stale requests from `tablet_transition_stage::streaming` could run concurrently with new requests from `tablet_transition_stage::use_new`. This caused several issues, including [#26864](https://github.com/scylladb/scylladb/issues/26864) and [#26375](https://github.com/scylladb/scylladb/issues/26375).

This PR fixes the problem in two steps:
* Replicas now hold an erm strong pointer while handling RPCs from coordinators.
* The tablet barrier is updated to require `barrier_and_drain` acknowledgments from all nodes.

A description of alternative solutions and various tradeoffs can be found in [this document](https://docs.google.com/document/d/1tpDtPOsrGaZGBYkdwOKApQv4eMzrBydMM1GaYYmaPgg/edit?pli=1&tab=t.0#heading=h.vidfy0hrz5j7).

[A previous attempt on this changes](https://github.com/scylladb/scylladb/pull/27185).

Fixes [scylladb/scylladb#26864](https://github.com/scylladb/scylladb/issues/26864)
Fixes [scylladb/scylladb#26375](https://github.com/scylladb/scylladb/issues/26375)

backport: needs backport to 2025.4 (fixes #26864 for tablets LWT)

Closes scylladb/scylladb#27492

* github.com:scylladb/scylladb:
  tests: extract get_topology_version helper
  global tablets barrier: require all nodes to ack barrier_and_drain
  topology_coordinator: pass raft_topology_cmd by value
  storage_proxy: hold erms in replica handlers
  token_metadata: improve stale versions diagnostics
2026-02-18 11:45:56 +01:00
Ernest Zaslavsky
196f7cad93 nodetool: fix handling of "--primary-replica-only" argument
The "--primary-replica-only" ("-pro") flag was previously ignored by
the `restore` operation. This patch ensures the argument is parsed and
applied correctly.

Closes scylladb/scylladb#28490
2026-02-18 12:21:27 +02:00
Ernest Zaslavsky
afac984632 s3_client: reorganize tests in part_size_calculation_test
just group all BOOST_REQUIRE_EXCEPTION tests in one block and
remove artificial scopes
2026-02-18 12:12:04 +02:00
Ernest Zaslavsky
1a20877afe s3_client: switch using s3 limits constants in tests
instead of using magic numbers, switch using s3 limit constants to
make it clearer what and why is tested
2026-02-18 12:12:04 +02:00
Pavel Emelyanov
b01adf643c Merge 'init: fix infinite loop on npos wrap with updated Seastar' from Emil Maskovsky
Fixes parsing of comma-separated seed lists in "init.cc" and "cql_test_env.cc" to use the standard `split_comma_separated_list` utility, avoiding manual `npos` arithmetic. The previous code relied on `npos` being `uint32_t(-1)`, which would not overflow in `uint64_t` target and exit the loop as expected. With Seastar's upcoming change to make `npos` `size_t(-1)`, this would wrap around to zero and cause an infinite loop.

Switch to `split_comma_separated_list` standardized way of tokenization that is also used in other places in the code. Empty tokens are handled as before. This prevents startup hangs and test failures when Seastar is updated.

The other commit also removes the unnecessary creation of temporary `gms::inet_address()` objects when calling `std::set<gms::inet_address>::emplace()`.

Refs: https://github.com/scylladb/seastar/pull/3236

No backport: The problem will only appear in master after the Seastar will be upgraded. The old code works with the Seastar before https://github.com/scylladb/seastar/pull/3236 (although by accident because of different integer bitsizes).

Closes scylladb/scylladb#28573

* github.com:scylladb/scylladb:
  init: fix infinite loop on npos wrap with updated Seastar
  init: remove unnecessary object creation in emplace calls
2026-02-18 11:46:26 +03:00
Nadav Har'El
a1475dbeb9 test/cqlpy: make test testMapWithLargePartition faster
Right now the slowest test in the test/cqlpy directory is

   cassandra_tests/validation/entities/collections_test.py::
      testMapWithLargePartition

This test (translated from Cassandra's unit test), just wants to verify
that we can write and flush a partition with a single large map - with
200 items totalling around 2MB in size.

200 items totalling 2MB is large, but not huge, and is not the reason
why this test was so so slow (around 9 seconds). It turns out that most
of the test time was spent in Python code, preparing a 2MB random string
the slowest possible way. But there is no need for this string to be
random at all - we only care about the large size of the value, not the
specific characters in it!

Making the characters written in this text constant instead of random
made it 20 times fast - it now takes less than half a second.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28271
2026-02-18 10:12:16 +03:00
Raphael S. Carvalho
5b550e94a6 streaming: Release space incrementally during file streaming
File streaming only releases the file descriptors of a tablet being
streamed in the very streaming end. Which means that if the streaming
tablet has compaction on largest tier finished after streaming
started, there will be always ~2x space amplification for that
single tablet. Since there can be up to 4 tablets being migrated
away, it can add up to a significant amount, since nodes are pushed
to a substantial usage of available space (~90%).

We want to optimize this by dropping reference to a sstable after
it was fully streamed. This way, we reduce the chances of hitting
2x space amplification for a given tablet.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#28505
2026-02-18 10:10:40 +03:00
Pavel Emelyanov
89d8ae5cb6 Merge 'http: prepare http clients retry machinery refactoring' from Ernest Zaslavsky
Today S3 client has well established and well testes (hopefully) http request retry strategy, in the rest of clients it looks like we are trying to achieve the same writing the same code over and over again and of course missing corner cases that already been addressed in the S3 client.
This PR aims to extract the code that could assist other clients to detect the retryability of an error originating from the http client, reuse the built in seastar http client retryability and to minimize the boilerplate of http client exception handling

No backport needed since it is only refactoring of the existing code

Closes scylladb/scylladb#28250

* github.com:scylladb/scylladb:
  exceptions: add helper to build a chain of error handlers
  http: extract error classification code
  aws_error: extract `retryable` from aws_error
2026-02-18 10:06:37 +03:00
Pavel Emelyanov
2f10fd93be Merge 's3_client: Fix s3 part size and number of parts calculation' from Ernest Zaslavsky
- Correct `calc_part_size` function since it could return more than 10k parts
- Add tests
- Add more checks in `calc_part_size` to comply with S3 limits

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-640
Must be ported back to 2025.3/4 and 2026.1 since we may encounter this bug in production clusters

Closes scylladb/scylladb#28592

* github.com:scylladb/scylladb:
  s3_client: add more constrains to the calc_part_size
  s3_client: add tests for calc_part_size
  s3_client: correct multipart part-size logic to respect 10k limit
2026-02-18 10:04:53 +03:00
Tomasz Grabiec
d33d38139f test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance
Currently, the test assumes that when
'topology_coordinator_pause_before_processing_backlog: waiting' is
logged, the task for decommission must be there. This was based on the
assumption that topology coordinator is idle and decommission request
wakes it up. But if the server is slow enough, it may still be running
the load balancer in reaction to table creation, and block on that
injection point before decommission request was added.

Fix by waiting for the task to appear rather than the injection.

Fixes SCYLLADB-715
2026-02-18 01:02:50 +01:00
Tomasz Grabiec
2454de4f8f test: cluster: task_manager_client: Introduce wait_task_appears() 2026-02-18 01:02:44 +01:00
Tomasz Grabiec
e14eca46af tests: pylib: util: Add exponential backoff to wait_for
Allows balancing the trade-off between fast execution in case the
condition is satisfied quickly and not adding load when it's not.
2026-02-18 01:02:19 +01:00