Commit Graph

11137 Commits

Author SHA1 Message Date
Michael Litvak
bd66edee5c logstor: truncate table
implement freeing all segments of a table for table truncate.

first do barrier to flush all active and mixed segments and put all the
table's data in compaction groups, then stop compaction for the table,
then free the table's segments and remove the live entries from the
index.
2026-03-18 19:24:27 +01:00
Michael Litvak
37c485e3d1 test: logstor: add separator and compaction tests 2026-03-18 19:24:27 +01:00
Michael Litvak
31aefdc07d logstor: segment and separator barrier
add barrier operation that forces switch of the active segment and
separator, and waits for all existing segments to close and all
separators to flush.
2026-03-18 19:24:27 +01:00
Michael Litvak
600ec82bec logstor: separator
initial implementation of the separator. it replaces "mixed" segments -
segments that have records from different groups, to segments by group.

every write is written to the active segment and to a buffer in the
active separator. the active separator has in-memory buffers by group.
at some threshold number of segments we switch the active segment and
separator atomically, and start flushing the separator.

the separator is flushed by writing the buffers into new non-mixed
segments, adding them to a compaction group, and frees the mixed
segments.
2026-03-18 19:24:27 +01:00
Michael Litvak
5a16980845 logstor: recovery: initial
initial and basic recovery implementation.
* find all files, read their segments and populate the index with the
  newest record for each key.
* find which segments are used and build the usage histogram
2026-03-18 19:24:26 +01:00
Michael Litvak
521fca5c92 logstor: index: buckets
divide the primary index to buckets, each bucket containing a btree. the
bucket is determined by using bits from the key hash.
2026-03-18 19:24:26 +01:00
Michael Litvak
ddd72a16b0 logstor: add group_id
add group_id value to each log record that is passed with the mutation
when writing it.

the group_id will be used to group log records in segments, such that a
segment will contain records only from a single group.

this will be useful for tablet migration. we want for each tablet to
have their own segments with all their records, so we can migrate them
efficiently by copying these segments.

the group_id value is set to a value equivalent to the tablet id.
2026-03-18 19:24:26 +01:00
Michael Litvak
5f649dd39f logstor: use RIPEMD-160 for index key
use a 20-byte hash function for the index key to make hash collisions
very unlikely. we assume there are no hash collisions.
2026-03-18 19:24:26 +01:00
Michael Litvak
a521bcbcee test: add test_logstor.py
add basic tests for key-value tables with logstor storage
2026-03-18 19:24:26 +01:00
Michael Litvak
1ae1f37ec1 api: add logstor compaction trigger endpoint
add a new api endpoint that triggers logstor compaction.
2026-03-18 19:24:26 +01:00
Michael Litvak
2128b1b15c replica: add logstor to db
Add a single logstor instance in the database that is used for writing
and reading to tables with kv storage
2026-03-18 19:24:26 +01:00
Michael Litvak
9172cc172e schema: add logstor cf property
add a schema property for tables with logstor storage
2026-03-18 19:24:26 +01:00
Michael Litvak
0b1343747f logstor: initial commit
initial implementation of the logstor storage engine for key-value
tables that supports writes, reads and basic compaction.

main components:
* logstor: this is the main interface to users that supports writing and
  reading back mutations, and manages the internal components.
* index: the primary index in-memory that maps a key to a location on
  disk.
* write buffer: writes go initially to a write buffer. it accumulates
  multiple records in a buffer and writes them to the segment manager in
  4k sized blocks.
* segment manager: manages the storage - files, segments, compaction. it
  manages file and segment allocation, and writes 4k aligned buffers to
  the active segment sequentially. it tracks the used space in each
  segment. the compaction finds segment with low space usage and writes
  them to new segments, and frees the old segments.
2026-03-18 19:24:26 +01:00
Botond Dénes
172c786079 Merge 'perf-alternator: wait for alternator port before running workload' from Marcin Maliszkiewicz
This patch is mostly for the purpose of running pgo CI job.

We may receive connection error if asyncio.sleep(5) in
pgo.py is not sufficient waiting time.

In pgo.py we do wait for port but only for cql,
anyway it's better to have high level check than
trying to wait for alternator port there.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1071
Backport: 2026.1 - it failed on CI for that build

Closes scylladb/scylladb#29063

* github.com:scylladb/scylladb:
  perf: add abort_source support to wait-for-port loops
  perf-alternator: wait for alternator port before running workload
2026-03-17 18:38:11 +02:00
Botond Dénes
5d868dcc55 Merge 's3_client: fix s3::range max value for object size' from Ernest Zaslavsky
- fix s3::range max value for object size which is 50TiB and not 5.
- refactor constants to make it accessible for all interested parties, also reuse these constants in tests

No need to backport, doubt we will encounter an object larger than 5TiB

Closes scylladb/scylladb#28601

* github.com:scylladb/scylladb:
  s3_client: reorganize tests in part_size_calculation_test
  s3_client: switch using s3 limits constants in tests
  s3_client: fix the s3::range max object size
  s3_client: remove "aws" prefix from object limits constants
  s3_client: make s3 object limits accessible
2026-03-17 16:34:42 +02:00
Dawid Mędrek
a8dd13731f Merge 'Improve debuggability of test/cluster/test_data_resurrection_in_memtable.py' from Botond Dénes
This test was observed to fail in CI recently but there is not enough information in the logs to figure out what went wrong. This PR makes a few improvements to make the next investigation easier, should it be needed:
* storage-service: add table name to mutation write failure error messages.
* database: the `database_apply` error injection used to cause trouble, catching writes to bystander tables, making tests flaky. To eliminate this, it gained a filter to apply only to non-system keyspaces. Unfortunately, this still allows it to catch writes to the trace tables. While this should not fail the test, it reduces observability, as some traces disappear. Improve this error injection to only apply to selected table. Also merge it with the `database_apply_wait` error injection, to streamline the code a bit.
* test/test_data_resurrection_in_memtable.py: dump data from the datable, before the checks for expected data, so if checks fail, the data in the table is known.

Refs: SCYLLADB-812
Refs: SCYLLADB-870
Fixes: SCYLLADB-1050 (by restricting `database_apply` error injection, so it doesn't affect writes to system traces)

Backport: test related improvement, no backport

Closes scylladb/scylladb#28899

* github.com:scylladb/scylladb:
  test/cluster/test_data_resurrection_in_memtable.py: dump rows before check
  replica/database: consolidate the two database_apply error injections
  service/storage_proxy: add name of table to error message for write errors
2026-03-17 13:35:19 +01:00
Botond Dénes
318aa07158 Merge ' test/alternator: use module-scope fixtures in test_streams.py ' from Nadav Har'El
Previously, all stream-table fixtures in test_streams.py used scope="function",
forcing a fresh table to be created for every test, slowing down the test a bit
(though not much), and discouraging writing small new tests.

 This was a workaround for a DynamoDB quirk (that Alternator doesn't have):
LATEST shard iterators have a time slack and may point slightly before  the true
stream head, causing leftover events from a previous test to appear in the next
test's reads.

The first two tests in this series fix small problems that turn up once we start
sharing test tables in test_streams.py. The final patch fixes the "LATEST" problem
and enables sharing the test table by using "module" scope fixtures instead of
"function".

After this series, test_streams.py run time went down a bit, from 20.2 seconds to 17.7 seconds.

Closes scylladb/scylladb#28972

* github.com:scylladb/scylladb:
  test/alternator: speed up test_streams.py by using module-scope fixtures
  test/alternator: test_streams.py don't use fixtures in 4 tests
  test/alternator: fix do_test() in test_streams.py
2026-03-17 13:56:16 +02:00
Botond Dénes
dbe70cddca test/boost/querier_cache_test: make test_time_based_cache_eviction less sensitive to timing
This test relies on the cache entry being evicted after 200ms past the
TTL. This may not happen on a busy CI machine. Make the test less
reliant on timing by using eventually_true().
Simplify the test by dropping the second entry, it doesn't add anything
to the test.

Fixes: SCYLLADB-811

Closes scylladb/scylladb#28958
2026-03-17 10:32:23 +01:00
Botond Dénes
0fd51c4adb test/nodetool: rest_api_mock_server: add retry for status code 404
This fixtures starts the mock server and immediately connects to it to
setup the expected requests. The connection attempt might be too early,
so there is a retry loop with a timeout. The loop currently checks for
requests.exception.ConnectionError. We've seen a case where the
connection is successful but the request fails with 404. The mock
started the server but didn't setup the routes yet. Add a retry for http
404 to handle this.

Fixes: SCYLLADB-966

Closes scylladb/scylladb#29003
2026-03-17 10:30:23 +01:00
Botond Dénes
035aa90d4b Merge 'Alternator: add per-table batch latency metrics and test coverage' from Amnon Heiman
This series fixes a metrics visibility gap in Alternator and adds regression coverage.

Until now, BatchGetItem and BatchWriteItem updated global latency histograms but did not consistently update per-table latency histograms. As a result, table-level latency dashboards could miss batch traffic.

It updates the batch read/write paths to compute request duration once and record it in both global and per-table latency metrics.

Add the missing tests, including a metric-agnostic helper and a dedicated per-table latency test that verifies latency counters increase for item and batch operations.

This change is metrics-only (no API/behavior change for requests) and improves observability consistency between global and per-table views.

Fixes #28721

**We assume the alternator per-table metrics exist, but the batch ones are not updated**

Closes scylladb/scylladb#28732

* github.com:scylladb/scylladb:
  test(alternator): add per-table latency coverage for item and batch ops
  alternator: track per-table latency for batch get/write operations
2026-03-16 17:18:00 +02:00
Botond Dénes
9de8d6798e Merge 'reader_concurrency_semaphore: skip preemptive abort for permits waiting for memory' from Łukasz Paszkowski
Permits in the `waiting_for_memory` state represent already-executing reads that are blocked on memory allocation. Preemptively aborting them is wasteful -- these reads have already consumed resources and made progress, so they should be allowed to complete.

Restrict the preemptive abort check in maybe_admit_waiters() to only apply to permits in the `waiting_for_admission` state, and tighten the state validation in `on_preemptive_aborted()` accordingly.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1016

Backport not needed. The commit introducing replica load shedding is not part of 2026.1

Closes scylladb/scylladb#29025

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: skip preemptive abort for permits waiting for memory
  reader_concurrency_semaphore_test: detect memory leak on preemptive abort of waiting_for_memory permit
2026-03-16 17:14:25 +02:00
Marcin Maliszkiewicz
9318c80203 perf: add abort_source support to wait-for-port loops
Check abort_source on each retry iteration in
wait_for_alternator and wait_for_cql so the
wait can be interrupted on shutdown.

Didn't use sleep_abortable as the sleep is very short
anyway.
2026-03-16 16:14:10 +01:00
Calle Wilund
a5df2e79a7 storage_service: Wait for snapshot/backup before decommission
Fixes: SCYLLADB-244

Disables snapshot control such that any active ops finish/fail
before proceeding with decommission.
Note: snapshot control provided as argument, not member ref
due to storage_service being used from both main and cql_test_env.
(The latter has no snapshot_ctl to provide).

Could do the snapshot lockout on API level, but want to do
pre-checks before this.

Note: this just disables backup/snapshot fully. Could re-enable
after decommission, but this seems somewhat pointless.

v2:
* Add log message to snapshot shutdown
* Make test use log waiting instead of timeouts

Closes scylladb/scylladb#28980
2026-03-16 17:12:57 +02:00
Marcin Maliszkiewicz
edf0148bee perf-alternator: wait for alternator port before running workload
This patch is mostly for the purpose of running pgo CI job.

We may receive connection error if asyncio.sleep(5) in
pgo.py is not sufficient waiting time.

In pgo.py we do wait for port but only for cql,
anyway it's better to have high level check than
trying to wait for alternator port there.
2026-03-16 16:07:52 +01:00
bitpathfinder
85d5073234 test: Fix non-awaited coroutine in test_gossiper_empty_self_id_on_shadow_round
The line with the error was not actually needed and has therefore been removed.

Fixes: SCYLLADB-906

Closes scylladb/scylladb#28884
2026-03-16 17:07:36 +02:00
Botond Dénes
3e4e0c57b8 Merge 'Relax rf-rack-valid-keyspace option in backup/restore tests' from Pavel Emelyanov
Some tests, when create a cluster, configure nodes with the rf-rack-valid option, because sometimes they want to have it OFF. For that the option is explicitly carried around, but the cluster creating helper can guess this option itself -- out of the provided topology and replication factor.

Removing this option simplifies the code and (which a nicer outcome) the test "signature" that's used e.g. in command-line to run a specific test.

Improving tests, not backporting

Closes scylladb/scylladb#28860

* github.com:scylladb/scylladb:
  test: Relax topology_rf_validity parameter for some tests
  test: Auto detect rf-rack-valid option in create_cluster()
2026-03-16 17:06:46 +02:00
Patryk Jędrzejczak
526e5986fe test: test_raft_no_quorum: decrease group0_raft_op_timeout_in_ms after quorum loss
`test_raft_no_quorum.py::test_cannot_add_new_node` is currently flaky in dev
mode. The bootstrap of the first node can fail due to `add_entry()` timing
out (with the 1s timeout set by the test case).

Other test cases in this test file could fail in the same way as well, so we
need a general fix. We don't want to increase the timeout in dev mode, as it
would slow down the test. The solution is to keep the timeout unchanged, but
set it only after quorum is lost. This prevents unexpected timeouts of group0
operations with almost no impact on the test running time.

A note about the new `update_group0_raft_op_timeout` function: waiting for
the log seems to be necessary only for
`test_quorum_lost_during_node_join_response_handler`, but let's do it
for all test cases just in case (including `test_can_restart` that shouldn't
be flaky currently).

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-913

Closes scylladb/scylladb#28998
2026-03-16 16:58:15 +02:00
Artsiom Mishuta
755d528135 test.py: fix warnings
changes in this commit:
1)rename class from 'TestContext' to  'Context' so pytest will not consider this class as a test

2)extend pytest filterwarnings list to ignore warnings from external libs

3) use datetime.datetime.now(datetime.UTC) unstead  datetime.datetime.utcnow()

4) use  ResultSet.one() instead  ResultSet[0]

Fixes SCYLLADB-904
Fixes SCYLLADB-908
Related SCYLLADB-902

Closes scylladb/scylladb#28956
2026-03-15 12:00:10 +02:00
Piotr Dulikowski
d8b283e1fb Merge 'Add CQL forwarding for strongly consistent tables' from Wojciech Mitros
In this series we add support for forwarding strongly consistent CQL requests to suitable replicas, so that clients can issue reads/writes to any node and have the request executed on an appropriate tablet replica (and, for writes, on the Raft leader). We return the same CQL response as what the user would get while sending the request to the correct replica and we perform the same logging/stats updates on the request coordinator as if the coordinator was the appropriate replica.

The core mechanism of forwarding a strongly consistent request is sending an RPC containing the user's cql request frame to the appropriate replica and returning back a ready, serialized `cql_transport::response`. We do this in the CQL server - it is most prepared for handling these types and forwarding a request containing a CQL frame allows us to reuse near-top-level methods for CQL request handling in the new RPC handler (such as the general `process`)

For sending the RPC, the CQL server needs to obtain the information about who should it forward the request to. This requires knowledge about the tablet raft group members and leader. We obtain this information during the execution of a `cql3/strong_consistency` statement, and we return this information back to the CQL server using the generalized `bounce_to_shard` `response_message`, where we now store the information about either a shard, or a specific replica to which we should forward to. Similarly to `bounce_to_shard`, we need to handle this `result_message` in a loop - a replica may move during statement execution, or the Raft leader can change. We also use it for forwarding strongly consistent writes when we're not a member of the affected tablet raft group - in that case we need to forward the statement twice - once to any replica of the affected tablet, then that replica can find the leader and return this information to the coordinator, which allows the second request to be directed to the leader.

This feature also allows passing through exception messages which happened on the target replica while executing the statement. For that, many methods of the `cql_transport::cql_server::connection` for creating error responses needed to be moved to `cql_transport::cql_server`. And for final exception handling on the coordinator, we added additional error info to the RPC response, so that the handling can be performed without having the `result_message::exception` or `exception_ptr` itself.

Fixes [SCYLLADB-71](https://scylladb.atlassian.net/browse/SCYLLADB-71)

[SCYLLADB-71]: https://scylladb.atlassian.net/browse/SCYLLADB-71?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#27517

* github.com:scylladb/scylladb:
  test: add tests for CQL forwarding
  transport: enable CQL forwarding for strong consistency statements
  transport: add remote statement preparation for CQL forwarding
  transport: handle redirect responses in CQL forwarding
  transport: add exception handling for forwarded CQL requests
  transport: add basic CQL request forwarding
  idl: add a representation of client_state for forwarding
  cql_server: handle query, execute, batch in one case
  transport: inline process_on_shard in cql_server::process
  transport: extract process() to cql_server
  transport: add messaging_service to cql_server
  transport: add response reconstruction helpers for forwarding
  transport: generalize the bounce result message for bouncing to other nodes
  strong consistency: redirect requests to live replicas from the same rack
  transport: pass foreign_ptr into sleep_until_timeout_passes and move it to cql_server
  transport: extract the error handling from process_request_one
  transport: move error response helpers from connection to cql_server
2026-03-13 15:03:10 +01:00
Pavel Emelyanov
d544d8602d test: Relax topology_rf_validity parameter for some tests
Tests that call create_cluster() helper no longer need to carry the
rf-validity parameter. This simplifies the code and test signature.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-13 14:30:32 +03:00
Pavel Emelyanov
313985fed7 test: Auto detect rf-rack-valid option in create_cluster()
The helper accepts its as boolean argument, but it can easily estimate
one from the provided topology.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-13 14:30:32 +03:00
Łukasz Paszkowski
4c4d043a3b reader_concurrency_semaphore: skip preemptive abort for permits waiting for memory
Permits in the `waiting_for_memory` state represent already-executing
reads that are blocked on memory allocation. Preemptively aborting
them is wasteful -- these reads have already consumed resources and
made progress, so they should be allowed to complete.

Restrict the preemptive abort check in maybe_admit_waiters() to only
apply to permits in the `waiting_for_admission` state, and tighten
the state validation in `on_preemptive_aborted()` accordingly.

Adjust the following tests:
+ test_reader_concurrency_semaphore_abort_preemptively_aborted_permit
  no longer relies on requesting memory
+ test_reader_concurrency_semaphore_preemptive_abort_requested_memory_leak
  adjusted to the fix

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1016
2026-03-13 09:50:05 +01:00
Botond Dénes
fc8cebd671 Merge 'Verify components digests during component load and scrub in validate mode' from Taras Veretilnyk
This PR adds integrity verification for SSTable component files during loading. When component digests are present in Scylla metadata, the loader now validates each component's CRC32 digest against the stored expected value, catching silent corruption of component files. Index, Rows and Partitions components digests are also validated duriung scrub in validate mode

Added corruption tests that write an SSTable, flip a bit in a specific component file, then verify that reloading the SSTable detects the corruption and throws the expected exception.

Depends on https://github.com/scylladb/scylladb/pull/28338

Backport is not required, this is new feature

Fixes https://github.com/scylladb/scylladb/issues/20103

Closes scylladb/scylladb#28761

* github.com:scylladb/scylladb:
  test/cqlpy: test --ignore-component-digest-mismatch flag in scylla sstable upgrade
  docs: document --ignore-component-digest-mismatch flag for scylla sstable upgrade
  sstables: propagate ignore_component_digest_mismatch config to all load sites
  sstables: add option to ignore component digest mismatches
  sstable_compaction_test: Add scrub validate test for corrupted index
  sstables: add tests for component digest validation on corrupted SSTables
  sstables: validate index components digests during SSTable scrub in validate mode
  sstables: verify component digests on SSTable load
  sstables: add digest_file_random_access_reader for CRC32 digest computation
2026-03-13 09:55:55 +02:00
Avi Kivity
ae8a418744 Merge 'Await async calls in test tablets migration' from Benny Halevy
Fix several test cases that did not await async tasks:
- test_restart_leaving_replica_during_cleanup
- test_restart_in_cleanup_stage_after_cleanup
- test_tablet_back_and_forth_migration
- test_staging_backlog_is_preserved_with_file_based_streaming

Fixes SCYLLADB-910

* Minor fixes, no backport needed

Closes scylladb/scylladb#28908

* github.com:scylladb/scylladb:
  test_tablets_migration: test_staging_backlog_is_preserved_with_file_based_streaming: convert for loop to asyncio.gather
  test_tablets_migration: test_tablet_back_and_forth_migration: await move_tablet
  test_tablets_migration: test_restart_in_cleanup_stage_after_cleanup: await move_task
  test_tablets_migration: test_restart_leaving_replica_during_cleanup: await move_task
  test_tablets_migration: drop unused imports from cassandra.query
2026-03-13 00:20:29 +02:00
Avi Kivity
b228eb26e6 Merge 'dbuild: Use slirp4netns network in dbuild nested containers' from Calle Wilund
Fixes #25084

Add slirp4netns and use for nested containers. This will allow nested container port aliasing, helping CI stability.

Note: this contains and updated Dockerfile for dbuild image, but since chicken and eggs, right now will force install slirp4netns before anything in dbuild script.

Updates the mock server handling to use ephemeral ports and query from container, ensuring we don't get port collisions. (boost as well as pytest).

Includes a timeout up, and a tweak to our scylla_cluster handling, ensuring we don't deadlock when pipe size is less than requires for our sys notify messages.

Closes scylladb/scylladb#28727

* github.com:scylladb/scylladb:
  gcs_fixture: Change to use docker helper
  aws_kms_fixture: Modify to use docker helper
  test/lib/proc_util: Add docker helper
  pytest: use ephemeral port publish for docker mock servers
  dbuild: Use container network in dbuild nested containers
  scylla_cluster: Read notify sock in background to prevent deadlock
2026-03-12 23:49:25 +02:00
Nadav Har'El
ad832c263e test/cluster: mark test_alternator_concurrent_rmw_same_partition_different_server not strictly xfail
A few days ago, in commit 7b30a39 we added to pytest.ini the option
xfail_strict. This option causes every time a test XPASSes, i.e., an xfail
test actually passes - to be considered an error and fail the test.

But some tests demonstrate a timing-related bug and do not reproduce the
bug every single time. An example we noticed in one CI run is:

test/cluster/test_alternator.py::test_alternator_concurrent_rmw_same_partition_different_server

This test reproduces a timing-related bug (if you do an LWT write to
one partition on to two different coordinators "at the same time", you
can get a failure), but only most of the time, not 100% of the time.

The solution is to add "strict=False" for the xfail marker on this specific
test. This undoes the xfail_strict for this specific test, accepting that
this specific test can either pass or fail. Note that this does NOT make
this test worthless - we still see this test failing most of the time, and
when a developer finally fixes this issue, the test will begin to pass all
the time.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-941
Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#29016
2026-03-12 23:46:23 +02:00
Avi Kivity
e2eeef3e01 Merge 'service level: remove remnants of version 1 service level' from Gleb Natapov
can_use_effective_service_level_cache() always returns true now, so the function can be dropped entirely and all the code that assumes it may return false can be dropped as well. Also drop async versions of find_effective_service_level and get_user_scheduling_group since they are unused.

No need to backport, code removal,

Closes scylladb/scylladb#29002

* github.com:scylladb/scylladb:
  service level: make maybe_update_per_service_level_params synchronous
  service level: remove unused get_user_scheduling_group function
  service level: drop async find_effective_service_level
  service level: remove remnants of version 1 service level
2026-03-12 23:39:41 +02:00
Avi Kivity
e8a6706d6e Merge 'shorten some sleeps to speed up bootstrap in tests' from Patryk Jędrzejczak
This PR shortens two sleeps from 1s to 100ms to speed up bootstrap in tests.
The changed sleeps are:
- the pause duration in group0 discovery,
- the retry period in `wait_for_cql`.

Refs: https://scylladb.atlassian.net/browse/SCYLLADB-918

No backport: performance improvements mostly relevant to tests.

Closes scylladb/scylladb#29020

* github.com:scylladb/scylladb:
  test: pylib: util: wait for CQL being ready with a shorter period
  group0: discovery: shorten the pause duration
2026-03-12 21:17:05 +02:00
Wojciech Mitros
32974770b0 test: add tests for CQL forwarding
Add basic cluster tests for CQL forwarding.
The test cases include:
- basic reads and writes
- prepared statements with binds
- forwarding from a non-replica
- exception passthrough during forwarding (using an injection)
- re-preparing a statement on the target node, even if the user
  query is also an EXECUTE request on a prepared statement
- verification metric updates

The existing test_basic_write_read was modified so that a few extra
cases could be validated on the same cluster.
2026-03-12 19:43:35 +01:00
Wojciech Mitros
916a9995c1 transport: enable CQL forwarding for strong consistency statements
We enable CQL forwarding by starting to return the bounce_to_node
result message in redirect_statement() instead of throwing. The
forwarding code introduced in the preceding patches reacts to these
messages, allowing the requests to be forwarded.

With the update, some tests assuming that requests can't be forwarded
need to be adjusted, so we do that as well.
2026-03-12 19:43:35 +01:00
Avi Kivity
76b6784c1a Merge 'cql3: track CQL parsing memory cost and use it for admission control' from Marcin Maliszkiewicz
Use rolling_max_tracker to record gross bytes allocated during each
CQL parse.  The rolling maximum is then added to the memory estimate
for incoming QUERY and PREPARE requests so that the admission control
in the CQL transport layer accounts for parsing overhead.

The measured memory footprint serves as upper bound rather than
exact number but it's purpose is to prevent OOMs under unprepared
statements heavy load.

In benchmark 1G memory node shows decrease of non-LSA memory usage
from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While
tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as
memory admission kicks in trying to prevent OOM.

This is phase 1 of OOM prevention, potential next steps:
- add second admission in query_processor::get_statement trying to prevent potential thundering herd problem
- decrease cql_server memory pool size
- count reads in the memory pool
- add per service level memory pool and a shared one

Related https://scylladb.atlassian.net/browse/SCYLLADB-740
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-938

Backport: no, new feature, but we may reconsider if some customer needs it

Closes scylladb/scylladb#28919

* github.com:scylladb/scylladb:
  cql3: track CQL parsing memory cost and use it for admission control
  utils: add rolling max tracker
2026-03-12 19:59:52 +02:00
Wojciech Mitros
e44820ba1f transport: generalize the bounce result message for bouncing to other nodes
In the following patches, we'll start allowing forwarding requests to strongly
consistent tables so that they'll get executed on the suitable tablet Raft group
members. For that we'll reuse the approach that we already have for bouncing
requests to other shards - we'll try to execute a request locally, and the
result of that will be a bounce message with another replica as the target.
In this patch we generalize the former bounce_to_shard result message so that
it will be able to specify the target of the bounce as another shard or specific
replica.
We also rename it to result_message::bounce so that it stops implying that only
another shard may be its target.
Aside from the host_id and the shard, the new message also includes the timeout,
because in the service handling the forwarding we won't have the access to it,
and it's needed for specifying how long we should wait for the forwarded
requests. It also includes an information whether this is a write request
to return correct timeout response in case the deadline is exceeded.
We will return other hosts in the new bounce message when executing requests to
strongly consistent tables when we can't handle the request because we aren't
a suitable replica. We can't handle this message yet, so we don't return it
anywhere and we still assume that every bounce message is a bounce to the same
host.
2026-03-12 17:48:57 +01:00
Wojciech Mitros
b4d66fda2e strong consistency: redirect requests to live replicas from the same rack
Forwarding CQL requests is not implemented yet, but we're already
prepared to return the target to forward to when trying to execute
strongly consistent requests. Currently, if we're not a replica
of the affected tablet, we redirect the request to the first replica
in the list.
This is not optimal, because this replica may be down or it may be
in another rack, making us perform cross-rack requests during forwarding.
Instead, we should forward the request to the replica from the same
rack and handle the case where the replica is down.

In this patch we change the replica selection for forwarding strongly
consistent requests, so that when the coordinator isn't a replica, it
redirects the request to the replica from the same rack.

If the replica from the same rack is down, or there is no replica in
our rack, we choose the next closest replica (preferring same-DC replicas
over other DCs). If no replica is alive, the query fails - the driver
should retry when some replica comes back up.
2026-03-12 17:48:54 +01:00
Andrzej Jackowski
3b9cd52a95 reader_concurrency_semaphore_test: detect memory leak on preemptive abort of waiting_for_memory permit
A permit in `waiting_for_memory` state can be preemptively aborted by
maybe_admit_waiters(). This is wrong: such permits have already been
admitted and are actively processing a read — they are merely blocked
waiting for memory under serialize-limit pressure.

When `on_preemptive_aborted()` fires on a `waiting_for_memory` permit,
it does not clear `_requested_memory`. A subsequent `request_memory()`
call accumulatesa on top of the stale value, causing `on_granted_memory()`
to consume more than resource_units tracks.

This commit adds a test that confirms that scenario by counting
internal_errors.
2026-03-12 17:09:34 +01:00
Alex
7fd39ba586 test/cluster: strengthen raft voters multi-DC test and tune debug runtime
The test_raft_voters_multidc_kill_dc scenario had become weaker after group0 voter count was made always odd.
  In particular, the old num_nodes == 1 case (dc1=2, dc2=1, dc3=1) could pass even without the intended balancing logic, because with 3 voters total we naturally get one voter per DC.

  This change restores coverage of the original intent:

  - Replace num_nodes parametrization with explicit DC triples.
  - Use (3, 1, 1) to force a meaningful asymmetric topology where voter placement logic is required.
  - Keep a larger topology case (6, 3, 3) for broader coverage.
  - Mark (6, 3, 3) as skip_mode(debug) with reason:
    larger topology case is too slow in debug on minipcs.

  Also updated comments/docstring to match the new setup.

Fixes: SCYLLADB-794

backport: None, it is done to deflake minipcs that will start working only on master

Closes scylladb/scylladb#29000
2026-03-12 17:07:45 +01:00
Patryk Jędrzejczak
c50cf32793 test: pylib: util: wait for CQL being ready with a shorter period
`wait_for_cql` is used in hundreds, if not thousands, of places in tests.
We shouldn't waste up to 1s for every call.

Also, the 1s period is clearly too long compared to the bootstrap time,
which is usually 0-3s in dev mode.

The following test speeds up from 50s to 42s with the change:
```
for _ in range(10):
    servers = await manager.servers_add(3)
    await manager.get_ready_cql(servers)
```
2026-03-12 15:40:19 +01:00
Gleb Natapov
c67f876893 service level: make maybe_update_per_service_level_params synchronous
It does not call async functions any more.
2026-03-12 15:53:08 +02:00
Benny Halevy
b3fec20960 test_tablets_migration: test_staging_backlog_is_preserved_with_file_based_streaming: convert for loop to asyncio.gather
Currently the test iterates on all servers and calls manager.api.disable_injection
but it doesn't await those calls.
Use asyncio.gather to await all calls in parallel.

Co-authored-by: Copilot CLI
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-12 15:26:40 +02:00
Benny Halevy
61d5a2df02 test_tablets_migration: test_tablet_back_and_forth_migration: await move_tablet
Co-authored-by: Copilot CLI
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-12 15:26:40 +02:00
Benny Halevy
b8655748a2 test_tablets_migration: test_restart_in_cleanup_stage_after_cleanup: await move_task
Co-authored-by: Copilot CLI
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-12 15:26:40 +02:00