Commit Graph

52099 Commits

Author SHA1 Message Date
Andrzej Jackowski
ec42fdfd01 test: add cluster tests for write CL guardrails
Most of the functionality is tested in cqlpy tests located in
`test_guardrail_write_consistency_level.py`. Add two tests
that require the cluster framework:

- `test_invalid_write_cl_guardrail_config` checks the node startup
  path when incorrect `write_consistency_levels_warned` and
  `write_consistency_levels_disallowed` values are used.
- `test_write_cl_default` checks the behavior of the default
  configuration using a multi-node cluster.

Tests execution time:
 - Dev: 10s
 - Debug: 18s

Refs: SCYLLADB-259
2026-03-04 08:00:17 +01:00
Andrzej Jackowski
446539f12f test: implement test_guardrail_write_consistency_level
Implement basic tests for write consistency level guardrails,
verifying that they work for each type of write request (inserts,
updates, deletes, logged batches, unlogged batches, conditional batches,
and counter operations).

All tests are marked as Scylla-only because they currently don't
pass with Cassandra due to differences in handling superusers (see:
SCYLLADB-882).

Tests execution time:
 - Dev: 3s
 - Debug: 14s

Refs: SCYLLADB-259
Refs: SCYLLADB-882
2026-03-04 08:00:13 +01:00
Andrzej Jackowski
bb359b3b78 cql3: start using write CL guardrails
Enable verification of write consistency level guardrails in
`modification_statement` and `batch_statement`.

Neither guardrail is enabled by default, so as not to disrupt clusters
that are currently using any of the CLs for writes. The warning
guardrail may seem harmless, as it only adds a warning to the CQL
response; however, enabling it can significantly increase network
traffic (as a warning message is added to each response) and also
decrease throughput due to additional allocations required to prepare
the warning. Therefore, both guardrails should be enabled with care.
The newly added `writes_per_consistency_level` metric, which is
incremented unconditionally, can help decide whether a guardrail can
be safely enabled in an existing cluster.

This commit adds additional `if` instructions on the critical path.
However, based on the `perf_simple_query` benchmark for writes,
the difference is marginal (~40 additional instructions, which is
a relative difference smaller than 0.001).

BEFORE:
```
291443.35 tps ( 53.3 allocs/op,  16.0 logallocs/op,  14.2 tasks/op,   48067 insns/op,   18885 cycles/op,        0 errors)
throughput:
	mean=   289743.07 standard-deviation=6075.60
	median= 291424.69 median-absolute-deviation=1702.56
	maximum=292498.27 minimum=261920.06
instructions_per_op:
	mean=   48072.30 standard-deviation=21.15
	median= 48074.49 median-absolute-deviation=12.07
	maximum=48119.87 minimum=48019.89
cpu_cycles_per_op:
	mean=   18884.09 standard-deviation=56.43
	median= 18877.33 median-absolute-deviation=14.71
	maximum=19155.48 minimum=18821.57
```

AFTER:
```
290108.83 tps ( 53.3 allocs/op,  16.0 logallocs/op,  14.2 tasks/op,   48121 insns/op,   18988 cycles/op,        0 errors)
throughput:
	mean=   289105.08 standard-deviation=3626.58
	median= 290018.90 median-absolute-deviation=1072.25
	maximum=291110.44 minimum=274669.98
instructions_per_op:
	mean=   48117.57 standard-deviation=18.58
	median= 48114.51 median-absolute-deviation=12.08
	maximum=48162.18 minimum=48087.18
cpu_cycles_per_op:
	mean=   18953.43 standard-deviation=28.76
	median= 18945.82 median-absolute-deviation=20.84
	maximum=19023.93 minimum=18916.46
```

Fixes: SCYLLADB-259
2026-03-04 07:26:00 +01:00
Andrzej Jackowski
371cdb3c81 cql3/query_processor: implement metrics to track CL of writes
Add `write_consistency_levels_disallowed_violations` and
`write_consistency_levels_warned_violations` metrics to track
violations of write_consistency_levels guardrails.

Add `writes_per_consistency_level` to track what CL is used by
writes, regardless of the guardrails configuration.
Data gathered by this metric can be used to decide whether enabling
a particular write consistency level guardrail in a particular
existing cluster is safe.

Refs: SCYLLADB-259
2026-03-03 21:18:11 +01:00
Andrzej Jackowski
3606934458 db: cql3/query_processor: add write_consistency_levels enum_sets
Add enum_sets to query_processor that track the configuration
values of `write_consistency_levels_warned` and
`write_consistency_levels_disallowed`.

Refs: SCYLLADB-259
2026-03-03 20:28:57 +01:00
Andrzej Jackowski
e2c4b0a733 config: add write_consistency_levels_* guardrails configuration
Add guardrails configuration that can be used later in this
patch series.

Refs: SCYLLADB-259
2026-02-25 10:30:03 +01:00
Andrei Chekun
1b92b140ee test.py: improve stdout output for boost test
The current way of checking the boost's stdout can have a race
condition when pytest will try to read the file before it was really
flushed. So this PR should eliminate this possibility.

Closes scylladb/scylladb#28783
2026-02-25 00:50:25 +01:00
Ferenc Szili
f70ca9a406 load_stats: implement the optimized sum of tablet sizes
PR #28703 was merged into master but not with the latest version of the
changes. This patch is an incremental fix for this.

Currently, the elements of the tablet_sizes_per_shard vector are
incremented in separate shards. This is prone to false sharing of cache
lines, and ping-pong of memory, which leads to reduced performance.

In this patch, in order to avoid cache line collisions while updating
the sum of tablet sizes per shard, we align the counter to 64 bytes.

Fixes: SCYLLADB-678

Closes scylladb/scylladb#28757
2026-02-24 22:19:25 +01:00
Alex
5557770b59 test_mv_build_during_shutdown started two async CREATE MATERIALIZED VIEW operations and never awaited them (asyncio.gather(...) without await).
This pr adds await for each one of the tasks to wait for the MV schema to be added successfully
and then to start the server shutdown
With this change we dont need will not get the shutdown races.

Closes scylladb/scylladb#28774
2026-02-24 17:25:05 +01:00
Anna Stuchlik
64b1798513 doc: remove reduntant Java-related information
This commit removes:
- Instructions to install scylla-jmx (and all references)
- The Java 11 requirement for Ubuntu.

Fixes https://github.com/scylladb/scylladb/issues/28249
Fixes https://github.com/scylladb/scylladb/issues/28252

Closes scylladb/scylladb#28254
2026-02-24 14:37:39 +01:00
Anna Stuchlik
e2333a57ad doc: remove the tablets limitation for Alternator
This commit removes the information that Alternator doesn't support tablets.
The limitation is no longer valid.

Fixes SCYLLADB-778

Closes scylladb/scylladb#28781
2026-02-24 14:24:30 +02:00
Andrzej Jackowski
cd4caed3d3 test: fix configuration of test_autoretrain_dict
`test_autoretrain_dict` sporadically fails because the default
compression algorithm was changed after the test was written.

`9ffa62a986815709d0a09c705d2d0caf64776249` was an attempt to fix it by
changing the compression configuration during node startup. However,
the configuration change had an incorrect YAML format and was
ignored by ScyllaDB. This commit fixes it.

Fixes: scylladb/scylladb#28204

Closes scylladb/scylladb#28746
2026-02-24 12:08:44 +01:00
Botond Dénes
067bb5f888 test/scylla_gdb: skip coroutine tests if coroutine frame is not found
For a while, we have seen coroutine related tests (those that use the
coroutine_task fixture) fail occasionally, because no coroutine frame is
found. Multiple attempts were made to make this problem self-diagnosing
and dump enough information to be able to debug this post-mortem. To no
avail so far. A lot of time was invested into this this benign issue:
See the long discussion at https://github.com/scylladb/scylladb/issues/22501.

It is not known if the bug is in gdb, or the gdb script trying to find
the coroutine frame. In any case, both are only used for debugging, so
we can tolerate occasional failures -- we are forced to do so when
working with gdb anyway.
Instead of piling on more effor there, just skip these tests when the
problem occurs. This solves the CI flakyness.

Fixes: #22501

Closes scylladb/scylladb#28745
2026-02-24 10:12:03 +01:00
Marcin Maliszkiewicz
d5684b98c8 test: cluster: add continue-after-error to perf tool tests
Add --continue-after-error true to perf-cql-raw and perf-alternator
tests, and --stop-on-error false to perf-simple-query test, so that
tests don't abort on the first error.

Reason for this is that tests are flaky with example failure:
Perf test failed: std::runtime_error (server returned ERROR to EXECUTE)

When CPU is starved on CI we can return timeouts and/or other errors.

The change should make tests more robust on the expense of smaller test
scope. But those tests were written mostly to test startup sequence
as it differs from Scylla's starup.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-759

Closes scylladb/scylladb#28767
2026-02-24 11:08:34 +02:00
Avi Kivity
0add130b9d lua: avoid undefined behavior when converting doubles to integers
Lua doesn't have separate integer and floating point numbers,
so we check if a number can fit in an integer and if so convert
it to an integer.

The conversion routine invokes undefined behavior (and even
acknowledges it!). More recent compilers changed their behavior
when casting infinities, breaking test_user_function_double_return
which tests this conversion.

Fix by tightening the conversion to not invoke undefined behavior.

Closes scylladb/scylladb#28503
2026-02-24 10:41:21 +02:00
Botond Dénes
1d5b8cc562 Merge 'Fix use after free in auth cache' from Marcin Maliszkiewicz
This patchset:
- ensures the loading semaphore is acquired in cross-shard callbacks
- fixes iterator invalidation problem when reloading all cached permissions

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-780
Backport: no, affected code not released yet

Closes scylladb/scylladb#28766

* github.com:scylladb/scylladb:
  auth: cache: fix permissions iterator invalidation in reload_all_permissions
  auth/cache: acquire _loading_sem in cross-shard callbacks
2026-02-24 10:35:46 +02:00
Pavel Emelyanov
5a5eb67144 vector_search/dns: Use newer seastar get_host_by_name API
The hostent::addr_list is deprecated in favor of address_entry::addr
field that contains the very same addresses.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28565
2026-02-23 21:28:43 +02:00
Pavel Emelyanov
6b02b50e3d Merge 'object_storage: add retryable machinery to object storage' from Ernest Zaslavsky
- add an overload to the rest http client to accept retry strategy instance as an argument
- remove hand rolled error handling from object storage client and replace with common machinery that supports handling and retrying when appropriate

No backport neede since it is only refactoring

Closes scylladb/scylladb#28161

* github.com:scylladb/scylladb:
  object_storage: add retryable machinery to object storage
  rest_client: add `simple_send` overload
2026-02-23 21:28:51 +03:00
Botond Dénes
dcd8de86ee Merge 'docs: update a documentation of adding/removing DC and rebuilding a node' from Aleksandra Martyniuk
Describe a procedure to convert tablet keyspace replication factor
to rack list. Update the procedures of adding and removing a node
to consider tablet keyspaces.

Fixes: [SCYLLADB-398](https://scylladb.atlassian.net/browse/SCYLLADB-398)
Fixes: https://github.com/scylladb/scylladb/issues/28306.
Fixes: https://github.com/scylladb/scylladb/issues/28307.
Fixes: https://github.com/scylladb/scylladb/issues/28270.

Needs backport to all live branches as they all include tablets.

[SCYLLADB-398]: https://scylladb.atlassian.net/browse/SCYLLADB-398?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#28521

* github.com:scylladb/scylladb:
  docs: update nodetool rebuild docs
  docs: update a procedure of decommissioning a DC
  docs: update a procedure of adding a DC
  docs: describe upgrade to enforce_rack_list option
  docs: describe conversion to rack-list RF
2026-02-23 16:15:16 +02:00
Andrei Chekun
6ae58c6fa6 test.py: move storage tests to cluster subdirectory
Move the storage test suite from test/storage/ to test/cluster/storage/
to consolidate related cluster-based tests.This removes the standalone
test/storage/suite.yaml as the tests will use the cluster's test configuration.
Initially these tests were in cluster, but to use unshare at first
iteration they were moved outside. Now they are using another way to
handle volumes without unshare, they should be in cluster

Closes scylladb/scylladb#28634
2026-02-23 16:14:15 +02:00
Marcin Maliszkiewicz
c5dc086baf Merge 'vector_search: return NaN for similarity_cosine with all-zero vectors' from Dawid Pawlik
The ANN vector queries with all-zero vectors are allowed even on vector indexes with similarity function set to cosine.
When enabling the rescoring option, those queries would fail as the rescoring calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception as all-zero vectors were not allowed matching Cassandra's behaviour.

To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass, but return the NaN as the cosine similarity for zero vectors is mathematically incorrect. We decided not to use arbitrary values contrary to USearch, for which the distance (not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while supporting the range of values [0, 2].
If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5, which does not support mathematical reasoning why that would be more similar than for example vectors marking obtuse angles.
It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact, therefore we return NaN and eliminate them from best results.

Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour.

Fixes: SCYLLADB-456

Backport to 2026.1 needed, as it fixes the bug for ANN vector queries using rescoring introduced there.

Closes scylladb/scylladb#28609

* github.com:scylladb/scylladb:
  test/vector_search: add reproducer for rescoring with zero vectors
  vector_search: return NaN for similarity_cosine with all-zero vectors
2026-02-23 13:10:44 +01:00
Aleksandra Martyniuk
9ccc95808f docs: update nodetool rebuild docs
Update nodetool rebuild docs to mention that the command does not
work for tablet keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/28270.
2026-02-23 12:45:01 +01:00
Aleksandra Martyniuk
e4c42acd8f docs: update a procedure of decommissioning a DC
Update a procedure of decommissioning a DC for tablet keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/28307.
2026-02-23 12:45:01 +01:00
Aleksandra Martyniuk
1c764cf6ea docs: update a procedure of adding a DC
Update a procedure of adding a DC for tablet keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/28306.
2026-02-23 12:45:01 +01:00
Aleksandra Martyniuk
e08ac60161 docs: describe upgrade to enforce_rack_list option 2026-02-23 12:44:57 +01:00
Aleksandra Martyniuk
eefe66b2b2 docs: describe conversion to rack-list RF
Fixes: SCYLLADB-398
2026-02-23 12:41:33 +01:00
Marcin Maliszkiewicz
54dca90e8c Merge 'test: move dtest/guardrails_test.py to test_guardrails.py' from Andrzej Jackowski
This patch series moves `test/cluster/dtest/guardrails_test.py`
to `test/cluster/test_guardrails.py`, and migrates it from `cluster/dtest/`
to `cluster/` framework.

There are two motivations for moving the test:
 - Execution time reduction (from 12s to 9s in 'dev' in my env)
 - Facilitate adding new tests to the `guardrails_test.py` file

No backport, `dtest/guardrails_test.py` is only on master

Closes scylladb/scylladb#28737

* github.com:scylladb/scylladb:
  test: move dtest/guardrails_test.py to test_guardrails.py
  test: prepare guardrails_test.py to be moved to test/cluster/
2026-02-23 12:34:43 +01:00
Marcin Maliszkiewicz
1293b94039 auth: cache: fix permissions iterator invalidation in reload_all_permissions
The inner loops in reload_all_permissions iterate role's permissions
and _anonymous_permissions maps across yield points. Concurrent
load_permissions calls (which don't hold _loading_sem) can emplace
into those same maps during a yield, potentially triggering a rehash
that invalidates the active iterator.

We want to avoid adding semaphore acquire in load_permissions
because it's on a common path (get_permissions).

Fixing by snapshotting the keys into a vector before iterating with
yields, so no long-lived map iterator is held across suspension
points.
2026-02-23 12:14:22 +01:00
Piotr Dulikowski
a4c389413c Merge 'Hardens MV shutdown behavior by fixing lifecycle tracking for detached view-builder callbacks' from Alex Dathskovsky
This series hardens MV shutdown behavior by fixing lifecycle tracking for detached view-builder callbacks and aligning update handling with the same async dispatch style used by create/drop.

Patch 1 refactors on_update_view to use a dedicated coroutine dispatcher (dispatch_update_view), keeping update logic serialized under the existing view-builder lock and consistent with the callback architecture already used for create/drop paths.

Patch 2 adds explicit callback lifetime coordination in view_builder:

  - introduce a seastar::gate member
  - acquire _ops_gate.hold() when launching detached create/update/drop dispatch futures
  - keep the hold alive until each detached future resolves
  - close the gate during view_builder::drain() so shutdown waits for in-flight callback work before final teardown

Together, these changes reduce shutdown race exposure in MV event handling while preserving existing behavior for normal operation.

Testing:
  - pytest --test-py-init test/cluster/mv (47 passed, 7 skipped)

backport: not required started happening in master

fixes: SCYLLADB-687

Closes scylladb/scylladb#28648

* github.com:scylladb/scylladb:
  db/view: gate detached view-builder callbacks during shutdown
  db:view: refactor on_update_view to use coroutine dispatcher
2026-02-23 11:28:37 +01:00
Marcin Maliszkiewicz
75d4bc26d3 auth/cache: acquire _loading_sem in cross-shard callbacks
distribute_role() modifies _roles on non-zero shards via
invoke_on_others() without holding _loading_sem. Similarly, load_all()'s
invoke_on_others() callback calls prune_all() without the semaphore.
When these run concurrently with reload_all_permissions(), which
iterates _roles across yield points, an insertion can trigger
absl::flat_hash_map::resize(), freeing the backing storage while
an iterator still references it.

Fix by acquiring _loading_sem on the target shard in both
distribute_role()'s and load_all()'s invoke_on_others callbacks,
serializing all _roles mutations with coroutines that iterate
the map.
2026-02-23 10:30:03 +01:00
Ernest Zaslavsky
321d4caf0c object_storage: add retryable machinery to object storage
remove hand rolled error handling from object storage client
and replace with common machinery that supports exception
handling and retrying when appropriate
2026-02-22 14:00:44 +02:00
Ernest Zaslavsky
24972da26d rest_client: add simple_send overload
add an overload to rest client `simple_send` to accept a retry_strategy for http's make_request
2026-02-22 14:00:44 +02:00
Patryk Jędrzejczak
e8efcae991 Merge 'Use standard ks/cf/data creation methods in object_store/test_basic.py test' from Pavel Emelyanov
The test uses create_ks_and_cf helper duplicating the existing code that does the same. This PR patches basic tests to use standard facilities. Also it prepares the ground for testing keyspace storage options with rf=3

Cleaning tests, not backporting

Closes scylladb/scylladb#28600

* https://github.com/scylladb/scylladb:
  test/object_store: Remove create_ks_and_cf() helper
  test/object_store: Replace create_ks_and_cf() usage with standard methods
  test/object_store: Shift indentation right for test cases
2026-02-20 15:53:38 +01:00
Nadav Har'El
d01915131a test/cqlpy: make test_indexing_paging_and_aggregation much faster
Currently, test_secondary_index.py::test_indexing_paging_and_aggregation
is very slow, and the slowest test in the test/cqlpy framework: It takes
around 13 seconds on dev build, and because it is CPU-bound (doesn't sleep),
it is much slower on debug builds. The reason for this slowness is that it
needs to set up and read over 10,000 rows which is the default
select_internal_page_size.

But after the patches in pull request (#25368), we can configure
select_internal_page_size, so in this patch we change the test to
temporarily reduce this option to just 50, and then the test can reach
the same code paths with just 142 rows instead of 20120 rows before this
patch.

As a result, the test should now be 140 times faster than it was before.
In practice, because of some fixed overheads (the test creates several
tables and indexes), in dev build mode the test run speedup is "only"
26-fold (to around half a second).

I verified that removing the code added in bb08af7 indeed makes the new
shorter test fail - and this is the only test in test_secondary_index.py
that starts to fail besides test_index_paging_group_by which is also
related (so my revert didn't just break secondary indexing completely).
So the shorter test is still a good regression test.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28268
2026-02-20 15:44:53 +02:00
Avi Kivity
92bc5568c5 tools: toolchain: build sanitizers for future toolchain
The future toolchain did not build the sanitizers, so debug
executables did not link. Fix by not disabling the sanitizers.

Closes scylladb/scylladb#28733
2026-02-20 15:44:24 +02:00
Botond Dénes
6c04e02f66 Merge 'Fix restoration test's validation of streaming directions' from Pavel Emelyanov
The test_restore_with_streaming_scopes among other things checks how data streams flow while restoring. Whether or not to check the streams is decided based on the min tablet count value, which is compared with a hardcoded 512. This value of 512 matched the tablet count used by this test until it was "optimized" by #27839, where this number changed to 5 and streaming checks became off.

Good news is that the very same checks are still performed by test_refresh_with_streaming_scopes. But it's better to have a working restoration test anyway.

Minor test fix, not backporting

Closes scylladb/scylladb#28607

* github.com:scylladb/scylladb:
  test: Fix the condition for streaming directions validation
  test: Split test_backup.py::check_data_is_back() into two
2026-02-20 15:42:10 +02:00
Botond Dénes
6f88c0dbd3 Merge ' test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance' from Tomasz Grabiec
Currently, the test assumes that when
'topology_coordinator_pause_before_processing_backlog: waiting' is
logged, the task for decommission must be there. This was based on the
assumption that topology coordinator is idle and decommission request
wakes it up. But if the server is slow enough, it may still be running
the load balancer in reaction to table creation, and block on that
injection point before decommission request was added.

Fix by waiting for the task to appear rather than the injection.

Fixes SCYLLADB-715

Only 2026.1 vulnerable.

Closes scylladb/scylladb#28688

* github.com:scylladb/scylladb:
  test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance
  test: cluster: task_manager_client: Introduce wait_task_appears()
  tests: pylib: util: Add exponential backoff to wait_for
2026-02-20 15:05:36 +02:00
Pavel Emelyanov
c96420c015 tests: Re-use manager.get_server_exe()
There's a bunch of incremental repair tests that want to call scylla
sstable command. For that they try to find where scylla binary by
scanning /proc directory (see local_process_id and get_scylla_path
helpers).

There's shorter way -- just call manager.get_server_exe().

Same for backup-restore test.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28676
2026-02-20 14:59:30 +02:00
Pavel Emelyanov
a4a0d75eee test/object_store: Parametrize test_simple_backup_and_restore()
There are three tests and a function with a pair of boolean parameters
called by those. It's less code if the function becomes a test with
parameters.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28677
2026-02-20 14:57:30 +02:00
Pavel Emelyanov
a2e1293f86 test/object_store: Squash two simple-backup tests together
The test_backup_simple creates a ks/cf, takes a snapshot, backs it up,
then checks that the files were uploaded. The test_backup_move does the
same, but also plays with 'move_files' parameter to be true/false.

In fact, the "move" test was the copy of "simple" one that dropepd check
for scheduling group being "streaming" (backup with --move-files can
check the same, it's not bad), and check for destination bucket to
contain needed files (same here -- checking that files arrived to bucket
after --move-files is good).

In the end of the day, after the change backup test is run two times,
instead of three, and performs extra checks for --move-files case.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28606
2026-02-20 14:49:30 +02:00
Botond Dénes
7e90ed657c Merge 'Fix client_options docs' from Karol Baryła
https://github.com/scylladb/scylladb/pull/25746 added a new column to `system.clients`: `client_options frozen<map<text, text>>`. This column stores all options sent by the client in the `STARTUP` message.
This PR also added `CLIENT_OPTIONS` to the list of values sent in `SUPPORTED` message, and documented that drivers can send their configuration (as JSON) in `STARTUP` under this key.

Documentation for the new column was not added to the description of `system.clients` table, and documentation about the new `STARTUP` key was added in `protocol-extensions.md`, but in the section about shard awareness extension.

This PR adds missing `system.clients` column description, moves the documentation of `CLIENT_OPTIONS` into its own section, and expands it a bit.

Backport: none, because this fixes internal documentation.

Closes scylladb/scylladb#28126

* github.com:scylladb/scylladb:
  protocol-extensions.md: Fix client_options docs
  system_keyspace.md: Add client_options column
  system_keyspace.md: Fix order in system.clients
2026-02-20 14:23:34 +02:00
Pavel Emelyanov
525cb5b3eb table: Use fmt::to_string() to stringify compation group ID
Doing it with format("{}", foo) is correct, but to_string is
a bit more lightweight.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28630
2026-02-20 14:13:15 +02:00
Patryk Jędrzejczak
d399a197f5 Merge 'raft: Await instead of returning future in wait_for_state_change' from Dawid Mędrek
The `try-catch` expression is pretty much useless in its current form. If we return the future, the awaiting will only be performed by the caller, completely circumventing the exception handling.

As a result, instead of handling `raft::request_aborted` with a proper error message, the user will face `seastar::abort_requested_exception` whose message is cryptic at best. It doesn't even point to the root of the problem.

Fixes SCYLLADB-665

Backport: This is a small improvement and may help when debugging, so let's backport it to all supported versions.

Closes scylladb/scylladb#28624

* https://github.com/scylladb/scylladb:
  test: raft: Add test_aborting_wait_for_state_change
  raft: Describe exception types for wait_for_state_change and wait_for_leader
  raft: Await instead of returning future in wait_for_state_change
2026-02-20 12:17:22 +01:00
Andrzej Jackowski
eb5a564df2 test: move dtest/guardrails_test.py to test_guardrails.py
This commit moves `guardrails_test.py`, prepared in the previous
commit of this patch series, to `test/cluster/test_guardrails.py`.
It also cleans up `suite.yaml`.
2026-02-20 11:39:52 +01:00
Andrzej Jackowski
9df426d2ae test: prepare guardrails_test.py to be moved to test/cluster/
Disable `test/cluster/dtest/guardrails_test.py` in `suite.yaml` and
make it compatible with the `test/cluster/` framework. This will
allow moving this file from `test/cluster/dtest/` to `test/cluster/`
in the next commit of this patch series.

There are two motivations for moving the test:
 - Execution time reduction (from 12s to 9s in 'dev' in my env)
 - Facilitate adding new tests to the `guardrails_test.py` file
2026-02-20 11:39:43 +01:00
Raphael S. Carvalho
f33f324f77 mutation_compactor: Fix tombstone GC metrics to account for only expired
There are 3 metrics (that goes in every compaction_history entry):
total_tombstone_purge_attempt
total_tombstone_purge_failure_due_to_overlapping_with_memtable
total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable

When a tombstone is not expired (e.g. doesn't satisfy "gc_before" or
grace period), it can be currently accounted as failure due to
overlapping with either memtable or uncompacting sstable.
So those 2 last metrics have noise of *unexpired* tombstones.

What we should do is to only account for expired tombstones in all
those 3 metrics. We lose the info of knowing the amount of tombstones
processed by compaction, now we'll only know about the expired ones.
But those metrics were primarily added for explaining why expired
tombstones cannot be removed.
We could have alternatively added a new field
purge_failure_due_to_being_unexpired or something, but
it requires adding a new field to compaction_history.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-737.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#28669
2026-02-20 10:43:58 +02:00
Botond Dénes
0bf4c68af5 Merge 'docs: fix link to docker build readme in the README.MD' from Marcin Szopa
Links were pointing to the `debian` subdirectory. However, there docker build was refactored to use `redhat`: 1abf981a73, see https://github.com/scylladb/scylladb/pull/22910

No backport, just a README link fixes.

Closes scylladb/scylladb#28699

* github.com:scylladb/scylladb:
  docs: fix path to the build_docker.sh which was moved from debian to redhat subdirectory
  docs: fix link to docker build README.MD
2026-02-20 08:21:46 +02:00
Avi Kivity
66bef0ed36 lua, tools: adjust for lua 5.5 lua_newstate seed parameter
Lua 5.5 adds a seed parameter to lua_newstate(), provide it
with a strong random seed.

Closes scylladb/scylladb#28734
2026-02-20 06:52:37 +02:00
Avi Kivity
27a5502f14 Merge 'Reapply "main: test: add future and abort_source to after_init_func"' from Marcin Maliszkiewicz
The patchset fixes abort_source implementation for perf-alternator and perf-cql-raw. It moves
run_standalone function to common code in perf.hh with necessary templating.

We also add extensive testing so that it's more difficult to break the tooling in the future.

Fixes SCYLLADB-560
Backport: no, internal tooling improvement

Closes scylladb/scylladb#28541

* github.com:scylladb/scylladb:
  test: cluster: add tests for perf tools
  test: perf: fix port race condition on startup in connect workload
  test: perf: prepare benchmarks to bind to custom host
  test: perf: make perf-alterantor remote port configurable
  test: perf: fix ASAN leak warnings in perf-alternator
  Reapply "main: test: add future and abort_source to after_init_func"
2026-02-19 19:12:46 +02:00
Dawid Mędrek
c9d192c684 Merge 'raft ropology: prevent crashes of multiple nodes' from Patryk Jędrzejczak
Some assertions in the Raft-based topology are likely to cause crashes of
multiple nodes due to the consistent nature of the Raft-based code. If the
failing assertion is executed in the code run by each follower (e.g., the code
reloading the in-memory topology state machine), then all nodes can crash. If
the failing assertion is executed only by the leader (e.g., the topology
coordinator fiber), then multiple consecutive group0 leaders will chain-crash
until there is no group0 majority.

Crashing multiple nodes is much more severe than necessary. It's enough to
prevent the topology state machine from making more progress. This will
naturally happen after throwing a runtime error. The problematic fiber will be
killed or will keep failing in a loop. Note that it should be safe to block
the topology state machine, but not the whole group0, as the topology state
machine is mostly isolated from the rest of group0.

We replace some occurrences of `on_fatal_internal_error` and `SCYLLA_ASSERT`
with `on_internal_error`. These are not all occurrences, as some fatal
assertions make sense, for example, in the bootstrap procedure.

We also raise an internal error to prevent a segmentation fault in a few places.

Fixes #27987

Backporting this PR is not required, but we can consider it at least for 2026.1
because:
- it is LTS,
- the changes are low-risk,
- there shouldn't be many conflicts.

Closes scylladb/scylladb#28558

* github.com:scylladb/scylladb:
  raft topology: prevent accessing nullptr returned by topology::find
  raft topology: make some assertions non-crashing
2026-02-19 16:50:03 +01:00