Compare commits

...

292 Commits

Author SHA1 Message Date
copilot-swe-agent[bot]
5515d5fb7d test/alternator: fix delete_item_no_ts test, add LWT rejection tests for delete ops, simplify assertions, update docs
Co-authored-by: nyh <584227+nyh@users.noreply.github.com>
2026-03-05 17:31:41 +00:00
copilot-swe-agent[bot]
328c263aed alternator: add custom timestamp support to DeleteItem and BatchWriteItem DeleteRequest
Co-authored-by: nyh <584227+nyh@users.noreply.github.com>
2026-03-05 16:16:27 +00:00
copilot-swe-agent[bot]
a6d36a480d test/alternator: add explicit only_rmw_uses_lwt isolation to test_table_ts and test_table_ts_ss
Co-authored-by: nyh <584227+nyh@users.noreply.github.com>
2026-02-25 22:17:17 +00:00
copilot-swe-agent[bot]
85354ae26a test/alternator: restore test_table_ts_lwt with always isolation; update docs
- Restore test_table_ts_lwt fixture with system:write_isolation=always to
  explicitly test that the timestamp attribute is rejected in LWT_ALWAYS mode
- Add test_timestamp_attribute_lwt_always_rejected which verifies that even
  a plain PutItem with a timestamp is rejected when always_use_lwt is set
- Keep test_timestamp_attribute_with_condition_rejected using test_table_ts
  (with the test runner's default only_rmw_uses_lwt isolation) to test
  that a ConditionExpression triggers LWT rejection
- Update docs: fix item 4 (non-numeric now rejected), improve Limitations
  section to clearly state always_use_lwt is incompatible with the feature
  and recommend system:write_isolation=only_rmw_uses_lwt

Co-authored-by: nyh <584227+nyh@users.noreply.github.com>
2026-02-25 22:12:07 +00:00
copilot-swe-agent[bot]
786fa68faa test/alternator: address review comments on test_timestamp_attribute
- Use scope="module" instead of testpy_test_fixture_scope in fixtures
- Rename test_table_ts_sc to test_table_ts_ss (ss = string+string keys)
- Remove test_table_ts_lwt; use test_table_ts for LWT-rejection test
  (the test server runs with only_rmw_uses_lwt, so conditions trigger LWT)
- Add comment that fixtures make tests implicitly Scylla-only
- Change non-numeric timestamp attribute behavior: reject with
  ValidationException instead of silently storing (test + C++ implementation)
- Add test_timestamp_attribute_microseconds: verifies the timestamp unit
  is microseconds and tests interaction with default server timestamps
- Add import time for the new microseconds test

Co-authored-by: nyh <584227+nyh@users.noreply.github.com>
2026-02-25 21:58:34 +00:00
copilot-swe-agent[bot]
a57f781852 test: refactor test_timestamp_attribute to use shared fixtures
Co-authored-by: nyh <584227+nyh@users.noreply.github.com>
2026-02-25 21:38:52 +00:00
copilot-swe-agent[bot]
7f79b90e91 Fix timestamp_attribute: non-numeric handling, tag visibility, and clean up
Co-authored-by: nyh <584227+nyh@users.noreply.github.com>
2026-02-25 20:35:48 +00:00
copilot-swe-agent[bot]
175b8a8a5e Add system:timestamp_attribute feature for custom write timestamps in Alternator
Co-authored-by: nyh <584227+nyh@users.noreply.github.com>
2026-02-25 20:27:26 +00:00
copilot-swe-agent[bot]
c2ef8075ee Initial plan 2026-02-25 20:04:37 +00:00
Botond Dénes
9dff9752b4 Merge 'Fix regression in Alternator TTL with tablets and node going down' from Nadav Har'El
Recently we suffered a regression on how Alternator TTL behaves when a node goes down when tablets are used.

Usually, expiration of data in a particular tablet are handled by this tablet's "primary replica". However, if that node is down, we want another node to perform these expiration until the primary replica goes back online. We created a function `tablet_map::get_secondary_replica()` to select that "other node". We don't care too much what the "secondary replica" means, but we do care that it's different from the primary replica - if it's the same the expiration of that tablet will never be done.

It turns out that recently, in commits 817fdad and d88036d, the implementation of get_primary_replica() changed without a corresponding change to get_secondary_replica(). After those changes, the two functions are mismatched, and sometimes return the same node for both primary and secondary replica.

Unfortunately, although we had a dtest for the handling of a dead node in Alternator TTL, it failed to reproduce this bug, so this regression was missed - nothing else besides Alternator TTL ever used the get_secondary_replica() function.

So this series, in addition to fixing the bug, we add two tests that reproduce this bug (fail before the fix, pass with the fix):

1. A unit test that checks that get_secondary_replica() always returns a different node from get_primary_replica()
2. A cluster test based on the original dtest, which does reproduce this bug in Alternator TTL where some of the data was never expired (but only failed in release build, for an unknown reason).

Fixes SCYLLADB-777.

Closes scylladb/scylladb#28771

* github.com:scylladb/scylladb:
  test: add unit test for tablet_map::get_secondary_replica()
  test, alternator: add test for TTL expiration with a node down
  locator: fix get_secondary_replica() to match get_primary_replica()
2026-02-25 10:13:55 +02:00
Andrei Chekun
1b92b140ee test.py: improve stdout output for boost test
The current way of checking the boost's stdout can have a race
condition when pytest will try to read the file before it was really
flushed. So this PR should eliminate this possibility.

Closes scylladb/scylladb#28783
2026-02-25 00:50:25 +01:00
Ferenc Szili
f70ca9a406 load_stats: implement the optimized sum of tablet sizes
PR #28703 was merged into master but not with the latest version of the
changes. This patch is an incremental fix for this.

Currently, the elements of the tablet_sizes_per_shard vector are
incremented in separate shards. This is prone to false sharing of cache
lines, and ping-pong of memory, which leads to reduced performance.

In this patch, in order to avoid cache line collisions while updating
the sum of tablet sizes per shard, we align the counter to 64 bytes.

Fixes: SCYLLADB-678

Closes scylladb/scylladb#28757
2026-02-24 22:19:25 +01:00
Alex
5557770b59 test_mv_build_during_shutdown started two async CREATE MATERIALIZED VIEW operations and never awaited them (asyncio.gather(...) without await).
This pr adds await for each one of the tasks to wait for the MV schema to be added successfully
and then to start the server shutdown
With this change we dont need will not get the shutdown races.

Closes scylladb/scylladb#28774
2026-02-24 17:25:05 +01:00
Anna Stuchlik
64b1798513 doc: remove reduntant Java-related information
This commit removes:
- Instructions to install scylla-jmx (and all references)
- The Java 11 requirement for Ubuntu.

Fixes https://github.com/scylladb/scylladb/issues/28249
Fixes https://github.com/scylladb/scylladb/issues/28252

Closes scylladb/scylladb#28254
2026-02-24 14:37:39 +01:00
Anna Stuchlik
e2333a57ad doc: remove the tablets limitation for Alternator
This commit removes the information that Alternator doesn't support tablets.
The limitation is no longer valid.

Fixes SCYLLADB-778

Closes scylladb/scylladb#28781
2026-02-24 14:24:30 +02:00
Andrzej Jackowski
cd4caed3d3 test: fix configuration of test_autoretrain_dict
`test_autoretrain_dict` sporadically fails because the default
compression algorithm was changed after the test was written.

`9ffa62a986815709d0a09c705d2d0caf64776249` was an attempt to fix it by
changing the compression configuration during node startup. However,
the configuration change had an incorrect YAML format and was
ignored by ScyllaDB. This commit fixes it.

Fixes: scylladb/scylladb#28204

Closes scylladb/scylladb#28746
2026-02-24 12:08:44 +01:00
Botond Dénes
067bb5f888 test/scylla_gdb: skip coroutine tests if coroutine frame is not found
For a while, we have seen coroutine related tests (those that use the
coroutine_task fixture) fail occasionally, because no coroutine frame is
found. Multiple attempts were made to make this problem self-diagnosing
and dump enough information to be able to debug this post-mortem. To no
avail so far. A lot of time was invested into this this benign issue:
See the long discussion at https://github.com/scylladb/scylladb/issues/22501.

It is not known if the bug is in gdb, or the gdb script trying to find
the coroutine frame. In any case, both are only used for debugging, so
we can tolerate occasional failures -- we are forced to do so when
working with gdb anyway.
Instead of piling on more effor there, just skip these tests when the
problem occurs. This solves the CI flakyness.

Fixes: #22501

Closes scylladb/scylladb#28745
2026-02-24 10:12:03 +01:00
Marcin Maliszkiewicz
d5684b98c8 test: cluster: add continue-after-error to perf tool tests
Add --continue-after-error true to perf-cql-raw and perf-alternator
tests, and --stop-on-error false to perf-simple-query test, so that
tests don't abort on the first error.

Reason for this is that tests are flaky with example failure:
Perf test failed: std::runtime_error (server returned ERROR to EXECUTE)

When CPU is starved on CI we can return timeouts and/or other errors.

The change should make tests more robust on the expense of smaller test
scope. But those tests were written mostly to test startup sequence
as it differs from Scylla's starup.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-759

Closes scylladb/scylladb#28767
2026-02-24 11:08:34 +02:00
Avi Kivity
0add130b9d lua: avoid undefined behavior when converting doubles to integers
Lua doesn't have separate integer and floating point numbers,
so we check if a number can fit in an integer and if so convert
it to an integer.

The conversion routine invokes undefined behavior (and even
acknowledges it!). More recent compilers changed their behavior
when casting infinities, breaking test_user_function_double_return
which tests this conversion.

Fix by tightening the conversion to not invoke undefined behavior.

Closes scylladb/scylladb#28503
2026-02-24 10:41:21 +02:00
Botond Dénes
1d5b8cc562 Merge 'Fix use after free in auth cache' from Marcin Maliszkiewicz
This patchset:
- ensures the loading semaphore is acquired in cross-shard callbacks
- fixes iterator invalidation problem when reloading all cached permissions

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-780
Backport: no, affected code not released yet

Closes scylladb/scylladb#28766

* github.com:scylladb/scylladb:
  auth: cache: fix permissions iterator invalidation in reload_all_permissions
  auth/cache: acquire _loading_sem in cross-shard callbacks
2026-02-24 10:35:46 +02:00
Pavel Emelyanov
5a5eb67144 vector_search/dns: Use newer seastar get_host_by_name API
The hostent::addr_list is deprecated in favor of address_entry::addr
field that contains the very same addresses.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28565
2026-02-23 21:28:43 +02:00
Pavel Emelyanov
6b02b50e3d Merge 'object_storage: add retryable machinery to object storage' from Ernest Zaslavsky
- add an overload to the rest http client to accept retry strategy instance as an argument
- remove hand rolled error handling from object storage client and replace with common machinery that supports handling and retrying when appropriate

No backport neede since it is only refactoring

Closes scylladb/scylladb#28161

* github.com:scylladb/scylladb:
  object_storage: add retryable machinery to object storage
  rest_client: add `simple_send` overload
2026-02-23 21:28:51 +03:00
Nadav Har'El
e463d528fe test: add unit test for tablet_map::get_secondary_replica()
This patch adds a unit test for tablet_map::get_secondary_replica().
It was never officially defined how the "primary" and "secondary"
replicas were chosen, and their implementation changed over time,
but the one invariant that this test verifies is that the secondary
replica and the primary replica must be a different node.

This test reproduces issue SCYLLADB-777, where we discovered that
the get_primary_replica() changed without a corresponding change to
get_primary_replica(). So before the previous patch, this test failed,
and after the previous patch - it passes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-23 16:19:43 +02:00
Nadav Har'El
0c7f499750 test, alternator: add test for TTL expiration with a node down
We have many single-node functional tests for Alternator TTL in
test/alternator/test_ttl.py. This patch adds a multi-node test in
test/cluster/test_alternator.py. The new test verifies that:

 1. Even though Alternator TTL splits the work of scanning and expiring
    items between nodes, all the items get correctly expired.
 2. When one node is down, all the items still expire because the
    "secondary" owner of each token range takes over expiring the
   items in this range while the "primary" owner is down.

This new test is actually a port of a test we already had in dtest
(alternator_ttl_tests.py::test_multinode_expiration). This port is
faster and smaller then the original (fewer nodes, fewer rows), but it
still found a regression (SCYLLADB-777) that dtest missed - the new test
failed when running with tablets and in release build mode.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-23 16:19:43 +02:00
Nadav Har'El
9ab3d5b946 locator: fix get_secondary_replica() to match get_primary_replica()
The function tablet_map::get_secondary_replica() is used by Alternator
TTL to choose a node different from get_primary_replica(). Unfortunately,
recently (commits 817fdad and d88037d) the implementation of the latter
function changed, without changing the former. So this patch changes
the former to match.

The next two patches will have two tests that fail before this patch,
and pass with it:

1. A unit test that checks that get_secondary_replica() returns a
   different node than get_primary_replica().

2. An Alternator TTL test that checks that when a node is down,
   expirations still happen because the secondary replica takes over
   the primary replica's work.

Fixes SCYLLADB-777

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-23 16:19:30 +02:00
Botond Dénes
dcd8de86ee Merge 'docs: update a documentation of adding/removing DC and rebuilding a node' from Aleksandra Martyniuk
Describe a procedure to convert tablet keyspace replication factor
to rack list. Update the procedures of adding and removing a node
to consider tablet keyspaces.

Fixes: [SCYLLADB-398](https://scylladb.atlassian.net/browse/SCYLLADB-398)
Fixes: https://github.com/scylladb/scylladb/issues/28306.
Fixes: https://github.com/scylladb/scylladb/issues/28307.
Fixes: https://github.com/scylladb/scylladb/issues/28270.

Needs backport to all live branches as they all include tablets.

[SCYLLADB-398]: https://scylladb.atlassian.net/browse/SCYLLADB-398?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#28521

* github.com:scylladb/scylladb:
  docs: update nodetool rebuild docs
  docs: update a procedure of decommissioning a DC
  docs: update a procedure of adding a DC
  docs: describe upgrade to enforce_rack_list option
  docs: describe conversion to rack-list RF
2026-02-23 16:15:16 +02:00
Andrei Chekun
6ae58c6fa6 test.py: move storage tests to cluster subdirectory
Move the storage test suite from test/storage/ to test/cluster/storage/
to consolidate related cluster-based tests.This removes the standalone
test/storage/suite.yaml as the tests will use the cluster's test configuration.
Initially these tests were in cluster, but to use unshare at first
iteration they were moved outside. Now they are using another way to
handle volumes without unshare, they should be in cluster

Closes scylladb/scylladb#28634
2026-02-23 16:14:15 +02:00
Marcin Maliszkiewicz
c5dc086baf Merge 'vector_search: return NaN for similarity_cosine with all-zero vectors' from Dawid Pawlik
The ANN vector queries with all-zero vectors are allowed even on vector indexes with similarity function set to cosine.
When enabling the rescoring option, those queries would fail as the rescoring calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception as all-zero vectors were not allowed matching Cassandra's behaviour.

To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass, but return the NaN as the cosine similarity for zero vectors is mathematically incorrect. We decided not to use arbitrary values contrary to USearch, for which the distance (not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while supporting the range of values [0, 2].
If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5, which does not support mathematical reasoning why that would be more similar than for example vectors marking obtuse angles.
It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact, therefore we return NaN and eliminate them from best results.

Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour.

Fixes: SCYLLADB-456

Backport to 2026.1 needed, as it fixes the bug for ANN vector queries using rescoring introduced there.

Closes scylladb/scylladb#28609

* github.com:scylladb/scylladb:
  test/vector_search: add reproducer for rescoring with zero vectors
  vector_search: return NaN for similarity_cosine with all-zero vectors
2026-02-23 13:10:44 +01:00
Aleksandra Martyniuk
9ccc95808f docs: update nodetool rebuild docs
Update nodetool rebuild docs to mention that the command does not
work for tablet keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/28270.
2026-02-23 12:45:01 +01:00
Aleksandra Martyniuk
e4c42acd8f docs: update a procedure of decommissioning a DC
Update a procedure of decommissioning a DC for tablet keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/28307.
2026-02-23 12:45:01 +01:00
Aleksandra Martyniuk
1c764cf6ea docs: update a procedure of adding a DC
Update a procedure of adding a DC for tablet keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/28306.
2026-02-23 12:45:01 +01:00
Aleksandra Martyniuk
e08ac60161 docs: describe upgrade to enforce_rack_list option 2026-02-23 12:44:57 +01:00
Aleksandra Martyniuk
eefe66b2b2 docs: describe conversion to rack-list RF
Fixes: SCYLLADB-398
2026-02-23 12:41:33 +01:00
Marcin Maliszkiewicz
54dca90e8c Merge 'test: move dtest/guardrails_test.py to test_guardrails.py' from Andrzej Jackowski
This patch series moves `test/cluster/dtest/guardrails_test.py`
to `test/cluster/test_guardrails.py`, and migrates it from `cluster/dtest/`
to `cluster/` framework.

There are two motivations for moving the test:
 - Execution time reduction (from 12s to 9s in 'dev' in my env)
 - Facilitate adding new tests to the `guardrails_test.py` file

No backport, `dtest/guardrails_test.py` is only on master

Closes scylladb/scylladb#28737

* github.com:scylladb/scylladb:
  test: move dtest/guardrails_test.py to test_guardrails.py
  test: prepare guardrails_test.py to be moved to test/cluster/
2026-02-23 12:34:43 +01:00
Marcin Maliszkiewicz
1293b94039 auth: cache: fix permissions iterator invalidation in reload_all_permissions
The inner loops in reload_all_permissions iterate role's permissions
and _anonymous_permissions maps across yield points. Concurrent
load_permissions calls (which don't hold _loading_sem) can emplace
into those same maps during a yield, potentially triggering a rehash
that invalidates the active iterator.

We want to avoid adding semaphore acquire in load_permissions
because it's on a common path (get_permissions).

Fixing by snapshotting the keys into a vector before iterating with
yields, so no long-lived map iterator is held across suspension
points.
2026-02-23 12:14:22 +01:00
Piotr Dulikowski
a4c389413c Merge 'Hardens MV shutdown behavior by fixing lifecycle tracking for detached view-builder callbacks' from Alex Dathskovsky
This series hardens MV shutdown behavior by fixing lifecycle tracking for detached view-builder callbacks and aligning update handling with the same async dispatch style used by create/drop.

Patch 1 refactors on_update_view to use a dedicated coroutine dispatcher (dispatch_update_view), keeping update logic serialized under the existing view-builder lock and consistent with the callback architecture already used for create/drop paths.

Patch 2 adds explicit callback lifetime coordination in view_builder:

  - introduce a seastar::gate member
  - acquire _ops_gate.hold() when launching detached create/update/drop dispatch futures
  - keep the hold alive until each detached future resolves
  - close the gate during view_builder::drain() so shutdown waits for in-flight callback work before final teardown

Together, these changes reduce shutdown race exposure in MV event handling while preserving existing behavior for normal operation.

Testing:
  - pytest --test-py-init test/cluster/mv (47 passed, 7 skipped)

backport: not required started happening in master

fixes: SCYLLADB-687

Closes scylladb/scylladb#28648

* github.com:scylladb/scylladb:
  db/view: gate detached view-builder callbacks during shutdown
  db:view: refactor on_update_view to use coroutine dispatcher
2026-02-23 11:28:37 +01:00
Marcin Maliszkiewicz
75d4bc26d3 auth/cache: acquire _loading_sem in cross-shard callbacks
distribute_role() modifies _roles on non-zero shards via
invoke_on_others() without holding _loading_sem. Similarly, load_all()'s
invoke_on_others() callback calls prune_all() without the semaphore.
When these run concurrently with reload_all_permissions(), which
iterates _roles across yield points, an insertion can trigger
absl::flat_hash_map::resize(), freeing the backing storage while
an iterator still references it.

Fix by acquiring _loading_sem on the target shard in both
distribute_role()'s and load_all()'s invoke_on_others callbacks,
serializing all _roles mutations with coroutines that iterate
the map.
2026-02-23 10:30:03 +01:00
Ernest Zaslavsky
321d4caf0c object_storage: add retryable machinery to object storage
remove hand rolled error handling from object storage client
and replace with common machinery that supports exception
handling and retrying when appropriate
2026-02-22 14:00:44 +02:00
Ernest Zaslavsky
24972da26d rest_client: add simple_send overload
add an overload to rest client `simple_send` to accept a retry_strategy for http's make_request
2026-02-22 14:00:44 +02:00
Patryk Jędrzejczak
e8efcae991 Merge 'Use standard ks/cf/data creation methods in object_store/test_basic.py test' from Pavel Emelyanov
The test uses create_ks_and_cf helper duplicating the existing code that does the same. This PR patches basic tests to use standard facilities. Also it prepares the ground for testing keyspace storage options with rf=3

Cleaning tests, not backporting

Closes scylladb/scylladb#28600

* https://github.com/scylladb/scylladb:
  test/object_store: Remove create_ks_and_cf() helper
  test/object_store: Replace create_ks_and_cf() usage with standard methods
  test/object_store: Shift indentation right for test cases
2026-02-20 15:53:38 +01:00
Nadav Har'El
d01915131a test/cqlpy: make test_indexing_paging_and_aggregation much faster
Currently, test_secondary_index.py::test_indexing_paging_and_aggregation
is very slow, and the slowest test in the test/cqlpy framework: It takes
around 13 seconds on dev build, and because it is CPU-bound (doesn't sleep),
it is much slower on debug builds. The reason for this slowness is that it
needs to set up and read over 10,000 rows which is the default
select_internal_page_size.

But after the patches in pull request (#25368), we can configure
select_internal_page_size, so in this patch we change the test to
temporarily reduce this option to just 50, and then the test can reach
the same code paths with just 142 rows instead of 20120 rows before this
patch.

As a result, the test should now be 140 times faster than it was before.
In practice, because of some fixed overheads (the test creates several
tables and indexes), in dev build mode the test run speedup is "only"
26-fold (to around half a second).

I verified that removing the code added in bb08af7 indeed makes the new
shorter test fail - and this is the only test in test_secondary_index.py
that starts to fail besides test_index_paging_group_by which is also
related (so my revert didn't just break secondary indexing completely).
So the shorter test is still a good regression test.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28268
2026-02-20 15:44:53 +02:00
Avi Kivity
92bc5568c5 tools: toolchain: build sanitizers for future toolchain
The future toolchain did not build the sanitizers, so debug
executables did not link. Fix by not disabling the sanitizers.

Closes scylladb/scylladb#28733
2026-02-20 15:44:24 +02:00
Botond Dénes
6c04e02f66 Merge 'Fix restoration test's validation of streaming directions' from Pavel Emelyanov
The test_restore_with_streaming_scopes among other things checks how data streams flow while restoring. Whether or not to check the streams is decided based on the min tablet count value, which is compared with a hardcoded 512. This value of 512 matched the tablet count used by this test until it was "optimized" by #27839, where this number changed to 5 and streaming checks became off.

Good news is that the very same checks are still performed by test_refresh_with_streaming_scopes. But it's better to have a working restoration test anyway.

Minor test fix, not backporting

Closes scylladb/scylladb#28607

* github.com:scylladb/scylladb:
  test: Fix the condition for streaming directions validation
  test: Split test_backup.py::check_data_is_back() into two
2026-02-20 15:42:10 +02:00
Botond Dénes
6f88c0dbd3 Merge ' test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance' from Tomasz Grabiec
Currently, the test assumes that when
'topology_coordinator_pause_before_processing_backlog: waiting' is
logged, the task for decommission must be there. This was based on the
assumption that topology coordinator is idle and decommission request
wakes it up. But if the server is slow enough, it may still be running
the load balancer in reaction to table creation, and block on that
injection point before decommission request was added.

Fix by waiting for the task to appear rather than the injection.

Fixes SCYLLADB-715

Only 2026.1 vulnerable.

Closes scylladb/scylladb#28688

* github.com:scylladb/scylladb:
  test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance
  test: cluster: task_manager_client: Introduce wait_task_appears()
  tests: pylib: util: Add exponential backoff to wait_for
2026-02-20 15:05:36 +02:00
Pavel Emelyanov
c96420c015 tests: Re-use manager.get_server_exe()
There's a bunch of incremental repair tests that want to call scylla
sstable command. For that they try to find where scylla binary by
scanning /proc directory (see local_process_id and get_scylla_path
helpers).

There's shorter way -- just call manager.get_server_exe().

Same for backup-restore test.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28676
2026-02-20 14:59:30 +02:00
Pavel Emelyanov
a4a0d75eee test/object_store: Parametrize test_simple_backup_and_restore()
There are three tests and a function with a pair of boolean parameters
called by those. It's less code if the function becomes a test with
parameters.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28677
2026-02-20 14:57:30 +02:00
Pavel Emelyanov
a2e1293f86 test/object_store: Squash two simple-backup tests together
The test_backup_simple creates a ks/cf, takes a snapshot, backs it up,
then checks that the files were uploaded. The test_backup_move does the
same, but also plays with 'move_files' parameter to be true/false.

In fact, the "move" test was the copy of "simple" one that dropepd check
for scheduling group being "streaming" (backup with --move-files can
check the same, it's not bad), and check for destination bucket to
contain needed files (same here -- checking that files arrived to bucket
after --move-files is good).

In the end of the day, after the change backup test is run two times,
instead of three, and performs extra checks for --move-files case.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28606
2026-02-20 14:49:30 +02:00
Botond Dénes
7e90ed657c Merge 'Fix client_options docs' from Karol Baryła
https://github.com/scylladb/scylladb/pull/25746 added a new column to `system.clients`: `client_options frozen<map<text, text>>`. This column stores all options sent by the client in the `STARTUP` message.
This PR also added `CLIENT_OPTIONS` to the list of values sent in `SUPPORTED` message, and documented that drivers can send their configuration (as JSON) in `STARTUP` under this key.

Documentation for the new column was not added to the description of `system.clients` table, and documentation about the new `STARTUP` key was added in `protocol-extensions.md`, but in the section about shard awareness extension.

This PR adds missing `system.clients` column description, moves the documentation of `CLIENT_OPTIONS` into its own section, and expands it a bit.

Backport: none, because this fixes internal documentation.

Closes scylladb/scylladb#28126

* github.com:scylladb/scylladb:
  protocol-extensions.md: Fix client_options docs
  system_keyspace.md: Add client_options column
  system_keyspace.md: Fix order in system.clients
2026-02-20 14:23:34 +02:00
Pavel Emelyanov
525cb5b3eb table: Use fmt::to_string() to stringify compation group ID
Doing it with format("{}", foo) is correct, but to_string is
a bit more lightweight.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28630
2026-02-20 14:13:15 +02:00
Patryk Jędrzejczak
d399a197f5 Merge 'raft: Await instead of returning future in wait_for_state_change' from Dawid Mędrek
The `try-catch` expression is pretty much useless in its current form. If we return the future, the awaiting will only be performed by the caller, completely circumventing the exception handling.

As a result, instead of handling `raft::request_aborted` with a proper error message, the user will face `seastar::abort_requested_exception` whose message is cryptic at best. It doesn't even point to the root of the problem.

Fixes SCYLLADB-665

Backport: This is a small improvement and may help when debugging, so let's backport it to all supported versions.

Closes scylladb/scylladb#28624

* https://github.com/scylladb/scylladb:
  test: raft: Add test_aborting_wait_for_state_change
  raft: Describe exception types for wait_for_state_change and wait_for_leader
  raft: Await instead of returning future in wait_for_state_change
2026-02-20 12:17:22 +01:00
Andrzej Jackowski
eb5a564df2 test: move dtest/guardrails_test.py to test_guardrails.py
This commit moves `guardrails_test.py`, prepared in the previous
commit of this patch series, to `test/cluster/test_guardrails.py`.
It also cleans up `suite.yaml`.
2026-02-20 11:39:52 +01:00
Andrzej Jackowski
9df426d2ae test: prepare guardrails_test.py to be moved to test/cluster/
Disable `test/cluster/dtest/guardrails_test.py` in `suite.yaml` and
make it compatible with the `test/cluster/` framework. This will
allow moving this file from `test/cluster/dtest/` to `test/cluster/`
in the next commit of this patch series.

There are two motivations for moving the test:
 - Execution time reduction (from 12s to 9s in 'dev' in my env)
 - Facilitate adding new tests to the `guardrails_test.py` file
2026-02-20 11:39:43 +01:00
Raphael S. Carvalho
f33f324f77 mutation_compactor: Fix tombstone GC metrics to account for only expired
There are 3 metrics (that goes in every compaction_history entry):
total_tombstone_purge_attempt
total_tombstone_purge_failure_due_to_overlapping_with_memtable
total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable

When a tombstone is not expired (e.g. doesn't satisfy "gc_before" or
grace period), it can be currently accounted as failure due to
overlapping with either memtable or uncompacting sstable.
So those 2 last metrics have noise of *unexpired* tombstones.

What we should do is to only account for expired tombstones in all
those 3 metrics. We lose the info of knowing the amount of tombstones
processed by compaction, now we'll only know about the expired ones.
But those metrics were primarily added for explaining why expired
tombstones cannot be removed.
We could have alternatively added a new field
purge_failure_due_to_being_unexpired or something, but
it requires adding a new field to compaction_history.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-737.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#28669
2026-02-20 10:43:58 +02:00
Botond Dénes
0bf4c68af5 Merge 'docs: fix link to docker build readme in the README.MD' from Marcin Szopa
Links were pointing to the `debian` subdirectory. However, there docker build was refactored to use `redhat`: 1abf981a73, see https://github.com/scylladb/scylladb/pull/22910

No backport, just a README link fixes.

Closes scylladb/scylladb#28699

* github.com:scylladb/scylladb:
  docs: fix path to the build_docker.sh which was moved from debian to redhat subdirectory
  docs: fix link to docker build README.MD
2026-02-20 08:21:46 +02:00
Avi Kivity
66bef0ed36 lua, tools: adjust for lua 5.5 lua_newstate seed parameter
Lua 5.5 adds a seed parameter to lua_newstate(), provide it
with a strong random seed.

Closes scylladb/scylladb#28734
2026-02-20 06:52:37 +02:00
Avi Kivity
27a5502f14 Merge 'Reapply "main: test: add future and abort_source to after_init_func"' from Marcin Maliszkiewicz
The patchset fixes abort_source implementation for perf-alternator and perf-cql-raw. It moves
run_standalone function to common code in perf.hh with necessary templating.

We also add extensive testing so that it's more difficult to break the tooling in the future.

Fixes SCYLLADB-560
Backport: no, internal tooling improvement

Closes scylladb/scylladb#28541

* github.com:scylladb/scylladb:
  test: cluster: add tests for perf tools
  test: perf: fix port race condition on startup in connect workload
  test: perf: prepare benchmarks to bind to custom host
  test: perf: make perf-alterantor remote port configurable
  test: perf: fix ASAN leak warnings in perf-alternator
  Reapply "main: test: add future and abort_source to after_init_func"
2026-02-19 19:12:46 +02:00
Dawid Mędrek
c9d192c684 Merge 'raft ropology: prevent crashes of multiple nodes' from Patryk Jędrzejczak
Some assertions in the Raft-based topology are likely to cause crashes of
multiple nodes due to the consistent nature of the Raft-based code. If the
failing assertion is executed in the code run by each follower (e.g., the code
reloading the in-memory topology state machine), then all nodes can crash. If
the failing assertion is executed only by the leader (e.g., the topology
coordinator fiber), then multiple consecutive group0 leaders will chain-crash
until there is no group0 majority.

Crashing multiple nodes is much more severe than necessary. It's enough to
prevent the topology state machine from making more progress. This will
naturally happen after throwing a runtime error. The problematic fiber will be
killed or will keep failing in a loop. Note that it should be safe to block
the topology state machine, but not the whole group0, as the topology state
machine is mostly isolated from the rest of group0.

We replace some occurrences of `on_fatal_internal_error` and `SCYLLA_ASSERT`
with `on_internal_error`. These are not all occurrences, as some fatal
assertions make sense, for example, in the bootstrap procedure.

We also raise an internal error to prevent a segmentation fault in a few places.

Fixes #27987

Backporting this PR is not required, but we can consider it at least for 2026.1
because:
- it is LTS,
- the changes are low-risk,
- there shouldn't be many conflicts.

Closes scylladb/scylladb#28558

* github.com:scylladb/scylladb:
  raft topology: prevent accessing nullptr returned by topology::find
  raft topology: make some assertions non-crashing
2026-02-19 16:50:03 +01:00
Marcin Maliszkiewicz
22c3d8d609 Merge 'db/config: enable table audit by default' from Piotr Smaron
In https://github.com/scylladb/scylladb/pull/27262 table audit has been
re-enabled by default in `scylla.yaml`, logging certain categories to a table,
which should make new Scylla deployments have audit enabled.
Now, in the next release, we also want to enable audit in `db/config.cc`,
which should enable audit for all deployments, which don't explicitly configure
audit otherwise in `scylla.yaml` (or via cmd line).
BTW. Because this commit aligns audit's default config values in `db/config.cc`
to those of `scylla.yaml`, `docs/reference/configuration-parameters.rst`, which
is based on `db/config.cc` will start showing that table audit is the default.

Refs: https://github.com/scylladb/scylladb/issues/28355
Refs: https://scylladb.atlassian.net/browse/SCYLLADB-222

No backport: table audit has been enabled in 2026.1 in `scylla.yaml`,
and should be always on starting from the next release,
which is the release we're currently merging to (2026.2).

Closes scylladb/scylladb#28376

* github.com:scylladb/scylladb:
  docs: decommission: note audit ks may require ALTERing
  docs: mention table audit enabled by default
  audit: disable DDL by default
  db/config: enable table audit by default
  test/cluster: fix `test_table_desc_read_barrier` assertion
  test/cluster: adjust audit in tests involving decommissioning its ks
  audit_test: fix incorrect config in `test_audit_type_none`
2026-02-19 16:30:11 +01:00
Pavel Emelyanov
b4b9b547ce replica: Remove unused sched groups from keyspace and table configs
Compaction and statement groups are carried over on those configs, but
are in fact unused. Drop both.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28540
2026-02-19 15:47:31 +01:00
Patryk Jędrzejczak
45115415fb Merge 'Parametrize and merge several restoration test cases' from Pavel Emelyanov
There are four tests that check how restore with primary-replica-only option works in various scopes and topologies. Cases that check same-racks and same-datacenters are very very similar, so are those that check different-racks and different-datacenters. Parametrizing them and merging saves lots of code (+30 lines, -116 lines)

It's probably worth merging the resulting same-domain with different-domain tests, because the similarity is still large in both, but the result becomes too if-y, so not done here. Maybe later.

Improving tests, not backporting

Closes scylladb/scylladb#28569

* https://github.com/scylladb/scylladb:
  test: Merge test_restore_primary_replica_different_... tests
  test: Merge test_restore_primary_replica_same_... tests
  test: Don't specify expected_replicas in test_restore_primary_replica_different_dc_scope_all
  test: Remove local r_servers variable from test_restore_primary_replica_different_dc_scope_all
2026-02-19 15:42:55 +01:00
Pavel Emelyanov
26372e65df Merge 's3_perf: Fix the s3 perf test' from Ernest Zaslavsky
Fix the build of the test and the upload operation flow

No need to backport since it is only a test we barely use

Closes scylladb/scylladb#28595

* github.com:scylladb/scylladb:
  s3_perf: fix upload operation flow
  s3_perf: fix the CMake build
2026-02-19 15:31:43 +02:00
Avi Kivity
7ec710c250 Merge 'tablets: Reduce per-shard migration concurrency to 2' from Tomasz Grabiec
Tablet migration keeps sstable snapshot during streaming, which may
cause temporary increase in disk utilization if compaction is running
concurrently. SSTables compacted away are kept on disk until streaming
is done with them. The more tablets we allow to migrate concurrently,
the higher disk space can rise. When the target tablet size is
configured correcly, every tablet should own about 1% of disk
space. So concurrency of 4 shouldn't put us at risk. But target tablet
size is not chosen dynamically yet, and it may not be aligned with
disk capacity.

Also, tablet sizes can temporarily grow above the target, up to 2x
before the split starts, and some more because splits take a while to
complete.

To reduce the impact from this, reduce concurrency of
migration. Concurrency of 2 should still be enough to saturate
resources on the leaving shard.

Also, reducing concurrency means that load balancing is more
responsive to preemption. There will be less bandwidth sharing, so
scheduled migrations complete faster. This is important for scale-out,
where we bootstrap a node and want to start migrations to that new
node as soon as possible.

Refs scylladb/siren#15317

Closes scylladb/scylladb#28563

* github.com:scylladb/scylladb:
  tablets, config: Reduce migration concurrency to 2
  tablets: load_balancer: Always accept migration if the load is 0
  config, tablets: Make tablet migration concurrency configurable
2026-02-19 15:31:43 +02:00
Dawid Mędrek
fae71f79c2 test: raft: Add test_aborting_wait_for_state_change 2026-02-19 14:21:01 +01:00
Dawid Mędrek
e4f2b62019 raft: Describe exception types for wait_for_state_change and wait_for_leader
The methods of `raft::server` are abortable and if the passed
`abort_source` is triggered, they throw `raft::request_aborted`.
We document that.

Although `raft::server` is an interface, this is consistent with
the descriptions of its other methods.
2026-02-19 12:47:14 +01:00
Dawid Mędrek
c36623baad raft: Await instead of returning future in wait_for_state_change
The `try-catch` expression is pretty much useless in its current form.
If we return the future, the awaiting will only be performed by the
caller, completely circumventing the exception handling.

As a result, instead of handling `raft::request_aborted` with a proper
error message, the user will face `seastar::abort_requested_exception`
whose message is cryptic at best. It doesn't even point to the root
of the problem.

Fixes SCYLLADB-665
2026-02-19 12:47:14 +01:00
Marcin Maliszkiewicz
de4e5e10af test: perf: fix prepared statements logic in perf-simple-query
Due to lack of checks present in process_execute_internal from
transport/server.cc needs_authorization bool was always set to true
doing some extra work (check_access()) for each request.

We mirror the logic in this patch in test env which perf-simple-query
uses. This can also potentially improve runtime of unittests (marginally).

Note that bug is only in perf tool not scylla itself, the fix
decreases insns/op by around 10%:
Before: 41065 insns/op
After: 37452 insns/op
Command: ./build/release/scylla perf-simple-query --duration 5 --smp 1

Fixes https://github.com/scylladb/scylladb/issues/27941

Closes scylladb/scylladb#28704
2026-02-19 12:42:07 +02:00
Avi Kivity
58a662b9db dist: refresh container base image (ubi9-minimal)
Using an outdated image can cause problems when `microdnf update`
runs, if the distribution doesn't maintain good update hygiene.
Although, I suspect that when update failures happen they're really
caused by propagation delay of packages to mirrors.

Fix by using --pull=always to get a fresh image.

Ref https://scylladb.atlassian.net/browse/SCYLLADB-714

Closes scylladb/scylladb#28680
2026-02-19 12:42:43 +03:00
Ferenc Szili
f1bc17bd4c load_stats: fix race condition when computing sum_tablet_sizes
In storage_service::load_stats_for_tablet_based_tables(), we are passing
a reference to sum_tablet_sizes to the lambda which increments this value
on each shard via map_reduce0(). This means we could have a race
condition because this is executed on separate threads/CPUs.

This patch fixed the problem by collecting the sums by shard into a
vector, then summing those up.

Refs: SCYLLADB-678

Closes scylladb/scylladb#28703
2026-02-19 12:29:25 +03:00
Avi Kivity
dee868b71a interval: avoid clang 23 warning on throw statement in potentially noexcept function
interval_data's move constructor is conditionally noexcept. It
contains a throw statemnt for the case that the underlying type's
move constructor can throw; that throw statemnt is never executed
if we're in the noexept branch. Clang 23 however doesn't understand
that, and warns about throwing in a noexcept function.

Fix that by rewriting the logic using seastar::defer(). In the
noexcept case, the optimizer should eliminate it as dead code.

Closes scylladb/scylladb#28710
2026-02-19 12:24:20 +03:00
Ernest Zaslavsky
45d824e0fe s3_perf: fix upload operation flow
Correct the upload operation logic. The previous flow incorrectly
checked for the test file on S3 even when performing operations that do
not download the file, such as uploads.
2026-02-19 11:14:59 +02:00
Botond Dénes
b637e17b19 db/config: don't use RBNO for scaling
Remove bootstrap and decomission from allowed_repair_based_node_ops.
Using RBNO over streaming for these operations has no benefits, as they
are not exposed to the out-of-date replica problem that replace,
removenode and rebuild are.
On top of that, RBNO is known to have problems with empty user tables.
Using streaming for boostrap and decomission is safe and faster
than RBNO in all condition, especially when the table is small.

One test needs adjustment as it relies on RBNO being used for all node
ops.

Fixes: SCYLLADB-105

Closes scylladb/scylladb#28080
2026-02-19 09:51:09 +01:00
Calle Wilund
8e71a6f52a gcp: Add handling of 429 (too many requests) to exponential backoff
Fixes: SCYLLADB-611

Adds http error code 429 to codes handled by exponential backoff.

Closes scylladb/scylladb#28588
2026-02-19 09:42:39 +01:00
Marcin Maliszkiewicz
3417d50add test: cluster: add tests for perf tools
It checks if all workloads can be properly
executed with succesfull startup and teardown.

Especially testing alternator in remote mode is important
because it's invoked like this during pgo training in pgo.py.

Test runtime:
Release - 24s
Debug - 1m 15s

Test time consists mostly of Scylla startup in various modes.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
c69534504c test: perf: fix port race condition on startup in connect workload
Other workloads at startup call prepopulate() which connects
with retry loop therefore it waits until cql port is open.

This commit adds a single place where we will wait for port
for all workloads.

Timeout is set to 5 minutes so that even slowest machines
are able to start.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
828f2fbdb1 test: perf: prepare benchmarks to bind to custom host
This is usefull for tests where we use local networks
like 127.5.5.5 to avoid port and host collisions.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
9f2b97bef4 test: perf: make perf-alterantor remote port configurable
It could be a usefull option to have.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
f5a212e91e test: perf: fix ASAN leak warnings in perf-alternator
Those were intentional as test process is short lived
but when we add automated tests in the following commits
we expect clean exit, with 0 exit code.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
0c76c73e34 Reapply "main: test: add future and abort_source to after_init_func"
This reverts commit ceec703bb7.

The commit was fixed with abort source handling for alternator
standalone path so it's safe to reapply.
2026-02-19 09:33:10 +01:00
Piotr Dulikowski
7d6f734a51 dictionary compression: add missing co_awaits on get_units
There is a handful of places in the code related to dictionary
compression which calls get_units to acquire semaphore units but the
returned future is not awaited, seemingly by mistake. The result of
get_units is assigned to a variable - which is reasonable at a glance
because the semaphore units need to be assigned to a variable in order
to control their scope - but at the same time if co_await is mistakenly
omitted, like here, doing so will silence the nodiscard check of
seastar::future and, effectively, the get_units call will be nearly
useless. Unfortunately, this is an easy mistake to make.

Fix the places in the code that acquire semaphore units via get_units
but never await the future returned by it. I found them by manual code
inspection, so I hope that I didn't miss any.

Closes scylladb/scylladb#28581
2026-02-18 16:40:40 +01:00
Ernest Zaslavsky
4026b54a5e s3_perf: fix the CMake build
Fix the CMake build of the perf_s3_client by adding the
necessary linkage with the jsoncpp library.
2026-02-18 17:12:08 +02:00
Piotr Smaron
797c5cd401 docs: decommission: note audit ks may require ALTERing
With audit feature enabled, it's not immediately obvious that its
pseudo-system keyspace `audit` may require adjusting its RF across DCs
before decommissioning a node, and this should be documented.
2026-02-18 15:14:57 +01:00
Piotr Smaron
65eec6d8e7 docs: mention table audit enabled by default
Also align the documentation with the current audit settings.
2026-02-18 15:14:57 +01:00
Piotr Smaron
c30607d80b audit: disable DDL by default
DDL audit category doesn't make sense if its enabled by default on its
own, as no DDL statements are going to be audited if audit_keyspaces/audit_tables
setting is empty. This may be counter-intuitive to our users, who may
expect to actually see these statements logged if we're enabling this by
default. Also, it doesn't make sense to enable a setting by default if
it has no effect.
Additionally, listed all possible audit categories for user's
convenience.
2026-02-18 15:14:57 +01:00
Piotr Smaron
08dc1008ba db/config: enable table audit by default
In https://github.com/scylladb/scylladb/pull/27262 table audit has been
re-enabled by default in `scylla.yaml`, logging certain categories to a table,
which should make new Scylla deployments have audit enabled.
Now, in the next release, we also want to enable audit in `db/config.cc`,
which should enable audit for all deployments, which don't explicitly configure
audit otherwise in `scylla.yaml` (or via cmd line).
BTW. Because this commit aligns audit's default config values in `db/config.cc`
to those of `scylla.yaml`, `docs/reference/configuration-parameters.rst`, which
is based on `db/config.cc` will start showing that table audit is the default.

Refs: https://github.com/scylladb/scylladb/issues/28355
Refs: https://scylladb.atlassian.net/browse/SCYLLADB-222
2026-02-18 15:14:57 +01:00
Piotr Smaron
95ee4a562c test/cluster: fix test_table_desc_read_barrier assertion
The test `assertion desc_schema[0] == desc_schema[1]` does a direct
list comparison, which is order-sensitive. Before enabling audit by default,
both nodes would return only the test keyspace/table, so the order
didn't matter. With audit enabled, there will be multiple keyspaces,
and they can be returned in different order by different nodes.
2026-02-18 15:14:57 +01:00
Piotr Smaron
2e12b83366 test/cluster: adjust audit in tests involving decommissioning its ks
When table audit is enabled, Scylla creates the "audit" ks with
NetworkTopologyStrategy and RF=3. During node decommission, streaming can fail
for the audit ks with "zero replica after the removal" when all nodes from a DC
are removed, and so we have to ALTER audit ks to either zero the number of its
replicas, to allow for a clear decommission, or have them in the 2nd DC.

BTW. https://github.com/scylladb/scylladb/issues/27395 is the same change, but
in dtests repository.
2026-02-18 15:14:55 +01:00
Piotr Smaron
0cf20fa15a audit_test: fix incorrect config in test_audit_type_none
Passing Python `None` to setup is incorrect, because config updates are sent
as a dict and `None` is treated as "unset" - meaning: use Scylla's default.
Using the explicit string "none" to guarantee that audit is disabled.
2026-02-18 15:12:26 +01:00
Asias He
1be80c9e86 repair: Skip auto repair for tables using RF one
There is no point running repair for tables using RF one. Row level
repair will skip it but the auto repair scheduler will keep scheduling
such repairs since repair_time could not be updated.

Skip such repairs at the scheduler level for auto repair.

If the request is issued by user, we will have to schedule such
repair otherwise the user request will never be finished.

Fixes SCYLLADB-561

Closes scylladb/scylladb#28640
2026-02-18 14:32:50 +02:00
Andrzej Jackowski
4221d9bbfd docs: improve examples in Handling Audit Failures section
This commit introduces four changes:
 - In the `table` example, singular forms (node, partition) are changed to
   plural forms (nodes, partitions). Currently, the default `table`
   audit configuration is RF=3 and writes use CL=ONE. Therefore,
   a `table` audit log write failure should not be caused by a single
   node unavailability, and plural forms are more adequate.
 - In the `table` example, unreachability due to network issues is
   mentioned because with RF=3, audit failure due to network problems
   is more likely to happen than a simultaneous failure of three
   nodes (such network failures happened in SCYLLADB-706).
 - In the `syslog` example, a slash `/` is changed to `or`, so `table`
   and `syslog` examples have similar structure.
 - As the `syslog` line is already being changed, I also change `unix`
   to `Unix`, as the capitalized form is the correct one.

Refs SCYLLADB-706

Closes scylladb/scylladb#28702
2026-02-18 13:10:01 +01:00
Botond Dénes
3bfd47da4b Merge 'transport: fix connection code to consume only initially taken semaphore units' from Marcin Maliszkiewicz
The connection's `cpu_concurrency_t` struct tracks the state of a connection
to manage the admission of new requests and prevent CPU overload during
connection storms. When a connection holds units (allowed only 0 or 1), it is
considered to be in the "CPU state" and contributes to the concurrency limits
used when accepting new connections.

The bug stems from the fact that `counted_data_source_impl::get` and
`counted_data_sink_impl::put` calls can interleave during execution. This
occurs because of `should_parallelize` and `_ready_to_respond`, the latter being
a future chain that can run in the background while requests are being read.
Consequently, while reading request (N), the system may concurrently be
writing the response for request (N-1) on the same connection.

This interleaving allows `return_all()` to be called twice before the
subsequent `consume_units()` is invoked. While the second `return_all()` call
correctly returns 0 units, the matching `consume_units()` call would
mistakenly take an extra unit from the semaphore. Over time, a connection
blocked on a read operation could end up holding an unreturned semaphore
unit. If this pattern repeats across multiple connections, the semaphore
units are eventually depleted, preventing the server from accepting any
new connections.

The fix ensures that we always consume the exact number of units that were
previously returned. With this change, interleaved operations behave as
follows:

get() return_all     — returns 1 unit
put() return_all     — returns 0 units
get() consume_units  — takes back 1 unit
put() consume_units  — takes back 0 units

Logically, the networking phase ends when the first network operation
concludes. But more importantly, when a network operation
starts, we no longer hold any units.

Other solutions are possible but the chosen one seems to be the
simplest and safest to backport.

Fixes SCYLLADB-485
Backport: all supported affected versions, bug introduced with initial feature implementation in: ed3e4f33fd

Closes scylladb/scylladb#28530

* github.com:scylladb/scylladb:
  test: auth_cluster: add test for hanged AUTHENTICATING connections
  transport: fix connection code to consume only initially taken semaphore units
2026-02-18 13:48:49 +02:00
Marcin Szopa
9217f85e99 docs: fix path to the build_docker.sh which was moved from debian to redhat subdirectory 2026-02-18 12:19:27 +01:00
Marcin Szopa
e66bf4a6f5 docs: fix link to docker build README.MD
Link was pointing to the old place of the README. It was moved in the 1abf981a73
2026-02-18 12:12:46 +01:00
Piotr Dulikowski
b9db3c9c75 Merge 'Add consistent permissions cache' from Marcin Maliszkiewicz
This patchset replaces permissions cache based on loading_cache with a new unified (permissions and roles), full, coherent auth cache.

Reason for the change is that we want to improve scenarios under stress and simplify operation manuals. New cache doesn't require any tweaking. And it behaves particularly better in scenarios with lots of schema entities (e.g. tables) combined with unprepared queries. Old cache can generate few thousands of extra internal tps due to cache refresh.

Benchmark of unprepared statements (just to populate the cache) with 1000 tables shows 3k tps of internal reads reduction and 9.1% reduction of median instructions per op. So many tables were used to show resource impact, cache could be filled with other resource types to show the same improvement.

Backport: no, it's a new feature.
Fixes https://github.com/scylladb/scylladb/issues/7397
Fixes https://github.com/scylladb/scylladb/issues/3693
Fixes https://github.com/scylladb/scylladb/issues/2589
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-147

Closes scylladb/scylladb#28078

* github.com:scylladb/scylladb:
  test: boost: add auth cache tests
  auth: add cache size metrics
  docs: conf: update permissions cache documentation
  auth: remove old permissions cache
  auth: use unified cache for permissions
  auth: ldap: add permissions reload to unified cache
  auth: add permissions cache to auth/cache
  auth: add service::revoke_all as main entry point
  auth: explicitly life-extend resource in auth_migration_listener
2026-02-18 12:03:20 +01:00
Tomasz Grabiec
af0b5d0894 Merge 'tablets global barrier: acknowledge barrier_and_drain from all nodes' from Petr Gusev
Before this series, the `global_barrier` used during tablet migration did not guarantee that `barrier_and_drain` was acknowledged by tablet replicas. As a result, if a request coordinator was fenced out, stale requests from previous topology versions could still execute on replicas in parallel with new requests from incompatible topology versions. For example, stale requests from `tablet_transition_stage::streaming` could run concurrently with new requests from `tablet_transition_stage::use_new`. This caused several issues, including [#26864](https://github.com/scylladb/scylladb/issues/26864) and [#26375](https://github.com/scylladb/scylladb/issues/26375).

This PR fixes the problem in two steps:
* Replicas now hold an erm strong pointer while handling RPCs from coordinators.
* The tablet barrier is updated to require `barrier_and_drain` acknowledgments from all nodes.

A description of alternative solutions and various tradeoffs can be found in [this document](https://docs.google.com/document/d/1tpDtPOsrGaZGBYkdwOKApQv4eMzrBydMM1GaYYmaPgg/edit?pli=1&tab=t.0#heading=h.vidfy0hrz5j7).

[A previous attempt on this changes](https://github.com/scylladb/scylladb/pull/27185).

Fixes [scylladb/scylladb#26864](https://github.com/scylladb/scylladb/issues/26864)
Fixes [scylladb/scylladb#26375](https://github.com/scylladb/scylladb/issues/26375)

backport: needs backport to 2025.4 (fixes #26864 for tablets LWT)

Closes scylladb/scylladb#27492

* github.com:scylladb/scylladb:
  tests: extract get_topology_version helper
  global tablets barrier: require all nodes to ack barrier_and_drain
  topology_coordinator: pass raft_topology_cmd by value
  storage_proxy: hold erms in replica handlers
  token_metadata: improve stale versions diagnostics
2026-02-18 11:45:56 +01:00
Pavel Emelyanov
0c443d5764 gms: Use newer seastar get_host_by_name API
The hostent::addr_list is deprecated in favor of address_entry::addr
field that contains the very same addresses.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28566
2026-02-18 12:24:35 +02:00
Pavel Emelyanov
5b740afe9a database: Remove streaming sched group getter
All users of it had been updated to get the streaming group elsewhere,
so this getter is no longer needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28527
2026-02-18 12:23:35 +02:00
Avi Kivity
c5a1f44731 tools: toolchain: switch from ccache to sccache
sccache combines the functions of ccache and distcc, and
promises to support C++20 modules in the future. Switch
to sccache in anticipation of modules support.

The documentation is adjusted since cache will be
persistent for sccache without further work.

Closes scylladb/scylladb#28524
2026-02-18 12:23:12 +02:00
Botond Dénes
36167a155e Merge 'Remove map_to_key_value() helpers from API' from Pavel Emelyanov
There are some places that get `map<foo, bar>` and return it to the caller as `"key": string(foo), "value": string(bar)` json. For that there's `map_to_key_value()` helper in api.hh that re-formats the map into a vector of json elements and returns it, letting seastar json-ize that vector.

Recently in seastar there appeared stream_range_as_array() helper that helps streaming any range without converting it into intermediate collection. Some of the hottest users of `map_to_key_value()` had been converted, this PR converts few remainders and removes the helper in question to encourage further usage of the stream_range_as_array().

Code cleanup, not backporting

Closes scylladb/scylladb#28491

* github.com:scylladb/scylladb:
  api: Remove map_to_key_value() helpers
  api: Streamify view_build_statuses handler
  api: Streamify few more storage_service/ handlers
  api: Add map_to_json() helper
  api: Coroutinize view_build_statuses handler
2026-02-18 12:22:00 +02:00
Ernest Zaslavsky
196f7cad93 nodetool: fix handling of "--primary-replica-only" argument
The "--primary-replica-only" ("-pro") flag was previously ignored by
the `restore` operation. This patch ensures the argument is parsed and
applied correctly.

Closes scylladb/scylladb#28490
2026-02-18 12:21:27 +02:00
Pavel Emelyanov
bce43c6b20 api: Remove unused (lost) local variable
Lost when the get_range_to_endpoint_map hander was implemented for real (48c3c94aa6)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28489
2026-02-18 12:20:30 +02:00
Alex
c44ad31d44 db/view: gate detached view-builder callbacks during shutdown
Detached migration callbacks (on_create_view, on_update_view, on_drop_view)
  can race with view_builder::drain() teardown.

  Add a lifetime gate to view_builder and wire callback launches through
  _ops_gate.hold() so each detached dispatch future is tracked until it
  completes (finally keeps the hold alive). During shutdown, drain()
  now waits for all tracked callback work with _ops_gate.close().

  This ensures drain does not proceed past callback lifetime while shutdown is in
  progress, and ignores only gate_closed_exception at callback entry as the
  expected shutdown path.
2026-02-18 11:56:41 +02:00
Pavel Emelyanov
b01adf643c Merge 'init: fix infinite loop on npos wrap with updated Seastar' from Emil Maskovsky
Fixes parsing of comma-separated seed lists in "init.cc" and "cql_test_env.cc" to use the standard `split_comma_separated_list` utility, avoiding manual `npos` arithmetic. The previous code relied on `npos` being `uint32_t(-1)`, which would not overflow in `uint64_t` target and exit the loop as expected. With Seastar's upcoming change to make `npos` `size_t(-1)`, this would wrap around to zero and cause an infinite loop.

Switch to `split_comma_separated_list` standardized way of tokenization that is also used in other places in the code. Empty tokens are handled as before. This prevents startup hangs and test failures when Seastar is updated.

The other commit also removes the unnecessary creation of temporary `gms::inet_address()` objects when calling `std::set<gms::inet_address>::emplace()`.

Refs: https://github.com/scylladb/seastar/pull/3236

No backport: The problem will only appear in master after the Seastar will be upgraded. The old code works with the Seastar before https://github.com/scylladb/seastar/pull/3236 (although by accident because of different integer bitsizes).

Closes scylladb/scylladb#28573

* github.com:scylladb/scylladb:
  init: fix infinite loop on npos wrap with updated Seastar
  init: remove unnecessary object creation in emplace calls
2026-02-18 11:46:26 +03:00
Aleksandra Martyniuk
100ccd61f8 tasks: increase tasks_vt_get_children timeout
test_node_ops_tasks.py::test_get_children fails due to timeout of
tasks_vt_get_children injection in debug mode. Compared to a successful
run, no clear root cause stands out.

Extend the message timeout of tasks_vt_get_children from 10s to 60s.

Fixes: #28295.

Closes scylladb/scylladb#28599
2026-02-18 11:39:19 +03:00
Dani Tweig
aac0f57836 .github/workflows: add SMI to milestone sync Jira project keys
What changed
Updated .github/workflows/call_sync_milestone_to_jira.yml to include SMI in jira_project_keys

Why (Requirements Summary)
Adding SMI to create releases in the SMI Jira project based on new milestones from scylladb.git.
This will create a new release in the SMI Jira project when a milestone is added to scylladb.git.

Fixes:PM-190

Closes scylladb/scylladb#28585
2026-02-18 09:35:37 +02:00
Nadav Har'El
a1475dbeb9 test/cqlpy: make test testMapWithLargePartition faster
Right now the slowest test in the test/cqlpy directory is

   cassandra_tests/validation/entities/collections_test.py::
      testMapWithLargePartition

This test (translated from Cassandra's unit test), just wants to verify
that we can write and flush a partition with a single large map - with
200 items totalling around 2MB in size.

200 items totalling 2MB is large, but not huge, and is not the reason
why this test was so so slow (around 9 seconds). It turns out that most
of the test time was spent in Python code, preparing a 2MB random string
the slowest possible way. But there is no need for this string to be
random at all - we only care about the large size of the value, not the
specific characters in it!

Making the characters written in this text constant instead of random
made it 20 times fast - it now takes less than half a second.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28271
2026-02-18 10:12:16 +03:00
Raphael S. Carvalho
5b550e94a6 streaming: Release space incrementally during file streaming
File streaming only releases the file descriptors of a tablet being
streamed in the very streaming end. Which means that if the streaming
tablet has compaction on largest tier finished after streaming
started, there will be always ~2x space amplification for that
single tablet. Since there can be up to 4 tablets being migrated
away, it can add up to a significant amount, since nodes are pushed
to a substantial usage of available space (~90%).

We want to optimize this by dropping reference to a sstable after
it was fully streamed. This way, we reduce the chances of hitting
2x space amplification for a given tablet.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#28505
2026-02-18 10:10:40 +03:00
Avi Kivity
f3cbd76d93 build: install cassandra-stress RPM with no signature check
Fedora 45 tightened the default installation checks [1]. As a result
the cassandra-stress rpm we provide no longer installs.

Install it with --no-gpgchecks as a workaround. It's our own package
so we trust it. Later we'll sign it properly.

We install its dependencies via the normal methods so they're still
checked.

[1] https://fedoraproject.org/wiki/Changes/Enforcing_signature_checking_by_default

Closes scylladb/scylladb#28687
2026-02-18 10:08:13 +03:00
Pavel Emelyanov
89d8ae5cb6 Merge 'http: prepare http clients retry machinery refactoring' from Ernest Zaslavsky
Today S3 client has well established and well testes (hopefully) http request retry strategy, in the rest of clients it looks like we are trying to achieve the same writing the same code over and over again and of course missing corner cases that already been addressed in the S3 client.
This PR aims to extract the code that could assist other clients to detect the retryability of an error originating from the http client, reuse the built in seastar http client retryability and to minimize the boilerplate of http client exception handling

No backport needed since it is only refactoring of the existing code

Closes scylladb/scylladb#28250

* github.com:scylladb/scylladb:
  exceptions: add helper to build a chain of error handlers
  http: extract error classification code
  aws_error: extract `retryable` from aws_error
2026-02-18 10:06:37 +03:00
Pavel Emelyanov
2f10fd93be Merge 's3_client: Fix s3 part size and number of parts calculation' from Ernest Zaslavsky
- Correct `calc_part_size` function since it could return more than 10k parts
- Add tests
- Add more checks in `calc_part_size` to comply with S3 limits

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-640
Must be ported back to 2025.3/4 and 2026.1 since we may encounter this bug in production clusters

Closes scylladb/scylladb#28592

* github.com:scylladb/scylladb:
  s3_client: add more constrains to the calc_part_size
  s3_client: add tests for calc_part_size
  s3_client: correct multipart part-size logic to respect 10k limit
2026-02-18 10:04:53 +03:00
Tomasz Grabiec
d33d38139f test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance
Currently, the test assumes that when
'topology_coordinator_pause_before_processing_backlog: waiting' is
logged, the task for decommission must be there. This was based on the
assumption that topology coordinator is idle and decommission request
wakes it up. But if the server is slow enough, it may still be running
the load balancer in reaction to table creation, and block on that
injection point before decommission request was added.

Fix by waiting for the task to appear rather than the injection.

Fixes SCYLLADB-715
2026-02-18 01:02:50 +01:00
Tomasz Grabiec
2454de4f8f test: cluster: task_manager_client: Introduce wait_task_appears() 2026-02-18 01:02:44 +01:00
Tomasz Grabiec
e14eca46af tests: pylib: util: Add exponential backoff to wait_for
Allows balancing the trade-off between fast execution in case the
condition is satisfied quickly and not adding load when it's not.
2026-02-18 01:02:19 +01:00
Szymon Malewski
668d6fe019 vector: Improve similarity functions performance
Improves performance of deserialization of vector data for calculating similarity functions.
Instead of deserializing vector data into a std::vector<data_value>, we deserialize directly into a std::vector<float>
and then pass it to similarity functions as a std::span<const float>.
This avoids overhead of data_value allocations and conversions.
Example QPS of `SELECT id, similarity_cosine({vector<float, 1536>}, {vector<float, 1536>}) ...`:
client concurrency 1: before: ~135 QPS, after: ~1005 QPS
client concurrency 20: before: ~280 QPS, after: ~2097 QPS
Measured using https://github.com/zilliztech/VectorDBBench (modified to call above query without ANN search)

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-471

Closes scylladb/scylladb#28615
2026-02-18 00:33:34 +02:00
Calle Wilund
ab4e4a8ac7 commitlog: Always abort replenish queue on loop exit
Fixes #28678

If replenish loop exits the sleep condition, with an empty queue,
when "_shutdown" is already set, a waiter might get stuck, unsignalled
waiting for segments, even though we are exiting.

Simply move queue abort to always be done on loop exit.

Closes scylladb/scylladb#28679
2026-02-17 23:46:47 +02:00
Emil Maskovsky
6b98f44485 init: fix infinite loop on npos wrap with updated Seastar
Fixes parsing of comma-separated seed lists in "init.cc" and
"cql_test_env.cc" to use the standard `split_comma_separated_list`
utility, avoiding manual `npos` arithmetic. The previous code relied on
`npos` being `uint32_t(-1)`, which would not overflow in `uint64_t`
target and exit the loop as expected. With Seastar's upcoming change
to make `npos` `size_t(-1)`, this would wrap around to zero and cause
an infinite loop.

Switch to `split_comma_separated_list` standardized way of tokenization
that is also used in other places in the code. Empty tokens are handled
as before. This prevents startup hangs and test failures when Seastar is
updated.

Refs: scylladb/seastar#3236
2026-02-17 17:57:13 +00:00
Emil Maskovsky
bda0fc9d93 init: remove unnecessary object creation in emplace calls
Simplifies code by directly passing constructor arguments to emplace,
avoiding redundant temporary gms::inet_address() object creation.
Improves clarity and potentially performance in affected areas.
2026-02-17 17:57:12 +00:00
Marcin Maliszkiewicz
741969cf4c test: boost: add auth cache tests
The cache is covered already with general auth
dtests but some cases are more tricky and easier
to express directly as calls to cache class.
For such tests boost test file was added.
2026-02-17 18:18:40 +01:00
Marcin Maliszkiewicz
c11eb73a59 auth: add cache size metrics 2026-02-17 18:18:40 +01:00
Marcin Maliszkiewicz
a059798de9 docs: conf: update permissions cache documentation 2026-02-17 18:18:40 +01:00
Marcin Maliszkiewicz
a23e503e7b auth: remove old permissions cache 2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz
9d9184e5b7 auth: use unified cache for permissions 2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz
7eedf50c12 auth: ldap: add permissions reload to unified cache
The LDAP server may change role-chain assignments without notifying
Scylla. As a result, effective permissions can change, so some form of
polling is required.

Currently, this is handled via cache expiration. However, the unified
cache is designed to be consistent and does not support expiration.
To provide an equivalent mechanism for LDAP, we will periodically
reload the permissions portion of the new cache at intervals matching
the previously configured expiration time.
2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz
10996bd0fb auth: add permissions cache to auth/cache
We want to get rid of loading cache because its periodic
refresh logic generates a lot of internal load when there
is many entries. Also our operation procedures involve tweaking
the config while new unified cache is supposed to work out
of the box.
2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz
03c4e4bb10 auth: add service::revoke_all as main entry point
In the following commit we'll need to add some
cache related logic (removing resource permissions).
This logic doesn't depend on authorizer so it should
be managed by the service itself.
2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz
070d0bfc4c auth: explicitly life-extend resource in auth_migration_listener
Otherwise it's easy to trigger use-after-free when code slightly changes.
2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz
3b98451776 test: auth_cluster: add test for hanged AUTHENTICATING connections
Test runtime:
Release - 2s
Debug - 5s
2026-02-17 17:55:48 +01:00
Marcin Maliszkiewicz
0376d16ad3 transport: fix connection code to consume only initially taken semaphore units
The connection's cpu_concurrency_t struct tracks the state of a connection
to manage the admission of new requests and prevent CPU overload during
connection storms. When a connection holds units (allowed only 0 or 1), it is
considered to be in the "CPU state" and contributes to the concurrency limits
used when accepting new connections.

The bug stems from the fact that `counted_data_source_impl::get` and
`counted_data_sink_impl::put` calls can interleave during execution. This
occurs because of `should_parallelize` and `_ready_to_respond`, the latter being
a future chain that can run in the background while requests are being read.
Consequently, while reading request (N), the system may concurrently be
writing the response for request (N-1) on the same connection.

This interleaving allows `return_all()` to be called twice before the
subsequent `consume_units()` is invoked. While the second `return_all()` call
correctly returns 0 units, the matching `consume_units()` call would
mistakenly take an extra unit from the semaphore. Over time, a connection
blocked on a read operation could end up holding an unreturned semaphore
unit. If this pattern repeats across multiple connections, the semaphore
units are eventually depleted, preventing the server from accepting any
new connections.

The fix ensures that we always consume the exact number of units that were
previously returned. With this change, interleaved operations behave as
follows:

get() return_all     — returns 1 unit
put() return_all     — returns 0 units
get() consume_units  — takes back 1 unit
put() consume_units  — takes back 0 units

Logically, the networking phase ends when the first network operation
concludes. But more importantly, when a network operation
starts, we no longer hold any units.

Other solutions are possible but the chosen one seems to be the
simplest and safest to backport.

Fixes SCYLLADB-485
2026-02-17 17:55:48 +01:00
Dani Tweig
5dc06647e9 .github: add workflow to auto-close issues from ScyllaDB associates
Added .github/workflows/close_issue_for_scylla_employee.yml workflow file to automatically close issues opened by ScyllaDB associates

We want to allow external users to open issues in the scylladb repo, but for ScyllaDB associates, we would like them to open issues in Jira instead. If a ScyllaDB associates opens by mistake an issue in scylladb.git repo, the issue will be closed automatically with an appropriate comment explaining that the issue should be opened in Jira.

This is a new github action, and does not require any code backport.

Fixes: PM-64

Closes scylladb/scylladb#28212
2026-02-17 17:18:32 +02:00
Dani Tweig
bb8a2c3a26 .github/workflow/:Add milestone sync to Jira based on GitHub Action
What changed
Added new workflow file .github/workflows/call_jira_sync_pr_milestone.yml

Why (Requirements Summary)
Adds a GitHub Action that will be triggered when a milestone is set or removed from a PR
When milestone is added (milestoned event), calls main_jira_sync_pr_milestone_set.yml from github-automation.git, which will add the version to the 'Fix Versions' field in the relevant linked Jira issue
When milestone is removed (demilestoned event), calls main_jira_sync_pr_milestone_removed.yml from github-automation.git, which will remove the version from the 'Fix Versions' field in the relevant linked Jira issue
Testing was performed in staging.git and the STAG Jira project.

Fixes:PM-177

Closes scylladb/scylladb#28575
2026-02-17 16:41:03 +02:00
Botond Dénes
2e087882fa Merge 'GCS object storage. Fix incompatibilty issues with "real" GCS' from Calle Wilund
Fixes #28398
Fixes #28399

When used as path elements in google storage paths, the object names need to be URL encoded. Due to

a.) tests not really using prefixes including non-url valid chars (i.e. / etc)
and
b.) the mock server used for most testing not enforcing this particular aspect,

this was missed.

Modified unit tests to use prefixing for all names, so when running real GS, any errors like this will show.

"Real" GCS also behaves a bit different when listing with pager, compared to mock;
The former will not give a pager token for last page, only penultimate.
 Adds handling for this.

Needs backport to the releases that have (though might not really use) the feature, as it is technically possible to use google storage for backup and whatnot there, and it should work as expected.

Closes scylladb/scylladb#28400

* github.com:scylladb/scylladb:
  utils/gcp/object_storage: URL-encode object names in URL:s
  utils::gcp::object_storage: Fix list object pager end condition detection
2026-02-17 16:40:02 +02:00
Andrei Chekun
1b5789cd63 test.py: refactor manager fixture
The current manager flow have a flaw. It will trigger pytest.fail when
it found errors on teardown regardless if the test was already failed.
This will create an additional record in JUnit report with the same name
and Jenkins will not be able to show the logs correctly. So to avoid
this, this PR changes logic slightly.
Now manager will check that test failed or not to avoid two fails for
the same test in the report.
If test passed, manager will check the cluster status and fail if
something wrong with a status of it. There is no need to check the
cluster status in case of test fail.
If test passed, and cluster status if OK, but there are unexpected
errors in the logs, test will fail as well. But this check will gather
all information about the errors and potential stacktraces and will only
fail the test if it's not yet failed to avoid double entry in report.

Closes scylladb/scylladb#28633
2026-02-17 14:35:18 +01:00
Dawid Mędrek
5b5222d72f Merge 'test: make test_different_group0_ids work with the Raft-based topology' from Patryk Jędrzejczak
The test was marked with xfail in #28383, as it needed to be updated to
work with the Raft-based topology. We are doing that in this patch.

With the Raft-based topology, there is no reason to check that nodes with
different group0 IDs cannot merge their topology/token_metadata. That is
clearly impossible, as doing any topology change requires being in the
same group0. So, the original regression test doesn't make sense.

We can still test that nodes with different group0 IDs cannot gossip with
each other, so we keep the test. It's very fast anyway.

No backport, test update.

Closes scylladb/scylladb#28571

* github.com:scylladb/scylladb:
  test: run test_different_group0_ids in all modes
  test: make test_different_group0_ids work with the Raft-based topology
2026-02-17 13:56:41 +01:00
Dawid Mędrek
1b80f6982b Merge 'test: make the load balancer simulator tablet size aware' from Ferenc Szili
Currently, the load balancing simulator computes node, shard and tablet load based on tablet count.

This patch changes the load balancing simulator to be tablet size aware. It generates random tablet sizes with a normal distribution, and a mean value of `default_target_tablet_size`, and reports the computed load for nodes and tables based on tablet size sum, instead of tablet count.

This is the last patch in the size based load balancing series. It is the last PR in the Size Based Load Balancing series:

- First part for tablet size collection via load_stats: scylladb/scylladb#26035
- Second part reconcile load_stats: scylladb/scylladb#26152
- The third part for load_sketch changes: scylladb/scylladb#26153
- The fourth part which performs tablet load balancing based on tablet size: scylladb/scylladb#26254
- The fifth part changes the load balancing simulator: scylladb/scylladb#26438

This is a new feature and backport is not needed.

Closes scylladb/scylladb#26438

* github.com:scylladb/scylladb:
  test, simulator: compute load based on tablet size instead of count
  test, simulator: generate tablet sizes and update load_stats
  test, simulator: postpone creation of load_stats_ptr
2026-02-17 13:29:37 +01:00
Avi Kivity
ffde2414e8 cql3: grammar: remove special case for vector similarity functions in selectors
In b03d520aff ("cql3: introduce similarity functions syntax") we
added vector similarity functions to the grammar. The grammar had to
be modified because we wanted to support literals as vector similarity
function arguments, and the general function syntax in selectors
did not allow that.

In cc03f5c89d ("cql3: support literals and bind variables in
selectors") we extended the selector function call grammar to allow
literals as function arguments.

Here, we remove the special case for vector similarity functions as
the general case in function calls covers all the possibilities the
special case does.

As a side effect, the vector similarity function names are no longer
reserved.

Note: the grammar change fixes an inconsistency with how the vector
similarity functions were evaluated: typically, when a USE statement
is in effect, an unqualified function is first matched against functions
in the keyspace, and only if there is no match is the system keyspace
checked. But with the previous implementation vector similarity functions
ignored the USE keyspace and always matched only the system keyspace.

This small inconsistency doesn't matter in practice because user defined
functions are still experimental, and no one would name a UDF to conflict
with a system function, but it is still good to fix it.

Closes scylladb/scylladb#28481
2026-02-17 12:40:21 +01:00
Ernest Zaslavsky
30699ed84b api: report restore params
report restore params once the API's call for restore is invoked

Closes scylladb/scylladb#28431
2026-02-17 14:27:21 +03:00
Andrei Chekun
767789304e test.py: improve C++ fail summary in pytest
Currently, if the test fail, pytest will output only some basic information
about the fail. With this change, it will output the last 300 lines of the
boost/seastar test output.
Also add capturing the output of the failed tests to JUnit report, so it
will be present in the report on Jenkins.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-449

Closes scylladb/scylladb#28535
2026-02-17 14:25:28 +03:00
Pavel Emelyanov
6d4af84846 Merge 'test: increase open file limit for sstable tests' from Avi Kivity
In ebda2fd4db ("test: cql_test_env: increase file descriptor limit"),
we raised the open file limit for cql_test_env. Here, we raise it for sstables::test_env
as well, to fix a couple of twcs resharding tests failing outside dbuild. These tests
open 256 sstables, and with 2 files/sstable + resharding work it is understandable
that they overflow the 1024 limit.

No backport: this is a quality of life improvement for developers running outside dbuild, but they can use dbuild for branches.

Closes scylladb/scylladb#28646

* github.com:scylladb/scylladb:
  test: sstables::test_env: adjust file open limit
  test: extract cql_test_env's adjust_rlimit() for reuse
2026-02-17 14:19:43 +03:00
Avi Kivity
41925083dc test: minio: tune sync setting
Disable O_DSYNC in minio to avoid unnecessary slowdown in S3
tests.

Closes scylladb/scylladb#28579
2026-02-17 14:19:27 +03:00
Avi Kivity
f03491b589 Update seastar submodule
* seastar f55dc7eb...d2953d2a (13):
  > io_tester: Revive IO bandwidth configuration
  > Merge 'io_tester: add vectorized I/O support' from Travis Downs
    doc: add vectorized I/O options to io-tester.md
    io_tester: add vectorized I/O support
  > Merge 'Remove global scheduling group ID bitmap' from Pavel Emelyanov
    reactor: Drop sched group IDs bitmap
    reactor: Allocate scheduling group on shard-0 first
    reactor: Detach init_scheduling_group_specific_data()
    reactor: Coroutinize create_scheduling_group()
  > set_iterator: increase compatibility with C++ ranges
  > test: fix race condition in test_connection_statistics
  > Add Claude Code project instructions
  > reactor: Unfriend pollable_fd via pollable_fd_state::make()
  > Merge 'rpc_tester: introduce rpc_streaming job based on streaming API' from Jakub Czyszczoń
    apps: rpc_tester: Add STREAM_UNIDIRECTIONAL job We introduce an unidirectional streaming to the rpc_streaming job.
    apps: rpc_tester: Add STREAM_BIDIRECTIONAL job This commit extends the rpc_tester with rpc_streaming job that uses rpc::sink<> and rpc::source<> to stream data between the client and the server.
  > treewide: remove remnants of SEASTAR_MODULE
  > test: Tune abort-accept test to use more readable async()
  > build: support sccache as a compiler cache (#3205)
  > posix-stack: Reuse parent class _reuseport from child
  > Merge 'reactor_backend: Fix another busy spin bug in the epoll backend' from Stephan Dollberg
    tests: Add unit test for epoll busy spin bug
    reactor_backend: Fix another busy spin bug in epoll

Closes scylladb/scylladb#28513
2026-02-17 13:13:22 +02:00
Jakub Smolar
189b056605 scylla_gdb: use run_ctx to nahdle Scylla exe and remove pexpect
Previous implementation of Scylla lifecycle brought flakiness to the test.
This change leaves lifecycle management up to PythonTest.run_ctx,
which implements more stability logic for setup/teardown.

Replace pexpect-driven GDB interaction with GDB batch mode:
- Avoids DeprecationWarning: "This process is multi-threaded, use of forkpty()
may lead to deadlocks in the child.", which ultimately caused CI deadlocks.
- Removes timeout-driven flakiness on slow systems - no interactive waits/timeouts.
- Produces cleaner, more direct assertions around command execution and output.
- Trade-off: batch mode adds ~10s per command per test,
but with --dist=worksteal this is ~10% overall runtime increase across the suite.

Closes scylladb/scylladb#28484
2026-02-17 11:36:20 +01:00
Łukasz Paszkowski
f45465b9f6 test_out_of_space_prevention.py: Lower the critical disk utilization threshold
After PR https://github.com/scylladb/scylladb/pull/28396 reduced
the test volumes to 20MiB to speed up test_out_of_space_prevention.py,
keeping the original 0.8 critical disk utilization threshold can make
the tests flaky: transient disk usage (e.g. commitlog segment churn)
can push the node into ENOSPC during the run.

These tests do not write much data, so reduce the critical disk
utilization threshold to 0.5. With 20MiB volumes this leaves ~10MiB
of headroom for temporary growth during the test.

Fixes: https://github.com/scylladb/scylladb/issues/28463

Closes scylladb/scylladb#28593
2026-02-16 15:10:18 +02:00
Andrei Chekun
e26cf0b2d6 test/cluster: fix two flaky tests
test_maintenance_socket with new way of running is flaky. Looks like the
driver tries to reconnect with an old maintenance socket from previous
driver and fails. This PR adds white list for connection that stabilize
the test
test_no_removed_node_event_on_ip_change was flaky on CI, while the issue
never reproduced locally. The assumption that under load we have race
condition and trying to check the logs before message is arrived. Small
for loop to retry added to avoid such situation.

Closes scylladb/scylladb#28635
2026-02-16 14:50:54 +02:00
Patryk Jędrzejczak
0693091aff test: test_restart_leaving_replica_during_cleanup: reconnect driver after restart
The test can currently fail like this:
```
>           await cql.run_async(f"ALTER TABLE {ks}.test WITH tablets = {{'min_tablet_count': 1}}")
E           cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.158.27.9:9042 datacenter1>: <Error from server: code=0000 [Server error] message="Failed to apply group 0 change due to concurrent modification">})
```
The following happens:
- node A is restarted and becomes the group0 leader,
- the driver sends the ALTER TABLE request to node B,
- the request hits group 0 concurrent modification error 10 times and fails
  because node A performs tablet migrations at the the same time.

What is unexpected is that even though the driver session uses the default
retry policy, the driver doesn't retry the request on node A. The request
is guaranteed to succeed on node A because it's the only node adding group0
entries.

The driver doesn't retry the request on node A because of a missing
`wait_for_cql_and_get_hosts` call. We add it in this commit. We also reconnect
the driver just in case to prevent hitting scylladb/python-driver#295.

Moreover, we can revert the workaround from
4c9efc08d8, as the fix from this commit also
prevents DROP KEYSPACE failures.

The commit has been tested in byo with `_concurrent_ddl_retries{0}` to
verify that node A really can't hit group 0 concurrent modification error
and always receives the ALTER TABLE request from the driver. All 300 runs in
each build mode passed.

Fixes #25938

Closes scylladb/scylladb#28632
2026-02-16 12:56:18 +01:00
Marcin Maliszkiewicz
6a4aef28ae Merge 'test: explicitly set compression algorithm in test_autoretrain_dict' from Andrzej Jackowski
When `test_autoretrain_dict` was originally written, the default
`sstable_compression_user_table_options` was `LZ4Compressor`. The
test assumed (correctly) that initially the compression doesn't use
a trained dictionary, and later in the test scenario, it changed
the algorithm to one with a dictionary.

However, the default `sstable_compression_user_table_options` is now
`LZ4WithDictsCompressor`, so the old assumption is no longer correct.
As a result, the assertion that data is initially not compressed well
may or may not fail depending on dictionary training timing.

To fix this, this commit explicitly sets `ZstdCompressor`
as the initial `sstable_compression_user_table_options`, ensuring that
the assumption that initial compression is without a dictionary
is always met.

Note: `ZstdCompressor` differs from the former default `LZ4Compressor`.
However, it's a better choice — the test aims to show the benefit of
using a dictionary, not the benefit of Zstd over LZ4 (and the test uses
ZstdWithDictsCompressor as the algorithm with the dictionary).

Fixes: https://github.com/scylladb/scylladb/issues/28204
Backport: 2025.4, as test already failed there (and also backport to 2026.1 to make everything consistent).

Closes scylladb/scylladb#28625

* github.com:scylladb/scylladb:
  test: explicitly set compression algorithm in test_autoretrain_dict
  test: remove unneeded semicolons from python test
2026-02-16 11:38:24 +01:00
Ernest Zaslavsky
034c6fbd87 s3_client: limit multipart upload concurrency
Prevent launching hundreds or thousands of fibers during multipart uploads
by capping concurrent part submissions to 16.

Closes scylladb/scylladb#28554
2026-02-16 13:32:58 +03:00
Botond Dénes
9f57d6285b Merge 'test: improve error reporting and retries in get_scylla_2025_1_executable' from Marcin Maliszkiewicz
Harden get_scylla_2025_1_executable() by improving error reporting when subprocesses fail,
increasing curl's retry count for more resilient downloads, and enabling --retry-all-errors to retry on all failures.

Fixes https://github.com/scylladb/scylladb/issues/27745
Backport: no, it's not a bug fix

Closes scylladb/scylladb#28628

* github.com:scylladb/scylladb:
  test: pylib: retry on all errors in get_scylla_2025_1_executable curl's call
  test: pylib: increase curl's number of retries when downloading scylla
  test: pylib: improve error reporting in get_scylla_2025_1_executable
2026-02-16 10:09:17 +02:00
Petr Gusev
c785d242a7 tests: extract get_topology_version helper
This is a refactoring commit. We need to load the cluster version
for a host in several places, so extract a helper for this.
2026-02-16 08:57:42 +01:00
Petr Gusev
ffe3262e8d global tablets barrier: require all nodes to ack barrier_and_drain
Previously, global_tablet_token_metadata_barrier() could proceed with
fencing even if some nodes did not acknowledge the barrier_and_drain.
This could cause problems:
* In scylladb/scylladb#26864, replica locks did not provide mutual
exclusion, because “fenced out” requests from old topology versions
could run in parallel with requests using newer versions.
* In scylladb/scylladb#26375, the barrier could succeed even though we
did not wait for closed sessions to become unused. This could leave
aborted repair or streaming tasks running concurrently after a tablet
transition was aborted, and thus running concurrently with the next
transition.

In this commit we add a parameter drain_all_nodes: bool to
the global_token_metadata_barrier function. If this parameter is set,
the barrier waits for all nodes to acknowledge the barrier_and_drain
round of RPCs. If any of the nodes are not accessible or throw an error,
such errors are rethrown to the caller. We set this parameter only in
global_tablet_token_metadata_barrier since for topology migrations
the old behavior should be preserved. In case of errors, the tablet
migration is blocked until the problem goes away by itself or the
problematic node is added to the ignore_nodes list.

The test_fenced_out_on_tablet_migration_while_handling_paxos_verb is
removed: with tablets, we now drain all nodes, so after a successful
barrier_and_drain round there can be no coordinators with an old
topology version. The fence_token check after executing a request on
a replica is therefore unnecessary for tablets, but still required for
vnodes, where topology changes do not wait for all nodes.
Topology fencing is covered by test_fence_lwt_during_bootstrap.

Fixes scylladb/scylladb#26864
Fixes scylladb/scylladb#26375
2026-02-16 08:57:42 +01:00
Petr Gusev
06f88b43e5 topology_coordinator: pass raft_topology_cmd by value
It's just a single enum. Passing by reference risks
use-after-free if a temporary command is created on
the stack in a non-coroutine function.
2026-02-16 08:57:42 +01:00
Petr Gusev
df73f723a6 storage_proxy: hold erms in replica handlers
Add explicit erm-holding variables in all replica-side RPC handlers.
This is required to ensure that tablet migration waits for in-flight
replica requests even if a non-replica coordinator has been fenced out.

Holding erms on the replica side may increase the global-barrier wait
time, since the barrier must drain these requests. We believe this
is acceptable because:
* We already hold erms during replica-side request execution, but in
an ad-hoc, non-systemic way in lower layers of storage_proxy
(e.g. in sp::mutate_locally and do_query_tablets).
* Replica requests are bounded by replica-side timeouts, so the
global-barrier wait time cannot exceed the maximum of these timeouts.

For Paxos verbs, we use token_metadata_guard, which wraps the ERM and
automatically refreshes it when tablet migration does not affect the
current token; see the token_metadata_guard comments for details.
We use this guard only for Paxos verbs because regular reads and writes
already hold raw erms in storage_proxy and on the coordinators.

The erms must be held in all RPC handlers that support fencing — that
is, those with a fencing_token parameter in storage_proxy.idl.
Counter updates already hold erms in
mutate_counter_on_leader_and_replicate.

Fix test_tablets2::test_timed_out_reader_after_cleanup: the tablets
barrier now waits for all nodes. As a result, the replica read
is expected to finish, rather than fail due to the tablet having
moved as it did previously. The test is renamed to
test_tablets_barrier_waits_for_replica_erms to better reflect its
purpose.

Refs scylladb/scylladb#26864
2026-02-16 08:57:42 +01:00
Petr Gusev
e39f4b399c token_metadata: improve stale versions diagnostics
Before waiting on stale_versions_in_use(), we log the stale versions
the barrier_and_drain handler will wait for, along with the number of
token_metadata references representing each version.
To achieve this, we store a pointer to token_metadata in
version_tracker, traverse the _trackers list, and output all items
with a version smaller than the latest. Since token_metadata
contains the version_tracker instance, it is guaranteed to remain
alive during traversal. To count references, token_metadata now
inherits from enable_lw_shared_from_this.

This helps diagnose tablet migration stalls and allows more
deterministic tests: when a barrier is expected to block, we can
verify that the log contains the expected stale versions rather
than checking that the barrier_and_drain is blocked on
stale_versions_in_use() for a fixed amount of time.
2026-02-16 08:57:42 +01:00
Andrei Chekun
8c5c1096c2 test: ensure that that table used it cqlpy/test_tools have at least 3 pk
One of the tests check that amount of the PK should be more than 2, but
the method that creates it can return table with less keys. This leads
to flakiness and to avoid it, this PR ensures that table will have at
least 3 PK

Closes scylladb/scylladb#28636
2026-02-16 09:50:58 +02:00
Anna Mikhlin
33cf97d688 .github/workflows: ignore quoted comments for trigger CI
prevent CI from being triggered when trigger-ci command appears inside
quoted (>) comment text

Fixes: https://scylladb.atlassian.net/browse/RELENG-271

Closes scylladb/scylladb#28604
2026-02-16 09:33:16 +02:00
Andrei Chekun
e144d5b0bb test.py: fix JUnit double test case records
Move the hook for overwriting the XML reporter to be the first, to
avoid double records.

Closes scylladb/scylladb#28627
2026-02-15 19:02:24 +02:00
Alex
75e25493c1 db:view: refactor on_update_view to use coroutine dispatcher
on_update_view() currently runs its serialized logic inline via with_semaphore()
  from a detached callback path, while create/drop already use dedicated async
  dispatchers.

  Refactor update handling to follow the same pattern:

  - add dispatch_update_view(sstring ks_name, sstring view_name)
  - move update logic into that coroutine
  - acquire the existing view-builder lock via get_or_adopt_view_builder_lock()
  - keep existing behavior for missing base/view state
  - keep background invocation semantics from on_update_view()

  This aligns update/create/drop flow and keeps async lifecycle handling and a first step to fix shutdown issue.
2026-02-15 18:50:32 +02:00
Avi Kivity
a365e2deaa test: sstables::test_env: adjust file open limit
The twcs compaction tests open more than 1024 files (not
so good), and will fail in a user session with the default
soft limit (1024).

Attempt to raise the limit so the tests pass. On a modern
systemd installation the hard limit is >500,000, so this
will work.

There's no problem in dbuild since it raises the file limit
globally.
2026-02-15 14:27:37 +02:00
Avi Kivity
bab3afab88 test: extract cql_test_env's adjust_rlimit() for reuse
The sstable-oriented sstable::test_env would also like to use
it, so extract it into a neutral place.
2026-02-15 14:26:46 +02:00
Jenkins Promoter
69249671a7 Update pgo profiles - aarch64 2026-02-15 05:22:17 +02:00
Jenkins Promoter
27aaafb8aa Update pgo profiles - x86_64 2026-02-15 04:26:36 +02:00
Piotr Dulikowski
9c1e310b0d Merge 'vector_search: Fix flaky vector_store_client_https_rewrite_ca_cert' from Karol Nowacki
Most likely, the root cause of the flaky test was that the TLS handshake hung for an extended period (60s). This caused
the test case to fail because the ANN request duration exceeded the test case timeout.

The PR introduces two changes:

* Mitigation of the hanging TLS handshake: This issue likely occurred because the test performed certificate rewrites
simultaneously with ANN requests that utilize those certificates.
* Production code fix: This addresses a bug where the TLS handshake itself was not covered by the connection timeout.
Since tls::connect does not perform the handshake immediately, the handshake only occurs during the first write
operation, potentially bypassing connect timeout.

Fixes: #28012

Backport to 2026.01 and 2025.04 is needed, as these branches are also affected and may experience CI flakiness due to this test.

Closes scylladb/scylladb#28617

* github.com:scylladb/scylladb:
  vector_search: Fix missing timeout on TLS handshake
  vector_search: test: Fix flaky cert rewrite test
2026-02-13 19:03:50 +01:00
Patryk Jędrzejczak
aebc108b1b test: run test_different_group0_ids in all modes
CI currently fails in release and debug modes if the PR only changes
a test run only in dev mode. There is no reason to wait for the CI fix,
as there is no reason to run this test only in dev mode in the first
place. The test is very fast.
2026-02-13 13:30:29 +01:00
Patryk Jędrzejczak
59746ea035 test: make test_different_group0_ids work with the Raft-based topology
The test was marked with xfail in #28383, as it needed to be updated to
work with the Raft-based topology. We are doing that in this patch.

With the Raft-based topology, there is no reason to check that nodes with
different group0 IDs cannot merge their topology/token_metadata. That is
clearly impossible, as doing any topology change requires being in the
same group0. So, the original regression test doesn't make sense.

We can still test that nodes with different group0 IDs cannot gossip with
each other, so we keep the test. It's very fast anyway.
2026-02-13 13:30:28 +01:00
Marcin Maliszkiewicz
1b0a68d1de test: pylib: retry on all errors in get_scylla_2025_1_executable curl's call
It's difficult to say if our download backend would always return
transient error correctly so that the curl could retry. Instead it's
more robust to always retry on error.
2026-02-12 16:18:52 +01:00
Marcin Maliszkiewicz
8ca834d4a4 test: pylib: increase curl's number of retries when downloading scylla
By default curl does exponential backoff, and we want to keep that
but there is time cap of 10 minutes, so with 40 retries we'd wait
long time, instead we set the cap to 60 seconds.

Total waiting time (excluding receiving request time):
before - 17m
after - 35m
2026-02-12 16:18:52 +01:00
Marcin Maliszkiewicz
70366168aa test: pylib: improve error reporting in get_scylla_2025_1_executable
Curl or other tools this function calls will now log error
in the place they fail instead of doing plain assert.
2026-02-12 16:18:52 +01:00
Andrzej Jackowski
9ffa62a986 test: explicitly set compression algorithm in test_autoretrain_dict
When `test_autoretrain_dict` was originally written, the default
`sstable_compression_user_table_options` was `LZ4Compressor`. The
test assumed (correctly) that initially the compression doesn't use
a trained dictionary, and later in the test scenario, it changed
the algorithm to one with a dictionary.

However, the default `sstable_compression_user_table_options` is now
`LZ4WithDictsCompressor`, so the old assumption is no longer correct.
As a result, the assertion that data is initially not compressed well
may or may not fail depending on dictionary training timing.

To fix this, this commit explicitly sets `ZstdCompressor`
as the initial `sstable_compression_user_table_options`, ensuring that
the assumption that initial compression is without a dictionary
is always met.

Note: `ZstdCompressor` differs from the former default `LZ4Compressor`.
However, it's a better choice — the test aims to show the benefit of
using a dictionary, not the benefit of Zstd over LZ4 (and the test uses
ZstdWithDictsCompressor as the algorithm with the dictionary).

Fixes: scylladb/scylladb#28204
2026-02-12 14:58:39 +01:00
Andrzej Jackowski
e63cfc38b3 test: remove unneeded semicolons from python test 2026-02-12 14:49:17 +01:00
Patryk Jędrzejczak
8e9c7397c5 raft topology: prevent accessing nullptr returned by topology::find
It's better to raise an internal error than cause a segmentation fault on
possibly multiple nodes.
2026-02-12 13:10:04 +01:00
Patryk Jędrzejczak
e21ecf69de raft topology: make some assertions non-crashing
Some assertions in the Raft-based topology are likely to cause crashes of
multiple nodes due to the consistent nature of the Raft-based code. If the
failing assertion is executed in the code run by each follower (e.g., the code
reloading the in-memory topology state machine), then all nodes can crash. If
the failing assertion is executed only by the leader (e.g., the topology
coordinator fiber), then multiple consecutive group0 leaders will chain-crash
until there is no group0 majority.

Crashing multiple nodes is much more severe than necessary. It's enough to
prevent the topology state machine from making more progress. This will
naturally happen after throwing a runtime error. The problematic fiber will be
killed or will keep failing in a loop. Note that it should be safe to block
the topology state machine, but not the whole group0, as the topology state
machine is mostly isolated from the rest of group0.

We replace some occurrences of `on_fatal_internal_error` and `SCYLLA_ASSERT`
with `on_internal_error`. These are not all occurrences, as some fatal
assertions make sense, for example, in the bootstrap procedure.
2026-02-12 13:10:03 +01:00
Ferenc Szili
d7cfaf3f84 test, simulator: compute load based on tablet size instead of count
This patch changes the load balancing simulator so that it computes
table load based on tablet sizes instead of tablet count.

best_shard_overcommit measured minimal allowed overcommit in cases
where the number of tablets can not be evenly distributed across
all the available shards. This is still the case, but instead of
computing it as an integer div_ceil() of the average shard load,
it is now computed by allocating the tablet sizes using the
largest-tablet-first method. From these, we can get the lowest
overcommit for the given set of nodes, shards and tablet sizes.
2026-02-12 12:54:55 +01:00
Ferenc Szili
216443c050 test, simulator: generate tablet sizes and update load_stats
This change adds a random tablet size generator. The tablet sizes are
created in load_stats.

Further changes to the load balance simulator:

- apply_plan() updates the load_stats after a migration plan is issued by the
load balancer,

- adds the option to set a command line option which controls the tablet size
deviation factor.
2026-02-12 12:54:55 +01:00
Ferenc Szili
e31870a02d test, simulator: postpone creation of load_stats_ptr
With size based load balancing, we will have to move the tablet size in
load_stats after each internode migration issued by balance_tablets().
This will be done in a subsequent commit in apply_plan() which is
called from rebalance_tablets().

Currently, rebalance_tablets() is passed a load_stats_ptr which is
defined as:

using load_stats_ptr = lw_shared_ptr<const load_stats>;

Because this is a pointer to const, apply_plan() can't modify it.

So, we pass a reference to load_stats to rebalance_tablets() and create
a load_stats_ptr from it for each call to balance_tablets().
2026-02-12 12:54:55 +01:00
Aleksandra Martyniuk
f955a90309 test: fix test_remove_node_violating_rf_rack_with_rack_list
test_remove_node_violating_rf_rack_with_rack_list creates a cluster
with four nodes. One of the nodes is excluded, then another one is
stopped, excluded, and removed. If the two stopped nodes were both
voters, the majority is lost and the cluster loses its raft leader.
As a result, the node cannot be removed and the operation times out.

Add the 5th node to the cluster. This way the majority is always up.

Fixes: https://github.com/scylladb/scylladb/issues/28596.

Closes scylladb/scylladb#28610
2026-02-12 12:58:48 +02:00
Ferenc Szili
4ca40929ef test: add read barrier to test_balance_empty_tablets
The test creates a single node cluster, then creates 3 tables which
remain empty. Then it adds another node with half the disk capacity of
the first one, and then it waits for the balancer to migrate tablets to
the newly added node by calling the quiesce topology API. The number of
tablets on the smaller node should be exactly half the number of tablets
on the larger node.

After waiting for quiesce topology, we could have a situation where we
query the number of tablets from the node which still hasn't processed
the last tablet migrations and updated system.tablets.

This patch adds a read barrier so that both nodes see the same tablets
metadata before we query the number of tablets.

Fixes: SCYLLADB-603

Closes scylladb/scylladb#28598
2026-02-12 11:16:34 +02:00
Karol Nowacki
079fe17e8b vector_search: Fix missing timeout on TLS handshake
Currently the TLS handshake in the vector search client does not have a timeout.
This is because tls::connect does not perform handshake itself; the handshake
is deferred until the first read/write operation is performed. This can lead to long
hangs on ANN requests.

This commit calls tls::check_session_is_resumed() after tls::connect
to force the handshake to happen immediately and to run under with_timeout.
2026-02-12 10:08:37 +01:00
Karol Nowacki
aef5ff7491 vector_search: test: Fix flaky cert rewrite test
The test is flaky most likely because when TLS certificate rewrite
happens simultaneously with an ANN request, the handshake can hang for a
long time (~60s). This leads to a timeout in the test case.

This change introduces a checkpoint in the test so that it will
wait for the certificate rewrite to happen before sending an ANN request,
which should prevent the handshake from hanging and make the test more reliable.

Fixes: #28012
2026-02-12 09:58:54 +01:00
Piotr Dulikowski
38c4a14a5b Merge 'test: cluster: Fix test_sync_point' from Dawid Mędrek
The test `test_sync_point` had a few shortcomings that made it flaky
or simply wrong:

1. We were verifying that hints were written by checking the size of
   in-flight hints. However, that could potentially lead to problems
   in rare situations.

   For instance, if all of the hints failed to be written to disk, the
   size of in-flight hints would drop to zero, but creating a sync point
   would correspond to the empty state.

   In such a situation, we should fail immediately and indicate what
   the cause was.

2. A sync point corresponds to the hints that have already been written
   to disk. The number of those is tracked by the metric `written`.
   It's a much more reliable way to make sure that hints have been
   written to the commitlog. That ensures that the sync point we'll
   create will really correspond to those hints.

3. The auxiliary function `wait_for` used in the test works like this:
   it executes the passed callback and looks at the result. If it's
   `None`, it retries it. Otherwise, the callback is deemed to have
   finished its execution and no further retries will be attempted.

   Before this commit, we simply returned a bool, and so the code was
   wrong. We improve it.

---

Note that this fixes scylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.

As a bonus, we rewrite the auxiliary code responsible for fetching
metrics and manipulating sync points. Now it's asynchronous and
uses the existing standard mechanisms available to developers.

Furthermore, we reduce the time needed for executing
`test_sync_point` by 27 seconds.

---

The total difference in time needed to execute the whole test file
(on my local machine, in dev mode):

Before:

    CPU utilization: 0.9%

    real    2m7.811s
    user    0m25.446s
    sys     0m16.733s

After:

    CPU utilization: 1.1%

    real    1m40.288s
    user    0m25.218s
    sys     0m16.566s

---

Refs scylladb/scylladb#25879
Fixes scylladb/scylladb#28203

Backport: This improves the stability of our CI, so let's
          backport it to all supported versions.

Closes scylladb/scylladb#28602

* github.com:scylladb/scylladb:
  test: cluster: Reduce wait time in test_sync_point
  test: cluster: Fix test_sync_point
  test: cluster: Await sync points asynchronously
  test: cluster: Create sync points asynchronously
  test: cluster: Fetch hint metrics asynchronously
2026-02-12 09:34:09 +01:00
Dawid Pawlik
4e32502bb3 test/vector_search: add reproducer for rescoring with zero vectors
Add reproducer for the SCYLLADB-456 issue following exception
on ANN vector queries with rescoring with similarity cosine.
2026-02-11 13:41:09 +01:00
Dawid Pawlik
af0889d194 vector_search: return NaN for similarity_cosine with all-zero vectors
The ANN vector queries with all-zero vectors are allowed even on vector
indexes with similarity function set to cosine.
When enabling the rescoring option, those queries would fail as the rescoring
calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception
as all-zero vectors were not allowed matching Cassandra's behaviour.

To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass,
but return the NaN as the cosine similarity for zero vectors is mathematically incorrect.
We decided not to use arbitrary values contrary to USearch, for which the distance
(not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while
supporting the range of values [0, 2].
If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5,
which does not support mathematical reasoning why that would be more similar than
for example vectors marking obtuse angles.
It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact,
therefore we return NaN and eliminate them from best results.

Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour.

Fixes: SCYLLADB-456
2026-02-11 12:31:47 +01:00
Pavel Emelyanov
2a3a56850c test: Fix the condition for streaming directions validation
Commit ea8a661119 tried to reduce the dataset for restoration tests.
While doing it effectively disabled part of itself -- the checks for
streaming directions were never ran after this change. The thing is that
this check only runs if restored tablet count matches some hardcoded one
of 512. This was the real dataset size of the test before the
aforementioned commit, but after it it had changed to over values, and
the comparison with 512 became always False.

Fix it with a local variable to prevent such mistakes in the future.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-11 12:55:27 +03:00
Pavel Emelyanov
f187dceb1a test: Split test_backup.py::check_data_is_back() into two
This method does two things -- checks that the data is indeed back, and
validates streaming directions. The latter is not quite about "data is
back", so better to have it as explicit dedicated method.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-11 12:54:20 +03:00
Dawid Mędrek
f83f911bae test: cluster: Reduce wait time in test_sync_point
If everything is OK, the sync point will not resolve with node 3 dead.
As a result, the waiting will use all of the time we allocate for it,
i.e. 30 seconds. That's a lot of time.

There's no easy way to verify that the sync point will NOT resolve, but
let's at least reduce the waiting to 3 seconds. If there's a bug, it
should be enough to trigger it at some point, while reducing the average
time needed for CI.
2026-02-10 17:05:02 +01:00
Dawid Mędrek
a256ba7de0 test: cluster: Fix test_sync_point
The test had a few shortcomings that made it flaky or simply wrong:

1. We were verifying that hints were written by checking the size of
   in-flight hints. However, that could potentially lead to problems
   in rare situations.

   For instance, if all of the hints failed to be written to disk, the
   size of in-flight hints would drop to zero, but creating a sync point
   would correspond to the empty state.

   In such a situation, we should fail immediately and indicate what
   the cause was.

2. A sync point corresponds to the hints that have already been written
   to disk. The number of those is tracked by the metric `written`.
   It's a much more reliable way to make sure that hints have been
   written to the commitlog. That ensures that the sync point we'll
   create will really correspond to those hints.

3. The auxiliary function `wait_for` used in the test works like this:
   it executes the passed callback and looks at the result. If it's
   `None`, it retries it. Otherwise, the callback is deemed to have
   finished its execution and no further retries will be attempted.

   Before this commit, we simply returned a bool, and so the code was
   wrong. We improve it.

Note that this fixes scylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.

Refs scylladb/scylladb#25879
Fixes scylladb/scylladb#28203
2026-02-10 17:05:02 +01:00
Dawid Mędrek
c5239edf2a test: cluster: Await sync points asynchronously
There's a dedicated HTTP API for communicating with the cluster, so
let's use it instead of yet another custom solution.
2026-02-10 17:05:02 +01:00
Dawid Mędrek
ac4af5f461 test: cluster: Create sync points asynchronously
There's a dedicated HTTP API for communicating with the nodes, so let's
use it instead of yet another custom solution.
2026-02-10 17:05:01 +01:00
Dawid Mędrek
628e74f157 test: cluster: Fetch hint metrics asynchronously
There's a dedicated API for fetching metrics now. Let's use it instead
of developing yet another solution that's also worse.
2026-02-10 17:04:59 +01:00
Pavel Emelyanov
875fd03882 test/object_store: Remove create_ks_and_cf() helper
Now all test cases use standard facilities to create data they test

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-10 15:59:05 +03:00
Pavel Emelyanov
94176f7477 test/object_store: Replace create_ks_and_cf() usage with standard methods
To create a keyspace theres new_test_keyspace helper
Table is created with a single cql.run_async with explicit schema
Dataset is populated with a single parallel INSERT as well

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-10 15:58:05 +03:00
Pavel Emelyanov
6665cda23f test/object_store: Shift indentation right for test cases
This is preparational patch. Next will need to replace

  foo()
  bar()

with

  with something() as s:
      foo()
      bar()

Effectively -- only add the `with something()` line. Not to shift the
whole file right together with that future change, do it here.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-10 15:56:27 +03:00
Ernest Zaslavsky
960adbb439 s3_client: add more constrains to the calc_part_size
Enforce more checks on part size and object size as defined in
"Amazon S3 multipart upload limits", see
https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html and
https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingObjects.html
2026-02-10 13:15:07 +02:00
Ernest Zaslavsky
6280cb91ca s3_client: add tests for calc_part_size
Introduce tests that validate the corrected multipart part-size
calculation, including boundary conditions and error cases.
2026-02-10 13:13:26 +02:00
Ernest Zaslavsky
289e910cec s3_client: correct multipart part-size logic to respect 10k limit
The previous calculation could produce more than 10,000 parts for large
uploads because we mixed values in bytes and MiB when determining the
part size. This could result in selecting a part size that still
exceeded the AWS multipart upload limit. The updated logic now ensures
the number of parts never exceeds the allowed maximum.

This change also aligns the implementation with the code comment: we
prefer a 50 MiB part size because it provides the best performance, and
we use it whenever it fits within the 10,000-part limit. If it does not,
we increase the part size (in bytes, aligned to MiB) to stay within the
limit.
2026-02-10 13:13:25 +02:00
Ernest Zaslavsky
7142b1a08d exceptions: add helper to build a chain of error handlers
Generalize error handling by creating exception dispatcher which allows to write error handlers by sequentially applying handlers the same way one would write `catch ()` blocks
2026-02-09 08:48:41 +02:00
Ernest Zaslavsky
7fd62f042e http: extract error classification code
move http client related error classification code to a common location for future reuse
2026-02-09 08:48:41 +02:00
Ernest Zaslavsky
5beb7a2814 aws_error: extract retryable from aws_error
Move aws::retryable to common location to reuse it later in other http based clients
2026-02-09 08:48:41 +02:00
Pavel Emelyanov
44715a2d45 api: Remove map_to_key_value() helpers
All the callers had already been patched to stream their results
directly as json.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-09 08:52:50 +03:00
Pavel Emelyanov
dcbb5cb45b api: Streamify view_build_statuses handler
Similarly to previous patch, the handler can stream the map of build
statuses. Unlike previous patch, it doesn't need to fmt::format() key
and value, as these are strings already.

It could be a map_to_json<string, string> partial specialization, but
there's so far only one caller, so probably not worth it yet.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-09 08:52:50 +03:00
Pavel Emelyanov
73512a59ff api: Streamify few more storage_service/ handlers
Like get_token_endpoint one streams the map that it got from storage
service, the get_ownership and get_effective_ownership can do the same.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-09 08:52:49 +03:00
Pavel Emelyanov
a4bd9037b3 api: Add map_to_json() helper
The get_token_endpoint handler converts iterator of std::map into
generated maplist_mapper type. Next patch will do the same for more
handlers, so it's good to have a helper converter for it.

As a nice side effect, it's possible to avoid multiline lambda argument
to stream_range_as_array().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-09 08:52:49 +03:00
Pavel Emelyanov
63cafab56c api: Coroutinize view_build_statuses handler
Further patching will be nicer if this handler is a coroutine

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-09 08:52:49 +03:00
Pawel Pery
81d11a23ce Revert "Merge 'vector_search: add validator tests' from Pawel Pery"
This reverts commit bcd1758911, reversing
changes made to b2c2a99741.

There is a design decision to not introduce additional test
orchestration tool for scylladb.git (see comments for #27499). One
commit has already been reverted in 55c7bc7. Last CI runs made validator
test flaky, so it is a time to remove all remaining validator tests.

It needs a backport to 2026.1 to remove remaining validator tests from there.

Fixes: VECTOR-497

Closes scylladb/scylladb#28568
2026-02-08 16:29:58 +02:00
Avi Kivity
bb99bfe815 test: scylla_gdb: tighten check for Error output from gdb
When running a gdb command, we check that the string 'Error'
does not appear within the output. However, if the command output
includes the string 'Error' as part of its normal operation, this
generates a false positive. In fact the task_histogram can include
the string 'error::Error' from the Rust core::error module.

Allow for that and only match 'Error' that isn't 'error::Error'.

Fixes #28516.

Closes scylladb/scylladb#28574
2026-02-08 09:48:23 +02:00
Pavel Emelyanov
1b1aae8a0d test: Merge test_restore_primary_replica_different_... tests
The difference is very small:

@@ -1,18 +1,18 @@
 @pytest.mark.asyncio
 async def test_restore_primary_replica_different_...(manager: ManagerClient, object_storage):
     ''' Comment '''

-    topology = topo(rf = 2, nodes = 2, racks = 2, dcs = 1)
-    scope = "dc"
+    topology = topo(rf = 1, nodes = 2, racks = 1, dcs = 2)
+    scope = "all"
     ks = 'ks'
     cf = 'cf'

-    servers, host_ids = await create_cluster(topology, True, manager, logger, object_storage)
+    servers, host_ids = await create_cluster(topology, False, manager, logger, object_storage)

     await manager.disable_tablet_balancing()
     cql = manager.get_cql()
@@ -41,7 +41,6 @@ async def test_restore_primary_replica_d
         log = await manager.server_open_log(s.server_id)
         res = await log.grep(r'INFO.*sstables_loader - load_and_stream:.*target_node=([0-9a-z-]+),.*num_bytes_sent=([0-9]+)')
         streamed_to = set([ r[1].group(1) for r in res ])
-        logger.info(f'{s.ip_addr} {host_ids[s.server_id]} streamed to {streamed_to}')
+        logger.info(f'{s.ip_addr} {host_ids[s.server_id]} streamed to {streamed_to}, expected {servers}')
         assert len(streamed_to) == 2

The (removed in the above example) test description comments differ only
in their usage of "rack" and "dc" words.

Squashing them into one parametrized test makes perfect sense.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-06 14:14:43 +03:00
Pavel Emelyanov
70988f9b61 test: Merge test_restore_primary_replica_same_... tests
The difference is very tiny:

@@ -1,12 +1,12 @@
 @pytest.mark.asyncio
 async def test_restore_primary_replica_same_...(manager: ManagerClient, object_storage):
     ''' comment '''

-    topology = topo(rf = 4, nodes = 8, racks = 2, dcs = 1)
-    scope = "rack"
+    topology = topo(rf = 4, nodes = 8, racks = 2, dcs = 2)
+    scope = "dc"
     ks = 'ks'
     cf = 'cf'

@@ -42,7 +42,7 @@ async def test_restore_primary_replica_s
         for r in res:
             nodes_by_operation[r[1].group(1)].append(r[1].group(2))

-        scope_nodes = set([ str(host_ids[s.server_id]) for s in servers if s.rack == servers[i].rack ])
+        scope_nodes = set([ str(host_ids[s.server_id]) for s in servers if s.datacenter == servers[i].datacenter ])
         for op, nodes in nodes_by_operation.items():
             logger.info(f'Operation {op} streamed to nodes {nodes}')
             assert len(nodes) == 1, "Each streaming operation should stream to exactly one primary replica"

The (removed in the above example) test description comments differ only
in their usage of "rack" and "dc" words.

Squashing them into one parametrized test makes perfect sense.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-06 14:12:23 +03:00
Pavel Emelyanov
98b4092153 test: Don't specify expected_replicas in test_restore_primary_replica_different_dc_scope_all
If not specified, the call would use dcs * rf default, which match this
teat parameters perfectly

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-06 14:11:47 +03:00
Pavel Emelyanov
1ac1e90b16 test: Remove local r_servers variable from test_restore_primary_replica_different_dc_scope_all
It merely copies the `servers` one, no need in it

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-06 14:11:22 +03:00
Anna Stuchlik
dc8f7c9d62 doc: replace the OS Support page with a link to the new location
We've moved that page to another place; see https://github.com/scylladb/scylladb/issues/28561.
This commit replaces the page with the link to the new location
and adds a redirection.

Fixes https://github.com/scylladb/scylladb/issues/28561

Closes scylladb/scylladb#28562
2026-02-06 11:38:21 +02:00
Tomasz Grabiec
41930c0176 tablets, config: Reduce migration concurrency to 2
Tablet migration keeps sstable snapshot during streaming, which may
cause temporary increase in disk utilization if compaction is running
concurrently. SStables compacted away are kept on disk until streaming
is done with them. The more tablets we allow to migrate concurrently,
the higher disk space can rise. When the target tablet size is
configured correcly, every tablet should own about 1% of disk
space. So concurrency of 4 shouldn't put us at risk. But target tablet
size is not chosen dynamically yet, and it may not be aligned with
disk capacity.

Also, tablet sizes can temporary grow above the target, up to 2x
before the split starts, and some more because splits take a while to
complete.

The reduce impact from this, reduce concurrency of
migation. Concurrency of 2 should still be enough to saturate
resources on the leaving shard.

Also, reducing concurrency means that load balancing is more
responsive to preemption. There will be less bandwidth sharing, so
scheduled migrations complete faster. This is important for scale-out,
where we bootstrap a node and want to start migrations to that new
node as soon as possible.

Refs scylladb/siren#15317
2026-02-06 00:42:19 +01:00
Tomasz Grabiec
56e40e90c9 tablets: load_balancer: Always accept migration if the load is 0
Different transitions have different weights, and limits are
configurable.  We don't want a situation where a high-cost migration
is cut off by limits and the system can make no progress.

For example, repair uses weight 2 for read concurrency.  Migrating
co-located tablets scales the cost by the number of co-located
tablets.
2026-02-06 00:42:18 +01:00
Tomasz Grabiec
39492596c2 config, tablets: Make tablet migration concurrency configurable
We're about to reduce it. It's better to not have it hard-coded in
case we change our mings again.
2026-02-06 00:42:18 +01:00
Avi Kivity
7a3ce5f91e test: minio: disable web console
minio starts a web console on a random port. This was seen to interfere
with the nodetool tests when the web console port clashed with the mock
API port.

Fix by disabling the web console.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-496

Closes scylladb/scylladb#28492
2026-02-05 20:11:32 +02:00
Nikos Dragazis
5d1e6243af test/cluster: Remove short_tablet_stats_refresh_interval injection
The test `test_size_based_load_balancing.py::test_balance_empty_tablets`
waits for tablet load stats to be refreshed and uses the
`short_tablet_stats_refresh_interval` injection to speed up the refresh
interval.

This injection has no effect; it was replaced by the
`tablet_load_stats_refresh_interval_in_seconds` config option (patch: 1d6808aec4),
so the test currently waits for 60 seconds (default refresh interval).

Use the config option. This reduces the execution time to ~8 seconds.

Fixes SCYLLADB-556.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#28536
2026-02-05 20:11:32 +02:00
Pavel Emelyanov
10c278fff7 database: Remove _flush_sg member from replica::database
This field is only used to initialize the following _memtable_controller
one. It's simpler just to do the initialization with whatever value the
field itself is initialized and drop the field itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28539
2026-02-05 13:02:35 +02:00
Petr Hála
a04dbac369 open-coredump: Change to use new backtrace
* This is a breaking change, which removes compatibility with the old backtrace
    - See https://staging.backtrace.scylladb.com/api/docs#/default/search_by_build_id_search_build_id_post for the APIDoc
* Add timestamp field to log
* Tested locally

Closes scylladb/scylladb#28325
2026-02-05 11:50:47 +02:00
Marcin Maliszkiewicz
0753d9fae5 Merge 'test: remove xfail marker from a few passing tests' from Nadav Har'El
This patch fixes the few remaining cases of XPASS in test/cqlpy and test/alternator.
These are tests which, when written, reproduced a bug and therefore were marked "xfail", but some time later the bug was fixed and we either did not notice it was ever fixed, or just forgot to remove the xfail marker.

Removing the no-longer-needed xfail markers is good for test hygiene, but more importantly is needed to avoid regressions in those already-fixed areas (if a test is already marked xfail, it can start to fail in a new way and we wouldn't notice).

Backport not needed, xpass doesn't bother anyone.

Closes scylladb/scylladb#28441

* github.com:scylladb/scylladb:
  test/cqlpy: remove xfail from tests for fixed issue 7972
  test/cqlpy: remove xfail from tests for fixed issue 10358
  test/cqlpy: remove xfail from passing test testInvalidNonFrozenUDTRelation
  test/alternator: remove xfail from passing test_update_item_increases_metrics_for_new_item_size_only
2026-02-05 10:10:43 +01:00
Marcin Maliszkiewicz
6eca74b7bb Merge 'More Alternator tests for BatchWriteItem' from Nadav Har'El
The goal of this small pull request is to reproduce issue #28439, which found a bug in the Alternator Streams output when BatchWriteItem is called to write multiple items in the same partition, and always_use_lwt write isolation mode is used.

* The first patch reproduces this specific bug in Alternator Streams.
* The second patch adds missing (Fixes #28171) tests for BatchWriteItem in different write modes, and shows that BatchWriteItem itself works correctly - the bug is just in Alternator Streams' reporting of this write.

Closes scylladb/scylladb#28528

* github.com:scylladb/scylladb:
  test/alternator: add test for BatchWriteItem with different write isolations
  test/alternator: reproducer for Alternator Streams bug
2026-02-05 10:07:29 +01:00
Yaron Kaikov
b30ecb72d5 ci: fix PR number extraction for unlabeled events
When the workflow is triggered by removing the 'conflicts' label
(pull_request_target unlabeled event), github.event.issue.number is
not available. Use github.event.pull_request.number as fallback.

Fixes: https://scylladb.atlassian.net/browse/RELENG-245

Closes scylladb/scylladb#28543
2026-02-05 08:41:43 +02:00
Michał Hudobski
6b9fcc6ca3 auth: add CDC streams and timestamps to vector search permissions
It turns out that the cdc driver requires permissions to two additional system tables. This patch adds them to VECTOR_SEARCH_INDEXING and modifies the unit tests. The integration with vector store was tested manually, integration tests will be added in vector-store repository in a follow up PR.

Fixes: SCYLLADB-522

Closes scylladb/scylladb#28519
2026-02-04 09:10:08 +01:00
Nadav Har'El
47e827262f test/alternator: add test for BatchWriteItem with different write isolations
Alternator's various write operations have different code paths for the
different write isolation modes. Because most of the test suite runs in
only a single write mode (currently - only_rmw_uses_lwt), we already
introduced a test file test/alternator/test_write_isolation.py for
checking the different write operations in *all* four write isolation
modes.

But we missed testing one write operation - BatchWriteItem. This
operation isn't very "interesting" because it doesn't support *any*
read-modify-option option (it doesn't support UpdateExpression,
ConditionExpression or ReturnValues), but even without those, the
pure write code still has different code paths with and without LWT,
and should be tested. So we add the missing test here - and it passes.

In issue #28439 we discovered a bug that can be seen in Alternator
Streams in the case of BatchWriteItem with multiple writes to the
same partition and always_use_lwt mode. The fact that the test added
here passes shows that the bug is NOT in BatchWriteItem itself, which
works correctly in this case - but only in the Alternator Streams layer.

Fixes #28171

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-04 09:24:29 +02:00
Nadav Har'El
c63f43975f test/alternator: reproducer for Alternator Streams bug
This patch adds a reproducer for an Alternator Streams bug described in
issue #28439, where the stream returns the wrong events (and fewer of
them) in the following specific combination of the following circumstances:

1. A BatchWriteItem operation writing multiple items to the *same*
   partition.

2. The "always_use_lwt" write isolation mode is used. (the bug doesn't
   occur in other write isolation modes).

We didn't catch this bug earlier because the Alternator Streams test
we had for BatchWriteItem had multiple items in multiple partitions,
and we missed the multiple-items-in-one-partition case. Moreover,
today we run all the tests in only_rmw_uses_lwt mode (in the past,
we did use always_use_lwt, but changed recently in commit e7257b1393
following commit 76a766c that changed test.py).

As issue #28439 explains, the underlying cause of the bug is that the
always_use_lwt causes the multiple items to be written with the same
timestamp, which confused the Alternator Streams code reading the CDC
log. The bug is not in BatchWriteItem itself, or in ScyllaDB CDC, but
just in the Alternator Streams layer.

The test in this patch is parameterized to run on each of the four
write isolation modes, and currently fails (and so marked xfail) just
for the one mode 'always_use_lwt'. The test is scylla_only, as its
purpose is to checks the different write isolation mode - which don't
exist in AWS DynamoDB.

Refs #28439

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-04 09:17:48 +02:00
Radosław Cybulski
03ff091bee alternator: improve events output when test failed
Improve events printing, when test in test_streams.py failed.
New code will print both expected and received events (keys, previous
image, new image and type).
New code will explicitly mark, at which output event comparison failed.

Fixes #28455

Closes scylladb/scylladb#28476
2026-02-03 21:55:07 +02:00
Anna Stuchlik
a427ad3bf9 doc: remove the link to the Open Source blog post
Fixes https://github.com/scylladb/scylladb/issues/28486

Closes scylladb/scylladb#28518
2026-02-03 14:15:16 +01:00
Botond Dénes
3adf8b58c4 Merge 'test: pylib: scylla_cluster: set shutdown_announce_in_ms to 0' from Patryk Jędrzejczak
The usual Scylla shutdown in a cluster test takes ~2.1s. 2s come from
```
co_await sleep(std::chrono::milliseconds(_gcfg.shutdown_announce_ms));
```
as the default value of `shutdown_announce_in_ms` is 2000. This sleep
makes every `server_stop_gracefully` call 2s slower. There are ~300 such
calls in cluster tests (note that some come from `rolling_restart`). So,
it looks like this sleep makes cluster tests 300 * 2s = 10min slower.
Indeed, `./test.py --mode=dev cluster` takes 61min instead of 71min
on the potwor machine (the one in the Warsaw office) without it.

We set `shutdown_announce_in_ms` to 0 for all cluster tests to make them
faster.

The sleep is completely unnecessary in tests. Removing it could introduce
flakiness, but if that's the case, then the test for which it happens is
incorrect in the first place. Tests shouldn't assume that all nodes
receive and handle the shutdown message in 2s. They should use functions
like `server_not_sees_other_server` instead, which are faster and more
reliable.

Improvement of the tests running time, so no backport. The fix of
`test_tablets_parallel_decommission` may have to be backported to
2026.1, but it can be done manually.

Closes scylladb/scylladb#28464

* github.com:scylladb/scylladb:
  test: pylib: scylla_cluster: set shutdown_announce_in_ms to 0
  test: test_tablets_parallel_decommission: prevent group0 majority loss
  test: delete test_service_levels_work_during_recovery
2026-02-03 08:19:05 +02:00
Pavel Emelyanov
19ea05692c view_build_worker: Do not switch scheduling groups inside work_on_view_building_tasks
The handler appeared back in c9e710dca3. In this commit it performed the
"core" part of the task -- the do_build_range() method -- inside the
streaming sched group. The setup code looks seemingly was copied from the
view_builder::do_build_step() method and got the explicit switch of the
scheduling group.

The switch looks both -- justified and not. On one hand, it makes it
explict that the activity runs in the streaming scheduling group. On the
other hand, the verb already uses RPC index on 1, which is negotiated to
be run in streaming group anyway. On the "third hand", even though being
explicit the switch happens too late, as there exists a lot of other
activities performed by the handler that seems to also belong to the
same scheduling group, but which is not switched into explicitly.

By and large, it seems better to avoid the explicit switch and rely on
the RPC-level negotiation-based sched group switching.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28397
2026-02-03 07:00:32 +02:00
Anna Stuchlik
77480c9d8f doc: fix the links on the repair-related pages
This is a follow-up to https://github.com/scylladb/scylladb/pull/28199.

This commit fixes the syntax of the internal links.

Fixes https://github.com/scylladb/scylladb/issues/28486

Closes scylladb/scylladb#28487
2026-02-03 06:54:08 +02:00
Botond Dénes
64b38a2d0a Merge 'Use gossiper scheduling group where needed' from Pavel Emelyanov
This is the continuation of #28363 , this time about getting gossiper scheduling group via database.
Several places that do it already have gossiper at hand and should better get the group from it.
Eventually, this will allow to get rid of database::get_gossip_scheduling_group().

Refining inter-components API, not backporting

Closes scylladb/scylladb#28412

* github.com:scylladb/scylladb:
  gossiper: Export its scheduling group for those who need it
  migration_manager: Reorder members
2026-02-03 06:51:31 +02:00
Nadav Har'El
48b01e72fa test/alternator: add test verifying that keys only allow S/B/N type
Recently we had a question whether key columns can have any supported
type. I knew that actually - they can't, that key columns can have only
the types S(tring), B(inary) or N(umber), and that is all. But it turns
out we never had a test that confirms this understanding is true.

We did have a test for it for GSI key types already,
test_gsi.py::test_gsi_invalid_key_types, but we didn't have one for the
base table. So in this patch we add this missing test, and confirm that,
indeed, both DynamoDB and Alternator refuse a key attribute with any
type other than S, B or N.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28479
2026-02-03 06:49:02 +02:00
Andrei Chekun
ed9a96fdb7 test.py: modify logic for adding function_path in JUnit
Current way is checking only fail during the test phase, and it will
miss the cases when fail happens on another phase. This PR eliminate
this, so every phase will have modified node reporter to enrich the
JUnit XML report with custom attribute function_path.

Closes scylladb/scylladb#28462
2026-02-03 06:42:18 +02:00
Andrei Chekun
3a422e82b4 test.py: fix the file name in test summary
Current way is always assumed that the error happened in the test file,
but that not always true. This PR will show the error from the boost
logger where actually error is happened.

Closes scylladb/scylladb#28429
2026-02-03 06:38:21 +02:00
Benny Halevy
84caa94340 gossiper: add_expire_time_for_endpoint: replace fmt::localtime with gmtime in log printout
1. fmt::localtime is deprecated.
2. We should really print times in UTC, especially on the cloud.
3. The current log message does not print the timezone so it'd unclear
   to anyone reading the lof message if the expiration time is in the
   local timezone or in GMT/UTC.

Fixes the following warning:
```
gms/gossiper.cc:2428:28: warning: 'localtime' is deprecated [-Wdeprecated-declarations]
 2428 |             endpoint, fmt::localtime(clk::to_time_t(expire_time)), expire_time.time_since_epoch().count(),
      |                            ^
/usr/include/fmt/chrono.h:538:1: note: 'localtime' has been explicitly marked deprecated here
  538 | FMT_DEPRECATED inline auto localtime(std::time_t time) -> std::tm {
      | ^
/usr/include/fmt/base.h:207:28: note: expanded from macro 'FMT_DEPRECATED'
  207 | #  define FMT_DEPRECATED [[deprecated]]
      |                            ^
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#28434
2026-02-03 06:36:53 +02:00
Pavel Emelyanov
8c42704c72 storage_service: Check raft rpc scheduling group from debug namespace
Some storage_service rpc verbs may checks that a handler is executed
inside gossiper scheduling group. For that, the expected group is
grabbed from database.

This patch puts the gossiper sched group into debug namespace and makes
this check use it from there. It removes one more place that uses
database as config provider.

Refs #28410

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28427
2026-02-03 06:34:03 +02:00
Asias He
b5c3587588 repair: Add request type in the tablet repair log
So we can know if the repair is an auto repair or a user repair.

Fixes SCYLLADB-395

Closes scylladb/scylladb#28425
2026-02-03 06:26:58 +02:00
Nadav Har'El
a63ad48b0f test/cqlpy: remove xfail from tests for fixed issue 7972
The test test_to_json_double used to fail due to #7972, but this issue
was already fixed in Scylla 5.1 and we didn't notice.
So remove the xfail marker from this test, and also update another test
which still xfails but no longer due to this issue.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-02 23:49:32 +02:00
Nadav Har'El
10b81c1e97 test/cqlpy: remove xfail from tests for fixed issue 10358
The tests testWithUnsetValues and testFilteringWithoutIndices used to fail
due to #10358, but this issue was already fixed three years ago, when the
UNSET-checking code was cleaned up, and the test is now passing.
So remove the xfail marker from these tests.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-02 23:49:31 +02:00
Nadav Har'El
508bb97089 test/cqlpy: remove xfail from passing test testInvalidNonFrozenUDTRelation
The test testInvalidNonFrozenUDTRelation used to fail due to #10632
(an incorrectly-printed column name in an error message) and was marked
"xfail". But this issue has already been fixed two years ago, and
the test is now passing. So remove the xfail marker.
2026-02-02 23:49:31 +02:00
Nadav Har'El
3682c06157 test/alternator: remove xfail from passing test_update_item_increases_metrics_for_new_item_size_only
The test test_metrics.py::test_update_item_increases_metrics_for_new_item_size_only
tests whether the Alternator metrics report the exactly-DynamoDB-compatible
WCU number. It is parameterized with two cases - one that uses
alternator_force_read_before_write and one which doesn't.

The case that uses alternator_force_read_before_write is expected to
measure the "accurate" WCU, and currently it doesn't, so the test
rightly xfails.
But the case that doesn't use alternator_force_read_before_write is not
expected to measure the "accurate" WCU and has a different expectation,
so this case actually passes. But because the entire test is marked
xfail, it is reported as "XPASS" - unexpected pass.

Fix this by marking only the "True" case with xfail, while the "False"
case is not marked. After this pass, the True case continues to XFAIL
and the False case passes normally, instead of XPASS.

Also removed a sentence promising that the failing case will be solved
"by the next PR". Clearly this didn't happen. Maybe we even have such
a PR open (?), but it won't the "the next PR" even if merged today.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-02-02 23:49:31 +02:00
Nadav Har'El
df69dbec2a Merge ' cql3/statements/describe_statement: hide paxos state tables ' from Michał Jadwiszczak
Paxos state tables are internal tables fully managed by Scylla
and they shouldn't be exposed to the user nor they shouldn't be backed up.

This commit hides those kind of tables from all listings and if such table
is directly described with `DESC ks."tbl$paxos"`, the description is generated
withing a comment and a note for the user is added.

Fixes https://github.com/scylladb/scylladb/issues/28183

LWT on tablets and paxos state tables are present in 2025.4, so the patch should be backported to this version.

Closes scylladb/scylladb#28230

* github.com:scylladb/scylladb:
  test/cqlpy: add reproducer for hidden Paxos table being shown by DESC
  cql3/statements/describe_statement: hide paxos state tables
2026-02-02 21:22:59 +02:00
Nadav Har'El
f23e796e76 alternator: fix typos in comments and variable names
Copilot found these typos in comments and variable name in alternator/,
so might as well fix them.

There are no functional changes in this patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28447
2026-02-02 19:16:43 +03:00
Marcin Maliszkiewicz
88c4ca3697 Merge 'test: migrate guardrails_test.py from scylla-dtest' from Andrzej Jackowski
This patch series copies `guardrails_test.py` from scylla-dtest, fix it and enables it.

The motivation is to unify the test execution of guardrails test, as some tests (`cqlpy/test_guardrail_...`) were already in scylladb repo, and some were in `scylla-dtest`.

Fixes: SCYLLADB-255

No backport, just test migration

Closes scylladb/scylladb#28454

* github.com:scylladb/scylladb:
  test: refactor test_all_rf_limits in guardrails_test.py
  test: specify exceptions being caught in guardrails_test.py
  test: enable guardrails_test.py
  test: add wait_other_notice to test_default_rf in guardrails_test.py
  test: copy guardrails_test.py from scylla-dtest
2026-02-02 16:54:13 +01:00
Avi Kivity
acc54cf304 tools: toolchain: adapt future toolchain to loss of toxiproxy in Fedora
Next Fedora will likely not have toxiproxy packaged [1]. Adapt
by installing it directly. To avoid changing the current toolchain,
add a ./install-dependencies --future option. This will allow us
to easily go back to the packages if the Fedora bug is fixed.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2426954

Closes scylladb/scylladb#28444
2026-02-02 17:02:19 +02:00
Avi Kivity
419636ca8f test: ldap: regularize toxiproxy command-line options
Modern toxiproxy interprets `-h` as help and requires the subcommand
subject (e.g. the proxy name) to be after the subcommand switches.
Arrange the command line in the way it likes, and spell out the
subcommands to be more comprehensible.

Closes scylladb/scylladb#28442
2026-02-02 17:00:58 +02:00
Botond Dénes
2b3f3d9ba7 Merge 'test.py: support boost labels in test.py' from Artsiom Mishuta
related PR: https://github.com/scylladb/scylladb/pull/27527

This PR changes test.py logic of parsing boost test cases to use -- --list_json_content
and pass boost labels as pytests markers

using  -- --list_json_content is not ideal and currenly require to implement severall [workarounds](https://github.com/scylladb/scylladb/pull/27527#issuecomment-3765499812), but having the ability to support boost labels in pytest is worth it. because now we can apply the tiering mechanism for the boost tests as well

Fixes SCYLLADB-246

Closes scylladb/scylladb#28232

* github.com:scylladb/scylladb:
  test: add nightly label
  test.py: support boost labels in test.py
2026-02-02 16:55:29 +02:00
Dawid Mędrek
68981cc90b Merge 'raft topology: generate notification about released nodes only once' from Piotr Dulikowski
Hints destined for some other node can only be drained after the other node is no longer a replica of any vnode or tablet. In case when tablets are present, a node might still technically be a replica of some tablets after it moved to left state. When it no longer is a replica of any tablet, it becomes "released" and storage service generates a notification about it. Hinted handoff listens to this notification and kicks off draining hints after getting it.

The current implementation of the "released" notification would trigger every time raft topology state is reloaded and a left node without any tokens is present in the raft topology. Although draining hints is idempotent, generating duplicate notifications is wasteful and recently became very noisy after in 44de563 verbosity of the draining-related log messages have been increased. The verbosity increase itself makes sense as draining is supposed to be a rare operation, but the duplicate notification bug now needs to be addressed.

Fix the duplicate notification problem by passing the list of previously released nodes to the `storage_service::raft_topology_update_ip` function and filtering based on it. If this function processes the topology state for the first time, it will not produce any notifications. This is fine as hinted handoff is prepared to detect "released" nodes during the startup sequence in main.cc and start draining the hints there, if needed.

Fixes: scylladb/scylladb#28301
Refs: scylladb/scylladb#25031

The log messages added in 44de563 cause a lot of noise during topology operations and tablet migrations, so the fix should be backported to all affected versions (2025.4 and 2026.1).

Closes scylladb/scylladb#28367

* github.com:scylladb/scylladb:
  storage_service: fix indentation after previous patch
  raft topology: generate notification about released nodes only once
  raft topology: extract "released" nodes calculation to external function
2026-02-02 15:39:15 +01:00
Jenkins Promoter
c907fc6789 Update pgo profiles - aarch64 2026-02-02 14:56:49 +02:00
Dawid Mędrek
b0afd3aa63 Merge 'storage_service: set up topology properly in maintenance mode' from Patryk Jędrzejczak
We currently make the local node the only token owner (that owns the
whole ring) in maintenance mode, but we don't update the topology properly.
The node is present in the topology, but in the `none` state. That's how
it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in
`scylla_main`. As a result, the node started in maintenance mode crashes
in the following way in the presence of a vnodes-based keyspace with the
NetworkTopologyStrategy:
```
scylla: locator/network_topology_strategy.cc:207:
    locator::natural_endpoints_tracker::natural_endpoints_tracker(
    const token_metadata &, const network_topology_strategy::dc_rep_factor_map &):
    Assertion `!_token_owners.empty() && !_racks.empty()' failed.
```
Both `_token_owners` and `_racks` are empty. The reason is that
`_tm.get_datacenter_token_owners()` and
`_tm.get_datacenter_racks_token_owners()` called above filter out nodes
in the `none` state.

This bug basically made maintenance mode unusable in customer clusters.

We fix it by changing the node state to `normal`.

We also extend `test_maintenance_mode` to provide a reproducer for

Fixes #27988

This PR must be backported to all branches, as maintenance mode is
currently unusable everywhere.

Closes scylladb/scylladb#28322

* github.com:scylladb/scylladb:
  test: test_maintenance_mode: enable maintenance mode properly
  test: test_maintenance_mode: shutdown cluster connections
  test: test_maintenance_mode: run with different keyspace options
  test: test_maintenance_mode: check that group0 is disabled by creating a keyspace
  test: test_maintenance_mode: get rid of the conditional skip
  test: test_maintenance_mode: remove the redundant value from the query result
  storage_proxy: skip validate_read_replica in maintenance mode
  storage_service: set up topology properly in maintenance mode
2026-02-02 13:28:19 +01:00
Andrzej Jackowski
298aca7da8 test: refactor test_all_rf_limits in guardrails_test.py
Before this commit, `test_all_rf_limits` was implemented in a
repetitive manner, making it harder to understand how the guardrails
were tested. This commit refactors the test to reduce code redundancy
and verify the guardrails more explicitly.
2026-02-02 10:49:12 +01:00
Andrzej Jackowski
136db260ca test: specify exceptions being caught in guardrails_test.py
Before this commit, the test caught a broad `Exception`. This change
specifies the expected exceptions to avoid a situation where the product
or test is broken and it goes undetected.
2026-02-02 10:48:07 +01:00
Patryk Jędrzejczak
ec2f99b3d1 test: pylib: scylla_cluster: set shutdown_announce_in_ms to 0
The usual Scylla shutdown in a cluster test takes ~2.1s. 2s come from
```
co_await sleep(std::chrono::milliseconds(_gcfg.shutdown_announce_ms));
```
as the default value of `shutdown_announce_in_ms` is 2000. This sleep
makes every `server_stop_gracefully` call 2s slower. There are ~300 such
calls in cluster tests (note that some come from `rolling_restart`). So,
it looks like this sleep makes cluster tests 300 * 2s = 10min slower.
Indeed, `./test.py --mode=dev cluster` takes 61min instead of 71min
on the potwor machine (the one in the Warsaw office) without it.

We set `shutdown_announce_in_ms` to 0 for all cluster tests to make them
faster.

The sleep is completely unnecessary in tests. Removing it could introduce
flakiness, but if that's the case, then the test for which it happens is
incorrect in the first place. Tests shouldn't assume that all nodes
receive and handle the shutdown message in 2s. They should use functions
like `server_not_sees_other_server` instead, which are faster and more
reliable.
2026-02-02 10:39:55 +01:00
Patryk Jędrzejczak
1f28a55448 test: test_tablets_parallel_decommission: prevent group0 majority loss
Both of the changed test cases stop two out of four nodes when there are
three group0 voters in the cluster. If one of the two live nodes is
a non-voter (node 1, specifically, as node 0 is the leader), a temporary
majority loss occurs, which can cause the following operations to fail.
In the case of `test_tablets_are_rebuilt_in_parallel`, the `exclude_node`
API can fail. In the case of `test_remove_is_canceled_if_there_is_node_down`,
removenode can fail with an unexpected error message:
```
"service::raft_operation_timeout_error (group
[46dd9cf1-fe21-11f0-baa0-03429f562ff5] raft operation [read_barrier] timed out)"
```

Somehow, these test cases are currently not flaky, but they become flaky in
the following commit.

We can consider backporting this commit to 2026.1 to prevent flakiness.
2026-02-02 10:39:55 +01:00
Patryk Jędrzejczak
bcf0114e90 test: delete test_service_levels_work_during_recovery
The test becomes flaky in one of the following commits. However, there is
no need to fix it, as we should delete it anyway. We are in the process of
removing the gossip-based topology from the code base, which includes the
recovery mode. We don't have to rewrite the test to use the new Raft-based
recovery procedure, as there is nothing interesting to test (no regression
to legacy service levels).
2026-02-02 10:39:54 +01:00
Artsiom Mishuta
af2d7a146f test: add nightly label
add nightly label for test
test_foreign_reader_as_mutation_source
as an example of usinf boost labels pytest as markers

command to test :
./tools/toolchain/dbuild  pytest --test-py-init --collect-only -q -m=nightly test/boost

output:
boost/mutation_reader_test.cc::test_foreign_reader_as_mutation_source.debug.1
boost/mutation_reader_test.cc::test_foreign_reader_as_mutation_source.release.1
boost/mutation_reader_test.cc::test_foreign_reader_as_mutation_source.dev.1
2026-02-02 10:30:38 +01:00
Gleb Natapov
08268eee3f topology: disable force-gossip-topology-changes option
The patch marks force-gossip-topology-changes as deprecated and removes
tests that use it. There is one test (test_different_group0_ids) which
is marked as xfail instead since it looks like gossiper mode was used
there as a way to easily achieve a certain state, so more investigation
is needed if the tests can be fixed to use raft mode instead.

Closes scylladb/scylladb#28383
2026-02-02 09:56:32 +01:00
Avi Kivity
ceec703bb7 Revert "main: test: add future and abort_source to after_init_func"
This reverts commit 7bf7ff785a. The commit
tried to add clean shutdown to `scylla perf` paths, but forgot at least
`scylla perf-alternator --workload wr` which now crashes on uninitialized
`c.as`.

Fixes #28473

Closes scylladb/scylladb#28478
2026-02-02 09:22:24 +01:00
Avi Kivity
cc03f5c89d cql3: support literals and bind variables in selectors
Add support for literals in the SELECT clause. This allows
SELECT fn(column, 4) or SELECT fn(column, ?).

Note, "SELECT 7 FROM tab" becomes valid in the grammar, but is still
not accepted because of failed type inference - we cannot infer the
type of 7, and don't have a favored type for literals (like C favors
int). We might relax this later.

In the WHERE clause, and Cassandra in the SELECT clause, type hints
can also resolve type ambiguity: (bigint)7 or (text)?. But this is
deferred to a later patch.

A few changes to the grammar are needed on top of adding a `value`
alternative to `unaliasedSelector`:

 - vectorSimilarityArg gained access to `value` via `unaliasedSelector`,
   so it loses that alternate to avoid ambiguity. We may drop
   `vectorSimilarityArg` later.
 - COUNT(1) became ambiguous via the general function path (since
   function arguments can now be literals), so we remove this case
   from the COUNT special cases, remaining with count(*).
 - SELECT JSON and SELECT DISTINCT became "ambiguous enough" for
   ANTLR to complain, though as far as I can tell `value` does not
   add real ambiguity. The solution is to commit early (via "=>") to
   a parsing path.

Due to the loss of count(1) recognition in the parser, we have to
special-case it in prepare. We may relax it to count any expression
later, like modern Cassandra and SQL.

Testing is awkward because of the type inference problem in top-level.
We test via the set_intersection() function and via lua functions.

Example:

```
cqlsh> CREATE FUNCTION ks.sum(a int, b int) RETURNS NULL ON NULL INPUT RETURNS int  LANGUAGE LUA AS 'return a + b';
cqlsh> SELECT ks.sum(1, 2) FROM system.local;

 ks.sum(1, 2)
--------------
            3

(1 rows)
cqlsh>
```

(There are no suitable system functions!)

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-296

Closes scylladb/scylladb#28256
2026-02-02 00:06:13 +02:00
Patryk Jędrzejczak
68b105b21c db: virtual tables: add the rack column to cluster_status
`system.cluster_status` is missing the rack info compared to `nodetool status`
that is supposed to be equivalent. It has probably been an omission.

Closes scylladb/scylladb#28457
2026-02-01 20:36:53 +01:00
Pavel Emelyanov
6f3f30ee07 storage_service: Use stream_manager group for streaming
The hander of raft_topology_cmd::command::stream_ranges switches to
streaming scheduling group to perform data streaming in it. It grabs the
group from database db_config, which's not great. There's streaming
manager at hand in storage service handlers, since it's using its
functionality, it should use _its_ scheduling group.

This will help splitting the streaming scheduling group into more
elaborated groups under the maintenance supergroup: SCYLLADB-351

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28363
2026-02-01 20:42:37 +02:00
Marcin Maliszkiewicz
b8c75673c8 main: remove confusing duplicated auth start message
Before we could observe two exactly the same
"starting auth service" messages in the log.
One from checkpoint() the other from notify().
We remove the second one to stay consistent
with other services.

Closes scylladb/scylladb#28349
2026-02-01 13:57:53 +02:00
Avi Kivity
6676953555 Merge 'test: perf: add option to write results to json in perf-cql-raw and perf-alternator' from Marcin Maliszkiewicz
Adds --json-result option to perf-cql-raw and perf-alternator, the same as perf-simple-query has.
It is useful for automating test runs.

Related: https://scylladb.atlassian.net/browse/SCYLLADB-434

Bacport: no, original benchmark is not backported

Closes scylladb/scylladb#28451

* github.com:scylladb/scylladb:
  test: perf: add example commands to perf-alternator and perf-cql-raw
  test: perf: add option to write results to json in perf-cql-raw
  test: perf: add option to write results to json in perf-alternator
  test: perf: move write_json_result to a common file
2026-02-01 13:57:10 +02:00
Artsiom Mishuta
e216504113 test.py: support boost labels in test.py
related PR: https://github.com/scylladb/scylladb/pull/27527

This PR changes test.py logic of parsing boost test cases to use -- --list_json_content
and pass boost labels as pytests markers

fixes: https://github.com/scylladb/scylladb/issues/25415
2026-02-01 11:31:26 +01:00
Tomasz Grabiec
b93472d595 Merge 'load_stats: fix problem with load_stats refresh throwing no_such_column_family' from Ferenc Szili
When the topology coordinator refreshes load_stats, it caches load_stats for every node. In case the node becomes unresponsive, and fresh load_stats can not be read from the node, the cached version of load_stats will be used. This is to allow the load balancer to have at least some information about the table sizes and disk capacities of the host.

During load_stats refresh, we aggregate the table sizes from all the nodes. This procedure calls db.find_column_family() for each table_id found in load_stats. This function will throw if the table is not found. This will cause load_stats refresh to fail.

It is also possible for a table to have been dropped between the time load_stats has been prepared on the host, and the time it is processed on the topology coordinator. This would also cause an exception in the refresh procedure.

This fixes this problem by checking if the table still exists.

Fixes: #28359

Closes scylladb/scylladb#28440

* github.com:scylladb/scylladb:
  test: add test and reproducer for load_stats refresh exception
  load_stats: handle dropped tables when refreshing load_stats
2026-01-31 21:12:19 +01:00
Ferenc Szili
92dbde54a5 test: add test and reproducer for load_stats refresh exception
This patch adds a test and reproducer for the issue where the load_stats
refresh procedure throws exceptions if any of the tables have been
dropped since load_stats was produced.
2026-01-30 15:11:29 +01:00
Patryk Jędrzejczak
7e7b9977c5 test: test_maintenance_mode: enable maintenance mode properly
The same issue as the one fixed in
394207fd69.
This one didn't cause real problems, but it's still cleaner to fix it.
2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak
6c547e1692 test: test_maintenance_mode: shutdown cluster connections
Leaked connections are known to cause inter-test issues.
2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak
867a1ca346 test: test_maintenance_mode: run with different keyspace options
We extend the test to provide a reproducer for #27988 and to avoid
similar bugs in the future.

The test slows down from ~14s to ~19s on my local machine in dev
mode. It seems reasonable.
2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak
53f58b85b7 test: test_maintenance_mode: check that group0 is disabled by creating a keyspace
In the following commit, we make the rest run with multiple keyspaces,
and the old check becomes inconvenient. We also move it below to the
part of the code that won't be executed for each keyspace.

Additionally, we check if the error message is as expected.
2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak
408c6ea3ee test: test_maintenance_mode: get rid of the conditional skip
This skip has already caused trouble.
After 0668c642a2, the skip was always hit, and
the test was silently doing nothing. This made us miss #26816 for a long
time. The test was fixed in 222eab45f8, but we
should get rid of the skip anyway.

We increase the number of writes from 256 to 1000 to make the chance of not
finding the key on server A even lower. If that still happens, it must be
due to a bug, so we fail the test. We also make the test insert rows until
server A is a replica of one row. The expected number of inserted rows is
a small constant, so it should, in theory, make the test faster and cleaner
(we need one row on server A, so we insert exactly one such row).

It's possible to make the test fully deterministic, by e.g., hardcoding
the key and tokens of all nodes via `initial_token`, but I'm afraid it would
make the test "too deterministic" and could hide a bug.
2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak
c92962ca45 test: test_maintenance_mode: remove the redundant value from the query result 2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak
9d4a5ade08 storage_proxy: skip validate_read_replica in maintenance mode
In maintenance mode, the local node adds only itself to the topology. However,
the effective replication map of a keyspace with tablets enabled contains all
tablet replicas. It gets them from the tablets map, not the topology. Hence,
`network_topology_strategy::sanity_check_read_replicas` hits
```
throw std::runtime_error(format("Requested location for node {} not in topology. backtrace {}", id, lazy_backtrace()));
```
for tablet replicas other than the local node.

As a result, all requests to a keyspace with tablets enabled and RF > 1 fail
in debug mode (`validate_read_replica` does nothing in other modes). We don't
want to skip maintenance mode tests in debug mode, so we skip the check in
maintenance mode.

We move the `is_debug_build()` check because:
- `validate_read_replicas` is a static function with no access to the config,
- we want the `!_db.local().get_config().maintenance_mode()` check to be
  dropped by the compiler in non-debug builds.

We also suppress `-Wunneeded-internal-declaration` with `[[maybe_unused]]`.
2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak
a08c53ae4b storage_service: set up topology properly in maintenance mode
We currently make the local node the only token owner (that owns the
whole ring) in maintenance mode, but we don't update the topology properly.
The node is present in the topology, but in the `none` state. That's how
it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in
`scylla_main`. As a result, the node started in maintenance mode crashes
in the following way in the presence of a vnodes-based keyspace with the
NetworkTopologyStrategy:
```
scylla: locator/network_topology_strategy.cc:207:
    locator::natural_endpoints_tracker::natural_endpoints_tracker(
    const token_metadata &, const network_topology_strategy::dc_rep_factor_map &):
    Assertion `!_token_owners.empty() && !_racks.empty()' failed.
```
Both `_token_owners` and `_racks` are empty. The reason is that
`_tm.get_datacenter_token_owners()` and
`_tm.get_datacenter_racks_token_owners()` called above filter out nodes
in the `none` state.

This bug basically made maintenance mode unusable in customer clusters.

We fix it by changing the node state to `normal`. We also update its
rack, datacenter, and shards count. Rack and datacenter are present in the
topology somehow, but there is nothing wrong with updating them again.
The shard count is also missing, so we better update it to avoid other
issues.

Fixes #27988
2026-01-30 12:55:16 +01:00
Andrzej Jackowski
625f292417 test: enable guardrails_test.py
After guardrails_test.py has been migrated to test.py and fixed in
previous commits of this patch series, it can finally be enabled.

Fixes: SCYLLADB-255
2026-01-30 11:51:46 +01:00
Andrzej Jackowski
576ad29ddb test: add wait_other_notice to test_default_rf in guardrails_test.py
This commit adds `wait_other_notice=True` to `cluster.populate` in
`guardrails_test.py`. Without this, `test_default_rf` sometimes fails
because `NetworkTopologyStrategy` setting fails before
the node knows about all other DCs.

Refs: SCYLLADB-255
2026-01-30 11:51:46 +01:00
Andrzej Jackowski
64c774c23a test: copy guardrails_test.py from scylla-dtest
This commit copies guardrails_test.py from dtest repository and
(temporarily) disables it, as it requires improvement in following
commits of this patch series before being enabled.

Refs: SCYLLADB-255
2026-01-30 11:51:40 +01:00
Marcin Maliszkiewicz
e18b519692 cql3: remove find_schema call from select check_access
Schema is already a member of select statement, avoiding
the call saves around 400 cpu instructions on a select
request hot path.

Closes scylladb/scylladb#28328
2026-01-30 11:49:09 +01:00
Ferenc Szili
71be10b8d6 load_stats: handle dropped tables when refreshing load_stats
When the topology coordinator refreshes load_stats, it caches load_stats
for every node. In case the node becomes unresponsive, and fresh
load_stats can not be read from the node, the cached version of
load_stats will be used. This is to allow the load balancer to
have at least some information about the table sizes and disk capacities
of the host.

During load_stats refresh, we aggregate the table sizes from all the
nodes. This procedure calls db.find_column_family() for each table_id
found in load_stats. This function will throw if the table is not found.
This will cause load_stats refresh to fail.

It is also possible for a table to have been dropped between the time
load_stats has been prepared on the host, and the time it is processed
on the topology coordinator. This would also cause an exception in the
refresh procedure.

This patch fixes this problem by checking if the table still exists.
2026-01-30 09:48:59 +01:00
Marcin Maliszkiewicz
80e627c64b test: perf: add example commands to perf-alternator and perf-cql-raw 2026-01-30 08:48:19 +01:00
Pawel Pery
f49c9e896a vector_search: allow full secondary indexes syntax while creating the vector index
Vector Search feature needs to support creating vector indexes with additional
filtering column. There will be two types of indexes: global which indexes
vectors per table, and local which indexes vectors per partition key. The new
syntaxes are based on ScyllaDB's Global Secondary Index and Local Secondary
Index. Vector indexes don't use secondary indexes functionalities in any way -
all indexing, filtering and processing data will be done on Vector Store side.

This patch allows creating vector indexes using this CQL syntax:

```
CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  discussion_board_id int,
  country text,
  lang text,
  PRIMARY KEY ((commenter, discussion_board_id), created_at)
);

CREATE CUSTOM INDEX IF NOT EXISTS global_ann_index
  ON cycling.comments_vs(comment_vector, country, lang) USING 'vector_index'
  WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };

CREATE CUSTOM INDEX IF NOT EXISTS local_ann_index
  ON cycling.comments_vs((commenter, discussion_board_id), comment_vector, country, lang)
  USING 'vector_index'
  WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };
```

Currently, if we run these queries to create indexes we will receive such errors:

```
InvalidRequest: Error from server: code=2200 [Invalid query] message="Vector index can only be created on a single column"
InvalidRequest: Error from server: code=2200 [Invalid query] message="Local index definition must contain full partition key only. Redundant column: XYZ"
```

This commit refactors `vector_index::check_target` to correctly validate
columns building the index. Vector-store currently support filtering by native
types, so the type of columns is checked. The first column from the list must
be a vector (to build index based on these vectors), so it is also checked.

Allowed types for columns are native types without counter (it is not possible
to create a table with counter and vector) and without duration (it is not
possible to correctly compare durations, this type is even not allowed in
secondary indexes).

This commits adds cqlpy test to check errors while creating indexes.

Fixes: SCYLLADB-298

This needs to be backported to version 2026.1 as this is a fix for filtering support.

Closes scylladb/scylladb#28366
2026-01-30 01:14:31 +02:00
Avi Kivity
3d1558be7e test: remove xfail markers from SELECT JSON count(*) tests
These were marked xfail due to #8077 (the column name was wrong),
but it was fixed long ago for 5.4 (exact commit not known).

Remove the xfail markers to prevent regressions.

Closes scylladb/scylladb#28432
2026-01-29 21:56:00 +02:00
Marcin Maliszkiewicz
ea29e4963e test: perf: add option to write results to json in perf-cql-raw 2026-01-29 10:56:03 +01:00
Marcin Maliszkiewicz
d974ee1e21 test: perf: add option to write results to json in perf-alternator 2026-01-29 10:55:52 +01:00
Marcin Maliszkiewicz
a74b442c65 test: perf: move write_json_result to a common file
The implementation is going to be shared with
perf-alternator and perf-cql-raw.
2026-01-29 10:54:11 +01:00
Pavel Emelyanov
5ce12f2404 gossiper: Export its scheduling group for those who need it
There are several places in the code that need to explicitly switch into
gossiper scheduling group. For that they currently call database to
provide the group, but it's better to get gossiper sched group from
gossiper itself, all the more so all those places have gossiper at hand.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-28 18:29:33 +03:00
Pavel Emelyanov
0da1a222fc migration_manager: Reorder members
This is to initialize dependency references, in particular gossiper&,
before _group0_barrier. The latter will need to access this->_gossiper
in the next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-28 18:29:33 +03:00
Calle Wilund
87aa6c8387 utils/gcp/object_storage: URL-encode object names in URL:s
Fixes #28398

When used as path elements in google storage paths, the object names
need to be URL encoded. Due to a.) tests not really using prefixes including
non-url valid chars (i.e. / etc) and the mock server used for most
testing not enforcing this particular aspect, this was missed.

Modified unit tests to use prefixing for all names, so when run
in real GS, any errors like this will show.
2026-01-27 18:01:21 +01:00
Calle Wilund
a896d8d5e3 utils::gcp::object_storage: Fix list object pager end condition detection
Fixes #28399

When iterating with pager, the mock server and real GCS behaves differently.
The latter will not give a pager token for last page, only penultimate.

Need to handle.
2026-01-27 17:57:17 +01:00
Piotr Dulikowski
29da20744a storage_service: fix indentation after previous patch 2026-01-27 15:49:01 +01:00
Piotr Dulikowski
d28c841fa9 raft topology: generate notification about released nodes only once
Hints destined for some other node can only be drained after the other
node is no longer a replica of any vnode or tablet. In case when tablets
are present, a node might still technically be a replica of some tablets
after it moved to left state. When it no longer is a replica of any
tablet, it becomes "released" and storage service generates a
notification about it. Hinted handoff listens to this notification and
kicks off draining hints after getting it.

The current implementation of the "released" notification would trigger
every time raft topology state is reloaded and a left node without any
tokens is present in the raft topology. Although draining hints is
idempotent, generating duplicate notifications is wasteful and recently
became very noisy after in 44de563 verbosity of the draining-related log
messages have been increased. The verbosity increase itself makes sense
as draining is supposed to be a rare operation, but the duplicate
notification bug now needs to be addressed.

Fix the duplicate notification problem by passing the list of previously
released nodes to the `storage_service::raft_topology_update_ip`
function and filtering based on it. If this function processes the
topology state for the first time, it will not produce any
notifications. This is fine as hinted handoff is prepared to detect
"released" nodes during the startup sequence in main.cc and start
draining the hints there, if needed.

Fixes: #28301
Refs: #25031
2026-01-27 15:48:05 +01:00
Piotr Dulikowski
10e9672852 raft topology: extract "released" nodes calculation to external function
In the following commits we will need to compare the set of released
nodes before and after reload of raft topology state. Moving the logic
that calculates such a set to a separate function will make it easier to
do.
2026-01-27 14:37:43 +01:00
Nadav Har'El
9baaddb613 test/cqlpy: add reproducer for hidden Paxos table being shown by DESC
This patch adds a reproducer test showing issue #28183 - that when LWT
is used, hidden tables "...$paxos" are created but they are unexpectedly
shown by DESC TABLES, DESC SCHEMA and DESC KEYSPACE.

The new test was failing (in three places) on Scylla, as those internal
(and illegally-named) tables are listed, and passes on Cassandra
(which doesn't add hidden tables for LWT).

The commit also contains another test, which verifies if direct
description of paxos state table is wrapped in comment.

Refs #28183.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-01-22 10:43:59 +01:00
Michał Jadwiszczak
f89a8c4ec4 cql3/statements/describe_statement: hide paxos state tables
Paxos state tables are internal tables fully managed by Scylla
and they shouldn't be exposed to the user nor they shouldn't be backed up.

This commit hides those kind of tables from all listings and if such table
is directly described with `DESC ks."tbl$paxos"`, the description is generated
withing a comment and a note for the user is added.

Fixes scylladb/scylladb#28183
2026-01-20 15:58:08 +01:00
Karol Baryła
2c471ec57a protocol-extensions.md: Fix client_options docs
When this column and relevant SUPPORTED key were added, the
documentation was mistakenly put in the section about shard awareness
extension. This commit moves the documentation into a dedicated section.

I also expended it to describe both the new column and the new SUPPORTED
key.
2026-01-13 11:49:00 +01:00
Karol Baryła
30d4d3248d system_keyspace.md: Add client_options column
It was recently introduced, but the documentation was not updated.
2026-01-13 11:35:52 +01:00
Karol Baryła
a0a6140436 system_keyspace.md: Fix order in system.clients
scheduling_group column is places after protocol_version in the current
version.
2026-01-13 11:33:34 +01:00
290 changed files with 5790 additions and 4658 deletions

View File

@@ -0,0 +1,22 @@
name: Sync Jira Based on PR Milestone Events
on:
pull_request_target:
types: [milestoned, demilestoned]
permissions:
contents: read
pull-requests: read
jobs:
jira-sync-milestone-set:
if: github.event.action == 'milestoned'
uses: scylladb/github-automation/.github/workflows/main_jira_sync_pr_milestone_set.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
jira-sync-milestone-removed:
if: github.event.action == 'demilestoned'
uses: scylladb/github-automation/.github/workflows/main_jira_sync_pr_milestone_removed.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -1,4 +1,4 @@
name: Call Jira release creation for new milestone
name: Call Jira release creation for new milestone
on:
milestone:
@@ -9,6 +9,6 @@ jobs:
uses: scylladb/github-automation/.github/workflows/main_sync_milestone_to_jira_release.yml@main
with:
# Comma-separated list of Jira project keys
jira_project_keys: "SCYLLADB,CUSTOMER"
jira_project_keys: "SCYLLADB,CUSTOMER,SMI"
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -0,0 +1,62 @@
name: Close issues created by Scylla associates
on:
issues:
types: [opened, reopened]
permissions:
issues: write
jobs:
comment-and-close:
runs-on: ubuntu-latest
steps:
- name: Comment and close if author email is scylladb.com
uses: actions/github-script@v7
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
script: |
const issue = context.payload.issue;
const actor = context.actor;
// Get user data (only public email is available)
const { data: user } = await github.rest.users.getByUsername({
username: actor,
});
const email = user.email || "";
console.log(`Actor: ${actor}, public email: ${email || "<none>"}`);
// Only continue if email exists and ends with @scylladb.com
if (!email || !email.toLowerCase().endsWith("@scylladb.com")) {
console.log("User is not a scylladb.com email (or email not public); skipping.");
return;
}
const owner = context.repo.owner;
const repo = context.repo.repo;
const issue_number = issue.number;
const body = "Issues in this repository are closed automatically. Scylla associates should use Jira to manage issues.\nPlease move this issue to Jira https://scylladb.atlassian.net/jira/software/c/projects/SCYLLADB/list";
// Add the comment
await github.rest.issues.createComment({
owner,
repo,
issue_number,
body,
});
console.log(`Comment added to #${issue_number}`);
// Close the issue
await github.rest.issues.update({
owner,
repo,
issue_number,
state: "closed",
state_reason: "not_planned"
});
console.log(`Issue #${issue_number} closed.`);

View File

@@ -9,16 +9,34 @@ on:
jobs:
trigger-jenkins:
if: (github.event.comment.user.login != 'scylladbbot' && contains(github.event.comment.body, '@scylladbbot') && contains(github.event.comment.body, 'trigger-ci')) || github.event.label.name == 'conflicts'
if: (github.event_name == 'issue_comment' && github.event.comment.user.login != 'scylladbbot') || github.event.label.name == 'conflicts'
runs-on: ubuntu-latest
steps:
- name: Validate Comment Trigger
if: github.event_name == 'issue_comment'
id: verify_comment
shell: bash
run: |
BODY=$(cat << 'EOF'
${{ github.event.comment.body }}
EOF
)
CLEAN_BODY=$(echo "$BODY" | grep -v '^[[:space:]]*>')
if echo "$CLEAN_BODY" | grep -qi '@scylladbbot' && echo "$CLEAN_BODY" | grep -qi 'trigger-ci'; then
echo "trigger=true" >> $GITHUB_OUTPUT
else
echo "trigger=false" >> $GITHUB_OUTPUT
fi
- name: Trigger Scylla-CI-Route Jenkins Job
if: github.event_name == 'pull_request_target' || steps.verify_comment.outputs.trigger == 'true'
env:
JENKINS_USER: ${{ secrets.JENKINS_USERNAME }}
JENKINS_API_TOKEN: ${{ secrets.JENKINS_TOKEN }}
JENKINS_URL: "https://jenkins.scylladb.com"
run: |
PR_NUMBER=${{ github.event.issue.number }}
PR_NUMBER=${{ github.event.issue.number || github.event.pull_request.number }}
PR_REPO_NAME=${{ github.event.repository.full_name }}
curl -X POST "$JENKINS_URL/job/releng/job/Scylla-CI-Route/buildWithParameters?PR_NUMBER=$PR_NUMBER&PR_REPO_NAME=$PR_REPO_NAME" \
--user "$JENKINS_USER:$JENKINS_API_TOKEN" --fail -i -v

View File

@@ -43,7 +43,7 @@ For further information, please see:
[developer documentation]: HACKING.md
[build documentation]: docs/dev/building.md
[docker image build documentation]: dist/docker/debian/README.md
[docker image build documentation]: dist/docker/redhat/README.md
## Running Scylla

View File

@@ -618,7 +618,7 @@ conditional_operator_type get_conditional_operator(const rjson::value& req) {
// Check if the existing values of the item (previous_item) match the
// conditions given by the Expected and ConditionalOperator parameters
// (if they exist) in the request (an UpdateItem, PutItem or DeleteItem).
// This function can throw an ValidationException API error if there
// This function can throw a ValidationException API error if there
// are errors in the format of the condition itself.
bool verify_expected(const rjson::value& req, const rjson::value* previous_item) {
const rjson::value* expected = rjson::find(req, "Expected");

View File

@@ -45,7 +45,7 @@ bool consumed_capacity_counter::should_add_capacity(const rjson::value& request)
}
void consumed_capacity_counter::add_consumed_capacity_to_response_if_needed(rjson::value& response) const noexcept {
if (_should_add_to_reponse) {
if (_should_add_to_response) {
auto consumption = rjson::empty_object();
rjson::add(consumption, "CapacityUnits", get_consumed_capacity_units());
rjson::add(response, "ConsumedCapacity", std::move(consumption));

View File

@@ -28,9 +28,9 @@ namespace alternator {
class consumed_capacity_counter {
public:
consumed_capacity_counter() = default;
consumed_capacity_counter(bool should_add_to_reponse) : _should_add_to_reponse(should_add_to_reponse){}
consumed_capacity_counter(bool should_add_to_response) : _should_add_to_response(should_add_to_response){}
bool operator()() const noexcept {
return _should_add_to_reponse;
return _should_add_to_response;
}
consumed_capacity_counter& operator +=(uint64_t bytes);
@@ -44,7 +44,7 @@ public:
uint64_t _total_bytes = 0;
static bool should_add_capacity(const rjson::value& request);
protected:
bool _should_add_to_reponse = false;
bool _should_add_to_response = false;
};
class rcu_consumed_capacity_counter : public consumed_capacity_counter {

View File

@@ -7,6 +7,7 @@
*/
#include <fmt/ranges.h>
#include <cstdlib>
#include <seastar/core/on_internal_error.hh>
#include "alternator/executor.hh"
#include "alternator/consumed_capacity.hh"
@@ -108,6 +109,16 @@ const sstring TABLE_CREATION_TIME_TAG_KEY("system:table_creation_time");
// configured by UpdateTimeToLive to be the expiration-time attribute for
// this table.
extern const sstring TTL_TAG_KEY("system:ttl_attribute");
// If this tag is present, it stores the name of an attribute whose numeric
// value (in microseconds since the Unix epoch) is used as the write timestamp
// for PutItem and UpdateItem operations. When the named attribute is present
// in a PutItem or UpdateItem request, its value is used as the timestamp of
// the write, and the attribute itself is NOT stored in the item. This allows
// users to control write ordering for last-write-wins semantics. Because LWT
// does not allow setting a custom write timestamp, operations using this
// feature are incompatible with conditions (which require LWT), and with
// the LWT_ALWAYS write isolation mode; such operations are rejected.
static const sstring TIMESTAMP_TAG_KEY("system:timestamp_attribute");
// This will be set to 1 in a case, where user DID NOT specify a range key.
// The way GSI / LSI is implemented by Alternator assumes user specified keys will come first
// in materialized view's key list. Then, if needed missing keys are added (current implementation
@@ -237,7 +248,7 @@ static void validate_is_object(const rjson::value& value, const char* caller) {
}
// This function assumes the given value is an object and returns requested member value.
// If it is not possible an api_error::validation is thrown.
// If it is not possible, an api_error::validation is thrown.
static const rjson::value& get_member(const rjson::value& obj, const char* member_name, const char* caller) {
validate_is_object(obj, caller);
const rjson::value* ret = rjson::find(obj, member_name);
@@ -249,7 +260,7 @@ static const rjson::value& get_member(const rjson::value& obj, const char* membe
// This function assumes the given value is an object with a single member, and returns this member.
// In case the requirements are not met an api_error::validation is thrown.
// In case the requirements are not met, an api_error::validation is thrown.
static const rjson::value::Member& get_single_member(const rjson::value& v, const char* caller) {
if (!v.IsObject() || v.MemberCount() != 1) {
throw api_error::validation(format("{}: expected an object with a single member.", caller));
@@ -682,7 +693,7 @@ static std::optional<int> get_int_attribute(const rjson::value& value, std::stri
}
// Sets a KeySchema object inside the given JSON parent describing the key
// attributes of the the given schema as being either HASH or RANGE keys.
// attributes of the given schema as being either HASH or RANGE keys.
// Additionally, adds to a given map mappings between the key attribute
// names and their type (as a DynamoDB type string).
void executor::describe_key_schema(rjson::value& parent, const schema& schema, std::unordered_map<std::string,std::string>* attribute_types, const std::map<sstring, sstring> *tags) {
@@ -916,7 +927,7 @@ future<rjson::value> executor::fill_table_description(schema_ptr schema, table_s
sstring index_name = cf_name.substr(delim_it + 1);
rjson::add(view_entry, "IndexName", rjson::from_string(index_name));
rjson::add(view_entry, "IndexArn", generate_arn_for_index(*schema, index_name));
// Add indexes's KeySchema and collect types for AttributeDefinitions:
// Add index's KeySchema and collect types for AttributeDefinitions:
executor::describe_key_schema(view_entry, *vptr, key_attribute_types, db::get_tags_of_table(vptr));
// Add projection type
rjson::value projection = rjson::empty_object();
@@ -1337,13 +1348,14 @@ void rmw_operation::set_default_write_isolation(std::string_view value) {
// Alternator uses tags whose keys start with the "system:" prefix for
// internal purposes. Those should not be readable by ListTagsOfResource,
// nor writable with TagResource or UntagResource (see #24098).
// Only a few specific system tags, currently only "system:write_isolation"
// and "system:initial_tablets", are deliberately intended to be set and read
// by the user, so are not considered "internal".
// Only a few specific system tags, currently only "system:write_isolation",
// "system:initial_tablets", and "system:timestamp_attribute", are deliberately
// intended to be set and read by the user, so are not considered "internal".
static bool tag_key_is_internal(std::string_view tag_key) {
return tag_key.starts_with("system:")
&& tag_key != rmw_operation::WRITE_ISOLATION_TAG_KEY
&& tag_key != INITIAL_TABLETS_TAG_KEY;
&& tag_key != INITIAL_TABLETS_TAG_KEY
&& tag_key != TIMESTAMP_TAG_KEY;
}
enum class update_tags_action { add_tags, delete_tags };
@@ -2298,8 +2310,11 @@ public:
// After calling pk_from_json() and ck_from_json() to extract the pk and ck
// components of a key, and if that succeeded, call check_key() to further
// check that the key doesn't have any spurious components.
static void check_key(const rjson::value& key, const schema_ptr& schema) {
if (key.MemberCount() != (schema->clustering_key_size() == 0 ? 1 : 2)) {
// allow_extra_attribute: set to true when the key may contain one extra
// non-key attribute (e.g., the timestamp pseudo-attribute for DeleteItem).
static void check_key(const rjson::value& key, const schema_ptr& schema, bool allow_extra_attribute = false) {
const unsigned expected = (schema->clustering_key_size() == 0 ? 1 : 2) + (allow_extra_attribute ? 1 : 0);
if (key.MemberCount() != expected) {
throw api_error::validation("Given key attribute not in schema");
}
}
@@ -2346,6 +2361,57 @@ void validate_value(const rjson::value& v, const char* caller) {
// any writing happens (if one of the commands has an error, none of the
// writes should be done). LWT makes it impossible for the parse step to
// generate "mutation" objects, because the timestamp still isn't known.
// Convert a DynamoDB number (big_decimal) to an api::timestamp_type
// (microseconds since the Unix epoch). Fractional microseconds are truncated.
// Returns nullopt if the value is negative or zero.
static std::optional<api::timestamp_type> bigdecimal_to_timestamp(const big_decimal& bd) {
if (bd.unscaled_value() <= 0) {
return std::nullopt;
}
if (bd.scale() == 0) {
// Fast path: integer value, no decimal adjustment needed
return static_cast<api::timestamp_type>(bd.unscaled_value());
}
// General case: adjust for decimal scale.
// big_decimal stores value as unscaled_value * 10^(-scale).
// scale > 0 means divide by 10^scale (truncate fractional part).
// scale < 0 means multiply by 10^|scale| (add trailing zeros).
auto str = bd.unscaled_value().str();
if (bd.scale() > 0) {
int len = str.length();
if (len <= bd.scale()) {
return std::nullopt; // Number < 1
}
str = str.substr(0, len - bd.scale());
} else {
if (bd.scale() < -18) {
// Too large to represent as int64_t
return std::nullopt;
}
for (int i = 0; i < -bd.scale(); i++) {
str.push_back('0');
}
}
long long result = strtoll(str.c_str(), nullptr, 10);
if (result <= 0) {
return std::nullopt;
}
return static_cast<api::timestamp_type>(result);
}
// Try to extract a write timestamp from a DynamoDB-typed value.
// The value should be a number ({"N": "..."}), representing microseconds
// since the Unix epoch. Returns nullopt if the value is not a valid number
// or doesn't represent a valid timestamp.
static std::optional<api::timestamp_type> try_get_timestamp(const rjson::value& attr_value) {
std::optional<big_decimal> n = try_unwrap_number(attr_value);
if (!n) {
return std::nullopt;
}
return bigdecimal_to_timestamp(*n);
}
class put_or_delete_item {
private:
partition_key _pk;
@@ -2361,11 +2427,17 @@ private:
// that length can have different meaning depends on the operation but the
// the calculation of length in bytes to WCU is the same.
uint64_t _length_in_bytes = 0;
// If the table has a system:timestamp_attribute tag, and the named
// attribute was found in the item with a valid numeric value, this holds
// the extracted timestamp. The attribute is not added to _cells.
std::optional<api::timestamp_type> _custom_timestamp;
public:
struct delete_item {};
struct put_item {};
put_or_delete_item(const rjson::value& key, schema_ptr schema, delete_item);
put_or_delete_item(const rjson::value& item, schema_ptr schema, put_item, std::unordered_map<bytes, std::string> key_attributes);
put_or_delete_item(const rjson::value& key, schema_ptr schema, delete_item,
const std::optional<bytes>& timestamp_attribute = std::nullopt);
put_or_delete_item(const rjson::value& item, schema_ptr schema, put_item, std::unordered_map<bytes, std::string> key_attributes,
const std::optional<bytes>& timestamp_attribute = std::nullopt);
// put_or_delete_item doesn't keep a reference to schema (so it can be
// moved between shards for LWT) so it needs to be given again to build():
mutation build(schema_ptr schema, api::timestamp_type ts) const;
@@ -2380,11 +2452,32 @@ public:
bool is_put_item() noexcept {
return _cells.has_value();
}
// Returns the custom write timestamp extracted from the timestamp attribute,
// if any. If not set, the caller should use api::new_timestamp() instead.
std::optional<api::timestamp_type> custom_timestamp() const noexcept {
return _custom_timestamp;
}
};
put_or_delete_item::put_or_delete_item(const rjson::value& key, schema_ptr schema, delete_item)
put_or_delete_item::put_or_delete_item(const rjson::value& key, schema_ptr schema, delete_item, const std::optional<bytes>& timestamp_attribute)
: _pk(pk_from_json(key, schema)), _ck(ck_from_json(key, schema)) {
check_key(key, schema);
if (timestamp_attribute) {
// The timestamp attribute may be provided as a "pseudo-key": it is
// not a real key column, but can be included in the "Key" object to
// carry the custom write timestamp. If found, extract the timestamp
// and don't store it in the item.
const rjson::value* ts_val = rjson::find(key, to_string_view(*timestamp_attribute));
if (ts_val) {
if (auto t = try_get_timestamp(*ts_val)) {
_custom_timestamp = t;
} else {
throw api_error::validation(fmt::format(
"The '{}' attribute used as a write timestamp must be a positive number (microseconds since epoch)",
to_string_view(*timestamp_attribute)));
}
}
}
check_key(key, schema, _custom_timestamp.has_value());
}
// find_attribute() checks whether the named attribute is stored in the
@@ -2435,7 +2528,7 @@ std::unordered_map<bytes, std::string> si_key_attributes(data_dictionary::table
// case, this function simply won't be called for this attribute.)
//
// This function checks if the given attribute update is an update to some
// GSI's key, and if the value is unsuitable, a api_error::validation is
// GSI's key, and if the value is unsuitable, an api_error::validation is
// thrown. The checking here is similar to the checking done in
// get_key_from_typed_value() for the base table's key columns.
//
@@ -2471,7 +2564,8 @@ static inline void validate_value_if_index_key(
}
}
put_or_delete_item::put_or_delete_item(const rjson::value& item, schema_ptr schema, put_item, std::unordered_map<bytes, std::string> key_attributes)
put_or_delete_item::put_or_delete_item(const rjson::value& item, schema_ptr schema, put_item, std::unordered_map<bytes, std::string> key_attributes,
const std::optional<bytes>& timestamp_attribute)
: _pk(pk_from_json(item, schema)), _ck(ck_from_json(item, schema)) {
_cells = std::vector<cell>();
_cells->reserve(item.MemberCount());
@@ -2480,6 +2574,17 @@ put_or_delete_item::put_or_delete_item(const rjson::value& item, schema_ptr sche
validate_value(it->value, "PutItem");
const column_definition* cdef = find_attribute(*schema, column_name);
validate_attr_name_length("", column_name.size(), cdef && cdef->is_primary_key());
// If this is the timestamp attribute, it must be a valid numeric value
// (microseconds since epoch). Use it as the write timestamp and do not
// store it in the item data. Reject the write if the value is non-numeric.
if (timestamp_attribute && column_name == *timestamp_attribute) {
if (auto t = try_get_timestamp(it->value)) {
_custom_timestamp = t;
// The attribute is consumed as timestamp, not stored in _cells.
continue;
}
throw api_error::validation(fmt::format("The '{}' attribute used as a write timestamp must be a positive number (microseconds since epoch)", to_string_view(*timestamp_attribute)));
}
_length_in_bytes += column_name.size();
if (!cdef) {
// This attribute may be a key column of one of the GSI or LSI,
@@ -2671,6 +2776,13 @@ rmw_operation::rmw_operation(service::storage_proxy& proxy, rjson::value&& reque
// _pk and _ck will be assigned later, by the subclass's constructor
// (each operation puts the key in a slightly different location in
// the request).
const auto tags_ptr = db::get_tags_of_table(_schema);
if (tags_ptr) {
auto it = tags_ptr->find(TIMESTAMP_TAG_KEY);
if (it != tags_ptr->end() && !it->second.empty()) {
_timestamp_attribute = to_bytes(it->second);
}
}
}
std::optional<mutation> rmw_operation::apply(foreign_ptr<lw_shared_ptr<query::result>> qr, const query::partition_slice& slice, api::timestamp_type ts, cdc::per_request_options& cdc_opts) {
@@ -2815,6 +2927,21 @@ future<executor::request_return_type> rmw_operation::execute(service::storage_pr
.alternator = true,
.alternator_streams_increased_compatibility = schema()->cdc_options().enabled() && proxy.data_dictionary().get_config().alternator_streams_increased_compatibility(),
};
// If the operation uses a custom write timestamp (from the
// system:timestamp_attribute tag), LWT is incompatible because LWT
// requires the timestamp to be set by the Paxos protocol. Reject the
// operation if it would need to use LWT.
if (has_custom_timestamp()) {
bool would_use_lwt = _write_isolation == write_isolation::LWT_ALWAYS ||
(needs_read_before_write &&
_write_isolation != write_isolation::FORBID_RMW &&
_write_isolation != write_isolation::UNSAFE_RMW);
if (would_use_lwt) {
throw api_error::validation(
"Using the system:timestamp_attribute is not compatible with "
"conditional writes or the 'always' write isolation policy.");
}
}
if (needs_read_before_write) {
if (_write_isolation == write_isolation::FORBID_RMW) {
throw api_error::validation("Read-modify-write operations are disabled by 'forbid_rmw' write isolation policy. Refer to https://github.com/scylladb/scylla/blob/master/docs/alternator/alternator.md#write-isolation-policies for more information.");
@@ -2913,7 +3040,8 @@ public:
put_item_operation(parsed::expression_cache& parsed_expression_cache, service::storage_proxy& proxy, rjson::value&& request)
: rmw_operation(proxy, std::move(request))
, _mutation_builder(rjson::get(_request, "Item"), schema(), put_or_delete_item::put_item{},
si_key_attributes(proxy.data_dictionary().find_table(schema()->ks_name(), schema()->cf_name()))) {
si_key_attributes(proxy.data_dictionary().find_table(schema()->ks_name(), schema()->cf_name())),
_timestamp_attribute) {
_pk = _mutation_builder.pk();
_ck = _mutation_builder.ck();
if (_returnvalues != returnvalues::NONE && _returnvalues != returnvalues::ALL_OLD) {
@@ -2945,6 +3073,9 @@ public:
check_needs_read_before_write(_condition_expression) ||
_returnvalues == returnvalues::ALL_OLD;
}
bool has_custom_timestamp() const noexcept {
return _mutation_builder.custom_timestamp().has_value();
}
virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts, cdc::per_request_options& cdc_opts) const override {
if (!verify_expected(_request, previous_item.get()) ||
!verify_condition_expression(_condition_expression, previous_item.get())) {
@@ -2962,7 +3093,10 @@ public:
} else {
_return_attributes = {};
}
return _mutation_builder.build(_schema, ts);
// Use the custom timestamp from the timestamp attribute if available,
// otherwise use the provided timestamp.
api::timestamp_type effective_ts = _mutation_builder.custom_timestamp().value_or(ts);
return _mutation_builder.build(_schema, effective_ts);
}
virtual ~put_item_operation() = default;
};
@@ -3014,7 +3148,7 @@ public:
parsed::condition_expression _condition_expression;
delete_item_operation(parsed::expression_cache& parsed_expression_cache, service::storage_proxy& proxy, rjson::value&& request)
: rmw_operation(proxy, std::move(request))
, _mutation_builder(rjson::get(_request, "Key"), schema(), put_or_delete_item::delete_item{}) {
, _mutation_builder(rjson::get(_request, "Key"), schema(), put_or_delete_item::delete_item{}, _timestamp_attribute) {
_pk = _mutation_builder.pk();
_ck = _mutation_builder.ck();
if (_returnvalues != returnvalues::NONE && _returnvalues != returnvalues::ALL_OLD) {
@@ -3045,6 +3179,9 @@ public:
check_needs_read_before_write(_condition_expression) ||
_returnvalues == returnvalues::ALL_OLD;
}
bool has_custom_timestamp() const noexcept override {
return _mutation_builder.custom_timestamp().has_value();
}
virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts, cdc::per_request_options& cdc_opts) const override {
if (!verify_expected(_request, previous_item.get()) ||
!verify_condition_expression(_condition_expression, previous_item.get())) {
@@ -3065,7 +3202,10 @@ public:
if (_consumed_capacity._total_bytes == 0) {
_consumed_capacity._total_bytes = 1;
}
return _mutation_builder.build(_schema, ts);
// Use the custom timestamp from the timestamp attribute if available,
// otherwise use the provided timestamp.
api::timestamp_type effective_ts = _mutation_builder.custom_timestamp().value_or(ts);
return _mutation_builder.build(_schema, effective_ts);
}
virtual ~delete_item_operation() = default;
};
@@ -3252,10 +3392,13 @@ future<> executor::do_batch_write(
// Do a normal write, without LWT:
utils::chunked_vector<mutation> mutations;
mutations.reserve(mutation_builders.size());
api::timestamp_type now = api::new_timestamp();
api::timestamp_type default_ts = api::new_timestamp();
bool any_cdc_enabled = false;
for (auto& b : mutation_builders) {
mutations.push_back(b.second.build(b.first, now));
// Use custom timestamp from the timestamp attribute if available,
// otherwise use the default timestamp for all items in this batch.
api::timestamp_type ts = b.second.custom_timestamp().value_or(default_ts);
mutations.push_back(b.second.build(b.first, ts));
any_cdc_enabled |= b.first->cdc_options().enabled();
}
return _proxy.mutate(std::move(mutations),
@@ -3355,6 +3498,16 @@ future<executor::request_return_type> executor::batch_write_item(client_state& c
std::unordered_set<primary_key, primary_key_hash, primary_key_equal> used_keys(
1, primary_key_hash{schema}, primary_key_equal{schema});
// Look up the timestamp attribute tag once per table (shared by all
// PutRequests and DeleteRequests for this table).
std::optional<bytes> ts_attr;
const auto tags_ptr = db::get_tags_of_table(schema);
if (tags_ptr) {
auto tag_it = tags_ptr->find(TIMESTAMP_TAG_KEY);
if (tag_it != tags_ptr->end() && !tag_it->second.empty()) {
ts_attr = to_bytes(tag_it->second);
}
}
for (auto& request : it->value.GetArray()) {
auto& r = get_single_member(request, "RequestItems element");
const auto r_name = rjson::to_string_view(r.name);
@@ -3363,7 +3516,8 @@ future<executor::request_return_type> executor::batch_write_item(client_state& c
validate_is_object(item, "Item in PutRequest");
auto&& put_item = put_or_delete_item(
item, schema, put_or_delete_item::put_item{},
si_key_attributes(_proxy.data_dictionary().find_table(schema->ks_name(), schema->cf_name())));
si_key_attributes(_proxy.data_dictionary().find_table(schema->ks_name(), schema->cf_name())),
ts_attr);
mutation_builders.emplace_back(schema, std::move(put_item));
auto mut_key = std::make_pair(mutation_builders.back().second.pk(), mutation_builders.back().second.ck());
if (used_keys.contains(mut_key)) {
@@ -3374,7 +3528,7 @@ future<executor::request_return_type> executor::batch_write_item(client_state& c
const rjson::value& key = get_member(r.value, "Key", "DeleteRequest");
validate_is_object(key, "Key in DeleteRequest");
mutation_builders.emplace_back(schema, put_or_delete_item(
key, schema, put_or_delete_item::delete_item{}));
key, schema, put_or_delete_item::delete_item{}, ts_attr));
auto mut_key = std::make_pair(mutation_builders.back().second.pk(),
mutation_builders.back().second.ck());
if (used_keys.contains(mut_key)) {
@@ -3548,7 +3702,7 @@ static bool hierarchy_filter(rjson::value& val, const attribute_path_map_node<T>
return true;
}
// Add a path to a attribute_path_map. Throws a validation error if the path
// Add a path to an attribute_path_map. Throws a validation error if the path
// "overlaps" with one already in the filter (one is a sub-path of the other)
// or "conflicts" with it (both a member and index is requested).
template<typename T>
@@ -3983,6 +4137,10 @@ public:
virtual ~update_item_operation() = default;
virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts, cdc::per_request_options& cdc_opts) const override;
bool needs_read_before_write() const;
// Returns true if the timestamp attribute is being set in this update
// (via AttributeUpdates PUT or UpdateExpression SET). Used to detect
// whether a custom write timestamp will be used.
bool has_custom_timestamp() const noexcept;
private:
void delete_attribute(bytes&& column_name, const std::unique_ptr<rjson::value>& previous_item, const api::timestamp_type ts, deletable_row& row,
@@ -4117,6 +4275,44 @@ update_item_operation::needs_read_before_write() const {
(_returnvalues != returnvalues::NONE && _returnvalues != returnvalues::UPDATED_NEW);
}
bool
update_item_operation::has_custom_timestamp() const noexcept {
if (!_timestamp_attribute) {
return false;
}
// Check if the timestamp attribute is being set via AttributeUpdates PUT
// with a valid numeric value.
if (_attribute_updates) {
std::string_view ts_attr = to_string_view(*_timestamp_attribute);
for (auto it = _attribute_updates->MemberBegin(); it != _attribute_updates->MemberEnd(); ++it) {
if (rjson::to_string_view(it->name) == ts_attr) {
const rjson::value* action = rjson::find(it->value, "Action");
if (action && rjson::to_string_view(*action) == "PUT" && it->value.HasMember("Value")) {
// Only consider it a custom timestamp if the value is numeric
if (try_get_timestamp((it->value)["Value"])) {
return true;
}
}
break;
}
}
}
// Check if the timestamp attribute is being set via UpdateExpression SET.
// We can't check the actual value type without resolving the expression
// (which requires previous_item), so we conservatively return true if the
// attribute appears in a SET action, and handle the non-numeric case in apply().
// A non-numeric value will cause apply() to throw a ValidationException.
if (!_update_expression.empty()) {
std::string ts_attr(to_string_view(*_timestamp_attribute));
auto it = _update_expression.find(ts_attr);
if (it != _update_expression.end() && it->second.has_value()) {
const auto& action = it->second.get_value();
return std::holds_alternative<parsed::update_expression::action::set>(action._action);
}
}
return false;
}
// action_result() returns the result of applying an UpdateItem action -
// this result is either a JSON object or an unset optional which indicates
// the action was a deletion. The caller (update_item_operation::apply()
@@ -4392,6 +4588,17 @@ inline void update_item_operation::apply_attribute_updates(const std::unique_ptr
throw api_error::validation(format("UpdateItem cannot update key column {}", rjson::to_string_view(it->name)));
}
std::string action = rjson::to_string((it->value)["Action"]);
// If this is the timestamp attribute being PUT, it must be a valid
// numeric value (microseconds since epoch). Use it as the write
// timestamp and skip storing it. Reject if the value is non-numeric.
if (_timestamp_attribute && column_name == *_timestamp_attribute && action == "PUT") {
if (it->value.HasMember("Value")) {
if (try_get_timestamp((it->value)["Value"])) {
continue;
}
throw api_error::validation(fmt::format("The '{}' attribute used as a write timestamp must be a positive number (microseconds since epoch)", to_string_view(*_timestamp_attribute)));
}
}
if (action == "DELETE") {
// The DELETE operation can do two unrelated tasks. Without a
// "Value" option, it is used to delete an attribute. With a
@@ -4495,6 +4702,20 @@ inline void update_item_operation::apply_update_expression(const std::unique_ptr
if (cdef && cdef->is_primary_key()) {
throw api_error::validation(fmt::format("UpdateItem cannot update key column {}", column_name));
}
// If this is the timestamp attribute being set via UpdateExpression SET,
// it must be a valid numeric value (microseconds since epoch). Use it as
// the write timestamp and skip storing it. Reject if non-numeric.
if (_timestamp_attribute && to_bytes(column_name) == *_timestamp_attribute &&
actions.second.has_value() &&
std::holds_alternative<parsed::update_expression::action::set>(actions.second.get_value()._action)) {
std::optional<rjson::value> result = action_result(actions.second.get_value(), previous_item.get());
if (result) {
if (try_get_timestamp(*result)) {
continue; // Skip - already used as timestamp
}
throw api_error::validation(fmt::format("The '{}' attribute used as a write timestamp must be a positive number (microseconds since epoch)", to_string_view(*_timestamp_attribute)));
}
}
if (actions.second.has_value()) {
// An action on a top-level attribute column_name. The single
// action is actions.second.get_value(). We can simply invoke
@@ -4543,6 +4764,44 @@ std::optional<mutation> update_item_operation::apply(std::unique_ptr<rjson::valu
return {};
}
// If the table has a timestamp attribute, look for it in the update
// (AttributeUpdates PUT or UpdateExpression SET). If found with a valid
// numeric value, use it as the write timestamp instead of the provided ts.
api::timestamp_type effective_ts = ts;
if (_timestamp_attribute) {
bool found_ts = false;
if (_attribute_updates) {
std::string_view ts_attr = to_string_view(*_timestamp_attribute);
for (auto it = _attribute_updates->MemberBegin(); it != _attribute_updates->MemberEnd(); ++it) {
if (rjson::to_string_view(it->name) == ts_attr) {
const rjson::value* action = rjson::find(it->value, "Action");
if (action && rjson::to_string_view(*action) == "PUT" && it->value.HasMember("Value")) {
if (auto t = try_get_timestamp((it->value)["Value"])) {
effective_ts = *t;
found_ts = true;
}
}
break;
}
}
}
if (!found_ts && !_update_expression.empty()) {
std::string ts_attr(to_string_view(*_timestamp_attribute));
auto it = _update_expression.find(ts_attr);
if (it != _update_expression.end() && it->second.has_value()) {
const auto& action = it->second.get_value();
if (std::holds_alternative<parsed::update_expression::action::set>(action._action)) {
std::optional<rjson::value> result = action_result(action, previous_item.get());
if (result) {
if (auto t = try_get_timestamp(*result)) {
effective_ts = *t;
}
}
}
}
}
}
// In the ReturnValues=ALL_NEW case, we make a copy of previous_item into
// _return_attributes and parts of it will be overwritten by the new
// updates (in do_update() and do_delete()). We need to make a copy and
@@ -4571,10 +4830,10 @@ std::optional<mutation> update_item_operation::apply(std::unique_ptr<rjson::valu
auto& row = m.partition().clustered_row(*_schema, _ck);
auto modified_attrs = attribute_collector();
if (!_update_expression.empty()) {
apply_update_expression(previous_item, ts, row, modified_attrs, any_updates, any_deletes);
apply_update_expression(previous_item, effective_ts, row, modified_attrs, any_updates, any_deletes);
}
if (_attribute_updates) {
apply_attribute_updates(previous_item, ts, row, modified_attrs, any_updates, any_deletes);
apply_attribute_updates(previous_item, effective_ts, row, modified_attrs, any_updates, any_deletes);
}
if (!modified_attrs.empty()) {
auto serialized_map = modified_attrs.to_mut().serialize(*attrs_type());
@@ -4585,7 +4844,7 @@ std::optional<mutation> update_item_operation::apply(std::unique_ptr<rjson::valu
// marker. An update with only DELETE operations must not add a row marker
// (this was issue #5862) but any other update, even an empty one, should.
if (any_updates || !any_deletes) {
row.apply(row_marker(ts));
row.apply(row_marker(effective_ts));
} else if (_returnvalues == returnvalues::ALL_NEW && !previous_item) {
// There was no pre-existing item, and we're not creating one, so
// don't report the new item in the returned Attributes.

View File

@@ -50,7 +50,7 @@ public:
_operators.emplace_back(i);
check_depth_limit();
}
void add_dot(std::string(name)) {
void add_dot(std::string name) {
_operators.emplace_back(std::move(name));
check_depth_limit();
}
@@ -85,7 +85,7 @@ struct constant {
}
};
// "value" is is a value used in the right hand side of an assignment
// "value" is a value used in the right hand side of an assignment
// expression, "SET a = ...". It can be a constant (a reference to a value
// included in the request, e.g., ":val"), a path to an attribute from the
// existing item (e.g., "a.b[3].c"), or a function of other such values.
@@ -205,7 +205,7 @@ public:
// The supported primitive conditions are:
// 1. Binary operators - v1 OP v2, where OP is =, <>, <, <=, >, or >= and
// v1 and v2 are values - from the item (an attribute path), the query
// (a ":val" reference), or a function of the the above (only the size()
// (a ":val" reference), or a function of the above (only the size()
// function is supported).
// 2. Ternary operator - v1 BETWEEN v2 and v3 (means v1 >= v2 AND v1 <= v3).
// 3. N-ary operator - v1 IN ( v2, v3, ... )

View File

@@ -18,6 +18,7 @@
#include "executor.hh"
#include "tracing/trace_state.hh"
#include "keys/keys.hh"
#include "bytes.hh"
namespace alternator {
@@ -72,6 +73,11 @@ protected:
clustering_key _ck = clustering_key::make_empty();
write_isolation _write_isolation;
mutable wcu_consumed_capacity_counter _consumed_capacity;
// If the table has a "system:timestamp_attribute" tag, this holds the
// name of the attribute (converted to bytes) whose numeric value should
// be used as the write timestamp instead of the current time. The
// attribute itself is NOT stored in the item data.
std::optional<bytes> _timestamp_attribute;
// All RMW operations can have a ReturnValues parameter from the following
// choices. But note that only UpdateItem actually supports all of them:
enum class returnvalues {
@@ -113,6 +119,9 @@ public:
// Convert the above apply() into the signature needed by cas_request:
virtual std::optional<mutation> apply(foreign_ptr<lw_shared_ptr<query::result>> qr, const query::partition_slice& slice, api::timestamp_type ts, cdc::per_request_options& cdc_opts) override;
virtual ~rmw_operation() = default;
// Returns true if the operation will use a custom write timestamp (from the
// system:timestamp_attribute tag). Subclasses override this as needed.
virtual bool has_custom_timestamp() const noexcept { return false; }
const wcu_consumed_capacity_counter& consumed_capacity() const noexcept { return _consumed_capacity; }
schema_ptr schema() const { return _schema; }
const rjson::value& request() const { return _request; }

View File

@@ -55,7 +55,7 @@ partition_key pk_from_json(const rjson::value& item, schema_ptr schema);
clustering_key ck_from_json(const rjson::value& item, schema_ptr schema);
position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema);
// If v encodes a number (i.e., it is a {"N": [...]}, returns an object representing it. Otherwise,
// If v encodes a number (i.e., it is a {"N": [...]}), returns an object representing it. Otherwise,
// raises ValidationException with diagnostic.
big_decimal unwrap_number(const rjson::value& v, std::string_view diagnostic);

View File

@@ -141,7 +141,7 @@ future<executor::request_return_type> executor::describe_time_to_live(client_sta
// expiration_service is a sharded service responsible for cleaning up expired
// items in all tables with per-item expiration enabled. Currently, this means
// Alternator tables with TTL configured via a UpdateTimeToLive request.
// Alternator tables with TTL configured via an UpdateTimeToLive request.
//
// Here is a brief overview of how the expiration service works:
//
@@ -593,7 +593,7 @@ static future<> scan_table_ranges(
if (retries >= 10) {
// Don't get stuck forever asking the same page, maybe there's
// a bug or a real problem in several replicas. Give up on
// this scan an retry the scan from a random position later,
// this scan and retry the scan from a random position later,
// in the next scan period.
throw runtime_exception("scanner thread failed after too many timeouts for the same page");
}
@@ -767,7 +767,7 @@ static future<bool> scan_table(
// by tasking another node to take over scanning of the dead node's primary
// ranges. What we do here is that this node will also check expiration
// on its *secondary* ranges - but only those whose primary owner is down.
auto tablet_secondary_replica = tablet_map.get_secondary_replica(*tablet); // throws if no secondary replica
auto tablet_secondary_replica = tablet_map.get_secondary_replica(*tablet, erm->get_topology()); // throws if no secondary replica
if (tablet_secondary_replica.host == my_host_id && tablet_secondary_replica.shard == this_shard_id()) {
if (!gossiper.is_alive(tablet_primary_replica.host)) {
co_await scan_tablet(*tablet, proxy, abort_source, page_sem, expiration_stats, scan_ctx, tablet_map);

View File

@@ -30,7 +30,7 @@ namespace alternator {
// expiration_service is a sharded service responsible for cleaning up expired
// items in all tables with per-item expiration enabled. Currently, this means
// Alternator tables with TTL configured via a UpdateTimeToLeave request.
// Alternator tables with TTL configured via an UpdateTimeToLive request.
class expiration_service final : public seastar::peering_sharded_service<expiration_service> {
public:
// Object holding per-shard statistics related to the expiration service.
@@ -52,7 +52,7 @@ private:
data_dictionary::database _db;
service::storage_proxy& _proxy;
gms::gossiper& _gossiper;
// _end is set by start(), and resolves when the the background service
// _end is set by start(), and resolves when the background service
// started by it ends. To ask the background service to end, _abort_source
// should be triggered. stop() below uses both _abort_source and _end.
std::optional<future<>> _end;

View File

@@ -12,7 +12,7 @@
"operations":[
{
"method":"POST",
"summary":"Reset cache",
"summary":"Resets authorized prepared statements cache",
"type":"void",
"nickname":"authorization_cache_reset",
"produces":[

View File

@@ -23,31 +23,6 @@
namespace api {
template<class T>
std::vector<T> map_to_key_value(const std::map<sstring, sstring>& map) {
std::vector<T> res;
res.reserve(map.size());
for (const auto& [key, value] : map) {
res.push_back(T());
res.back().key = key;
res.back().value = value;
}
return res;
}
template<class T, class MAP>
std::vector<T>& map_to_key_value(const MAP& map, std::vector<T>& res) {
res.reserve(res.size() + std::size(map));
for (const auto& [key, value] : map) {
T val;
val.key = fmt::to_string(key);
val.value = fmt::to_string(value);
res.push_back(val);
}
return res;
}
template <typename T, typename S = T>
T map_sum(T&& dest, const S& src) {
for (const auto& i : src) {

View File

@@ -515,6 +515,15 @@ void set_sstables_loader(http_context& ctx, routes& r, sharded<sstables_loader>&
auto sstables = parsed.GetArray() |
std::views::transform([] (const auto& s) { return sstring(rjson::to_string_view(s)); }) |
std::ranges::to<std::vector>();
apilog.info("Restore invoked with following parameters: keyspace={}, table={}, endpoint={}, bucket={}, prefix={}, sstables_count={}, scope={}, primary_replica_only={}",
keyspace,
table,
endpoint,
bucket,
prefix,
sstables.size(),
scope,
primary_replica_only);
auto task_id = co_await sst_loader.local().download_new_sstables(keyspace, table, prefix, std::move(sstables), endpoint, bucket, scope, primary_replica_only);
co_return json::json_return_type(fmt::to_string(task_id));
});
@@ -527,13 +536,15 @@ void unset_sstables_loader(http_context& ctx, routes& r) {
}
void set_view_builder(http_context& ctx, routes& r, sharded<db::view::view_builder>& vb, sharded<gms::gossiper>& g) {
ss::view_build_statuses.set(r, [&ctx, &vb, &g] (std::unique_ptr<http::request> req) {
ss::view_build_statuses.set(r, [&ctx, &vb, &g] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto keyspace = validate_keyspace(ctx, req);
auto view = req->get_path_param("view");
return vb.local().view_build_statuses(std::move(keyspace), std::move(view), g.local()).then([] (std::unordered_map<sstring, sstring> status) {
std::vector<storage_service_json::mapper> res;
return make_ready_future<json::json_return_type>(map_to_key_value(std::move(status), res));
});
co_return json::json_return_type(stream_range_as_array(co_await vb.local().view_build_statuses(std::move(keyspace), std::move(view), g.local()), [] (const auto& i) {
storage_service_json::mapper res;
res.key = i.first;
res.value = i.second;
return res;
}));
});
cf::get_built_indexes.set(r, [&vb](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
@@ -571,6 +582,16 @@ static future<json::json_return_type> describe_ring_as_json_for_table(const shar
co_return json::json_return_type(stream_range_as_array(co_await ss.local().describe_ring_for_table(keyspace, table), token_range_endpoints_to_json));
}
namespace {
template <typename Key, typename Value>
storage_service_json::mapper map_to_json(const std::pair<Key, Value>& i) {
storage_service_json::mapper val;
val.key = fmt::to_string(i.first);
val.value = fmt::to_string(i.second);
return val;
}
}
static
future<json::json_return_type>
rest_get_token_endpoint(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
@@ -588,12 +609,7 @@ rest_get_token_endpoint(http_context& ctx, sharded<service::storage_service>& ss
throw bad_param_exception("Either provide both keyspace and table (for tablet table) or neither (for vnodes)");
}
co_return json::json_return_type(stream_range_as_array(token_endpoints, [](const auto& i) {
storage_service_json::mapper val;
val.key = fmt::to_string(i.first);
val.value = fmt::to_string(i.second);
return val;
}));
co_return json::json_return_type(stream_range_as_array(token_endpoints, &map_to_json<dht::token, gms::inet_address>));
}
static
@@ -677,7 +693,6 @@ rest_get_range_to_endpoint_map(http_context& ctx, sharded<service::storage_servi
table_id = validate_table(ctx.db.local(), keyspace, table);
}
std::vector<ss::maplist_mapper> res;
co_return stream_range_as_array(co_await ss.local().get_range_to_address_map(keyspace, table_id),
[](const std::pair<dht::token_range, inet_address_vector_replica_set>& entry){
ss::maplist_mapper m;
@@ -1308,10 +1323,7 @@ rest_get_ownership(http_context& ctx, sharded<service::storage_service>& ss, std
throw httpd::bad_param_exception("storage_service/ownership cannot be used when a keyspace uses tablets");
}
return ss.local().get_ownership().then([] (auto&& ownership) {
std::vector<storage_service_json::mapper> res;
return make_ready_future<json::json_return_type>(map_to_key_value(ownership, res));
});
co_return json::json_return_type(stream_range_as_array(co_await ss.local().get_ownership(), &map_to_json<gms::inet_address, float>));
}
static
@@ -1328,10 +1340,7 @@ rest_get_effective_ownership(http_context& ctx, sharded<service::storage_service
}
}
return ss.local().effective_ownership(keyspace_name, table_name).then([] (auto&& ownership) {
std::vector<storage_service_json::mapper> res;
return make_ready_future<json::json_return_type>(map_to_key_value(ownership, res));
});
co_return json::json_return_type(stream_range_as_array(co_await ss.local().effective_ownership(keyspace_name, table_name), &map_to_json<gms::inet_address, float>));
}
static
@@ -1341,7 +1350,7 @@ rest_estimate_compression_ratios(http_context& ctx, sharded<service::storage_ser
apilog.warn("estimate_compression_ratios: called before the cluster feature was enabled");
throw std::runtime_error("estimate_compression_ratios requires all nodes to support the SSTABLE_COMPRESSION_DICTS cluster feature");
}
auto ticket = get_units(ss.local().get_do_sample_sstables_concurrency_limiter(), 1);
auto ticket = co_await get_units(ss.local().get_do_sample_sstables_concurrency_limiter(), 1);
auto ks = api::req_param<sstring>(*req, "keyspace", {}).value;
auto cf = api::req_param<sstring>(*req, "cf", {}).value;
apilog.debug("estimate_compression_ratios: called with ks={} cf={}", ks, cf);
@@ -1407,7 +1416,7 @@ rest_retrain_dict(http_context& ctx, sharded<service::storage_service>& ss, serv
apilog.warn("retrain_dict: called before the cluster feature was enabled");
throw std::runtime_error("retrain_dict requires all nodes to support the SSTABLE_COMPRESSION_DICTS cluster feature");
}
auto ticket = get_units(ss.local().get_do_sample_sstables_concurrency_limiter(), 1);
auto ticket = co_await get_units(ss.local().get_do_sample_sstables_concurrency_limiter(), 1);
auto ks = api::req_param<sstring>(*req, "keyspace", {}).value;
auto cf = api::req_param<sstring>(*req, "cf", {}).value;
apilog.debug("retrain_dict: called with ks={} cf={}", ks, cf);

View File

@@ -17,7 +17,6 @@ target_sources(scylla_auth
password_authenticator.cc
passwords.cc
permission.cc
permissions_cache.cc
resource.cc
role_or_anonymous.cc
roles-metadata.cc

View File

@@ -8,6 +8,7 @@
#include "auth/cache.hh"
#include "auth/common.hh"
#include "auth/role_or_anonymous.hh"
#include "auth/roles-metadata.hh"
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
@@ -18,6 +19,8 @@
#include <seastar/core/abort_source.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include <seastar/core/format.hh>
#include <seastar/core/metrics.hh>
#include <seastar/core/do_with.hh>
namespace auth {
@@ -27,7 +30,21 @@ cache::cache(cql3::query_processor& qp, abort_source& as) noexcept
: _current_version(0)
, _qp(qp)
, _loading_sem(1)
, _as(as) {
, _as(as)
, _permission_loader(nullptr)
, _permission_loader_sem(8) {
namespace sm = seastar::metrics;
_metrics.add_group("auth_cache", {
sm::make_gauge("roles", [this] { return _roles.size(); },
sm::description("Number of roles currently cached")),
sm::make_gauge("permissions", [this] {
return _cached_permissions_count;
}, sm::description("Total number of permission sets currently cached across all roles"))
});
}
void cache::set_permission_loader(permission_loader_func loader) {
_permission_loader = std::move(loader);
}
lw_shared_ptr<const cache::role_record> cache::get(const role_name_t& role) const noexcept {
@@ -38,6 +55,83 @@ lw_shared_ptr<const cache::role_record> cache::get(const role_name_t& role) cons
return it->second;
}
future<permission_set> cache::get_permissions(const role_or_anonymous& role, const resource& r) {
std::unordered_map<resource, permission_set>* perms_cache;
lw_shared_ptr<role_record> role_ptr;
if (is_anonymous(role)) {
perms_cache = &_anonymous_permissions;
} else {
const auto& role_name = *role.name;
auto role_it = _roles.find(role_name);
if (role_it == _roles.end()) {
// Role might have been deleted but there are some connections
// left which reference it. They should no longer have access to anything.
return make_ready_future<permission_set>(permissions::NONE);
}
role_ptr = role_it->second;
perms_cache = &role_ptr->cached_permissions;
}
if (auto it = perms_cache->find(r); it != perms_cache->end()) {
return make_ready_future<permission_set>(it->second);
}
// keep alive role_ptr as it holds perms_cache (except anonymous)
return do_with(std::move(role_ptr), [this, &role, &r, perms_cache] (auto& role_ptr) {
return load_permissions(role, r, perms_cache);
});
}
future<permission_set> cache::load_permissions(const role_or_anonymous& role, const resource& r, std::unordered_map<resource, permission_set>* perms_cache) {
SCYLLA_ASSERT(_permission_loader);
auto units = co_await get_units(_permission_loader_sem, 1, _as);
// Check again, perhaps we were blocked and other call loaded
// the permissions already. This is a protection against misses storm.
if (auto it = perms_cache->find(r); it != perms_cache->end()) {
co_return it->second;
}
auto perms = co_await _permission_loader(role, r);
add_permissions(*perms_cache, r, perms);
co_return perms;
}
future<> cache::prune(const resource& r) {
auto units = co_await get_units(_loading_sem, 1, _as);
_anonymous_permissions.erase(r);
for (auto& it : _roles) {
// Prunning can run concurrently with other functions but it
// can only cause cached_permissions extra reload via get_permissions.
remove_permissions(it.second->cached_permissions, r);
co_await coroutine::maybe_yield();
}
}
future<> cache::reload_all_permissions() noexcept {
SCYLLA_ASSERT(_permission_loader);
auto units = co_await get_units(_loading_sem, 1, _as);
auto copy_keys = [] (const std::unordered_map<resource, permission_set>& m) {
std::vector<resource> keys;
keys.reserve(m.size());
for (const auto& [res, _] : m) {
keys.push_back(res);
}
return keys;
};
const role_or_anonymous anon;
for (const auto& res : copy_keys(_anonymous_permissions)) {
_anonymous_permissions[res] = co_await _permission_loader(anon, res);
}
for (auto& [role, entry] : _roles) {
auto& perms_cache = entry->cached_permissions;
auto r = role_or_anonymous(role);
for (const auto& res : copy_keys(perms_cache)) {
perms_cache[res] = co_await _permission_loader(r, res);
}
}
logger.debug("Reloaded auth cache with {} entries", _roles.size());
}
future<lw_shared_ptr<cache::role_record>> cache::fetch_role(const role_name_t& role) const {
auto rec = make_lw_shared<role_record>();
rec->version = _current_version;
@@ -105,7 +199,7 @@ future<lw_shared_ptr<cache::role_record>> cache::fetch_role(const role_name_t& r
future<> cache::prune_all() noexcept {
for (auto it = _roles.begin(); it != _roles.end(); ) {
if (it->second->version != _current_version) {
_roles.erase(it++);
remove_role(it++);
co_await coroutine::maybe_yield();
} else {
++it;
@@ -129,7 +223,7 @@ future<> cache::load_all() {
const auto name = r.get_as<sstring>("role");
auto role = co_await fetch_role(name);
if (role) {
_roles[name] = role;
add_role(name, role);
}
co_return stop_iteration::no;
};
@@ -142,11 +236,32 @@ future<> cache::load_all() {
co_await distribute_role(name, role);
}
co_await container().invoke_on_others([this](cache& c) -> future<> {
auto units = co_await get_units(c._loading_sem, 1, c._as);
c._current_version = _current_version;
co_await c.prune_all();
});
}
future<> cache::gather_inheriting_roles(std::unordered_set<role_name_t>& roles, lw_shared_ptr<cache::role_record> role, const role_name_t& name) {
if (!role) {
// Role might have been removed or not yet added, either way
// their members will be handled by another top call to this function.
co_return;
}
for (const auto& member_name : role->members) {
bool is_new = roles.insert(member_name).second;
if (!is_new) {
continue;
}
lw_shared_ptr<cache::role_record> member_role;
auto r = _roles.find(member_name);
if (r != _roles.end()) {
member_role = r->second;
}
co_await gather_inheriting_roles(roles, member_role, member_name);
}
}
future<> cache::load_roles(std::unordered_set<role_name_t> roles) {
if (legacy_mode(_qp)) {
co_return;
@@ -154,27 +269,41 @@ future<> cache::load_roles(std::unordered_set<role_name_t> roles) {
SCYLLA_ASSERT(this_shard_id() == 0);
auto units = co_await get_units(_loading_sem, 1, _as);
std::unordered_set<role_name_t> roles_to_clear_perms;
for (const auto& name : roles) {
logger.info("Loading role {}", name);
auto role = co_await fetch_role(name);
if (role) {
_roles[name] = role;
add_role(name, role);
co_await gather_inheriting_roles(roles_to_clear_perms, role, name);
} else {
_roles.erase(name);
if (auto it = _roles.find(name); it != _roles.end()) {
auto old_role = it->second;
remove_role(it);
co_await gather_inheriting_roles(roles_to_clear_perms, old_role, name);
}
}
co_await distribute_role(name, role);
}
co_await container().invoke_on_all([&roles_to_clear_perms] (cache& c) -> future<> {
for (const auto& name : roles_to_clear_perms) {
c.clear_role_permissions(name);
co_await coroutine::maybe_yield();
}
});
}
future<> cache::distribute_role(const role_name_t& name, lw_shared_ptr<role_record> role) {
auto role_ptr = role.get();
co_await container().invoke_on_others([&name, role_ptr](cache& c) {
co_await container().invoke_on_others([&name, role_ptr](cache& c) -> future<> {
auto units = co_await get_units(c._loading_sem, 1, c._as);
if (!role_ptr) {
c._roles.erase(name);
return;
c.remove_role(name);
co_return;
}
auto role_copy = make_lw_shared<role_record>(*role_ptr);
c._roles[name] = std::move(role_copy);
c.add_role(name, std::move(role_copy));
});
}
@@ -185,4 +314,40 @@ bool cache::includes_table(const table_id& id) noexcept {
|| id == db::system_keyspace::role_permissions()->id();
}
void cache::add_role(const role_name_t& name, lw_shared_ptr<role_record> role) {
if (auto it = _roles.find(name); it != _roles.end()) {
_cached_permissions_count -= it->second->cached_permissions.size();
}
_cached_permissions_count += role->cached_permissions.size();
_roles[name] = std::move(role);
}
void cache::remove_role(const role_name_t& name) {
if (auto it = _roles.find(name); it != _roles.end()) {
remove_role(it);
}
}
void cache::remove_role(roles_map::iterator it) {
_cached_permissions_count -= it->second->cached_permissions.size();
_roles.erase(it);
}
void cache::clear_role_permissions(const role_name_t& name) {
if (auto it = _roles.find(name); it != _roles.end()) {
_cached_permissions_count -= it->second->cached_permissions.size();
it->second->cached_permissions.clear();
}
}
void cache::add_permissions(std::unordered_map<resource, permission_set>& cache, const resource& r, permission_set perms) {
if (cache.emplace(r, perms).second) {
++_cached_permissions_count;
}
}
void cache::remove_permissions(std::unordered_map<resource, permission_set>& cache, const resource& r) {
_cached_permissions_count -= cache.erase(r);
}
} // namespace auth

View File

@@ -17,11 +17,14 @@
#include <seastar/core/sharded.hh>
#include <seastar/core/shared_ptr.hh>
#include <seastar/core/semaphore.hh>
#include <seastar/core/metrics_registration.hh>
#include <absl/container/flat_hash_map.h>
#include "auth/permission.hh"
#include "auth/common.hh"
#include "auth/resource.hh"
#include "auth/role_or_anonymous.hh"
namespace cql3 { class query_processor; }
@@ -31,6 +34,7 @@ class cache : public peering_sharded_service<cache> {
public:
using role_name_t = sstring;
using version_tag_t = char;
using permission_loader_func = std::function<future<permission_set>(const role_or_anonymous&, const resource&)>;
struct role_record {
bool can_login = false;
@@ -40,11 +44,19 @@ public:
sstring salted_hash;
std::unordered_map<sstring, sstring> attributes;
std::unordered_map<sstring, permission_set> permissions;
private:
friend cache;
// cached permissions include effects of role's inheritance
std::unordered_map<resource, permission_set> cached_permissions;
version_tag_t version; // used for seamless cache reloads
};
explicit cache(cql3::query_processor& qp, abort_source& as) noexcept;
lw_shared_ptr<const role_record> get(const role_name_t& role) const noexcept;
void set_permission_loader(permission_loader_func loader);
future<permission_set> get_permissions(const role_or_anonymous& role, const resource& r);
future<> prune(const resource& r);
future<> reload_all_permissions() noexcept;
future<> load_all();
future<> load_roles(std::unordered_set<role_name_t> roles);
static bool includes_table(const table_id&) noexcept;
@@ -52,14 +64,31 @@ public:
private:
using roles_map = absl::flat_hash_map<role_name_t, lw_shared_ptr<role_record>>;
roles_map _roles;
// anonymous permissions map exists mainly due to compatibility with
// higher layers which use role_or_anonymous to get permissions.
std::unordered_map<resource, permission_set> _anonymous_permissions;
version_tag_t _current_version;
cql3::query_processor& _qp;
semaphore _loading_sem;
semaphore _loading_sem; // protects iteration of _roles map
abort_source& _as;
permission_loader_func _permission_loader;
semaphore _permission_loader_sem; // protects against reload storms on a single role change
metrics::metric_groups _metrics;
size_t _cached_permissions_count = 0;
future<lw_shared_ptr<role_record>> fetch_role(const role_name_t& role) const;
future<> prune_all() noexcept;
future<> distribute_role(const role_name_t& name, const lw_shared_ptr<role_record> role);
future<> gather_inheriting_roles(std::unordered_set<role_name_t>& roles, lw_shared_ptr<cache::role_record> role, const role_name_t& name);
void add_role(const role_name_t& name, lw_shared_ptr<role_record> role);
void remove_role(const role_name_t& name);
void remove_role(roles_map::iterator it);
void clear_role_permissions(const role_name_t& name);
void add_permissions(std::unordered_map<resource, permission_set>& cache, const resource& r, permission_set perms);
void remove_permissions(std::unordered_map<resource, permission_set>& cache, const resource& r);
future<permission_set> load_permissions(const role_or_anonymous& role, const resource& r, std::unordered_map<resource, permission_set>* perms_cache);
};
} // namespace auth

View File

@@ -88,10 +88,16 @@ static const class_registrator<
ldap_role_manager::ldap_role_manager(
std::string_view query_template, std::string_view target_attr, std::string_view bind_name, std::string_view bind_password,
uint32_t permissions_update_interval_in_ms,
utils::observer<uint32_t> permissions_update_interval_in_ms_observer,
cql3::query_processor& qp, ::service::raft_group0_client& rg0c, ::service::migration_manager& mm, cache& cache)
: _std_mgr(qp, rg0c, mm, cache), _group0_client(rg0c), _query_template(query_template), _target_attr(target_attr), _bind_name(bind_name)
, _bind_password(bind_password)
, _connection_factory(bind(std::mem_fn(&ldap_role_manager::reconnect), std::ref(*this))) {
, _permissions_update_interval_in_ms(permissions_update_interval_in_ms)
, _permissions_update_interval_in_ms_observer(std::move(permissions_update_interval_in_ms_observer))
, _connection_factory(bind(std::mem_fn(&ldap_role_manager::reconnect), std::ref(*this)))
, _cache(cache)
, _cache_pruner(make_ready_future<>()) {
}
ldap_role_manager::ldap_role_manager(cql3::query_processor& qp, ::service::raft_group0_client& rg0c, ::service::migration_manager& mm, cache& cache)
@@ -100,6 +106,8 @@ ldap_role_manager::ldap_role_manager(cql3::query_processor& qp, ::service::raft_
qp.db().get_config().ldap_attr_role(),
qp.db().get_config().ldap_bind_dn(),
qp.db().get_config().ldap_bind_passwd(),
qp.db().get_config().permissions_update_interval_in_ms(),
qp.db().get_config().permissions_update_interval_in_ms.observe([this] (const uint32_t& v) { _permissions_update_interval_in_ms = v; }),
qp,
rg0c,
mm,
@@ -119,6 +127,22 @@ future<> ldap_role_manager::start() {
return make_exception_future(
std::runtime_error(fmt::format("error getting LDAP server address from template {}", _query_template)));
}
_cache_pruner = futurize_invoke([this] () -> future<> {
while (true) {
try {
co_await seastar::sleep_abortable(std::chrono::milliseconds(_permissions_update_interval_in_ms), _as);
} catch (const seastar::sleep_aborted&) {
co_return; // ignore
}
co_await _cache.container().invoke_on_all([] (cache& c) -> future<> {
try {
co_await c.reload_all_permissions();
} catch (...) {
mylog.warn("Cache reload all permissions failed: {}", std::current_exception());
}
});
}
});
return _std_mgr.start();
}
@@ -175,7 +199,11 @@ future<conn_ptr> ldap_role_manager::reconnect() {
future<> ldap_role_manager::stop() {
_as.request_abort();
return _std_mgr.stop().then([this] { return _connection_factory.stop(); });
return std::move(_cache_pruner).then([this] {
return _std_mgr.stop();
}).then([this] {
return _connection_factory.stop();
});
}
future<> ldap_role_manager::create(std::string_view name, const role_config& config, ::service::group0_batch& mc) {

View File

@@ -10,6 +10,7 @@
#pragma once
#include <seastar/core/abort_source.hh>
#include <seastar/core/future.hh>
#include <stdexcept>
#include "ent/ldap/ldap_connection.hh"
@@ -34,14 +35,22 @@ class ldap_role_manager : public role_manager {
seastar::sstring _target_attr; ///< LDAP entry attribute containing the Scylla role name.
seastar::sstring _bind_name; ///< Username for LDAP simple bind.
seastar::sstring _bind_password; ///< Password for LDAP simple bind.
uint32_t _permissions_update_interval_in_ms;
utils::observer<uint32_t> _permissions_update_interval_in_ms_observer;
mutable ldap_reuser _connection_factory; // Potentially modified by query_granted().
seastar::abort_source _as;
cache& _cache;
seastar::future<> _cache_pruner;
public:
ldap_role_manager(
std::string_view query_template, ///< LDAP query template as described in Scylla documentation.
std::string_view target_attr, ///< LDAP entry attribute containing the Scylla role name.
std::string_view bind_name, ///< LDAP bind credentials.
std::string_view bind_password, ///< LDAP bind credentials.
uint32_t permissions_update_interval_in_ms,
utils::observer<uint32_t> permissions_update_interval_in_ms_observer,
cql3::query_processor& qp, ///< Passed to standard_role_manager.
::service::raft_group0_client& rg0c, ///< Passed to standard_role_manager.
::service::migration_manager& mm, ///< Passed to standard_role_manager.

View File

@@ -1,38 +0,0 @@
/*
* Copyright (C) 2017-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "auth/permissions_cache.hh"
#include <fmt/ranges.h>
#include "auth/authorizer.hh"
#include "auth/service.hh"
namespace auth {
permissions_cache::permissions_cache(const utils::loading_cache_config& c, service& ser, logging::logger& log)
: _cache(c, log, [&ser, &log](const key_type& k) {
log.debug("Refreshing permissions for {}", k.first);
return ser.get_uncached_permissions(k.first, k.second);
}) {
}
bool permissions_cache::update_config(utils::loading_cache_config c) {
return _cache.update_config(std::move(c));
}
void permissions_cache::reset() {
_cache.reset();
}
future<permission_set> permissions_cache::get(const role_or_anonymous& maybe_role, const resource& r) {
return do_with(key_type(maybe_role, r), [this](const auto& k) {
return _cache.get(k);
});
}
}

View File

@@ -1,66 +0,0 @@
/*
* Copyright (C) 2017-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include <iostream>
#include <utility>
#include <fmt/core.h>
#include <seastar/core/future.hh>
#include "auth/permission.hh"
#include "auth/resource.hh"
#include "auth/role_or_anonymous.hh"
#include "utils/log.hh"
#include "utils/hash.hh"
#include "utils/loading_cache.hh"
namespace std {
inline std::ostream& operator<<(std::ostream& os, const pair<auth::role_or_anonymous, auth::resource>& p) {
fmt::print(os, "{{role: {}, resource: {}}}", p.first, p.second);
return os;
}
}
namespace db {
class config;
}
namespace auth {
class service;
class permissions_cache final {
using cache_type = utils::loading_cache<
std::pair<role_or_anonymous, resource>,
permission_set,
1,
utils::loading_cache_reload_enabled::yes,
utils::simple_entry_size<permission_set>,
utils::tuple_hash>;
using key_type = typename cache_type::key_type;
cache_type _cache;
public:
explicit permissions_cache(const utils::loading_cache_config&, service&, logging::logger&);
future <> stop() {
return _cache.stop();
}
bool update_config(utils::loading_cache_config);
void reset();
future<permission_set> get(const role_or_anonymous&, const resource&);
};
}

View File

@@ -64,11 +64,11 @@ static const sstring superuser_col_name("super");
static logging::logger log("auth_service");
class auth_migration_listener final : public ::service::migration_listener {
authorizer& _authorizer;
service& _service;
cql3::query_processor& _qp;
public:
explicit auth_migration_listener(authorizer& a, cql3::query_processor& qp) : _authorizer(a), _qp(qp) {
explicit auth_migration_listener(service& s, cql3::query_processor& qp) : _service(s), _qp(qp) {
}
private:
@@ -92,14 +92,14 @@ private:
return;
}
// Do it in the background.
(void)do_with(::service::group0_batch::unused(), [this, &ks_name] (auto& mc) mutable {
return _authorizer.revoke_all(auth::make_data_resource(ks_name), mc);
(void)do_with(auth::make_data_resource(ks_name), ::service::group0_batch::unused(), [this] (auto& r, auto& mc) mutable {
return _service.revoke_all(r, mc);
}).handle_exception([] (std::exception_ptr e) {
log.error("Unexpected exception while revoking all permissions on dropped keyspace: {}", e);
});
(void)do_with(::service::group0_batch::unused(), [this, &ks_name] (auto& mc) mutable {
return _authorizer.revoke_all(auth::make_functions_resource(ks_name), mc);
(void)do_with(auth::make_functions_resource(ks_name), ::service::group0_batch::unused(), [this] (auto& r, auto& mc) mutable {
return _service.revoke_all(r, mc);
}).handle_exception([] (std::exception_ptr e) {
log.error("Unexpected exception while revoking all permissions on functions in dropped keyspace: {}", e);
});
@@ -111,9 +111,8 @@ private:
return;
}
// Do it in the background.
(void)do_with(::service::group0_batch::unused(), [this, &ks_name, &cf_name] (auto& mc) mutable {
return _authorizer.revoke_all(
auth::make_data_resource(ks_name, cf_name), mc);
(void)do_with(auth::make_data_resource(ks_name, cf_name), ::service::group0_batch::unused(), [this] (auto& r, auto& mc) mutable {
return _service.revoke_all(r, mc);
}).handle_exception([] (std::exception_ptr e) {
log.error("Unexpected exception while revoking all permissions on dropped table: {}", e);
});
@@ -126,9 +125,8 @@ private:
return;
}
// Do it in the background.
(void)do_with(::service::group0_batch::unused(), [this, &ks_name, &function_name] (auto& mc) mutable {
return _authorizer.revoke_all(
auth::make_functions_resource(ks_name, function_name), mc);
(void)do_with(auth::make_functions_resource(ks_name, function_name), ::service::group0_batch::unused(), [this] (auto& r, auto& mc) mutable {
return _service.revoke_all(r, mc);
}).handle_exception([] (std::exception_ptr e) {
log.error("Unexpected exception while revoking all permissions on dropped function: {}", e);
});
@@ -138,9 +136,8 @@ private:
// in non legacy path revoke is part of schema change statement execution
return;
}
(void)do_with(::service::group0_batch::unused(), [this, &ks_name, &aggregate_name] (auto& mc) mutable {
return _authorizer.revoke_all(
auth::make_functions_resource(ks_name, aggregate_name), mc);
(void)do_with(auth::make_functions_resource(ks_name, aggregate_name), ::service::group0_batch::unused(), [this] (auto& r, auto& mc) mutable {
return _service.revoke_all(r, mc);
}).handle_exception([] (std::exception_ptr e) {
log.error("Unexpected exception while revoking all permissions on dropped aggregate: {}", e);
});
@@ -157,7 +154,6 @@ static future<> validate_role_exists(const service& ser, std::string_view role_n
}
service::service(
utils::loading_cache_config c,
cache& cache,
cql3::query_processor& qp,
::service::raft_group0_client& g0,
@@ -166,25 +162,17 @@ service::service(
std::unique_ptr<authenticator> a,
std::unique_ptr<role_manager> r,
maintenance_socket_enabled used_by_maintenance_socket)
: _loading_cache_config(std::move(c))
, _permissions_cache(nullptr)
, _cache(cache)
: _cache(cache)
, _qp(qp)
, _group0_client(g0)
, _mnotifier(mn)
, _authorizer(std::move(z))
, _authenticator(std::move(a))
, _role_manager(std::move(r))
, _migration_listener(std::make_unique<auth_migration_listener>(*_authorizer, qp))
, _permissions_cache_cfg_cb([this] (uint32_t) { (void) _permissions_cache_config_action.trigger_later(); })
, _permissions_cache_config_action([this] { update_cache_config(); return make_ready_future<>(); })
, _permissions_cache_max_entries_observer(_qp.db().get_config().permissions_cache_max_entries.observe(_permissions_cache_cfg_cb))
, _permissions_cache_update_interval_in_ms_observer(_qp.db().get_config().permissions_update_interval_in_ms.observe(_permissions_cache_cfg_cb))
, _permissions_cache_validity_in_ms_observer(_qp.db().get_config().permissions_validity_in_ms.observe(_permissions_cache_cfg_cb))
, _migration_listener(std::make_unique<auth_migration_listener>(*this, qp))
, _used_by_maintenance_socket(used_by_maintenance_socket) {}
service::service(
utils::loading_cache_config c,
cql3::query_processor& qp,
::service::raft_group0_client& g0,
::service::migration_notifier& mn,
@@ -193,7 +181,6 @@ service::service(
maintenance_socket_enabled used_by_maintenance_socket,
cache& cache)
: service(
std::move(c),
cache,
qp,
g0,
@@ -257,7 +244,14 @@ future<> service::start(::service::migration_manager& mm, db::system_keyspace& s
co_await _role_manager->ensure_superuser_is_created();
}
co_await when_all_succeed(_authorizer->start(), _authenticator->start()).discard_result();
_permissions_cache = std::make_unique<permissions_cache>(_loading_cache_config, *this, log);
if (!_used_by_maintenance_socket) {
// Maintenance socket mode can't cache permissions because it has
// different authorizer. We can't mix cached permissions, they could be
// different in normal mode.
_cache.set_permission_loader(std::bind(
&service::get_uncached_permissions,
this, std::placeholders::_1, std::placeholders::_2));
}
co_await once_among_shards([this] {
_mnotifier.register_listener(_migration_listener.get());
return make_ready_future<>();
@@ -269,9 +263,7 @@ future<> service::stop() {
// Only one of the shards has the listener registered, but let's try to
// unregister on each one just to make sure.
return _mnotifier.unregister_listener(_migration_listener.get()).then([this] {
if (_permissions_cache) {
return _permissions_cache->stop();
}
_cache.set_permission_loader(nullptr);
return make_ready_future<>();
}).then([this] {
return when_all_succeed(_role_manager->stop(), _authorizer->stop(), _authenticator->stop()).discard_result();
@@ -283,21 +275,8 @@ future<> service::ensure_superuser_is_created() {
co_await _authenticator->ensure_superuser_is_created();
}
void service::update_cache_config() {
auto db = _qp.db();
utils::loading_cache_config perm_cache_config;
perm_cache_config.max_size = db.get_config().permissions_cache_max_entries();
perm_cache_config.expiry = std::chrono::milliseconds(db.get_config().permissions_validity_in_ms());
perm_cache_config.refresh = std::chrono::milliseconds(db.get_config().permissions_update_interval_in_ms());
if (!_permissions_cache->update_config(std::move(perm_cache_config))) {
log.error("Failed to apply permissions cache changes. Please read the documentation of these parameters");
}
}
void service::reset_authorization_cache() {
_permissions_cache->reset();
_qp.reset_cache();
}
@@ -322,7 +301,10 @@ service::get_uncached_permissions(const role_or_anonymous& maybe_role, const res
}
future<permission_set> service::get_permissions(const role_or_anonymous& maybe_role, const resource& r) const {
return _permissions_cache->get(maybe_role, r);
if (legacy_mode(_qp) || _used_by_maintenance_socket) {
return get_uncached_permissions(maybe_role, r);
}
return _cache.get_permissions(maybe_role, r);
}
future<bool> service::has_superuser(std::string_view role_name, const role_set& roles) const {
@@ -447,6 +429,11 @@ future<bool> service::exists(const resource& r) const {
return make_ready_future<bool>(false);
}
future<> service::revoke_all(const resource& r, ::service::group0_batch& mc) const {
co_await _authorizer->revoke_all(r, mc);
co_await _cache.prune(r);
}
future<std::vector<cql3::description>> service::describe_roles(bool with_hashed_passwords) {
std::vector<cql3::description> result{};
@@ -801,7 +788,7 @@ future<> revoke_permissions(
}
future<> revoke_all(const service& ser, const resource& r, ::service::group0_batch& mc) {
return ser.underlying_authorizer().revoke_all(r, mc);
return ser.revoke_all(r, mc);
}
future<std::vector<permission_details>> list_filtered_permissions(

View File

@@ -20,7 +20,6 @@
#include "auth/authenticator.hh"
#include "auth/authorizer.hh"
#include "auth/permission.hh"
#include "auth/permissions_cache.hh"
#include "auth/cache.hh"
#include "auth/role_manager.hh"
#include "auth/common.hh"
@@ -75,8 +74,6 @@ public:
/// peering_sharded_service inheritance is needed to be able to access shard local authentication service
/// given an object from another shard. Used for bouncing lwt requests to correct shard.
class service final : public seastar::peering_sharded_service<service> {
utils::loading_cache_config _loading_cache_config;
std::unique_ptr<permissions_cache> _permissions_cache;
cache& _cache;
cql3::query_processor& _qp;
@@ -94,20 +91,12 @@ class service final : public seastar::peering_sharded_service<service> {
// Only one of these should be registered, so we end up with some unused instances. Not the end of the world.
std::unique_ptr<::service::migration_listener> _migration_listener;
std::function<void(uint32_t)> _permissions_cache_cfg_cb;
serialized_action _permissions_cache_config_action;
utils::observer<uint32_t> _permissions_cache_max_entries_observer;
utils::observer<uint32_t> _permissions_cache_update_interval_in_ms_observer;
utils::observer<uint32_t> _permissions_cache_validity_in_ms_observer;
maintenance_socket_enabled _used_by_maintenance_socket;
abort_source _as;
public:
service(
utils::loading_cache_config,
cache& cache,
cql3::query_processor&,
::service::raft_group0_client&,
@@ -123,7 +112,6 @@ public:
/// of the instances themselves.
///
service(
utils::loading_cache_config,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_notifier&,
@@ -138,8 +126,6 @@ public:
future<> ensure_superuser_is_created();
void update_cache_config();
void reset_authorization_cache();
///
@@ -181,6 +167,13 @@ public:
future<bool> exists(const resource&) const;
///
/// Revoke all permissions granted to any role for a particular resource.
///
/// \throws \ref unsupported_authorization_operation if revoking permissions is not supported.
///
future<> revoke_all(const resource&, ::service::group0_batch&) const;
///
/// Produces descriptions that can be used to restore the state of auth. That encompasses
/// roles, role grants, and permission grants.

View File

@@ -814,8 +814,7 @@ generation_service::generation_service(
config cfg, gms::gossiper& g, sharded<db::system_distributed_keyspace>& sys_dist_ks,
sharded<db::system_keyspace>& sys_ks,
abort_source& abort_src, const locator::shared_token_metadata& stm, gms::feature_service& f,
replica::database& db,
std::function<bool()> raft_topology_change_enabled)
replica::database& db)
: _cfg(std::move(cfg))
, _gossiper(g)
, _sys_dist_ks(sys_dist_ks)
@@ -824,7 +823,6 @@ generation_service::generation_service(
, _token_metadata(stm)
, _feature_service(f)
, _db(db)
, _raft_topology_change_enabled(std::move(raft_topology_change_enabled))
{
}
@@ -878,16 +876,7 @@ future<> generation_service::on_join(gms::inet_address ep, locator::host_id id,
future<> generation_service::on_change(gms::inet_address ep, locator::host_id id, const gms::application_state_map& states, gms::permit_id pid) {
assert_shard_zero(__PRETTY_FUNCTION__);
if (_raft_topology_change_enabled()) {
return make_ready_future<>();
}
return on_application_state_change(ep, id, states, gms::application_state::CDC_GENERATION_ID, pid, [this] (gms::inet_address ep, locator::host_id id, const gms::versioned_value& v, gms::permit_id) {
auto gen_id = gms::versioned_value::cdc_generation_id_from_string(v.value());
cdc_log.debug("Endpoint: {}, CDC generation ID change: {}", ep, gen_id);
return legacy_handle_cdc_generation(gen_id);
});
return make_ready_future<>();
}
future<> generation_service::check_and_repair_cdc_streams() {

View File

@@ -79,17 +79,12 @@ private:
std::optional<cdc::generation_id> _gen_id;
future<> _cdc_streams_rewrite_complete = make_ready_future<>();
/* Returns true if raft topology changes are enabled.
* Can only be called from shard 0.
*/
std::function<bool()> _raft_topology_change_enabled;
public:
generation_service(config cfg, gms::gossiper&,
sharded<db::system_distributed_keyspace>&,
sharded<db::system_keyspace>& sys_ks,
abort_source&, const locator::shared_token_metadata&,
gms::feature_service&, replica::database& db,
std::function<bool()> raft_topology_change_enabled);
gms::feature_service&, replica::database& db);
future<> stop();
~generation_service();

View File

@@ -299,13 +299,11 @@ batch_size_fail_threshold_in_kb: 1024
# max_hint_window_in_ms: 10800000 # 3 hours
# Validity period for permissions cache (fetching permissions can be an
# expensive operation depending on the authorizer, CassandraAuthorizer is
# one example). Defaults to 10000, set to 0 to disable.
# Validity period for authorized statements cache. Defaults to 10000, set to 0 to disable.
# Will be disabled automatically for AllowAllAuthorizer.
# permissions_validity_in_ms: 10000
# Refresh interval for permissions cache (if enabled).
# Refresh interval for authorized statements cache.
# After this interval, cache entries become eligible for refresh. Upon next
# access, an async reload is scheduled and the old value returned until it
# completes. If permissions_validity_in_ms is non-zero, then this also must have
@@ -566,15 +564,16 @@ commitlog_total_space_in_mb: -1
# prometheus_address: 1.2.3.4
# audit settings
# By default, Scylla does not audit anything.
# Table audit is enabled by default.
# 'audit' config option controls if and where to output audited events:
# - "none": auditing is disabled (default)
# - "table": save audited events in audit.audit_log column family
# - "none": auditing is disabled
# - "table": save audited events in audit.audit_log column family (default)
# - "syslog": send audited events via syslog (depends on OS, but usually to /dev/log)
audit: "table"
#
# List of statement categories that should be audited.
audit_categories: "DCL,DDL,AUTH,ADMIN"
# Possible categories are: QUERY, DML, DCL, DDL, AUTH, ADMIN
audit_categories: "DCL,AUTH,ADMIN"
#
# List of tables that should be audited.
# audit_tables: "<keyspace_name>.<table_name>,<keyspace_name>.<table_name>"

View File

@@ -730,28 +730,6 @@ vector_search_tests = set([
'test/vector_search/rescoring_test'
])
vector_search_validator_bin = 'vector-search-validator/bin/vector-search-validator'
vector_search_validator_deps = set([
'test/vector_search_validator/build-validator',
'test/vector_search_validator/Cargo.toml',
'test/vector_search_validator/crates/validator/Cargo.toml',
'test/vector_search_validator/crates/validator/src/main.rs',
'test/vector_search_validator/crates/validator-scylla/Cargo.toml',
'test/vector_search_validator/crates/validator-scylla/src/lib.rs',
'test/vector_search_validator/crates/validator-scylla/src/cql.rs',
])
vector_store_bin = 'vector-search-validator/bin/vector-store'
vector_store_deps = set([
'test/vector_search_validator/build-env',
'test/vector_search_validator/build-vector-store',
])
vector_search_validator_bins = set([
vector_search_validator_bin,
vector_store_bin,
])
wasms = set([
'wasm/return_input.wat',
'wasm/test_complex_null_values.wat',
@@ -785,7 +763,7 @@ other = set([
'iotune',
])
all_artifacts = apps | cpp_apps | tests | other | wasms | vector_search_validator_bins
all_artifacts = apps | cpp_apps | tests | other | wasms
arg_parser = argparse.ArgumentParser('Configure scylla', add_help=False, formatter_class=argparse.ArgumentDefaultsHelpFormatter)
arg_parser.add_argument('--out', dest='buildfile', action='store', default='build.ninja',
@@ -1196,6 +1174,7 @@ scylla_core = (['message/messaging_service.cc',
'utils/gz/crc_combine.cc',
'utils/gz/crc_combine_table.cc',
'utils/http.cc',
'utils/http_client_error_processing.cc',
'utils/rest/client.cc',
'utils/s3/aws_error.cc',
'utils/s3/client.cc',
@@ -1213,6 +1192,7 @@ scylla_core = (['message/messaging_service.cc',
'utils/azure/identity/default_credentials.cc',
'utils/gcp/gcp_credentials.cc',
'utils/gcp/object_storage.cc',
'utils/gcp/object_storage_retry_strategy.cc',
'gms/version_generator.cc',
'gms/versioned_value.cc',
'gms/gossiper.cc',
@@ -1297,7 +1277,6 @@ scylla_core = (['message/messaging_service.cc',
'auth/passwords.cc',
'auth/password_authenticator.cc',
'auth/permission.cc',
'auth/permissions_cache.cc',
'auth/service.cc',
'auth/standard_role_manager.cc',
'auth/ldap_role_manager.cc',
@@ -1667,6 +1646,7 @@ for t in sorted(perf_tests):
deps['test/boost/combined_tests'] += [
'test/boost/aggregate_fcts_test.cc',
'test/boost/auth_cache_test.cc',
'test/boost/auth_test.cc',
'test/boost/batchlog_manager_test.cc',
'test/boost/cache_algorithm_test.cc',
@@ -2585,11 +2565,10 @@ def write_build_file(f,
description = RUST_LIB $out
''').format(mode=mode, antlr3_exec=args.antlr3_exec, fmt_lib=fmt_lib, test_repeat=args.test_repeat, test_timeout=args.test_timeout, rustc_wrapper=rustc_wrapper, **modeval))
f.write(
'build {mode}-build: phony {artifacts} {wasms} {vector_search_validator_bins}\n'.format(
'build {mode}-build: phony {artifacts} {wasms}\n'.format(
mode=mode,
artifacts=str.join(' ', ['$builddir/' + mode + '/' + x for x in sorted(build_artifacts - wasms - vector_search_validator_bins)]),
artifacts=str.join(' ', ['$builddir/' + mode + '/' + x for x in sorted(build_artifacts - wasms)]),
wasms = str.join(' ', ['$builddir/' + x for x in sorted(build_artifacts & wasms)]),
vector_search_validator_bins=str.join(' ', ['$builddir/' + x for x in sorted(build_artifacts & vector_search_validator_bins)]),
)
)
if profile_recipe := modes[mode].get('profile_recipe'):
@@ -2619,7 +2598,7 @@ def write_build_file(f,
continue
profile_dep = modes[mode].get('profile_target', "")
if binary in other or binary in wasms or binary in vector_search_validator_bins:
if binary in other or binary in wasms:
continue
srcs = deps[binary]
# 'scylla'
@@ -2730,11 +2709,10 @@ def write_build_file(f,
)
f.write(
'build {mode}-test: test.{mode} {test_executables} $builddir/{mode}/scylla {wasms} {vector_search_validator_bins} \n'.format(
'build {mode}-test: test.{mode} {test_executables} $builddir/{mode}/scylla {wasms}\n'.format(
mode=mode,
test_executables=' '.join(['$builddir/{}/{}'.format(mode, binary) for binary in sorted(tests)]),
wasms=' '.join([f'$builddir/{binary}' for binary in sorted(wasms)]),
vector_search_validator_bins=' '.join([f'$builddir/{binary}' for binary in sorted(vector_search_validator_bins)]),
)
)
f.write(
@@ -2902,19 +2880,6 @@ def write_build_file(f,
'build compiler-training: phony {}\n'.format(' '.join(['{mode}-compiler-training'.format(mode=mode) for mode in default_modes]))
)
f.write(textwrap.dedent(f'''\
rule build-vector-search-validator
command = test/vector_search_validator/build-validator $builddir
rule build-vector-store
command = test/vector_search_validator/build-vector-store $builddir
'''))
f.write(
'build $builddir/{vector_search_validator_bin}: build-vector-search-validator {}\n'.format(' '.join([dep for dep in sorted(vector_search_validator_deps)]), vector_search_validator_bin=vector_search_validator_bin)
)
f.write(
'build $builddir/{vector_store_bin}: build-vector-store {}\n'.format(' '.join([dep for dep in sorted(vector_store_deps)]), vector_store_bin=vector_store_bin)
)
f.write(textwrap.dedent(f'''\
build dist-unified-tar: phony {' '.join([f'$builddir/{mode}/dist/tar/{scylla_product}-unified-{scylla_version}-{scylla_release}.{arch}.tar.gz' for mode in default_modes])}
build dist-unified: phony dist-unified-tar

View File

@@ -389,8 +389,10 @@ selectStatement returns [std::unique_ptr<raw::select_statement> expr]
bool is_ann_ordering = false;
}
: K_SELECT (
( K_JSON { statement_subtype = raw::select_statement::parameters::statement_subtype::JSON; } )?
( K_DISTINCT { is_distinct = true; } )?
( (K_JSON K_DISTINCT)=> K_JSON { statement_subtype = raw::select_statement::parameters::statement_subtype::JSON; }
| (K_JSON selectClause K_FROM)=> K_JSON { statement_subtype = raw::select_statement::parameters::statement_subtype::JSON; }
)?
( (K_DISTINCT selectClause K_FROM)=> K_DISTINCT { is_distinct = true; } )?
sclause=selectClause
)
K_FROM (
@@ -425,13 +427,13 @@ selector returns [shared_ptr<raw_selector> s]
unaliasedSelector returns [uexpression tmp]
: ( c=cident { tmp = unresolved_identifier{std::move(c)}; }
| v=value { tmp = std::move(v); }
| K_COUNT '(' countArgument ')' { tmp = make_count_rows_function_expression(); }
| K_WRITETIME '(' c=cident ')' { tmp = column_mutation_attribute{column_mutation_attribute::attribute_kind::writetime,
unresolved_identifier{std::move(c)}}; }
| K_TTL '(' c=cident ')' { tmp = column_mutation_attribute{column_mutation_attribute::attribute_kind::ttl,
unresolved_identifier{std::move(c)}}; }
| f=functionName args=selectionFunctionArgs { tmp = function_call{std::move(f), std::move(args)}; }
| f=similarityFunctionName args=vectorSimilarityArgs { tmp = function_call{std::move(f), std::move(args)}; }
| K_CAST '(' arg=unaliasedSelector K_AS t=native_type ')' { tmp = cast{.style = cast::cast_style::sql, .arg = std::move(arg), .type = std::move(t)}; }
)
( '.' fi=cident { tmp = field_selection{std::move(tmp), std::move(fi)}; }
@@ -446,23 +448,9 @@ selectionFunctionArgs returns [std::vector<expression> a]
')'
;
vectorSimilarityArgs returns [std::vector<expression> a]
: '(' ')'
| '(' v1=vectorSimilarityArg { a.push_back(std::move(v1)); }
( ',' vn=vectorSimilarityArg { a.push_back(std::move(vn)); } )*
')'
;
vectorSimilarityArg returns [uexpression a]
: s=unaliasedSelector { a = std::move(s); }
| v=value { a = std::move(v); }
;
countArgument
: '*'
| i=INTEGER { if (i->getText() != "1") {
add_recognition_error("Only COUNT(1) is supported, got COUNT(" + i->getText() + ")");
} }
/* COUNT(1) is also allowed, it is recognized via the general function(args) path */
;
whereClause returns [uexpression clause]
@@ -1706,10 +1694,6 @@ functionName returns [cql3::functions::function_name s]
: (ks=keyspaceName '.')? f=allowedFunctionName { $s.keyspace = std::move(ks); $s.name = std::move(f); }
;
similarityFunctionName returns [cql3::functions::function_name s]
: f=allowedSimilarityFunctionName { $s = cql3::functions::function_name::native_function(std::move(f)); }
;
allowedFunctionName returns [sstring s]
: f=IDENT { $s = $f.text; std::transform(s.begin(), s.end(), s.begin(), ::tolower); }
| f=QUOTED_NAME { $s = $f.text; }
@@ -1718,11 +1702,6 @@ allowedFunctionName returns [sstring s]
| K_COUNT { $s = "count"; }
;
allowedSimilarityFunctionName returns [sstring s]
: f=(K_SIMILARITY_COSINE | K_SIMILARITY_EUCLIDEAN | K_SIMILARITY_DOT_PRODUCT)
{ $s = $f.text; std::transform(s.begin(), s.end(), s.begin(), ::tolower); }
;
functionArgs returns [std::vector<expression> a]
: '(' ')'
| '(' t1=term { a.push_back(std::move(t1)); }
@@ -2419,10 +2398,6 @@ K_MUTATION_FRAGMENTS: M U T A T I O N '_' F R A G M E N T S;
K_VECTOR_SEARCH_INDEXING: V E C T O R '_' S E A R C H '_' I N D E X I N G;
K_SIMILARITY_EUCLIDEAN: S I M I L A R I T Y '_' E U C L I D E A N;
K_SIMILARITY_COSINE: S I M I L A R I T Y '_' C O S I N E;
K_SIMILARITY_DOT_PRODUCT: S I M I L A R I T Y '_' D O T '_' P R O D U C T;
// Case-insensitive alpha characters
fragment A: ('a'|'A');
fragment B: ('b'|'B');

View File

@@ -10,6 +10,7 @@
#include "expr-utils.hh"
#include "evaluate.hh"
#include "cql3/functions/functions.hh"
#include "cql3/functions/aggregate_fcts.hh"
#include "cql3/functions/castas_fcts.hh"
#include "cql3/functions/scalar_function.hh"
#include "cql3/column_identifier.hh"
@@ -1047,8 +1048,47 @@ prepare_function_args_for_type_inference(std::span<const expression> args, data_
return partially_prepared_args;
}
// Special case for count(1) - recognize it as the countRows() function. Note it is quite
// artificial and we might relax it to the more general count(expression) later.
static
std::optional<expression>
try_prepare_count_rows(const expr::function_call& fc, data_dictionary::database db, const sstring& keyspace, const schema* schema_opt, lw_shared_ptr<column_specification> receiver) {
return std::visit(overloaded_functor{
[&] (const functions::function_name& name) -> std::optional<expression> {
auto native_name = name;
if (!native_name.has_keyspace()) {
native_name = name.as_native_function();
}
// Collapse count(1) into countRows()
if (native_name == functions::function_name::native_function("count")) {
if (fc.args.size() == 1) {
if (auto uc_arg = expr::as_if<expr::untyped_constant>(&fc.args[0])) {
if (uc_arg->partial_type == expr::untyped_constant::type_class::integer
&& uc_arg->raw_text == "1") {
return expr::function_call{
.func = functions::aggregate_fcts::make_count_rows_function(),
.args = {},
};
} else {
throw exceptions::invalid_request_exception(format("count() expects a column or the literal 1 as an argument", fc.args[0]));
}
}
}
}
return std::nullopt;
},
[] (const shared_ptr<functions::function>&) -> std::optional<expression> {
// Already prepared, nothing to do
return std::nullopt;
},
}, fc.func);
}
std::optional<expression>
prepare_function_call(const expr::function_call& fc, data_dictionary::database db, const sstring& keyspace, const schema* schema_opt, lw_shared_ptr<column_specification> receiver) {
if (auto prepared = try_prepare_count_rows(fc, db, keyspace, schema_opt, receiver)) {
return prepared;
}
// Try to extract a column family name from the available information.
// Most functions can be prepared without information about the column family, usually just the keyspace is enough.
// One exception is the token() function - in order to prepare system.token() we have to know the partition key of the table,

View File

@@ -10,9 +10,41 @@
#include "types/types.hh"
#include "types/vector.hh"
#include "exceptions/exceptions.hh"
#include <span>
#include <bit>
namespace cql3 {
namespace functions {
namespace detail {
std::vector<float> extract_float_vector(const bytes_opt& param, size_t dimension) {
if (!param) {
throw exceptions::invalid_request_exception("Cannot extract float vector from null parameter");
}
const size_t expected_size = dimension * sizeof(float);
if (param->size() != expected_size) {
throw exceptions::invalid_request_exception(
fmt::format("Invalid vector size: expected {} bytes for {} floats, got {} bytes",
expected_size, dimension, param->size()));
}
std::vector<float> result;
result.reserve(dimension);
bytes_view view(*param);
for (size_t i = 0; i < dimension; ++i) {
// read_simple handles network byte order (big-endian) conversion
uint32_t raw = read_simple<uint32_t>(view);
result.push_back(std::bit_cast<float>(raw));
}
return result;
}
} // namespace detail
namespace {
// The computations of similarity scores match the exact formulas of Cassandra's (jVector's) implementation to ensure compatibility.
@@ -22,14 +54,14 @@ namespace {
// You should only use this function if you need to preserve the original vectors and cannot normalize
// them in advance.
float compute_cosine_similarity(const std::vector<data_value>& v1, const std::vector<data_value>& v2) {
float compute_cosine_similarity(std::span<const float> v1, std::span<const float> v2) {
double dot_product = 0.0;
double squared_norm_a = 0.0;
double squared_norm_b = 0.0;
for (size_t i = 0; i < v1.size(); ++i) {
double a = value_cast<float>(v1[i]);
double b = value_cast<float>(v2[i]);
double a = v1[i];
double b = v2[i];
dot_product += a * b;
squared_norm_a += a * a;
@@ -37,7 +69,7 @@ float compute_cosine_similarity(const std::vector<data_value>& v1, const std::ve
}
if (squared_norm_a == 0 || squared_norm_b == 0) {
throw exceptions::invalid_request_exception("Function system.similarity_cosine doesn't support all-zero vectors");
return std::numeric_limits<float>::quiet_NaN();
}
// The cosine similarity is in the range [-1, 1].
@@ -46,12 +78,12 @@ float compute_cosine_similarity(const std::vector<data_value>& v1, const std::ve
return (1 + (dot_product / (std::sqrt(squared_norm_a * squared_norm_b)))) / 2;
}
float compute_euclidean_similarity(const std::vector<data_value>& v1, const std::vector<data_value>& v2) {
float compute_euclidean_similarity(std::span<const float> v1, std::span<const float> v2) {
double sum = 0.0;
for (size_t i = 0; i < v1.size(); ++i) {
double a = value_cast<float>(v1[i]);
double b = value_cast<float>(v2[i]);
double a = v1[i];
double b = v2[i];
double diff = a - b;
sum += diff * diff;
@@ -65,12 +97,12 @@ float compute_euclidean_similarity(const std::vector<data_value>& v1, const std:
// Assumes that both vectors are L2-normalized.
// This similarity is intended as an optimized way to perform cosine similarity calculation.
float compute_dot_product_similarity(const std::vector<data_value>& v1, const std::vector<data_value>& v2) {
float compute_dot_product_similarity(std::span<const float> v1, std::span<const float> v2) {
double dot_product = 0.0;
for (size_t i = 0; i < v1.size(); ++i) {
double a = value_cast<float>(v1[i]);
double b = value_cast<float>(v2[i]);
double a = v1[i];
double b = v2[i];
dot_product += a * b;
}
@@ -136,13 +168,15 @@ bytes_opt vector_similarity_fct::execute(std::span<const bytes_opt> parameters)
return std::nullopt;
}
const auto& type = arg_types()[0];
data_value v1 = type->deserialize(*parameters[0]);
data_value v2 = type->deserialize(*parameters[1]);
const auto& v1_elements = value_cast<std::vector<data_value>>(v1);
const auto& v2_elements = value_cast<std::vector<data_value>>(v2);
// Extract dimension from the vector type
const auto& type = static_cast<const vector_type_impl&>(*arg_types()[0]);
size_t dimension = type.get_dimension();
float result = SIMILARITY_FUNCTIONS.at(_name)(v1_elements, v2_elements);
// Optimized path: extract floats directly from bytes, bypassing data_value overhead
std::vector<float> v1 = detail::extract_float_vector(parameters[0], dimension);
std::vector<float> v2 = detail::extract_float_vector(parameters[1], dimension);
float result = SIMILARITY_FUNCTIONS.at(_name)(v1, v2);
return float_type->decompose(result);
}

View File

@@ -11,6 +11,7 @@
#include "native_scalar_function.hh"
#include "cql3/assignment_testable.hh"
#include "cql3/functions/function_name.hh"
#include <span>
namespace cql3 {
namespace functions {
@@ -19,7 +20,7 @@ static const function_name SIMILARITY_COSINE_FUNCTION_NAME = function_name::nati
static const function_name SIMILARITY_EUCLIDEAN_FUNCTION_NAME = function_name::native_function("similarity_euclidean");
static const function_name SIMILARITY_DOT_PRODUCT_FUNCTION_NAME = function_name::native_function("similarity_dot_product");
using similarity_function_t = float (*)(const std::vector<data_value>&, const std::vector<data_value>&);
using similarity_function_t = float (*)(std::span<const float>, std::span<const float>);
extern thread_local const std::unordered_map<function_name, similarity_function_t> SIMILARITY_FUNCTIONS;
std::vector<data_type> retrieve_vector_arg_types(const function_name& name, const std::vector<shared_ptr<assignment_testable>>& provided_args);
@@ -33,5 +34,14 @@ public:
virtual bytes_opt execute(std::span<const bytes_opt> parameters) override;
};
namespace detail {
// Extract float vector directly from serialized bytes, bypassing data_value overhead.
// This is an internal API exposed for testing purposes.
// Vector<float, N> wire format: N floats as big-endian uint32_t values, 4 bytes each.
std::vector<float> extract_float_vector(const bytes_opt& param, size_t dimension);
} // namespace detail
} // namespace functions
} // namespace cql3

View File

@@ -23,6 +23,7 @@
#include "index/vector_index.hh"
#include "schema/schema.hh"
#include "service/client_state.hh"
#include "service/paxos/paxos_state.hh"
#include "types/types.hh"
#include "cql3/query_processor.hh"
#include "cql3/cql_statement.hh"
@@ -329,6 +330,19 @@ future<std::vector<description>> table(const data_dictionary::database& db, cons
"*/",
*table_desc.create_statement);
table_desc.create_statement = std::move(os).to_managed_string();
} else if (service::paxos::paxos_store::try_get_base_table(name)) {
// Paxos state table is internally managed by Scylla and it shouldn't be exposed to the user.
// The table is allowed to be described as a comment to ease administrative work but it's hidden from all listings.
fragmented_ostringstream os{};
fmt::format_to(os.to_iter(),
"/* Do NOT execute this statement! It's only for informational purposes.\n"
" A paxos state table is created automatically when enabling LWT on a base table.\n"
"\n{}\n"
"*/",
*table_desc.create_statement);
table_desc.create_statement = std::move(os).to_managed_string();
}
result.push_back(std::move(table_desc));
@@ -364,7 +378,7 @@ future<std::vector<description>> table(const data_dictionary::database& db, cons
future<std::vector<description>> tables(const data_dictionary::database& db, const lw_shared_ptr<keyspace_metadata>& ks, std::optional<bool> with_internals = std::nullopt) {
auto& replica_db = db.real_database();
auto tables = ks->tables() | std::views::filter([&replica_db] (const schema_ptr& s) {
return !cdc::is_log_for_some_table(replica_db, s->ks_name(), s->cf_name());
return !cdc::is_log_for_some_table(replica_db, s->ks_name(), s->cf_name()) && !service::paxos::paxos_store::try_get_base_table(s->cf_name());
}) | std::ranges::to<std::vector<schema_ptr>>();
std::ranges::sort(tables, std::ranges::less(), std::mem_fn(&schema::cf_name));

View File

@@ -259,11 +259,9 @@ uint32_t select_statement::get_bound_terms() const {
future<> select_statement::check_access(query_processor& qp, const service::client_state& state) const {
try {
const data_dictionary::database db = qp.db();
auto&& s = db.find_schema(keyspace(), column_family());
auto cdc = db.get_cdc_base_table(*s);
auto& cf_name = s->is_view()
? s->view_info()->base_name()
auto cdc = qp.db().get_cdc_base_table(*_schema);
auto& cf_name = _schema->is_view()
? _schema->view_info()->base_name()
: (cdc ? cdc->cf_name() : column_family());
const schema_ptr& base_schema = cdc ? cdc : _schema;
bool is_vector_indexed = secondary_index::vector_index::has_vector_index(*base_schema);

View File

@@ -1986,13 +1986,13 @@ future<> db::commitlog::segment_manager::replenish_reserve() {
}
continue;
} catch (shutdown_marker&) {
_reserve_segments.abort(std::current_exception());
break;
} catch (...) {
clogger.warn("Exception in segment reservation: {}", std::current_exception());
}
co_await sleep(100ms);
}
_reserve_segments.abort(std::make_exception_ptr(shutdown_marker()));
}
future<std::vector<db::commitlog::descriptor>>

View File

@@ -1201,13 +1201,13 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"* org.apache.cassandra.auth.CassandraRoleManager: Stores role data in the system_auth keyspace;\n"
"* com.scylladb.auth.LDAPRoleManager: Fetches role data from an LDAP server.")
, permissions_validity_in_ms(this, "permissions_validity_in_ms", liveness::LiveUpdate, value_status::Used, 10000,
"How long permissions in cache remain valid. Depending on the authorizer, such as CassandraAuthorizer, fetching permissions can be resource intensive. Permissions caching is disabled when this property is set to 0 or when AllowAllAuthorizer is used. The cached value is considered valid as long as both its value is not older than the permissions_validity_in_ms "
"How long authorized statements cache entries remain valid. The cached value is considered valid as long as both its value is not older than the permissions_validity_in_ms "
"and the cached value has been read at least once during the permissions_validity_in_ms time frame. If any of these two conditions doesn't hold the cached value is going to be evicted from the cache.\n"
"\n"
"Related information: Object permissions")
, permissions_update_interval_in_ms(this, "permissions_update_interval_in_ms", liveness::LiveUpdate, value_status::Used, 2000,
"Refresh interval for permissions cache (if enabled). After this interval, cache entries become eligible for refresh. An async reload is scheduled every permissions_update_interval_in_ms time period and the old value is returned until it completes. If permissions_validity_in_ms has a non-zero value, then this property must also have a non-zero value. It's recommended to set this value to be at least 3 times smaller than the permissions_validity_in_ms.")
, permissions_cache_max_entries(this, "permissions_cache_max_entries", liveness::LiveUpdate, value_status::Used, 1000,
"Refresh interval for authorized statements cache. After this interval, cache entries become eligible for refresh. An async reload is scheduled every permissions_update_interval_in_ms time period and the old value is returned until it completes. If permissions_validity_in_ms has a non-zero value, then this property must also have a non-zero value. It's recommended to set this value to be at least 3 times smaller than the permissions_validity_in_ms. This option additionally controls the permissions refresh interval for LDAP.")
, permissions_cache_max_entries(this, "permissions_cache_max_entries", liveness::LiveUpdate, value_status::Unused, 1000,
"Maximum cached permission entries. Must have a non-zero value if permissions caching is enabled (see a permissions_validity_in_ms description).")
, server_encryption_options(this, "server_encryption_options", value_status::Used, {/*none*/},
"Enable or disable inter-node encryption. You must also generate keys and provide the appropriate key and trust store locations and passwords. The available options are:\n"
@@ -1272,7 +1272,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, ignore_dead_nodes_for_replace(this, "ignore_dead_nodes_for_replace", value_status::Used, "", "List dead nodes to ignore for replace operation using a comma-separated list of host IDs. E.g., scylla --ignore-dead-nodes-for-replace 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c,125ed9f4-7777-1dbn-mac8-43fddce9123e")
, override_decommission(this, "override_decommission", value_status::Deprecated, false, "Set true to force a decommissioned node to join the cluster (cannot be set if consistent-cluster-management is enabled).")
, enable_repair_based_node_ops(this, "enable_repair_based_node_ops", liveness::LiveUpdate, value_status::Used, true, "Set true to use enable repair based node operations instead of streaming based.")
, allowed_repair_based_node_ops(this, "allowed_repair_based_node_ops", liveness::LiveUpdate, value_status::Used, "replace,removenode,rebuild,bootstrap,decommission", "A comma separated list of node operations which are allowed to enable repair based node operations. The operations can be bootstrap, replace, removenode, decommission and rebuild.")
, allowed_repair_based_node_ops(this, "allowed_repair_based_node_ops", liveness::LiveUpdate, value_status::Used, "replace,removenode,rebuild", "A comma separated list of node operations which are allowed to enable repair based node operations. The operations can be bootstrap, replace, removenode, decommission and rebuild.")
, enable_compacting_data_for_streaming_and_repair(this, "enable_compacting_data_for_streaming_and_repair", liveness::LiveUpdate, value_status::Used, true, "Enable the compacting reader, which compacts the data for streaming and repair (load'n'stream included) before sending it to, or synchronizing it with peers. Can reduce the amount of data to be processed by removing dead data, but adds CPU overhead.")
, enable_tombstone_gc_for_streaming_and_repair(this, "enable_tombstone_gc_for_streaming_and_repair", liveness::LiveUpdate, value_status::Used, false,
"If the compacting reader is enabled for streaming and repair (see enable_compacting_data_for_streaming_and_repair), allow it to garbage-collect tombstones."
@@ -1498,7 +1498,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, index_cache_fraction(this, "index_cache_fraction", liveness::LiveUpdate, value_status::Used, 0.2,
"The maximum fraction of cache memory permitted for use by index cache. Clamped to the [0.0; 1.0] range. Must be small enough to not deprive the row cache of memory, but should be big enough to fit a large fraction of the index. The default value 0.2 means that at least 80\% of cache memory is reserved for the row cache, while at most 20\% is usable by the index cache.")
, consistent_cluster_management(this, "consistent_cluster_management", value_status::Deprecated, true, "Use RAFT for cluster management and DDL.")
, force_gossip_topology_changes(this, "force_gossip_topology_changes", value_status::Used, false, "Force gossip-based topology operations in a fresh cluster. Only the first node in the cluster must use it. The rest will fall back to gossip-based operations anyway. This option should be used only for testing. Note: gossip topology changes are incompatible with tablets.")
, force_gossip_topology_changes(this, "force_gossip_topology_changes", value_status::Deprecated, false, "Force gossip-based topology operations in a fresh cluster. Only the first node in the cluster must use it. The rest will fall back to gossip-based operations anyway. This option should be used only for testing. Note: gossip topology changes are incompatible with tablets.")
, recovery_leader(this, "recovery_leader", liveness::LiveUpdate, value_status::Used, utils::null_uuid(), "Host ID of the node restarted first while performing the Manual Raft-based Recovery Procedure. Warning: this option disables some guardrails for the needs of the Manual Raft-based Recovery Procedure. Make sure you unset it at the end of the procedure.")
, wasm_cache_memory_fraction(this, "wasm_cache_memory_fraction", value_status::Used, 0.01, "Maximum total size of all WASM instances stored in the cache as fraction of total shard memory.")
, wasm_cache_timeout_in_ms(this, "wasm_cache_timeout_in_ms", value_status::Used, 5000, "Time after which an instance is evicted from the cache.")
@@ -1527,17 +1527,21 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"Allows target tablet size to be configured. Defaults to 5G (in bytes). Maintaining tablets at reasonable sizes is important to be able to " \
"redistribute load. A higher value means tablet migration throughput can be reduced. A lower value may cause number of tablets to increase significantly, " \
"potentially resulting in performance drawbacks.")
, tablet_streaming_read_concurrency_per_shard(this, "tablet_streaming_read_concurrency_per_shard", liveness::LiveUpdate, value_status::Used, 2,
"Maximum number of tablets which may be leaving a shard at the same time. Effecting only on topology coordinator. Set to the same value on all nodes.")
, tablet_streaming_write_concurrency_per_shard(this, "tablet_streaming_write_concurrency_per_shard", liveness::LiveUpdate, value_status::Used, 2,
"Maximum number of tablets which may be pending on a shard at the same time. Effecting only on topology coordinator. Set to the same value on all nodes.")
, replication_strategy_warn_list(this, "replication_strategy_warn_list", liveness::LiveUpdate, value_status::Used, {locator::replication_strategy_type::simple}, "Controls which replication strategies to warn about when creating/altering a keyspace. Doesn't affect the pre-existing keyspaces.")
, replication_strategy_fail_list(this, "replication_strategy_fail_list", liveness::LiveUpdate, value_status::Used, {}, "Controls which replication strategies are disallowed to be used when creating/altering a keyspace. Doesn't affect the pre-existing keyspaces.")
, service_levels_interval(this, "service_levels_interval_ms", liveness::LiveUpdate, value_status::Used, 10000, "Controls how often service levels module polls configuration table")
, audit(this, "audit", value_status::Used, "none",
, audit(this, "audit", value_status::Used, "table",
"Controls the audit feature:\n"
"\n"
"\tnone : No auditing enabled.\n"
"\tsyslog : Audit messages sent to Syslog.\n"
"\ttable : Audit messages written to column family named audit.audit_log.\n")
, audit_categories(this, "audit_categories", liveness::LiveUpdate, value_status::Used, "DCL,DDL,AUTH", "Comma separated list of operation categories that should be audited.")
, audit_categories(this, "audit_categories", liveness::LiveUpdate, value_status::Used, "DCL,AUTH,ADMIN", "Comma separated list of operation categories that should be audited.")
, audit_tables(this, "audit_tables", liveness::LiveUpdate, value_status::Used, "", "Comma separated list of table names (<keyspace>.<table>) that will be audited.")
, audit_keyspaces(this, "audit_keyspaces", liveness::LiveUpdate, value_status::Used, "", "Comma separated list of keyspaces that will be audited. All tables in those keyspaces will be audited")
, audit_unix_socket_path(this, "audit_unix_socket_path", value_status::Used, "/dev/log", "The path to the unix socket used for writing to syslog. Only applicable when audit is set to syslog.")

View File

@@ -542,6 +542,8 @@ public:
named_value<double> tablets_initial_scale_factor;
named_value<unsigned> tablets_per_shard_goal;
named_value<uint64_t> target_tablet_size_in_bytes;
named_value<unsigned> tablet_streaming_read_concurrency_per_shard;
named_value<unsigned> tablet_streaming_write_concurrency_per_shard;
named_value<std::vector<enum_option<replication_strategy_restriction_t>>> replication_strategy_warn_list;
named_value<std::vector<enum_option<replication_strategy_restriction_t>>> replication_strategy_fail_list;

View File

@@ -1714,7 +1714,9 @@ std::unordered_set<dht::token> decode_tokens(const set_type_impl::native_type& t
std::unordered_set<dht::token> tset;
for (auto& t: tokens) {
auto str = value_cast<sstring>(t);
SCYLLA_ASSERT(str == dht::token::from_sstring(str).to_sstring());
if (str != dht::token::from_sstring(str).to_sstring()) {
on_internal_error(slogger, format("decode_tokens: invalid token string '{}'", str));
}
tset.insert(dht::token::from_sstring(str));
}
return tset;
@@ -3191,7 +3193,7 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
};
}
} else if (must_have_tokens(nstate)) {
on_fatal_internal_error(slogger, format(
on_internal_error(slogger, format(
"load_topology_state: node {} in {} state but missing ring slice", host_id, nstate));
}
}
@@ -3273,7 +3275,7 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
// Currently, at most one node at a time can be in transitioning state.
if (!map->empty()) {
const auto& [other_id, other_rs] = *map->begin();
on_fatal_internal_error(slogger, format(
on_internal_error(slogger, format(
"load_topology_state: found two nodes in transitioning state: {} in {} state and {} in {} state",
other_id, other_rs.state, host_id, nstate));
}
@@ -3331,8 +3333,7 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
format("SELECT count(range_end) as cnt FROM {}.{} WHERE key = '{}' AND id = ?",
NAME, CDC_GENERATIONS_V3, cdc::CDC_GENERATIONS_V3_KEY),
gen_id.id);
SCYLLA_ASSERT(gen_rows);
if (gen_rows->empty()) {
if (!gen_rows || gen_rows->empty()) {
on_internal_error(slogger, format(
"load_topology_state: last committed CDC generation time UUID ({}) present, but data missing", gen_id.id));
}

View File

@@ -215,6 +215,8 @@ public:
static constexpr auto BUILT_VIEWS = "built_views";
static constexpr auto SCYLLA_VIEWS_BUILDS_IN_PROGRESS = "scylla_views_builds_in_progress";
static constexpr auto CDC_LOCAL = "cdc_local";
static constexpr auto CDC_TIMESTAMPS = "cdc_timestamps";
static constexpr auto CDC_STREAMS = "cdc_streams";
// auth
static constexpr auto ROLES = "roles";

View File

@@ -2308,6 +2308,7 @@ future<> view_builder::drain() {
vlogger.info("Draining view builder");
_as.request_abort();
co_await _mnotifier.unregister_listener(this);
co_await _ops_gate.close();
co_await _vug.drain();
co_await _sem.wait();
_sem.broken();
@@ -2742,30 +2743,48 @@ void view_builder::on_create_view(const sstring& ks_name, const sstring& view_na
}
// Do it in the background, serialized and broadcast from shard 0.
static_cast<void>(dispatch_create_view(ks_name, view_name).handle_exception([ks_name, view_name] (std::exception_ptr ep) {
static_cast<void>(with_gate(_ops_gate, [this, ks_name = ks_name, view_name = view_name] () mutable {
return dispatch_create_view(std::move(ks_name), std::move(view_name));
}).handle_exception([ks_name, view_name] (std::exception_ptr ep) {
vlogger.warn("Failed to dispatch view creation {}.{}: {}", ks_name, view_name, ep);
}));
}
void view_builder::on_update_view(const sstring& ks_name, const sstring& view_name, bool) {
future<> view_builder::dispatch_update_view(sstring ks_name, sstring view_name) {
if (should_ignore_tablet_keyspace(_db, ks_name)) {
return;
co_return;
}
[[maybe_unused]] auto sem_units = co_await get_or_adopt_view_builder_lock(std::nullopt);
auto view = view_ptr(_db.find_schema(ks_name, view_name));
auto step_it = _base_to_build_step.find(view->view_info()->base_id());
if (step_it == _base_to_build_step.end()) {
co_return; // In case all the views for this CF have finished building already.
}
auto status_it = std::ranges::find_if(step_it->second.build_status, [view] (const view_build_status& bs) {
return bs.view->id() == view->id();
});
if (status_it != step_it->second.build_status.end()) {
status_it->view = std::move(view);
}
}
void view_builder::on_update_view(const sstring& ks_name, const sstring& view_name, bool) {
// Do it in the background, serialized.
(void)with_semaphore(_sem, view_builder_semaphore_units, [ks_name, view_name, this] {
auto view = view_ptr(_db.find_schema(ks_name, view_name));
auto step_it = _base_to_build_step.find(view->view_info()->base_id());
if (step_it == _base_to_build_step.end()) {
return;// In case all the views for this CF have finished building already.
static_cast<void>(with_gate(_ops_gate, [this, ks_name = ks_name, view_name = view_name] () mutable {
return dispatch_update_view(std::move(ks_name), std::move(view_name));
}).handle_exception([ks_name, view_name] (std::exception_ptr ep) {
try {
std::rethrow_exception(ep);
} catch (const seastar::gate_closed_exception&) {
vlogger.warn("Ignoring gate_closed_exception during view update {}.{}", ks_name, view_name);
} catch (const seastar::broken_named_semaphore&) {
vlogger.warn("Ignoring broken_named_semaphore during view update {}.{}", ks_name, view_name);
} catch (const replica::no_such_column_family&) {
vlogger.warn("Ignoring no_such_column_family during view update {}.{}", ks_name, view_name);
}
auto status_it = std::ranges::find_if(step_it->second.build_status, [view] (const view_build_status& bs) {
return bs.view->id() == view->id();
});
if (status_it != step_it->second.build_status.end()) {
status_it->view = std::move(view);
}
}).handle_exception_type([] (replica::no_such_column_family&) { });
}));
}
future<> view_builder::dispatch_drop_view(sstring ks_name, sstring view_name) {
@@ -2827,7 +2846,9 @@ void view_builder::on_drop_view(const sstring& ks_name, const sstring& view_name
}
// Do it in the background, serialized and broadcast from shard 0.
static_cast<void>(dispatch_drop_view(ks_name, view_name).handle_exception([ks_name, view_name] (std::exception_ptr ep) {
static_cast<void>(with_gate(_ops_gate, [this, ks_name = ks_name, view_name = view_name] () mutable {
return dispatch_drop_view(std::move(ks_name), std::move(view_name));
}).handle_exception([ks_name, view_name] (std::exception_ptr ep) {
vlogger.warn("Failed to dispatch view drop {}.{}: {}", ks_name, view_name, ep);
}));
}

View File

@@ -16,6 +16,7 @@
#include <seastar/core/abort_source.hh>
#include <seastar/core/future.hh>
#include <seastar/core/gate.hh>
#include <seastar/core/semaphore.hh>
#include <seastar/core/condition-variable.hh>
#include <seastar/core/sharded.hh>
@@ -190,6 +191,7 @@ class view_builder final : public service::migration_listener::only_view_notific
// Guard the whole startup routine with a semaphore so that it's not intercepted by
// `on_drop_view`, `on_create_view`, or `on_update_view` events.
seastar::named_semaphore _sem{view_builder_semaphore_units, named_semaphore_exception_factory{"view builder"}};
seastar::gate _ops_gate;
seastar::abort_source _as;
future<> _step_fiber = make_ready_future<>();
// Used to coordinate between shards the conclusion of the build process for a particular view.
@@ -284,6 +286,7 @@ private:
future<> mark_as_built(view_ptr);
void setup_metrics();
future<> dispatch_create_view(sstring ks_name, sstring view_name);
future<> dispatch_update_view(sstring ks_name, sstring view_name);
future<> dispatch_drop_view(sstring ks_name, sstring view_name);
future<> handle_seed_view_build_progress(const sstring& ks_name, const sstring& view_name);
future<> handle_create_view_local(const sstring& ks_name, const sstring& view_name, view_builder_units_opt units);

View File

@@ -588,11 +588,7 @@ future<> view_building_worker::do_build_range(table_id base_id, std::vector<tabl
utils::get_local_injector().inject("do_build_range_fail",
[] { throw std::runtime_error("do_build_range failed due to error injection"); });
// Run the view building in the streaming scheduling group
// so that it doesn't impact other tasks with higher priority.
seastar::thread_attributes attr;
attr.sched_group = _db.get_streaming_scheduling_group();
return seastar::async(std::move(attr), [this, base_id, views_ids = std::move(views_ids), last_token, &as] {
return seastar::async([this, base_id, views_ids = std::move(views_ids), last_token, &as] {
gc_clock::time_point now = gc_clock::now();
auto base_cf = _db.find_column_family(base_id).shared_from_this();
reader_permit permit = _db.get_reader_concurrency_semaphore().make_tracking_only_permit(nullptr, "build_views_range", db::no_timeout, {});

View File

@@ -67,6 +67,7 @@ public:
return schema_builder(system_keyspace::NAME, "cluster_status", std::make_optional(id))
.with_column("peer", inet_addr_type, column_kind::partition_key)
.with_column("dc", utf8_type)
.with_column("rack", utf8_type)
.with_column("up", boolean_type)
.with_column("draining", boolean_type)
.with_column("excluded", boolean_type)
@@ -111,7 +112,9 @@ public:
// Not all entries in gossiper are present in the topology
auto& node = tm.get_topology().get_node(hostid);
sstring dc = node.dc_rack().dc;
sstring rack = node.dc_rack().rack;
set_cell(cr, "dc", dc);
set_cell(cr, "rack", rack);
set_cell(cr, "draining", node.is_draining());
set_cell(cr, "excluded", node.is_excluded());
}
@@ -1345,8 +1348,8 @@ public:
private:
static schema_ptr build_schema() {
auto id = generate_legacy_id(system_keyspace::NAME, "cdc_timestamps");
return schema_builder(system_keyspace::NAME, "cdc_timestamps", std::make_optional(id))
auto id = generate_legacy_id(system_keyspace::NAME, system_keyspace::CDC_TIMESTAMPS);
return schema_builder(system_keyspace::NAME, system_keyspace::CDC_TIMESTAMPS, std::make_optional(id))
.with_column("keyspace_name", utf8_type, column_kind::partition_key)
.with_column("table_name", utf8_type, column_kind::partition_key)
.with_column("timestamp", reversed_type_impl::get_instance(timestamp_type), column_kind::clustering_key)
@@ -1428,8 +1431,8 @@ public:
}
private:
static schema_ptr build_schema() {
auto id = generate_legacy_id(system_keyspace::NAME, "cdc_streams");
return schema_builder(system_keyspace::NAME, "cdc_streams", std::make_optional(id))
auto id = generate_legacy_id(system_keyspace::NAME, system_keyspace::CDC_STREAMS);
return schema_builder(system_keyspace::NAME, system_keyspace::CDC_STREAMS, std::make_optional(id))
.with_column("keyspace_name", utf8_type, column_kind::partition_key)
.with_column("table_name", utf8_type, column_kind::partition_key)
.with_column("timestamp", timestamp_type, column_kind::clustering_key)

View File

@@ -12,5 +12,6 @@ namespace debug {
seastar::sharded<replica::database>* volatile the_database = nullptr;
seastar::scheduling_group streaming_scheduling_group;
seastar::scheduling_group gossip_scheduling_group;
}

View File

@@ -18,6 +18,7 @@ namespace debug {
extern seastar::sharded<replica::database>* volatile the_database;
extern seastar::scheduling_group streaming_scheduling_group;
extern seastar::scheduling_group gossip_scheduling_group;
}

View File

@@ -12,7 +12,7 @@ Do the following in the top-level Scylla source directory:
2. Run `ninja dist-dev` (with the same mode name as above) to prepare
the distribution artifacts.
3. Run `./dist/docker/debian/build_docker.sh --mode dev`
3. Run `./dist/docker/redhat/build_docker.sh --mode dev`
This creates a docker image as a **file**, in the OCI format, and prints
its name, looking something like:

View File

@@ -70,7 +70,7 @@ bcp() { buildah copy "$container" "$@"; }
run() { buildah run "$container" "$@"; }
bconfig() { buildah config "$@" "$container"; }
container="$(buildah from docker.io/redhat/ubi9-minimal:latest)"
container="$(buildah from --pull=always docker.io/redhat/ubi9-minimal:latest)"
packages=(
"build/dist/$config/redhat/RPMS/$arch/$product-$version-$release.$arch.rpm"

View File

@@ -1,6 +1,10 @@
### a dictionary of redirections
#old path: new path
# Move the OS Support page
/stable/getting-started/os-support.html: https://docs.scylladb.com/stable/versioning/os-support-per-version.html
# Remove an outdated KB
/stable/kb/perftune-modes-sync.html: /stable/kb/index.html

View File

@@ -142,10 +142,6 @@ want modify a non-top-level attribute directly (e.g., a.b[3].c) need RMW:
Alternator implements such requests by reading the entire top-level
attribute a, modifying only a.b[3].c, and then writing back a.
Currently, Alternator doesn't use Tablets. That's because Alternator relies
on LWT (lightweight transactions), and LWT is not supported in keyspaces
with Tablets enabled.
```{eval-rst}
.. toctree::
:maxdepth: 2

View File

@@ -213,3 +213,71 @@ Alternator table, the following features will not work for this table:
* Enabling Streams with CreateTable or UpdateTable doesn't work
(results in an error).
See <https://github.com/scylladb/scylla/issues/23838>.
## Custom write timestamps
DynamoDB doesn't allow clients to set the write timestamp of updates. All
updates use the current server time as their timestamp, and ScyllaDB uses
these timestamps for last-write-wins conflict resolution when concurrent
writes reach different replicas.
ScyllaDB Alternator extends this with the `system:timestamp_attribute` tag,
which allows specifying a custom write timestamp for each PutItem,
UpdateItem, DeleteItem, or BatchWriteItem request. To use this feature:
1. Tag the table (at CreateTable time or using TagResource) with
`system:timestamp_attribute` set to the name of an attribute that will
hold the custom write timestamp.
2. When performing a PutItem or UpdateItem, include the named attribute
in the request with a numeric value. The value represents the write
timestamp in **microseconds since the Unix epoch** (this is the same
unit used internally by ScyllaDB for timestamps).
For a DeleteItem or a BatchWriteItem DeleteRequest, include the named
attribute in the `Key` parameter (it will be stripped from the key
before use).
3. The named attribute is **not stored** in the item data - it only
controls the write timestamp. If you also want to record the timestamp
as data, use a separate attribute for that purpose.
4. If the named attribute is absent, the write proceeds normally using the
current server time as the timestamp. If the named attribute is present
but has a non-numeric value, the write is rejected with a ValidationException.
### Limitations
- **Incompatible with conditions**: If the write includes a ConditionExpression
(or uses the `Expected` legacy condition), LWT is needed and the operation
is rejected with a ValidationException, because LWT requires the write
timestamp to be set by the Paxos protocol, not by the client.
- **Incompatible with `always` write isolation**: Tables using the `always`
(or `always_use_lwt`) write isolation policy cannot use the timestamp
attribute feature at all, because every write uses LWT in that mode.
When using `system:timestamp_attribute`, consider tagging the table with
`system:write_isolation=only_rmw_uses_lwt` (or `forbid_rmw`) so that
unconditional writes do not use LWT.
### Example use case
This feature is useful for ingesting data from multiple sources where each
record has a known logical timestamp. By setting the `system:timestamp_attribute`
tag, you can ensure that the record with the highest logical timestamp always
wins, regardless of ingestion order:
```python
# Create table with timestamp attribute
dynamodb.create_table(
TableName='my_table',
...
Tags=[{'Key': 'system:timestamp_attribute', 'Value': 'write_ts'}]
)
# Write a record with a specific timestamp (in microseconds since epoch)
table.put_item(Item={
'pk': 'my_key',
'data': 'new_value',
'write_ts': Decimal('1700000000000000'), # Nov 14, 2023 in microseconds
})
```

View File

@@ -187,6 +187,23 @@ You can create a keyspace with tablets enabled with the ``tablets = {'enabled':
the keyspace schema with ``tablets = { 'enabled': false }`` or
``tablets = { 'enabled': true }``.
.. _keyspace-rf-rack-valid-to-enforce-rack-list:
Enforcing Rack-List Replication for Tablet Keyspaces
------------------------------------------------------------------
The ``rf_rack_valid_keyspaces`` is a legacy option that ensures that all keyspaces with tablets enabled are
:term:`RF-rack-valid <RF-rack-valid keyspace>`.
Requiring every tablet keyspace to use the rack list replication factor exclusively is enough to guarantee the keyspace is
:term:`RF-rack-valid <RF-rack-valid keyspace>`. It reduces restrictions and provides stronger guarantees compared
to ``rf_rack_valid_keyspaces`` option.
To enforce rack list in tablet keyspaces, use ``enforce_rack_list`` option. It can be set only if all tablet keyspaces use
rack list. To ensure that, follow a procedure of :ref:`conversion to rack list replication factor <conversion-to-rack-list-rf>`.
After that restart all nodes in the cluster, with ``enforce_rack_list`` enabled and ``rf_rack_valid_keyspaces`` disabled. Make
sure to avoid setting or updating replication factor (with CREATE KEYSPACE or ALTER KEYSPACE) while nodes are being restarted.
.. _tablets-limitations:
Limitations and Unsupported Features

View File

@@ -200,8 +200,6 @@ for two cases. One is setting replication factor to 0, in which case the number
The other is when the numeric replication factor is equal to the current number of replicas
for a given datacanter, in which case the current rack list is preserved.
Altering from a numeric replication factor to a rack list is not supported yet.
Note that when ``ALTER`` ing keyspaces and supplying ``replication_factor``,
auto-expansion will only *add* new datacenters for safety, it will not alter
existing datacenters or remove any even if they are no longer in the cluster.
@@ -424,6 +422,21 @@ Altering from a rack list to a numeric replication factor is not supported.
Keyspaces which use rack lists are :term:`RF-rack-valid <RF-rack-valid keyspace>` if each rack in the rack list contains at least one node (excluding :doc:`zero-token nodes </architecture/zero-token-nodes>`).
.. _conversion-to-rack-list-rf:
Conversion to rack-list replication factor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To migrate a keyspace from a numeric replication factor to a rack-list replication factor, provide the rack-list replication factor explicitly in ALTER KEYSPACE statement. The number of racks in the list must be equal to the numeric replication factor. The replication factor can be converted in any number of DCs at once. In a statement that converts replication factor, no replication factor updates (increase or decrease) are allowed in any DC.
.. code-block:: cql
CREATE KEYSPACE Excelsior
WITH replication = { 'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 1} AND tablets = { 'enabled': true };
ALTER KEYSPACE Excelsior
WITH replication = { 'class' : 'NetworkTopologyStrategy', 'dc1' : ['RAC1', 'RAC2', 'RAC3'], 'dc2' : ['RAC4']} AND tablets = { 'enabled': true };
.. _drop-keyspace-statement:
DROP KEYSPACE

View File

@@ -25,6 +25,8 @@ Querying data from data is done using a ``SELECT`` statement:
: | CAST '(' `selector` AS `cql_type` ')'
: | `function_name` '(' [ `selector` ( ',' `selector` )* ] ')'
: | COUNT '(' '*' ')'
: | literal
: | bind_marker
: )
: ( '.' `field_name` | '[' `term` ']' )*
where_clause: `relation` ( AND `relation` )*
@@ -35,6 +37,8 @@ Querying data from data is done using a ``SELECT`` statement:
operator: '=' | '<' | '>' | '<=' | '>=' | IN | NOT IN | CONTAINS | CONTAINS KEY
ordering_clause: `column_name` [ ASC | DESC ] ( ',' `column_name` [ ASC | DESC ] )*
timeout: `duration`
literal: number | 'string' | boolean | NULL | tuple_literal | list_literal | map_literal
bind_marker: '?' | ':' `identifier`
For instance::
@@ -81,6 +85,13 @@ A :token:`selector` can be one of the following:
- A casting, which allows you to convert a nested selector to a (compatible) type.
- A function call, where the arguments are selector themselves.
- A call to the :ref:`COUNT function <count-function>`, which counts all non-null results.
- A literal value (constant).
- A bind variable (`?` or `:name`).
Note that due to a quirk of the type system, literals and bind markers cannot be
used as top-level selectors, as the parser cannot infer their type. However, they can be used
when nested inside functions, as the function formal parameter types provide the
necessary context.
Aliases
```````
@@ -281,7 +292,8 @@ For example::
ORDER BY embedding ANN OF [0.1, 0.2, 0.3, 0.4] LIMIT 5;
Vector queries also support filtering with ``WHERE`` clauses on columns that are part of the primary key.
Vector queries also support filtering with ``WHERE`` clauses on columns that are part of the primary key
or columns provided in a definition of the index.
For example::

View File

@@ -140,17 +140,83 @@ Vector Index :label-note:`ScyllaDB Cloud`
`ScyllaDB Cloud documentation <https://cloud.docs.scylladb.com/stable/vector-search/>`_.
ScyllaDB supports creating vector indexes on tables, allowing queries on the table to use those indexes for efficient
similarity search on vector data.
similarity search on vector data. Vector indexes can be a global index for indexing vectors per table or a local
index for indexing vectors per partition.
The vector index is the only custom type index supported in ScyllaDB. It is created using
the ``CUSTOM`` keyword and specifying the index type as ``vector_index``. Example:
the ``CUSTOM`` keyword and specifying the index type as ``vector_index``. It is also possible to
add additional columns to the index for filtering the search results. The partition column
specified in the global vector index definition must be the vector column, and any subsequent
columns are treated as filtering columns. The local vector index requires that the partition key
of the base table is also the partition key of the index and the vector column is the first one
from the following columns.
Example of a simple index:
.. code-block:: cql
CREATE CUSTOM INDEX vectorIndex ON ImageEmbeddings (embedding)
CREATE CUSTOM INDEX vectorIndex ON ImageEmbeddings (embedding)
USING 'vector_index'
WITH OPTIONS = {'similarity_function': 'COSINE', 'maximum_node_connections': '16'};
The vector column (``embedding``) is indexed to enable similarity search using
a global vector index. Additional filtering can be performed on the primary key
columns of the base table.
Example of a global vector index with additional filtering:
.. code-block:: cql
CREATE CUSTOM INDEX vectorIndex ON ImageEmbeddings (embedding, category, info)
USING 'vector_index'
WITH OPTIONS = {'similarity_function': 'COSINE', 'maximum_node_connections': '16'};
The vector column (``embedding``) is indexed to enable similarity search using
a global index. Additional columns are added for filtering the search results.
The filtering is possible on ``category``, ``info`` and all primary key columns
of the base table.
Example of a local vector index:
.. code-block:: cql
CREATE CUSTOM INDEX vectorIndex ON ImageEmbeddings ((id, created_at), embedding, category, info)
USING 'vector_index'
WITH OPTIONS = {'similarity_function': 'COSINE', 'maximum_node_connections': '16'};
The vector column (``embedding``) is indexed for similarity search (a local
index) and additional columns are added for filtering the search results. The
filtering is possible on ``category``, ``info`` and all primary key columns of
the base table. The columns ``id`` and ``created_at`` must be the partition key
of the base table.
Vector indexes support additional filtering columns of native data types
(excluding counter and duration). The indexed column itself must be a vector
column, while the extra columns can be used to filter search results.
The supported types are:
* ``ascii``
* ``bigint``
* ``blob``
* ``boolean``
* ``date``
* ``decimal``
* ``double``
* ``float``
* ``inet``
* ``int``
* ``smallint``
* ``text``
* ``varchar``
* ``time``
* ``timestamp``
* ``timeuuid``
* ``tinyint``
* ``uuid``
* ``varint``
The following options are supported for vector indexes. All of them are optional.
+------------------------------+----------------------------------------------------------------------------------------------------------+---------------+

View File

@@ -108,6 +108,4 @@ check the statement and throw if it is disallowed, similar to what
Obviously, an audit definition must survive a server restart and stay
consistent among all nodes in a cluster. We'll accomplish both by
storing audits in a system table. They will be cached in memory the
same way `permissions_cache` caches table contents in `permission_set`
objects resident in memory.
storing audits in a system table.

View File

@@ -39,6 +39,17 @@ Both client and server use the same string identifiers for the keys to determine
negotiated extension set, judging by the presence of a particular key in the
SUPPORTED/STARTUP messages.
## Client options
`client_options` column in `system.clients` table stores all data sent by the
client in STARTUP request, as a `map<text, text>`. This column may be useful
for debugging and monitoring purposes.
Drivers can send additional data in STARTUP, e.g. load balancing policy, retry
policy, timeouts, and other configuration.
Such data should be sent in `CLIENT_OPTIONS` key, as JSON. The recommended
structure of this JSON will be decided in the future.
## Intranode sharding
This extension allows the driver to discover how Scylla internally
@@ -74,8 +85,6 @@ The keys and values are:
as an indicator to which shard client wants to connect. The desired shard number
is calculated as: `desired_shard_no = client_port % SCYLLA_NR_SHARDS`.
Its value is a decimal representation of type `uint16_t`, by default `19142`.
- `CLIENT_OPTIONS` is a string containing a JSON object representation that
contains CQL Driver configuration, e.g. load balancing policy, retry policy, timeouts, etc.
Currently, one `SCYLLA_SHARDING_ALGORITHM` is defined,
`biased-token-round-robin`. To apply the algorithm,

View File

@@ -563,17 +563,18 @@ CREATE TABLE system.clients (
address inet,
port int,
client_type text,
client_options frozen<map<text, text>>,
connection_stage text,
driver_name text,
driver_version text,
hostname text,
protocol_version int,
scheduling_group text,
shard_id int,
ssl_cipher_suite text,
ssl_enabled boolean,
ssl_protocol text,
username text,
scheduling_group text,
PRIMARY KEY (address, port, client_type)
) WITH CLUSTERING ORDER BY (port ASC, client_type ASC)
~~~
@@ -581,4 +582,7 @@ CREATE TABLE system.clients (
Currently only CQL clients are tracked. The table used to be present on disk (in data
directory) before and including version 4.5.
`client_options` column stores all data sent by the client in the STARTUP request.
This column is useful for debugging and monitoring purposes.
## TODO: the rest

View File

@@ -156,7 +156,7 @@ How do I check the current version of ScyllaDB that I am running?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* On a regular system or VM (running Ubuntu, CentOS, or RedHat Enterprise): :code:`$ scylla --version`
Check the :doc:`Operating System Support Guide </getting-started/os-support>` for a list of supported operating systems and versions.
Check the `Operating System Support Guide <https://docs.scylladb.com/stable/versioning/os-support-per-version.html>`_ for a list of supported operating systems and versions.
* On a docker node: :code:`$ docker exec -it Node_Z scylla --version`

View File

@@ -3,9 +3,9 @@
Automatic Repair
================
Traditionally, launching `repairs </operating-scylla/procedures/maintenance/repair>`_ in a ScyllaDB cluster is left to an external process, typically done via `Scylla Manager <https://manager.docs.scylladb.com/stable/repair/index.html>`_.
Traditionally, launching :doc:`repairs </operating-scylla/procedures/maintenance/repair>` in a ScyllaDB cluster is left to an external process, typically done via `Scylla Manager <https://manager.docs.scylladb.com/stable/repair/index.html>`_.
Automatic repair offers built-in scheduling in ScyllaDB itself. If the time since the last repair is greater than the configured repair interval, ScyllaDB will start a repair for the tablet `tablet </architecture/tablets>`_ automatically.
Automatic repair offers built-in scheduling in ScyllaDB itself. If the time since the last repair is greater than the configured repair interval, ScyllaDB will start a repair for the :doc:`tablet table </architecture/tablets>` automatically.
Repairs are spread over time and among nodes and shards, to avoid load spikes or any adverse effects on user workloads.
To enable automatic repair, add this to the configuration (``scylla.yaml``):
@@ -20,4 +20,4 @@ More featureful configuration methods will be implemented in the future.
To disable, set ``auto_repair_enabled_default: false``.
Automatic repair relies on `Incremental Repair </features/incremental-repair>`_ and as such it only works with `tablet </architecture/tablets>`_ tables.
Automatic repair relies on :doc:`Incremental Repair </features/incremental-repair>` and as such it only works with :doc:`tablet </architecture/tablets>` tables.

View File

@@ -3,7 +3,7 @@
Incremental Repair
==================
ScyllaDB's standard `repair </operating-scylla/procedures/maintenance/repair>`_ process scans and processes all the data on a node, regardless of whether it has changed since the last repair. This operation can be resource-intensive and time-consuming. The Incremental Repair feature provides a much more efficient and lightweight alternative for maintaining data consistency.
ScyllaDB's standard :doc:`repair </operating-scylla/procedures/maintenance/repair>` process scans and processes all the data on a node, regardless of whether it has changed since the last repair. This operation can be resource-intensive and time-consuming. The Incremental Repair feature provides a much more efficient and lightweight alternative for maintaining data consistency.
The core idea of incremental repair is to repair only the data that has been written or changed since the last repair was run. It intelligently skips data that has already been verified, dramatically reducing the time, I/O, and CPU resources required for the repair operation.
@@ -51,7 +51,7 @@ Benefits of Incremental Repair
* **Reduced Resource Usage:** Consumes significantly less CPU, I/O, and network bandwidth compared to a full repair.
* **More Frequent Repairs:** The efficiency of incremental repair allows you to run it more frequently, ensuring a higher level of data consistency across your cluster at all times.
Tables using Incremental Repair can schedule repairs in ScyllaDB itself, with `Automatic Repair </features/automatic-repair>`_.
Tables using Incremental Repair can schedule repairs in ScyllaDB itself, with :doc:`Automatic Repair </features/automatic-repair>`.
Notes
-----

View File

@@ -18,7 +18,7 @@ Getting Started
:class: my-panel
* :doc:`ScyllaDB System Requirements Guide</getting-started/system-requirements/>`
* :doc:`OS Support by Platform and Version</getting-started/os-support/>`
* `OS Support by Platform and Version <https://docs.scylladb.com/stable/versioning/os-support-per-version.html>`_
.. panel-box::
:title: Install and Configure ScyllaDB

View File

@@ -10,7 +10,6 @@ Install ScyllaDB |CURRENT_VERSION|
/getting-started/install-scylla/launch-on-azure
/getting-started/installation-common/scylla-web-installer
/getting-started/install-scylla/install-on-linux
/getting-started/installation-common/install-jmx
/getting-started/install-scylla/run-in-docker
/getting-started/installation-common/unified-installer
/getting-started/installation-common/air-gapped-install
@@ -24,9 +23,9 @@ Keep your versions up-to-date. The two latest versions are supported. Also, alwa
:id: "getting-started"
:class: my-panel
* :doc:`Launch ScyllaDB |CURRENT_VERSION| on AWS </getting-started/install-scylla/launch-on-aws>`
* :doc:`Launch ScyllaDB |CURRENT_VERSION| on GCP </getting-started/install-scylla/launch-on-gcp>`
* :doc:`Launch ScyllaDB |CURRENT_VERSION| on Azure </getting-started/install-scylla/launch-on-azure>`
* :doc:`Launch ScyllaDB on AWS </getting-started/install-scylla/launch-on-aws>`
* :doc:`Launch ScyllaDB on GCP </getting-started/install-scylla/launch-on-gcp>`
* :doc:`Launch ScyllaDB on Azure </getting-started/install-scylla/launch-on-azure>`
.. panel-box::
@@ -35,8 +34,7 @@ Keep your versions up-to-date. The two latest versions are supported. Also, alwa
:class: my-panel
* :doc:`Install ScyllaDB with Web Installer (recommended) </getting-started/installation-common/scylla-web-installer>`
* :doc:`Install ScyllaDB |CURRENT_VERSION| Linux Packages </getting-started/install-scylla/install-on-linux>`
* :doc:`Install scylla-jmx Package </getting-started/installation-common/install-jmx>`
* :doc:`Install ScyllaDB Linux Packages </getting-started/install-scylla/install-on-linux>`
* :doc:`Install ScyllaDB Without root Privileges </getting-started/installation-common/unified-installer>`
* :doc:`Air-gapped Server Installation </getting-started/installation-common/air-gapped-install>`
* :doc:`ScyllaDB Developer Mode </getting-started/installation-common/dev-mod>`

View File

@@ -17,7 +17,7 @@ This article will help you install ScyllaDB on Linux using platform-specific pac
Prerequisites
----------------
* Ubuntu, Debian, CentOS, or RHEL (see :doc:`OS Support by Platform and Version </getting-started/os-support>`
* Ubuntu, Debian, CentOS, or RHEL (see `OS Support by Platform and Version <https://docs.scylladb.com/stable/versioning/os-support-per-version.html>`_
for details about supported versions and architecture)
* Root or ``sudo`` access to the system
* Open :ref:`ports used by ScyllaDB <networking-ports>`
@@ -94,16 +94,6 @@ Install ScyllaDB
apt-get install scylla{,-server,-kernel-conf,-node-exporter,-conf,-python3,-cqlsh}=2025.3.1-0.20250907.2bbf3cf669bb-1
#. (Ubuntu only) Set Java 11.
.. code-block:: console
sudo apt-get update
sudo apt-get install -y openjdk-11-jre-headless
sudo update-java-alternatives --jre-headless -s java-1.11.0-openjdk-amd64
.. group-tab:: Centos/RHEL
#. Install the EPEL repository.
@@ -157,14 +147,6 @@ Install ScyllaDB
sudo yum install scylla-5.2.3
(Optional) Install scylla-jmx
-------------------------------
scylla-jmx is an optional package and is not installed by default.
If you need JMX server, see :doc:`Install scylla-jmx Package </getting-started/installation-common/install-jmx>`.
.. include:: /getting-started/_common/setup-after-install.rst
Next Steps

View File

@@ -1,78 +0,0 @@
======================================
Install scylla-jmx Package
======================================
scylla-jmx is an optional package and is not installed by default.
If you need JMX server, you can still install it from scylla-jmx GitHub page.
.. tabs::
.. group-tab:: Debian/Ubuntu
#. Download .deb package from scylla-jmx page.
Access to https://github.com/scylladb/scylla-jmx, select latest
release from "releases", download a file end with ".deb".
#. (Optional) Transfer the downloaded package to the install node.
If the pc from which you downloaded the package is different from
the node where you install scylladb, you will need to transfer
the files to the node.
#. Install scylla-jmx package.
.. code-block:: console
sudo apt install -y ./scylla-jmx_<version>_all.deb
.. group-tab:: Centos/RHEL
#. Download .rpm package from scylla-jmx page.
Access to https://github.com/scylladb/scylla-jmx, select latest
release from "releases", download a file end with ".rpm".
#. (Optional) Transfer the downloaded package to the install node.
If the pc from which you downloaded the package is different from
the node where you install scylladb, you will need to transfer
the files to the node.
#. Install scylla-jmx package.
.. code-block:: console
sudo yum install -y ./scylla-jmx-<version>.noarch.rpm
.. group-tab:: Install without root privileges
#. Download .tar.gz package from scylla-jmx page.
Access to https://github.com/scylladb/scylla-jmx, select latest
release from "releases", download a file end with ".tar.gz".
#. (Optional) Transfer the downloaded package to the install node.
If the pc from which you downloaded the package is different from
the node where you install scylladb, you will need to transfer
the files to the node.
#. Install scylla-jmx package.
.. code:: console
tar xpf scylla-jmx-<version>.noarch.tar.gz
cd scylla-jmx
./install.sh --nonroot
Next Steps
-----------
* :doc:`Configure ScyllaDB </getting-started/system-configuration>`
* Manage your clusters with `ScyllaDB Manager <https://manager.docs.scylladb.com/>`_
* Monitor your cluster and data with `ScyllaDB Monitoring <https://monitoring.docs.scylladb.com/>`_
* Get familiar with ScyllaDBs :doc:`command line reference guide </operating-scylla/nodetool>`.
* Learn about ScyllaDB at `ScyllaDB University <https://university.scylladb.com/>`_

View File

@@ -10,7 +10,7 @@ Prerequisites
--------------
Ensure that your platform is supported by the ScyllaDB version you want to install.
See :doc:`OS Support by Platform and Version </getting-started/os-support/>`.
See `OS Support by Platform and Version <https://docs.scylladb.com/stable/versioning/os-support-per-version.html>`_.
Install ScyllaDB with Web Installer
---------------------------------------

View File

@@ -12,7 +12,8 @@ the package manager (dnf and apt).
Prerequisites
---------------
Ensure your platform is supported by the ScyllaDB version you want to install.
See :doc:`OS Support </getting-started/os-support>` for information about supported Linux distributions and versions.
See `OS Support <https://docs.scylladb.com/stable/versioning/os-support-per-version.html>`_
for information about supported Linux distributions and versions.
Note that if you're on CentOS 7, only root offline installation is supported.
@@ -48,11 +49,6 @@ Download and Install
./install.sh --nonroot --python3 ~/scylladb/python3/bin/python3
#. (Optional) Install scylla-jmx
scylla-jmx is an optional package and is not installed by default.
If you need JMX server, see :doc:`Install scylla-jmx Package </getting-started/installation-common/install-jmx>`.
Configure and Run ScyllaDB
----------------------------

View File

@@ -1,26 +0,0 @@
OS Support by Linux Distributions and Version
==============================================
The following matrix shows which Linux distributions, containers, and images
are :ref:`supported <os-support-definition>` with which versions of ScyllaDB.
.. datatemplate:json:: /_static/data/os-support.json
:template: platforms.tmpl
``*`` 2024.1.9 and later
All releases are available as a Docker container, EC2 AMI, GCP, and Azure images.
.. _os-support-definition:
By *supported*, it is meant that:
- A binary installation package is available.
- The download and install procedures are tested as part of the ScyllaDB release process for each version.
- An automated install is included from :doc:`ScyllaDB Web Installer for Linux tool </getting-started/installation-common/scylla-web-installer>` (for the latest versions).
You can `build ScyllaDB from source <https://github.com/scylladb/scylladb#build-prerequisites>`_
on other x86_64 or aarch64 platforms, without any guarantees.

View File

@@ -8,12 +8,12 @@ ScyllaDB Requirements
:hidden:
system-requirements
OS Support <os-support>
OS Support <https://docs.scylladb.com/stable/versioning/os-support-per-version.html>
Cloud Instance Recommendations <cloud-instance-recommendations>
scylla-in-a-shared-environment
* :doc:`System Requirements</getting-started/system-requirements/>`
* :doc:`OS Support by Platform and Version</getting-started/os-support/>`
* `OS Support by Platform and Version <https://docs.scylladb.com/stable/versioning/os-support-per-version.html>`_
* :doc:`Cloud Instance Recommendations AWS, GCP, and Azure </getting-started/cloud-instance-recommendations>`
* :doc:`Running ScyllaDB in a Shared Environment </getting-started/scylla-in-a-shared-environment>`

View File

@@ -8,7 +8,7 @@ Supported Platforms
===================
ScyllaDB runs on 64-bit Linux. The x86_64 and AArch64 architectures are supported (AArch64 support includes AWS EC2 Graviton).
See :doc:`OS Support by Platform and Version </getting-started/os-support>` for information about
See `OS Support by Platform and Version <https://docs.scylladb.com/stable/versioning/os-support-per-version.html>`_ for information about
supported operating systems, distros, and versions.
See :doc:`Cloud Instance Recommendations for AWS, GCP, and Azure </getting-started/cloud-instance-recommendations>` for information

View File

@@ -1,43 +0,0 @@
====================================================
Increase Permission Cache to Avoid Non-paged Queries
====================================================
**Topic: Mitigate non-paged queries coming from connection authentications**
**Audience: ScyllaDB administrators**
Issue
-----
If you create lots of roles and give them lots of permissions your nodes might spike with non-paged queries.
Root Cause
----------
``permissions_cache_max_entries`` is set to 1000 by default. This setting may not be high enough for bigger deployments with lots of tables, users, and roles with permissions.
Solution
--------
Open the scylla.yaml configuration for editing and adjust the following parameters:
``permissions_cache_max_entries`` - increase this value to suit your needs. See the example below.
``permissions_update_interval_in_ms``
``permissions_validity_in_ms``
Note:: ``permissions_update_interval_in_ms`` and ``permissions_validity_in_ms`` can be set to also make the authentication records come from cache instead of lookups, which generate non-paged queries
Example
-------
Considering with ``permissions_cache_max_entries`` there is no maximum value, it's just limited by your memory.
The cache consumes memory as it caches all records from the list of users and their associated roles (similar to a cartesian product).
Every user, role, and permissions(7 types) on a per table basis are cached.
If for example, you have 1 user with 1 role and 1 table, the table will have 7 permission types and 7 entries 1 * 1 * 1 * 7 = 7.
When expanded to 5 users, 5 roles, and 10 tables this will be 5 * 5 * 10 * 7 = 1750 entries, which is above the default cache value of 1000. The entries that go over the max value (750 entries) will be non-paged queries for every new connection from the client (and clients tend to reconnect often).
In cases like this, you may want to consider trading your memory for not stressing the entire cluster with ``auth`` queries.

View File

@@ -38,7 +38,6 @@ Knowledge Base
* :doc:`If a query does not reveal enough results </kb/cqlsh-results>`
* :doc:`How to Change gc_grace_seconds for a Table </kb/gc-grace-seconds>` - How to change the ``gc_grace_seconds`` parameter and prevent data resurrection.
* :doc:`How to flush old tombstones from a table </kb/tombstones-flush>` - How to remove old tombstones from SSTables.
* :doc:`Increase Cache to Avoid Non-paged Queries </kb/increase-permission-cache>` - How to increase the ``permissions_cache_max_entries`` setting.
* :doc:`How to Safely Increase the Replication Factor </kb/rf-increase>`
* :doc:`Facts about TTL, Compaction, and gc_grace_seconds <ttl-facts>`
* :doc:`Efficient Tombstone Garbage Collection in ICS <garbage-collection-ics>`

View File

@@ -25,7 +25,8 @@ Before you run ``nodetool decommission``:
starting the removal procedure.
* Make sure that the number of nodes remaining in the DC after you decommission a node
will be the same or higher than the Replication Factor configured for the keyspace
in this DC. If the number of remaining nodes is lower than the RF, the decommission
in this DC. Please mind that e.g. audit feature, which is enabled by default, may require
adjusting ``audit`` keyspace. If the number of remaining nodes is lower than the RF, the decommission
request may fail.
In such a case, ALTER the keyspace to reduce the RF before running ``nodetool decommission``.

View File

@@ -25,4 +25,8 @@ For Example:
nodetool rebuild <source-dc-name>
``nodetool rebuild`` command works only for vnode keyspaces. For tablet keyspaces, use ``nodetool cluster repair`` instead.
See :doc:`Data Distribution with Tablets </architecture/tablets/>`.
.. include:: nodetool-index.rst

View File

@@ -155,7 +155,6 @@ Add New DC
UN 54.235.9.159 109.75 KB 256 ? 39798227-9f6f-4868-8193-08570856c09a RACK1
UN 54.146.228.25 128.33 KB 256 ? 7a4957a1-9590-4434-9746-9c8a6f796a0c RACK1
.. TODO possibly provide additional information WRT how ALTER works with tablets
#. When all nodes are up and running ``ALTER`` the following Keyspaces in the new nodes:
@@ -171,26 +170,68 @@ Add New DC
DESCRIBE KEYSPACE mykeyspace;
CREATE KEYSPACE mykeyspace WITH replication = { 'class' : 'NetworkTopologyStrategy', '<exiting_dc>' : 3};
CREATE KEYSPACE mykeyspace WITH replication = { 'class' : 'NetworkTopologyStrategy', '<existing_dc>' : 3};
ALTER Command
.. code-block:: cql
ALTER KEYSPACE mykeyspace WITH replication = { 'class' : 'NetworkTopologyStrategy', '<exiting_dc>' : 3, <new_dc> : 3};
ALTER KEYSPACE system_distributed WITH replication = { 'class' : 'NetworkTopologyStrategy', '<exiting_dc>' : 3, <new_dc> : 3};
ALTER KEYSPACE system_traces WITH replication = { 'class' : 'NetworkTopologyStrategy', '<exiting_dc>' : 3, <new_dc> : 3};
ALTER KEYSPACE mykeyspace WITH replication = { 'class' : 'NetworkTopologyStrategy', '<existing_dc>' : 3, '<new_dc>' : 3};
ALTER KEYSPACE system_distributed WITH replication = { 'class' : 'NetworkTopologyStrategy', '<existing_dc>' : 3, '<new_dc>' : 3};
ALTER KEYSPACE system_traces WITH replication = { 'class' : 'NetworkTopologyStrategy', '<existing_dc>' : 3, '<new_dc>' : 3};
After
.. code-block:: cql
DESCRIBE KEYSPACE mykeyspace;
CREATE KEYSPACE mykeyspace WITH REPLICATION = {'class: 'NetworkTopologyStrategy', <exiting_dc>:3, <new_dc>: 3};
CREATE KEYSPACE system_distributed WITH replication = { 'class' : 'NetworkTopologyStrategy', '<exiting_dc>' : 3, <new_dc> : 3};
CREATE KEYSPACE system_traces WITH replication = { 'class' : 'NetworkTopologyStrategy', '<exiting_dc>' : 3, <new_dc> : 3};
CREATE KEYSPACE mykeyspace WITH REPLICATION = {'class': 'NetworkTopologyStrategy', '<existing_dc>' : 3, '<new_dc>' : 3};
CREATE KEYSPACE system_distributed WITH replication = { 'class' : 'NetworkTopologyStrategy', '<existing_dc>' : 3, '<new_dc>' : 3};
CREATE KEYSPACE system_traces WITH replication = { 'class' : 'NetworkTopologyStrategy', '<existing_dc>' : 3, '<new_dc>' : 3};
#. Run ``nodetool rebuild`` on each node in the new datacenter, specify the existing datacenter name in the rebuild command.
For tablet keyspaces, update the replication factor one by one:
.. code-block:: cql
DESCRIBE KEYSPACE mykeyspace2;
CREATE KEYSPACE mykeyspace2 WITH replication = { 'class' : 'NetworkTopologyStrategy', '<existing_dc>' : 3} AND tablets = { 'enabled': true };
.. code-block:: cql
ALTER KEYSPACE mykeyspace2 WITH replication = { 'class' : 'NetworkTopologyStrategy', '<existing_dc>' : 3, '<new_dc>' : 1} AND tablets = { 'enabled': true };
ALTER KEYSPACE mykeyspace2 WITH replication = { 'class' : 'NetworkTopologyStrategy', '<existing_dc>' : 3, '<new_dc>' : 2} AND tablets = { 'enabled': true };
ALTER KEYSPACE mykeyspace2 WITH replication = { 'class' : 'NetworkTopologyStrategy', '<existing_dc>' : 3, '<new_dc>' : 3} AND tablets = { 'enabled': true };
.. note::
If ``rf_rack_valid_keyspaces`` option is set, a tablet keyspace needs to use rack list replication factor, so that a new DC (rack) can be added. See :ref:`the conversion procedure <conversion-to-rack-list-rf>`. In this case, to add a datacenter:
Before
.. code-block:: cql
DESCRIBE KEYSPACE mykeyspace3;
CREATE KEYSPACE mykeyspace3 WITH replication = { 'class' : 'NetworkTopologyStrategy', '<existing_dc>' : ['<existing_rack1>', '<existing_rack2>', '<existing_rack3>']} AND tablets = { 'enabled': true };
Add all the nodes to the new datacenter and then alter the keyspace one by one:
.. code-block:: cql
ALTER KEYSPACE mykeyspace3 WITH replication = { 'class' : 'NetworkTopologyStrategy', '<existing_dc>' : ['<existing_rack1>', '<existing_rack2>', '<existing_rack3>'], '<new_dc>' : ['<new_rack1>']} AND tablets = { 'enabled': true };
ALTER KEYSPACE mykeyspace3 WITH replication = { 'class' : 'NetworkTopologyStrategy', '<existing_dc>' : ['<existing_rack1>', '<existing_rack2>', '<existing_rack3>'], '<new_dc>' : ['<new_rack1>', '<new_rack2>']} AND tablets = { 'enabled': true };
ALTER KEYSPACE mykeyspace3 WITH replication = { 'class' : 'NetworkTopologyStrategy', '<existing_dc>' : ['<existing_rack1>', '<existing_rack2>', '<existing_rack3>'], '<new_dc>' : ['<new_rack1>', '<new_rack2>', '<new_rack3>']} AND tablets = { 'enabled': true };
After
.. code-block:: cql
DESCRIBE KEYSPACE mykeyspace3;
CREATE KEYSPACE mykeyspace3 WITH REPLICATION = {'class': 'NetworkTopologyStrategy', '<existing_dc>' : ['<existing_rack1>', '<existing_rack2>', '<existing_rack3>'], '<new_dc>' : ['<new_rack1>', '<new_rack2>', '<new_rack3>']} AND tablets = { 'enabled': true };
Consider :ref:`upgrading rf_rack_valid_keyspaces option to enforce_rack_list option <keyspace-rf-rack-valid-to-enforce-rack-list>` to ensure all tablet keyspaces use rack lists.
#. If any vnode keyspace was altered, run ``nodetool rebuild`` on each node in the new datacenter, specifying the existing datacenter name in the rebuild command.
For example:
@@ -198,7 +239,7 @@ Add New DC
The rebuild ensures that the new nodes that were just added to the cluster will recognize the existing datacenters in the cluster.
#. Run a full cluster repair, using :doc:`nodetool repair -pr </operating-scylla/nodetool-commands/repair>` on each node, or using `ScyllaDB Manager ad-hoc repair <https://manager.docs.scylladb.com/stable/repair>`_
#. If any vnode keyspace was altered, run a full cluster repair, using :doc:`nodetool repair -pr </operating-scylla/nodetool-commands/repair>` on each node, or using `ScyllaDB Manager ad-hoc repair <https://manager.docs.scylladb.com/stable/repair>`_
#. If you are using ScyllaDB Monitoring, update the `monitoring stack <https://monitoring.docs.scylladb.com/stable/install/monitoring_stack.html#configure-scylla-nodes-from-files>`_ to monitor it. If you are using ScyllaDB Manager, make sure you install the `Manager Agent <https://manager.docs.scylladb.com/stable/install-scylla-manager-agent.html>`_ and Manager can access the new DC.

View File

@@ -40,12 +40,14 @@ Prerequisites
Procedure
---------
#. Run the ``nodetool repair -pr`` command on each node in the data-center that is going to be decommissioned. This will verify that all the data is in sync between the decommissioned data-center and the other data-centers in the cluster.
#. If there are vnode keyspaces in this DC, run the ``nodetool repair -pr`` command on each node in the data-center that is going to be decommissioned. This will verify that all the data is in sync between the decommissioned data-center and the other data-centers in the cluster.
For example:
If the ASIA-DC cluster is to be removed, then, run the ``nodetool repair -pr`` command on all the nodes in the ASIA-DC
#. If there are tablet keyspaces in this DC, run the ``nodetool cluster repair`` on an arbitrary node. The reason for running repair is to ensure that any updates stored only on the about-to-be-decommissioned replicas are propagated to the other replicas, before the replicas on the decommissioned datacenter are dropped.
#. ALTER every cluster KEYSPACE, so that the keyspaces will no longer replicate data to the decommissioned data-center.
For example:
@@ -73,6 +75,44 @@ Procedure
cqlsh> ALTER KEYSPACE nba WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'US-DC' : 3, 'ASIA-DC' : 0, 'EUROPE-DC' : 3};
For tablet keyspaces, update the replication factor one by one:
.. code-block:: shell
cqlsh> DESCRIBE nba2
cqlsh> CREATE KEYSPACE nba2 WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'US-DC' : 3, 'ASIA-DC' : 2, 'EUROPE-DC' : 3} AND tablets = { 'enabled': true };
.. code-block:: shell
cqlsh> ALTER KEYSPACE nba2 WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'US-DC' : 3, 'ASIA-DC' : 1, 'EUROPE-DC' : 3} AND tablets = { 'enabled': true };
cqlsh> ALTER KEYSPACE nba2 WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'US-DC' : 3, 'ASIA-DC' : 0, 'EUROPE-DC' : 3} AND tablets = { 'enabled': true };
.. note::
If ``rf_rack_valid_keyspaces`` option is set, a tablet keyspace needs to use rack list replication factor, so that the DC can be removed. See :ref:`the conversion procedure <conversion-to-rack-list-rf>`. In this case, to remove a datacenter:
.. code-block:: shell
cqlsh> DESCRIBE nba3
cqlsh> CREATE KEYSPACE nba3 WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'US-DC' : ['RAC1', 'RAC2', 'RAC3'], 'ASIA-DC' : ['RAC4', 'RAC5'], 'EUROPE-DC' : ['RAC6', 'RAC7', 'RAC8']} AND tablets = { 'enabled': true };
.. code-block:: shell
cqlsh> ALTER KEYSPACE nba3 WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'US-DC' : ['RAC1', 'RAC2', 'RAC3'], 'ASIA-DC' : ['RAC4'], 'EUROPE-DC' : ['RAC6', 'RAC7', 'RAC8']} AND tablets = { 'enabled': true };
cqlsh> ALTER KEYSPACE nba3 WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'US-DC' : ['RAC1', 'RAC2', 'RAC3'], 'ASIA-DC' : [], 'EUROPE-DC' : ['RAC6', 'RAC7', 'RAC8']} AND tablets = { 'enabled': true };
Consider :ref:`upgrading rf_rack_valid_keyspaces option to enforce_rack_list option <keyspace-rf-rack-valid-to-enforce-rack-list>` to ensure all tablet keyspaces use rack lists.
.. note::
If table audit is enabled, the ``audit`` keyspace is automatically created with ``NetworkTopologyStrategy``.
You must also alter the ``audit`` keyspace to remove replicas from the decommissioned data-center. For example:
.. code-block:: shell
cqlsh> ALTER KEYSPACE audit WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'US-DC' : 3, 'ASIA-DC' : 0, 'EUROPE-DC' : 3};
Failure to do so will result in decommission errors such as "zero replica after the removal".
#. Run :doc:`nodetool decommission </operating-scylla/nodetool-commands/decommission>` on every node in the data center that is to be removed.
Refer to :doc:`Remove a Node from a ScyllaDB Cluster - Down Scale </operating-scylla/procedures/cluster-management/remove-node>` for further information.

View File

@@ -52,18 +52,14 @@ Row-level repair improves ScyllaDB in two ways:
* keeping the data in a temporary buffer.
* using the cached data to calculate the checksum and send it to the replicas.
See also
* `ScyllaDB Manager documentation <https://manager.docs.scylladb.com/>`_
* `Blog: ScyllaDB Open Source 3.1: Efficiently Maintaining Consistency with Row-Level Repair <https://www.scylladb.com/2019/08/13/scylla-open-source-3-1-efficiently-maintaining-consistency-with-row-level-repair/>`_
See also the `ScyllaDB Manager documentation <https://manager.docs.scylladb.com/>`_.
Incremental Repair
------------------
Built on top of `Row-level Repair <row-level-repair_>`_ and `Tablets </architecture/tablets>`_, Incremental Repair enables frequent and quick repairs. For more details, see `Incremental Repair </features/incremental-repair>`_.
Built on top of :ref:`Row-level Repair <row-level-repair>` and :doc:`Tablets </architecture/tablets>`, Incremental Repair enables frequent and quick repairs. For more details, see :doc:`Incremental Repair </features/incremental-repair>`.
Automatic Repair
----------------
Built on top of `Incremental Repair </features/incremental-repair>`_, `Automatic Repair </features/automatic-repair>`_ offers repair scheduling and execution directly in ScyllaDB, without external processes.
Built on top of :doc:`Incremental Repair </features/incremental-repair>`, :doc:`Automatic Repair </features/automatic-repair>` offers repair scheduling and execution directly in ScyllaDB, without external processes.

View File

@@ -14,11 +14,11 @@ Enable ScyllaDB :doc:`Authentication </operating-scylla/security/authentication>
Enabling Audit
---------------
By default, auditing is **enabled**. Enabling auditing is controlled by the ``audit:`` parameter in the ``scylla.yaml`` file.
By default, table auditing is **enabled**. Enabling auditing is controlled by the ``audit:`` parameter in the ``scylla.yaml`` file.
You can set the following options:
* ``none`` - Audit is disabled (default).
* ``table`` - Audit is enabled, and messages are stored in a Scylla table.
* ``none`` - Audit is disabled.
* ``table`` - Audit is enabled, and messages are stored in a Scylla table (default).
* ``syslog`` - Audit is enabled, and messages are sent to Syslog.
* ``syslog,table`` - Audit is enabled, and messages are stored in a Scylla table and sent to Syslog.
@@ -32,7 +32,7 @@ The audit can be tuned using the following flags or ``scylla.yaml`` entries:
================== ================================== ========================================================================================================================
Flag Default Value Description
================== ================================== ========================================================================================================================
audit_categories "DCL,DDL,AUTH,ADMIN" Comma-separated list of statement categories that should be audited
audit_categories "DCL,AUTH,ADMIN" Comma-separated list of statement categories that should be audited
------------------ ---------------------------------- ------------------------------------------------------------------------------------------------------------------------
audit_tables “” Comma-separated list of table names that should be audited, in the format of <keyspacename>.<tablename>
------------------ ---------------------------------- ------------------------------------------------------------------------------------------------------------------------
@@ -86,9 +86,7 @@ Storing Audit Messages in Syslog
.. code-block:: shell
# audit setting
# by default, Scylla does not audit anything.
# It is possible to enable auditing to the following places:
# - audit.audit_log column family by setting the flag to "table"
# 'audit' config option controls if and where to output audited events:
audit: "syslog"
#
# List of statement categories that should be audited.
@@ -159,9 +157,7 @@ For example:
.. code-block:: shell
# audit setting
# by default, Scylla does not audit anything.
# It is possible to enable auditing to the following places:
# - audit.audit_log column family by setting the flag to "table"
# 'audit' config option controls if and where to output audited events:
audit: "table"
#
# List of statement categories that should be audited.
@@ -215,8 +211,8 @@ Handling Audit Failures
In some cases, auditing may not be possible, for example, when:
* A table is used as the audits backend, and the audit partition where the audit row is saved is not available because the node that holds this partition is down.
* Syslog is used as the audits backend, and the Syslog sink (a regular unix socket) is unresponsive/unavailable.
* A table is used as the audits backend, and the partitions where the audit rows are saved are unavailable because the nodes holding those partitions are down or unreachable due to network issues.
* Syslog is used as the audits backend, and the Syslog sink (a regular Unix socket) is unresponsive or unavailable.
If the audit fails and audit messages are not stored in the configured audits backend, you can still review the audit log in the regular ScyllaDB logs.

View File

@@ -14,7 +14,7 @@ if necessary.
This guide covers upgrading ScyllaDB on Red Hat Enterprise Linux (RHEL),
CentOS, Debian, and Ubuntu.
See :doc:`OS Support by Platform and Version </getting-started/os-support>`
See `OS Support by Platform and Version <https://docs.scylladb.com/stable/versioning/os-support-per-version.html>`_
for information about supported versions.
It also applies to the ScyllaDB official image on EC2, GCP, or Azure.

View File

@@ -17,7 +17,7 @@ This document describes a step-by-step procedure for upgrading from |SCYLLA_NAME
to |SCYLLA_NAME| |NEW_VERSION| and rollback to version |SRC_VERSION| if necessary.
This guide covers upgrading ScyllaDB on Red Hat Enterprise Linux (RHEL), CentOS, Debian,
and Ubuntu. See :doc:`OS Support by Platform and Version </getting-started/os-support>`
and Ubuntu. See `OS Support by Platform and Version <https://docs.scylladb.com/stable/versioning/os-support-per-version.html>`_
for information about supported versions.
It also applies when using the ScyllaDB official image on EC2, GCP, or Azure.
@@ -199,9 +199,6 @@ You should take note of the current version in case you want to |ROLLBACK|_ the
#. Run ``sudo /opt/scylladb/scylla-machine-image/scylla_cloud_io_setup``.
If you need JMX server, see
:doc:`Install scylla-jmx Package </getting-started/installation-common/install-jmx>`
and get new version.
Start the node
--------------

View File

@@ -284,6 +284,7 @@ future<rjson::value> encryption::gcp_host::impl::gcp_auth_post_with_retry(std::s
}
[[fallthrough]];
case httpclient::reply_status::request_timeout:
case httpclient::reply_status::too_many_requests:
if (retry < max_retries) {
// service unavailable etc -> backoff + retry
do_backoff = true;

View File

@@ -2424,8 +2424,8 @@ bool gossiper::is_enabled() const {
void gossiper::add_expire_time_for_endpoint(locator::host_id endpoint, clk::time_point expire_time) {
auto now_ = now();
auto diff = std::chrono::duration_cast<std::chrono::seconds>(expire_time - now_).count();
logger.info("Node {} will be removed from gossip at [{:%Y-%m-%d %T}]: (expire = {}, now = {}, diff = {} seconds)",
endpoint, fmt::localtime(clk::to_time_t(expire_time)), expire_time.time_since_epoch().count(),
logger.info("Node {} will be removed from gossip at [{:%Y-%m-%d %T %z}]: (expire = {}, now = {}, diff = {} seconds)",
endpoint, fmt::gmtime(clk::to_time_t(expire_time)), expire_time.time_since_epoch().count(),
now_.time_since_epoch().count(), diff);
_expire_time_endpoint_map[endpoint] = expire_time;
}

View File

@@ -153,6 +153,8 @@ public:
}
const std::set<inet_address>& get_seeds() const noexcept;
seastar::scheduling_group get_scheduling_group() const noexcept { return _gcfg.gossip_scheduling_group; }
public:
static clk::time_point inline now() noexcept { return clk::now(); }
public:

View File

@@ -23,11 +23,11 @@ static_assert(std::is_nothrow_move_constructible_v<gms::inet_address>);
future<gms::inet_address> gms::inet_address::lookup(sstring name, opt_family family, opt_family preferred) {
return seastar::net::dns::get_host_by_name(std::move(name), family).then([preferred](seastar::net::hostent&& h) {
for (auto& addr : h.addr_list) {
if (!preferred || addr.in_family() == preferred) {
return gms::inet_address(addr);
for (auto& ent : h.addr_entries) {
if (!preferred || ent.addr.in_family() == preferred) {
return gms::inet_address(ent.addr);
}
}
return gms::inet_address(h.addr_list.front());
return gms::inet_address(h.addr_entries.front().addr);
});
}

View File

@@ -17,11 +17,11 @@
#include "index/secondary_index.hh"
#include "index/secondary_index_manager.hh"
#include "types/concrete_types.hh"
#include "types/types.hh"
#include "utils/managed_string.hh"
#include <seastar/core/sstring.hh>
#include <boost/algorithm/string.hpp>
namespace secondary_index {
static void validate_positive_option(int max, const sstring& value_name, const sstring& value) {
@@ -147,17 +147,88 @@ std::optional<cql3::description> vector_index::describe(const index_metadata& im
}
void vector_index::check_target(const schema& schema, const std::vector<::shared_ptr<cql3::statements::index_target>>& targets) const {
if (targets.size() != 1) {
throw exceptions::invalid_request_exception("Vector index can only be created on a single column");
}
auto target = targets[0];
auto c_def = schema.get_column_definition(to_bytes(target->column_name()));
if (!c_def) {
throw exceptions::invalid_request_exception(format("Column {} not found in schema", target->column_name()));
}
auto type = c_def->type;
if (!type->is_vector() || static_cast<const vector_type_impl*>(type.get())->get_elements_type()->get_kind() != abstract_type::kind::float_kind) {
throw exceptions::invalid_request_exception(format("Vector indexes are only supported on columns of vectors of floats", target->column_name()));
struct validate_visitor {
const class schema& schema;
bool& is_vector;
/// Vector indexes support filtering on native types that can be used as primary key columns.
/// There is no counter (it cannot be used with vector columns)
/// and no duration (it cannot be used as a primary key or in secondary indexes).
static bool is_supported_filtering_column(abstract_type const & kind_type) {
switch (kind_type.get_kind()) {
case abstract_type::kind::ascii:
case abstract_type::kind::boolean:
case abstract_type::kind::byte:
case abstract_type::kind::bytes:
case abstract_type::kind::date:
case abstract_type::kind::decimal:
case abstract_type::kind::double_kind:
case abstract_type::kind::float_kind:
case abstract_type::kind::inet:
case abstract_type::kind::int32:
case abstract_type::kind::long_kind:
case abstract_type::kind::short_kind:
case abstract_type::kind::simple_date:
case abstract_type::kind::time:
case abstract_type::kind::timestamp:
case abstract_type::kind::timeuuid:
case abstract_type::kind::utf8:
case abstract_type::kind::uuid:
case abstract_type::kind::varint:
return true;
default:
break;
}
return false;
}
void validate(cql3::column_identifier const& column, bool is_vector) const {
auto const& c_name = column.to_string();
auto const* c_def = schema.get_column_definition(column.name());
if (c_def == nullptr) {
throw exceptions::invalid_request_exception(format("Column {} not found in schema", c_name));
}
auto type = c_def->type;
if (is_vector) {
auto const* vector_type = dynamic_cast<const vector_type_impl*>(type.get());
if (vector_type == nullptr) {
throw exceptions::invalid_request_exception("Vector indexes are only supported on columns of vectors of floats");
}
auto elements_type = vector_type->get_elements_type();
if (elements_type->get_kind() != abstract_type::kind::float_kind) {
throw exceptions::invalid_request_exception("Vector indexes are only supported on columns of vectors of floats");
}
return;
}
if (!is_supported_filtering_column(*type)) {
throw exceptions::invalid_request_exception(format("Unsupported vector index filtering column {} type", c_name));
}
}
void operator()(const std::vector<::shared_ptr<cql3::column_identifier>>& columns) const {
for (const auto& column : columns) {
// CQL restricts the secondary local index to have multiple columns with partition key only.
// Vectors shouldn't be partition key columns and they aren't supported as a filtering column,
// so we can assume here that these are non-vectors filtering columns.
validate(*column, false);
}
}
void operator()(const ::shared_ptr<cql3::column_identifier>& column) {
validate(*column, is_vector);
// The first column is the vector column, the rest mustn't be vectors.
is_vector = false;
}
};
bool is_vector = true;
for (const auto& target : targets) {
std::visit(validate_visitor{.schema = schema, .is_vector = is_vector}, target->value);
}
}

17
init.cc
View File

@@ -11,7 +11,6 @@
#include "seastarx.hh"
#include "db/config.hh"
#include <boost/algorithm/string/trim.hpp>
#include <seastar/core/coroutine.hh>
#include "sstables/sstable_compressor_factory.hh"
#include "gms/feature_service.hh"
@@ -30,11 +29,7 @@ std::set<gms::inet_address> get_seeds_from_db_config(const db::config& cfg,
std::set<gms::inet_address> seeds;
if (seed_provider.parameters.contains("seeds")) {
size_t begin = 0;
size_t next = 0;
sstring seeds_str = seed_provider.parameters.find("seeds")->second;
while (begin < seeds_str.length() && begin != (next=seeds_str.find(",",begin))) {
auto seed = boost::trim_copy(seeds_str.substr(begin,next-begin));
for (const auto& seed : utils::split_comma_separated_list(seed_provider.parameters.at("seeds"))) {
try {
seeds.emplace(gms::inet_address::lookup(seed, family, preferred).get());
} catch (...) {
@@ -46,11 +41,10 @@ std::set<gms::inet_address> get_seeds_from_db_config(const db::config& cfg,
seed,
std::current_exception());
}
begin = next+1;
}
}
if (seeds.empty()) {
seeds.emplace(gms::inet_address("127.0.0.1"));
seeds.emplace("127.0.0.1");
}
startlog.info("seeds={{{}}}, listen_address={}, broadcast_address={}",
fmt::join(seeds, ", "), listen, broadcast_address);
@@ -102,13 +96,6 @@ std::set<sstring> get_disabled_features_from_db_config(const db::config& cfg, st
if (!cfg.check_experimental(db::experimental_features_t::feature::STRONGLY_CONSISTENT_TABLES)) {
disabled.insert("STRONGLY_CONSISTENT_TABLES"s);
}
if (cfg.force_gossip_topology_changes()) {
if (cfg.enable_tablets_by_default()) {
throw std::runtime_error("Tablets cannot be enabled with gossip topology changes. Use either --tablets-mode-for-new-keyspaces=enabled|enforced or --force-gossip-topology-changes, but not both.");
}
startlog.warn("The tablets feature is disabled due to forced gossip topology changes");
disabled.insert("TABLETS"s);
}
if (!cfg.table_digest_insensitive_to_expiry()) {
disabled.insert("TABLE_DIGEST_INSENSITIVE_TO_EXPIRY"s);
}

View File

@@ -150,7 +150,6 @@ fedora_packages=(
llvm
openldap-servers
openldap-devel
toxiproxy
cyrus-sasl
fipscheck
cpp-jwt-devel
@@ -158,7 +157,10 @@ fedora_packages=(
podman
buildah
https://github.com/scylladb/cassandra-stress/releases/download/v3.18.1/cassandra-stress-java21-3.18.1-1.noarch.rpm
# for cassandra-stress
java-openjdk-headless
snappy
elfutils
jq
@@ -295,6 +297,7 @@ print_usage() {
echo " --print-pip-runtime-packages Print required pip packages for Scylla"
echo " --print-pip-symlinks Print list of pip provided commands which need to install to /usr/bin"
echo " --print-node-exporter-filename Print node_exporter filename"
echo " --future Install dependencies for future toolchain (Fedora rawhide based)"
exit 1
}
@@ -302,6 +305,7 @@ PRINT_PYTHON3=false
PRINT_PIP=false
PRINT_PIP_SYMLINK=false
PRINT_NODE_EXPORTER=false
FUTURE=false
while [ $# -gt 0 ]; do
case "$1" in
"--print-python3-runtime-packages")
@@ -320,6 +324,10 @@ while [ $# -gt 0 ]; do
PRINT_NODE_EXPORTER=true
shift 1
;;
"--future")
FUTURE=true
shift 1
;;
*)
print_usage
;;
@@ -350,6 +358,10 @@ if $PRINT_NODE_EXPORTER; then
exit 0
fi
if ! $FUTURE; then
fedora_packages+=(toxiproxy)
fi
umask 0022
./seastar/install-dependencies.sh
@@ -377,6 +389,10 @@ elif [ "$ID" = "fedora" ]; then
exit 1
fi
dnf install -y "${fedora_packages[@]}" "${fedora_python3_packages[@]}"
# Fedora 45 tightened key checks, and cassandra-stress is not signed yet.
dnf install --no-gpgchecks -y https://github.com/scylladb/cassandra-stress/releases/download/v3.18.1/cassandra-stress-java21-3.18.1-1.noarch.rpm
PIP_DEFAULT_ARGS="--only-binary=:all: -v"
pip_constrained_packages=""
for package in "${!pip_packages[@]}"
@@ -447,3 +463,11 @@ if [ ! -z "${CURL_ARGS}" ]; then
else
echo "Minio server and client are up-to-date, skipping download"
fi
if $FUTURE ; then
toxyproxy_version="v2.12.0"
for bin in toxiproxy-cli toxiproxy-server; do
curl -fSL -o "/usr/local/bin/${bin}" "https://github.com/Shopify/toxiproxy/releases/download/${toxyproxy_version}/${bin}-linux-$(go_arch)"
chmod +x "/usr/local/bin/${bin}"
done
fi

View File

@@ -8,6 +8,7 @@
#include <boost/date_time/gregorian/greg_date.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <random>
#include "lua.hh"
#include "lang/lua_scylla_types.hh"
#include "exceptions/exceptions.hh"
@@ -28,6 +29,14 @@
# define LUA_504_PLUS(x...)
#endif
// Lua 5.5 added a seed parameter to lua_newstate
#if LUA_VERSION_NUM >= 505
# define LUA_505_PLUS(x...) x
#else
# define LUA_505_PLUS(x...)
#endif
using namespace seastar;
using namespace lua;
@@ -126,7 +135,11 @@ static void debug_hook(lua_State* l, lua_Debug* ar) {
static lua_slice_state new_lua(const lua::runtime_config& cfg) {
auto a_state = std::make_unique<alloc_state>(cfg.max_bytes, cfg.max_contiguous);
std::unique_ptr<lua_State, lua_closer> l{lua_newstate(lua_alloc, a_state.get())};
#if LUA_VERSION_NUM >= 505
static thread_local std::default_random_engine rng{std::random_device{}()};
auto seed = rng();
#endif
std::unique_ptr<lua_State, lua_closer> l{lua_newstate(lua_alloc, a_state.get() LUA_505_PLUS(, seed))};
if (!l) {
throw std::runtime_error("could not create lua state");
}
@@ -270,17 +283,6 @@ concept CanHandleLuaTypes = requires(Func f) {
{ f(*static_cast<const lua_table*>(nullptr)) } -> std::same_as<lua_visit_ret_type<Func>>;
};
// This is used to test if a double fits in a long long, so
// we expect overflows. Prevent the sanitizer from complaining.
#ifdef __clang__
[[clang::no_sanitize("undefined")]]
#endif
static
long long
cast_to_long_long_allow_overflow(double v) {
return (long long)v;
}
template <typename Func>
requires CanHandleLuaTypes<Func>
static auto visit_lua_value(lua_State* l, int index, Func&& f) {
@@ -291,9 +293,10 @@ static auto visit_lua_value(lua_State* l, int index, Func&& f) {
auto operator()(const long long& v) { return f(utils::multiprecision_int(v)); }
auto operator()(const utils::multiprecision_int& v) { return f(v); }
auto operator()(const double& v) {
long long v2 = cast_to_long_long_allow_overflow(v);
if (v2 == v) {
return (*this)(v2);
auto min = double(std::numeric_limits<long long>::min());
auto max = double(std::numeric_limits<long long>::max());
if (min <= v && v <= max && std::trunc(v) == v) {
return (*this)((long long)v);
}
// FIXME: We could use frexp to produce a decimal instead of a double
return f(v);

View File

@@ -616,12 +616,16 @@ tablet_replica tablet_map::get_primary_replica(tablet_id id, const locator::topo
return maybe_get_primary_replica(id, replicas, topo, [&] (const auto& _) { return true; }).value();
}
tablet_replica tablet_map::get_secondary_replica(tablet_id id) const {
if (get_tablet_info(id).replicas.size() < 2) {
tablet_replica tablet_map::get_secondary_replica(tablet_id id, const locator::topology& topo) const {
const auto& orig_replicas = get_tablet_info(id).replicas;
if (orig_replicas.size() < 2) {
throw std::runtime_error(format("No secondary replica for tablet id {}", id));
}
const auto& replicas = get_tablet_info(id).replicas;
return replicas.at((size_t(id)+1) % replicas.size());
tablet_replica_set replicas = orig_replicas;
std::ranges::sort(replicas, tablet_replica_comparator(topo));
// This formula must match the one in get_primary_replica(),
// just with + 1.
return replicas.at((size_t(id) + size_t(id) / replicas.size() + 1) % replicas.size());
}
std::optional<tablet_replica> tablet_map::maybe_get_selected_replica(tablet_id id, const topology& topo, const tablet_task_info& tablet_task_info) const {

View File

@@ -648,9 +648,10 @@ public:
/// Returns the primary replica for the tablet
tablet_replica get_primary_replica(tablet_id id, const locator::topology& topo) const;
/// Returns the secondary replica for the tablet, which is assumed to be directly following the primary replica in the replicas vector
/// Returns the secondary replica for the tablet: the replica that immediately follows the primary
/// replica in the topology-sorted replica list.
/// \throws std::runtime_error if the tablet has less than 2 replicas.
tablet_replica get_secondary_replica(tablet_id id) const;
tablet_replica get_secondary_replica(tablet_id id, const locator::topology& topo) const;
// Returns the replica that matches hosts and dcs filters for tablet_task_info.
std::optional<tablet_replica> maybe_get_selected_replica(tablet_id id, const topology& topo, const tablet_task_info& tablet_task_info) const;

View File

@@ -1170,6 +1170,17 @@ token_metadata::set_version_tracker(version_tracker_t tracker) {
_impl->set_version_tracker(std::move(tracker));
}
version_tracker::version_tracker(utils::phased_barrier::operation op, const token_metadata& tm)
: _op(std::move(op))
, _version(tm.get_version())
, _tm(&tm)
{
}
long version_tracker::version_use_count() const {
return _tm->use_count();
}
version_tracker::~version_tracker() {
if (_expired_at) {
auto now = std::chrono::steady_clock::now();
@@ -1181,8 +1192,8 @@ version_tracker::~version_tracker() {
}
}
version_tracker shared_token_metadata::new_tracker(token_metadata::version_t version) {
auto tracker = version_tracker(_versions_barrier.start(), version);
version_tracker shared_token_metadata::new_tracker(const token_metadata& tm) {
auto tracker = version_tracker(_versions_barrier.start(), tm);
_trackers.push_front(tracker);
return tracker;
}
@@ -1198,6 +1209,18 @@ void shared_token_metadata::clear_and_dispose(std::unique_ptr<token_metadata_imp
}
}
std::unordered_map<service::topology::version_t, int> shared_token_metadata::describe_stale_versions() {
std::unordered_map<service::topology::version_t, int> result;
const auto active_version = _shared.get()->get_version();
for (const auto& t: _trackers) {
const auto v = t.version();
if (v < active_version) {
result.emplace(v, t.version_use_count());
}
}
return result;
}
void shared_token_metadata::set(mutable_token_metadata_ptr tmptr) noexcept {
if (_shared->get_ring_version() >= tmptr->get_ring_version()) {
on_internal_error(tlogger, format("shared_token_metadata: must not set non-increasing ring_version: {} -> {}", _shared->get_ring_version(), tmptr->get_ring_version()));
@@ -1211,7 +1234,7 @@ void shared_token_metadata::set(mutable_token_metadata_ptr tmptr) noexcept {
tmptr->set_shared_token_metadata(*this);
_shared = std::move(tmptr);
_shared->set_version_tracker(new_tracker(_shared->get_version()));
_shared->set_version_tracker(new_tracker(*_shared));
for (auto&& v : _trackers) {
if (v.version() != _shared->get_version()) {

View File

@@ -112,6 +112,7 @@ public:
private:
utils::phased_barrier::operation _op;
service::topology::version_t _version;
const token_metadata* _tm = nullptr;
link_type _link;
// When engaged it means the version is no longer latest and should be released soon as to
@@ -120,8 +121,7 @@ private:
std::chrono::steady_clock::duration _log_threshold;
public:
version_tracker() = default;
version_tracker(utils::phased_barrier::operation op, service::topology::version_t version)
: _op(std::move(op)), _version(version) {}
version_tracker(utils::phased_barrier::operation op, const token_metadata& tm);
version_tracker(version_tracker&&) noexcept = default;
version_tracker& operator=(version_tracker&& o) noexcept {
if (this != &o) {
@@ -137,6 +137,8 @@ public:
return _version;
}
long version_use_count() const;
void mark_expired(std::chrono::steady_clock::duration log_threshold) {
if (!_expired_at) {
_expired_at = std::chrono::steady_clock::now();
@@ -172,7 +174,7 @@ private:
friend class token_metadata_impl;
};
class token_metadata final {
class token_metadata final: public enable_lw_shared_from_this<token_metadata> {
shared_token_metadata* _shared_token_metadata = nullptr;
std::unique_ptr<token_metadata_impl> _impl;
private:
@@ -410,7 +412,7 @@ class shared_token_metadata : public peering_sharded_service<shared_token_metada
boost::intrusive::constant_time_size<false>>;
version_tracker_list_type _trackers;
private:
version_tracker new_tracker(token_metadata::version_t);
version_tracker new_tracker(const token_metadata& tm);
public:
// used to construct the shared object as a sharded<> instance
// lock_func returns semaphore_units<>
@@ -419,7 +421,7 @@ public:
, _lock_func(std::move(lock_func))
, _versions_barrier("shared_token_metadata::versions_barrier")
{
_shared->set_version_tracker(new_tracker(_shared->get_version()));
_shared->set_version_tracker(new_tracker(*_shared));
}
shared_token_metadata(const shared_token_metadata& x) = delete;
@@ -446,6 +448,9 @@ public:
_stall_detector_threshold = threshold;
}
// Returns a map version -> use_count
std::unordered_map<service::topology::version_t, int> describe_stale_versions();
future<> stale_versions_in_use() const {
return _stale_versions_in_use.get_future();
}

14
main.cc
View File

@@ -1150,6 +1150,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
dbcfg.memtable_scheduling_group = create_scheduling_group("memtable", "mt", 1000).get();
dbcfg.memtable_to_cache_scheduling_group = create_scheduling_group("memtable_to_cache", "mt2c", 200).get();
dbcfg.gossip_scheduling_group = create_scheduling_group("gossip", "gms", 1000).get();
debug::gossip_scheduling_group = dbcfg.gossip_scheduling_group;
dbcfg.commitlog_scheduling_group = create_scheduling_group("commitlog", "clog", 1000).get();
dbcfg.schema_commitlog_scheduling_group = create_scheduling_group("schema_commitlog", "sclg", 1000).get();
dbcfg.available_memory = memory::stats().total_memory();
@@ -2041,8 +2042,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
cdc_config.ring_delay = std::chrono::milliseconds(cfg->ring_delay_ms());
cdc_config.dont_rewrite_streams = cfg->cdc_dont_rewrite_streams();
cdc_generation_service.start(std::move(cdc_config), std::ref(gossiper), std::ref(sys_dist_ks), std::ref(sys_ks),
std::ref(stop_signal.as_sharded_abort_source()), std::ref(token_metadata), std::ref(feature_service), std::ref(db),
[&ss] () -> bool { return ss.local().raft_topology_change_enabled(); }).get();
std::ref(stop_signal.as_sharded_abort_source()), std::ref(token_metadata), std::ref(feature_service), std::ref(db)).get();
auto stop_cdc_generation_service = defer_verbose_shutdown("CDC Generation Management service", [] {
cdc_generation_service.stop().get();
});
@@ -2071,13 +2071,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
gossiper.local().unregister_(mm.local().shared_from_this()).get();
});
utils::loading_cache_config perm_cache_config;
perm_cache_config.max_size = cfg->permissions_cache_max_entries();
perm_cache_config.expiry = std::chrono::milliseconds(cfg->permissions_validity_in_ms());
perm_cache_config.refresh = std::chrono::milliseconds(cfg->permissions_update_interval_in_ms());
auto start_auth_service = [&mm] (sharded<auth::service>& auth_service, std::any& stop_auth_service, const char* what) {
supervisor::notify(fmt::format("starting {}", what));
auth_service.invoke_on_all(&auth::service::start, std::ref(mm), std::ref(sys_ks)).get();
stop_auth_service = defer_verbose_shutdown(what, [&auth_service] {
@@ -2105,7 +2099,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
maintenance_auth_config.authenticator_java_name = sstring{auth::allow_all_authenticator_name};
maintenance_auth_config.role_manager_java_name = sstring{auth::maintenance_socket_role_manager_name};
maintenance_auth_service.start(perm_cache_config, std::ref(qp), std::ref(group0_client), std::ref(mm_notifier), std::ref(mm), maintenance_auth_config, maintenance_socket_enabled::yes, std::ref(auth_cache)).get();
maintenance_auth_service.start(std::ref(qp), std::ref(group0_client), std::ref(mm_notifier), std::ref(mm), maintenance_auth_config, maintenance_socket_enabled::yes, std::ref(auth_cache)).get();
cql_maintenance_server_ctl.emplace(maintenance_auth_service, mm_notifier, gossiper, qp, service_memory_limiter, sl_controller, lifecycle_notifier, *cfg, maintenance_cql_sg_stats_key, maintenance_socket_enabled::yes, dbcfg.statement_scheduling_group);
@@ -2372,7 +2366,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
auth_config.authenticator_java_name = qualified_authenticator_name;
auth_config.role_manager_java_name = qualified_role_manager_name;
auth_service.start(std::move(perm_cache_config), std::ref(qp), std::ref(group0_client), std::ref(mm_notifier), std::ref(mm), auth_config, maintenance_socket_enabled::no, std::ref(auth_cache)).get();
auth_service.start(std::ref(qp), std::ref(group0_client), std::ref(mm_notifier), std::ref(mm), auth_config, maintenance_socket_enabled::no, std::ref(auth_cache)).get();
std::any stop_auth_service;
// Has to be called after node joined the cluster (join_cluster())

View File

@@ -272,25 +272,27 @@ private:
bool can_purge_tombstone(const tombstone& t, is_shadowable is_shadowable, const gc_clock::time_point deletion_time) {
max_purgeable::can_purge_result purge_res { };
std::optional<bool> expired;
if (_tombstone_gc_state.cheap_to_get_gc_before(_schema)) {
// if retrieval of grace period is cheap, can_gc() will only be
// called for tombstones that are older than grace period, in
// order to avoid unnecessary bloom filter checks when calculating
// max purgeable timestamp.
purge_res.can_purge = satisfy_grace_period(deletion_time);
expired = purge_res.can_purge = satisfy_grace_period(deletion_time);
if (purge_res.can_purge) {
purge_res = can_gc(t, is_shadowable);
}
} else {
purge_res = can_gc(t, is_shadowable);
if (purge_res.can_purge) {
purge_res.can_purge = satisfy_grace_period(deletion_time);
expired = purge_res.can_purge = satisfy_grace_period(deletion_time);
}
}
if constexpr (sstable_compaction()) {
if (!_tombstone_stats || !t) {
// Tombstone GC stats only account for expired tombstones (those eligible for GC).
if (!_tombstone_stats || !t || !expired.value_or(satisfy_grace_period(deletion_time))) {
return purge_res.can_purge;
}

View File

@@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:a4710f1f0b0bb329721c21d133618e811e820f2e70553b0aca28fb278bff89c9
size 6492280
oid sha256:9034610470ff645fab03da5ad6c690e5b41f3307ea4b529c7e63b0786a1289ed
size 6539600

View File

@@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:2433f7a1fc5cda0dd990ab59587eb6046dca0fe1ae48d599953d1936fe014ed9
size 6492176
oid sha256:0c4bbf51dbe01d684ea5b9a9157781988ed499604d2fde90143bad0b9a5594f0
size 6543944

View File

@@ -459,7 +459,7 @@ future<> server_impl::wait_for_state_change(seastar::abort_source* as) {
}
try {
return as ? _state_change_promise->get_shared_future(*as) : _state_change_promise->get_shared_future();
co_await (as ? _state_change_promise->get_shared_future(*as) : _state_change_promise->get_shared_future());
} catch (abort_requested_exception&) {
throw request_aborted(fmt::format(
"Aborted while waiting for state change on server: {}, latest applied entry: {}, current state: {}", _id, _applied_idx, _fsm->current_state()));

View File

@@ -252,6 +252,10 @@ public:
//
// The caller may pass a pointer to an abort_source to make the function abortable.
// It it passes nullptr, the function is unabortable.
//
// Exceptions:
// raft::request_aborted
// Thrown if abort is requested before the operation finishes.
virtual future<> wait_for_state_change(seastar::abort_source* as) = 0;
// The returned future is resolved when a leader is elected for the current term.
@@ -262,6 +266,10 @@ public:
//
// The caller may pass a pointer to an abort_source to make the function abortable.
// It it passes nullptr, the function is unabortable.
//
// Exceptions:
// raft::request_aborted
// Thrown if abort is requested before the operation finishes.
virtual future<> wait_for_leader(seastar::abort_source* as) = 0;
// Manually trigger snapshot creation and log truncation.

View File

@@ -103,8 +103,8 @@ thread_local dirty_memory_manager default_dirty_memory_manager;
inline
flush_controller
make_flush_controller(const db::config& cfg, backlog_controller::scheduling_group& sg, std::function<double()> fn) {
return flush_controller(sg, cfg.memtable_flush_static_shares(), 50ms, cfg.unspooled_dirty_soft_limit(), std::move(fn));
make_flush_controller(const db::config& cfg, const database_config& dbcfg, std::function<double()> fn) {
return flush_controller(dbcfg.memtable_scheduling_group, cfg.memtable_flush_static_shares(), 50ms, cfg.unspooled_dirty_soft_limit(), std::move(fn));
}
keyspace::keyspace(config cfg, locator::effective_replication_map_factory& erm_factory)
@@ -394,8 +394,7 @@ database::database(const db::config& cfg, database_config dbcfg, service::migrat
, _system_dirty_memory_manager(*this, 10 << 20, cfg.unspooled_dirty_soft_limit(), default_scheduling_group())
, _dirty_memory_manager(*this, dbcfg.available_memory * 0.50, cfg.unspooled_dirty_soft_limit(), dbcfg.statement_scheduling_group)
, _dbcfg(dbcfg)
, _flush_sg(dbcfg.memtable_scheduling_group)
, _memtable_controller(make_flush_controller(_cfg, _flush_sg, [this, limit = float(_dirty_memory_manager.throttle_threshold())] {
, _memtable_controller(make_flush_controller(_cfg, _dbcfg, [this, limit = float(_dirty_memory_manager.throttle_threshold())] {
auto backlog = (_dirty_memory_manager.unspooled_dirty_memory()) / limit;
if (_dirty_memory_manager.has_extraneous_flushes_requested()) {
backlog = std::max(backlog, _memtable_controller.backlog_of_shares(200));
@@ -1504,12 +1503,10 @@ keyspace::make_column_family_config(const schema& s, const database& db) const {
cfg.compaction_concurrency_semaphore = _config.compaction_concurrency_semaphore;
cfg.cf_stats = _config.cf_stats;
cfg.enable_incremental_backups = _config.enable_incremental_backups;
cfg.compaction_scheduling_group = _config.compaction_scheduling_group;
cfg.memory_compaction_scheduling_group = _config.memory_compaction_scheduling_group;
cfg.memtable_scheduling_group = _config.memtable_scheduling_group;
cfg.memtable_to_cache_scheduling_group = _config.memtable_to_cache_scheduling_group;
cfg.streaming_scheduling_group = _config.streaming_scheduling_group;
cfg.statement_scheduling_group = _config.statement_scheduling_group;
cfg.enable_metrics_reporting = db_config.enable_keyspace_column_family_metrics();
cfg.enable_node_aggregated_table_metrics = db_config.enable_node_aggregated_table_metrics();
cfg.tombstone_warn_threshold = db_config.tombstone_warn_threshold();
@@ -2453,12 +2450,10 @@ database::make_keyspace_config(const keyspace_metadata& ksm, system_keyspace is_
cfg.cf_stats = &_cf_stats;
cfg.enable_incremental_backups = _enable_incremental_backups;
cfg.compaction_scheduling_group = _dbcfg.compaction_scheduling_group;
cfg.memory_compaction_scheduling_group = _dbcfg.memory_compaction_scheduling_group;
cfg.memtable_scheduling_group = _dbcfg.memtable_scheduling_group;
cfg.memtable_to_cache_scheduling_group = _dbcfg.memtable_to_cache_scheduling_group;
cfg.streaming_scheduling_group = _dbcfg.streaming_scheduling_group;
cfg.statement_scheduling_group = _dbcfg.statement_scheduling_group;
cfg.enable_metrics_reporting = _cfg.enable_keyspace_column_family_metrics();
cfg.view_update_memory_semaphore_limit = max_memory_pending_view_updates();
@@ -3782,7 +3777,7 @@ future<utils::chunked_vector<temporary_buffer<char>>> database::sample_data_file
&result,
chunk_size
] (database& local_db, state_by_shard& local_state) -> future<> {
auto ticket = get_units(local_db._sample_data_files_local_concurrency_limiter, 1);
auto ticket = co_await get_units(local_db._sample_data_files_local_concurrency_limiter, 1);
// In `chosen_chunks`, the sorted array of chosen chunk offsets (in the "global chunk list"),
// find the range of offsets which belongs to us.

Some files were not shown because too many files have changed in this diff Show More