Compare commits

...

372 Commits

Author SHA1 Message Date
copilot-swe-agent[bot]
173fb1e6d3 Clarify audit all-keyspaces exclusivity
Co-authored-by: ptrsmrn <124208650+ptrsmrn@users.noreply.github.com>
2026-03-17 15:32:11 +00:00
copilot-swe-agent[bot]
e252bb1550 Revise audit all-keyspaces design
Co-authored-by: ptrsmrn <124208650+ptrsmrn@users.noreply.github.com>
2026-03-17 15:04:40 +00:00
copilot-swe-agent[bot]
5713b5efd1 Finalize audit design doc clarifications
Co-authored-by: ptrsmrn <124208650+ptrsmrn@users.noreply.github.com>
2026-03-17 14:49:48 +00:00
copilot-swe-agent[bot]
979ec5ada8 Polish audit design doc review feedback
Co-authored-by: ptrsmrn <124208650+ptrsmrn@users.noreply.github.com>
2026-03-17 14:47:26 +00:00
copilot-swe-agent[bot]
67503a350b Refine audit prototype design details
Co-authored-by: ptrsmrn <124208650+ptrsmrn@users.noreply.github.com>
2026-03-17 14:45:38 +00:00
copilot-swe-agent[bot]
a90490c3cf Clarify audit design doc semantics
Co-authored-by: ptrsmrn <124208650+ptrsmrn@users.noreply.github.com>
2026-03-17 14:44:28 +00:00
copilot-swe-agent[bot]
6f957ea4e0 Add audit prototype design doc
Co-authored-by: ptrsmrn <124208650+ptrsmrn@users.noreply.github.com>
2026-03-17 14:43:44 +00:00
copilot-swe-agent[bot]
f6605f7b66 Initial plan 2026-03-17 14:37:56 +00:00
Asias He
6cb263bab0 repair: Prevent CPU stall during cross-shard row copy and destruction
When handling `repair_stream_cmd::end_of_current_rows`, passing the
foreign list directly to `put_row_diff_handler` triggered a massive
synchronous deep copy on the destination shard. Additionally, destroying
the list triggered a synchronous deallocation on the source shard. This
blocked the reactor and triggered the CPU stall detector.

This commit fixes the issue by introducing `clone_gently()` to copy the
list elements one by one, and leveraging the existing
`utils::clear_gently()` to destroy them. Both utilize
`seastar::coroutine::maybe_yield()` to allow the reactor to breathe
during large cross-shard transfers and cleanups.

Fixes SCYLLADB-403

Closes scylladb/scylladb#28979
2026-03-17 11:05:15 +02:00
Botond Dénes
035aa90d4b Merge 'Alternator: add per-table batch latency metrics and test coverage' from Amnon Heiman
This series fixes a metrics visibility gap in Alternator and adds regression coverage.

Until now, BatchGetItem and BatchWriteItem updated global latency histograms but did not consistently update per-table latency histograms. As a result, table-level latency dashboards could miss batch traffic.

It updates the batch read/write paths to compute request duration once and record it in both global and per-table latency metrics.

Add the missing tests, including a metric-agnostic helper and a dedicated per-table latency test that verifies latency counters increase for item and batch operations.

This change is metrics-only (no API/behavior change for requests) and improves observability consistency between global and per-table views.

Fixes #28721

**We assume the alternator per-table metrics exist, but the batch ones are not updated**

Closes scylladb/scylladb#28732

* github.com:scylladb/scylladb:
  test(alternator): add per-table latency coverage for item and batch ops
  alternator: track per-table latency for batch get/write operations
2026-03-16 17:18:00 +02:00
Michał Hudobski
40d180a7ef docs: update vector search filtering to reflect primary key support only
Remove outdated references to filtering on columns provided in the
index definition, and remove the note about equal relations (= and IN)
being the only supported operations. Vector search filtering currently
supports WHERE clauses on primary key columns only.

Closes scylladb/scylladb#28949
2026-03-16 17:16:16 +02:00
Botond Dénes
9de8d6798e Merge 'reader_concurrency_semaphore: skip preemptive abort for permits waiting for memory' from Łukasz Paszkowski
Permits in the `waiting_for_memory` state represent already-executing reads that are blocked on memory allocation. Preemptively aborting them is wasteful -- these reads have already consumed resources and made progress, so they should be allowed to complete.

Restrict the preemptive abort check in maybe_admit_waiters() to only apply to permits in the `waiting_for_admission` state, and tighten the state validation in `on_preemptive_aborted()` accordingly.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1016

Backport not needed. The commit introducing replica load shedding is not part of 2026.1

Closes scylladb/scylladb#29025

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: skip preemptive abort for permits waiting for memory
  reader_concurrency_semaphore_test: detect memory leak on preemptive abort of waiting_for_memory permit
2026-03-16 17:14:25 +02:00
Calle Wilund
a5df2e79a7 storage_service: Wait for snapshot/backup before decommission
Fixes: SCYLLADB-244

Disables snapshot control such that any active ops finish/fail
before proceeding with decommission.
Note: snapshot control provided as argument, not member ref
due to storage_service being used from both main and cql_test_env.
(The latter has no snapshot_ctl to provide).

Could do the snapshot lockout on API level, but want to do
pre-checks before this.

Note: this just disables backup/snapshot fully. Could re-enable
after decommission, but this seems somewhat pointless.

v2:
* Add log message to snapshot shutdown
* Make test use log waiting instead of timeouts

Closes scylladb/scylladb#28980
2026-03-16 17:12:57 +02:00
bitpathfinder
85d5073234 test: Fix non-awaited coroutine in test_gossiper_empty_self_id_on_shadow_round
The line with the error was not actually needed and has therefore been removed.

Fixes: SCYLLADB-906

Closes scylladb/scylladb#28884
2026-03-16 17:07:36 +02:00
Botond Dénes
3e4e0c57b8 Merge 'Relax rf-rack-valid-keyspace option in backup/restore tests' from Pavel Emelyanov
Some tests, when create a cluster, configure nodes with the rf-rack-valid option, because sometimes they want to have it OFF. For that the option is explicitly carried around, but the cluster creating helper can guess this option itself -- out of the provided topology and replication factor.

Removing this option simplifies the code and (which a nicer outcome) the test "signature" that's used e.g. in command-line to run a specific test.

Improving tests, not backporting

Closes scylladb/scylladb#28860

* github.com:scylladb/scylladb:
  test: Relax topology_rf_validity parameter for some tests
  test: Auto detect rf-rack-valid option in create_cluster()
2026-03-16 17:06:46 +02:00
Patryk Jędrzejczak
526e5986fe test: test_raft_no_quorum: decrease group0_raft_op_timeout_in_ms after quorum loss
`test_raft_no_quorum.py::test_cannot_add_new_node` is currently flaky in dev
mode. The bootstrap of the first node can fail due to `add_entry()` timing
out (with the 1s timeout set by the test case).

Other test cases in this test file could fail in the same way as well, so we
need a general fix. We don't want to increase the timeout in dev mode, as it
would slow down the test. The solution is to keep the timeout unchanged, but
set it only after quorum is lost. This prevents unexpected timeouts of group0
operations with almost no impact on the test running time.

A note about the new `update_group0_raft_op_timeout` function: waiting for
the log seems to be necessary only for
`test_quorum_lost_during_node_join_response_handler`, but let's do it
for all test cases just in case (including `test_can_restart` that shouldn't
be flaky currently).

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-913

Closes scylladb/scylladb#28998
2026-03-16 16:58:15 +02:00
Dani Tweig
bc0952781a Update Jira sync calling workflow to consolidated view
Replaced multiple per-action workflow jobs with a single consolidated
call to main_pr_events_jira_sync.yml. Added 'edited' event trigger.
This makes CI actions in PRs more readable and workflow execution faster.

Fixes:PM-253

Closes scylladb/scylladb#29042
2026-03-16 08:25:32 +02:00
Artsiom Mishuta
755d528135 test.py: fix warnings
changes in this commit:
1)rename class from 'TestContext' to  'Context' so pytest will not consider this class as a test

2)extend pytest filterwarnings list to ignore warnings from external libs

3) use datetime.datetime.now(datetime.UTC) unstead  datetime.datetime.utcnow()

4) use  ResultSet.one() instead  ResultSet[0]

Fixes SCYLLADB-904
Fixes SCYLLADB-908
Related SCYLLADB-902

Closes scylladb/scylladb#28956
2026-03-15 12:00:10 +02:00
Piotr Dulikowski
d8b283e1fb Merge 'Add CQL forwarding for strongly consistent tables' from Wojciech Mitros
In this series we add support for forwarding strongly consistent CQL requests to suitable replicas, so that clients can issue reads/writes to any node and have the request executed on an appropriate tablet replica (and, for writes, on the Raft leader). We return the same CQL response as what the user would get while sending the request to the correct replica and we perform the same logging/stats updates on the request coordinator as if the coordinator was the appropriate replica.

The core mechanism of forwarding a strongly consistent request is sending an RPC containing the user's cql request frame to the appropriate replica and returning back a ready, serialized `cql_transport::response`. We do this in the CQL server - it is most prepared for handling these types and forwarding a request containing a CQL frame allows us to reuse near-top-level methods for CQL request handling in the new RPC handler (such as the general `process`)

For sending the RPC, the CQL server needs to obtain the information about who should it forward the request to. This requires knowledge about the tablet raft group members and leader. We obtain this information during the execution of a `cql3/strong_consistency` statement, and we return this information back to the CQL server using the generalized `bounce_to_shard` `response_message`, where we now store the information about either a shard, or a specific replica to which we should forward to. Similarly to `bounce_to_shard`, we need to handle this `result_message` in a loop - a replica may move during statement execution, or the Raft leader can change. We also use it for forwarding strongly consistent writes when we're not a member of the affected tablet raft group - in that case we need to forward the statement twice - once to any replica of the affected tablet, then that replica can find the leader and return this information to the coordinator, which allows the second request to be directed to the leader.

This feature also allows passing through exception messages which happened on the target replica while executing the statement. For that, many methods of the `cql_transport::cql_server::connection` for creating error responses needed to be moved to `cql_transport::cql_server`. And for final exception handling on the coordinator, we added additional error info to the RPC response, so that the handling can be performed without having the `result_message::exception` or `exception_ptr` itself.

Fixes [SCYLLADB-71](https://scylladb.atlassian.net/browse/SCYLLADB-71)

[SCYLLADB-71]: https://scylladb.atlassian.net/browse/SCYLLADB-71?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#27517

* github.com:scylladb/scylladb:
  test: add tests for CQL forwarding
  transport: enable CQL forwarding for strong consistency statements
  transport: add remote statement preparation for CQL forwarding
  transport: handle redirect responses in CQL forwarding
  transport: add exception handling for forwarded CQL requests
  transport: add basic CQL request forwarding
  idl: add a representation of client_state for forwarding
  cql_server: handle query, execute, batch in one case
  transport: inline process_on_shard in cql_server::process
  transport: extract process() to cql_server
  transport: add messaging_service to cql_server
  transport: add response reconstruction helpers for forwarding
  transport: generalize the bounce result message for bouncing to other nodes
  strong consistency: redirect requests to live replicas from the same rack
  transport: pass foreign_ptr into sleep_until_timeout_passes and move it to cql_server
  transport: extract the error handling from process_request_one
  transport: move error response helpers from connection to cql_server
2026-03-13 15:03:10 +01:00
Tomasz Grabiec
518470e89e Merge 'load_stats: improve tablet filtering for load stats' from Ferenc Szili
When computing table sizes via load_stats to determine if a split/merge is needed, we are filtering tablets which are being migrated, in order to avoid counting them twice (both on leaving and pending replica) in the total table size. The tablets are filtered so that they are counted on the leaving replica until the streaming stage, and on the pending replica after the streaming stage.

Currently, the procedure for collecting tablet sizes for load balancing also uses this same filter. This should be changed, because the load balancer needs to have as much information about tablet sizes as possible, and could ignore a node due to missing tablet sizes for tablets in the `write_both_read_new` and `use_new` stages.

For tablet size collection, we should include all the tablets which are currently taking up disk space. This means:
- on leaving replica, include all tablets until the `cleanup` stage
- on pending replica, include all tablets starting with the `write_both_read_new` and later stages

While this is an improvement, it causes problems with some of the tests, and therefore needs to be backported to 2026.1

Fixes: SCYLLADB-829

Closes scylladb/scylladb#28587

* github.com:scylladb/scylladb:
  load_stats: add filtering for tablet sizes
  load_stats: move tablet filtering for table size computation
  load_stats: bring the comment and code in sync
2026-03-13 13:08:11 +01:00
Pavel Emelyanov
d544d8602d test: Relax topology_rf_validity parameter for some tests
Tests that call create_cluster() helper no longer need to carry the
rf-validity parameter. This simplifies the code and test signature.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-13 14:30:32 +03:00
Pavel Emelyanov
313985fed7 test: Auto detect rf-rack-valid option in create_cluster()
The helper accepts its as boolean argument, but it can easily estimate
one from the provided topology.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-13 14:30:32 +03:00
Gleb Natapov
fae5282c82 service level: fix crash during migration to driver server level
Before b59b3d4 the migration code checked that service level controller
is on v2 version before migration and the check also implicitly checked
that _sl_data_accessor field is already initialized, but now that the
check is gone the migration can start before service level controller is
fully initialized. Re add the check, but to a different place.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1049

Closes scylladb/scylladb#29021
2026-03-13 11:24:26 +01:00
Łukasz Paszkowski
4c4d043a3b reader_concurrency_semaphore: skip preemptive abort for permits waiting for memory
Permits in the `waiting_for_memory` state represent already-executing
reads that are blocked on memory allocation. Preemptively aborting
them is wasteful -- these reads have already consumed resources and
made progress, so they should be allowed to complete.

Restrict the preemptive abort check in maybe_admit_waiters() to only
apply to permits in the `waiting_for_admission` state, and tighten
the state validation in `on_preemptive_aborted()` accordingly.

Adjust the following tests:
+ test_reader_concurrency_semaphore_abort_preemptively_aborted_permit
  no longer relies on requesting memory
+ test_reader_concurrency_semaphore_preemptive_abort_requested_memory_leak
  adjusted to the fix

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1016
2026-03-13 09:50:05 +01:00
Dani Tweig
aa46a0f4e0 Add VECTOR to the list of synced milestones in scylladb.git
- Added VECTOR to the comma-separated list of Jira project keys in `call_sync_milestone_to_jira.yml`.
- The `jira_project_keys` value changed from `SCYLLADB,CUSTOMER,SMI,RELENG` to `SCYLLADB,CUSTOMER,SMI,RELENG,VECTOR`.

- The VECTOR project needs to sync with scylladb.git milestones, so that when a GitHub milestone is created or closed in scylladb/scylladb, the corresponding Jira release is also created or released in the VECTOR project.
- Previously only SCYLLADB, CUSTOMER, SMI, and RELENG projects were synced.

Fixes:PM-220

Closes scylladb/scylladb#29014
2026-03-13 09:58:41 +02:00
Botond Dénes
fc8cebd671 Merge 'Verify components digests during component load and scrub in validate mode' from Taras Veretilnyk
This PR adds integrity verification for SSTable component files during loading. When component digests are present in Scylla metadata, the loader now validates each component's CRC32 digest against the stored expected value, catching silent corruption of component files. Index, Rows and Partitions components digests are also validated duriung scrub in validate mode

Added corruption tests that write an SSTable, flip a bit in a specific component file, then verify that reloading the SSTable detects the corruption and throws the expected exception.

Depends on https://github.com/scylladb/scylladb/pull/28338

Backport is not required, this is new feature

Fixes https://github.com/scylladb/scylladb/issues/20103

Closes scylladb/scylladb#28761

* github.com:scylladb/scylladb:
  test/cqlpy: test --ignore-component-digest-mismatch flag in scylla sstable upgrade
  docs: document --ignore-component-digest-mismatch flag for scylla sstable upgrade
  sstables: propagate ignore_component_digest_mismatch config to all load sites
  sstables: add option to ignore component digest mismatches
  sstable_compaction_test: Add scrub validate test for corrupted index
  sstables: add tests for component digest validation on corrupted SSTables
  sstables: validate index components digests during SSTable scrub in validate mode
  sstables: verify component digests on SSTable load
  sstables: add digest_file_random_access_reader for CRC32 digest computation
2026-03-13 09:55:55 +02:00
Avi Kivity
ae8a418744 Merge 'Await async calls in test tablets migration' from Benny Halevy
Fix several test cases that did not await async tasks:
- test_restart_leaving_replica_during_cleanup
- test_restart_in_cleanup_stage_after_cleanup
- test_tablet_back_and_forth_migration
- test_staging_backlog_is_preserved_with_file_based_streaming

Fixes SCYLLADB-910

* Minor fixes, no backport needed

Closes scylladb/scylladb#28908

* github.com:scylladb/scylladb:
  test_tablets_migration: test_staging_backlog_is_preserved_with_file_based_streaming: convert for loop to asyncio.gather
  test_tablets_migration: test_tablet_back_and_forth_migration: await move_tablet
  test_tablets_migration: test_restart_in_cleanup_stage_after_cleanup: await move_task
  test_tablets_migration: test_restart_leaving_replica_during_cleanup: await move_task
  test_tablets_migration: drop unused imports from cassandra.query
2026-03-13 00:20:29 +02:00
Avi Kivity
b228eb26e6 Merge 'dbuild: Use slirp4netns network in dbuild nested containers' from Calle Wilund
Fixes #25084

Add slirp4netns and use for nested containers. This will allow nested container port aliasing, helping CI stability.

Note: this contains and updated Dockerfile for dbuild image, but since chicken and eggs, right now will force install slirp4netns before anything in dbuild script.

Updates the mock server handling to use ephemeral ports and query from container, ensuring we don't get port collisions. (boost as well as pytest).

Includes a timeout up, and a tweak to our scylla_cluster handling, ensuring we don't deadlock when pipe size is less than requires for our sys notify messages.

Closes scylladb/scylladb#28727

* github.com:scylladb/scylladb:
  gcs_fixture: Change to use docker helper
  aws_kms_fixture: Modify to use docker helper
  test/lib/proc_util: Add docker helper
  pytest: use ephemeral port publish for docker mock servers
  dbuild: Use container network in dbuild nested containers
  scylla_cluster: Read notify sock in background to prevent deadlock
2026-03-12 23:49:25 +02:00
Nadav Har'El
ad832c263e test/cluster: mark test_alternator_concurrent_rmw_same_partition_different_server not strictly xfail
A few days ago, in commit 7b30a39 we added to pytest.ini the option
xfail_strict. This option causes every time a test XPASSes, i.e., an xfail
test actually passes - to be considered an error and fail the test.

But some tests demonstrate a timing-related bug and do not reproduce the
bug every single time. An example we noticed in one CI run is:

test/cluster/test_alternator.py::test_alternator_concurrent_rmw_same_partition_different_server

This test reproduces a timing-related bug (if you do an LWT write to
one partition on to two different coordinators "at the same time", you
can get a failure), but only most of the time, not 100% of the time.

The solution is to add "strict=False" for the xfail marker on this specific
test. This undoes the xfail_strict for this specific test, accepting that
this specific test can either pass or fail. Note that this does NOT make
this test worthless - we still see this test failing most of the time, and
when a developer finally fixes this issue, the test will begin to pass all
the time.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-941
Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#29016
2026-03-12 23:46:23 +02:00
Avi Kivity
03186ce60d Merge 'Cleanup after auth v1 and default superuser code removal' from Marcin Maliszkiewicz
This is short cleanup after recent removal of creating default cassandra superuser and auth-v1 code removal.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1036

Backport: no, just code cleanup

Closes scylladb/scylladb#29004

* github.com:scylladb/scylladb:
  auth: remove DEFAULT_SUPERUSER_NAME constant and dead DEFAULT_USER_PASSWORD
  auth: use configurable default_superuser in describe_roles
  auth: move default_superuser to common, remove _superuser member
  auth: use LOCAL_ONE for all auth queries
  auth: remove get_auth_ks_name indirection
2026-03-12 23:44:32 +02:00
Avi Kivity
e2eeef3e01 Merge 'service level: remove remnants of version 1 service level' from Gleb Natapov
can_use_effective_service_level_cache() always returns true now, so the function can be dropped entirely and all the code that assumes it may return false can be dropped as well. Also drop async versions of find_effective_service_level and get_user_scheduling_group since they are unused.

No need to backport, code removal,

Closes scylladb/scylladb#29002

* github.com:scylladb/scylladb:
  service level: make maybe_update_per_service_level_params synchronous
  service level: remove unused get_user_scheduling_group function
  service level: drop async find_effective_service_level
  service level: remove remnants of version 1 service level
2026-03-12 23:39:41 +02:00
Botond Dénes
eed3a6d407 sstables/mx/writer: move post-cell write yield to collection write loop
Introduced by 54bddeb3b5, the yield was
added to write_cell(), to also help the general case where there is no
collection. Arguably this was unnecessary and this patch moves the yield
to write_collection(), to the cell write loop instead, so regular cells
don't have to poll the preempt flag.

Closes scylladb/scylladb#29013
2026-03-12 21:26:35 +02:00
Avi Kivity
e8a6706d6e Merge 'shorten some sleeps to speed up bootstrap in tests' from Patryk Jędrzejczak
This PR shortens two sleeps from 1s to 100ms to speed up bootstrap in tests.
The changed sleeps are:
- the pause duration in group0 discovery,
- the retry period in `wait_for_cql`.

Refs: https://scylladb.atlassian.net/browse/SCYLLADB-918

No backport: performance improvements mostly relevant to tests.

Closes scylladb/scylladb#29020

* github.com:scylladb/scylladb:
  test: pylib: util: wait for CQL being ready with a shorter period
  group0: discovery: shorten the pause duration
2026-03-12 21:17:05 +02:00
Wojciech Mitros
32974770b0 test: add tests for CQL forwarding
Add basic cluster tests for CQL forwarding.
The test cases include:
- basic reads and writes
- prepared statements with binds
- forwarding from a non-replica
- exception passthrough during forwarding (using an injection)
- re-preparing a statement on the target node, even if the user
  query is also an EXECUTE request on a prepared statement
- verification metric updates

The existing test_basic_write_read was modified so that a few extra
cases could be validated on the same cluster.
2026-03-12 19:43:35 +01:00
Wojciech Mitros
916a9995c1 transport: enable CQL forwarding for strong consistency statements
We enable CQL forwarding by starting to return the bounce_to_node
result message in redirect_statement() instead of throwing. The
forwarding code introduced in the preceding patches reacts to these
messages, allowing the requests to be forwarded.

With the update, some tests assuming that requests can't be forwarded
need to be adjusted, so we do that as well.
2026-03-12 19:43:35 +01:00
Wojciech Mitros
21a7b036a5 transport: add remote statement preparation for CQL forwarding
During forwarding of CQL EXECUTE requests, the target node may
not have the prepared statement in its cache. If we do have this
statement as a coordinator, instead of returning PREPARED NOT FOUND
to the client, we want to prepare the statement ourselves on target
node.
For that, we add a new FORWARD_CQL_PREPARE RPC. We use the new RPC
after gettting the prepared_not_found status during forwarding. When
we try to forward a request, we always have the query string (we
decide whether to forward based on this query), so we can always use
the new RPC when getting the prepared_not_found status.
After receiving the response, we try forwarding the EXECUTE request
again.
2026-03-12 19:43:35 +01:00
Wojciech Mitros
96a5e1c7ce transport: handle redirect responses in CQL forwarding
During CQL forwarding, when the target node can't handle the request,
it will find another node which can execute the request or which knows
where the request can be executed. We return this information in
responses to CQL forwarding, and in this patch, we add handling of
this kind of a response.
After getting a redirect response, we retry forwarding to the returned
host/shard until success or timeout. This can happen many times during
a single request, when we first forward to a replica and later to the
coordinator, or when a replica/coordinator migrated while we were
performing the forwarding
2026-03-12 19:43:31 +01:00
Wojciech Mitros
8816d3038c transport: add exception handling for forwarded CQL requests
When a forwarded request fails on the remote node, we can't use the
exception handling that happens in process_request_one because we
don't go through this code path. Instead, we use the previously
extracted cql_server::handle_exception handler, which performs
all accounting on the forwarded-to node, and which prepares the
response. For the read_failure_exception_with_timeout exception,
we need to perform the sleep on the source node, so we return the
timeout in the forwarding response and use it on the source node
to know how long to sleep without any extra calculations.

The handle_forward_execute() method is extracted from the inline handler
lambda to make the error catching wrapper cleaner.
2026-03-12 19:41:37 +01:00
Wojciech Mitros
23bff5dfef transport: add basic CQL request forwarding
Add the infrastructure for forwarding CQL requests to other nodes.
When a process() call results in a node bounce (as opposed to a shard
bounce), the coordinator serializes the request and sends it via the
FORWARD_CQL_EXECUTE RPC verb to the target node.

In this patch we omit several features that allow handling more
scenarios that can happen when trying to forward a CQL request,
but the RPC request and response are already prepared for them.
They will be handled in the following commits.
2026-03-12 19:41:35 +01:00
Avi Kivity
76b6784c1a Merge 'cql3: track CQL parsing memory cost and use it for admission control' from Marcin Maliszkiewicz
Use rolling_max_tracker to record gross bytes allocated during each
CQL parse.  The rolling maximum is then added to the memory estimate
for incoming QUERY and PREPARE requests so that the admission control
in the CQL transport layer accounts for parsing overhead.

The measured memory footprint serves as upper bound rather than
exact number but it's purpose is to prevent OOMs under unprepared
statements heavy load.

In benchmark 1G memory node shows decrease of non-LSA memory usage
from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While
tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as
memory admission kicks in trying to prevent OOM.

This is phase 1 of OOM prevention, potential next steps:
- add second admission in query_processor::get_statement trying to prevent potential thundering herd problem
- decrease cql_server memory pool size
- count reads in the memory pool
- add per service level memory pool and a shared one

Related https://scylladb.atlassian.net/browse/SCYLLADB-740
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-938

Backport: no, new feature, but we may reconsider if some customer needs it

Closes scylladb/scylladb#28919

* github.com:scylladb/scylladb:
  cql3: track CQL parsing memory cost and use it for admission control
  utils: add rolling max tracker
2026-03-12 19:59:52 +02:00
Wojciech Mitros
170b82ddca idl: add a representation of client_state for forwarding
In the following patches, when we start allowing to forward CQL
requests to other nodes, we'll need to use the same client state
for executing the request on the destination node as we had on the
source. client_state contains many fields and we need to create
a new instance of it when we start handling the forwarded request,
so to prepare for the forwarding RPC, we add a serializable format
of the client_state as an IDL struct. The new class is missing some
fields that are not used while executing requests, and some whose
value is determined by the fact that the client state is used for
a forwarded request.
These include:
- driver name, driver version, client options - not used for executing
requests. Instead, we use these as data sources for the virtual
"clients" system table.
- auth_state - must be READY - we reached a bounce message, so we were
able to try executing the request locally
- _control_connection - used for altering a cql_server::connection, which
we don't have on the target node
- _default_timeout_config - used when updating service levels, also only
per-connection
- workload_type - used for deciding whether to allow shedding at the
start of processing the request, and for getting per-connection service
level params (for an API)
2026-03-12 17:48:58 +01:00
Wojciech Mitros
b4a7fefe20 cql_server: handle query, execute, batch in one case
Currently we perform the same steps when handling query, execute
and batch CQL requests. So instead of creating multiple functions
performing these steps, we can handle them all in one fallthrough
case in cql_server::connection::process_request_one.
2026-03-12 17:48:58 +01:00
Wojciech Mitros
dadb87047c transport: inline process_on_shard in cql_server::process
The process_on_shard method is relatively short, it's only used
in the process() method and the Process concept that is uses
is as long as the function itself. This area will be made more
complex by the following patches for cql forwarding, so we simplify
it by inlining process_on_shard in cql_server::process.
2026-03-12 17:48:58 +01:00
Wojciech Mitros
24cdc3a10d transport: extract process() to cql_server
Move process() and process_on_shard() from cql_server::connection to
cql_server. The process() method is no longer a template - instead, it
takes an opcode parameter and uses get_process_fn_for_opcode() to select
the appropriate internal processing function.

The process_query, process_execute, and process_batch wrappers on
connection now delegate to _server.process() with the appropriate opcode.

This refactoring is preparation for CQL request forwarding, where
process() will need to be called from a context other than connection
- the forwarding RPC handler).
2026-03-12 17:48:57 +01:00
Wojciech Mitros
0e3469e89c transport: add messaging_service to cql_server
The messaging service will be used by cql_server to register RPC
handlers for forwarding CQL requests between nodes.
We pass it through the controller to cql_server.
2026-03-12 17:48:57 +01:00
Wojciech Mitros
1376caf980 transport: add response reconstruction helpers for forwarding
Expose response::flags() and response::extract_body(), and a new constructor.
It will be needed for creating a cql_transport::response from the response body returned
during CQL forwarding.
2026-03-12 17:48:57 +01:00
Wojciech Mitros
e44820ba1f transport: generalize the bounce result message for bouncing to other nodes
In the following patches, we'll start allowing forwarding requests to strongly
consistent tables so that they'll get executed on the suitable tablet Raft group
members. For that we'll reuse the approach that we already have for bouncing
requests to other shards - we'll try to execute a request locally, and the
result of that will be a bounce message with another replica as the target.
In this patch we generalize the former bounce_to_shard result message so that
it will be able to specify the target of the bounce as another shard or specific
replica.
We also rename it to result_message::bounce so that it stops implying that only
another shard may be its target.
Aside from the host_id and the shard, the new message also includes the timeout,
because in the service handling the forwarding we won't have the access to it,
and it's needed for specifying how long we should wait for the forwarded
requests. It also includes an information whether this is a write request
to return correct timeout response in case the deadline is exceeded.
We will return other hosts in the new bounce message when executing requests to
strongly consistent tables when we can't handle the request because we aren't
a suitable replica. We can't handle this message yet, so we don't return it
anywhere and we still assume that every bounce message is a bounce to the same
host.
2026-03-12 17:48:57 +01:00
Wojciech Mitros
b4d66fda2e strong consistency: redirect requests to live replicas from the same rack
Forwarding CQL requests is not implemented yet, but we're already
prepared to return the target to forward to when trying to execute
strongly consistent requests. Currently, if we're not a replica
of the affected tablet, we redirect the request to the first replica
in the list.
This is not optimal, because this replica may be down or it may be
in another rack, making us perform cross-rack requests during forwarding.
Instead, we should forward the request to the replica from the same
rack and handle the case where the replica is down.

In this patch we change the replica selection for forwarding strongly
consistent requests, so that when the coordinator isn't a replica, it
redirects the request to the replica from the same rack.

If the replica from the same rack is down, or there is no replica in
our rack, we choose the next closest replica (preferring same-DC replicas
over other DCs). If no replica is alive, the query fails - the driver
should retry when some replica comes back up.
2026-03-12 17:48:54 +01:00
Andrzej Jackowski
3b9cd52a95 reader_concurrency_semaphore_test: detect memory leak on preemptive abort of waiting_for_memory permit
A permit in `waiting_for_memory` state can be preemptively aborted by
maybe_admit_waiters(). This is wrong: such permits have already been
admitted and are actively processing a read — they are merely blocked
waiting for memory under serialize-limit pressure.

When `on_preemptive_aborted()` fires on a `waiting_for_memory` permit,
it does not clear `_requested_memory`. A subsequent `request_memory()`
call accumulatesa on top of the stale value, causing `on_granted_memory()`
to consume more than resource_units tracks.

This commit adds a test that confirms that scenario by counting
internal_errors.
2026-03-12 17:09:34 +01:00
Alex
7fd39ba586 test/cluster: strengthen raft voters multi-DC test and tune debug runtime
The test_raft_voters_multidc_kill_dc scenario had become weaker after group0 voter count was made always odd.
  In particular, the old num_nodes == 1 case (dc1=2, dc2=1, dc3=1) could pass even without the intended balancing logic, because with 3 voters total we naturally get one voter per DC.

  This change restores coverage of the original intent:

  - Replace num_nodes parametrization with explicit DC triples.
  - Use (3, 1, 1) to force a meaningful asymmetric topology where voter placement logic is required.
  - Keep a larger topology case (6, 3, 3) for broader coverage.
  - Mark (6, 3, 3) as skip_mode(debug) with reason:
    larger topology case is too slow in debug on minipcs.

  Also updated comments/docstring to match the new setup.

Fixes: SCYLLADB-794

backport: None, it is done to deflake minipcs that will start working only on master

Closes scylladb/scylladb#29000
2026-03-12 17:07:45 +01:00
Wojciech Mitros
309abc44d9 transport: pass foreign_ptr into sleep_until_timeout_passes and move it to cql_server
Change sleep_until_timeout_passes() to accept a foreign_ptr<std::unique_ptr<response>>.
We can easily create the foreign_ptr for the responses created in the CQL server,
but we'll need this when we get responses when forwarding CQL statements - the responses
may come from other shards.
We also move it from cql_server::connection to cql_server, because for forwarded CQL
requests, we'll need to handle it at the cql_server level.
The method also loses its const qualifier - the abort_source that we pass into
sleep_abortable needs to be non-const. Apparently, we could still use it in a const
method of cql_server::connection because we passed it as _server._abort_source which
caused the const qualifier to be lost.
2026-03-12 16:03:14 +01:00
Marcin Maliszkiewicz
975cd60e05 ldap: fix use-after-move crash in ldap_reuser::reap()
After stop() moved _reaper, in-flight with_connection() callbacks could
still call reap(), which accessed the moved-from future causing a
SIGSEGV in future_base::detach_promise(). Add a seastar::gate so
stop() waits for all in-flight operations before moving _reaper.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1043

Closes scylladb/scylladb#29015
2026-03-12 16:48:45 +02:00
Patryk Jędrzejczak
c50cf32793 test: pylib: util: wait for CQL being ready with a shorter period
`wait_for_cql` is used in hundreds, if not thousands, of places in tests.
We shouldn't waste up to 1s for every call.

Also, the 1s period is clearly too long compared to the bootstrap time,
which is usually 0-3s in dev mode.

The following test speeds up from 50s to 42s with the change:
```
for _ in range(10):
    servers = await manager.servers_add(3)
    await manager.get_ready_cql(servers)
```
2026-03-12 15:40:19 +01:00
Patryk Jędrzejczak
f85628a9a0 group0: discovery: shorten the pause duration
Nodes currently pause group0 discovery for 1s. This case is always hit while
adding multiple nodes in parallel to an empty cluster by all nodes except the
one that becomes the group0 leader.

This is fine in production, but in tests, the slowdown is quite significant.
Every `manager.servers_add(n)` call for n > 1 becomes 1s slower when the
cluster is empty. Many cluster tests are affected.

In this commit, we decrease the sleep duration from 1s to 100ms to speed up
tests. The consequence of this change is that nodes might perform more steps
in group0 discovery, but the increase in CPU usage and network traffic should
be negligible.
2026-03-12 15:40:18 +01:00
Gleb Natapov
c67f876893 service level: make maybe_update_per_service_level_params synchronous
It does not call async functions any more.
2026-03-12 15:53:08 +02:00
Benny Halevy
b3fec20960 test_tablets_migration: test_staging_backlog_is_preserved_with_file_based_streaming: convert for loop to asyncio.gather
Currently the test iterates on all servers and calls manager.api.disable_injection
but it doesn't await those calls.
Use asyncio.gather to await all calls in parallel.

Co-authored-by: Copilot CLI
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-12 15:26:40 +02:00
Benny Halevy
61d5a2df02 test_tablets_migration: test_tablet_back_and_forth_migration: await move_tablet
Co-authored-by: Copilot CLI
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-12 15:26:40 +02:00
Benny Halevy
b8655748a2 test_tablets_migration: test_restart_in_cleanup_stage_after_cleanup: await move_task
Co-authored-by: Copilot CLI
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-12 15:26:40 +02:00
Benny Halevy
10dccc2c4e test_tablets_migration: test_restart_leaving_replica_during_cleanup: await move_task
Co-authored-by: Copilot CLI
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-12 15:26:40 +02:00
Benny Halevy
c9d653fb1e test_tablets_migration: drop unused imports from cassandra.query
Co-authored-by: Copilot CLI
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-03-12 15:26:40 +02:00
Gleb Natapov
c30907b8f2 service level: remove unused get_user_scheduling_group function 2026-03-12 14:28:26 +02:00
Gleb Natapov
a934d8391d service level: drop async find_effective_service_level
find_cached_effective_service_level does exactly same thing now and it
is synchronous.
2026-03-12 14:28:26 +02:00
Botond Dénes
15cfa5beeb mutation/collection_mutation: don't copy the serialized collection
serialize_collection_mutation() copies the serialized collection into
the returned collection_mutation object. Change to move to avoid the
copy.

Fixes: SCYLLADB-1041

Closes scylladb/scylladb#29010
2026-03-12 13:57:40 +02:00
Gleb Natapov
f888f2dced service level: remove remnants of version 1 service level
can_use_effective_service_level_cache() always returns true now, so the
function can be dropped entirely and all the code that assumes it may
return false can be dropped as well.
2026-03-12 12:27:52 +02:00
Nadav Har'El
27f0510280 test/alternator: test_gzip_request_oversized now passes on AWS
The Alternator test test_compressed_request.py::test_gzip_request_oversized
checks that a very large request that compresses to a small size is still
rejected. This test passed on Alternator, but used to fail on DynamoDB
because DynamoDB didn't reject this case. This was a bug in DynamoDB
(a "decompression bomb" vulnerability), and after I reported it, it
was fixed.

So now this test does pass on DynamoDB (after a small modification to
allow for different error codes). So remove its scylla_only marker,
and make the comment true to the current state.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28820
2026-03-12 10:41:56 +01:00
Marcin Maliszkiewicz
b277d9d9aa cql3: track CQL parsing memory cost and use it for admission control
Use rolling_max_tracker to record gross bytes allocated during each
CQL parse.  The rolling maximum is then added to the memory estimate
for incoming QUERY and PREPARE requests so that the admission control
in the CQL transport layer accounts for parsing overhead.

The measured memory footprint serves as upper bound rather than
exact number but it's purpose is to prevent OOMs under unprepared
statements heavy load.

In benchmark 1G memory node shows decrease of non-LSA memory usage
from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While
tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as
memory admission kicks in trying to prevent OOM.
2026-03-12 10:16:10 +01:00
Botond Dénes
0b19a6de85 tombstone_gc: tombstone_gc_state::for_tests(): remove unused param
Closes scylladb/scylladb#28923
2026-03-12 10:01:42 +01:00
Marcin Maliszkiewicz
2d22eea2f9 Merge 'cql3: Replace SCYLLA_ASSERT and abort by throwing_assert' from Nadav Har'El
In this patch we replace every single use of SCYLLA_ASSERT(), abort() and assert() in the cql3/ directory by throwing_assert().

The problem with SCYLLA_ASSERT()/abort()/assert() is that when it fails, it crashes Scylla. This is almost always a bad idea (see #7871 discussing why), but it's even riskier in front-end code like cql3/: In front-end code, there is a risk that due to a bug in our code, a specific user request can cause Scylla to crash. A malicious user can send this query to all nodes and crash the entire cluster. When the user is not malicious, it causes a small problem (a failing request) to become a much worse crash - and worse, the user has no idea which request is causing this crash and the crash will repeat if the same request is tried again.

All of this is solved by using the new throwing_assert(), which is the same as SCYLLA_ASSERT() but throws an exception (using on_internal_error()) instead of crashing. The exception will prevent the code path with the invalid assumption from continuing, but will result in only the current user request being aborted, with a clear error message reporting the internal server error due to an assertion failure.

I reviewed all the changes that I did in these patches to check that (to the best of my understanding) none of the assertions in cql3/ involve the sort of serious corruption that might require crashing the Scylla node entirely.

throwing_assert() also improves logging of assertion failures compared to the original SCYLLA_ASSERT()/abort() - SCYLLA_ASSERT() printed a message to stderr which in many installations is lost, and abort() often prints no message at all. But throwing_assert() uses Scylla's standard logger, and also includes a backtrace in the log message.

Fixes #13970 (Exorcise assertions from CQL code paths)
Refs #7871 (Exorcise assertions from Scylla)

Closes scylladb/scylladb#28847

* github.com:scylladb/scylladb:
  cql3: remove unnecessary assert()
  cql3: replace abort() by throwing_assert()
  cql3: Replace SCYLLA_ASSERT by throwing_assert
2026-03-12 09:09:24 +01:00
Szymon Malewski
3116db6c2d test: fix testJsonOrdering
The `test/cqlpy/cassandra_tests/validation/entities/json_test.py::testJsonOrdering` was failing because of differences between Cassandra and Scylla in printing
JSON floating point values - e.g. Cassandra prints 30.0, where Scylla prints 30.
Both are valid, so in this patch, instead of comparing strings, we compare parsed JSON using `EquivalentJson`.

Fixes #28467

Closes scylladb/scylladb#28924
2026-03-12 09:07:08 +01:00
Marcin Maliszkiewicz
5b2a07b408 utils: add rolling max tracker
We will use it later to track parser memory
usage via per query samples.

Tests runtime in dev: 1.6s
2026-03-12 08:56:41 +01:00
Marcin Maliszkiewicz
54ef8fca57 auth: remove DEFAULT_SUPERUSER_NAME constant and dead DEFAULT_USER_PASSWORD
DEFAULT_SUPERUSER_NAME is no longer referenced after removing the
role_part special-casing in describe_roles. DEFAULT_USER_PASSWORD was
dead code too.
2026-03-12 08:46:00 +01:00
Marcin Maliszkiewicz
029410e159 auth: use configurable default_superuser in describe_roles
Replace the hardcoded meta::DEFAULT_SUPERUSER_NAME comparison with
default_superuser(_qp) which reads from the auth_superuser_name
config option. This makes the IF NOT EXISTS clause in DESCRIBE
output correct for clusters with a non-default superuser name.
2026-03-12 08:45:47 +01:00
Nadav Har'El
09a399ae3c Merge 'Replace estimated_histogram with approx_exponential_histogram - alternator' from Amnon Heiman
_"A journey of a thousand miles begins with a single step" Lao Tzu_

ScyllaDB uses estimated_histogram in many places.
We already have a more efficient alternative: approx_exponential_histogram. It is both CPU and
memory-efficient and can be exported as Prometheus native histograms.

Its main limitation (which has its benefits) is that the bucket layout is fixed at compile time, so
histograms with different configurations cannot be mixed.

The end goal is to replace all uses of estimated_histogram in the codebase.
That migration needs a few small API adjustments, so I am splitting the work
into steps for easier review.

This series is the first step. It introduces a base template for fixed-size
estimated histograms, and switches the Alternator's estimated_histogram with the template.
This change is self-contained and valuable on its own, while keeping the scope limited.

Minor adjustments were made to the code and tests so that the tests would pass.

Follow-up PRs will apply the same pattern to the rest of the code.

**New feature no need to backport**

Closes scylladb/scylladb#28987

* github.com:scylladb/scylladb:
  alternator: migrate to operation_size_kb histograms
  test/alternator/test_metrics.py: Update the bucket in the histogram search
  alternator: Use batch_histogram for batch size histograms
  estimated_histogram.hh: adds estimated_histogram_with_max
2026-03-12 00:06:16 +02:00
Wojciech Mitros
b1bd206147 transport: extract the error handling from process_request_one
When we forward CQL statements, we'll need to handle the errors
on the destination node. Only for read_failure_exception_with_timeout
exception, we'll still need to wait until timeout passes on the
source node.
For that we extract the exception handling to a separate method.
Additionally, we separate the waiting and all other handling,
so that all handling aside from waiting will be reusable after
forwarding, and we'll also be able to sleep on the source node
if necessary.
2026-03-11 19:40:47 +01:00
Wojciech Mitros
6184b1d5ea transport: move error response helpers from connection to cql_server
These methods are used only in the error handler in the cql server,
and outside of 3 cases, they don't need any information from the
cql_server::connection. We move them from cql_server::connection
to cql_server, so that they can be used in the following patches
for methods for CQL request forwarding where we'll have no instance
of cql_server::connection on the node forwarded to.

After the change the methods require no access to the server's
or connection's fields, so we also make them static methods.
2026-03-11 19:40:47 +01:00
Amnon Heiman
1339a44163 alternator: migrate to operation_size_kb histograms
Switch Alternator operation-size metrics from the legacy estimated
histogram implementation to estimated_histogram_with_max<512> and export
them through the native approx-exponential histogram path.

Add a dedicated operation-size histogram type alias based on
estimated_histogram_with_max<512>.
Replace all per-operation size histograms (GetItem/PutItem/DeleteItem/
UpdateItem/BatchGetItem/BatchWriteItem) with the new type.

Remove the custom legacy histogram-to-metrics adapter and use
to_metrics_histogram() for operation size metrics, aligning export
behavior with other approx-exponential histograms.
Update Alternator metrics tests to compute expected le bucket boundaries using
approx-exponential bucket math (including deduplication of equal
bounds), so assertions match the new exported histogram schema.
Update bucket helper signatures to use (max, precision) parameters and keep
+Inf handling unchanged.

Replace byte-to-KB ceiling conversion with plain integer division (bytes
/ 1024): histogram export already reports each bucket by its upper bound
(le), so rounding input values up before bucketing is unnecessary and
would over-shift borderline samples into higher buckets.
2026-03-11 17:29:14 +02:00
Marcin Maliszkiewicz
adc840919b auth: move default_superuser to common, remove _superuser member
Move default_superuser() to auth::meta in common.{hh,cc} and remove the
cached _superuser member from both standard_role_manager and
password_authenticator. The superuser name comes from config which is
immutable at runtime, so caching it is unnecessary.
2026-03-11 16:28:38 +01:00
Marcin Maliszkiewicz
993e06c1ae auth: use LOCAL_ONE for all auth queries
Removes auth-v1 hack for cassandra superuser as auth-v1
code no longer exists.

Also CL is not really used when quering raft replicated
tables (like auth ones), but LOCAL_ONE is the least confusing
one.
2026-03-11 16:27:15 +01:00
Marcin Maliszkiewicz
6d1153687a auth: remove get_auth_ks_name indirection
Replace get_auth_ks_name(qp) with db::system_keyspace::NAME directly.
The function always returned the constant "system" and its qp
parameter was unused.
2026-03-11 16:26:47 +01:00
David
79f9967eaa docs: update theme 1.9
Motivation

Upgrades Sphinx to 9.x, MyST Parser to 5.x, Python to 3.11+–3.14, Node.js to 22, and replaces Poetry with uv for dependency management.

Changelog: https://github.com/scylladb/sphinx-scylladb-theme/blob/master/docs/source/upgrade/CHANGELOG.md#190---26-february-2026
How to test

* Make sure you are using Python 3.11-3.14:
* python --version
* Install uv:
* make setupenv
* Build the docs:
* make preview
* Docs should render without errors at http://127.0.0.1:5500

Closes scylladb/scylladb#28971
2026-03-11 16:56:51 +02:00
Aleksandra Martyniuk
2e68f48068 nodetool: cluster repair: do not fail if a table was dropped
nodetool cluster repair without additional params repairs all tablet
keyspaces in a cluster. Currently, if a table is dropped while
the command is running, all tables are repaired but the command finishes
with a failure.

Modify nodetool cluster repair. If a table wasn't specified
(i.e. all tables are repaired), the command finishes successfully
even if a table was dropped.

If a table was specified and it does not exist (e.g. because it was
dropped before the repair was requested), then the behavior remains
unchanged.

Fixes: SCYLLADB-568.

Closes scylladb/scylladb#28739
2026-03-11 16:35:04 +02:00
Dani Tweig
45d7d9a96c .github/workflow: also call call_sync_milestone_to_jira.yml for close milestone event
What changed

* Added closed to milestone event types in
  call_sync_milestone_to_jira.yml (types: [created] -> types: [created, closed])
* Added VECTOR to the list of Jira project keys being synced
  (jira_project_keys: SCYLLADB,CUSTOMER,SMI,RELENG -> jira_project_keys: SCYLLADB,CUSTOMER,SMI,RELENG,VECTOR)

Why (Requirements Summary)

* The call_sync_milestone_to_jira.yml workflow only triggered on
  milestone creation. When a GitHub milestone is closed, the
  corresponding Jira versions (in SCYLLADB, CUSTOMER, SMI, RELENG
  projects) should be marked as released. Adding the closed trigger
  enables the called workflow (main_sync_milestone_to_jira_release.yml
  in github-automation) to handle both creating and releasing Jira
  versions from GitHub milestone events.
* Added the VECTOR project so its Jira versions are also created/released
  when milestones are created or closed in scylladb.git.
* This is consistent with the same change already applied to the staging
  and scylla-machine-image repos.

Fixes:PM-216

Update call_sync_milestone_to_jira.yml in scylladb.git - add close trigger and VECTOR project sync

Closes scylladb/scylladb#28981
2026-03-11 15:56:55 +02:00
Amnon Heiman
69fbcd32bd test/alternator/test_metrics.py: Update the bucket in the histogram search 2026-03-11 15:24:05 +02:00
Amnon Heiman
50af1f3671 alternator: Use batch_histogram for batch size histograms
Switch batch-related histograms to estimated_histogram_with_max.
Results with better memory consumption and improve efficiency.
2026-03-11 15:21:25 +02:00
Amnon Heiman
b22162c719 estimated_histogram.hh: adds estimated_histogram_with_max
This patch adds estimated_histogram_with_max template that will be a
based for specific estimated_histograms, eventually replacing the current
struct implementation.

Introduce estimated_histogram_with_max<Max> as a reusable wrapper around
approx_exponential_histogram<1, Max, 4>, providing merge support and the
same add helpers used by existing estimated_histogra type.

Add estimated_histogram_with_max_merge()

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-03-11 15:02:37 +02:00
Radosław Cybulski
fe8117feee alternator: fix shard's parent calculation for vnodes
Fix an invalid condition, when searching for a parent shard, when table
is based on vnodes. Shards have associated with them `last token` -
token, than marks the end of the range of tokens they consume (inclusive).
An additional assumptions are whole token space is used and
(for vnodes) token space wraps around.

Previously code looked like this:
    auto pid = std::upper_bound(..., [](const dht::token& t, const cdc::stream_id& id) {
                    return t < id.token();
    });
    if (pid != pids.begin()) {
        pid = std::prev(pid);
    }

An `upper_bound` call with `t < id.token()` means it is looking for
an iterator, for which value `t < id.token()` changed to true,
which effectively means a position, where iterator is bigger
then searched value. Then we move iterator backward once if possible.
Assuming token space <-2, 2> and parents [0, 2], when we search for:
- -1 -> we will get 0, it's first, so we can't move backward, so 0 (ok)
- 0 -> we will get 2, it's not first, so we go back and we return 0 (ok)
- 1 -> we will get 2, it's not first, so we go back and we return 0
      (not ok - should be 2)

The fix is to replace it with `std::lower_bound` and remove conditional
backward motion. Since we've a guarantees that whole token space is used
if `std::lower_bound` ends with `end()` value, then we have a wrap
around case and we need to pick `begin()` as result.

Fixes #28354
Fixes: SCYLLADB-537

Closes scylladb/scylladb#28382
2026-03-11 14:51:42 +02:00
Calle Wilund
bc544eb08e gcs_fixture: Change to use docker helper 2026-03-11 12:32:02 +01:00
Calle Wilund
eb2dfe04e1 aws_kms_fixture: Modify to use docker helper 2026-03-11 12:32:02 +01:00
Calle Wilund
4a8afd9649 test/lib/proc_util: Add docker helper
Adds boost test equivalent of dockerized_service to
handle launching dockerized mock service using ephermal port,
query port and return the process.
2026-03-11 12:32:02 +01:00
Calle Wilund
3e8a9a0beb pytest: use ephemeral port publish for docker mock servers
Changes dockerized_service to use ephermal port publish, and
query the published port from podman/docker.
Modifies client code to use slightly changed usage syntax.
2026-03-11 12:32:01 +01:00
Piotr Dulikowski
d9a277453e Merge 'cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race' from Alex Dathskovsky
query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak handle could no longer be promoted and the prepare path could fail nondeterministically.

This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops.

  Test coverage was extended in test/cluster/test_prepare_race.py:

  - reproduces the invalidation-during-prepare window with injection,
  - verifies prepare completes successfully,
  - then invalidates again and executes the same stale client prepared object,
  - confirms the driver transparently re-requests/re-prepares and execution succeeds.

  This change introduces:

  - no behavior change for normal prepare flow besides stronger lifetime guarantees,
  - no new protocol semantics,
  - preserves existing cache invalidation logic,
  - adds explicit cluster-level regression coverage for both the race and driver reprepare path.
  - pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage

Fixes: https://github.com/scylladb/scylladb/issues/27657

Backport to active branches recommended: No node crash, but user-visible PREPARE failures under rare schema-invalidation race; low-risk timeout-bounded retry improves robustness.

Closes scylladb/scylladb#28952

* github.com:scylladb/scylladb:
  transport/messages: hold pinned prepared entry in PREPARE result
  cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race
2026-03-11 12:09:23 +01:00
Calle Wilund
e3e940bc47 dbuild: Use container network in dbuild nested containers
Remove the host network setting, ensuring we use private networks
(slirp4netns). This will allow nested container port aliasing,
helping CI stability (can use ephemeral ports and container
introspection).

This also makes the nested podman setup non-conditional,
since we only run podman containers inside dbuild, and need
the setup regardless if host container is docker or not.
2026-03-11 12:05:51 +01:00
Calle Wilund
8a56eafd39 scylla_cluster: Read notify sock in background to prevent deadlock
Starts a thread to process scylla notify messages (NOTIFY_SOCKET)
instead of just processing inline, non-blocking. This because it
is possible for the pipe created to be to small to hold enough
messages for us to reach the point where we otherwise even read
from said pipe, allowing other end (scylla) to proceed.
2026-03-11 11:59:00 +01:00
Patryk Jędrzejczak
37aeba9c8c Merge 'raft: add global read barrier to group0_batch::commit and switch auth and service levels' from Marcin Maliszkiewicz
This series adds a global read barrier to raft_group0_client, ensuring that Raft group0 mutations are applied on all live nodes before returning to the caller.

Currently, after a group0_batch::commit, the mutations are only guaranteed to be applied on the leader. Other nodes may still be catching up, leading to stale reads. This patch introduces a broadcast read barrier mechanism. Calling  send_group0_read_barrier_to_live_members after committing will cause the coordinator to send a read barrier RPC to all live nodes (discovered via gossiper) and waits for them to complete. This is best effort attempt to get cluster-wide visibility of the committed state before the response is returned to the user.

Auth and service levels write paths are switched to use this new mechanism.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-650

Backport: no, new feature

Closes scylladb/scylladb#28731

* https://github.com/scylladb/scylladb:
  test: add tests for global group0_batch barrier feature
  qos: switch service levels write paths to use global group0_batch barrier
  auth: switch write paths to use global group0_batch barrier
  raft: add function to broadcast read barrier request
  raft: add gossiper dependency to raft_group0_client
  raft: add read barrier RPC
2026-03-11 10:37:19 +01:00
Botond Dénes
54bddeb3b5 sstables/mx/writer: yield after writing a cell
With the goal of avoiding stalls on writing large collections, like
below:
    ++[0#1/1 100%] addr=0x5422d1e total=32 count=1 avg=32:
    |              seastar::backtrace<seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}> at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:85
    ++           - addr=0x541b6d4:
    |              seastar::backtrace_buffer::append_backtrace_oneline at ./build/release/seastar/./seastar/src/core/reactor.cc:811
    |              (inlined by) seastar::print_with_backtrace at ./build/release/seastar/./seastar/src/core/reactor.cc:838
    ++           - addr=0x541afb7:
    |              seastar::internal::cpu_stall_detector::generate_trace at ./build/release/seastar/./seastar/src/core/reactor.cc:1479
    ++           - addr=0x541b86c:
    |              seastar::internal::cpu_stall_detector::maybe_report at ./build/release/seastar/./seastar/src/core/reactor.cc:1214
    |              (inlined by) seastar::internal::cpu_stall_detector::on_signal at ./build/release/seastar/./seastar/src/core/reactor.cc:1234
    |              (inlined by) seastar::reactor::block_notifier at ./build/release/seastar/./seastar/src/core/reactor.cc:1548
    /opt/scylladb/libreloc/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=f83d43b9b4b0ed5c2bd0a1613bf33e08ee054c93, for GNU/Linux 3.2.0, not stripped

    ++           - addr=/opt/scylladb/libreloc/libc.so.6+0x1a28f:
    |              sigpending at ??:0
    ++           - addr=0x1760bf6:
    |              std::basic_string_view<signed char, std::char_traits<signed char> >::remove_prefix at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/string_view:302
    |              (inlined by) managed_bytes_basic_view<(mutable_view)0>::remove_prefix at ././utils/managed_bytes.hh:421
    |              (inlined by) _Z11read_simpleIlTk14FragmentedView24managed_bytes_basic_viewIL12mutable_view0EEET_RT0_ at ././utils/fragment_range.hh:365
    |              (inlined by) _ZL9get_fieldIlTk14FragmentedView24managed_bytes_basic_viewIL12mutable_view0EEQsr3stdE12is_trivial_vIT_EES3_T0_j at ././mutation/atomic_cell.hh:62
    |              (inlined by) atomic_cell_type::timestamp at ././mutation/atomic_cell.hh:103
    |              (inlined by) basic_atomic_cell_view<(mutable_view)0>::timestamp at ././mutation/atomic_cell.hh:232
    |              (inlined by) sstables::mc::writer::write_cell at ./sstables/mx/writer.cc:1101
    |              (inlined by) sstables::mc::writer::write_collection(bytes_ostream&, clustering_key_prefix const*, column_definition const&, collection_mutation_view, sstables::mc::writer::row_time_properties const&, bool)::$_0::operator() at ./sstables/mx/writer.cc:1233
    |              (inlined by) collection_mutation_view::with_deserialized<sstables::mc::writer::write_collection(bytes_ostream&, clustering_key_prefix const*, column_definition const&, collection_mutation_view, sstables::mc::writer::row_time_properties const&, bool)::$_0> at ././mutation/collection_mutation.hh:97
    |              (inlined by) sstables::mc::writer::write_collection at ./sstables/mx/writer.cc:1221
    ++           - addr=0x1677af3:
    |              sstables::mc::writer::write_cells at ./sstables/mx/writer.cc:1261
    |              (inlined by) sstables::mc::writer::write_row_body at ./sstables/mx/writer.cc:1287
    |              (inlined by) sstables::mc::writer::write_clustered at ./sstables/mx/writer.cc:1377
    |              (inlined by) _ZN8sstables2mc6writer15write_clusteredI14clustering_rowQ9ClusteredIT_EEEvRKS4_9tombstone at ./sstables/mx/writer.cc:766
    |              (inlined by) sstables::mc::writer::consume at ./sstables/mx/writer.cc:1425

Putting the yield in write_cell() instead of in write_collection() means
that writing any row benefits from the added yield point in the middle.

Refs: SCYLLADB-964

Closes scylladb/scylladb#28948
2026-03-11 10:34:55 +01:00
Botond Dénes
475220b9c9 Merge 'Remove the rest of pre raft topology code' from Gleb Natapov
Remove the rest of the code that assumes that either group0 does not exist yet or a cluster is till not upgraded to raft topology. Both of those are not supported any more.

No need to backport since we remove functionality here.

Closes scylladb/scylladb#28841

* github.com:scylladb/scylladb:
  service level: remove version 1 service level code
  features: move GROUP0_SCHEMA_VERSIONING to deprecated features list
  migration_manager: remove unused forward definitions
  test: remove unused code
  auth: drop auth_migration_listener since it does nothing now
  schema: drop schema_registry_entry::maybe_sync() function
  schema: drop make_table_deleting_mutations since it should not be needed with raft
  schema: remove calculate_schema_digest function
  schema: drop recalculate_schema_version function and its uses
  migration_manager: drop check for group0_schema_versioning feature
  cdc: drop usage of cdc_local table and v1 generation definition
  storage_service: no need to add yourself to the topology during reboot since raft state loading already did it
  storage_service: remove unused functions
  group0: drop with_raft() function from group0_guard since it always returns true now
  gossiper: do not gossip TOKENS and CDC_GENERATION_ID any more
  gossiper: drop tokens from loaded_endpoint_state
  gossiper: remove unused functions
  storage_service: do not pass loaded_peer_features to join_topology()
  storage_service: remove unused fields from replacement_info
  gossiper: drop is_safe_for_restart() function and its use
  storage_service: remove unused variables from join_topology
  gossiper: remove the code that was only used in gossiper topology
  storage_service: drop the check for raft mode from recovery code
  cdc: remove legacy code
  test: remove unused injection points
  auth: remove legacy auth mode and upgrade code
  treewide: remove schema pull code since we never pull schema any more
  raft topology: drop upgrade_state and its type from the topology state machine since it is not used any longer
  group0: hoist the checks for an illegal upgrade into main.cc
  api: drop get_topology_upgrade_state and always report upgrade status as done
  service_level_controller: drop service level upgrade code
  test: drop run_with_raft_recovery parameter to cql_test_env
  group0: get rid of group0_upgrade_state
  storage_service: drop topology_change_kind as it is no longer needed
  storage_service: drop check_ability_to_perform_topology_operation since no upgrades can happen any more
  service_storage: remove unused functions
  storage_service: remove non raft rebuild code
  storage_service: set topology change kind only once
  group0: drop in_recovery function and its uses
  group0: rename use_raft to maintenance_mode and make it sync
2026-03-11 10:24:20 +02:00
Piotr Dulikowski
38a2829f69 Merge 'Return HTTP error description in Vector Store client' from Szymon Wasik
The `service_error` struct: 6dc2c42f8b/service/vector_store_client.hh (L64)
currently stores just the error status code. For this reason whenever the HTTP error occurs, only the error code can be forwarded to the client. For example see here: 6dc2c42f8b/service/vector_store_client.cc (L580)
For this reason in the output of the drivers full description of the error is missing which forces user to take a look into Scylla server logs.

The objective of this PR is to extend the support for HTTP errors in Vector Store client to handle messages as well.

Moreover, it removes the quadratic reallocation in response_content_to_sstring() helper function that is used for getting the response in case of error.

Fixes: VECTOR-189

Closes scylladb/scylladb#26139

* github.com:scylladb/scylladb:
  vector_search: Avoid quadratic reallocation in response_content_to_sstring
  vector_store_client: Return HTTP error description, not just code
2026-03-11 09:19:27 +01:00
Calle Wilund
6d8ac23731 test_encryption: Use maximum replication in _smoke_test
Refs: SCYLLADB-557

We should use full replication in KS/CF creation and population,
for at least two reasons:
1.) Ensure we wait fully for and write to all nodes
2.) Make test more "real", behaving like a proper cluster

Closes scylladb/scylladb#28959
2026-03-11 09:54:57 +02:00
Nadav Har'El
00a819bcd8 cql3: remove unnecessary assert()
In cql3/, there was one call to assert() (not SCYLLA_ASSERT or
throwing_assert), and it was:

        const auto shard_num = smp::count;
        assert(shard_num > 0)

Rather than converting this assert() to throwing_assert() as I did in
previous patches, I decided to outright remove it: Seastar guarantees
that smp::count is not zero. Many other places in the code use
smp::count assuming that it is correct, no other place bothers to assert
it isn't zero.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-11 09:43:24 +02:00
Nadav Har'El
34eec020b3 cql3: replace abort() by throwing_assert()
After the previous patch replaced all SCYLLA_ASSERT() calls by
throwing_assert(), this patch also replaces all calls to abort().

All these abort() calls are supposedly cases that can never happen,
but if they ever do happen because of a bug, in none of these places
we absolutely need to crash - and exception that aborts the current
operation should be enough.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-11 09:43:11 +02:00
Nadav Har'El
c87d6407ed cql3: Replace SCYLLA_ASSERT by throwing_assert
In this patch we replace every single use of SCYLLA_ASSERT() in the cql3/
directory by throwing_assert().

The problem with SCYLLA_ASSERT() is that when it fails, it crashes Scylla.
This is almost always a bad idea (see #7871 discussing why), but it's even
riskier in front-end code like cql3/: In front-end code, there is a risk
that due to a bug in our code, a specific user request can cause Scylla
to crash. A malicious user can send this query to all nodes and crash
the entire cluster. When the user is not malicious, it causes a small
problem (a failing request) to become a much worse crash - and worse,
the user has no idea which request is causing this crash and the crash
will repeat if the same request is tried again.

All of this is solved by using the new throwing_assert(), which is the
same as SCYLLA_ASSERT() but throws an exception (using on_internal_error())
instead of crashing. The exception will prevent the code path with the
invalid assumption from continuing, but will result in only the current
user request being aborted, with a clear error message reporting the
internal server error due to an assertion failure.

I reviewed all the changes that I did in this patch to check that (to the
best of my understanding) none of the assertions in cql3/ involve the
sort of serious corruption that might require crashing the Scylla node
entirely.

throwing_assert() also improves logging of assertion failures compared
to the original SCYLLA_ASSERT() - SCYLLA_ASSERT() printed a message to
stderr which in many installations is lost, whereas throwing_assert()
uses Scylla's standard logger, and also includes a backtrace in the
log message.

Fixes #13970 (Exorcise assertions from CQL code paths)
Refs #7871 (Exorcise assertions from Scylla)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-11 09:41:20 +02:00
Botond Dénes
99fa912f1b Merge 'Generalize streaming scopes tests' from Pavel Emelyanov
To restore how streaming scopes work there are two tests that greatly duplicate each other -- test_restore_with_streaming_scopes from cluster/object_store suite and test_refresh_with_streaming_scopes from cluster suite.

This patch generalizes both into a do_test_streaming_scopes() non-test function

Closes scylladb/scylladb#28874

* github.com:scylladb/scylladb:
  test: Re-sort comments around do_test_streaming_scopes()
  test: Split do_load_sstables()
  test: Drop load_fn argument from do_load_sstables()
  test: Re-use do_test_streaming_scopes() in refresh test
  test: Introduce SSTablesOnLocalStorage
  test: Introduce SSTablesOnObjectStorage
  test: Move test_restore_with_streaming_scopes() into do_test_streaming_scopes()
2026-03-11 09:35:21 +02:00
Dmitriy Kruglov
cee44716db docs: add cluster platform migration procedure
Document how to migrate a ScyllaDB cluster to different instance
types using the add-and-replace node cycling approach.

Closes: QAINFRA-42

Closes scylladb/scylladb#28458
2026-03-11 09:31:35 +02:00
Nadav Har'El
401dc1894c test/alternator,cqlpy: avoid xfail_strict against DynamoDB/Cassandra
Recently, in commit 7b30a39, we added to pytest.ini the option xfail_strict.
This option causes every time a test XPASSes, i.e., an XFAIL test actually
passes, to be considered an error and fail the test.

While this has some benefits, it's a big problem when running tests
against a reference implementation like DynamoDB or Cassandra: We
typically mark a test "xfail" if the test shows a known bug - i.e., if
the test fails on Scylla but passes on the reference system (DynamoDB
or Cassandra). This means that when running "test/cqlpy/run-cassandra"
or "test/alternator/run --aws", we expect to see many tests XPASS,
and now this will cause these runs to "fail".

So in this patch we add the xfail_strict=false to cqlpy/run-cassandra
and alternator/run --aws. This option is not added to cqlpy/run or
to alternator/run without --aws, and also doesn't affect test.py or
Jenkins.

P.S. This is another nail in the coffin of doing "cd test/alternator;
pytest --aws". You should get used to running Alternator tests through
test/alternator/run, even if you don't need to run Scylla (the "--aws"
option doesn't run Scylla).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28973
2026-03-11 09:29:30 +02:00
Robert Bindar
29619e48d7 replica/table: calculate manifest tablet_count from tablet map
During tests I noticed that if the number of tablets is very small,
say 2, and the number of nodes is 3 (2 shards per node), using the
number of storage groups on each shard, a shard may end up holding 0 groups,
whilst the other holds 1 group. And in some nodes even both shards have
0 groups.
Taking the minimum among shards here was showing in manifests a tablet
count of 0 for all 3 nodes, which is incorrect.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#28978
2026-03-11 09:27:04 +02:00
Botond Dénes
3fed6f9eff Merge 'service: tasks: scan all tablets in tablet_virtual_task::wait' from Aleksandra Martyniuk
Currently, for repair tasks tablet_virtual_task::wait gathers the
ids of tablets that are to be repaired. The gathered set is later
used to check if the repair is still ongoing.

However, if the tablets are resized (split or merged), the gathered
set becomes irrelevant. Those, we may end up with invalid tablet id
error being thrown.

Wait until repair is done for all tablets in the table.

Fixes: https://github.com/scylladb/scylladb/issues/28202

Backport to 2026.1 needed as it contains the change introducing the issue d51b1fea94

Closes scylladb/scylladb#28323

* github.com:scylladb/scylladb:
  service: fix indentation
  test: add test_tablet_repair_wait
  service: remove status_helper::tablets
  service: tasks: scan all tablets in tablet_virtual_task::wait
2026-03-11 09:24:07 +02:00
Raphael S. Carvalho
cc5b1acadf Improve log when sstable load fails due to missing tablet replica
A bug or some bad operator intervention can lead to a sstable existing in a
node after the tablet replica was moved to a different node.

This will result sstable loading during boot failing, requiring operator
intervention.
The log today just dumps the name of the "orphaned" sstable, but one
investigating it might want to know which process (repair, memtable, whatever)
generated that sstable, if the sstable was created locally or remotely,
and the current replica set of the underlying tablet.
From the original identifier, we can know the exact time the sstable was
created on its original node. From the current id, we know the time it
was created on the current node.
All this info can help the investigator to correlate with events in other nodes
(includes actions from the coordinator) to get closer to the root cause.

The new log will look like this:

"Unable to load SSTable .../me-3gyg_1fsw_2u0u826b00b71vc46o-big-Data.db
(originated from compaction with id 913f41c0-18c2-11f1-8f08-cb8521b3f330
on host e483238c-2287-4022-8bc4-b4f1c4cb2b0d)
of tablet 6 (replica set: [e483238c-2287-4022-8bc4-b4f1c4cb2b0d:0])"

Refs https://scylladb.atlassian.net/browse/SCYLLADB-788.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#28921
2026-03-11 06:20:34 +02:00
Avi Kivity
b17e1259e3 Merge 'types: optimize vector deserialization for high-dimensional vectors' from Szymon Wasik
Vector deserialization is an operation which performance is critical for
vector similarity search feature because it is frequently executed during
rescoring operation. Some of the identified performance bottlenecks
for it include:

1. Per-element virtual dispatch in deserialize(): each of the N elements
   went through visit() which switches on ~28 type variants. For a
   1024-dimension float vector, that's 1024 redundant type switches when
   the element type is the same for all of them.

2. Redundant work in split_fragmented(): value_length_if_fixed() was
   called inside the loop (N virtual calls), and no reserve() was done
   on the output vector causing repeated reallocations.

This series fixes both:

- Introduce deserialize_vector_visitor that dispatches on the element
  type once for the entire vector, then loops inside the resolved
  handler. Simple numeric types (float, int, etc.) call
  deserialize_value() directly with no virtual dispatch per element.
  String types (ascii, utf8) get a dedicated handler that skips
  make_empty() (sstring has no empty_t constructor). Complex types
  (list, map, tuple, etc.) fall back to per-element dispatch.

- In split_fragmented(), reserve the output vector to _dimension and
  cache value_length_if_fixed() before the loop.

Benchmark results (1024-dim float vector, release build, -O3 -flto):

  deserialize:      15.73 us -> 11.70 us  (1.34x, 26% faster)
  split_fragmented: 10.34 us ->  7.45 us  (1.39x, 28% faster)

References: SCYLLADB-471

Backport: none, unless we observe some critical performance improvement for quantization.

Closes scylladb/scylladb#28618

* github.com:scylladb/scylladb:
  types: optimize reading vector fragments
  types: optimize vector deserialization for high-dimensional vectors
2026-03-11 00:39:46 +02:00
Dawid Mędrek
167feabe1a cql3: Reject user-provided timestamps for strongly consistent tables
Similarly to LWTs, we reject queries with user-provided timestamps
when they target strongly consistent tables.

Such statements could force us to rewrite history, and that contradicts
the philosophy of linearizability we aim for.

Fixes SCYLLADB-879

Closes scylladb/scylladb#28867
2026-03-10 22:11:39 +02:00
Marcin Maliszkiewicz
8ae80a32c0 Update seastar submodule
* seastar d2953d2a...4d268e0e (32):
  > Merge 'prometheus: support multiple __name__ filters and prefixed names' from Travis Downs
    doc: update prometheus.md with __name__ filter enhancements
    prometheus: support prefixed names in __name__ filter
    prometheus: add benchmarks for name filter performance
    prometheus: support multiple __name__ query parameters
    prometheus: move write_body_args to header
  > fair_queue: Subtract from _queued_capacity on pop_front()
  > memory: expose cumulative allocated bytes statistic
  > Merge 'Add ability to configure IO bandwidth limit for supergroup' from Pavel Emelyanov
    test: IO bandiwdth throttler unit tests
    code: Add ability to configure IO bandwidth limit for supergroup
    io_queue: Have more than one throttler par class
    io_queue: Introduce bandwidth_throttler helper class
    io_queue: Nest io_group::priotiy_class_data-s
    io_queue: Update class bandwidth on group's class data
    io_queue: Make io_group::priority_class_data::tokens() static
    fair_queue: Introduce group (un)plugging
  > Fix _shard_to_numa_node_mapping double population
  > Use exception parameter in log_timer_callback_exception()
  > Fix wakeup_granularity() fallback debug-fs reading
  > test_fixture: Fix SEASTAR_FIXTURE_THREAD_TEST_CASE thread not propagated
  > build: support tuning -ffile-prefix-map
  > test: Remove unused C::dup() method of testing class
  > src/core/reactor: introduce reactor::get_backend_name()
  > util/process: add pid() accessor
  > Merge 'Add source location to task and tasktrace object' from Radosław Cybulski
    coroutine.hh: disable source_location for GCC to avoid ICE
    reactor: improve do_dump_task_queue reporting
    Use source_location in `do_dump_task_queue`
    Update backtrace with source locations of resume points
    Add calls to update resume_point
    Add a std::source_location (resume_point) to task object.
  > Merge 'Refine posix file .dup() implementation' from Pavel Emelyanov
    file: Templatize posix_file_handle_impl
    file: Don't dup() non-read-only files
    file: Split ..._impl::dup() implementations
    test: Add a simple test for dup()
  > Merge 'Deprecate reactor::make_pollable_fd(socket_address, int)' from Pavel Emelyanov
    reactor: Deprecate make_pollable_fd()
    net/posix: Create file_desc for sockets in-place
    reactor,net: Keep sock_need_nonblock boolean on posix_network_stack
    net/posix: Re-format constructor initializer lists
  > Merge 'test: add fuzz testing infrastructure and sstring fuzzer' from Travis Downs
    test: add fuzz tests to CI workflow
    test: add sstring differential fuzzer
    test: add fuzz testing infrastructure
  > Introduce "integrated queue length" metrics and use it for IO classes (#3210)
  > reactor: Remove get_sg_data(unsigned) overload
  > memcached: Stop using scattered_message
  > reactor: Mark uptime() method const
  > alien: Remove deprecated run_on and submit_to calls
  > file: make open_flags and access_flags constexpr
  > scheduling: Unfriend some methods from scheduling_group
  > reactor: Move _dying bit to epoll backend
  > file: coroutinize the with_file templates
  > configure: validate --cook ingredient names
  > fix trailing whitespace
  > Merge 'Estimate timing overhead, allow failing if it is too high' from Travis Downs
    perf_tests: document overhead column and threshold options
    perf_tests: add measurement overhead tracking and warnings
    perf_tests: remove inline/hot attributes from time_measurement methods
    perf_tests: move time_measurement class to implementation file
    perf_tests: move perf counters into time_measurement singleton
  > rpc: log handler type
  > Merge 'Add pre-commit with trailing whitespace hook' from Travis Downs
    Add GitHub Actions workflow for pre-commit enforcement
    Add pre-commit setup documentation to HACKING.md
    Add pre-commit configuration with trailing-whitespace hook
    Remove trailing whitespace from source files
  > posix-stack: Make internal::posix_connect() resolve exceptions into futures
  > sstring: fix npos to be size_t for consistency with std::string

Closes scylladb/scylladb#28954
2026-03-10 22:06:58 +02:00
Szymon Wasik
7fae78d2b0 types: optimize reading vector fragments
There was a redundant work in split_fragmented(): value_length_if_fixed() was
called inside the loop (N virtual calls), and no reserve() was done
on the output vector causing repeated reallocations.

This patch reserves the output vector to _dimension and
caches value_length_if_fixed() before the loop.

Additionally, split read_vector_element() into two specialized functions:
read_vector_element_fixed() and read_vector_element_variable(), and hoist
the branch on fixed_len outside the loop in split_fragmented() and
deserialize_loop(). This avoids a conditional branch per element in the
hot path.

Benchmark results (1024-dim float vector, release build, -O3 -flto):
10.34 us ->  7.45 us  (1.39x, 28% faster)
2026-03-10 20:17:31 +01:00
Taras Veretilnyk
579269b3c5 test/cqlpy: test --ignore-component-digest-mismatch flag in scylla sstable upgrade
Verify that scylla sstable upgrade fails when an sstable has a
corrupted Statistics component digest, and succeeds when the
--ignore-component-digest-mismatch flag is provided.
2026-03-10 19:24:05 +01:00
Taras Veretilnyk
fc4c82b962 docs: document --ignore-component-digest-mismatch flag for scylla sstable upgrade 2026-03-10 19:24:05 +01:00
Taras Veretilnyk
7214f5a0b6 sstables: propagate ignore_component_digest_mismatch config to all load sites
Add ignore_component_digest_mismatch option to db::config (default false).
When set, sstable loading logs a warning instead of throwing on component
digest mismatches, allowing a node to start up despite corrupted non-vital
components or bugs in digest calculation.

Propagate the config to all production sstable load paths:
- distributed_loader (node startup, upload dir processing)
- storage_service (tablet storage cloning)
- sstables_loader (load-and-stream, download tasks, attach)
- stream_blob (tablet streaming)
2026-03-10 19:24:05 +01:00
Taras Veretilnyk
c123f637ea sstables: add option to ignore component digest mismatches
Add `ignore_component_digest_mismatch` option to `sstable_open_config`
that logs a warning instead of throwing `malformed_sstable_exception`
on component digest mismatch. This is useful for recovering sstables
with corrupted non-vital components or working around bugs in digest
calculation.

Expose the option in scylla-sstable via the
`--ignore-component-digest-mismatch` flag for the upgrade operation.
2026-03-10 19:24:05 +01:00
Taras Veretilnyk
95420014ea sstable_compaction_test: Add scrub validate test for corrupted index
Generalize corrupt_sstable() and scrub_validate_corrupted_file() to
accept a component_type parameter, defaulting to Data, so they can be
reused for corrupting other components.
2026-03-10 19:24:05 +01:00
Taras Veretilnyk
a3912cf7f1 sstables: add tests for component digest validation on corrupted SSTables
Add tests that verify SSTable component digest validation detects
corruption on load. Each test writes an SSTable, corrupts a specific
component file by flipping a bit, then asserts that reloading the
SSTable throws malformed_sstable_exception with the expected digest
mismatch message.
2026-03-10 19:24:05 +01:00
Taras Veretilnyk
e78a3d2c44 sstables: validate index components digests during SSTable scrub in validate mode 2026-03-10 19:24:05 +01:00
Taras Veretilnyk
9decbdeab0 sstables: verify component digests on SSTable load
Add integrity verification for SSTable component files by validating
their CRC32 digests against the expected values stored in Scylla
metadata during SSTable loading.

The following components are validated on load: TOC, Scylla metadata,
CompressionInfo, Statistics, Summary, and Filter.
2026-03-10 19:24:05 +01:00
Taras Veretilnyk
478c1eaec5 sstables: add digest_file_random_access_reader for CRC32 digest computation
Add a new reader that wraps file_random_access_reader and computes a
running CRC32 digest over the data as it is read. The digest accumulates
across sequential read_exactly() calls and is reset on seek(), since a
non-sequential read invalidates the running checksum.
2026-03-10 19:24:05 +01:00
Szymon Wasik
6c0ef8eb92 types: optimize vector deserialization for high-dimensional vectors
One of the performance bottlenecks while deserializing vectors was
per-element virtual dispatch in deserialize(): each of the N elements
went through visit() which switches on ~28 type variants. For a
1024-dimension float vector, that's 1024 redundant type switches when
the element type is the same for all of them.

This patch introduces deserialize_vector_visitor that dispatches on the element
type once for the entire vector, then loops inside the resolved
handler. Simple numeric types (float, int, etc.) call
deserialize_value() directly with no virtual dispatch per element.
String types (ascii, utf8) get a dedicated handler that skips
make_empty() (sstring has no empty_t constructor). Complex types
(list, map, tuple, etc.) fall back to per-element dispatch.

Benchmark results (1024-dim float vector, release build, -O3 -flto):
15.73 us -> 11.70 us  (1.34x, 26% faster)
2026-03-10 18:21:34 +01:00
Andrzej Jackowski
9247dff8c2 reader_concurrency_semaphore: fix leak workaround
`e4da0afb8d5491bf995cbd1d7a7efb966c79ac34` introduces a protection
against resources that are "made up" of thin air to
`reader_concurrency_semaphore`. If there are more `_resources` than
the `_initial_resources`, it means there is a negative leak, and
`on_internal_error_noexcept` is called. In addition to it,
`_resources` is set to `std::max(_resources, _initial_resources)`.

However, the commit message of `e4da0afb8d5491bf995cbd1d7a7efb966c79ac34`
states the opposite: "The detection also clamps the
_resources to _initial_resources, to prevent any damage".

Before this commit, the protection mechanism doesn't clamp
`_resources` to `_initial_resources` but instead keeps `_resources` high,
possibly even indefinitely growing. This commit changes `std::max` to
`std::min` to make the code behave as intended.

Refs: SCYLLADB-163

Closes scylladb/scylladb#28982
2026-03-10 18:57:31 +02:00
Szymon Wasik
74d86d3fe9 vector_search: Avoid quadratic reallocation in response_content_to_sstring
Pre-compute the total size and allocate a single uninitialized sstring
before copying the buffers, following the pattern from Seastar's
read_entire_stream_contiguous(). This avoids iterative reallocation
which is O(n^2) for large responses.
2026-03-10 17:45:55 +01:00
Szymon Wasik
d27610f138 vector_store_client: Return HTTP error description, not just code
This simple patch adds support for storing the HTTP error description
that Vector Store client receives from vector store. Until now it was
just printed to the log but it was not returned. For this reason it
was not forwarded to the drivers which forced users to access ScyllaDB
server logs to understand what is wrong with Vector Store.

This patch also updates formatter to print the message next to the
error code.

Fixes: VECTOR-189
2026-03-10 17:22:30 +01:00
Botond Dénes
81e214237f Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk
This pull request adds support for calculation and storing CRC32 digests for all SSTable components.
This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure
and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream.
Several test cases where introduced to verify expected behaviour.

Additionally, this PR adds new rewrite component mechanism for safe sstable component rewriting.
Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed to the final name after sealing. This allowed crash recovery by simply removing the temporary file on startup.

However, with component digests stored in scylla_metadata (#20100),
replacing a component like Statistics requires atomically updating both the component
and scylla_metadata with the new digest - impossible with POSIX rename.

The new mechanism creates a clone sstable with a fresh generation:

- Hard-links all components from the source except the component being rewritten and scylla_metadata
- Copies original sstable components pointer and recognized components from the source
- Invokes a modifier callback to adjust the new sstable before rewriting
- Writes the modified component along with updated scylla_metadata containing the new digest
- Seals the new sstable with a temporary TOC
- Replaces the old sstable atomically, the same way as it is done in compaction

This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair).
In case of any failure durning the whole process, sstable will be automatically deleted on the node startup due to
temporary toc persistence.

Backport is not required, it is a new feature

Fixes https://github.com/scylladb/scylladb/issues/20100, https://github.com/scylladb/scylladb/issues/27453

Closes scylladb/scylladb#28338

* github.com:scylladb/scylladb:
  docs: document components_digests subcomponent and trailing digest in Scylla.db
  sstable_compaction_test: Add tests for perform_component_rewrite
  sstable_test: add verification testcases of SSTable components digests persistance
  sstables: store digest of all sstable components in scylla metadata
  sstables: replace rewrite_statistics with new rewrite component mechanism
  sstables: add new rewrite component mechanism for safe sstable component rewriting
  compaction: add compaction_group_view method to specify sstable version
  sstables: add null_data_sink and serialized_checksum for checksum-only calculation
  sstables: extract default write open flags into a constant
  sstables: Add write_simple_with_digest for component checksumming
  sstables: Extract file writer closing logic into separate methods
  sstables: Implement CRC32 digest-only writer
2026-03-10 16:02:53 +02:00
Aleksandra Martyniuk
e02b0d763c service: fix indentation 2026-03-10 14:44:52 +01:00
Aleksandra Martyniuk
02257d1429 test: add test_tablet_repair_wait
Add a test that checks if tablet_virtual_task::wait won't fail if
tablets are merged.
2026-03-10 14:42:27 +01:00
Aleksandra Martyniuk
0e0070f118 service: remove status_helper::tablets
Currently, status_helper::tablets, which keeps a vector of processed
tablet ids, is used only in tablet_virtual_task::get_status_helper,
so there is no point in returning it. Also, in get_status_helper,
it is used only to determine if any tablets are processed.

Remove status_helper::tablets. Use a flag instead of the vector
in get_status_helper.
2026-03-10 14:42:27 +01:00
Aleksandra Martyniuk
e5928497ce service: tasks: scan all tablets in tablet_virtual_task::wait
Currently, for repair tasks tablet_virtual_task::wait gathers the
ids of tablets that are to be repaired. The gathered set is later
used to check if the repair is still ongoing.

However, if the tablets are resized (split or merged), the gathered
set becomes irrelevant. Those, we may end up with invalid tablet id
error being thrown.

Wait until repair is done for all tablets in the table.
2026-03-10 14:42:21 +01:00
Andrei Chekun
c36df5ecf4 test.py: eliminite drivers exception
There is a race condition in driver that raises the RuntimeException.
This pollutes the output, so this PR is just silencing this exception.

Fixes: SCYLLADB-900

Closes scylladb/scylladb#28957
2026-03-10 14:31:36 +02:00
Alex
3ac4e258e8 transport/messages: hold pinned prepared entry in PREPARE result
result_message::prepared now owns a strong pinned prepared-cache entry instead of relying only on a weak pointer view. This closes the remaining lifetime gap after query_processor::prepare() returns, so users of the returned PREPARE message cannot observe an invalidated weak handle during subsequent
  processing.

  - update result_message::prepared::cql constructor to accept pinned entry
  - construct weak view from owned pinned entry inside the message
  - pass pinned cache entry from query_processor::prepare() into the message constructor
2026-03-10 14:17:57 +02:00
Alex
27051d9a7c cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race
query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak
  handle could no longer be promoted and the prepare path could fail nondeterministically.

  This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops.

  Test coverage was extended in test/cluster/test_prepare_race.py:

  - reproduces the invalidation-during-prepare window with injection,
  - verifies prepare completes successfully,
  - then invalidates again and executes the same stale client prepared object,
  - confirms the driver transparently re-requests/re-prepares and execution succeeds.

  This change introduces:

  - no behavior change for normal prepare flow besides stronger lifetime guarantees,
  - no new protocol semantics,
  - preserves existing cache invalidation logic,
  - adds explicit cluster-level regression coverage for both the race and driver reprepare path.
  - pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage
2026-03-10 14:17:57 +02:00
Piotr Dulikowski
37f8cdf485 Merge 'test.py: fix unawaited ScyllaLogFile.grep() coroutines' from Andrei Chekun
Fixed several places where ScyllaLogFile.grep() was called without await, resulting in checking coroutine objects for truthiness instead of actual log matches.

Fixes: SCYLLADB-903

No backport, framework fix and one test fix.

Closes scylladb/scylladb#28909

* github.com:scylladb/scylladb:
  test.py: fix unawaited ScyllaLogFile.grep() coroutines
  tests: fix test_group0_recovers_after_partial_command_application
2026-03-10 12:29:23 +01:00
Dario Mirovic
f72081194c db: use prefix tombstones in DROP TABLE schema mutations
When dropping a table, make_drop_table_or_view_mutations() creates
a point tombstone in system_schema.columns for every column in the table.

The clustering key of system_schema.columns is (table_name, column_name).
A clustering key with only the table_name component acts as a prefix
tombstone. That tombstone covers all columns belonging to that table.
This approach is already used by make_table_deleting_mutations() during
CREATE TABLE.

Apply the same prefix tombstone approach to DROP TABLE for the columns,
view_virtual_columns, computed_columns, and dropped_columns schema tables.
This reduces tombstone accumulation in schema table sstables.

In test_max_cells test case, which repeatedly creates and drops a table
with 32768 columns, overall test time improved from ~180s to ~157s, which
is ~12.7% improvement.

Refs SCYLLADB-815

Closes scylladb/scylladb#28976
2026-03-10 11:59:00 +01:00
Gleb Natapov
b59b3d4f8a service level: remove version 1 service level code 2026-03-10 10:46:48 +02:00
Gleb Natapov
b633ec1779 features: move GROUP0_SCHEMA_VERSIONING to deprecated features list 2026-03-10 10:46:48 +02:00
Gleb Natapov
40ec0d4942 migration_manager: remove unused forward definitions 2026-03-10 10:46:48 +02:00
Gleb Natapov
aa9eb0ef8c test: remove unused code 2026-03-10 10:46:48 +02:00
Gleb Natapov
4660f908f9 auth: drop auth_migration_listener since it does nothing now 2026-03-10 10:46:48 +02:00
Gleb Natapov
74b5a8d43d schema: drop schema_registry_entry::maybe_sync() function
Schema is synced through group0 now. Drop all the test of the function
as well.
2026-03-10 10:46:47 +02:00
Gleb Natapov
b9f3281af6 schema: drop make_table_deleting_mutations since it should not be needed with raft
Also remove the test since it is no longer relevant
2026-03-10 10:46:47 +02:00
Gleb Natapov
f76199e5c2 schema: remove calculate_schema_digest function
It is used by the test only, so remove the test and its data as well.
2026-03-10 10:46:47 +02:00
Gleb Natapov
08e33ad7f7 schema: drop recalculate_schema_version function and its uses
There is no need to recalculate schema version any more since it is set
by group0.
2026-03-10 10:46:39 +02:00
Gleb Natapov
7bb334a5dd migration_manager: drop check for group0_schema_versioning feature
We do not allow upgrading from a version that does not have it any
longer.
2026-03-10 10:39:59 +02:00
Gleb Natapov
4402b030ae cdc: drop usage of cdc_local table and v1 generation definition 2026-03-10 10:39:59 +02:00
Gleb Natapov
6769615ff1 storage_service: no need to add yourself to the topology during reboot since raft state loading already did it 2026-03-10 10:39:59 +02:00
Gleb Natapov
33fbda9f3b storage_service: remove unused functions 2026-03-10 10:39:58 +02:00
Gleb Natapov
0e3e7be335 group0: drop with_raft() function from group0_guard since it always returns true now
Also drop the code that assumed that the function can return false.
2026-03-10 10:39:58 +02:00
Gleb Natapov
4e56ca3c76 gossiper: do not gossip TOKENS and CDC_GENERATION_ID any more
They were used by legacy topology and cdc code only.
2026-03-10 10:39:58 +02:00
Gleb Natapov
77f8f952b2 gossiper: drop tokens from loaded_endpoint_state 2026-03-10 10:39:58 +02:00
Gleb Natapov
706754dc24 gossiper: remove unused functions 2026-03-10 10:39:58 +02:00
Gleb Natapov
8ee4cdd4b7 storage_service: do not pass loaded_peer_features to join_topology()
They are not used there any longer.
2026-03-10 10:39:58 +02:00
Gleb Natapov
24c01f2289 storage_service: remove unused fields from replacement_info 2026-03-10 10:39:58 +02:00
Gleb Natapov
2d8722d204 gossiper: drop is_safe_for_restart() function and its use
The function checks that the node's state is not left or removed in
gossiper during restart, but with raft topology a removed node will
not be able to contact the cluster to get this information since it will
be banned.
2026-03-10 10:39:58 +02:00
Gleb Natapov
6f739a8ee4 storage_service: remove unused variables from join_topology 2026-03-10 10:39:58 +02:00
Gleb Natapov
d35b83bec8 gossiper: remove the code that was only used in gossiper topology
The topology state machine is always present now and can be passed to
the gossiper during creation.
2026-03-10 10:39:58 +02:00
Gleb Natapov
390eb46c1a storage_service: drop the check for raft mode from recovery code
In non raft mode the node will node boot at all, so the check is
redundant now.
2026-03-10 10:39:58 +02:00
Gleb Natapov
6a7e850161 cdc: remove legacy code
The patch removes test/boost/cdc_generation_test.cc since it unit tests
cdc::limit_number_of_streams_if_needed function which is remove here.
2026-03-10 10:38:57 +02:00
Gleb Natapov
0b508c5f96 test: remove unused injection points
Also remove test_auth_raft_command_split test which is irrelevant since 5ba7d1b116
because it does not use the function that injects max sized command after the
commit.
2026-03-10 10:09:39 +02:00
Gleb Natapov
1d188f0394 auth: remove legacy auth mode and upgrade code
A system needs to be upgraded to use v2 auth before moving to this
ScyllaDB version otherwise the boot will fail.
2026-03-10 10:09:39 +02:00
Gleb Natapov
02fc4ad0a9 treewide: remove schema pull code since we never pull schema any more
Schema pull was used by legacy schema code which is not supported for a
long time now and during legacy recovery which is no longer supported as
well. It can be dropped now.
2026-03-10 10:09:39 +02:00
Gleb Natapov
0cf726c81f raft topology: drop upgrade_state and its type from the topology state machine since it is not used any longer 2026-03-10 10:09:39 +02:00
Gleb Natapov
60a861c518 group0: hoist the checks for an illegal upgrade into main.cc
The checks are spread around now, but having then in one place and done
as early as possible simplifies the logic.
2026-03-10 10:09:39 +02:00
Gleb Natapov
1ff98c89e3 api: drop get_topology_upgrade_state and always report upgrade status as done
Non upgraded version will not boot any longer.
2026-03-10 10:09:38 +02:00
Gleb Natapov
be153a4eb7 service_level_controller: drop service level upgrade code
We do not allow upgrade from a version that is not updated yet, so the
code is not used any longer.
2026-03-10 10:09:38 +02:00
Gleb Natapov
61cc091364 test: drop run_with_raft_recovery parameter to cql_test_env
It is unused.
2026-03-10 10:09:38 +02:00
Gleb Natapov
00083b42a7 group0: get rid of group0_upgrade_state
Simplify code by getting rid of group0_upgrade_state since upgrade is no
longer supported, so no need to track its state. The none upgraded node
will simply not boot and to detect that the patch checks the state
directly from the system table.
2026-03-10 10:09:38 +02:00
Gleb Natapov
d4b55de214 storage_service: drop topology_change_kind as it is no longer needed
The mode is always raft, so no need to keep a variable that tracks that.
2026-03-10 10:09:38 +02:00
Gleb Natapov
68ea6aa0a6 storage_service: drop check_ability_to_perform_topology_operation since no upgrades can happen any more 2026-03-10 10:09:38 +02:00
Gleb Natapov
06652948f3 service_storage: remove unused functions
raft_topology_change_enabled and upgrade_state_to_topology_op_kind are
not use any more. Remove the code.
2026-03-10 10:09:38 +02:00
Gleb Natapov
e8c72b7ba0 storage_service: remove non raft rebuild code
Only raft is supported now.
2026-03-10 10:09:38 +02:00
Gleb Natapov
49ebab971d storage_service: set topology change kind only once
The only support mode is topology_change_kind::raft, so always set it in
storage_service::join_cluster during join or regular boot. Drop the check
for legacy mode from raft_group0::setup_group0_if_exist since the mode
will not be set at this point any longer. The wrong upgrade will still
be detected in storage_service::join_cluster where topology.upgrade_state
is checked directly.
2026-03-10 10:09:38 +02:00
Gleb Natapov
4e072977d4 group0: drop in_recovery function and its uses
Legacy recovery procedure is no longer supported and the code can be
dropped.
2026-03-10 10:09:38 +02:00
Gleb Natapov
770762edd8 group0: rename use_raft to maintenance_mode and make it sync
group0_upgrade_state::recovery is now used only in maintenance mode
so rename the function to indicate it. Also there is no preemption point
in the function any more and it can be a regular function, not a
co-routine.
2026-03-10 10:09:33 +02:00
Pavel Emelyanov
61af7c8300 test: Re-sort comments around do_test_streaming_scopes()
The test description of refreshing test is very elaborated and it's
worth having it as the description of the streaming scopes test itself.
Callers of the helper can go with smaller descriptions.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-10 10:00:09 +03:00
Pavel Emelyanov
5ce3597c25 test: Split do_load_sstables()
This helper does two things -- sorts sstables per server according to
scope in use and calls sstables_storage.restore(). The code looks better
if the sorting of sstables stays in a helper and the call for .restore()
is moved to the caller.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-10 10:00:09 +03:00
Pavel Emelyanov
8c1fb2b39a test: Drop load_fn argument from do_load_sstables()
Now all callers provide the sstables_storage argument and the load_fn is
effectively unused.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-10 09:59:08 +03:00
Pavel Emelyanov
59051ccc28 test: Re-use do_test_streaming_scopes() in refresh test
Now it's possible to replace the whole body of the
test_refresh_with_streaming_scopes() test by calling the corresponding
helper function from backup/restore test module. This helper does
exactly the same, and the SSTablesOnLocalStorage class provides the
necessary save/restore implementations.

One more thing to mention -- the refreshing test for some reason only
wants to run with restored min-tablet-count equal to the original one.
The do_test_streaming_scopes() needs to account for that, as it runs the
tests for more options.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-10 09:59:07 +03:00
Pavel Emelyanov
f6f1cb0391 test: Introduce SSTablesOnLocalStorage
This class implements some of the sstables manipulations performed by
test_refresh_with_streaming_scopes(). It's here to facilitate next patch
that will use it to call do_test_streaming_scopes() helper.

This patch moves two blocks of code out of the test into this new class.

The shutil.rmtree(tmpbackup) is seemingly lost, but it really isn't --
the tmpbackup variable holds a name of a _subdir_ inside servers'
workdirs. This path doesn't really exist on disk on its own, so removing
it is a no-op.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-10 09:58:40 +03:00
Pavel Emelyanov
dae4da1810 test: Introduce SSTablesOnObjectStorage
The class in question performs two operations for
do_test_streaming_scopes(): saves sstables and restores them. Current
caller of the helper is the test_restore_with_streaming_scopes() test
that need to backup sstables on object storage and restore them from
there with the restoration API. The SSTablesOnObjectStorage class does
exactly that.

The change in do_load_sstables() that checks for sstables_storage to be
non None is needed to keep test_refresh_with_streaming_scopes() work --
that test doesn't provide sstables_storage (yet) and the function in
question will call the load_fn callback. Next patch will eliminate it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-10 09:58:39 +03:00
Pavel Emelyanov
5a033dea47 test: Move test_restore_with_streaming_scopes() into do_test_streaming_scopes()
The body of this test is duplicated by
test_refresh_with_streaming_scopes() test from other module. Keeping it
in a non-test top-level function will help generalizing these two tests.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-10 09:57:53 +03:00
Nadav Har'El
d78ea3d498 test/cqlpy: mark test_unbuilt_index_not_used not strictly xfail
A few days ago, in commit 7b30a3981b we added to pytest.ini the option
xfail_strict. This option causes every time a test XPASSes, i.e., an
xfail test actually passes - to be considered an error and fail the test.

But some tests demonstrate a timing-related bug and do not reproduce the
bug every single time. An example we noticed in one CI run is:

    test/cqlpy/test_secondary_index.py::test_unbuilt_index_not_used

This test reproduces a timing-related bug (if you read from a secondary
index "too quickly" you can get wrong results), but only about 90% of the
time, not 100% of the time.

The solution is to add "strict=False" for the xfail marker on this
specific test. This undoes the xfail_strict for this specific test,
accepting that this specific test can either pass or fail. Note that
this does NOT make this test worthless - we still see this test failing
most of the time, and when a developer finally fixes this issue, the
test will begin to pass all the time.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-956
(we'll probably need to follow up this fix with the same fix for other
xfail tests that can sometime pass).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28942
2026-03-09 22:48:20 +02:00
Avi Kivity
01ddc17ab9 Merge 'mv: allow skipping view updates when a collection is unmodified' from Wojciech Mitros
When we generate view updates, we check whether we can skip the
entire view update if all columns selected by the view are unmodified.
However, for collection columns, we only check if they were unset
before and after the update.
In this patch we add a check for the actual collection contents.
We perform this check for both virtual and non-virtual selections.
When the column is only a virtual column in the view, it would be
enough to check the liveness of each collection cell, however for
that we'd need to deserialize the entire collection anyway, which
should be effectively as expensive as comparing all of its bytes.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-808

Closes scylladb/scylladb#28839

* github.com:scylladb/scylladb:
  mv: allow skipping view updates when a collection is unmodified
  mv: allow skipping view updates if an empty collection remains unset
2026-03-09 22:46:01 +02:00
Botond Dénes
13ff9c4394 db,compaction: use utils::chunked_vector for cache invalidation ranges
Instead of dht::partition_ranges_vector, which is an std::vector<> and
have been seen to cause large allocations when calculating ranges to be
invalidated after compaction:

    seastar_memory - oversized allocation: 147456 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at
    [Backtrace #0]
    void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89
     (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./seastar/src/util/backtrace.cc:99
    seastar::current_tasktrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:136
    seastar::current_backtrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:169
    seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:840
    seastar::memory::cpu_pages::check_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:903
     (inlined by) seastar::memory::cpu_pages::allocate_large(unsigned int, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:910
     (inlined by) seastar::memory::allocate_large(unsigned long, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:1533
     (inlined by) seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:1679
    seastar::memory::allocate(unsigned long) at ././seastar/src/core/memory.cc:1698
     (inlined by) operator new(unsigned long) at ././seastar/src/core/memory.cc:2440
     (inlined by) std::__new_allocator<interval<dht::ring_position>>::allocate(unsigned long, void const*) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/new_allocator.h:151
     (inlined by) std::allocator<interval<dht::ring_position>>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/allocator.h:203
     (inlined by) std::allocator_traits<std::allocator<interval<dht::ring_position>>>::allocate(std::allocator<interval<dht::ring_position>>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/alloc_traits.h:614
     (inlined by) std::_Vector_base<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/stl_vector.h:387
     (inlined by) std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::reserve(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/vector.tcc:79
    dht::to_partition_ranges(utils::chunked_vector<interval<dht::token>, 131072ul> const&, seastar::bool_class<utils::can_yield_tag>) at ./dht/i_partitioner.cc:347
    compaction::compaction::get_ranges_for_invalidation(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>> const&) at ./compaction/compaction.cc:619
     (inlined by) compaction::compaction::get_compaction_completion_desc(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>, std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>) at ./compaction/compaction.cc:719
     (inlined by) compaction::regular_compaction::replace_remaining_exhausted_sstables() at ./compaction/compaction.cc:1362
    compaction::compaction::finish(std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>, std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>) at ./compaction/compaction.cc:1021
    compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0::operator()() at ./compaction/compaction.cc:1960
     (inlined by) compaction::compaction_result std::__invoke_impl<compaction::compaction_result, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(std::__invoke_other, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:63
     (inlined by) std::__invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type std::__invoke<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:98
     (inlined by) decltype(auto) std::__apply_impl<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&, std::integer_sequence<unsigned long, ...>) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2920
     (inlined by) decltype(auto) std::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2935
     (inlined by) seastar::future<compaction::compaction_result> seastar::futurize<compaction::compaction_result>::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at ././seastar/include/seastar/core/future.hh:1930
     (inlined by) seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()::operator()() const at ././seastar/include/seastar/core/thread.hh:267
     (inlined by) seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()>::call(seastar::noncopyable_function<void ()> const*) at ././seastar/include/seastar/util/noncopyable_function.hh:138
    seastar::noncopyable_function<void ()>::operator()() const at ./build/release/seastar/./seastar/include/seastar/util/noncopyable_function.hh:224
     (inlined by) seastar::thread_context::main() at ./build/release/seastar/./seastar/src/core/thread.cc:318

dht::partition_ranges_vector is used on the hot path, so just convert
the problematic user -- cache invalidation -- to use
utils::chunked_vector<dht::partition_range> instead.

Fixes: SCYLLADB-121

Closes scylladb/scylladb#28855
2026-03-09 22:04:54 +02:00
Andrei Chekun
8acba40c84 test.py: fix unawaited ScyllaLogFile.grep() coroutines
Fixed several places where ScyllaLogFile.grep() was called without
await, resulting in checking coroutine objects for truthiness instead
of actual log matches.

Fixes: SCYLLADB-903
2026-03-09 19:41:07 +01:00
Andrei Chekun
224a11be65 tests: fix test_group0_recovers_after_partial_command_application
Due to the fact that grep logs was not awaited this issue was masked.
With adding await for log grep it started to fail. This PR fixes the
test.
2026-03-09 19:41:07 +01:00
Marcin Maliszkiewicz
96a2b0e634 test: add tests for global group0_batch barrier feature
Runtime: 16s in dev mode
2026-03-09 15:15:59 +01:00
Marcin Maliszkiewicz
6723ced684 qos: switch service levels write paths to use global group0_batch barrier
This ensures that we return auth functions only after
we wait until all live nodes apply our mutations.
2026-03-09 15:15:59 +01:00
Marcin Maliszkiewicz
fe79fdf090 auth: switch write paths to use global group0_batch barrier
This ensures that we return auth functions only after
we wait until all live nodes apply our mutations.
2026-03-09 15:15:59 +01:00
Marcin Maliszkiewicz
4c8681a927 raft: add function to broadcast read barrier request
This function ensures that all alive nodes executed
read barrier.

It will be usefull for the following commits which would
eventually delay returning response to the user until
mutations are applied on other nodes so that the user
may perceive better data consistency accross nodes.
2026-03-09 15:15:59 +01:00
Marcin Maliszkiewicz
cbae84a926 raft: add gossiper dependency to raft_group0_client
In following commit raft_group0_client will send read
barrier RPC to all alive nodes, it takes list of the nodes
from gossiper.
2026-03-09 15:15:59 +01:00
Marcin Maliszkiewicz
8422fbca9f raft: add read barrier RPC
The RPC does read barrier on a destination node.

It will be issued in following commits
to live nodes to assure that command was applied
everywhere.
2026-03-09 15:15:59 +01:00
Michał Chojnowski
ff60a5f1e5 cql3: suggest ALTER MATERIALIZED VIEW to users trying to use ALTER TABLE on a view
When a user tries to use ALTER TABLE on a materialized view,
the resulting error message is `Cannot use ALTER TABLE on Materialized View`.

The intention behind this error is that ALTER MATERIALIZED VIEW should
be used instead.

But we observed that some users interpret this error message as a general
"You cannot do any ALTER on this thing".

This patch enhances the error message (and others similar to it)
to prevent the confusion.

Closes scylladb/scylladb#28831
2026-03-09 15:07:21 +01:00
Botond Dénes
1e41db5948 Merge 'service: tasks: return successful status if a table was dropped' from Aleksandra Martyniuk
tablet_virtual_task::wait throws if a table on which a tablet operation
was working is dropped.

Treat the tablet operation as successful if a table is dropped.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-494

Needs backport to all live releases

Closes scylladb/scylladb#28933

* github.com:scylladb/scylladb:
  test: add test_tablet_repair_wait_with_table_drop
  service: tasks: return successful status if a table was dropped
2026-03-09 16:04:44 +02:00
Piotr Dulikowski
23ed0d4df8 Merge 'vector_search: fix TLS server name with IP' from Karol Nowacki
SNI works only with DNS hostnames. Adding an IP address causes warnings
on the server side.
This change adds SNI only if it is not an IP address.

This change has no unit tests, as this behavior is not critical,
since it causes a warning on the server side.
The critical part, that the server name is verified, is already covered.

This PR also adds warning logs to improve future troubleshooting of connections to the vector-store nodes.

Fixes: VECTOR-528

Backports to 2025.04 and 2026.01 are required, as these branches are also affected.

Closes scylladb/scylladb#28637

* github.com:scylladb/scylladb:
  vector_search: fix TLS server name with IP
  vector_search: add warn log for failed ann requests
2026-03-09 15:03:22 +01:00
Asias He
e0483f6001 test: Fix coordinator assumption in do_test_tablet_incremental_repair_merge_error
The first node in the cluster is not guaranteed to be the coordinator
node. Hardcoding node 0 as the coordinator causes test flakiness.  This
patch dynamically finds the actual coordinator node and targets it for
error injection, log checking, and restarts.

Additionally, inject `tablet_force_tablet_count_decrease_once` across
all servers to force the tablet merge process to trigger once.

Fixes SCYLLADB-865

Closes scylladb/scylladb#28945
2026-03-09 15:27:45 +02:00
Marcin Maliszkiewicz
b6a7484520 docs: note eventual visibility of auth changes
Mention that role and permission changes are durable but may
not be immediately visible on other nodes due to asynchronous
replication.

Fixes: SCYLLADB-651

Closes scylladb/scylladb#28900
2026-03-09 14:07:10 +01:00
Piotr Dulikowski
42d70baad3 db: view: mutate_MV: don't hold keyspace ref across preemption
Currently, the view_update_generator::mutate_MV function acquires a
reference to the keyspace relevant to the operation, then it calls
max_concurrent_for_each and uses that reference inside the lambda passed
to that function. max_concurrent_for_each can preempt and there is no
mechanism that makes sure that the keyspace is alive until the view
updates are generated, so it is possible that the keyspace is freed by
the time the reference is used.

Fix the issue by precomputing the necessary information based on the
keyspace reference right away, and then passing that information by
value to the other parts of the code. It turns out that we only need to
know whether the keyspace uses tablets and whether it uses a network
topology strategy.

Fixes: scylladb/scylladb#28925

Closes scylladb/scylladb#28928
2026-03-09 15:04:26 +02:00
Łukasz Paszkowski
826fd5d6c3 test/storage: harden out-of-space prevention tests around restart and disk-utilization transitions
The tests in test_out_of_space_prevention.py are flaky. Three issues contribute:

1. After creating/removing the blob file that simulates disk pressure,
   the tests immediately checked derived state (e.g., "compaction_manager
   - Drained") without first confirming the disk space monitor had detected
   the utilization change. Fix: explicitly wait for "Reached/Dropped below
   critical disk utilization level" right after creating/removing the blob
   file, before checking downstream effects.

2. Several tests called `manager.driver_connect()` or omitted reconnection
   entirely after `server_restart()` / `server_start()`. The pre-existing
   driver session can silently reconnect multiple times, causing subsequent
   CQL queries to fail. Fix: call `reconnect_driver()` after every node restart.
   Additionally, call `wait_for_cql_and_get_hosts()` where CQL is used afterward,
   to ensure all connection pools are established.

3. Some log assertions used marks captured before a restart, so they could
   match pre-restart messages or miss messages emitted in the correct post-restart
   window. Fix: refresh marks at the right points.

Apart from that, the patch fixes a typo: autotoogle -> autotoggle.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-655

Closes scylladb/scylladb#28626
2026-03-09 14:45:09 +02:00
Calle Wilund
ef795eda5b test_encryption: Fix test_system_auth_encryption
Fixes: SCYLLADB-915

Test was quite broken; Not waiting for coro:s, as well as a bunch
of checks no longer even close to valid (this is a ported dtest, and
not a very good one).

Closes scylladb/scylladb#28887
2026-03-09 14:38:31 +02:00
Marcin Maliszkiewicz
f177259316 Merge 'vector_search: small improvements' from Karol Nowacki
vector_search: small improvements

This PR addresses several minor code quality issues and style inconsistencies within the vector_search module.

No backport is needed as these improvements are not visible to the end user.

Closes scylladb/scylladb#28718

* github.com:scylladb/scylladb:
  vector_search: fix names of private members
  vector_search: remove unused global variable
2026-03-09 11:42:35 +01:00
Botond Dénes
6bba4f7ca1 Merge 'test: cluster: util: sleep for 0.01s between writes in do_writes' from Patryk Jędrzejczak
Tests use `start_writes` as a simple write workload to test that writes
succeed when they should (e.g., there is no availability loss), but not to
test performance. There is no reason to overload the CPU, which can lead to
test failures.

I suspect this function to be the cause of SCYLLADB-929, where the failures
of `test_raft_recovery_user_data` (that creates multiple write workloads
with `start_writes`) indicated that the machine was overloaded.
The relevant observations:
- two runs failed at the same time in debug mode,
- there were many reactor stalls and RPC timeouts in the logs (leading to
  unexpected events like servers marking each other down and group0
  leader changes).

I didn't prove that `start_writes` really caused this, but adding this sleep
should be a good change, even if I'm wrong.

The number of writes performed by the test decreases 30-50 times with the
sleep.

Note that some other util functions like `start_writes_to_cdc_table` have
such a sleep.

This PR also contains some minor updates to `test_raft_recovery_user_data`.

Fixes SCYLLADB-929

No backport:
- the failures were observed only in master CI,
- no proof that the change fixes the issue, so backports could be a waste
  of time.

Closes scylladb/scylladb#28917

* github.com:scylladb/scylladb:
  test: test_raft_recovery_user_data: replace asyncio.gather with gather_safely
  test: test_raft_recovery_user_data: use the exclude_node API
  test: test_raft_recovery_user_data: drop tablet_load_stats_cfg
  test: cluster: util: sleep for 0.01s between writes in do_writes
2026-03-09 12:12:04 +02:00
Nadav Har'El
47e8206482 test/alternator: test, and document, Alternator's data encoding
This patch adds a test file, test/alternator/test_encoding.py, testing
how Alternator stores its data in the underlying CQL database. We test
how tables are named, and how attributes of different types are encoded
into CQL.

The test, which begins with a long comment, also doubles as developer-
oriented *documention* on how Alternator encodes its data in the CQL
database. This documentation is not intended for end-users - we do
not want to officially support reading or writing Alternator tables
through CQL. But once in a while, this information can come in handy
for developers.

More importantly, this test will also serve as a regression test,
verifying that Alternator's encoding doesn't change unintentionally.
If we make an unintentional change to the way that Alternator stores
its data, this can break upgrades: The new code might not be able to
read or write the old table with its old encoding. So it's important
to make sure we never make such unintentional changes to the encoding
of Alternator's data. If we ever do make *intentional* changes to
Alternator's data encoding, we will need to fix the relevant test;
But also not forget to make sure that the new code is able to read
the old encoding as well.

The new tests use both "dynamodb" (Alternator) and "cql" fixtures,
to test how CQL sees the Alternator tables. So naturally are these
tests are marked "scylla_only" and skipped on DynamoDB.

Fixes #19770.

Closes scylladb/scylladb#28866
2026-03-09 10:50:09 +01:00
Andrzej Jackowski
6fb5ab78eb db/config: move guardrails config to one place and reorder
The motivations for this patch are as follows:
 - Guardrails should follow similar conventions, e.g. for config names,
   metrics names, testing. Keeping guardrails together makes it easier
   to find and compare existing guardrails when new guardrails are
   implemented.
 - The configuration is used to auto-generate the documentation
   (particularly, the `configuration-parameters` page). Currently,
   the order of parameters in the documentation is inconsistent (e.g.
   `minimum_replication_factor_fail_threshold` before
   `minimum_replication_factor_warn_threshold` but
   `maximum_replication_factor_fail_threshold` after
   `maximum_replication_factor_warn_threshold`), which can be confusing
   to customers.

Fixes: SCYLLADB-256

Closes scylladb/scylladb#28932
2026-03-09 10:50:00 +01:00
Patryk Jędrzejczak
46b7170347 Merge 'test/pylib: centralize timeout scaling and propagate build_mode in LWT helpers' from Alex Dathskovsky
This series improves timeout handling consistency across the test framework and makes build-mode effects explicit in LWT tests. (starting with LWT test that got flaky)

1. Centralize timeout scaling
Introduce scale_timeout(timeout) fixture in runner.py to provide a single, consistent mechanism for scaling test timeouts based on build mode.
Previously, timeout adjustments were done in an ad-hoc manner across different helpers and tests. Centralizing the logic:
Ensures consistent behavior across the test suite
Simplifies maintenance and reasoning about timeout behavior
Reduces duplication and per-test scaling logic
This becomes increasingly important as tests run on heterogeneous hardware configurations, where different build modes (especially debug) can significantly impact execution time.

2. Make scale_timeout explicit in LWT helpers
Propagate scale_timeout explicitly through BaseLWTTester and Worker, validating it at construction time instead of relying on implicit pytest fixture injection inside helper classes.
Additionally:
Update wait_for_phase_ops() and wait_for_tablet_count() to use scale_timeout_by_mode() for consistent polling behavior across modes
Update all LWT test call sites to pass build_mode explicitly
Increase default timeout values, as the previous defaults were too short and prone to flakiness, particularly under slower configurations such as debug builds

Overall, this series improves determinism, reduces flakiness, and makes the interaction between build mode and test timing explicit and maintainable.

backport: not required just an enhansment for test.py infra

Closes scylladb/scylladb#28840

* https://github.com/scylladb/scylladb:
  test/auth_cluster: align service-level timeout expectations with scaled config
  test/lwt: propagate scale_timeout through LWT helpers; scale resize waits Pass scale_timeout explicitly through BaseLWTTester and Worker, validating it at construction time instead of relying on implicit pytest fixture injection inside helper classes. Update wait_for_phase_ops() and wait_for_tablet_count() to use scale_timeout_by_mode() so polling behavior remains consistent across build modes. Adjust LWT test call sites to pass scale_timeout explicitly. Increase default timeout values, as the previous defaults were too short and prone to flakiness under slower configurations (notably debug/dev builds).
  test/pylib: introduce scale_timeout fixture helper
2026-03-09 10:28:19 +01:00
Patryk Jędrzejczak
4c8dba15f1 Merge 'strong_consistency/state_machine: ensure and upgrade mutations schema' from Michał Jadwiszczak
This patch fixes 2 issues within strong consistency state machine:
- it might happen that apply is called before the schema is delivered to the node
- on the other hand, the apply may be called after the schema was changed and purged from the schema registry

The first problem is fixed by doing `group0.read_barrier()` before applying the mutations.
The second one is solved by upgrading the mutations using column mappings in case the version of the mutations' schema is older.

Fixes SCYLLADB-428

Strong consistency is in experimental phase, no need to backport.

Closes scylladb/scylladb#28546

* https://github.com/scylladb/scylladb:
  test/cluster/test_strong_consistency: add reproducer for old schema during apply
  test/cluster/test_strong_consistency: add reproducer for missing schema during apply
  test/cluster/test_strong_consistency: extract common function
  raft_group_registry: allow to drop append entries requests for specific raft group
  strong_consistency/state_machine: find and hold schemas of applying mutations
  strong_consistency/state_machine: pull necessary dependencies
  db/schema_tables: add `get_column_mapping_if_exists()`
2026-03-09 09:49:22 +01:00
Marcin Maliszkiewicz
4150c62f29 Merge 'test_proxy_protocol: fix flaky system.clients visibility checks' from Piotr Smaron
`test_proxy_protocol_port_preserved_in_system_clients` failed because it
didn't see the just created connection in system.clients immediately. The
last lines of the stacktrace are:
```
            # Complete CQL handshake
            await do_cql_handshake(reader, writer)

            # Now query system.clients using the driver to see our connection
            cql = manager.get_cql()
            rows = list(cql.execute(
                f"SELECT address, port FROM system.clients WHERE address = '{fake_src_addr}' ALLOW FILTERING"
            ))

            # We should find our connection with the fake source address and port
>           assert len(rows) > 0, f"Expected to find connection from {fake_src_addr} in system.clients"
E           AssertionError: Expected to find connection from 203.0.113.200 in system.clients
E           assert 0 > 0
E            +  where 0 = len([])
```
Explanation: we first await for the hand-made connection to be completed,
then, via another connection, we're querying system.clients, and we don't
get this hand-made connection in the resultset.
The solution is to replace the bare cql.execute() calls with await wait_for_results(), a helper
 that polls via cql.run_async() until the expected row count is reached
 (30 s timeout, 100 ms period).

Fixes: SCYLLADB-819

The flaky test is present on master and in previous release, so backporting only there.

Closes scylladb/scylladb#28849

* github.com:scylladb/scylladb:
  test_proxy_protocol: introduce extra logging to aid debugging
  test_proxy_protocol: fix flaky system.clients visibility checks
2026-03-09 08:37:57 +01:00
Yaron Kaikov
977bdd6260 .github/workflows/trigger-scylla-ci: fix heredoc injection in trigger-scylla-ci workflow
Move all ${{ }} expression interpolations into env: blocks so they are
passed as environment variables instead of being expanded directly into
shell scripts. This prevents an attacker from escaping the heredoc in
the Validate Comment Trigger step and executing arbitrary commands on
the runner.

The Verify Org Membership step is hardened in the same way for
defense-in-depth.

Refs: GHSA-9pmq-v59g-8fxp
Fixes: SCYLLADB-954

Closes scylladb/scylladb#28935
2026-03-08 21:34:51 +02:00
Wojciech Mitros
0008976e2f mv: allow skipping view updates when a collection is unmodified
When we generate view updates, we check whether we can skip the
entire view update if all columns selected by the view are unmodified.
However, for collection columns, we only check if they were unset
before and after the update.
In this patch we add a check for the actual collection contents.
We perform this check for both virtual and non-virtual selections.
When the column is only a virtual column in the view, it would be
enough to check the liveness of each collection cell, however for
that we'd need to deserialize the entire collection anyway, which
should be effectively as expensive as comparing all of its bytes.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-808
2026-03-08 16:23:22 +01:00
Wojciech Mitros
7d1e0a2e4d mv: allow skipping view updates if an empty collection remains unset
Currently, when we generate view updates, we skip the view update if
all columns selected by the view are unchanged in the base table update.
However, this does not apply for collection columns - if the base table
has a collection regular column, we never allow skipping generating
view updates and the reason for that is missing implementation.
We can easily relax this for the case where the collection was missing
before and after the update - in this commit we move the check for
collections after the check for missing cells.
2026-03-08 16:22:27 +01:00
Artsiom Mishuta
fda68811e8 test.py: fix strict-config argument.
The ini-level strict_config was removed/never existed as a config key in pytest 8 — it's only a command-line flag(and  back in pytest 9)
In pytest 8.3.5, the equivalent is the --strict-config CLI flag, not an ini option

Fixes SCYLLADB-955

Closes scylladb/scylladb#28939
2026-03-08 16:09:29 +02:00
Taras Veretilnyk
739dd59ebc docs: document components_digests subcomponent and trailing digest in Scylla.db
Document the new `components_digests` subcomponent (tag 12) added to the
Scylla.db metadata component, which stores CRC32 digests of all checksummed
SSTable component files. Also document the trailing CRC32 digest that
stores digest of the scylla metadata itself.
2026-03-06 21:58:15 +01:00
Taras Veretilnyk
2b1c37396a sstable_compaction_test: Add tests for perform_component_rewrite
Add two test cases to verify the correctness of the perform_component_rewrite
functionality:
- test_perform_component_rewrite_single_sstable: Tests rewriting the Statistics
  component of a single sstable
- test_perform_component_rewrite_multiple_sstables: Tests rewriting 5 out of 10
  sstables
2026-03-06 21:58:15 +01:00
Taras Veretilnyk
591d13e942 sstable_test: add verification testcases of SSTable components digests persistance
Adds a generic test helper that writes a random SSTable, reloads it, and
verifies that the persisted CRC32 digest for each component matches the
digest computed from disk. Those covers all checksummed components test cases.
2026-03-06 21:58:15 +01:00
Taras Veretilnyk
54af4a26ca sstables: store digest of all sstable components in scylla metadata
This change replaces plain file_writer with crc32_digest_file_writer
for all SSTable components that should be checksummed. The resulting component
digests are stored in scylla metadata component.

This also extends new rewrite component mechanism,
to rewrite metadata with updated digest together with the component.
2026-03-06 21:58:10 +01:00
Dawid Mędrek
5feed00caa Merge 'raft: read_barrier: update local commit_idx to read_idx when it's safe' from Patryk Jędrzejczak
When the local entry with `read_idx` belongs to the current term, it's
safe to update the local `commit_idx` to `read_idx`.

The motivation for this change is to speed up read barriers. `wait_for_apply`
executed at the end of `read_barrier` is delayed until the follower learns
that the entry with `read_idx` is committed. It usually happens quickly in
the `read_quorum` message. However, non-voters don't receive this message,
so they have to wait for `append_entries`. If no new entries are being
added, `append_entries` can come only from `fsm::tick_leader()`. For group0,
this happens once every 100ms.

The issue above significantly slows down cluster setups in tests. Nodes
join group0 as non-voters, and then they are met with several read barriers
just after a write to group0. One example is `global_token_metadata_barrier`
in `write_both_read_new` performed just after `update_topology_state` in
`write_both_read_old`.

I tested the performance impact of this change with the following test:
```python
for _ in range(10):
    await manager.servers_add(3)
```
It consistently takes 44-45s with the change and 50-51s without the change
in dev mode.

No backport:
- non-critical performance improvement mostly relevant in tests,
- the change requires some soak time in master.

Closes scylladb/scylladb#28891

* github.com:scylladb/scylladb:
  raft: server: fix the repeating typo
  raft: clarify the comment about read_barrier_reply
  raft: read_barrier: update local commit_idx to read_idx when it's safe
  raft: log: clarify the specification of term_for
2026-03-06 18:50:08 +01:00
Aleksandra Martyniuk
40dca578c5 test: add test_tablet_repair_wait_with_table_drop 2026-03-06 15:08:29 +01:00
Piotr Smaron
f12e4ea42b test_proxy_protocol: introduce extra logging to aid debugging
In case of an error, we want to see the contents of the system.clients
table to have a better understanding of what happened - whether the
row(s) are really missing or maybe they are there, but 1 digit doesn't
match or the row is half-written.
We'll therefore query for the whole table on the CQL side, and then
filter out the rows we want to later proceed with on the python side.
This way we can dump the contents of the whole system.clients table if
something goes south.
2026-03-06 14:50:12 +01:00
Piotr Smaron
d8cf2c5f23 test_proxy_protocol: fix flaky system.clients visibility checks
`test_proxy_protocol_port_preserved_in_system_clients` failed because it
didn't see the just created connection in system.clients immediately. The
last lines of the stacktrace are:
```
            # Complete CQL handshake
            await do_cql_handshake(reader, writer)

            # Now query system.clients using the driver to see our connection
            cql = manager.get_cql()
            rows = list(cql.execute(
                f"SELECT address, port FROM system.clients WHERE address = '{fake_src_addr}' ALLOW FILTERING"
            ))

            # We should find our connection with the fake source address and port
>           assert len(rows) > 0, f"Expected to find connection from {fake_src_addr} in system.clients"
E           AssertionError: Expected to find connection from 203.0.113.200 in system.clients
E           assert 0 > 0
E            +  where 0 = len([])
```
Explanation: we first await for the hand-made connection to be completed,
then, via another connection, we're querying system.clients, and we don't
get this hand-made connection in the resultset.
The solution is to replace the bare cql.execute() calls with await wait_for_results(), a helper
 that polls via cql.run_async() until the expected row count is reached
 (30 s timeout, 100 ms period).

Fixes: SCYLLADB-819
2026-03-06 14:49:59 +01:00
Aleksandra Martyniuk
dd634c329f service: tasks: return successful status if a table was dropped
tablet_virtual_task::wait throws if a table on which a tablet operation
was working is dropped.

Treat the tablet operation as successful if a table is dropped.
2026-03-06 14:37:44 +01:00
Botond Dénes
4fdc0a5316 Merge 'Relax test's check_mutation_replicas() argument list' from Pavel Emelyanov
The one accepts long list of arguments, some of those is not really needed. Also some callers can be relaxed not to provide default values for arguments with such.

Improving tests, not backporting

Closes scylladb/scylladb#28861

* github.com:scylladb/scylladb:
  test: Remove passing default "expected_replicas" to check_mutation_replicas()
  test: Remove scope and primary-replica-only arguments from check_mutation_replicas() helper
2026-03-06 11:25:00 +02:00
Szymon Malewski
d817e56e87 vector_similarity_fcts.cc: fix strict aliasing violation in extract_float_vector
Previous code performed endian conversion by bulk-copying raw bytes
into a std::vector<float> and then iterating over it via a
reinterpret_cast<uint32_t*> pointer. Accessing float storage through a
uint32_t* violates C++ strict aliasing rules, giving the compiler
freedom to reorder or elide the stores, causing undefined behavior.

Replace the two-pass approach with a single-pass loop using
seastar::consume_be<uint32_t>() and std::bit_cast<float>(), which is
both well-defined and auto-vectorizable.

Follow-up #28754

Closes scylladb/scylladb#28912
2026-03-06 09:15:45 +01:00
Artsiom Mishuta
5d7a73cc5b test.py add support if non_gating tests
Add support for non_gating, the opposite of gating in dtest terminology, tests in test.py
codebase

This test will/should not be run by any current gating job (ci/next/nightly)

Closes scylladb/scylladb#28902
2026-03-06 09:39:32 +02:00
Andrei Chekun
01498a00d5 test.py: make HostRegistry singleton
HostRegistry initialized in several places in the framework, this can
lead to the overlapping IP, even though the possibility is low it's not
zero. This PR makes host registry initialized once for the master
thread and pytest. To avoid communication between with workers, each
worker will get its own subnet that it can use solely for its own goals.
This simplifies the solution while providing the way to avoid overlapping IP's.

Closes scylladb/scylladb#28520
2026-03-06 09:25:29 +02:00
Artsiom Mishuta
2be4d8074d test.py disable XFail tests on CI run
This PR disables running FXAIL tests on ci run to speed it up.

tests will continue run on "nightly" job and FAIL on unexpected pass
and will continue run on "NEXT" job and NOT FAIL on unexpected pass

Closes scylladb/scylladb#28886
2026-03-06 09:12:06 +02:00
Szymon Malewski
f9d213547f cql3: selection: fix add_column_for_post_processing for ORDER BY
The purpose of `add_column_for_post_processing` is to add columns that are required for processing of a query,
but are not part of SELECT clause and shouldn't be returned. They are added to the final result set, but later are not serialized.
Mainly it is used for filtering and grouping columns, with a special case of `WHERE primary_key IN ...  ORDER BY ...` when the whole result set needs additional final sorting,
and ordering columns must be added as well.
There was a bug that manifested in #9435, #8100 and was actually identified in #22061.
In case of selection with processing (e.g functions involved), result set row is formed in two stages.
Initially it is a list of columns fetched from replicas - on which filtering and grouping is performed.
After that the actual selection is resolved and the final number of columns can change.
Ordering is performed on this final shape, but the ordering column index returned by `add_column_for_post_processing` refereed to initial shape.
If selection refereed to the same column twice (e.g. `v, TTL(v)` as in #9435) final row was longer than initial and ordering refereed to incorrect column.
If a function in selection refereed to multiple columns (e.g. as_json(.., ..) which #8100 effectively uses) the final row was shorter
and ordering tried to use a non-existing column.

This patch fixes the problem by making sure that column index of the final result set is used for ordering.

The previously crashing test `cassandra_tests/validation/entities/json_test.py::testJsonOrdering` doesn't have to be skipped, but now it is failing on issue #28467.

Fixes #9435
Fixes #8100
Fixes #22061

Closes scylladb/scylladb#28472
2026-03-05 19:22:34 +02:00
Patryk Jędrzejczak
c8c57850d9 test: test_raft_recovery_user_data: replace asyncio.gather with gather_safely 2026-03-05 17:13:52 +01:00
Patryk Jędrzejczak
c3aa4ed23c test: test_raft_recovery_user_data: use the exclude_node API
The API is now available.
2026-03-05 17:13:52 +01:00
Patryk Jędrzejczak
dd75687251 test: test_raft_recovery_user_data: drop tablet_load_stats_cfg
The issue has been fixed.
2026-03-05 17:13:52 +01:00
Patryk Jędrzejczak
52940c4f31 test: cluster: util: sleep for 0.01s between writes in do_writes
Tests use `start_writes` as a simple write workload to test that writes
succeed when they should (e.g., there is no availability loss), but not to
test performance. There is no reason to overload the CPU, which can lead to
test failures.

I suspect this function to be the cause of SCYLLADB-929, where the failures
of `test_raft_recovery_user_data` (that creates multiple write workloads
with `start_writes`) indicated that the machine was overloaded.
The relevant observations:
- two runs failed at the same time in debug mode,
- there were many reactor stalls and RPC timeouts in the logs (leading to
  unexpected events like servers marking each other down and group0
  leader changes).

I didn't prove that `start_writes` really caused this, but adding this sleep
should be a good change, even if I'm wrong.

The number of writes performed by the test decreases 30-50 times with the
sleep.

Note that some other util functions like `start_writes_to_cdc_table` have
such a sleep.

Fixes SCYLLADB-929
2026-03-05 17:13:40 +01:00
Calle Wilund
ab3d3d8638 build: add slirp4netns to dependencies
Needed for port forwarded podman-in-podman containers

[avi:
  - move from Dockerfile to install-dependencies.sh so non-container
  builds also get it
  - regenerate frozen toolchain with optimized clang from

    https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-aarch64.tar.gz
    https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-x86_64.tar.gz
]

Closes scylladb/scylladb#28870
2026-03-05 17:44:17 +02:00
Michał Jadwiszczak
37bbbd3a27 test/cluster/test_strong_consistency: add reproducer for old schema during apply 2026-03-05 13:50:20 +01:00
Michał Jadwiszczak
6aef4d3541 test/cluster/test_strong_consistency: add reproducer for missing schema during apply 2026-03-05 13:50:16 +01:00
Michał Jadwiszczak
4795f5840f test/cluster/test_strong_consistency: extract common function 2026-03-05 13:47:43 +01:00
Michał Jadwiszczak
3548b7ad38 raft_group_registry: allow to drop append entries requests for specific raft group
Similar to `raft_drop_incoming_append_entries`, the new error injection
`raft_drop_incoming_append_entries_for_specified_group` skips handler
for `raft_append_entries` RPC but it allows to specify id of raft group
for which the requests should be dropped.

The id of a raft group should be passed in error injection parameters
under `value` key.
2026-03-05 13:47:43 +01:00
Michał Jadwiszczak
b0cffb2e81 strong_consistency/state_machine: find and hold schemas of applying mutations
It might happen that a strong consistency command will arrive to a node:
- before it knows about the schema
- after the schema was changes and the old version was removed from the
  memory

To fix the first case, it's enough to perform a read barrier on group0.
In case of the second one, we can use column mapping the upgrade the
mutation to newer schema.

Also, we should hold pointers to schemas until we finish `_db.apply()`,
so the schema is valid for the whole time.
And potentially we should hold multiple pointers because commands passed
to `state_machine::apply()` may contain mutations to different schema
versions.

This commit relies on a fact that the tablet raft group and its state
machine is created only after the table is created locally on the node.

Fixes SCYLLADB-428
2026-03-05 13:47:40 +01:00
Tomasz Grabiec
b90fe19a42 Merge 'service: assert that tables updated via group0 use schema commitlog' from Aleksandra Martyniuk
Set enable_schema_commitlog for each group0 tables.

Assert that group0 tables use schema commitlog in ensure_group0_schema
(per each command).

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-914.

Needs backport to all live releases as all are vulnerable

Closes scylladb/scylladb#28876

* github.com:scylladb/scylladb:
  test: add test_group0_tables_use_schema_commitlog
  db: service: remove group0 tables from schema commitlog schema initializer
  service: ensure that tables updated via group0 use schema commitlog
  db: schema: remove set_is_group0_table param
2026-03-05 13:28:13 +01:00
Botond Dénes
509f2af8db Merge 'repair: Fix rwlock in compaction_state and lock holder lifecycle' from Raphael Raph Carvalho
Consider this:

- repair takes the lock holder
- tablet merge filber destories the compaction group and the compaction state
- repair fails
- repair destroy the lock holder

This is observed in the test:

```
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036] Repair 1 out of 1 tablets: table=sec_index.users range=(432345564227567615,504403158265495551] replicas=[0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea:15, 498e354c-1254-4d8d-a565-2f5c6523845a:9, 5208598c-84f0-4526-bb7f-573728592172:28]

...

repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: Started to repair 1 out of 1 tables in keyspace=sec_index, table=users, table_id=ea2072d0-ccd9-11f0-8dba-c5ab01bffb77, repair_reason=repair
repair - Enable incremental repair for table=sec_index.users range=(432345564227567615,504403158265495551]
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: get_sync_boundary: got error from node=0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea, keyspace=sec_index, table=users, range=(432345564227567615,504403158265495551], error=seastar::rpc::remote_verb_error (Compaction state for table [0x60f008fa34c0] not found)
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge

....

scylla[10793] Segmentation fault on shard 28, in scheduling group streaming
```

The rwlock in compaction_state could be destroyed before the lock holder
of the rwlock is destroyed. This causes user after free when the lock
the holder is destroyed.

To fix it, users of repair lock will now be waited when a compaction
group is being stopped.
That way, compaction group - which controls the lifetime of rwlock -
cannot be destroyed while the lock is held.
Additionally, the merge completion fiber - that might remove groups -
is properly serialized with incremental repair.

The issue can be reproduced using sanitize build consistently and can not
be reproduced after the fix.

Fixes #27365

Closes scylladb/scylladb#28823

* github.com:scylladb/scylladb:
  repair: Fix rwlock in compaction_state and lock holder lifecycle
  repair: Prevent repair lock holder leakage after table drop
2026-03-05 14:18:25 +02:00
Patryk Jędrzejczak
f1978d8a22 raft: server: fix the repeating typo 2026-03-05 13:06:08 +01:00
Patryk Jędrzejczak
5a43695f6a raft: clarify the comment about read_barrier_reply
The comment could be misleading. It could suggest that the returned index is
already safe to read. That's not necessarily true. The entry with the
returned index could, for example, be dropped by the leader if the leader's
entry with this index had a different term.
2026-03-05 13:06:08 +01:00
Patryk Jędrzejczak
1ae2ae50a6 raft: read_barrier: update local commit_idx to read_idx when it's safe
When the local entry with `read_idx` belongs to the current term, it's
safe to update the local `commit_idx` to `read_idx`. The argument for
safety is in the new comment above `maybe_update_commit_idx_for_read`.

The motivation for this change is to speed up read barriers. `wait_for_apply`
executed at the end of `read_barrier` is delayed until the follower learns
that the entry with `read_idx` is committed. It usually happens quickly in
the `read_quorum` message. However, non-voters don't receive this message,
so they have to wait for `append_entries`. If no new entries are being
added, `append_entries` can come only from `fsm::tick_leader()`. For group0,
this happens once every 100ms.

The issue above significantly slows down cluster setups in tests. Nodes
join group0 as non-voters, and then they are met with several read barriers
just after a write to group0. One example is `global_token_metadata_barrier`
in `write_both_read_new` performed just after `update_topology_state` in
`write_both_read_old`.

Writing a test for this change would be difficult, so we trust the nemesis
tests to do the job. They have already found consistency issues in read
barriers. See #10578.
2026-03-05 13:06:08 +01:00
Patryk Jędrzejczak
1cbd0da519 raft: log: clarify the specification of term_for
When `idx > last_idx()`, the function does an out-of-bounds access to `_log`.
This may look contradictory to the current specification.
2026-03-05 13:06:07 +01:00
Michał Jadwiszczak
33a16940be strong_consistency/state_machine: pull necessary dependencies
Both migration manager and system keyspace will be used in next commit.
The first one is needed to execute group0 read barrier and we need
system keyspace to get column mappings.
2026-03-05 12:33:17 +01:00
Alex
b32ef8ecd5 test/auth_cluster: align service-level timeout expectations with scaled config
Use scale_timeout_by_mode() in make_scylla_conf() to derive
  request_timeout_in_ms in test/pylib/scylla_cluster.py.

  Update test_connections_parameters_auto_update in
  test/cluster/auth_cluster/test_raft_service_levels.py to expect the
  mode-specific timeout string returned by the REST endpoint after this
  scaling change.
2026-03-05 13:32:15 +02:00
Alex
a66565cc42 test/lwt: propagate scale_timeout through LWT helpers; scale resize waits
Pass scale_timeout explicitly through BaseLWTTester and Worker, validating it at construction time instead of relying on implicit pytest fixture injection inside helper classes.
Update wait_for_phase_ops() and wait_for_tablet_count() to use scale_timeout_by_mode() so polling behavior remains consistent across build modes.
Adjust LWT test call sites to pass scale_timeout explicitly.
Increase default timeout values, as the previous defaults were too short and prone to flakiness under slower configurations (notably debug/dev builds).
2026-03-05 13:07:09 +02:00
Alex
73f1a65203 test/pylib: introduce scale_timeout fixture helper
Introduce scale_timeout(mode) to centralize test timeout scaling logic based on build mode, the function will return a callable that will handle the timeout by mode.
This ensures consistent timeout behavior across test helpers and eliminates ad-hoc per-test scaling adjustments.
Centralizing the logic improves maintainability and makes timeout behavior easier to reason about.
This becomes increasingly important as we run tests on heterogeneous hardware configurations.
Different build modes (especially debug) can significantly affect execution time, and having a single scaling mechanism helps keep test stability predictable across environments.

No functional change beyond unifying existing timeout scaling behavior.
2026-03-05 13:07:09 +02:00
Anna Stuchlik
855c503c63 doc: fix the unified installer instructions
This commit updates the documentation for the unified installer.

- The Open Source example is replaced with version 2025.1 (Source Available, currently supported, LTS).
- The info about CentOS 7 is removed (no longer supported).
- Java 8 is removed.
- The example for cassandra-stress is removed (as it was already removed on other installation pages).

Fixes https://github.com/scylladb/scylladb/issues/28150

Closes scylladb/scylladb#28152
2026-03-05 12:57:06 +02:00
Michał Jadwiszczak
d25be9e389 db/schema_tables: add get_column_mapping_if_exists()
In scenarios where we want to firsty check if a column mapping exists
and if we don't want do flow control with exception, it is very wasteful
to do
```
if (column_mapping_exists()) {
  get_column_mapping();
}
```
especially in a hot path like `state_machine::apply()` becase this will
execute 2 internal queries.

This commit introduces `get_column_mapping_if_exists()` function,
which simply wrapps result of `get_column_mapping()` in optional and
doesn't throw an exception if the mapping doesn't exist.
2026-03-05 11:55:57 +01:00
Artsiom Mishuta
7b30a3981b test.py: enable strict_config,xfail_strict,strict-markers
this commit enables 3 strict pytest options:

strict_config - if any warnings encountered while parsing the pytest section of the configuration file will raise errors.
xfail_strict - if markers not registered in the markers section of the configuration file will raise errors.
strict-markers - if tests marked with @pytest.mark.xfail that actually succeed will by default fail the test suite

and fix errors that occur after enabling these options

Closes scylladb/scylladb#28859
2026-03-05 12:54:26 +02:00
Dawid Mędrek
7564a56dc8 Merge 'tombstone_gc: allow using repair-mode tombstone gc with RF=1 tables' from Botond Dénes
Currently, repair-mode tombstone-gc cannot be used on tables with RF=1. We want to make repair-mode the default for all tablet tables (and more, see https://github.com/scylladb/scylladb/issues/22814), but currently a keyspace created with RF=1 and later altered to RF>1 will end up using timeout-mode tombstone gc. This is because the repair-mode tombstone-gc code relies on repair history to determine the gc-before time for keys/ranges. RF=1 tables cannot run repairs so they will have empty repair history and consequently won't be able to purge tombstones.
This PR solves this by keeping a registry of RF=1 tables and consulting this registry when creating `tombstone_gc_state` objects. If the table is RF=1, tombstone-gc will work as if the table used immediate-mode tombstone-gc. The registry is updated on each replication update. As soon as the table is not RF=1 anymore, the tombstone-gc reverts to the natural repair-mode behaviour.

After this PR, tombstone-gc defaults to repair-mode for all tables, regardless of RF and tablets/vnodes.

Fixes: SCYLLADB-106.

New feature, no backport required.

Closes scylladb/scylladb#22945

* github.com:scylladb/scylladb:
  test/{boost,cluster}: add test for tombstone gc mode=repair with RF=1
  tombstone_gc: allow use of repair-mode for RF=1 tables
  replica/table: update rf=1 table registry in shared tombstone-gc state
  tombstone_gc: tombstone_gc_before_getter: consider RF when getting gc before time
  tombstone_gc: unpack per_table_history_maps
  tombstone_gc: extract _group0_gc_time from per_table_history_map
  tombstone_gc: drop tombstone_gc_state(nullptr) ctor and operator bool()
  test/lib/random_schema: use timeout-mode tombstone_gc
  tombstone_gc_options: add C++ friendly constructor
  test: move away from tombstone_gc_state(nullptr) ctor
  treewide: move away from tombstone_gc_state(nullptr) ctor
  sstable: move away from tombstone_gc_mode::operator bool()
  replica/table: add get_tombstone_gc_state()
  compaction: use tombstone_gc_state with value semantics
  db/row_cache: use tombstone_gc_state with value semantics
  tombstone_gc: introduce tombstone_gc_state::for_tests()
2026-03-05 11:50:31 +01:00
Piotr Dulikowski
a2669e9983 test: test_mv_merge_allowed: add mistakenly omitted awaits
The test test_mv_merge_allowed asserts in two places that the tablet
count is 2. It does so by calling an async function but, mistakenly, the
returned coroutine was not awaited. The coroutine is, apparently, truthy
so the assertions always passed.

Fix the test to properly await the coroutines in the assertions.

Fixes: SCYLLADB-905

Closes scylladb/scylladb#28875
2026-03-05 11:29:23 +01:00
Avi Kivity
5ae40caa6d dist: tune tcp_mem to 3% of total memory in scylla-kernel-conf package
tcp_mem defaults to 9% of total memory. ScyllaDB defaults to 93%. The
sum is more than 100%.

Fix by tuning tcp_mem to 3% of total memory.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-734

Closes scylladb/scylladb#28700
2026-03-05 12:51:04 +03:00
Patryk Jędrzejczak
bb1a798c2c Merge 'raft: Throw stopped_error if server aborted' from Dawid Mędrek
This PR solves a series of similar problems related to executing
methods on an already aborted `raft::server`. They materialize
in various ways:

* For `add_entry` and `modify_config`, a `raft::not_a_leader` with
  a null ID will be returned IF forwarding is disabled. This wasn't
  a big problem because forwarding has always been enabled for group0,
  but it's something that's nice to fix. It's might be relevant for
  strong consistency that will heavily rely on this code.

* For `wait_for_leader` and `wait_for_state_change`, the calls may
  hang and never resolve. A more detailed scenario is provided in a
  commit message.

For the last two methods, we also extend their descriptions to indicate
the new possible exception type, `raft::stopped_error`. This change is
correct since either we enter the functions and throw the exception
immediately (if the server has already been aborted), or it will be
thrown upon the call to `raft::server::abort`.

We fix both issues. A few reproducer tests have been included to verify
that the calls finish and throw the appropriate errors.

Fixes SCYLLADB-841

Backport: Although the hanging problems haven't been spotted so far
          (at least to the best of my knowledge), it's best to avoid
          running into a problem like that, so let's backport the
          changes to all supported versions. They're small enough.

Closes scylladb/scylladb#28822

* https://github.com/scylladb/scylladb:
  raft: Make methods throw stopped_error if server aborted
  raft: Throw stopped_error if server aborted
  test: raft: Introduce get_default_cluster
2026-03-05 10:47:39 +01:00
Marcin Maliszkiewicz
c3f59e4fa1 Merge 'cql3: implement write_consistency_levels guardrails' from Andrzej Jackowski
This patch series implements `write_consistency_levels_warned` and `write_consistency_levels_disallowed` guardrails, allowing the configuration of which consistency levels are unwanted for writes. The motivation for these guardrails is to forbid writing with consistency levels that don't provide high durability guarantees (like CL=ANY, ONE, or LOCAL_ONE).

Neither guardrail is enabled by default, so as not to disrupt clusters that are currently using any of the CLs for writes. The warning guardrail may seem harmless, as it only adds a warning to the CQL response; however, enabling it can significantly increase network traffic (as a warning message is added to each response) and also decrease throughput due to additional allocations required to prepare the warning. Therefore, both guardrails should be enabled with care. The newly added `writes_per_consistency_level` metric, which is incremented unconditionally, can help decide whether a guardrail can be safely enabled in an existing cluster.

This commit adds additional `if` instructions on the critical path. However, based on the `perf_simple_query` benchmark for writes, the difference is marginal (~40 additional instructions, which is a relative difference smaller than 0.001).

BEFORE:
```
291443.35 tps ( 53.3 allocs/op,  16.0 logallocs/op,  14.2 tasks/op,   48067 insns/op,   18885 cycles/op,        0 errors)
throughput:
 mean=   289743.07 standard-deviation=6075.60
 median= 291424.69 median-absolute-deviation=1702.56
 maximum=292498.27 minimum=261920.06
instructions_per_op:
 mean=   48072.30 standard-deviation=21.15
 median= 48074.49 median-absolute-deviation=12.07
 maximum=48119.87 minimum=48019.89
cpu_cycles_per_op:
 mean=   18884.09 standard-deviation=56.43
 median= 18877.33 median-absolute-deviation=14.71
 maximum=19155.48 minimum=18821.57
```

AFTER:
```
290108.83 tps ( 53.3 allocs/op,  16.0 logallocs/op,  14.2 tasks/op,   48121 insns/op,   18988 cycles/op,        0 errors)
throughput:
 mean=   289105.08 standard-deviation=3626.58
 median= 290018.90 median-absolute-deviation=1072.25
 maximum=291110.44 minimum=274669.98
instructions_per_op:
 mean=   48117.57 standard-deviation=18.58
 median= 48114.51 median-absolute-deviation=12.08
 maximum=48162.18 minimum=48087.18
cpu_cycles_per_op:
 mean=   18953.43 standard-deviation=28.76
 median= 18945.82 median-absolute-deviation=20.84
 maximum=19023.93 minimum=18916.46
```

Fixes: SCYLLADB-259
Refs: SCYLLADB-739
No backport, it's a new feature

Closes scylladb/scylladb#28570

* github.com:scylladb/scylladb:
  scylla.yaml: add write CL guardrails to scylla.yaml
  scylla.yaml: reorganize guardrails config to be in one place
  test: add cluster tests for write CL guardrails
  test: implement test_guardrail_write_consistency_level
  cql3: start using write CL guardrails
  cql3/query_processor: implement metrics to track CL of writes
  db: cql3/query_processor: add write_consistency_levels enum_sets
  config: add write_consistency_levels_* guardrails configuration
2026-03-05 09:55:38 +01:00
Yauheni Khatsianevich
aa85f5a9c3 test: migrating alternator ttl tests to scylla repo
migrating alternator_ttl_tests.py to scylla repo as part of
deprecating dtest framework
migrated tests:
- test_ttl_with_load_and_decommission

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-869

Closes scylladb/scylladb#28858
2026-03-05 10:04:14 +02:00
Nadav Har'El
8e32d97be6 test/alternator: fix run script
The test/alternator/run script currently fails, Scylla fails to boot
complaining that "--alternator-ttl-period-in-seconds" is specified
twice (which is, unfortunately, not allowed). The problem is that
recently we started to set this option in test/cqlpy/run.py, for
CQL's new row-level TTL, so now it is no longer needed in
test/alternator/run - and in fact not allowed and we must remove it.

This patch only affects the script test/alternator/run, and has no
affect on running tests through test.py or Jenkins.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28868
2026-03-05 10:06:38 +03:00
Botond Dénes
9b2242c752 test/cluster/test_repair.py: fix test_repair_timtestamp_difference
This test forgot to await its check() calls, which is the pass-condition
of the test. Once the await was added, the test started failing. Turns
out, the test was broken, but this was never discovered, because due to
the missing await, the errors were not propagated.
This patch adds the missing await and fixes the discovered problems:
* Use cql.run_async() instead of cql.execute()
* Fix json path for timestamp
* Missing flush/compact

Fixes: SCYLLADB-911

Closes scylladb/scylladb#28883
2026-03-05 10:04:49 +03:00
Nadav Har'El
af07718fff test/cqlpy: fix "run --release" for versions 5.4 or older
Recently we started to rely on the options "--auth-superuser-name"
and "--auth-superuser-salted-password" to ensure that a
cassandra/cassandra user exists for tests - without those options
a default superuser no longer exists.

This broke "test/cqlpy/run --release" for old releases, earlier
than 5.4 (in the enterprise stream, 2024.1 or earlier), because
those old release didn't have this option.

So in this patch we fix the "--release" logic that removes these
options from the command line when running these old versions.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28894
2026-03-05 09:59:46 +03:00
Botond Dénes
5e7b966d37 Merge 'Remove prepare_snapshot_for_backup() helper from backup/restore tests' from Pavel Emelyanov
The helper in question duplicates the functionality of `take_snapshot()` one from the same file. The only difference is that it additionally creates keyspace:table with yet another helper, but that helper is also going to be removed (as continuation of #28600 and #28608)

Enhancing tests, not backporting

Closes scylladb/scylladb#28834

* github.com:scylladb/scylladb:
  test_backup: Remove prepare_snapshot_for_backup()
  test_backup: Patch test_simple_backup_and_restore to use take_snapshot()
  test_backup: Patch backup tests to use take_snapshot()
  test_backup: Add helper to take snapshot on a single server
2026-03-05 06:54:07 +02:00
Dani Tweig
25fc8ef14c Add RELENG to milestone-to-Jira sync project keys
Closes scylladb/scylladb#28889
2026-03-05 06:51:21 +02:00
Calle Wilund
35aab75256 test_internode_compression: Add await for "run" coro:s
Fixes: SCYLLADB-907

Closes scylladb/scylladb#28885
2026-03-05 06:50:33 +02:00
Patryk Jędrzejczak
2a3476094e storage_service: raft_topology_cmd_handler: fix use-after-free
8e9c7397c5 made `rs` a reference, which can
lead to use-after-free. The `normal_nodes` map containing the referenced
value can be destroyed before the last use of `rs` when the topology state
is reloaded after a context switch on some `co_await`. The following move
assignment in `storage_service::topology_state_load` causes this:
```
_topology_state_machine._topology = co_await _sys_ks.local().load_topology_state(tablet_hosts);
```

This issue has been discovered in next-2026.1 CI after queueing the
backport of #28558. `test_truncate_during_topology_change` failed after
ASan reported a heap-use-after-free in
```
co_await _repair.local().bootstrap_with_repair(get_token_metadata_ptr(), rs.ring.value().tokens, session);
```
This test enables `delay_bootstrap_120s`, which makes the bug much more
likely to reproduce, but it could happen elsewhere.

No backport needed, as the only backport of #28558 hasn't been merged yet.
The backport PR will cherry-pick this commit.

Closes scylladb/scylladb#28772
2026-03-05 01:50:22 +01:00
Aleksandra Martyniuk
156c29f962 test: add test_group0_tables_use_schema_commitlog 2026-03-04 17:25:06 +01:00
Aleksandra Martyniuk
5306e26b83 db: service: remove group0 tables from schema commitlog schema initializer
Remove group0 tables from schema commitlog schema initializer.
The schema commitlog of group0 tables is ensured by set_is_group0_table.
2026-03-04 17:25:06 +01:00
Aleksandra Martyniuk
690b2c4142 service: ensure that tables updated via group0 use schema commitlog
Set enable_schema_commitlog for each group0 tables.

Assert that group0 tables use schema commitlog in ensure_group0_schema
(per each command).

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-914.
2026-03-04 17:25:04 +01:00
Aleksandra Martyniuk
6b3b174704 db: schema: remove set_is_group0_table param
set_is_group0_table takes an enabled flag, based on which it decides
whether it's a group0 table. The method is called only with enabled = true.

Drop the param. For not group0 tables nothing should be set.
2026-03-04 17:24:34 +01:00
Aleksandra Martyniuk
57f1e46204 test: cluster: tasks: await set_task_ttl
Await set_task_ttl in test_tablet_repair_task_children.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-912

Closes scylladb/scylladb#28882
2026-03-04 17:58:37 +02:00
Dawid Mędrek
d44fc00c4c raft: Make methods throw stopped_error if server aborted
After the previous changes in `raft::server::{add_entry, modify_config}`
(cf. SCYLLADB-841), we also go through other methods of `raft::server`
and verify that they handle the aborted state properly.

I found two methods that do not:

(A) `wait_for_leader`
(B) `wait_for_state_change`

What happened before these changes?

In case (A), the dangerous scenario occurred when `_leader_promise` was
empty on entering the function. In that case, we would construct the
promise and wait on the corresponding future. However, if the server
had been already aborted before the call, the future would never
resolve and we'd be effectively stuck.

Case (B) is fully analogous: instead of `_leader_promise`, we'd work
with `_stte_change_promise`.

There's probably a more proper solution to this problem, but since I'm
not familiar with the internal code of Raft, I fix it this way. We can
improve it further in the future.

We provide two simple validation tests. They verify that after aborting
a `raft::server`, the calls:

* do not hang (the tests would time out otherwise),
* throw raft::stopped_error.

Fixes SCYLLADB-841
2026-03-04 16:28:11 +01:00
Dawid Mędrek
c200d6ab4f raft: Throw stopped_error if server aborted
Before the change, calling `add_entry` or `modify_config` on an already
aborted Raft server could result in an error `not_a_leader` containing
a null server ID. It was possible precisely when forwarding was
disabled in the server configuration.

`not_a_leader` is supposed to return the ID of the current leader,
so that was wrong. Furthermore, the description of the function
specified that if a server is aborted, then it should throw
`stopped_error`.

We fix that issue. A few small reproducer tests were provided to verify
that the functions behave correctly with and without forwarding enabled.

Refs SCYLLADB-841
2026-03-04 16:28:08 +01:00
Marcin Maliszkiewicz
9697b6013f Merge 'test: add missing awaits in test_client_routes_upgrade' from Andrzej Jackowski
Two calls in test_client_routes_upgrade were missing `await`,
so they were never actually executed. This caused Python
to emit RuntimeWarning about unawaited coroutines, and more
importantly, the test skipped important verification steps, which
could mask real bugs or cause flakiness.

Additionally, increase 10s timeouts to 60s to avoid flakiness in slow
environments. Although these tests haven't failed so far, similar
issues have already been observed in other tests with too-short
timeouts.

Fixes: [SCYLLADB-909](https://scylladb.atlassian.net/browse/SCYLLADB-909)

Backport to 2026.1, as the test is also there.

[SCYLLADB-909]: https://scylladb.atlassian.net/browse/SCYLLADB-909?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#28877

* github.com:scylladb/scylladb:
  test: increase timeouts in test_client_routes.py
  test: add missing awaits in test_client_routes_upgrade
2026-03-04 15:26:34 +01:00
Szymon Malewski
212bd6ae1a vector: Vectorize loops in similarity functions
The main loops iterating over vector components were not vectorized due to:
- "cannot prove it is safe to reorder floating-point operations"
- "Cannot vectorize early exit loop with more than one early exit"
The first issue is fixed with adding `#pragma clang fp contract(fast) reassociate(on)`, which allows compiler to optimize floating point operations.
The second issue is solved by refactoring the operations in the affected loop.
Additionally using float operations instead of double increases throughput and numerical accuracy is not the main consideration in vector search scenarios.

Performance measured:
- scylla built using dbuild
- using https://github.com/zilliztech/VectorDBBench (modified to call `SELECT id, similarity_cosine({vector<float, 1536>}, {vector<float, 1536>}) ...` without ANN search):
- client concurrency 20
before: ~2250 QPS
`float` operations: ~2350 QPS
`compute_cosine_similarity` vectorization: ~2500QPS
`extract_float_vector` vectorization: ~3000QPS

Follow-up https://github.com/scylladb/scylladb/pull/28615
Ref https://scylladb.atlassian.net/browse/SCYLLADB-764

Closes scylladb/scylladb#28754
2026-03-04 15:14:53 +01:00
Andrzej Jackowski
221b78cb81 test: increase timeouts in test_client_routes.py
Increase 10s timeouts to 60s to avoid flakiness in slow
environments. Although these tests haven't failed so far, similar
issues have already been observed in other tests with too-short
timeouts.

Test execution time is unaffected; the entire suite in `dev` takes ~30s
before and after this change.
2026-03-04 13:40:30 +01:00
Andrzej Jackowski
527c4141da test: add missing awaits in test_client_routes_upgrade
Two calls in test_client_routes_upgrade were missing `await`,
so they were never actually executed. This caused Python
to emit RuntimeWarning about unawaited coroutines, and more
importantly, the test skipped important verification steps, which
could mask real bugs or cause flakiness.

Fixes: SCYLLADB-909
2026-03-04 13:34:37 +01:00
Piotr Smaron
a31cb18324 db: fix UB in system.clients row sorting
The comparator used to sort per-IP client rows was not a strict-weak-ordering (it could return true in both directions for some pairs), which makes `std::ranges::sort` behavior undefined. A concrete pair that breaks it (and is realistic in system.clients):
a = (port=9042, client_type="cql")
b = (port=10000, client_type="alternator")
With the current comparator:
cmp(a,b) = (9042 < 10000) || ("cql" < "alternator") = true || false = true
cmp(b,a) = (10000 < 9042) || ("alternator" < "cql") = false || true = true
So both directions are true, meaning there is no valid ordering that sort can achieve.

The fix is to sort lexicographically by (port, client_type) to match the table's clustering key and ensure deterministic ordering.

Closes scylladb/scylladb#28844
2026-03-04 14:10:49 +03:00
Avi Kivity
c331796d28 Merge 'Support Min < Precision for approx_exponential_histogram' from Amnon Heiman
This series closes a gap in the approx_exponential_histogram implementation to
cover integer values starting from small Min values.

While the original implementation was focused on durations, where this limitation
was not an issue, over time, there has been a growing need for histograms that
cover smaller values, such as the number of SSTables or the number of items in a
batch.

The reason for the original limitation is inherent to the exponential histogram
math. The previous code required Min to be at least Precision to avoid negative
bit shifts in the exponential calculations.

After this series, approx_exponential_histogram allows Min to be smaller than
Precision by scaling values during indexing. The value is shifted left by
log2 Precision minus log2 Min or zero whichever is larger, and the existing
exponential math is applied. Bucket limits are then scaled back to the original
units. This keeps insertion and retrieval O(1) without runtime branching, at the
cost of repeated bucket limits for some values in the Min to Precision range.

Additional tests cover the new behavior.
Relates to #2785

** New feature, no need to backport. **

Closes scylladb/scylladb#28371

* github.com:scylladb/scylladb:
  estimated_histogram_test.cc: add to_metrics_histogram test
  histogram_metrics_helper.hh: Support Min < Precision
  estimated_histogram_test.cc: Add tests for approx_exponential_histogram with Min<Precision
  estimated_histogram.hh: support Min less than Precision histograms
2026-03-04 12:43:26 +02:00
Szymon Malewski
4c4673e8f9 test: vector_similarity: Fix similarity value checks
`isclose` function checks if returned similarity floats are close enough to expected value, but it doesn't `assert` by itself.
Several tests missed that `assert`, effectively always passing.
With this patch similarity values checks are wrapped in helper function `assert_similarity` with predefined tolerance.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-877

Closes scylladb/scylladb#28748
2026-03-04 09:53:32 +01:00
Marcin Maliszkiewicz
c7d3f80863 Merge 'auth: do not create default 'cassandra:cassandra' superuser' from Dario Mirovic
This patch series removes creation of default 'cassandra:cassandra' superuser on system start.

Disable creation of a superuser with default 'cassandra:cassandra' credentials to improve security. The current flow requires clients to create another superuser and then drop the default `cassandra:cassandra' role. For those who do, there is a time window where the default credentials exist. For those who do not, that role stays. We want to improve security by forcing the client to either use config to specify default values for default superuser name and password or use cqlsh over maintenance socket connection to explicitly create/alter a superuser role.

The patch series:
- Enable role modification over the maintenance socket
- Stop using default 'cassandra' value for default superuser, skipping creation instead

Design document: https://scylladb.atlassian.net/wiki/spaces/RND/pages/165773327/Drop+default+cassandra+superuser

Fixes scylladb/scylla-enterprise#5657

This is an improvement. It does not need a backport.

Closes scylladb/scylladb#27215

* github.com:scylladb/scylladb:
  config: enable maintenance socket in workdir by default
  docs: auth: do not specify password with -p option
  docs: update documentation related to default superuser
  test: maintenance socket role management
  test: cluster: add logs to test_maintenance_socket.py
  test: pylib: fix connect_driver handling when adding and starting server
  auth: do not create default 'cassandra:cassandra' superuser
  auth: remove redundant DEFAULT_USER_NAME from password authenticator
  auth: enable role management operations via maintenance socket
  client_state: add has_superuser method
  client_state: add _bypass_auth_checks flag
  auth: let maintenance_socket_role_manager know if node is in maintenance mode
  auth: remove class registrator usage
  auth: instantiate auth service with factory functors
  auth: add service constructor with factory functors
  auth: add transitional.hh file
  service: qos: handle special scheduling group case for maintenance socket
  service: qos: use _auth_integration as condition for using _auth_integration
2026-03-04 09:43:57 +01:00
Piotr Dulikowski
85dcbfae9a Merge 'hint: Don't switch group in database::apply_hint()' from Pavel Emelyanov
The method is called from storage_proxy::mutate_hint() which is in turn called from hint_mutation::apply_locally(). The latter is either called from directly by hint sender, which already runs in streaming group, or via RPC HINT_MUTATION handler which uses index 1 that negotiates streaming group as well.

To be sure, add a debugging check for current group being the expected one.

Code cleanup, not backporting

Closes scylladb/scylladb#28545

* github.com:scylladb/scylladb:
  hint: Don't switch group in database::apply_hint()
  hint_sender: Switch to sender group on stop either
2026-03-04 09:36:38 +01:00
Pavel Emelyanov
5793e305b5 test_backup: Remove prepare_snapshot_for_backup()
It's now unused

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-04 11:33:43 +03:00
Pavel Emelyanov
ffbd9a3218 test_backup: Patch test_simple_backup_and_restore to use take_snapshot()
This change is a bit more careful, as the test collects files from
snapshot directory several times. Before patching it to use the helper,
it collected _all_ the files. Now the helper only provides TOC-s, but
that's fine -- the only check that relies on that may also re-collect
TOC-s and compare new set with old set.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-04 11:33:43 +03:00
Pavel Emelyanov
c1b0ac141b test_backup: Patch backup tests to use take_snapshot()
Some of those tests need to update the hard-coded 'backup' snapshot name
to use the one provided by take_snapshot() helper. Other than that, the
patching is pretty straightforward.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-04 11:33:43 +03:00
Pavel Emelyanov
ea17c26fd9 test_backup: Add helper to take snapshot on a single server
The take_snapshot() helper returns a dict(server: list[string]). When
there's only one server to work with, it's more handy to just get a
single list of sstables.

Next patches will make use of that helper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-04 11:31:39 +03:00
Botond Dénes
e7487c21e4 test/{boost,cluster}: add test for tombstone gc mode=repair with RF=1 2026-03-04 09:45:38 +02:00
Botond Dénes
5998a859f7 tombstone_gc: allow use of repair-mode for RF=1 tables
Modify the methods which calculate the default gc mode as well as that
which validates whether repair-mode can be used at all, so both accepts
use of repair-mode on RF=1 tables.

This de-facto changes the default tombstone-gc to repair-mode for all
tables. Documentation is updated accordingly.

Some tests need adjusting:
* cqlpy/test_select_from_mutation_fragments.py: disable GC for some test
  cases because this patch makes tombstones they write subject to GC
  when using defaults.
* test/cluster/test_mv.py::test_mv_tombstone_gc_not_inherited used
  repair-mode as a non-default for the base table and expected the MV to
  revert to default. Another mode has to be used as the non-default
  (immediate).
* test/cqlpy/test_tools.py::test_scylla_sstable_dump_schema: don't
  compare tombstone_gc schema extension when comparing dumped schema vs.
  original. The tool's schema loader doesn't have access to the keyspace
  definition so it will come up with different defaults for
  tombstone-gc.
* test/boost/row_cache_test.cc::test_populating_cache_with_expired_and_nonexpired_tombstones
  sets tombstone expiry assuming the tombstone-gc timeout-mode default.
  Change the CREATE TABLE statement to set the expected mode.
2026-03-04 09:44:24 +02:00
Andrzej Jackowski
c0e94828de scylla.yaml: add write CL guardrails to scylla.yaml
Disabled by default. This change is introduced only to document the
guardrail.

Refs: SCYLLADB-259
2026-03-04 08:00:17 +01:00
Andrzej Jackowski
038f89ede4 scylla.yaml: reorganize guardrails config to be in one place
Also change the format of the section header and add "#" to empty
lines, so that in the future no one splits the section by adding new
configs.
2026-03-04 08:00:17 +01:00
Andrzej Jackowski
ec42fdfd01 test: add cluster tests for write CL guardrails
Most of the functionality is tested in cqlpy tests located in
`test_guardrail_write_consistency_level.py`. Add two tests
that require the cluster framework:

- `test_invalid_write_cl_guardrail_config` checks the node startup
  path when incorrect `write_consistency_levels_warned` and
  `write_consistency_levels_disallowed` values are used.
- `test_write_cl_default` checks the behavior of the default
  configuration using a multi-node cluster.

Tests execution time:
 - Dev: 10s
 - Debug: 18s

Refs: SCYLLADB-259
2026-03-04 08:00:17 +01:00
Andrzej Jackowski
446539f12f test: implement test_guardrail_write_consistency_level
Implement basic tests for write consistency level guardrails,
verifying that they work for each type of write request (inserts,
updates, deletes, logged batches, unlogged batches, conditional batches,
and counter operations).

All tests are marked as Scylla-only because they currently don't
pass with Cassandra due to differences in handling superusers (see:
SCYLLADB-882).

Tests execution time:
 - Dev: 3s
 - Debug: 14s

Refs: SCYLLADB-259
Refs: SCYLLADB-882
2026-03-04 08:00:13 +01:00
Avi Kivity
85bd6d0114 Merge 'Add multiple-shard persistent metadata storage for strongly consistent tables' from Wojciech Mitros
In this series we introduce new system tables and use them for storing the raft metadata
for strongly consistent tables. In contrast to the previously used raft group0 tables, the
new tables can store data on any shard. The tables also allow specifying the shard where
each partition should reside, which enables the tablets of strongly consistent tables to have
their raft group metadata co-located on the same shard as the tablet replica.

The new tables have almost the same schemas as the raft group0 tables. However, they
have an additional column in their partition keys. The additional column is the shard
that specifies where the data should be located. While a tablet and its corresponding
raft group server resides on some shard, it now writes and reads all requests to the
metadata tables using its shard in addition to the group_id.

The extra partition key column is used by the new partitioner and sharder which allow
this special shard routing. The partitioner encodes the shard in the token and the
sharder decodes the shard from the token. This approach for routing avoids any
additional lookups (for the tablet mapping) during operations on the new tables
and it also doesn't require keeping any state. It also doesn't interact negatively
with resharding - as long as tablets (and their corresponding raft metadata) occupy
some shard, we do not allow starting the node with a shard count lower than the
id of this shard. When increasing the shard count, the routing does not change,
similarly to how tablet allocation doesn't change.

To use the new tables, a new implementation of `raft::persistence` is added. Currently,
it's almost an exact copy of the `raft_sys_table_storage` which just uses the new tables,
but in the future we can modify it with changes specific to metadata (or mutation)
storage for strongly consistent tables. The new storage is used in the `groups_manager`,
which combined with the removal of some `this_shard_id() == 0` checks, allows strongly
consistent tables to be used on all shards.

This approach for making sure that the reads/writes to the new tables end up on the correct shards
won in the balance of complexity/usability/performance against a few other approaches we've considered.
They include:
1. Making the Raft server read/write directly to the database, skipping the sharder, on its shard, while using
the default partitioner/sharder. This approach could let us avoid changing the schema and there should be
no problems for reads and writes performed by the Raft server. However, in this approach we would input
data in tables conflicting with the placement determined by the sharder. As a result, any read going through
the sharder could miss the rows it was supposed to read. Even when reading all shards to find a specific value,
there is a risk of polluting the cache - the rows loaded on incorrect shards may persist in the cache for an unknown
amount of time. The cache may also mistakenly remember that a row is missing, even though it's actually present,
just on an incorrect shard.
Some of the issues with this approach could be worked around using another sharder which always returns
this_shard_id() when asked about a shard. It's not clear how such a sharder would implement a method like
`token_for_next_shard`, and how much simpler it would be compared to the current "identity" sharder.
2. Using a sharder depending on the current allocation of tablets on the node. This approach relies on the
knowledge of group_id -> shard mapping at any point in time in the cluster. For this approach we'd also
need to either add a custom partitioner which encodes the group_id in the token, or we'd need to track the
token(group_id) -> shard mapping. This approach has the benefit over the one used in the series of keeping
the partition key as just group_id. However, it requires more logic, and the access to the live state of the node
in the sharder, and it's not static - the same token may be sharded differently depending on the state of the
node - it shouldn't occur in practice, but if we changed the state of the node before adjusting the table data,
we would be unable to access/fix the stale data without artificially also changing the state of the node.
3. Using metadata tables co-located to the strongly consistent tables. This approach could simplify the
metadata migrations in the future, however it would require additional schema management of all co-located
metadata tables, and it's not even obvious what could be used as the partition key in these tables - some
metadata is per-raft-group, so we couldn't reuse the partition key of the strongly consistent table for it. And
finding and remembering a partition key that is routed to a specific shard is not a simple task. Finally, splits
and merges will most likely need special handling for metadata anyway, so we wouldn't even make use of
co-located table's splits and merges.

Fixes [SCYLLADB-361](https://scylladb.atlassian.net/browse/SCYLLADB-361)

[SCYLLADB-361]: https://scylladb.atlassian.net/browse/SCYLLADB-361?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#28509

* github.com:scylladb/scylladb:
  docs: add strong consistency doc
  test/cluster: add tests for strongly-consistent tables' metadata persistence
  raft: enable multi-shard raft groups for strongly consistent tablets
  test/raft: add unit tests for raft_groups_storage
  raft: add raft_groups_storage persistence class
  db: add system tables for strongly consistent tables' raft groups
  dht: add fixed_shard_partitioner and fixed_shard_sharder
  raft: add group_id -> shard mapping to raft_group_registry
  schema: add with_sharder overload accepting static_sharder reference
2026-03-04 08:55:43 +02:00
Piotr Dulikowski
2fb981413a Merge 'vector_search: test: fix HTTPS client test flakiness' from Karol Nowacki
The default 100ms timeout for client readiness in tests is too
aggressive. In some test environments, this is not enough time for
client creation, which involves address resolution and TLS certificate
reading, leading to flaky tests.

This commit increases the default client creation timeout to 10 seconds.
This makes the tests more robust, especially in slower execution
environments, and prevents similar flakiness in other test cases.

Fixes: VECTOR-547, SCYLLADB-802, SCYLLADB-825, SCYLLADB-826

Backport to 2025.4 and 2026.1, as the same problem occurs on these branches and can potentially make the CI flaky there as well.

Closes scylladb/scylladb#28846

* github.com:scylladb/scylladb:
  vector_search: test: include ANN error in assertion
  vector_search: test: fix HTTPS client test flakiness
2026-03-04 08:55:43 +02:00
Wojciech Mitros
38f02b8d76 mv: remove dead code in view_updates::can_skip_view_updates
When we create a materialized view, we consider 2 cases:
1. the view's primary key contains a column that is not
in the primary key of the base table
2. the view's primary key doesn't contain such a column

In the 2nd case, we add all columns from the base table
to the schema of the view (as virtual columns). As a result,
all of these columns are effectively "selected" in
view_updates::can_skip_view_updates. Same thing happens when
we add new columns to the base table using ALTER.
Because of this, we can never have !column_is_selected and
!has_base_non_pk_columns_in_view_pk at the same time. And
thus, the check (!column_is_selected
&& _base_info.has_base_non_pk_columns_in_view_pk) is always
the same as (!column_is_selected).
Because we immediately return after this check, the tail of
this function is also never reached - all checks after the
(column_is_selected) are affected by this. Also, the condition
(!column_is_selected && base_has_nonexpiring_marker) is always
false at the point it is called. And this in turn makes the
`base_has_nonexpiring_marker` unused, so we delete it as well.

It's worth considering, why did we even have
`base_has_nonexpiring_marker` if it's effectively unused. We
initially introduced it in bd52e05ae2 and we (incorrectly)
used it to allow skipping view updates even if the liveness of
virtual columns changed. Soon after, in 5f85a7a821, we
started categorizing virtual columns as column_is_selected == true
and we moved the liveness checks for virtual columns to the
`if (column_is_selected)` clause, before the `base_has_nonexpiring_marker`
check. We changed this because even if we have a nonexpiring marker
right now, it may be changed in the future, in which case the liveness
of the view row will depend on liveness of the virtual columns and
we'll need to have the view updates from the time the row marker was
nonexpiring.

Closes scylladb/scylladb#28838
2026-03-04 08:55:43 +02:00
Geoff Montee
0eb5603ebd Docs: describe the system tables
Fixes issue #12818 with the following docs changes:

docs/dev/system_keyspace.md: Added missing system tables, added table of contents (TOC), added categories

Closes scylladb/scylladb#27789
2026-03-04 08:55:43 +02:00
Botond Dénes
f156bcddab Merge 'test: decrease strain in test_startup_response' from Marcin Maliszkiewicz
For 2025.3 and 2025.4 this test runs order of magnitude
slower in debug mode. Potentially due to passwords::check
running in alien thread and overwhelming the CPU (this is
fixed in newer versions).

Decreasing the number of connections in test makes it fast
again, without breaking reproducibility.

As additional measure we double the timeout.

The fix is now cherry-picked to master as sometimes
test fails there too.

(cherry picked from commit 1f1fc2c2ac)

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-795

backport: 2026.1, already on other stable branches

Closes scylladb/scylladb#28848

* github.com:scylladb/scylladb:
  test: add more logs to test_startup_no_auth_response
  test: decrease strain in test_startup_response
2026-03-04 08:55:43 +02:00
Andrzej Jackowski
bb359b3b78 cql3: start using write CL guardrails
Enable verification of write consistency level guardrails in
`modification_statement` and `batch_statement`.

Neither guardrail is enabled by default, so as not to disrupt clusters
that are currently using any of the CLs for writes. The warning
guardrail may seem harmless, as it only adds a warning to the CQL
response; however, enabling it can significantly increase network
traffic (as a warning message is added to each response) and also
decrease throughput due to additional allocations required to prepare
the warning. Therefore, both guardrails should be enabled with care.
The newly added `writes_per_consistency_level` metric, which is
incremented unconditionally, can help decide whether a guardrail can
be safely enabled in an existing cluster.

This commit adds additional `if` instructions on the critical path.
However, based on the `perf_simple_query` benchmark for writes,
the difference is marginal (~40 additional instructions, which is
a relative difference smaller than 0.001).

BEFORE:
```
291443.35 tps ( 53.3 allocs/op,  16.0 logallocs/op,  14.2 tasks/op,   48067 insns/op,   18885 cycles/op,        0 errors)
throughput:
	mean=   289743.07 standard-deviation=6075.60
	median= 291424.69 median-absolute-deviation=1702.56
	maximum=292498.27 minimum=261920.06
instructions_per_op:
	mean=   48072.30 standard-deviation=21.15
	median= 48074.49 median-absolute-deviation=12.07
	maximum=48119.87 minimum=48019.89
cpu_cycles_per_op:
	mean=   18884.09 standard-deviation=56.43
	median= 18877.33 median-absolute-deviation=14.71
	maximum=19155.48 minimum=18821.57
```

AFTER:
```
290108.83 tps ( 53.3 allocs/op,  16.0 logallocs/op,  14.2 tasks/op,   48121 insns/op,   18988 cycles/op,        0 errors)
throughput:
	mean=   289105.08 standard-deviation=3626.58
	median= 290018.90 median-absolute-deviation=1072.25
	maximum=291110.44 minimum=274669.98
instructions_per_op:
	mean=   48117.57 standard-deviation=18.58
	median= 48114.51 median-absolute-deviation=12.08
	maximum=48162.18 minimum=48087.18
cpu_cycles_per_op:
	mean=   18953.43 standard-deviation=28.76
	median= 18945.82 median-absolute-deviation=20.84
	maximum=19023.93 minimum=18916.46
```

Fixes: SCYLLADB-259
2026-03-04 07:26:00 +01:00
Asias He
225b10b683 repair: Fix rwlock in compaction_state and lock holder lifecycle
Consider this:

- repair takes the lock holder
- tablet merge filber destories the compaction group and the compaction state
- repair fails
- repair destroy the lock holder

This is observed in the test:

```
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036] Repair 1 out of 1 tablets: table=sec_index.users range=(432345564227567615,504403158265495551] replicas=[0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea:15, 498e354c-1254-4d8d-a565-2f5c6523845a:9, 5208598c-84f0-4526-bb7f-573728592172:28]

...

repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: Started to repair 1 out of 1 tables in keyspace=sec_index, table=users, table_id=ea2072d0-ccd9-11f0-8dba-c5ab01bffb77, repair_reason=repair
repair - Enable incremental repair for table=sec_index.users range=(432345564227567615,504403158265495551]
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: get_sync_boundary: got error from node=0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea, keyspace=sec_index, table=users, range=(432345564227567615,504403158265495551], error=seastar::rpc::remote_verb_error (Compaction state for table [0x60f008fa34c0] not found)
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge

....

scylla[10793] Segmentation fault on shard 28, in scheduling group streaming
```

The rwlock in compaction_state could be destroyed before the lock holder
of the rwlock is destroyed. This causes user after free when the lock
the holder is destroyed.

To fix it, users of repair lock will now be waited when a compaction
group is being stopped.
That way, compaction group - which controls the lifetime of rwlock -
cannot be destroyed while the lock is held.
Additionally, the merge completion fiber - that might remove groups -
is properly serialized with incremental repair.

The issue can be reproduced using sanitize build consistently and can not
be reproduced after the fix.

Fixes #27365

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2026-03-03 21:05:15 -03:00
Raphael S. Carvalho
1d8903d9f7 repair: Prevent repair lock holder leakage after table drop
Prevent repair lock holder from being leaked in repair_service when table
is dropped midway.
The leakage might result in use-after-free later, since the repair lock
itself will be gone after table drop.
The RPC verb that removes the lock on success path will not be called
by coordinator after table was dropped.

Refs #27365.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-896.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2026-03-03 21:05:10 -03:00
Dario Mirovic
06af4480ea config: enable maintenance socket in workdir by default
We want to enable maintenance socket by default.
This will prevent users from having to reboot a server to enable it.
Also, there is little point in having maintenance socket that is turned off,
and we want users to use it. After this patch series, they will have
to use it. Note that while config seeding exists, we do not encourage it
for production deployments.

This patch changes default maintenance_socket value from ignore to workdir.
This enables maintenance socket without specifying an explicit path.

Refs SCYLLADB-409
2026-03-04 00:01:07 +01:00
Dario Mirovic
6e83fb5029 docs: auth: do not specify password with -p option
Specifying password with -p option is considered unsafe.
The password will be saved in bash history.
The preferred approach is to enter the password when prompted.
Any approach that passes the password via command line arguments
makes that password visible in process options (ps command), no matter
if the password is passed directly or as an environment variable.

Refs SCYLLADB-409
2026-03-04 00:01:07 +01:00
Dario Mirovic
afafb8a8fa docs: update documentation related to default superuser
Update create superuser procedure:
- Remove notes about default `cassandra` superuser
- Add create superuser using existing superuser section
- Update create superuser by using `scylla.yaml` config
- Add create superuser using maintenance socket

Update password reset procedure:
- Add maintenance socket approach
- Remove the old approach with deleting all the roles

Update enabling authentication with downtime and during runtime:
- Mention creating new superuser over the maintenance socket
- Remove default superuser usage

Update enable authorization:
- Mention creating new superuser over the maintenance socket
- Remove mention of default superuser

Reasoning for deletion of the old approach:
- [old] Needs cluster downtime, removes all roles, needs recreation of roles,
  needs maintenance socket anyways, if config values are not used for superuser
- [new] No cluster downtime, possibly one node restart to enable maintenance
  socket, faster

Refs SCYLLADB-409
2026-03-04 00:01:07 +01:00
Dario Mirovic
3db74aaf5f test: maintenance socket role management
Introduce a test that cover:
- Server startup without credentials config seeding with no roles created
- Await maintenance socket role management to be enabled
- `CREATE ROLE`, `ALTER ROLE`, and `DROP ROLE` statement execution success

All the tests in the test_maintenance_socket.py module take 2-3 seconds
to execute.

Explicitly shut down Cluster objects to prevent 'RuntimeError: cannot
schedule new futures after shutdown'.

Refs SCYLLADB-409
2026-03-03 23:57:50 +01:00
Dario Mirovic
f74fe22386 test: cluster: add logs to test_maintenance_socket.py
Add logs to test_maintenance_socket.py test test_maintenance_socket.
This approach offers additional visibility in case of test failure.
Such logs will be added to new tests in a follow up patch in this
patch series.

Refs SCYLLADB-409
2026-03-03 23:42:25 +01:00
Dario Mirovic
0e5ddec2a8 test: pylib: fix connect_driver handling when adding and starting server
When connect_driver=False, the expected server up state should be
capped to HOST_ID_QUERIED. This is to avoid waiting for CQL readiness,
which requires a superuser to be present.

This logic was only in ScyllaCluster.server_start. ManagerClient.server_add
with start=True and connect_driver=False would still wait for CQL and hang
if no superuser is present. The workaround was to call
ManagerClient.server_add(start=False, connect_driver=False) followed by
ManagerClient.server_start(connect_driver=False).

This patch moves the capping from ScyllaCluster.server_start to
ManagerClient.server_add and ManagerClient.server_start, where connect_driver
is processed. ScyllaCluster only receives the already resolved
expected_server_up_state value.

Refs SCYLLADB-409
2026-03-03 23:42:25 +01:00
Dario Mirovic
fd17dcbec8 auth: do not create default 'cassandra:cassandra' superuser
Changes the behavior of default superuser creation.
Previously, without configuration 'cassandra:cassandra' credentials
were used. Now default superuser creation is skipped if not configured.

The two ways to create default superuser are:
- Config file - auth_superuser_name and auth_superuser_salted_password fields
- Maintenance socket - connect over maintenance socket and CREATE/ALTER ROLE ...

Behavior changes:

Old behavior:
- No config - 'cassandra:cassandra' created
- auth_superuser_name only - <name>:cassandra created
- auth_superuser_salted_password only - 'cassandra:<password>' created
- Both specified - '<name>:<password>' created

New behavior:
- No config - no default superuser
    - Requires maintenance socket setup
- auth_superuser_name only - '<name>:' created WITHOUT password
    - Requires maintenance socket setup
- auth_superuser_salted_password only - no default superuser
- Both specified - '<name>:<password>' created

Fixes SCYLLADB-409
2026-03-03 23:42:25 +01:00
Dario Mirovic
9dc1deccf3 auth: remove redundant DEFAULT_USER_NAME from password authenticator
Remove redundant DEFAULT_USER_NAME from password_authenticator.cc file.
It is just a copy of meta::DEFAULT_SUPERUSER_NAME.

Refs SCYLLADB-409
2026-03-03 23:42:25 +01:00
Dario Mirovic
45628cf041 auth: enable role management operations via maintenance socket
Introduce maintenance_socket_authenticator and rework
maintenance_socket_role_manager to support role management operations.

Maintenance auth service uses allow_all_authenticator. To allow
role modification statements over the maintenance socket connections,
we need to treat the maintenance socket connections as superusers and
give them proper access rights.

Possible approaches are:
1. Modify allow_all_authenticator with conditional logic that
   password_authenticator already does
2. Modify password_authenticator with conditional logic specific
   for the maintenance socket connections
3. Extend password_authenticator, overriding the methods that differ

Option 3 is chosen: maintenance_socket_authenticator extends
password_authenticator with authentication disabled.

The maintenance_socket_role_manager is reworked to lazily create a
standard_role_manager once the node joins the cluster, delegating role
operations to it. In maintenance mode role operations remain disabled.

Refs SCYLLADB-409
2026-03-03 23:41:05 +01:00
Dario Mirovic
6a1edab2ac client_state: add has_superuser method
Encapsulate the superuser check in client_state so that it
respects _bypass_auth_checks. Connections that bypass auth
(internal callers and the maintenance socket) are always
considered superusers.

Migrate existing call sites from auth::has_superuser(service, user)
to client_state.has_superuser(). Also add _bypass_auth_checks
handling to ensure_not_anonymous().

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
d765b5b309 client_state: add _bypass_auth_checks flag
Authorization checks were previously skipped based on the
_is_internal flag. This couples two concerns: marking client
state as internal and bypassing authorization.

Introduce _bypass_auth_checks to handle only the authorization
bypass. Internal client state sets it to true, preserving current
behavior. External client state accepts it as a constructor
parameter, defaulting to false.

This will allow maintenance socket connections to skip
authorization without being marked as internal.

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
b68656b59f auth: let maintenance_socket_role_manager know if node is in maintenance mode
This patch is part of preparations for dropping 'cassandra::cassandra'
default superuser. When that is implemented, maintenance_socket_role_manager
will have two modes of work:
1. in maintenance mode, where role operations are forbidden
2. in normal mode, where role operations are allowed

To execute the role operations, the node has to join a cluster.
In maintenance mode the node does not join a cluster.

This patch lets maintenance_socket_role_manager know if it works under
maintenance mode and returns appropriate error message when role
operations execution is requested.

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
3bef493a35 auth: remove class registrator usage
This patch removes class registrator usage in auth module.
It is not used after switching to factory functor initialization
of auth service.

Several role manager, authenticator, and authorizer name variables
are returned as well, and hardcoded inside qualified_java_name method,
since that is the only place they are ever used.

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
eab24ff3b0 auth: instantiate auth service with factory functors
Auth service is instantiated with the constructor that accepts
service_config, which then uses class registrator to instantiate
authorizer, authenticator, and role manager.

This patch switches to instantiating auth service via the constructor
that accepts factory functors. This is a step towards removing
usage of class registrator.

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
bfff07eacb auth: add service constructor with factory functors
Auth service can be initialized:
- [current] by passing instantiated authorizer, authenticator, role manager
- [current] by passing service_config, which then uses class registrator to instantiate authorizer, authenticator, role manager
    - This approach is easy to use with sharded services
- [new] by passing factory functors which instantiate authorizer, authenticator, role manager
    - This approach is also easy to use with sharded services

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
e8e00c874b auth: add transitional.hh file
In a follow-up patch in this patch series class registrator will be removed.
Adding transitional.hh file will be necessary to expose the authenticator and authorizer.

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
e5218157de service: qos: handle special scheduling group case for maintenance socket
service_level_controller has special handling for maintenance socket connections.
If the current user is not a named user, it should use the default scheduling group.

The reason is that the maintenance socket can communicate with Scylla before
auth_integration is registered.

The guard is already present, but it was omitted in get_cached_user_scheduling_group.

This also fixes flakiness in test_maintenance_socket.py tests.

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Dario Mirovic
dc9a90d7cb service: qos: use _auth_integration as condition for using _auth_integration
Maintenance socket connections can be established before _auth_integration is
initialized. The fix introduced with scylladb/scylladb#26856 PR check for
the value of user variable. For maintenance socket connections it will be an
anonymous user, and will fall back to using default scheduling group.

This patch changes the criteria for using default scheduling group from
the user variable to checking the _auth_integration variable itself:
- If _auth_integration is not initialized, use default scheduling group
- If _auth_integration is initialized, let it choose the scheduling group

Refs SCYLLADB-409
2026-03-03 22:31:35 +01:00
Andrzej Jackowski
371cdb3c81 cql3/query_processor: implement metrics to track CL of writes
Add `write_consistency_levels_disallowed_violations` and
`write_consistency_levels_warned_violations` metrics to track
violations of write_consistency_levels guardrails.

Add `writes_per_consistency_level` to track what CL is used by
writes, regardless of the guardrails configuration.
Data gathered by this metric can be used to decide whether enabling
a particular write consistency level guardrail in a particular
existing cluster is safe.

Refs: SCYLLADB-259
2026-03-03 21:18:11 +01:00
Andrzej Jackowski
3606934458 db: cql3/query_processor: add write_consistency_levels enum_sets
Add enum_sets to query_processor that track the configuration
values of `write_consistency_levels_warned` and
`write_consistency_levels_disallowed`.

Refs: SCYLLADB-259
2026-03-03 20:28:57 +01:00
Dawid Mędrek
7fd083e329 test: raft: Introduce get_default_cluster
We introduce a function creating a Raft cluster with parameters usually
used by the tests. This will avoid code duplication, especially after
introducing new tests in the following commits.

Note that the test `test_aborting_wait_for_state_change` has changed:
the previous complex configuration was unnecessary for it (I wrote it).
2026-03-03 18:50:21 +01:00
Pavel Emelyanov
b768753c0f test: Remove passing default "expected_replicas" to check_mutation_replicas()
The value of None is default, callers don't need to specify it
explicitly

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-03 17:28:21 +03:00
Pavel Emelyanov
b8ae9ede63 test: Remove scope and primary-replica-only arguments from check_mutation_replicas() helper
These two are only used to print into logs on error. However, their
values can be found from previous logs and test execution context.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-03 17:26:25 +03:00
Karol Nowacki
45477d9c6b vector_search: test: include ANN error in assertion
When the test fails, the assertion message does not include
the error from the ANN request.

This change enhances the assertion to include the specific ANN error,
making it easier to diagnose test failures.
2026-03-03 14:19:20 +01:00
Karol Nowacki
ab6c222fc4 vector_search: test: fix HTTPS client test flakiness
The default 100ms timeout for client readiness in tests is too
aggressive. In some test environments, this is not enough time for
client creation, which involves address resolution and TLS certificate
reading, leading to flaky tests.

This commit increases the default client creation timeout to 10 seconds.
This makes the tests more robust, especially in slower execution
environments, and prevents similar flakiness in other test cases.

Fixes: VECTOR-547, SCYLLADB-802
2026-03-03 14:19:20 +01:00
Botond Dénes
4f5310bc72 replica/table: update rf=1 table registry in shared tombstone-gc state
On every update of the ERM, update the state of the current table in the
registry of RF=1 tables in shared tombstone gc state.
Ensures that tombstone gc stops collection of tombstones in immediate
mode as soon as the table starts transitioning away from RF=1.
2026-03-03 14:09:28 +02:00
Botond Dénes
7c2c63ab43 tombstone_gc: tombstone_gc_before_getter: consider RF when getting gc before time
Currently we cannot use repair-mode tombstone gc on RF=1 tables, because
such tables don't need repair and so there won't be repair history to
use to produce gc_before times.
Introduce shared_tombstone_gc_state::_rf_one_tables which will keep a
registry of RF=1 tables. Keeping this up to date is left to outside code
(table.cc). Consult the registry to determine whether a table is RF=1 or
not, so the repair history check can be ellided for rf=1 tables.
Not wired in yet into the table code.
2026-03-03 14:09:28 +02:00
Botond Dénes
074006749c tombstone_gc: unpack per_table_history_maps
It is now a class with a single member, replace usage with that of the
member (through an alias to reduce churn).
2026-03-03 14:09:28 +02:00
Botond Dénes
d6e2d44759 tombstone_gc: extract _group0_gc_time from per_table_history_map
Doesn't belong there. Also, having it as a separate member of
shared_tombstone_gc_state makes updating _group0_gc_time cheaper, as the
update doesn't have to do a copy-mutate-swap of the history maps.
2026-03-03 14:09:28 +02:00
Botond Dénes
5fd9fc3056 tombstone_gc: drop tombstone_gc_state(nullptr) ctor and operator bool()
Both are ambiguous and all users were migrated away to more meaningful
alternatives. They are now unused, drop them.
2026-03-03 14:09:28 +02:00
Botond Dénes
a785c0cf41 test/lib/random_schema: use timeout-mode tombstone_gc
This is the current de-facto default for all tests using random schema
and some are apparently relying on this. Make this explicit to avoid
upsetting tests, by the impending change of this default to repair.
2026-03-03 14:09:28 +02:00
Botond Dénes
d10e622a3b tombstone_gc_options: add C++ friendly constructor
So one can create options with strong types, instead of from a map of
strings.
2026-03-03 14:09:28 +02:00
Botond Dénes
6004e84f18 test: move away from tombstone_gc_state(nullptr) ctor
Use for_tests() instead (or no_gc() where approriate).
2026-03-03 14:09:28 +02:00
Botond Dénes
3c34598d88 treewide: move away from tombstone_gc_state(nullptr) ctor
It is ambigous, use the appropriate no-gc or gc-all factories instead,
as appropriate.
A special note for mutation::compacted(): according to the comment above
it, it doesn't drop expired tombstones but as it is currently, it
actually does. Change the tombstone gc param for the underlying call to
compact_for_compaction() to uphold the comment. This is used in tests
mostly, so no fallout expected.

Tests are handled in the next commit, to reduce noise.

Two tests in mutation_test.cc have to be updated:

* test_compactor_range_tombstone_spanning_many_pages
  has to be updated in this commit, as it uses
  mutation_partition::compact_for_query() as well as
  compact_for_query(). The test passes default constructed
  tombstone_gc() to the latter while the former now uses no-gc
  creating a mismatch in tombstone gc behaviour, resulting in test
  failure. Update the test to also pass no-gc to compact_for_query().

* test_query_digest similarly uses mutation_partition::query_mutation()
  and another compaction method, having to match the no-gc now used in
  query_mutation().
2026-03-03 14:09:28 +02:00
Botond Dénes
04b001daa6 sstable: move away from tombstone_gc_mode::operator bool()
It is ambiguous, use tombstone_gc_mode::is_gc_enabled() instead.
Note that the two has slightly different meanings, operator bool()
returned true when repair-history related functionality was enabled.
This is fine, because the only two users are logs, where the two
meanings are close enough. All other users were eliminated or migrated
already, taking the change in meaning into account.
2026-03-03 14:09:28 +02:00
Botond Dénes
6364e35403 replica/table: add get_tombstone_gc_state()
Shorthand for
get_compaction_manager().get_shared_tombstone_gc_state().get_tombstone_gc_state().
2026-03-03 14:09:28 +02:00
Botond Dénes
f3ee6a0bd1 compaction: use tombstone_gc_state with value semantics
Instead of passing around references to it, pass around values.

This object is now designed to be used as a value-type, after recent
refactoring.
2026-03-03 14:09:27 +02:00
Botond Dénes
83e20d920e db/row_cache: use tombstone_gc_state with value semantics
Instead of keeping a pointer to it. Replace nullptr with
tombstone_gc_state::no_gc().

This object is now designed to be used as a value-type, after recent
refactoring.
2026-03-03 14:09:27 +02:00
Botond Dénes
041ab593c7 tombstone_gc: introduce tombstone_gc_state::for_tests()
To replace the usage of tombstone_gc_state(nullptr) usage in tests
specifically. This more verbose factory method will hopefully convey
that this is not to be used in production code. The nullptr constructor
doesn't convey this and in fact it was used in production code
here-and-there.
2026-03-03 14:09:27 +02:00
Artsiom Mishuta
5c84a76b28 test.py: setup pytest logger
This commit introduces pure pytest logging into a file

Previously, test.py used pytest as a script(not a framework) and just captured pytest stdout and logged this data by itself

This commit sets up the log files format that additionaly display Python processName, threadName adn taskName because test.py test cases use them, and now it is so hard to investigate issues that are connected with parallelism inside test case themselve

In addition, commit splits the logging of different pytest workers(xdist) into different files. If pytest workers have ho failed test - log file for these workers will be deleted

There is also additional logging for failures that will contain a separate file per test failure and contain the error itself (stacktrace) and all capture logs from stdout, stderr during the test run. With --save-log-on-success it will be a separate file per test on pass as well

All this new functionality works with the new xdit scheduler (--test-py-init=True)

Fixes SCYLLADB-713

Closes scylladb/scylladb#28705
2026-03-03 11:49:01 +01:00
Dimitrios Symonidis
80b74d7df2 tablet options: Add max_tablet_count tablet option to enforce tablet count upper bounds
Introduced a new max_tablet_count tablet option that caps the maximum number of tablets a table can have. This feature is designed primarily for backup and restore workflows.
During backup, when load balancing is disabled for snapshot consistency, the current tablet count is recorded in the backup manifest.
During restore, max_tablet_count is set to this recorded value, ensuring the restored table's tablet count never exceeds the original snapshot's tablet distribution.
This guarantee enables efficient file-based SSTable streaming during restore, as each SSTable remains fully contained within a single tablet boundary.

Closes scylladb/scylladb#28450
2026-03-03 11:19:24 +03:00
Marcin Maliszkiewicz
d95939d69a test: add more logs to test_startup_no_auth_response
When test fails with assert connections_observed
we would like to know if it was unable to connect
or execute query in attempt_good_connection
2026-03-02 14:53:46 +01:00
Marcin Maliszkiewicz
91126eb2fb test: decrease strain in test_startup_response
For 2025.3 and 2025.4 this test runs order of magnitude
slower in debug mode. Potentially due to passwords::check
running in alien thread and overwhelming the CPU (this is
fixed in newer versions).

Decreasing the number of connections in test makes it fast
again, without breaking reproducibility.

As additional measure we double the timeout.

The fix is now cherry-picked to master as sometimes
test fails there too.

(cherry picked from commit 1f1fc2c2ac)
2026-03-02 14:46:51 +01:00
Karol Nowacki
647172d4b8 vector_search: fix names of private members
According to coding style in Scylla,
member variables are prefixed with underscore.
2026-03-02 14:08:16 +01:00
Karol Nowacki
f2308b000f vector_search: remove unused global variable 2026-03-02 14:08:07 +01:00
Taras Veretilnyk
5bbc44ed12 sstables: replace rewrite_statistics with new rewrite component mechanism
This commits migrates all callers that used rewrite_statistics to new
rewrite component mechanism.
2026-02-26 22:38:55 +01:00
Taras Veretilnyk
51c345aaf6 sstables: add new rewrite component mechanism for safe sstable component rewriting
Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed
to the final name after sealing. This allows crash recovery by simply removing the temporary file on startup.

However, this approach won't work once component digests are stored in scylla_metadata,
as replacing a component like Statistics will require atomically updating both the component
and scylla_metadata with the new digest—impossible with POSIX rename.

The new mechanism creates a clone sstable with a fresh generation:
- Hard-links all components from the source except the component being rewritten and scylla metadata if update_sstable_id is true
- Copies original sstable components pointer and recognized components from the source
- Invokes a modifier callback to adjust the new sstable before rewriting
- Writes the modified component. If update_sstable_id is true, reads scylla metadata, generates new sstable_id and rewrites it.
- Seals the new sstable with a temporary TOC
- Replaces the old sstable atomically, the same way as it is done in compaction

This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair).
In case of any failure during the whole process, sstable will be automatically deleted on the node startup due to
temporary toc persistence.

This prepares the infrastructure for component digests. Once digests are introduced in scylla_metadata
this mechanism will be extended to also rewrite scylla metadata with the updated digest alongside the modified component, ensuring atomic updates of both.
2026-02-26 22:38:55 +01:00
Taras Veretilnyk
4aa0a3acf9 compaction: add compaction_group_view method to specify sstable version
Add make_sstable() overload that accepts sstable_version_types parameter
to compaction_group_view interface and all implementations.
This will be useful in rewrite component mechanism, as we
need to preserve sstable version when creating the new one for the replacement.
2026-02-26 22:38:55 +01:00
Taras Veretilnyk
16ea7a8c1c sstables: add null_data_sink and serialized_checksum for checksum-only calculation
Introduce a null_data_sink and make_digest_calculator implementation that discards
all writes, enabling checksum calculation without file I/O. This allows the
existing checksummed_file_writer to be used for digest computation
without writing data to disk.

This will be used in a future commit to calculate the scylla metadata
component checksum before writing it to disk, allowing the component
to store its own checksum.
2026-02-26 22:38:51 +01:00
Ferenc Szili
f22d75a57e load_stats: add filtering for tablet sizes
This patch adds filtering for tablet sizes collected in load_stats.
This is needed to improve the chances that the balancer will have all
the tablet sizes for the node, and that way avoid having the node
ignored during balancing.
2026-02-26 15:17:39 +01:00
Ferenc Szili
d0a5a1d5d0 load_stats: move tablet filtering for table size computation
This patch moves the table size tablet filtering code from a lambda in
storage_service::load_stats_for_tablet_based_tables() to the code
section where it will be used:
tablet_storage_group_manager::table_load_stats()

This is needed to better accomodate the next commit which will add
code for filtering tablets for tablet sizes.
2026-02-26 11:07:53 +01:00
Ferenc Szili
9cd2a04e79 load_stats: bring the comment and code in sync
Table sizes are collected in load stats, and are filtered according to
the migration stage, so as to avoid double accounting of tablet sizes.
The comment which explains the cut-off migration stage (cleanup) and
the cut-off stage in the code (streaming) are not the same.

This patch fixes that.
2026-02-26 11:03:33 +01:00
Amnon Heiman
5db971c2f9 estimated_histogram_test.cc: add to_metrics_histogram test
Add a test that exercises to_metrics_histogram when Min is smaller than
Precision.

The test verifies duplicate integer bounds are collapsed,
counts remain cumulative, and native histogram metadata is still present
with the expected schema and min id.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-02-26 09:00:52 +02:00
Amnon Heiman
0b4f28ae21 histogram_metrics_helper.hh: Support Min < Precision
to_metrics_histogram now collapses duplicate integer bucket bounds
caused by Min less than Precision scaling while always keeping native
histogram metadata.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-02-26 09:00:38 +02:00
Amnon Heiman
af6371c11f estimated_histogram_test.cc: Add tests for approx_exponential_histogram with Min<Precision
Add three test cases to verify the hybrid linear/exponential bucketing:
- test_histogram_min_1_bucket_limits: Validates bucket lower limits
- test_histogram_min_1_basic: Tests value insertion and bucket distribution
- test_histogram_min_1_statistics: Tests min(), max(), quantile(), and mean()
- test_histogram_min_2_precision_4: Test min == 2 and precision 4.

These tests cover the new Min<Precision mode with Precision=4, verifying both the
linear range and exponential range.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-02-26 08:53:06 +02:00
Amnon Heiman
6c21e5f80c estimated_histogram.hh: support Min less than Precision histograms
approx_exponential_histogram is a pseudo exponential histogram implementation
that can insert and retrieve values into and from buckets in O 1 time.
The implementation uses power of two ranges and splits them linearly into
buckets. The number of buckets per power of two range is called Precision.

The original implementation aimed at covering large value ranges had a
limitation. The histogram Min value had to be greater than or equal to
Precision. As a result code that needs histograms for small integer values
could not use this implementation efficiently.

This change addresses that gap by handling the case where Min is less than
Precision. For Min smaller than Precision the value is scaled by a power of
two factor during indexing so the existing exponential math can be reused
without runtime branching. Bucket limits are scaled back to the original
units which can lead to repeated bucket limits in the Min to Precision
range for integer values.

Example with Min 2 and Precision 4
Buckets 2 2 3 3 4 5 6 7 8 10 12 14 and so on

Implementation details
Introduce SHIFT based on log2 Precision minus log2 Min when positive
Scale Min and Max by SHIFT for all exponential calculations
Compute NUM_BUCKETS using the standard log2 Max over Min formula
Use scaled value in find_bucket_index to avoid fractional bucket steps
Return bucket limits by scaling back to original units
Constraint relaxed from Min greater or equal to Precision to allow any Min
less than Max still power of two

This change maintains backward compatibility with existing histograms
while enabling efficient tracking of small integer values.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-02-26 00:46:14 +02:00
Amnon Heiman
aca5284b13 test(alternator): add per-table latency coverage for item and batch ops
Add missing tests for per-table Alternator latency metrics to ensure recent
per-table latency accounting is actually validated.

Changes in this patch:

Refactor latency assertion helper into check_sets_latency_by_metric(),
parameterized by metric name.
Keep existing behavior by implementing check_sets_latency() as a wrapper
over scylla_alternator_op_latency.
Add test_item_latency_per_table() to verify
scylla_alternator_table_op_latency_count increases for:
PutItem, GetItem, DeleteItem, UpdateItem, BatchWriteItem,
and BatchGetItem.
This closes a test gap where only global latency metrics were checked, while
per-table latency metrics were not covered.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-02-25 20:51:18 +02:00
Amnon Heiman
29e0b4e08c alternator: track per-table latency for batch get/write operations
Batch operations were updating only global latency histograms, which left
table-level latency metrics incomplete.

This change computes request duration once at the end of each operation and
reuses it to update both global and per-table latency stats:

Latencies are stored per table used,

This aligns batch read/write metric behavior with other operations and improves
per-table observability.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-02-25 20:51:18 +02:00
Wojciech Mitros
d1ff8f1db3 docs: add strong consistency doc
Add a new docs/dev document for the strongly consistent tables feature.
For now, it only contains information about the Raft metadata persistence,
but it should be updated as more of the strong-consistency components are
added.
2026-02-25 12:34:58 +01:00
Wojciech Mitros
97dc88d6b6 test/cluster: add tests for strongly-consistent tables' metadata persistence
In this patch we add various tests for checking how strongly consistent
tables work while allowing their tablets to reside on non-0 shards and
while using the new persistent storage for their raft metadata.
The tests verify that:
- strongly consistent tables' tablets can be allocated on different shards
and we can write/read from them
- the raft metadata is persistent across restarts even with disruptions
- the sharder correctly routes metadata queries to specified shards
- we can correctly perform multi-shard reads from the metadata tables
- we can read using just the group_id (without shard) using ALLOW FILTERING

For the tests we add logging to the sharder and partitioner and we add
some extra logs for observability.
2026-02-25 12:34:58 +01:00
Wojciech Mitros
f841c0522d raft: enable multi-shard raft groups for strongly consistent tablets
In this patch we allow strongly consistent tables to have tablets on
shards different than 0.
For that, we remove the checks for shard 0 for the non-group0 raft
groups, and  we allow the tablet allocator to place tablets of
strongly consistent tables on shards different than 0.
We also start using the new storage (raft::persistence) for strongly
consistent tables, added in the preceding commits.
2026-02-25 12:34:58 +01:00
Wojciech Mitros
ffe32e8e4d test/raft: add unit tests for raft_groups_storage
Most functions of the new storage for raft groups for strongly
consistent tables are the same as for the system raft table
storage, so we reuse the tests for them to test the new storage.

We add additional tests for checking the new raft groups partitioner
and sharder, and for verifying that writes using storages for different
shards do not affect the data read on different shards.

We also add a test for checking the snapshot_descriptor present after
the storage bootstrap - for both system and strongly consistent storages
we check that the storage contains the initial descriptor.
2026-02-25 12:34:58 +01:00
Wojciech Mitros
16977d7aa0 raft: add raft_groups_storage persistence class
Add raft_groups_storage, a raft::persistence implementation for
strongly consistent tablet groups.

Currently, it's almost an exact copy of the raft_sys_table_storage that
uses the new raft tables for strongly consistent tables (raft_groups,
raft_groups_snapshots, raft_groups_snapshot_config) which have
a (shard, group_id) partition key.

In the future, the mutation, term and commit_idx data will be stored
differently for for strongly consistent tables than for group0, which
will differentiate this class from the original raft_sys_table_storage.

The storage is created for each raft group server and it takes a shard
parameter at construction time to ensure all queries target the correct
partition (and thus shard).
2026-02-25 12:34:58 +01:00
Wojciech Mitros
654fe4b1ca db: add system tables for strongly consistent tables' raft groups
Add three new system tables for storing raft state for strongly
consistent tablets, corresponding to the tables for group0:

- system.raft_groups: Stores the raft log, term/vote, snapshot_id,
  and commit_idx for each tablet's raft group.

- system.raft_groups_snapshots: Stores snapshot descriptors
  (index, term) for each group.

- system.raft_groups_snapshot_config: Stores the raft configuration
  (current and previous voters) for each snapshot.

These tables use a (shard, group_id) composite partition key with
the newly added raft_groups_partitioner and raft_groups_sharder, ensuring
data is co-located with the tablet replica that owns the raft group.

The tables are only created when the STRONGLY_CONSISTENT_TABLES experimental
feature is enabled.
2026-02-25 12:34:58 +01:00
Wojciech Mitros
cb0caea8bf dht: add fixed_shard_partitioner and fixed_shard_sharder
Add a custom partitioner and sharder that will be used for Raft tables
for strongly consistent tables. These tables will have partition keys
of the form (shard, group_id) and the partitioner creates tokens that
encode the target shard in the high 16 bits.

Token layout:
  [shard: 16 bits][partition key hash: 48 bits]

This encoding guarantees that raft group data will be located on the
same shard as the tablet replica corresponding to that raft group as long
we use the tablet replica's shard as the value in the partition key.

Storing the shard directly in the partition key avoids additional lookups
for request routing to the incoming new raft tables.
For even more simplicity, we avoid biasing between uint64_t and int64_t
by limiting the acceptable shard ids up to 32767 (leaving the top bit 0),
which results in the same value of the token when interpreting either as
uint64_t or int64_t.

The sharder decodes the shard by extracting the high bits, which is
shard-count independent. This allows the partition key:shard mapping
to remain the same even during smp changes (only increases are allowed,
the same limitation as for tablets).
2026-02-25 12:34:51 +01:00
Andrzej Jackowski
e2c4b0a733 config: add write_consistency_levels_* guardrails configuration
Add guardrails configuration that can be used later in this
patch series.

Refs: SCYLLADB-259
2026-02-25 10:30:03 +01:00
Wojciech Mitros
c1b3fec11a raft: add group_id -> shard mapping to raft_group_registry
To handle RPC from other nodes, we need to be able to redirect the
requests for each raft group to the shard that owns it. We need to
be able to do the redirection on all shards, so to achieve that, on
all shards we need to store the information about which shard is
occupied by each Raft group server.
For that we add a group_id -> shard mapping to the raft_group_registry.
The mapping is filled out when starting raft servers, it's emptied
when we abort raft servers. We use it when registering RPC verb handlers,
so that regardless of the shard handling the RPC, the work on the raft
group can be performed on the corresponding shard.
2026-02-23 15:34:56 +01:00
Karol Nowacki
ca7f9a8baf vector_search: fix TLS server name with IP
SNI works only with DNS hostnames. Adding an IP address causes warnings
on the server side.
This change adds SNI only if it is not an IP address.

This change has no unit tests, as this behavior is not critical,
since it causes a warning on the server side.
The critical part, that the server name is verified, is already covered.

Fixes: VECTOR-528
2026-02-19 13:00:03 +01:00
Karol Nowacki
6205aad601 vector_search: add warn log for failed ann requests
In order to simplify troubleshooting connection problems, this patch
adds an extra warn log that prints the error for the vector search
request whenever it fails.
2026-02-19 13:00:03 +01:00
Taras Veretilnyk
f140ab0332 sstables: extract default write open flags into a constant
Extract the commonly used `open_flags::wo | open_flags::create |
open_flags::exclusive` into a reusable constant
`sstable_write_open_flags` to reduce duplication.
2026-02-13 14:27:01 +01:00
Taras Veretilnyk
c8281b7b8b sstables: Add write_simple_with_digest for component checksumming
Introduce new methods to write SSTable components while calculating
and returning their CRC32 checksums. This adds:

- make_digests_component_file_writer(): creates a crc32_digest_file_writer
  for component writing with checksum tracking
- write_simple_with_digest() and do_write_simple_with_digest(): write
  components and return the full checksum value
2026-02-13 14:27:01 +01:00
Taras Veretilnyk
1bf934c77c sstables: Extract file writer closing logic into separate methods
Refactor the consume_end_of_stream() method by extracting the inline
file writer closing logic into dedicated methods:
- close_index_writer()
- close_partitions_writer()
- close_rows_writer()
2026-02-13 14:27:01 +01:00
Taras Veretilnyk
dec5e48666 sstables: Implement CRC32 digest-only writer
Introduce template parameter to checksummed file writer to support
digest-only calculation without storing chunk checksums.
This will be needed for future to calculate digest of other components.
2026-02-13 14:27:00 +01:00
Wojciech Mitros
c5a44b0f88 schema: add with_sharder overload accepting static_sharder reference
Add a schema_builder::with_sharder() overload that accepts a const
reference to dht::static_sharder. This allows schemas to use custom
sharder instances instead of only static sharder configurations.

This is needed to support tables that use custom partitioning and
sharding strategies, such as the incoming raft metadata tables for
strongly consistent tables.
2026-02-10 10:52:00 +01:00
Pavel Emelyanov
83e64b516a hint: Don't switch group in database::apply_hint()
The method is called from storage_proxy::mutate_hint() which is in turn
called from hint_mutation::apply_locally(). The latter is either called
from directly by hint sender, which already runs in streaming group, or
via RPC HINT_MUTATION handler which uses index 1 that negotiates streaming
group as well.

To be sure, add a debugging check for current group being the expected one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-09 08:54:51 +03:00
Pavel Emelyanov
727f1be11c hint_sender: Switch to sender group on stop either
Currently sender only switches group for hints sending on start. It's
worth doing the same on stop too for consistency. There's nothing to
compete with at this point.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-09 08:54:51 +03:00
2660 changed files with 14629 additions and 12937 deletions

View File

@@ -1,6 +1,6 @@
version: 2
updates:
- package-ecosystem: "pip"
- package-ecosystem: "uv"
directory: "/docs"
schedule:
interval: "daily"

View File

@@ -1,8 +1,8 @@
name: Sync Jira Based on PR Events
name: Sync Jira Based on PR Events
on:
pull_request_target:
types: [opened, ready_for_review, review_requested, labeled, unlabeled, closed]
types: [opened, edited, ready_for_review, review_requested, labeled, unlabeled, closed]
permissions:
contents: read
@@ -10,32 +10,9 @@ permissions:
issues: write
jobs:
jira-sync-pr-opened:
if: github.event.action == 'opened'
uses: scylladb/github-automation/.github/workflows/main_jira_sync_pr_opened.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
jira-sync-in-review:
if: github.event.action == 'ready_for_review' || github.event.action == 'review_requested'
uses: scylladb/github-automation/.github/workflows/main_jira_sync_in_review.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
jira-sync-add-label:
if: github.event.action == 'labeled'
uses: scylladb/github-automation/.github/workflows/main_jira_sync_add_label.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
jira-status-remove-label:
if: github.event.action == 'unlabeled'
uses: scylladb/github-automation/.github/workflows/main_jira_sync_remove_label.yml@main
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
jira-status-pr-closed:
if: github.event.action == 'closed'
uses: scylladb/github-automation/.github/workflows/main_jira_sync_pr_closed.yml@main
jira-sync:
uses: scylladb/github-automation/.github/workflows/main_pr_events_jira_sync.yml@main
with:
caller_action: ${{ github.event.action }}
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -1,14 +1,14 @@
name: Call Jira release creation for new milestone
name: Call Jira release creation for new milestone
on:
milestone:
types: [created]
types: [created, closed]
jobs:
sync-milestone-to-jira:
uses: scylladb/github-automation/.github/workflows/main_sync_milestone_to_jira_release.yml@main
with:
# Comma-separated list of Jira project keys
jira_project_keys: "SCYLLADB,CUSTOMER,SMI"
jira_project_keys: "SCYLLADB,CUSTOMER,SMI,RELENG,VECTOR"
secrets:
caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -19,6 +19,8 @@ on:
jobs:
release:
permissions:
pages: write
id-token: write
contents: write
runs-on: ubuntu-latest
steps:
@@ -31,7 +33,9 @@ jobs:
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.10"
python-version: "3.12"
- name: Install uv
uses: astral-sh/setup-uv@v6
- name: Set up env
run: make -C docs FLAG="${{ env.FLAG }}" setupenv
- name: Build docs

View File

@@ -29,7 +29,9 @@ jobs:
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.10"
python-version: "3.12"
- name: Install uv
uses: astral-sh/setup-uv@v6
- name: Set up env
run: make -C docs FLAG="${{ env.FLAG }}" setupenv
- name: Build docs

View File

@@ -14,14 +14,20 @@ jobs:
steps:
- name: Verify Org Membership
id: verify_author
env:
EVENT_NAME: ${{ github.event_name }}
PR_AUTHOR: ${{ github.event.pull_request.user.login }}
PR_ASSOCIATION: ${{ github.event.pull_request.author_association }}
COMMENT_AUTHOR: ${{ github.event.comment.user.login }}
COMMENT_ASSOCIATION: ${{ github.event.comment.author_association }}
shell: bash
run: |
if [[ "${{ github.event_name }}" == "pull_request_target" ]]; then
AUTHOR="${{ github.event.pull_request.user.login }}"
ASSOCIATION="${{ github.event.pull_request.author_association }}"
if [[ "$EVENT_NAME" == "pull_request_target" ]]; then
AUTHOR="$PR_AUTHOR"
ASSOCIATION="$PR_ASSOCIATION"
else
AUTHOR="${{ github.event.comment.user.login }}"
ASSOCIATION="${{ github.event.comment.author_association }}"
AUTHOR="$COMMENT_AUTHOR"
ASSOCIATION="$COMMENT_ASSOCIATION"
fi
if [[ "$ASSOCIATION" == "MEMBER" || "$ASSOCIATION" == "OWNER" ]]; then
echo "member=true" >> $GITHUB_OUTPUT
@@ -33,13 +39,11 @@ jobs:
- name: Validate Comment Trigger
if: github.event_name == 'issue_comment'
id: verify_comment
env:
COMMENT_BODY: ${{ github.event.comment.body }}
shell: bash
run: |
BODY=$(cat << 'EOF'
${{ github.event.comment.body }}
EOF
)
CLEAN_BODY=$(echo "$BODY" | grep -v '^[[:space:]]*>')
CLEAN_BODY=$(echo "$COMMENT_BODY" | grep -v '^[[:space:]]*>')
if echo "$CLEAN_BODY" | grep -qi '@scylladbbot' && echo "$CLEAN_BODY" | grep -qi 'trigger-ci'; then
echo "trigger=true" >> $GITHUB_OUTPUT

View File

@@ -13,7 +13,8 @@
#include <string_view>
#include "alternator/auth.hh"
#include <fmt/format.h>
#include "auth/password_authenticator.hh"
#include "db/consistency_level_type.hh"
#include "db/system_keyspace.hh"
#include "service/storage_proxy.hh"
#include "alternator/executor.hh"
#include "cql3/selection/selection.hh"
@@ -25,8 +26,8 @@ namespace alternator {
static logging::logger alogger("alternator-auth");
future<std::string> get_key_from_roles(service::storage_proxy& proxy, auth::service& as, std::string username) {
schema_ptr schema = proxy.data_dictionary().find_schema(auth::get_auth_ks_name(as.query_processor()), "roles");
future<std::string> get_key_from_roles(service::storage_proxy& proxy, std::string username) {
schema_ptr schema = proxy.data_dictionary().find_schema(db::system_keyspace::NAME, "roles");
partition_key pk = partition_key::from_single_value(*schema, utf8_type->decompose(username));
dht::partition_range_vector partition_ranges{dht::partition_range(dht::decorate_key(*schema, pk))};
std::vector<query::clustering_range> bounds{query::clustering_range::make_open_ended_both_sides()};
@@ -39,7 +40,7 @@ future<std::string> get_key_from_roles(service::storage_proxy& proxy, auth::serv
auto partition_slice = query::partition_slice(std::move(bounds), {}, query::column_id_vector{salted_hash_col->id, can_login_col->id}, selection->get_query_options());
auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice,
proxy.get_max_result_size(partition_slice), query::tombstone_limit(proxy.get_tombstone_limit()));
auto cl = auth::password_authenticator::consistency_for_user(username);
auto cl = db::consistency_level::LOCAL_ONE;
service::client_state client_state{service::client_state::internal_tag()};
service::storage_proxy::coordinator_query_result qr = co_await proxy.query(schema, std::move(command), std::move(partition_ranges), cl,

View File

@@ -20,6 +20,6 @@ namespace alternator {
using key_cache = utils::loading_cache<std::string, std::string, 1>;
future<std::string> get_key_from_roles(service::storage_proxy& proxy, auth::service& as, std::string username);
future<std::string> get_key_from_roles(service::storage_proxy& proxy, std::string username);
}

View File

@@ -3463,7 +3463,11 @@ future<executor::request_return_type> executor::batch_write_item(client_state& c
if (should_add_wcu) {
rjson::add(ret, "ConsumedCapacity", std::move(consumed_capacity));
}
_stats.api_operations.batch_write_item_latency.mark(std::chrono::steady_clock::now() - start_time);
auto duration = std::chrono::steady_clock::now() - start_time;
_stats.api_operations.batch_write_item_latency.mark(duration);
for (const auto& w : per_table_wcu) {
w.first->api_operations.batch_write_item_latency.mark(duration);
}
co_return rjson::print(std::move(ret));
}
@@ -4974,7 +4978,12 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
if (!some_succeeded && eptr) {
co_await coroutine::return_exception_ptr(std::move(eptr));
}
_stats.api_operations.batch_get_item_latency.mark(std::chrono::steady_clock::now() - start_time);
auto duration = std::chrono::steady_clock::now() - start_time;
_stats.api_operations.batch_get_item_latency.mark(duration);
for (const table_requests& rs : requests) {
lw_shared_ptr<stats> per_table_stats = get_stats_from_schema(_proxy, *rs.schema);
per_table_stats->api_operations.batch_get_item_latency.mark(duration);
}
if (is_big(response)) {
co_return make_streamed(std::move(response));
} else {

View File

@@ -411,8 +411,8 @@ future<std::string> server::verify_signature(const request& req, const chunked_c
}
}
auto cache_getter = [&proxy = _proxy, &as = _auth_service] (std::string username) {
return get_key_from_roles(proxy, as, std::move(username));
auto cache_getter = [&proxy = _proxy] (std::string username) {
return get_key_from_roles(proxy, std::move(username));
};
return _key_cache.get_ptr(user, cache_getter).then_wrapped([this, &req, &content,
user = std::move(user),
@@ -771,7 +771,7 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
if (!username.empty()) {
client_state.set_login(auth::authenticated_user(username));
}
co_await client_state.maybe_update_per_service_level_params();
client_state.maybe_update_per_service_level_params();
tracing::trace_state_ptr trace_state = maybe_trace_query(client_state, username, op, content, _max_users_query_size_in_trace_output.get());
tracing::trace(trace_state, "{}", op);

View File

@@ -14,20 +14,6 @@
namespace alternator {
const char* ALTERNATOR_METRICS = "alternator";
static seastar::metrics::histogram estimated_histogram_to_metrics(const utils::estimated_histogram& histogram) {
seastar::metrics::histogram res;
res.buckets.resize(histogram.bucket_offsets.size());
uint64_t cumulative_count = 0;
res.sample_count = histogram._count;
res.sample_sum = histogram._sample_sum;
for (size_t i = 0; i < res.buckets.size(); i++) {
auto& v = res.buckets[i];
v.upper_bound = histogram.bucket_offsets[i];
cumulative_count += histogram.buckets[i];
v.count = cumulative_count;
}
return res;
}
static seastar::metrics::label column_family_label("cf");
static seastar::metrics::label keyspace_label("ks");
@@ -151,21 +137,21 @@ static void register_metrics_with_optional_table(seastar::metrics::metric_groups
seastar::metrics::make_counter("batch_item_count", seastar::metrics::description("The total number of items processed across all batches"), labels,
stats.api_operations.batch_get_item_batch_total)(op("BatchGetItem")).aggregate(aggregate_labels).set_skip_when_empty(),
seastar::metrics::make_histogram("batch_item_count_histogram", seastar::metrics::description("Histogram of the number of items in a batch request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.api_operations.batch_get_item_histogram);})(op("BatchGetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
[&stats]{ return to_metrics_histogram(stats.api_operations.batch_get_item_histogram);})(op("BatchGetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("batch_item_count_histogram", seastar::metrics::description("Histogram of the number of items in a batch request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.api_operations.batch_write_item_histogram);})(op("BatchWriteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
[&stats]{ return to_metrics_histogram(stats.api_operations.batch_write_item_histogram);})(op("BatchWriteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.get_item_op_size_kb);})(op("GetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
[&stats]{ return to_metrics_histogram(stats.operation_sizes.get_item_op_size_kb);})(op("GetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.put_item_op_size_kb);})(op("PutItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
[&stats]{ return to_metrics_histogram(stats.operation_sizes.put_item_op_size_kb);})(op("PutItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.delete_item_op_size_kb);})(op("DeleteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
[&stats]{ return to_metrics_histogram(stats.operation_sizes.delete_item_op_size_kb);})(op("DeleteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.update_item_op_size_kb);})(op("UpdateItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
[&stats]{ return to_metrics_histogram(stats.operation_sizes.update_item_op_size_kb);})(op("UpdateItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.batch_get_item_op_size_kb);})(op("BatchGetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
[&stats]{ return to_metrics_histogram(stats.operation_sizes.batch_get_item_op_size_kb);})(op("BatchGetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,
[&stats]{ return estimated_histogram_to_metrics(stats.operation_sizes.batch_write_item_op_size_kb);})(op("BatchWriteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
[&stats]{ return to_metrics_histogram(stats.operation_sizes.batch_write_item_op_size_kb);})(op("BatchWriteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
});
seastar::metrics::label expression_label("expression");

View File

@@ -16,6 +16,8 @@
#include "cql3/stats.hh"
namespace alternator {
using batch_histogram = utils::estimated_histogram_with_max<128>;
using op_size_histogram = utils::estimated_histogram_with_max<512>;
// Object holding per-shard statistics related to Alternator.
// While this object is alive, these metrics are also registered to be
@@ -76,34 +78,34 @@ public:
utils::timed_rate_moving_average_summary_and_histogram batch_get_item_latency;
utils::timed_rate_moving_average_summary_and_histogram get_records_latency;
utils::estimated_histogram batch_get_item_histogram{22}; // a histogram that covers the range 1 - 100
utils::estimated_histogram batch_write_item_histogram{22}; // a histogram that covers the range 1 - 100
batch_histogram batch_get_item_histogram;
batch_histogram batch_write_item_histogram;
} api_operations;
// Operation size metrics
struct {
// Item size statistics collected per table and aggregated per node.
// Each histogram covers the range 0 - 446. Resolves #25143.
// Each histogram covers the range 0 - 512. Resolves #25143.
// A size is the retrieved item's size.
utils::estimated_histogram get_item_op_size_kb{30};
op_size_histogram get_item_op_size_kb;
// A size is the maximum of the new item's size and the old item's size.
utils::estimated_histogram put_item_op_size_kb{30};
op_size_histogram put_item_op_size_kb;
// A size is the deleted item's size. If the deleted item's size is
// unknown (i.e. read-before-write wasn't necessary and it wasn't
// forced by a configuration option), it won't be recorded on the
// histogram.
utils::estimated_histogram delete_item_op_size_kb{30};
op_size_histogram delete_item_op_size_kb;
// A size is the maximum of existing item's size and the estimated size
// of the update. This will be changed to the maximum of the existing item's
// size and the new item's size in a subsequent PR.
utils::estimated_histogram update_item_op_size_kb{30};
op_size_histogram update_item_op_size_kb;
// A size is the sum of the sizes of all items per table. This means
// that a single BatchGetItem / BatchWriteItem updates the histogram
// for each table that it has items in.
// The sizes are the retrieved items' sizes grouped per table.
utils::estimated_histogram batch_get_item_op_size_kb{30};
op_size_histogram batch_get_item_op_size_kb;
// The sizes are the the written items' sizes grouped per table.
utils::estimated_histogram batch_write_item_op_size_kb{30};
op_size_histogram batch_write_item_op_size_kb;
} operation_sizes;
// Count of authentication and authorization failures, counted if either
// alternator_enforce_authorization or alternator_warn_authorization are
@@ -140,7 +142,7 @@ public:
cql3::cql_stats cql_stats;
// Enumeration of expression types only for stats
// if needed it can be extended e.g. per operation
// if needed it can be extended e.g. per operation
enum expression_types {
UPDATE_EXPRESSION,
CONDITION_EXPRESSION,
@@ -164,7 +166,7 @@ struct table_stats {
void register_metrics(seastar::metrics::metric_groups& metrics, const stats& stats);
inline uint64_t bytes_to_kb_ceil(uint64_t bytes) {
return (bytes + 1023) / 1024;
return (bytes) / 1024;
}
}

View File

@@ -33,6 +33,8 @@
#include "data_dictionary/data_dictionary.hh"
#include "utils/rjson.hh"
static logging::logger elogger("alternator-streams");
/**
* Base template type to implement rapidjson::internal::TypeHelper<...>:s
* for types that are ostreamable/string constructible/castable.
@@ -428,6 +430,25 @@ using namespace std::chrono_literals;
// Dynamo docs says no data shall live longer than 24h.
static constexpr auto dynamodb_streams_max_window = 24h;
// find the parent shard in previous generation for the given child shard
// takes care of wrap-around case in vnodes
// prev_streams must be sorted by token
const cdc::stream_id& find_parent_shard_in_previous_generation(db_clock::time_point prev_timestamp, const utils::chunked_vector<cdc::stream_id> &prev_streams, const cdc::stream_id &child) {
if (prev_streams.empty()) {
// something is really wrong - streams are empty
// let's try internal_error in hope it will be notified and fixed
on_internal_error(elogger, fmt::format("streams are empty for cdc generation at {} ({})", prev_timestamp, prev_timestamp.time_since_epoch().count()));
}
auto it = std::lower_bound(prev_streams.begin(), prev_streams.end(), child.token(), [](const cdc::stream_id& id, const dht::token& t) {
return id.token() < t;
});
if (it == prev_streams.end()) {
// wrap around case - take first
it = prev_streams.begin();
}
return *it;
}
future<executor::request_return_type> executor::describe_stream(client_state& client_state, service_permit permit, rjson::value request) {
_stats.api_operations.describe_stream++;
@@ -578,16 +599,8 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
auto shard = rjson::empty_object();
if (prev != e) {
auto& pids = prev->second.streams;
auto pid = std::upper_bound(pids.begin(), pids.end(), id.token(), [](const dht::token& t, const cdc::stream_id& id) {
return t < id.token();
});
if (pid != pids.begin()) {
pid = std::prev(pid);
}
if (pid != pids.end()) {
rjson::add(shard, "ParentShardId", shard_id(prev->first, *pid));
}
auto &pid = find_parent_shard_in_previous_generation(prev->first, prev->second.streams, id);
rjson::add(shard, "ParentShardId", shard_id(prev->first, pid));
}
last.emplace(ts, id);

View File

@@ -243,7 +243,7 @@
"GOSSIP_DIGEST_SYN",
"GOSSIP_DIGEST_ACK2",
"GOSSIP_SHUTDOWN",
"DEFINITIONS_UPDATE",
"UNUSED__DEFINITIONS_UPDATE",
"TRUNCATE",
"UNUSED__REPLICATION_FINISHED",
"MIGRATION_REQUEST",

View File

@@ -122,9 +122,9 @@ future<> unset_thrift_controller(http_context& ctx) {
return ctx.http_server.set_routes([&ctx] (routes& r) { unset_thrift_controller(ctx, r); });
}
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, service::raft_group0_client& group0_client) {
return ctx.http_server.set_routes([&ctx, &ss, &group0_client] (routes& r) {
set_storage_service(ctx, r, ss, group0_client);
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& ssc, service::raft_group0_client& group0_client) {
return ctx.http_server.set_routes([&ctx, &ss, &ssc, &group0_client] (routes& r) {
set_storage_service(ctx, r, ss, ssc, group0_client);
});
}

View File

@@ -98,7 +98,7 @@ future<> set_server_config(http_context& ctx, db::config& cfg);
future<> unset_server_config(http_context& ctx);
future<> set_server_snitch(http_context& ctx, sharded<locator::snitch_ptr>& snitch);
future<> unset_server_snitch(http_context& ctx);
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, service::raft_group0_client&);
future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>&, service::raft_group0_client&);
future<> unset_server_storage_service(http_context& ctx);
future<> set_server_client_routes(http_context& ctx, sharded<service::client_routes_service>& cr);
future<> unset_server_client_routes(http_context& ctx);

View File

@@ -835,9 +835,9 @@ rest_force_keyspace_flush(http_context& ctx, std::unique_ptr<http::request> req)
static
future<json::json_return_type>
rest_decommission(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
rest_decommission(sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& ssc, std::unique_ptr<http::request> req) {
apilog.info("decommission");
return ss.local().decommission().then([] {
return ss.local().decommission(ssc).then([] {
return make_ready_future<json::json_return_type>(json_void());
});
}
@@ -1572,10 +1572,7 @@ rest_upgrade_to_raft_topology(sharded<service::storage_service>& ss, std::unique
static
future<json::json_return_type>
rest_raft_topology_upgrade_status(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
const auto ustate = co_await ss.invoke_on(0, [] (auto& ss) {
return ss.get_topology_upgrade_state();
});
co_return sstring(format("{}", ustate));
co_return sstring("done");
}
static
@@ -1785,7 +1782,7 @@ rest_bind(FuncType func, BindArgs&... args) {
return std::bind_front(func, std::ref(args)...);
}
void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, service::raft_group0_client& group0_client) {
void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& ssc, service::raft_group0_client& group0_client) {
ss::get_token_endpoint.set(r, rest_bind(rest_get_token_endpoint, ctx, ss));
ss::toppartitions_generic.set(r, rest_bind(rest_toppartitions_generic, ctx));
ss::get_release_version.set(r, rest_bind(rest_get_release_version, ss));
@@ -1802,7 +1799,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::reset_cleanup_needed.set(r, rest_bind(rest_reset_cleanup_needed, ctx, ss));
ss::force_flush.set(r, rest_bind(rest_force_flush, ctx));
ss::force_keyspace_flush.set(r, rest_bind(rest_force_keyspace_flush, ctx));
ss::decommission.set(r, rest_bind(rest_decommission, ss));
ss::decommission.set(r, rest_bind(rest_decommission, ss, ssc));
ss::move.set(r, rest_bind(rest_move, ss));
ss::remove_node.set(r, rest_bind(rest_remove_node, ss));
ss::exclude_node.set(r, rest_bind(rest_exclude_node, ss));
@@ -2144,6 +2141,7 @@ void unset_snapshot(http_context& ctx, routes& r) {
ss::start_backup.unset(r);
cf::get_true_snapshots_size.unset(r);
cf::get_all_true_snapshots_size.unset(r);
ss::decommission.unset(r);
}
}

View File

@@ -66,7 +66,7 @@ struct scrub_info {
scrub_info parse_scrub_options(const http_context& ctx, std::unique_ptr<http::request> req);
void set_storage_service(http_context& ctx, httpd::routes& r, sharded<service::storage_service>& ss, service::raft_group0_client&);
void set_storage_service(http_context& ctx, httpd::routes& r, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>&, service::raft_group0_client&);
void unset_storage_service(http_context& ctx, httpd::routes& r);
void set_sstables_loader(http_context& ctx, httpd::routes& r, sharded<sstables_loader>& sst_loader);
void unset_sstables_loader(http_context& ctx, httpd::routes& r);

View File

@@ -19,12 +19,12 @@ target_sources(scylla_auth
permission.cc
resource.cc
role_or_anonymous.cc
roles-metadata.cc
sasl_challenge.cc
saslauthd_authenticator.cc
service.cc
standard_role_manager.cc
transitional.cc
maintenance_socket_authenticator.cc
maintenance_socket_role_manager.cc)
target_include_directories(scylla_auth
PUBLIC
@@ -48,4 +48,4 @@ if (Scylla_USE_PRECOMPILED_HEADER_USE)
target_precompile_headers(scylla_auth REUSE_FROM scylla-precompiled-header)
endif()
check_headers(check-headers scylla_auth
GLOB_RECURSE ${CMAKE_CURRENT_SOURCE_DIR}/*.hh)
GLOB_RECURSE ${CMAKE_CURRENT_SOURCE_DIR}/*.hh)

View File

@@ -9,19 +9,9 @@
#include "auth/allow_all_authenticator.hh"
#include "service/migration_manager.hh"
#include "utils/class_registrator.hh"
namespace auth {
constexpr std::string_view allow_all_authenticator_name("org.apache.cassandra.auth.AllowAllAuthenticator");
// To ensure correct initialization order, we unfortunately need to use a string literal.
static const class_registrator<
authenticator,
allow_all_authenticator,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&,
cache&> registration("org.apache.cassandra.auth.AllowAllAuthenticator");
}

View File

@@ -9,18 +9,9 @@
#include "auth/allow_all_authorizer.hh"
#include "auth/common.hh"
#include "utils/class_registrator.hh"
namespace auth {
constexpr std::string_view allow_all_authorizer_name("org.apache.cassandra.auth.AllowAllAuthorizer");
// To ensure correct initialization order, we unfortunately need to use a string literal.
static const class_registrator<
authorizer,
allow_all_authorizer,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&> registration("org.apache.cassandra.auth.AllowAllAuthorizer");
}

View File

@@ -26,7 +26,7 @@ extern const std::string_view allow_all_authorizer_name;
class allow_all_authorizer final : public authorizer {
public:
allow_all_authorizer(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&) {
allow_all_authorizer(cql3::query_processor&) {
}
virtual future<> start() override {

View File

@@ -209,9 +209,6 @@ future<> cache::prune_all() noexcept {
}
future<> cache::load_all() {
if (legacy_mode(_qp)) {
co_return;
}
SCYLLA_ASSERT(this_shard_id() == 0);
auto units = co_await get_units(_loading_sem, 1, _as);
@@ -263,9 +260,6 @@ future<> cache::gather_inheriting_roles(std::unordered_set<role_name_t>& roles,
}
future<> cache::load_roles(std::unordered_set<role_name_t> roles) {
if (legacy_mode(_qp)) {
co_return;
}
SCYLLA_ASSERT(this_shard_id() == 0);
auto units = co_await get_units(_loading_sem, 1, _as);

View File

@@ -13,14 +13,11 @@
#include <boost/regex.hpp>
#include <fmt/ranges.h>
#include "utils/class_registrator.hh"
#include "utils/to_string.hh"
#include "data_dictionary/data_dictionary.hh"
#include "cql3/query_processor.hh"
#include "db/config.hh"
static const auto CERT_AUTH_NAME = "com.scylladb.auth.CertificateAuthenticator";
const std::string_view auth::certificate_authenticator_name(CERT_AUTH_NAME);
static logging::logger clogger("certificate_authenticator");
@@ -30,13 +27,6 @@ static const std::string cfg_query_attr = "query";
static const std::string cfg_source_subject = "SUBJECT";
static const std::string cfg_source_altname = "ALTNAME";
static const class_registrator<auth::authenticator
, auth::certificate_authenticator
, cql3::query_processor&
, ::service::raft_group0_client&
, ::service::migration_manager&
, auth::cache&> cert_auth_reg(CERT_AUTH_NAME);
enum class auth::certificate_authenticator::query_source {
subject, altname
};
@@ -99,7 +89,7 @@ future<> auth::certificate_authenticator::stop() {
}
std::string_view auth::certificate_authenticator::qualified_java_name() const {
return certificate_authenticator_name;
return "com.scylladb.auth.CertificateAuthenticator";
}
bool auth::certificate_authenticator::require_authentication() const {

View File

@@ -27,8 +27,6 @@ namespace auth {
class cache;
extern const std::string_view certificate_authenticator_name;
class certificate_authenticator : public authenticator {
enum class query_source;
std::vector<std::pair<query_source, boost::regex>> _queries;

View File

@@ -14,18 +14,11 @@
#include <seastar/core/sharded.hh>
#include "mutation/canonical_mutation.hh"
#include "schema/schema_fwd.hh"
#include "mutation/timestamp.hh"
#include "utils/assert.hh"
#include "utils/exponential_backoff_retry.hh"
#include "cql3/query_processor.hh"
#include "cql3/statements/create_table_statement.hh"
#include "schema/schema_builder.hh"
#include "service/migration_manager.hh"
#include "service/raft/group0_state_machine.hh"
#include "timeout_config.hh"
#include "utils/error_injection.hh"
#include "db/system_keyspace.hh"
namespace auth {
@@ -33,22 +26,14 @@ namespace meta {
namespace legacy {
constinit const std::string_view AUTH_KS("system_auth");
constinit const std::string_view USERS_CF("users");
} // namespace legacy
constinit const std::string_view AUTH_PACKAGE_NAME("org.apache.cassandra.auth.");
} // namespace meta
static logging::logger auth_log("auth");
bool legacy_mode(cql3::query_processor& qp) {
return qp.auth_version < db::auth_version_t::v2;
}
std::string_view get_auth_ks_name(cql3::query_processor& qp) {
if (legacy_mode(qp)) {
return meta::legacy::AUTH_KS;
}
return db::system_keyspace::NAME;
std::string default_superuser(cql3::query_processor& qp) {
return qp.db().get_config().auth_superuser_name();
}
// Func must support being invoked more than once.
@@ -65,47 +50,6 @@ future<> do_after_system_ready(seastar::abort_source& as, seastar::noncopyable_f
}).discard_result();
}
static future<> create_legacy_metadata_table_if_missing_impl(
std::string_view table_name,
cql3::query_processor& qp,
std::string_view cql,
::service::migration_manager& mm) {
SCYLLA_ASSERT(this_shard_id() == 0); // once_among_shards makes sure a function is executed on shard 0 only
auto db = qp.db();
auto parsed_statement = cql3::query_processor::parse_statement(cql, cql3::dialect{});
auto& parsed_cf_statement = static_cast<cql3::statements::raw::cf_statement&>(*parsed_statement);
parsed_cf_statement.prepare_keyspace(meta::legacy::AUTH_KS);
auto statement = static_pointer_cast<cql3::statements::create_table_statement>(
parsed_cf_statement.prepare(db, qp.get_cql_stats())->statement);
const auto schema = statement->get_cf_meta_data(qp.db());
const auto uuid = generate_legacy_id(schema->ks_name(), schema->cf_name());
schema_builder b(schema);
b.set_uuid(uuid);
schema_ptr table = b.build();
if (!db.has_schema(table->ks_name(), table->cf_name())) {
auto group0_guard = co_await mm.start_group0_operation();
auto ts = group0_guard.write_timestamp();
try {
co_return co_await mm.announce(co_await ::service::prepare_new_column_family_announcement(qp.proxy(), table, ts),
std::move(group0_guard), format("auth: create {} metadata table", table->cf_name()));
} catch (const exceptions::already_exists_exception&) {}
}
}
future<> create_legacy_metadata_table_if_missing(
std::string_view table_name,
cql3::query_processor& qp,
std::string_view cql,
::service::migration_manager& mm) noexcept {
return futurize_invoke(create_legacy_metadata_table_if_missing_impl, table_name, qp, cql, mm);
}
::service::query_state& internal_distributed_query_state() noexcept {
#ifdef DEBUG
// Give the much slower debug tests more headroom for completing auth queries.
@@ -140,56 +84,6 @@ static future<> announce_mutations_with_guard(
return group0_client.add_entry(std::move(group0_cmd), std::move(group0_guard), as, timeout);
}
future<> announce_mutations_with_batching(
::service::raft_group0_client& group0_client,
start_operation_func_t start_operation_func,
std::function<::service::mutations_generator(api::timestamp_type t)> gen,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout) {
// account for command's overhead, it's better to use smaller threshold than constantly bounce off the limit
size_t memory_threshold = group0_client.max_command_size() * 0.75;
utils::get_local_injector().inject("auth_announce_mutations_command_max_size",
[&memory_threshold] {
memory_threshold = 1000;
});
size_t memory_usage = 0;
utils::chunked_vector<canonical_mutation> muts;
// guard has to be taken before we execute code in gen as
// it can do read-before-write and we want announce_mutations
// operation to be linearizable with other such calls,
// for instance if we do select and then delete in gen
// we want both to operate on the same data or fail
// if someone else modified it in the middle
std::optional<::service::group0_guard> group0_guard;
group0_guard = co_await start_operation_func(as);
auto timestamp = group0_guard->write_timestamp();
auto g = gen(timestamp);
while (auto mut = co_await g()) {
muts.push_back(canonical_mutation{*mut});
memory_usage += muts.back().representation().size();
if (memory_usage >= memory_threshold) {
if (!group0_guard) {
group0_guard = co_await start_operation_func(as);
timestamp = group0_guard->write_timestamp();
}
co_await announce_mutations_with_guard(group0_client, std::move(muts), std::move(*group0_guard), as, timeout);
group0_guard = std::nullopt;
memory_usage = 0;
muts = {};
}
}
if (!muts.empty()) {
if (!group0_guard) {
group0_guard = co_await start_operation_func(as);
timestamp = group0_guard->write_timestamp();
}
co_await announce_mutations_with_guard(group0_client, std::move(muts), std::move(*group0_guard), as, timeout);
}
}
future<> announce_mutations(
cql3::query_processor& qp,
::service::raft_group0_client& group0_client,

View File

@@ -21,12 +21,7 @@
using namespace std::chrono_literals;
namespace replica {
class database;
}
namespace service {
class migration_manager;
class query_state;
}
@@ -40,10 +35,8 @@ namespace meta {
namespace legacy {
extern constinit const std::string_view AUTH_KS;
extern constinit const std::string_view USERS_CF;
} // namespace legacy
constexpr std::string_view DEFAULT_SUPERUSER_NAME("cassandra");
extern constinit const std::string_view AUTH_PACKAGE_NAME;
} // namespace meta
@@ -52,12 +45,7 @@ constexpr std::string_view PERMISSIONS_CF = "role_permissions";
constexpr std::string_view ROLE_MEMBERS_CF = "role_members";
constexpr std::string_view ROLE_ATTRIBUTES_CF = "role_attributes";
// This is a helper to check whether auth-v2 is on.
bool legacy_mode(cql3::query_processor& qp);
// We have legacy implementation using different keyspace
// and need to parametrize depending on runtime feature.
std::string_view get_auth_ks_name(cql3::query_processor& qp);
std::string default_superuser(cql3::query_processor& qp);
template <class Task>
future<> once_among_shards(Task&& f) {
@@ -71,12 +59,6 @@ future<> once_among_shards(Task&& f) {
// Func must support being invoked more than once.
future<> do_after_system_ready(seastar::abort_source& as, seastar::noncopyable_function<future<>()> func);
future<> create_legacy_metadata_table_if_missing(
std::string_view table_name,
cql3::query_processor&,
std::string_view cql,
::service::migration_manager&) noexcept;
///
/// Time-outs for internal, non-local CQL queries.
///
@@ -84,20 +66,6 @@ future<> create_legacy_metadata_table_if_missing(
::service::raft_timeout get_raft_timeout() noexcept;
// Execute update query via group0 mechanism, mutations will be applied on all nodes.
// Use this function when need to perform read before write on a single guard or if
// you have more than one mutation and potentially exceed single command size limit.
using start_operation_func_t = std::function<future<::service::group0_guard>(abort_source&)>;
future<> announce_mutations_with_batching(
::service::raft_group0_client& group0_client,
// since we can operate also in topology coordinator context where we need stronger
// guarantees than start_operation from group0_client gives we allow to inject custom
// function here
start_operation_func_t start_operation_func,
std::function<::service::mutations_generator(api::timestamp_type t)> gen,
seastar::abort_source& as,
std::optional<::service::raft_timeout> timeout);
// Execute update query via group0 mechanism, mutations will be applied on all nodes.
future<> announce_mutations(
cql3::query_processor& qp,

View File

@@ -26,7 +26,6 @@ extern "C" {
#include "cql3/untyped_result_set.hh"
#include "exceptions/exceptions.hh"
#include "utils/log.hh"
#include "utils/class_registrator.hh"
namespace auth {
@@ -40,111 +39,14 @@ static constexpr std::string_view PERMISSIONS_NAME = "permissions";
static logging::logger alogger("default_authorizer");
// To ensure correct initialization order, we unfortunately need to use a string literal.
static const class_registrator<
authorizer,
default_authorizer,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&> password_auth_reg("org.apache.cassandra.auth.CassandraAuthorizer");
default_authorizer::default_authorizer(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm)
: _qp(qp)
, _migration_manager(mm) {
default_authorizer::default_authorizer(cql3::query_processor& qp)
: _qp(qp) {
}
default_authorizer::~default_authorizer() {
}
static const sstring legacy_table_name{"permissions"};
bool default_authorizer::legacy_metadata_exists() const {
return _qp.db().has_schema(meta::legacy::AUTH_KS, legacy_table_name);
}
future<bool> default_authorizer::legacy_any_granted() const {
static const sstring query = seastar::format("SELECT * FROM {}.{} LIMIT 1", meta::legacy::AUTH_KS, PERMISSIONS_CF);
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
{},
cql3::query_processor::cache_internal::yes).then([](::shared_ptr<cql3::untyped_result_set> results) {
return !results->empty();
});
}
future<> default_authorizer::migrate_legacy_metadata() {
alogger.info("Starting migration of legacy permissions metadata.");
static const sstring query = seastar::format("SELECT * FROM {}.{}", meta::legacy::AUTH_KS, legacy_table_name);
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
cql3::query_processor::cache_internal::no).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
return do_with(
row.get_as<sstring>("username"),
parse_resource(row.get_as<sstring>(RESOURCE_NAME)),
::service::group0_batch::unused(),
[this, &row](const auto& username, const auto& r, auto& mc) {
const permission_set perms = permissions::from_strings(row.get_set<sstring>(PERMISSIONS_NAME));
return grant(username, perms, r, mc);
});
}).finally([results] {});
}).then([] {
alogger.info("Finished migrating legacy permissions metadata.");
}).handle_exception([](std::exception_ptr ep) {
alogger.error("Encountered an error during migration!");
std::rethrow_exception(ep);
});
}
future<> default_authorizer::start_legacy() {
static const sstring create_table = fmt::format(
"CREATE TABLE {}.{} ("
"{} text,"
"{} text,"
"{} set<text>,"
"PRIMARY KEY({}, {})"
") WITH gc_grace_seconds={}",
meta::legacy::AUTH_KS,
PERMISSIONS_CF,
ROLE_NAME,
RESOURCE_NAME,
PERMISSIONS_NAME,
ROLE_NAME,
RESOURCE_NAME,
90 * 24 * 60 * 60); // 3 months.
return once_among_shards([this] {
return create_legacy_metadata_table_if_missing(
PERMISSIONS_CF,
_qp,
create_table,
_migration_manager).then([this] {
_finished = do_after_system_ready(_as, [this] {
return async([this] {
_migration_manager.wait_for_schema_agreement(_qp.db().real_database(), db::timeout_clock::time_point::max(), &_as).get();
if (legacy_metadata_exists()) {
if (!legacy_any_granted().get()) {
migrate_legacy_metadata().get();
return;
}
alogger.warn("Ignoring legacy permissions metadata since role permissions exist.");
}
});
});
});
});
}
future<> default_authorizer::start() {
if (legacy_mode(_qp)) {
return start_legacy();
}
return make_ready_future<>();
}
@@ -161,7 +63,7 @@ default_authorizer::authorize(const role_or_anonymous& maybe_role, const resourc
const sstring query = seastar::format("SELECT {} FROM {}.{} WHERE {} = ? AND {} = ?",
PERMISSIONS_NAME,
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
PERMISSIONS_CF,
ROLE_NAME,
RESOURCE_NAME);
@@ -185,21 +87,13 @@ default_authorizer::modify(
std::string_view op,
::service::group0_batch& mc) {
const sstring query = seastar::format("UPDATE {}.{} SET {} = {} {} ? WHERE {} = ? AND {} = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
PERMISSIONS_CF,
PERMISSIONS_NAME,
PERMISSIONS_NAME,
op,
ROLE_NAME,
RESOURCE_NAME);
if (legacy_mode(_qp)) {
co_return co_await _qp.execute_internal(
query,
db::consistency_level::ONE,
internal_distributed_query_state(),
{permissions::to_strings(set), sstring(role_name), resource.name()},
cql3::query_processor::cache_internal::no).discard_result();
}
co_await collect_mutations(_qp, mc, query,
{permissions::to_strings(set), sstring(role_name), resource.name()});
}
@@ -218,7 +112,7 @@ future<std::vector<permission_details>> default_authorizer::list_all() const {
ROLE_NAME,
RESOURCE_NAME,
PERMISSIONS_NAME,
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
PERMISSIONS_CF);
const auto results = co_await _qp.execute_internal(
@@ -243,74 +137,16 @@ future<std::vector<permission_details>> default_authorizer::list_all() const {
future<> default_authorizer::revoke_all(std::string_view role_name, ::service::group0_batch& mc) {
try {
const sstring query = seastar::format("DELETE FROM {}.{} WHERE {} = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
PERMISSIONS_CF,
ROLE_NAME);
if (legacy_mode(_qp)) {
co_await _qp.execute_internal(
query,
db::consistency_level::ONE,
internal_distributed_query_state(),
{sstring(role_name)},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await collect_mutations(_qp, mc, query, {sstring(role_name)});
}
co_await collect_mutations(_qp, mc, query, {sstring(role_name)});
} catch (const exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to revoke all permissions of {}: {}", role_name, e);
}
}
future<> default_authorizer::revoke_all_legacy(const resource& resource) {
static const sstring query = seastar::format("SELECT {} FROM {}.{} WHERE {} = ? ALLOW FILTERING",
ROLE_NAME,
get_auth_ks_name(_qp),
PERMISSIONS_CF,
RESOURCE_NAME);
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
{resource.name()},
cql3::query_processor::cache_internal::no).then_wrapped([this, resource](future<::shared_ptr<cql3::untyped_result_set>> f) {
try {
auto res = f.get();
return parallel_for_each(
res->begin(),
res->end(),
[this, res, resource](const cql3::untyped_result_set::row& r) {
static const sstring query = seastar::format("DELETE FROM {}.{} WHERE {} = ? AND {} = ?",
get_auth_ks_name(_qp),
PERMISSIONS_CF,
ROLE_NAME,
RESOURCE_NAME);
return _qp.execute_internal(
query,
db::consistency_level::LOCAL_ONE,
{r.get_as<sstring>(ROLE_NAME), resource.name()},
cql3::query_processor::cache_internal::no).discard_result().handle_exception(
[resource](auto ep) {
try {
std::rethrow_exception(ep);
} catch (const exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to revoke all permissions on {}: {}", resource, e);
}
});
});
} catch (const exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to revoke all permissions on {}: {}", resource, e);
return make_ready_future();
}
});
}
future<> default_authorizer::revoke_all(const resource& resource, ::service::group0_batch& mc) {
if (legacy_mode(_qp)) {
co_return co_await revoke_all_legacy(resource);
}
if (resource.kind() == resource_kind::data &&
data_resource_view(resource).is_keyspace()) {
revoke_all_keyspace_resources(resource, mc);
@@ -321,7 +157,7 @@ future<> default_authorizer::revoke_all(const resource& resource, ::service::gro
auto gen = [this, name] (api::timestamp_type t) -> ::service::mutations_generator {
const sstring query = seastar::format("SELECT {} FROM {}.{} WHERE {} = ? ALLOW FILTERING",
ROLE_NAME,
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
PERMISSIONS_CF,
RESOURCE_NAME);
auto res = co_await _qp.execute_internal(
@@ -331,7 +167,7 @@ future<> default_authorizer::revoke_all(const resource& resource, ::service::gro
cql3::query_processor::cache_internal::no);
for (const auto& r : *res) {
const sstring query = seastar::format("DELETE FROM {}.{} WHERE {} = ? AND {} = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
PERMISSIONS_CF,
ROLE_NAME,
RESOURCE_NAME);
@@ -356,7 +192,7 @@ void default_authorizer::revoke_all_keyspace_resources(const resource& ks_resour
const sstring query = seastar::format("SELECT {}, {} FROM {}.{}",
ROLE_NAME,
RESOURCE_NAME,
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
PERMISSIONS_CF);
auto res = co_await _qp.execute_internal(
query,
@@ -371,7 +207,7 @@ void default_authorizer::revoke_all_keyspace_resources(const resource& ks_resour
continue;
}
const sstring query = seastar::format("DELETE FROM {}.{} WHERE {} = ? AND {} = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
PERMISSIONS_CF,
ROLE_NAME,
RESOURCE_NAME);

View File

@@ -27,14 +27,12 @@ namespace auth {
class default_authorizer : public authorizer {
cql3::query_processor& _qp;
::service::migration_manager& _migration_manager;
abort_source _as{};
future<> _finished{make_ready_future<>()};
public:
default_authorizer(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&);
default_authorizer(cql3::query_processor&);
~default_authorizer();
@@ -59,16 +57,6 @@ public:
virtual const resource_set& protected_resources() const override;
private:
future<> start_legacy();
bool legacy_metadata_exists() const;
future<> revoke_all_legacy(const resource&);
future<bool> legacy_any_granted() const;
future<> migrate_legacy_metadata();
future<> modify(std::string_view, permission_set, const resource&, std::string_view, ::service::group0_batch&);
void revoke_all_keyspace_resources(const resource& ks_resource, ::service::group0_batch& mc);

View File

@@ -24,7 +24,6 @@
#include "exceptions/exceptions.hh"
#include "seastarx.hh"
#include "service/raft/raft_group0_client.hh"
#include "utils/class_registrator.hh"
#include "db/config.hh"
#include "utils/exponential_backoff_retry.hh"
@@ -72,20 +71,10 @@ std::vector<sstring> get_attr_values(LDAP* ld, LDAPMessage* res, const char* att
return values;
}
const char* ldap_role_manager_full_name = "com.scylladb.auth.LDAPRoleManager";
} // anonymous namespace
namespace auth {
static const class_registrator<
role_manager,
ldap_role_manager,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&,
cache&> registration(ldap_role_manager_full_name);
ldap_role_manager::ldap_role_manager(
std::string_view query_template, std::string_view target_attr, std::string_view bind_name, std::string_view bind_password,
uint32_t permissions_update_interval_in_ms,
@@ -115,7 +104,7 @@ ldap_role_manager::ldap_role_manager(cql3::query_processor& qp, ::service::raft_
}
std::string_view ldap_role_manager::qualified_java_name() const noexcept {
return ldap_role_manager_full_name;
return "com.scylladb.auth.LDAPRoleManager";
}
const resource_set& ldap_role_manager::protected_resources() const {

View File

@@ -57,8 +57,7 @@ class ldap_role_manager : public role_manager {
cache& cache ///< Passed to standard_role_manager.
);
/// Retrieves LDAP configuration entries from qp and invokes the other constructor. Required by
/// class_registrator<role_manager>.
/// Retrieves LDAP configuration entries from qp and invokes the other constructor.
ldap_role_manager(cql3::query_processor& qp, ::service::raft_group0_client& rg0c, ::service::migration_manager& mm, cache& cache);
/// Thrown when query-template parsing fails.

View File

@@ -0,0 +1,31 @@
/*
* Copyright (C) 2026-present ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* SPDX-License-Identifier: (LicenseRef-ScyllaDB-Source-Available-1.0 and Apache-2.0)
*/
#include "auth/maintenance_socket_authenticator.hh"
namespace auth {
maintenance_socket_authenticator::~maintenance_socket_authenticator() {
}
future<> maintenance_socket_authenticator::start() {
return make_ready_future<>();
}
future<> maintenance_socket_authenticator::ensure_superuser_is_created() const {
return make_ready_future<>();
}
bool maintenance_socket_authenticator::require_authentication() const {
return false;
}
} // namespace auth

View File

@@ -0,0 +1,36 @@
/*
* Copyright (C) 2026-present ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* SPDX-License-Identifier: (LicenseRef-ScyllaDB-Source-Available-1.0 and Apache-2.0)
*/
#pragma once
#include <seastar/core/shared_future.hh>
#include "password_authenticator.hh"
namespace auth {
// maintenance_socket_authenticator is used for clients connecting to the
// maintenance socket. It does not require authentication,
// while still allowing the managing of roles and their credentials.
class maintenance_socket_authenticator : public password_authenticator {
public:
using password_authenticator::password_authenticator;
virtual ~maintenance_socket_authenticator();
virtual future<> start() override;
virtual future<> ensure_superuser_is_created() const override;
bool require_authentication() const override;
};
} // namespace auth

View File

@@ -13,23 +13,48 @@
#include <string_view>
#include "auth/cache.hh"
#include "cql3/description.hh"
#include "utils/class_registrator.hh"
#include "utils/log.hh"
#include "utils/on_internal_error.hh"
namespace auth {
constexpr std::string_view maintenance_socket_role_manager_name = "com.scylladb.auth.MaintenanceSocketRoleManager";
static logging::logger log("maintenance_socket_role_manager");
static const class_registrator<
role_manager,
maintenance_socket_role_manager,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&,
cache&> registration(sstring{maintenance_socket_role_manager_name});
future<> maintenance_socket_role_manager::ensure_role_operations_are_enabled() {
if (_is_maintenance_mode) {
on_internal_error(log, "enabling role operations not allowed in maintenance mode");
}
if (_std_mgr.has_value()) {
on_internal_error(log, "role operations are already enabled");
}
_std_mgr.emplace(_qp, _group0_client, _migration_manager, _cache);
return _std_mgr->start();
}
void maintenance_socket_role_manager::set_maintenance_mode() {
if (_std_mgr.has_value()) {
on_internal_error(log, "cannot enter maintenance mode after role operations have been enabled");
}
_is_maintenance_mode = true;
}
maintenance_socket_role_manager::maintenance_socket_role_manager(
cql3::query_processor& qp,
::service::raft_group0_client& rg0c,
::service::migration_manager& mm,
cache& c)
: _qp(qp)
, _group0_client(rg0c)
, _migration_manager(mm)
, _cache(c)
, _std_mgr(std::nullopt)
, _is_maintenance_mode(false) {
}
std::string_view maintenance_socket_role_manager::qualified_java_name() const noexcept {
return maintenance_socket_role_manager_name;
return "com.scylladb.auth.MaintenanceSocketRoleManager";
}
const resource_set& maintenance_socket_role_manager::protected_resources() const {
@@ -43,81 +68,161 @@ future<> maintenance_socket_role_manager::start() {
}
future<> maintenance_socket_role_manager::stop() {
return make_ready_future<>();
return _std_mgr ? _std_mgr->stop() : make_ready_future<>();
}
future<> maintenance_socket_role_manager::ensure_superuser_is_created() {
return make_ready_future<>();
return _std_mgr ? _std_mgr->ensure_superuser_is_created() : make_ready_future<>();
}
template<typename T = void>
future<T> operation_not_supported_exception(std::string_view operation) {
future<T> operation_not_available_in_maintenance_mode_exception(std::string_view operation) {
return make_exception_future<T>(
std::runtime_error(fmt::format("role manager: {} operation not supported through maintenance socket", operation)));
std::runtime_error(fmt::format("role manager: {} operation not available through maintenance socket in maintenance mode", operation)));
}
future<> maintenance_socket_role_manager::create(std::string_view role_name, const role_config&, ::service::group0_batch&) {
return operation_not_supported_exception("CREATE");
template<typename T = void>
future<T> manager_not_ready_exception(std::string_view operation) {
return make_exception_future<T>(
std::runtime_error(fmt::format("role manager: {} operation not available because manager not ready yet (role operations not enabled)", operation)));
}
future<> maintenance_socket_role_manager::validate_operation(std::string_view name) const {
if (_is_maintenance_mode) {
return operation_not_available_in_maintenance_mode_exception(name);
}
if (!_std_mgr) {
return manager_not_ready_exception(name);
}
return make_ready_future<>();
}
future<> maintenance_socket_role_manager::create(std::string_view role_name, const role_config& c, ::service::group0_batch& mc) {
auto f = validate_operation("CREATE");
if (f.failed()) {
return f;
}
return _std_mgr->create(role_name, c, mc);
}
future<> maintenance_socket_role_manager::drop(std::string_view role_name, ::service::group0_batch& mc) {
return operation_not_supported_exception("DROP");
auto f = validate_operation("DROP");
if (f.failed()) {
return f;
}
return _std_mgr->drop(role_name, mc);
}
future<> maintenance_socket_role_manager::alter(std::string_view role_name, const role_config_update&, ::service::group0_batch&) {
return operation_not_supported_exception("ALTER");
future<> maintenance_socket_role_manager::alter(std::string_view role_name, const role_config_update& u, ::service::group0_batch& mc) {
auto f = validate_operation("ALTER");
if (f.failed()) {
return f;
}
return _std_mgr->alter(role_name, u, mc);
}
future<> maintenance_socket_role_manager::grant(std::string_view grantee_name, std::string_view role_name, ::service::group0_batch& mc) {
return operation_not_supported_exception("GRANT");
auto f = validate_operation("GRANT");
if (f.failed()) {
return f;
}
return _std_mgr->grant(grantee_name, role_name, mc);
}
future<> maintenance_socket_role_manager::revoke(std::string_view revokee_name, std::string_view role_name, ::service::group0_batch& mc) {
return operation_not_supported_exception("REVOKE");
auto f = validate_operation("REVOKE");
if (f.failed()) {
return f;
}
return _std_mgr->revoke(revokee_name, role_name, mc);
}
future<role_set> maintenance_socket_role_manager::query_granted(std::string_view grantee_name, recursive_role_query) {
return operation_not_supported_exception<role_set>("QUERY GRANTED");
future<role_set> maintenance_socket_role_manager::query_granted(std::string_view grantee_name, recursive_role_query m) {
auto f = validate_operation("QUERY GRANTED");
if (f.failed()) {
return make_exception_future<role_set>(f.get_exception());
}
return _std_mgr->query_granted(grantee_name, m);
}
future<role_to_directly_granted_map> maintenance_socket_role_manager::query_all_directly_granted(::service::query_state&) {
return operation_not_supported_exception<role_to_directly_granted_map>("QUERY ALL DIRECTLY GRANTED");
future<role_to_directly_granted_map> maintenance_socket_role_manager::query_all_directly_granted(::service::query_state& qs) {
auto f = validate_operation("QUERY ALL DIRECTLY GRANTED");
if (f.failed()) {
return make_exception_future<role_to_directly_granted_map>(f.get_exception());
}
return _std_mgr->query_all_directly_granted(qs);
}
future<role_set> maintenance_socket_role_manager::query_all(::service::query_state&) {
return operation_not_supported_exception<role_set>("QUERY ALL");
future<role_set> maintenance_socket_role_manager::query_all(::service::query_state& qs) {
auto f = validate_operation("QUERY ALL");
if (f.failed()) {
return make_exception_future<role_set>(f.get_exception());
}
return _std_mgr->query_all(qs);
}
future<bool> maintenance_socket_role_manager::exists(std::string_view role_name) {
return operation_not_supported_exception<bool>("EXISTS");
auto f = validate_operation("EXISTS");
if (f.failed()) {
return make_exception_future<bool>(f.get_exception());
}
return _std_mgr->exists(role_name);
}
future<bool> maintenance_socket_role_manager::is_superuser(std::string_view role_name) {
return make_ready_future<bool>(true);
auto f = validate_operation("IS SUPERUSER");
if (f.failed()) {
return make_exception_future<bool>(f.get_exception());
}
return _std_mgr->is_superuser(role_name);
}
future<bool> maintenance_socket_role_manager::can_login(std::string_view role_name) {
return make_ready_future<bool>(true);
auto f = validate_operation("CAN LOGIN");
if (f.failed()) {
return make_exception_future<bool>(f.get_exception());
}
return _std_mgr->can_login(role_name);
}
future<std::optional<sstring>> maintenance_socket_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state&) {
return operation_not_supported_exception<std::optional<sstring>>("GET ATTRIBUTE");
future<std::optional<sstring>> maintenance_socket_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) {
auto f = validate_operation("GET ATTRIBUTE");
if (f.failed()) {
return make_exception_future<std::optional<sstring>>(f.get_exception());
}
return _std_mgr->get_attribute(role_name, attribute_name, qs);
}
future<role_manager::attribute_vals> maintenance_socket_role_manager::query_attribute_for_all(std::string_view attribute_name, ::service::query_state&) {
return operation_not_supported_exception<role_manager::attribute_vals>("QUERY ATTRIBUTE");
future<role_manager::attribute_vals> maintenance_socket_role_manager::query_attribute_for_all(std::string_view attribute_name, ::service::query_state& qs) {
auto f = validate_operation("QUERY ATTRIBUTE FOR ALL");
if (f.failed()) {
return make_exception_future<role_manager::attribute_vals>(f.get_exception());
}
return _std_mgr->query_attribute_for_all(attribute_name, qs);
}
future<> maintenance_socket_role_manager::set_attribute(std::string_view role_name, std::string_view attribute_name, std::string_view attribute_value, ::service::group0_batch& mc) {
return operation_not_supported_exception("SET ATTRIBUTE");
auto f = validate_operation("SET ATTRIBUTE");
if (f.failed()) {
return f;
}
return _std_mgr->set_attribute(role_name, attribute_name, attribute_value, mc);
}
future<> maintenance_socket_role_manager::remove_attribute(std::string_view role_name, std::string_view attribute_name, ::service::group0_batch& mc) {
return operation_not_supported_exception("REMOVE ATTRIBUTE");
auto f = validate_operation("REMOVE ATTRIBUTE");
if (f.failed()) {
return f;
}
return _std_mgr->remove_attribute(role_name, attribute_name, mc);
}
future<std::vector<cql3::description>> maintenance_socket_role_manager::describe_role_grants() {
return operation_not_supported_exception<std::vector<cql3::description>>("DESCRIBE SCHEMA WITH INTERNALS");
auto f = validate_operation("DESCRIBE ROLE GRANTS");
if (f.failed()) {
return make_exception_future<std::vector<cql3::description>>(f.get_exception());
}
return _std_mgr->describe_role_grants();
}
} // namespace auth

View File

@@ -11,6 +11,7 @@
#include "auth/cache.hh"
#include "auth/resource.hh"
#include "auth/role_manager.hh"
#include "auth/standard_role_manager.hh"
#include <seastar/core/future.hh>
namespace cql3 {
@@ -24,13 +25,26 @@ class raft_group0_client;
namespace auth {
extern const std::string_view maintenance_socket_role_manager_name;
// This role manager is used by the maintenance socket. It has disabled all role management operations to not depend on
// system_auth keyspace, which may be not yet created when the maintenance socket starts listening.
// This role manager is used by the maintenance socket. It has disabled all role management operations
// in maintenance mode. In normal mode it delegates all operations to a standard_role_manager,
// which is created on demand when the node joins the cluster.
class maintenance_socket_role_manager final : public role_manager {
cql3::query_processor& _qp;
::service::raft_group0_client& _group0_client;
::service::migration_manager& _migration_manager;
cache& _cache;
std::optional<standard_role_manager> _std_mgr;
bool _is_maintenance_mode;
public:
maintenance_socket_role_manager(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&) {}
void set_maintenance_mode() override;
// Ensures role management operations are enabled.
// It must be called once the node has joined the cluster.
// In the meantime all role management operations will fail.
future<> ensure_role_operations_are_enabled() override;
maintenance_socket_role_manager(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&);
virtual std::string_view qualified_java_name() const noexcept override;
@@ -42,21 +56,21 @@ public:
virtual future<> ensure_superuser_is_created() override;
virtual future<> create(std::string_view role_name, const role_config&, ::service::group0_batch&) override;
virtual future<> create(std::string_view role_name, const role_config& c, ::service::group0_batch& mc) override;
virtual future<> drop(std::string_view role_name, ::service::group0_batch& mc) override;
virtual future<> alter(std::string_view role_name, const role_config_update&, ::service::group0_batch&) override;
virtual future<> alter(std::string_view role_name, const role_config_update& u, ::service::group0_batch& mc) override;
virtual future<> grant(std::string_view grantee_name, std::string_view role_name, ::service::group0_batch& mc) override;
virtual future<> revoke(std::string_view revokee_name, std::string_view role_name, ::service::group0_batch& mc) override;
virtual future<role_set> query_granted(std::string_view grantee_name, recursive_role_query) override;
virtual future<role_set> query_granted(std::string_view grantee_name, recursive_role_query m) override;
virtual future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state&) override;
virtual future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state& qs) override;
virtual future<role_set> query_all(::service::query_state&) override;
virtual future<role_set> query_all(::service::query_state& qs) override;
virtual future<bool> exists(std::string_view role_name) override;
@@ -64,15 +78,19 @@ public:
virtual future<bool> can_login(std::string_view role_name) override;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state&) override;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) override;
virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name, ::service::query_state&) override;
virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name, ::service::query_state& qs) override;
virtual future<> set_attribute(std::string_view role_name, std::string_view attribute_name, std::string_view attribute_value, ::service::group0_batch& mc) override;
virtual future<> remove_attribute(std::string_view role_name, std::string_view attribute_name, ::service::group0_batch& mc) override;
virtual future<std::vector<cql3::description>> describe_role_grants() override;
private:
future<> validate_operation(std::string_view name) const;
};
}

View File

@@ -26,10 +26,9 @@
#include "cql3/untyped_result_set.hh"
#include "utils/log.hh"
#include "service/migration_manager.hh"
#include "utils/class_registrator.hh"
#include "replica/database.hh"
#include "cql3/query_processor.hh"
#include "db/config.hh"
#include "db/system_keyspace.hh"
namespace auth {
@@ -37,29 +36,10 @@ constexpr std::string_view password_authenticator_name("org.apache.cassandra.aut
// name of the hash column.
static constexpr std::string_view SALTED_HASH = "salted_hash";
static constexpr std::string_view DEFAULT_USER_NAME = meta::DEFAULT_SUPERUSER_NAME;
static const sstring DEFAULT_USER_PASSWORD = sstring(meta::DEFAULT_SUPERUSER_NAME);
static logging::logger plogger("password_authenticator");
// To ensure correct initialization order, we unfortunately need to use a string literal.
static const class_registrator<
authenticator,
password_authenticator,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&,
cache&> password_auth_reg("org.apache.cassandra.auth.PasswordAuthenticator");
static thread_local auto rng_for_salt = std::default_random_engine(std::random_device{}());
static std::string_view get_config_value(std::string_view value, std::string_view def) {
return value.empty() ? def : value;
}
std::string password_authenticator::default_superuser(const db::config& cfg) {
return std::string(get_config_value(cfg.auth_superuser_name(), DEFAULT_USER_NAME));
}
password_authenticator::~password_authenticator() {
}
@@ -69,7 +49,6 @@ password_authenticator::password_authenticator(cql3::query_processor& qp, ::serv
, _migration_manager(mm)
, _cache(cache)
, _stopped(make_ready_future<>())
, _superuser(default_superuser(qp.db().get_config()))
{}
static bool has_salted_hash(const cql3::untyped_result_set_row& row) {
@@ -78,76 +57,18 @@ static bool has_salted_hash(const cql3::untyped_result_set_row& row) {
sstring password_authenticator::update_row_query() const {
return seastar::format("UPDATE {}.{} SET {} = ? WHERE {} = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
meta::roles_table::name,
SALTED_HASH,
meta::roles_table::role_col_name);
}
static const sstring legacy_table_name{"credentials"};
bool password_authenticator::legacy_metadata_exists() const {
return _qp.db().has_schema(meta::legacy::AUTH_KS, legacy_table_name);
}
future<> password_authenticator::migrate_legacy_metadata() const {
plogger.info("Starting migration of legacy authentication metadata.");
static const sstring query = seastar::format("SELECT * FROM {}.{}", meta::legacy::AUTH_KS, legacy_table_name);
return _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
cql3::query_processor::cache_internal::no).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
auto username = row.get_as<sstring>("username");
auto salted_hash = row.get_as<sstring>(SALTED_HASH);
static const auto query = seastar::format("UPDATE {}.{} SET {} = ? WHERE {} = ?",
meta::legacy::AUTH_KS,
meta::roles_table::name,
SALTED_HASH,
meta::roles_table::role_col_name);
return _qp.execute_internal(
query,
consistency_for_user(username),
internal_distributed_query_state(),
{std::move(salted_hash), username},
cql3::query_processor::cache_internal::no).discard_result();
}).finally([results] {});
}).then([] {
plogger.info("Finished migrating legacy authentication metadata.");
}).handle_exception([](std::exception_ptr ep) {
plogger.error("Encountered an error during migration!");
std::rethrow_exception(ep);
});
}
future<> password_authenticator::legacy_create_default_if_missing() {
const auto exists = co_await legacy::default_role_row_satisfies(_qp, &has_salted_hash, _superuser);
if (exists) {
co_return;
}
std::string salted_pwd(get_config_value(_qp.db().get_config().auth_superuser_salted_password(), ""));
if (salted_pwd.empty()) {
salted_pwd = passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt, _scheme);
}
const auto query = seastar::format("UPDATE {}.{} SET {} = ? WHERE {} = ?",
meta::legacy::AUTH_KS,
meta::roles_table::name,
SALTED_HASH,
meta::roles_table::role_col_name);
co_await _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
{salted_pwd, _superuser},
cql3::query_processor::cache_internal::no);
plogger.info("Created default superuser authentication record.");
}
future<> password_authenticator::maybe_create_default_password() {
auto needs_password = [this] () -> future<bool> {
const sstring query = seastar::format("SELECT * FROM {}.{} WHERE is_superuser = true ALLOW FILTERING", get_auth_ks_name(_qp), meta::roles_table::name);
if (default_superuser(_qp).empty()) {
co_return false;
}
const sstring query = seastar::format("SELECT * FROM {}.{} WHERE is_superuser = true ALLOW FILTERING", db::system_keyspace::NAME, meta::roles_table::name);
auto results = co_await _qp.execute_internal(query,
db::consistency_level::LOCAL_ONE,
internal_distributed_query_state(), cql3::query_processor::cache_internal::yes);
@@ -157,7 +78,7 @@ future<> password_authenticator::maybe_create_default_password() {
bool has_default = false;
bool has_superuser_with_password = false;
for (auto& result : *results) {
if (result.get_as<sstring>(meta::roles_table::role_col_name) == _superuser) {
if (result.get_as<sstring>(meta::roles_table::role_col_name) == default_superuser(_qp)) {
has_default = true;
}
if (has_salted_hash(result)) {
@@ -178,12 +99,12 @@ future<> password_authenticator::maybe_create_default_password() {
co_return;
}
// Set default superuser's password.
std::string salted_pwd(get_config_value(_qp.db().get_config().auth_superuser_salted_password(), ""));
std::string salted_pwd(_qp.db().get_config().auth_superuser_salted_password());
if (salted_pwd.empty()) {
salted_pwd = passwords::hash(DEFAULT_USER_PASSWORD, rng_for_salt, _scheme);
co_return;
}
const auto update_query = update_row_query();
co_await collect_mutations(_qp, batch, update_query, {salted_pwd, _superuser});
co_await collect_mutations(_qp, batch, update_query, {salted_pwd, default_superuser(_qp)});
co_await std::move(batch).commit(_group0_client, _as, get_raft_timeout());
plogger.info("Created default superuser authentication record.");
}
@@ -216,58 +137,14 @@ future<> password_authenticator::start() {
_stopped = do_after_system_ready(_as, [this] {
return async([this] {
if (legacy_mode(_qp)) {
if (!_superuser_created_promise.available()) {
// Counterintuitively, we mark promise as ready before any startup work
// because wait_for_schema_agreement() below will block indefinitely
// without cluster majority. In that case, blocking node startup
// would lead to a cluster deadlock.
_superuser_created_promise.set_value();
}
_migration_manager.wait_for_schema_agreement(_qp.db().real_database(), db::timeout_clock::time_point::max(), &_as).get();
if (legacy::any_nondefault_role_row_satisfies(_qp, &has_salted_hash, _superuser).get()) {
if (legacy_metadata_exists()) {
plogger.warn("Ignoring legacy authentication metadata since nondefault data already exist.");
}
return;
}
if (legacy_metadata_exists()) {
migrate_legacy_metadata().get();
return;
}
legacy_create_default_if_missing().get();
}
utils::get_local_injector().inject("password_authenticator_start_pause", utils::wait_for_message(5min)).get();
if (!legacy_mode(_qp)) {
maybe_create_default_password_with_retries().get();
if (!_superuser_created_promise.available()) {
_superuser_created_promise.set_value();
}
maybe_create_default_password_with_retries().get();
if (!_superuser_created_promise.available()) {
_superuser_created_promise.set_value();
}
});
});
if (legacy_mode(_qp)) {
static const sstring create_roles_query = fmt::format(
"CREATE TABLE {}.{} ("
" {} text PRIMARY KEY,"
" can_login boolean,"
" is_superuser boolean,"
" member_of set<text>,"
" salted_hash text"
")",
meta::legacy::AUTH_KS,
meta::roles_table::name,
meta::roles_table::role_col_name);
return create_legacy_metadata_table_if_missing(
meta::roles_table::name,
_qp,
create_roles_query,
_migration_manager);
}
return make_ready_future<>();
});
}
@@ -277,15 +154,6 @@ future<> password_authenticator::stop() {
return _stopped.handle_exception_type([] (const sleep_aborted&) { }).handle_exception_type([](const abort_requested_exception&) {});
}
db::consistency_level password_authenticator::consistency_for_user(std::string_view role_name) {
// TODO: this is plain dung. Why treat hardcoded default special, but for example a user-created
// super user uses plain LOCAL_ONE?
if (role_name == DEFAULT_USER_NAME) {
return db::consistency_level::QUORUM;
}
return db::consistency_level::LOCAL_ONE;
}
std::string_view password_authenticator::qualified_java_name() const {
return password_authenticator_name;
}
@@ -315,20 +183,12 @@ future<authenticated_user> password_authenticator::authenticate(
const sstring password = credentials.at(PASSWORD_KEY);
try {
std::optional<sstring> salted_hash;
if (legacy_mode(_qp)) {
salted_hash = co_await get_password_hash(username);
if (!salted_hash) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
}
} else {
auto role = _cache.get(username);
if (!role || role->salted_hash.empty()) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
}
salted_hash = role->salted_hash;
auto role = _cache.get(username);
if (!role || role->salted_hash.empty()) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
}
const bool password_match = co_await passwords::check(password, *salted_hash);
const auto& salted_hash = role->salted_hash;
const bool password_match = co_await passwords::check(password, salted_hash);
if (!password_match) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
}
@@ -367,16 +227,7 @@ future<> password_authenticator::create(std::string_view role_name, const authen
}
const auto query = update_row_query();
if (legacy_mode(_qp)) {
co_await _qp.execute_internal(
query,
consistency_for_user(role_name),
internal_distributed_query_state(),
{std::move(*maybe_hash), sstring(role_name)},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await collect_mutations(_qp, mc, query, {std::move(*maybe_hash), sstring(role_name)});
}
co_await collect_mutations(_qp, mc, query, {std::move(*maybe_hash), sstring(role_name)});
}
future<> password_authenticator::alter(std::string_view role_name, const authentication_options& options, ::service::group0_batch& mc) {
@@ -387,38 +238,21 @@ future<> password_authenticator::alter(std::string_view role_name, const authent
const auto password = std::get<password_option>(*options.credentials).password;
const sstring query = seastar::format("UPDATE {}.{} SET {} = ? WHERE {} = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
meta::roles_table::name,
SALTED_HASH,
meta::roles_table::role_col_name);
if (legacy_mode(_qp)) {
co_await _qp.execute_internal(
query,
consistency_for_user(role_name),
internal_distributed_query_state(),
{passwords::hash(password, rng_for_salt, _scheme), sstring(role_name)},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await collect_mutations(_qp, mc, query,
{passwords::hash(password, rng_for_salt, _scheme), sstring(role_name)});
}
co_await collect_mutations(_qp, mc, query,
{passwords::hash(password, rng_for_salt, _scheme), sstring(role_name)});
}
future<> password_authenticator::drop(std::string_view name, ::service::group0_batch& mc) {
const sstring query = seastar::format("DELETE {} FROM {}.{} WHERE {} = ?",
SALTED_HASH,
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
meta::roles_table::name,
meta::roles_table::role_col_name);
if (legacy_mode(_qp)) {
co_await _qp.execute_internal(
query, consistency_for_user(name),
internal_distributed_query_state(),
{sstring(name)},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await collect_mutations(_qp, mc, query, {sstring(name)});
}
co_await collect_mutations(_qp, mc, query, {sstring(name)});
}
future<custom_options> password_authenticator::query_custom_options(std::string_view role_name) const {
@@ -437,13 +271,13 @@ future<std::optional<sstring>> password_authenticator::get_password_hash(std::st
// that a map lookup string->statement is not gonna kill us much.
const sstring query = seastar::format("SELECT {} FROM {}.{} WHERE {} = ?",
SALTED_HASH,
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
meta::roles_table::name,
meta::roles_table::role_col_name);
const auto res = co_await _qp.execute_internal(
query,
consistency_for_user(role_name),
db::consistency_level::LOCAL_ONE,
internal_distributed_query_state(),
{role_name},
cql3::query_processor::cache_internal::yes);

View File

@@ -13,7 +13,6 @@
#include <seastar/core/abort_source.hh>
#include <seastar/core/shared_future.hh>
#include "db/consistency_level_type.hh"
#include "auth/authenticator.hh"
#include "auth/passwords.hh"
#include "auth/cache.hh"
@@ -44,15 +43,11 @@ class password_authenticator : public authenticator {
cache& _cache;
future<> _stopped;
abort_source _as;
std::string _superuser; // default superuser name from the config (may or may not be present in roles table)
shared_promise<> _superuser_created_promise;
// We used to also support bcrypt, SHA-256, and MD5 (ref. scylladb#24524).
constexpr static auth::passwords::scheme _scheme = passwords::scheme::sha_512;
public:
static db::consistency_level consistency_for_user(std::string_view role_name);
static std::string default_superuser(const db::config&);
password_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&);
~password_authenticator();
@@ -90,12 +85,6 @@ public:
virtual future<> ensure_superuser_is_created() const override;
private:
bool legacy_metadata_exists() const;
future<> migrate_legacy_metadata() const;
future<> legacy_create_default_if_missing();
future<> maybe_create_default_password();
future<> maybe_create_default_password_with_retries();

View File

@@ -112,6 +112,11 @@ public:
virtual future<> stop() = 0;
///
/// Notify that the maintenance mode is starting.
///
virtual void set_maintenance_mode() {}
///
/// Ensure that superuser role exists.
///
@@ -119,6 +124,11 @@ public:
///
virtual future<> ensure_superuser_is_created() = 0;
///
/// Ensure role management operations are enabled. Some role managers may defer initialization.
///
virtual future<> ensure_role_operations_are_enabled() { return make_ready_future<>(); }
///
/// \returns an exceptional future with \ref role_already_exists for a role that has previously been created.
///

View File

@@ -1,68 +0,0 @@
/*
* Copyright (C) 2018-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "auth/roles-metadata.hh"
#include <seastar/core/format.hh>
#include <seastar/core/shared_ptr.hh>
#include <seastar/core/sstring.hh>
#include "auth/common.hh"
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
namespace auth {
namespace legacy {
future<bool> default_role_row_satisfies(
cql3::query_processor& qp,
std::function<bool(const cql3::untyped_result_set_row&)> p,
std::optional<std::string> rolename) {
const sstring query = seastar::format("SELECT * FROM {}.{} WHERE {} = ?",
auth::meta::legacy::AUTH_KS,
meta::roles_table::name,
meta::roles_table::role_col_name);
for (auto cl : { db::consistency_level::ONE, db::consistency_level::QUORUM }) {
auto results = co_await qp.execute_internal(query, cl
, internal_distributed_query_state()
, {rolename.value_or(std::string(auth::meta::DEFAULT_SUPERUSER_NAME))}
, cql3::query_processor::cache_internal::yes
);
if (!results->empty()) {
co_return p(results->one());
}
}
co_return false;
}
future<bool> any_nondefault_role_row_satisfies(
cql3::query_processor& qp,
std::function<bool(const cql3::untyped_result_set_row&)> p,
std::optional<std::string> rolename) {
const sstring query = seastar::format("SELECT * FROM {}.{}", auth::meta::legacy::AUTH_KS, meta::roles_table::name);
auto results = co_await qp.execute_internal(query, db::consistency_level::QUORUM
, internal_distributed_query_state(), cql3::query_processor::cache_internal::no
);
if (results->empty()) {
co_return false;
}
static const sstring col_name = sstring(meta::roles_table::role_col_name);
co_return std::ranges::any_of(*results, [&](const cql3::untyped_result_set_row& row) {
auto superuser = rolename ? std::string_view(*rolename) : meta::DEFAULT_SUPERUSER_NAME;
const bool is_nondefault = row.get_as<sstring>(col_name) != superuser;
return is_nondefault && p(row);
});
}
} // namespace legacy
} // namespace auth

View File

@@ -8,18 +8,7 @@
#pragma once
#include <optional>
#include <string_view>
#include <functional>
#include <seastar/core/future.hh>
#include "seastarx.hh"
namespace cql3 {
class query_processor;
class untyped_result_set_row;
}
namespace auth {
@@ -35,26 +24,4 @@ constexpr std::string_view role_col_name{"role", 4};
} // namespace meta
namespace legacy {
///
/// Check that the default role satisfies a predicate, or `false` if the default role does not exist.
///
future<bool> default_role_row_satisfies(
cql3::query_processor&,
std::function<bool(const cql3::untyped_result_set_row&)>,
std::optional<std::string> rolename = {}
);
///
/// Check that any nondefault role satisfies a predicate. `false` if no nondefault roles exist.
///
future<bool> any_nondefault_role_row_satisfies(
cql3::query_processor&,
std::function<bool(const cql3::untyped_result_set_row&)>,
std::optional<std::string> rolename = {}
);
} // namespace legacy
} // namespace auth

View File

@@ -22,21 +22,11 @@
#include "db/config.hh"
#include "utils/log.hh"
#include "seastarx.hh"
#include "utils/class_registrator.hh"
namespace auth {
static logging::logger mylog("saslauthd_authenticator");
// To ensure correct initialization order, we unfortunately need to use a string literal.
static const class_registrator<
authenticator,
saslauthd_authenticator,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&,
cache&> saslauthd_auth_reg("com.scylladb.auth.SaslauthdAuthenticator");
saslauthd_authenticator::saslauthd_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&, cache&)
: _socket_path(qp.db().get_config().saslauthd_socket_path())
{}

View File

@@ -16,6 +16,8 @@
#include <algorithm>
#include <chrono>
#include <boost/algorithm/string.hpp>
#include <seastar/core/future-util.hh>
#include <seastar/core/shard_id.hh>
#include <seastar/core/sharded.hh>
@@ -23,8 +25,17 @@
#include "auth/allow_all_authenticator.hh"
#include "auth/allow_all_authorizer.hh"
#include "auth/certificate_authenticator.hh"
#include "auth/common.hh"
#include "auth/default_authorizer.hh"
#include "auth/ldap_role_manager.hh"
#include "auth/maintenance_socket_authenticator.hh"
#include "auth/maintenance_socket_role_manager.hh"
#include "auth/password_authenticator.hh"
#include "auth/role_or_anonymous.hh"
#include "auth/saslauthd_authenticator.hh"
#include "auth/standard_role_manager.hh"
#include "auth/transitional.hh"
#include "cql3/functions/functions.hh"
#include "cql3/query_processor.hh"
#include "cql3/description.hh"
@@ -43,7 +54,6 @@
#include "service/raft/raft_group0_client.hh"
#include "mutation/timestamp.hh"
#include "utils/assert.hh"
#include "utils/class_registrator.hh"
#include "locator/abstract_replication_strategy.hh"
#include "data_dictionary/keyspace_metadata.hh"
#include "service/storage_service.hh"
@@ -63,88 +73,6 @@ static const sstring superuser_col_name("super");
static logging::logger log("auth_service");
class auth_migration_listener final : public ::service::migration_listener {
service& _service;
cql3::query_processor& _qp;
public:
explicit auth_migration_listener(service& s, cql3::query_processor& qp) : _service(s), _qp(qp) {
}
private:
void on_create_keyspace(const sstring& ks_name) override {}
void on_create_column_family(const sstring& ks_name, const sstring& cf_name) override {}
void on_create_user_type(const sstring& ks_name, const sstring& type_name) override {}
void on_create_function(const sstring& ks_name, const sstring& function_name) override {}
void on_create_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}
void on_create_view(const sstring& ks_name, const sstring& view_name) override {}
void on_update_keyspace(const sstring& ks_name) override {}
void on_update_column_family(const sstring& ks_name, const sstring& cf_name, bool) override {}
void on_update_user_type(const sstring& ks_name, const sstring& type_name) override {}
void on_update_function(const sstring& ks_name, const sstring& function_name) override {}
void on_update_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {}
void on_update_view(const sstring& ks_name, const sstring& view_name, bool columns_changed) override {}
void on_drop_keyspace(const sstring& ks_name) override {
if (!legacy_mode(_qp)) {
// in non legacy path revoke is part of schema change statement execution
return;
}
// Do it in the background.
(void)do_with(auth::make_data_resource(ks_name), ::service::group0_batch::unused(), [this] (auto& r, auto& mc) mutable {
return _service.revoke_all(r, mc);
}).handle_exception([] (std::exception_ptr e) {
log.error("Unexpected exception while revoking all permissions on dropped keyspace: {}", e);
});
(void)do_with(auth::make_functions_resource(ks_name), ::service::group0_batch::unused(), [this] (auto& r, auto& mc) mutable {
return _service.revoke_all(r, mc);
}).handle_exception([] (std::exception_ptr e) {
log.error("Unexpected exception while revoking all permissions on functions in dropped keyspace: {}", e);
});
}
void on_drop_column_family(const sstring& ks_name, const sstring& cf_name) override {
if (!legacy_mode(_qp)) {
// in non legacy path revoke is part of schema change statement execution
return;
}
// Do it in the background.
(void)do_with(auth::make_data_resource(ks_name, cf_name), ::service::group0_batch::unused(), [this] (auto& r, auto& mc) mutable {
return _service.revoke_all(r, mc);
}).handle_exception([] (std::exception_ptr e) {
log.error("Unexpected exception while revoking all permissions on dropped table: {}", e);
});
}
void on_drop_user_type(const sstring& ks_name, const sstring& type_name) override {}
void on_drop_function(const sstring& ks_name, const sstring& function_name) override {
if (!legacy_mode(_qp)) {
// in non legacy path revoke is part of schema change statement execution
return;
}
// Do it in the background.
(void)do_with(auth::make_functions_resource(ks_name, function_name), ::service::group0_batch::unused(), [this] (auto& r, auto& mc) mutable {
return _service.revoke_all(r, mc);
}).handle_exception([] (std::exception_ptr e) {
log.error("Unexpected exception while revoking all permissions on dropped function: {}", e);
});
}
void on_drop_aggregate(const sstring& ks_name, const sstring& aggregate_name) override {
if (!legacy_mode(_qp)) {
// in non legacy path revoke is part of schema change statement execution
return;
}
(void)do_with(auth::make_functions_resource(ks_name, aggregate_name), ::service::group0_batch::unused(), [this] (auto& r, auto& mc) mutable {
return _service.revoke_all(r, mc);
}).handle_exception([] (std::exception_ptr e) {
log.error("Unexpected exception while revoking all permissions on dropped aggregate: {}", e);
});
}
void on_drop_view(const sstring& ks_name, const sstring& view_name) override {}
};
static future<> validate_role_exists(const service& ser, std::string_view role_name) {
return ser.underlying_role_manager().exists(role_name).then([role_name](bool exists) {
if (!exists) {
@@ -157,7 +85,6 @@ service::service(
cache& cache,
cql3::query_processor& qp,
::service::raft_group0_client& g0,
::service::migration_notifier& mn,
std::unique_ptr<authorizer> z,
std::unique_ptr<authenticator> a,
std::unique_ptr<role_manager> r,
@@ -165,29 +92,26 @@ service::service(
: _cache(cache)
, _qp(qp)
, _group0_client(g0)
, _mnotifier(mn)
, _authorizer(std::move(z))
, _authenticator(std::move(a))
, _role_manager(std::move(r))
, _migration_listener(std::make_unique<auth_migration_listener>(*this, qp))
, _used_by_maintenance_socket(used_by_maintenance_socket) {}
service::service(
cql3::query_processor& qp,
::service::raft_group0_client& g0,
::service::migration_notifier& mn,
::service::migration_manager& mm,
const service_config& sc,
authorizer_factory authorizer_factory,
authenticator_factory authenticator_factory,
role_manager_factory role_manager_factory,
maintenance_socket_enabled used_by_maintenance_socket,
cache& cache)
: service(
cache,
qp,
g0,
mn,
create_object<authorizer>(sc.authorizer_java_name, qp, g0, mm),
create_object<authenticator>(sc.authenticator_java_name, qp, g0, mm, cache),
create_object<role_manager>(sc.role_manager_java_name, qp, g0, mm, cache),
authorizer_factory(),
authenticator_factory(),
role_manager_factory(),
used_by_maintenance_socket) {
}
@@ -220,9 +144,6 @@ future<> service::create_legacy_keyspace_if_missing(::service::migration_manager
}
future<> service::start(::service::migration_manager& mm, db::system_keyspace& sys_ks) {
auto auth_version = co_await sys_ks.get_auth_version();
// version is set in query processor to be easily available in various places we call auth::legacy_mode check.
_qp.auth_version = auth_version;
if (this_shard_id() == 0) {
co_await _cache.load_all();
}
@@ -252,22 +173,12 @@ future<> service::start(::service::migration_manager& mm, db::system_keyspace& s
&service::get_uncached_permissions,
this, std::placeholders::_1, std::placeholders::_2));
}
co_await once_among_shards([this] {
_mnotifier.register_listener(_migration_listener.get());
return make_ready_future<>();
});
}
future<> service::stop() {
_as.request_abort();
// Only one of the shards has the listener registered, but let's try to
// unregister on each one just to make sure.
return _mnotifier.unregister_listener(_migration_listener.get()).then([this] {
_cache.set_permission_loader(nullptr);
return make_ready_future<>();
}).then([this] {
return when_all_succeed(_role_manager->stop(), _authorizer->stop(), _authenticator->stop()).discard_result();
});
_cache.set_permission_loader(nullptr);
return when_all_succeed(_role_manager->stop(), _authorizer->stop(), _authenticator->stop()).discard_result();
}
future<> service::ensure_superuser_is_created() {
@@ -301,12 +212,16 @@ service::get_uncached_permissions(const role_or_anonymous& maybe_role, const res
}
future<permission_set> service::get_permissions(const role_or_anonymous& maybe_role, const resource& r) const {
if (legacy_mode(_qp) || _used_by_maintenance_socket) {
if (_used_by_maintenance_socket) {
return get_uncached_permissions(maybe_role, r);
}
return _cache.get_permissions(maybe_role, r);
}
void service::set_maintenance_mode() {
_role_manager->set_maintenance_mode();
}
future<bool> service::has_superuser(std::string_view role_name, const role_set& roles) const {
for (const auto& role : roles) {
if (co_await _role_manager->is_superuser(role)) {
@@ -342,6 +257,10 @@ static void validate_authentication_options_are_supported(
}
}
future<> service::ensure_role_operations_are_enabled() {
return _role_manager->ensure_role_operations_are_enabled();
}
future<> service::create_role(std::string_view name,
const role_config& config,
const authentication_options& options,
@@ -359,11 +278,6 @@ future<> service::create_role(std::string_view name,
ep = std::current_exception();
}
if (ep) {
// Rollback only in legacy mode as normally mutations won't be
// applied in case exception is raised
if (legacy_mode(_qp)) {
co_await underlying_role_manager().drop(name, mc);
}
std::rethrow_exception(std::move(ep));
}
}
@@ -442,11 +356,11 @@ future<std::vector<cql3::description>> service::describe_roles(bool with_hashed_
const bool authenticator_uses_password_hashes = _authenticator->uses_password_hashes();
auto produce_create_statement = [with_hashed_passwords] (const sstring& formatted_role_name,
const auto default_su = cql3::util::maybe_quote(default_superuser(_qp));
auto produce_create_statement = [&default_su, with_hashed_passwords] (const sstring& formatted_role_name,
const std::optional<sstring>& maybe_hashed_password, bool can_login, bool is_superuser) {
// Even after applying formatting to a role, `formatted_role_name` can only equal `meta::DEFAULT_SUPER_NAME`
// if the original identifier was equal to it.
const sstring role_part = formatted_role_name == meta::DEFAULT_SUPERUSER_NAME
const sstring role_part = formatted_role_name == default_su
? seastar::format("IF NOT EXISTS {}", formatted_role_name)
: formatted_role_name;
@@ -659,6 +573,10 @@ future<std::vector<cql3::description>> service::describe_auth(bool with_hashed_p
// Free functions.
//
void set_maintenance_mode(service& ser) {
ser.set_maintenance_mode();
}
future<bool> has_superuser(const service& ser, const authenticated_user& u) {
if (is_anonymous(u)) {
return make_ready_future<bool>(false);
@@ -667,6 +585,10 @@ future<bool> has_superuser(const service& ser, const authenticated_user& u) {
return ser.has_superuser(*u.name);
}
future<> ensure_role_operations_are_enabled(service& ser) {
return ser.underlying_role_manager().ensure_role_operations_are_enabled();
}
future<role_set> get_roles(const service& ser, const authenticated_user& u) {
if (is_anonymous(u)) {
return make_ready_future<role_set>();
@@ -849,83 +771,109 @@ future<> commit_mutations(service& ser, ::service::group0_batch&& mc) {
return ser.commit_mutations(std::move(mc));
}
future<> migrate_to_auth_v2(db::system_keyspace& sys_ks, ::service::raft_group0_client& g0, start_operation_func_t start_operation_func, abort_source& as) {
// FIXME: if this function fails it may leave partial data in the new tables
// that should be cleared
auto gen = [&sys_ks] (api::timestamp_type ts) -> ::service::mutations_generator {
auto& qp = sys_ks.query_processor();
for (const auto& cf_name : std::vector<sstring>{
"roles", "role_members", "role_attributes", "role_permissions"}) {
schema_ptr schema;
try {
schema = qp.db().find_schema(meta::legacy::AUTH_KS, cf_name);
} catch (const data_dictionary::no_such_column_family&) {
continue; // some tables might not have been created if they were not used
}
namespace {
std::vector<sstring> col_names;
for (const auto& col : schema->all_columns()) {
col_names.push_back(col.name_as_cql_string());
}
sstring val_binders_str = "?";
for (size_t i = 1; i < col_names.size(); ++i) {
val_binders_str += ", ?";
}
std::string_view get_short_name(std::string_view name) {
auto pos = name.find_last_of('.');
if (pos == std::string_view::npos) {
return name;
}
return name.substr(pos + 1);
}
std::vector<mutation> collected;
// use longer than usual timeout as we scan the whole table
// but not infinite or very long as we want to fail reasonably fast
const auto t = 5min;
const timeout_config tc{t, t, t, t, t, t, t};
::service::client_state cs(::service::client_state::internal_tag{}, tc);
::service::query_state qs(cs, empty_service_permit());
} // anonymous namespace
co_await qp.query_internal(
seastar::format("SELECT * FROM {}.{}", meta::legacy::AUTH_KS, cf_name),
db::consistency_level::ALL,
{},
1000,
[&qp, &cf_name, &col_names, &val_binders_str, &schema, ts, &collected] (const cql3::untyped_result_set::row& row) -> future<stop_iteration> {
std::vector<data_value_or_unset> values;
for (const auto& col : schema->all_columns()) {
if (row.has(col.name_as_text())) {
values.push_back(
col.type->deserialize(row.get_blob_unfragmented(col.name_as_text())));
} else {
values.push_back(unset_value{});
}
}
auto muts = co_await qp.get_mutations_internal(
seastar::format("INSERT INTO {}.{} ({}) VALUES ({})",
db::system_keyspace::NAME,
cf_name,
fmt::join(col_names, ", "),
val_binders_str),
internal_distributed_query_state(),
ts,
std::move(values));
if (muts.size() != 1) {
on_internal_error(log,
format("expecting single insert mutation, got {}", muts.size()));
}
authorizer_factory make_authorizer_factory(
std::string_view name,
sharded<cql3::query_processor>& qp) {
std::string_view short_name = get_short_name(name);
collected.push_back(std::move(muts[0]));
co_return stop_iteration::no;
},
std::move(qs));
if (boost::iequals(short_name, "AllowAllAuthorizer")) {
return [&qp] {
return std::make_unique<allow_all_authorizer>(qp.local());
};
} else if (boost::iequals(short_name, "CassandraAuthorizer")) {
return [&qp] {
return std::make_unique<default_authorizer>(qp.local());
};
} else if (boost::iequals(short_name, "TransitionalAuthorizer")) {
return [&qp] {
return std::make_unique<transitional_authorizer>(qp.local());
};
}
throw std::invalid_argument(fmt::format("Unknown authorizer: {}", name));
}
for (auto& m : collected) {
co_yield std::move(m);
}
}
co_yield co_await sys_ks.make_auth_version_mutation(ts,
db::system_keyspace::auth_version_t::v2);
authenticator_factory make_authenticator_factory(
std::string_view name,
sharded<cql3::query_processor>& qp,
::service::raft_group0_client& g0,
sharded<::service::migration_manager>& mm,
sharded<cache>& auth_cache) {
std::string_view short_name = get_short_name(name);
if (boost::iequals(short_name, "AllowAllAuthenticator")) {
return [&qp, &g0, &mm, &auth_cache] {
return std::make_unique<allow_all_authenticator>(qp.local(), g0, mm.local(), auth_cache.local());
};
} else if (boost::iequals(short_name, "PasswordAuthenticator")) {
return [&qp, &g0, &mm, &auth_cache] {
return std::make_unique<password_authenticator>(qp.local(), g0, mm.local(), auth_cache.local());
};
} else if (boost::iequals(short_name, "CertificateAuthenticator")) {
return [&qp, &g0, &mm, &auth_cache] {
return std::make_unique<certificate_authenticator>(qp.local(), g0, mm.local(), auth_cache.local());
};
} else if (boost::iequals(short_name, "SaslauthdAuthenticator")) {
return [&qp, &g0, &mm, &auth_cache] {
return std::make_unique<saslauthd_authenticator>(qp.local(), g0, mm.local(), auth_cache.local());
};
} else if (boost::iequals(short_name, "TransitionalAuthenticator")) {
return [&qp, &g0, &mm, &auth_cache] {
return std::make_unique<transitional_authenticator>(qp.local(), g0, mm.local(), auth_cache.local());
};
}
throw std::invalid_argument(fmt::format("Unknown authenticator: {}", name));
}
role_manager_factory make_role_manager_factory(
std::string_view name,
sharded<cql3::query_processor>& qp,
::service::raft_group0_client& g0,
sharded<::service::migration_manager>& mm,
sharded<cache>& auth_cache) {
std::string_view short_name = get_short_name(name);
if (boost::iequals(short_name, "CassandraRoleManager")) {
return [&qp, &g0, &mm, &auth_cache] {
return std::make_unique<standard_role_manager>(qp.local(), g0, mm.local(), auth_cache.local());
};
} else if (boost::iequals(short_name, "LDAPRoleManager")) {
return [&qp, &g0, &mm, &auth_cache] {
return std::make_unique<ldap_role_manager>(qp.local(), g0, mm.local(), auth_cache.local());
};
}
throw std::invalid_argument(fmt::format("Unknown role manager: {}", name));
}
authenticator_factory make_maintenance_socket_authenticator_factory(
sharded<cql3::query_processor>& qp,
::service::raft_group0_client& g0,
sharded<::service::migration_manager>& mm,
sharded<cache>& auth_cache) {
return [&qp, &g0, &mm, &auth_cache] {
return std::make_unique<maintenance_socket_authenticator>(qp.local(), g0, mm.local(), auth_cache.local());
};
}
role_manager_factory make_maintenance_socket_role_manager_factory(
sharded<cql3::query_processor>& qp,
::service::raft_group0_client& g0,
sharded<::service::migration_manager>& mm,
sharded<cache>& auth_cache) {
return [&qp, &g0, &mm, &auth_cache] {
return std::make_unique<maintenance_socket_role_manager>(qp.local(), g0, mm.local(), auth_cache.local());
};
co_await announce_mutations_with_batching(g0,
start_operation_func,
std::move(gen),
as,
std::nullopt);
}
}

View File

@@ -12,6 +12,7 @@
#include <memory>
#include <optional>
#include <seastar/core/coroutine.hh>
#include <seastar/core/future.hh>
#include <seastar/core/sstring.hh>
#include <seastar/util/bool_class.hh>
@@ -36,19 +37,16 @@ class query_processor;
namespace service {
class migration_manager;
class migration_notifier;
class migration_listener;
}
namespace auth {
class role_or_anonymous;
struct service_config final {
sstring authorizer_java_name;
sstring authenticator_java_name;
sstring role_manager_java_name;
};
/// Factory function types for creating auth module instances on each shard.
using authorizer_factory = std::function<std::unique_ptr<authorizer>()>;
using authenticator_factory = std::function<std::unique_ptr<authenticator>()>;
using role_manager_factory = std::function<std::unique_ptr<role_manager>()>;
///
/// Due to poor (in this author's opinion) decisions of Apache Cassandra, certain choices of one role-manager,
@@ -80,17 +78,12 @@ class service final : public seastar::peering_sharded_service<service> {
::service::raft_group0_client& _group0_client;
::service::migration_notifier& _mnotifier;
authorizer::ptr_type _authorizer;
authenticator::ptr_type _authenticator;
role_manager::ptr_type _role_manager;
// Only one of these should be registered, so we end up with some unused instances. Not the end of the world.
std::unique_ptr<::service::migration_listener> _migration_listener;
maintenance_socket_enabled _used_by_maintenance_socket;
abort_source _as;
@@ -100,7 +93,6 @@ public:
cache& cache,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_notifier&,
std::unique_ptr<authorizer>,
std::unique_ptr<authenticator>,
std::unique_ptr<role_manager>,
@@ -108,15 +100,15 @@ public:
///
/// This constructor is intended to be used when the class is sharded via \ref seastar::sharded. In that case, the
/// arguments must be copyable, which is why we delay construction with instance-construction instructions instead
/// arguments must be copyable, which is why we delay construction with instance-construction factories instead
/// of the instances themselves.
///
service(
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_notifier&,
::service::migration_manager&,
const service_config&,
authorizer_factory,
authenticator_factory,
role_manager_factory,
maintenance_socket_enabled,
cache&);
@@ -138,6 +130,11 @@ public:
///
future<permission_set> get_uncached_permissions(const role_or_anonymous&, const resource&) const;
///
/// Notify the service that the node is entering maintenance mode.
///
void set_maintenance_mode();
///
/// Query whether the named role has been granted a role that is a superuser.
///
@@ -147,6 +144,11 @@ public:
///
future<bool> has_superuser(std::string_view role_name) const;
///
/// Ensure that the role operations are enabled. Some role managers defer initialization.
///
future<> ensure_role_operations_are_enabled();
///
/// Create a role with optional authentication information.
///
@@ -192,12 +194,9 @@ public:
return *_role_manager;
}
cql3::query_processor& query_processor() const noexcept {
return _qp;
}
future<> commit_mutations(::service::group0_batch&& mc) {
return std::move(mc).commit(_group0_client, _as, ::service::raft_timeout{});
co_await std::move(mc).commit(_group0_client, _as, ::service::raft_timeout{});
co_await _group0_client.send_group0_read_barrier_to_live_members();
}
private:
@@ -208,8 +207,12 @@ private:
future<std::vector<cql3::description>> describe_permissions() const;
};
void set_maintenance_mode(service&);
future<bool> has_superuser(const service&, const authenticated_user&);
future<> ensure_role_operations_are_enabled(service&);
future<role_set> get_roles(const service&, const authenticated_user&);
future<permission_set> get_permissions(const service&, const authenticated_user&, const resource&);
@@ -393,7 +396,50 @@ future<std::vector<permission_details>> list_filtered_permissions(
// Finalizes write operations performed in auth by committing mutations via raft group0.
future<> commit_mutations(service& ser, ::service::group0_batch&& mc);
// Migrates data from old keyspace to new one which supports linearizable writes via raft.
future<> migrate_to_auth_v2(db::system_keyspace& sys_ks, ::service::raft_group0_client& g0, start_operation_func_t start_operation_func, abort_source& as);
///
/// Factory helper functions for creating auth module instances.
/// These are intended for use with sharded<service>::start() where copyable arguments are required.
/// The returned factories capture the sharded references and call .local() when invoked on each shard.
///
/// Creates an authorizer factory for config-selectable authorizer types.
/// @param name The authorizer class name (e.g., "CassandraAuthorizer", "AllowAllAuthorizer")
authorizer_factory make_authorizer_factory(
std::string_view name,
sharded<cql3::query_processor>& qp);
/// Creates an authenticator factory for config-selectable authenticator types.
/// @param name The authenticator class name (e.g., "PasswordAuthenticator", "AllowAllAuthenticator")
authenticator_factory make_authenticator_factory(
std::string_view name,
sharded<cql3::query_processor>& qp,
::service::raft_group0_client& g0,
sharded<::service::migration_manager>& mm,
sharded<cache>& cache);
/// Creates a role_manager factory for config-selectable role manager types.
/// @param name The role manager class name (e.g., "CassandraRoleManager")
role_manager_factory make_role_manager_factory(
std::string_view name,
sharded<cql3::query_processor>& qp,
::service::raft_group0_client& g0,
sharded<::service::migration_manager>& mm,
sharded<cache>& cache);
/// Creates a factory for the maintenance socket authenticator.
/// This authenticator is not config-selectable and is only used for the maintenance socket.
authenticator_factory make_maintenance_socket_authenticator_factory(
sharded<cql3::query_processor>& qp,
::service::raft_group0_client& g0,
sharded<::service::migration_manager>& mm,
sharded<cache>& cache);
/// Creates a factory for the maintenance socket role manager.
/// This role manager is not config-selectable and is only used for the maintenance socket.
role_manager_factory make_maintenance_socket_role_manager_factory(
sharded<cql3::query_processor>& qp,
::service::raft_group0_client& g0,
sharded<::service::migration_manager>& mm,
sharded<cache>& cache);
}

View File

@@ -28,15 +28,14 @@
#include "cql3/untyped_result_set.hh"
#include "cql3/util.hh"
#include "db/consistency_level_type.hh"
#include "db/system_keyspace.hh"
#include "exceptions/exceptions.hh"
#include "utils/error_injection.hh"
#include "utils/log.hh"
#include <seastar/core/loop.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include "service/raft/raft_group0_client.hh"
#include "utils/class_registrator.hh"
#include "service/migration_manager.hh"
#include "password_authenticator.hh"
#include "utils/managed_string.hh"
namespace auth {
@@ -44,52 +43,7 @@ namespace auth {
static logging::logger log("standard_role_manager");
static const class_registrator<
role_manager,
standard_role_manager,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&,
cache&> registration("org.apache.cassandra.auth.CassandraRoleManager");
static db::consistency_level consistency_for_role(std::string_view role_name) noexcept {
if (role_name == meta::DEFAULT_SUPERUSER_NAME) {
return db::consistency_level::QUORUM;
}
return db::consistency_level::LOCAL_ONE;
}
future<std::optional<standard_role_manager::record>> standard_role_manager::legacy_find_record(std::string_view role_name) {
const sstring query = seastar::format("SELECT * FROM {}.{} WHERE {} = ?",
get_auth_ks_name(_qp),
meta::roles_table::name,
meta::roles_table::role_col_name);
const auto results = co_await _qp.execute_internal(
query,
consistency_for_role(role_name),
internal_distributed_query_state(),
{sstring(role_name)},
cql3::query_processor::cache_internal::yes);
if (results->empty()) {
co_return std::optional<record>();
}
const cql3::untyped_result_set_row& row = results->one();
co_return std::make_optional(record{
row.get_as<sstring>(sstring(meta::roles_table::role_col_name)),
row.get_or<bool>("is_superuser", false),
row.get_or<bool>("can_login", false),
(row.has("member_of")
? row.get_set<sstring>("member_of")
: role_set())});
}
future<std::optional<standard_role_manager::record>> standard_role_manager::find_record(std::string_view role_name) {
if (legacy_mode(_qp)) {
return legacy_find_record(role_name);
}
auto name = sstring(role_name);
auto role = _cache.get(name);
if (!role) {
@@ -123,7 +77,6 @@ standard_role_manager::standard_role_manager(cql3::query_processor& qp, ::servic
, _migration_manager(mm)
, _cache(cache)
, _stopped(make_ready_future<>())
, _superuser(password_authenticator::default_superuser(qp.db().get_config()))
{}
std::string_view standard_role_manager::qualified_java_name() const noexcept {
@@ -138,79 +91,12 @@ const resource_set& standard_role_manager::protected_resources() const {
return resources;
}
future<> standard_role_manager::create_legacy_metadata_tables_if_missing() const {
static const sstring create_roles_query = fmt::format(
"CREATE TABLE {}.{} ("
" {} text PRIMARY KEY,"
" can_login boolean,"
" is_superuser boolean,"
" member_of set<text>,"
" salted_hash text"
")",
meta::legacy::AUTH_KS,
meta::roles_table::name,
meta::roles_table::role_col_name);
static const sstring create_role_members_query = fmt::format(
"CREATE TABLE {}.{} ("
" role text,"
" member text,"
" PRIMARY KEY (role, member)"
")",
meta::legacy::AUTH_KS,
ROLE_MEMBERS_CF);
static const sstring create_role_attributes_query = seastar::format(
"CREATE TABLE {}.{} ("
" role text,"
" name text,"
" value text,"
" PRIMARY KEY(role, name)"
")",
meta::legacy::AUTH_KS,
ROLE_ATTRIBUTES_CF);
return when_all_succeed(
create_legacy_metadata_table_if_missing(
meta::roles_table::name,
_qp,
create_roles_query,
_migration_manager),
create_legacy_metadata_table_if_missing(
ROLE_MEMBERS_CF,
_qp,
create_role_members_query,
_migration_manager),
create_legacy_metadata_table_if_missing(
ROLE_ATTRIBUTES_CF,
_qp,
create_role_attributes_query,
_migration_manager)).discard_result();
}
future<> standard_role_manager::legacy_create_default_role_if_missing() {
try {
const auto exists = co_await legacy::default_role_row_satisfies(_qp, &has_can_login, _superuser);
if (exists) {
co_return;
}
const sstring query = seastar::format("INSERT INTO {}.{} ({}, is_superuser, can_login) VALUES (?, true, true)",
meta::legacy::AUTH_KS,
meta::roles_table::name,
meta::roles_table::role_col_name);
co_await _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
{_superuser},
cql3::query_processor::cache_internal::no).discard_result();
log.info("Created default superuser role '{}'.", _superuser);
} catch (const exceptions::unavailable_exception& e) {
log.warn("Skipped default role setup: some nodes were not ready; will retry");
throw e;
}
}
future<> standard_role_manager::maybe_create_default_role() {
if (default_superuser(_qp).empty()) {
co_return;
}
auto has_superuser = [this] () -> future<bool> {
const sstring query = seastar::format("SELECT * FROM {}.{} WHERE is_superuser = true ALLOW FILTERING", get_auth_ks_name(_qp), meta::roles_table::name);
const sstring query = seastar::format("SELECT * FROM {}.{} WHERE is_superuser = true ALLOW FILTERING", db::system_keyspace::NAME, meta::roles_table::name);
auto results = co_await _qp.execute_internal(query, db::consistency_level::LOCAL_ONE,
internal_distributed_query_state(), cql3::query_processor::cache_internal::yes);
for (const auto& result : *results) {
@@ -234,12 +120,12 @@ future<> standard_role_manager::maybe_create_default_role() {
// There is no superuser which has can_login field - create default role.
// Note that we don't check if can_login is set to true.
const sstring insert_query = seastar::format("INSERT INTO {}.{} ({}, is_superuser, can_login) VALUES (?, true, true)",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
meta::roles_table::name,
meta::roles_table::role_col_name);
co_await collect_mutations(_qp, batch, insert_query, {_superuser});
co_await collect_mutations(_qp, batch, insert_query, {default_superuser(_qp)});
co_await std::move(batch).commit(_group0_client, _as, get_raft_timeout());
log.info("Created default superuser role '{}'.", _superuser);
log.info("Created default superuser role '{}'.", default_superuser(_qp));
}
future<> standard_role_manager::maybe_create_default_role_with_retries() {
@@ -262,78 +148,12 @@ future<> standard_role_manager::maybe_create_default_role_with_retries() {
}
}
static const sstring legacy_table_name{"users"};
bool standard_role_manager::legacy_metadata_exists() {
return _qp.db().has_schema(meta::legacy::AUTH_KS, legacy_table_name);
}
future<> standard_role_manager::migrate_legacy_metadata() {
log.info("Starting migration of legacy user metadata.");
static const sstring query = seastar::format("SELECT * FROM {}.{}", meta::legacy::AUTH_KS, legacy_table_name);
return _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
cql3::query_processor::cache_internal::no).then([this](::shared_ptr<cql3::untyped_result_set> results) {
return do_for_each(*results, [this](const cql3::untyped_result_set_row& row) {
role_config config;
config.is_superuser = row.get_or<bool>("super", false);
config.can_login = true;
return do_with(
row.get_as<sstring>("name"),
std::move(config),
::service::group0_batch::unused(),
[this](const auto& name, const auto& config, auto& mc) {
return create_or_replace(meta::legacy::AUTH_KS, name, config, mc);
});
}).finally([results] {});
}).then([] {
log.info("Finished migrating legacy user metadata.");
}).handle_exception([](std::exception_ptr ep) {
log.error("Encountered an error during migration!");
std::rethrow_exception(ep);
});
}
future<> standard_role_manager::start() {
return once_among_shards([this] () -> future<> {
if (legacy_mode(_qp)) {
co_await create_legacy_metadata_tables_if_missing();
}
auto handler = [this] () -> future<> {
const bool legacy = legacy_mode(_qp);
if (legacy) {
if (!_superuser_created_promise.available()) {
// Counterintuitively, we mark promise as ready before any startup work
// because wait_for_schema_agreement() below will block indefinitely
// without cluster majority. In that case, blocking node startup
// would lead to a cluster deadlock.
_superuser_created_promise.set_value();
}
co_await _migration_manager.wait_for_schema_agreement(_qp.db().real_database(), db::timeout_clock::time_point::max(), &_as);
if (co_await legacy::any_nondefault_role_row_satisfies(_qp, &has_can_login)) {
if (legacy_metadata_exists()) {
log.warn("Ignoring legacy user metadata since nondefault roles already exist.");
}
co_return;
}
if (legacy_metadata_exists()) {
co_await migrate_legacy_metadata();
co_return;
}
co_await legacy_create_default_role_if_missing();
}
if (!legacy) {
co_await maybe_create_default_role_with_retries();
if (!_superuser_created_promise.available()) {
_superuser_created_promise.set_value();
}
co_await maybe_create_default_role_with_retries();
if (!_superuser_created_promise.available()) {
_superuser_created_promise.set_value();
}
};
@@ -352,21 +172,12 @@ future<> standard_role_manager::ensure_superuser_is_created() {
return _superuser_created_promise.get_shared_future();
}
future<> standard_role_manager::create_or_replace(std::string_view auth_ks_name, std::string_view role_name, const role_config& c, ::service::group0_batch& mc) {
future<> standard_role_manager::create_or_replace(std::string_view role_name, const role_config& c, ::service::group0_batch& mc) {
const sstring query = seastar::format("INSERT INTO {}.{} ({}, is_superuser, can_login) VALUES (?, ?, ?)",
auth_ks_name,
db::system_keyspace::NAME,
meta::roles_table::name,
meta::roles_table::role_col_name);
if (auth_ks_name == meta::legacy::AUTH_KS) {
co_await _qp.execute_internal(
query,
consistency_for_role(role_name),
internal_distributed_query_state(),
{sstring(role_name), c.is_superuser, c.can_login},
cql3::query_processor::cache_internal::yes).discard_result();
} else {
co_await collect_mutations(_qp, mc, query, {sstring(role_name), c.is_superuser, c.can_login});
}
co_await collect_mutations(_qp, mc, query, {sstring(role_name), c.is_superuser, c.can_login});
}
future<>
@@ -376,7 +187,7 @@ standard_role_manager::create(std::string_view role_name, const role_config& c,
throw role_already_exists(role_name);
}
return create_or_replace(get_auth_ks_name(_qp), role_name, c, mc);
return create_or_replace(role_name, c, mc);
});
}
@@ -401,20 +212,11 @@ standard_role_manager::alter(std::string_view role_name, const role_config_updat
return make_ready_future<>();
}
const sstring query = seastar::format("UPDATE {}.{} SET {} WHERE {} = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
meta::roles_table::name,
build_column_assignments(u),
meta::roles_table::role_col_name);
if (legacy_mode(_qp)) {
return _qp.execute_internal(
std::move(query),
consistency_for_role(role_name),
internal_distributed_query_state(),
{sstring(role_name)},
cql3::query_processor::cache_internal::no).discard_result();
} else {
return collect_mutations(_qp, mc, std::move(query), {sstring(role_name)});
}
return collect_mutations(_qp, mc, std::move(query), {sstring(role_name)});
});
}
@@ -425,11 +227,11 @@ future<> standard_role_manager::drop(std::string_view role_name, ::service::grou
// First, revoke this role from all roles that are members of it.
const auto revoke_from_members = [this, role_name, &mc] () -> future<> {
const sstring query = seastar::format("SELECT member FROM {}.{} WHERE role = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
ROLE_MEMBERS_CF);
const auto members = co_await _qp.execute_internal(
query,
consistency_for_role(role_name),
db::consistency_level::LOCAL_ONE,
internal_distributed_query_state(),
{sstring(role_name)},
cql3::query_processor::cache_internal::no);
@@ -457,102 +259,33 @@ future<> standard_role_manager::drop(std::string_view role_name, ::service::grou
// Delete all attributes for that role
const auto remove_attributes_of = [this, role_name, &mc] () -> future<> {
const sstring query = seastar::format("DELETE FROM {}.{} WHERE role = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
ROLE_ATTRIBUTES_CF);
if (legacy_mode(_qp)) {
co_await _qp.execute_internal(query, {sstring(role_name)},
cql3::query_processor::cache_internal::yes).discard_result();
} else {
co_await collect_mutations(_qp, mc, query, {sstring(role_name)});
}
co_await collect_mutations(_qp, mc, query, {sstring(role_name)});
};
// Finally, delete the role itself.
const auto delete_role = [this, role_name, &mc] () -> future<> {
const sstring query = seastar::format("DELETE FROM {}.{} WHERE {} = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
meta::roles_table::name,
meta::roles_table::role_col_name);
if (legacy_mode(_qp)) {
co_await _qp.execute_internal(
query,
consistency_for_role(role_name),
internal_distributed_query_state(),
{sstring(role_name)},
cql3::query_processor::cache_internal::no).discard_result();
} else {
co_await collect_mutations(_qp, mc, query, {sstring(role_name)});
}
co_await collect_mutations(_qp, mc, query, {sstring(role_name)});
};
co_await when_all_succeed(revoke_from_members, revoke_members_of, remove_attributes_of);
co_await delete_role();
}
future<>
standard_role_manager::legacy_modify_membership(
std::string_view grantee_name,
std::string_view role_name,
membership_change ch) {
const auto modify_roles = [this, role_name, grantee_name, ch] () -> future<> {
const auto query = seastar::format(
"UPDATE {}.{} SET member_of = member_of {} ? WHERE {} = ?",
get_auth_ks_name(_qp),
meta::roles_table::name,
(ch == membership_change::add ? '+' : '-'),
meta::roles_table::role_col_name);
co_await _qp.execute_internal(
query,
consistency_for_role(grantee_name),
internal_distributed_query_state(),
{role_set{sstring(role_name)}, sstring(grantee_name)},
cql3::query_processor::cache_internal::no).discard_result();
};
const auto modify_role_members = [this, role_name, grantee_name, ch] () -> future<> {
switch (ch) {
case membership_change::add: {
const sstring insert_query = seastar::format("INSERT INTO {}.{} (role, member) VALUES (?, ?)",
get_auth_ks_name(_qp),
ROLE_MEMBERS_CF);
co_return co_await _qp.execute_internal(
insert_query,
consistency_for_role(role_name),
internal_distributed_query_state(),
{sstring(role_name), sstring(grantee_name)},
cql3::query_processor::cache_internal::no).discard_result();
}
case membership_change::remove: {
const sstring delete_query = seastar::format("DELETE FROM {}.{} WHERE role = ? AND member = ?",
get_auth_ks_name(_qp),
ROLE_MEMBERS_CF);
co_return co_await _qp.execute_internal(
delete_query,
consistency_for_role(role_name),
internal_distributed_query_state(),
{sstring(role_name), sstring(grantee_name)},
cql3::query_processor::cache_internal::no).discard_result();
}
}
};
co_await when_all_succeed(modify_roles, modify_role_members).discard_result();
}
future<>
standard_role_manager::modify_membership(
std::string_view grantee_name,
std::string_view role_name,
membership_change ch,
::service::group0_batch& mc) {
if (legacy_mode(_qp)) {
co_return co_await legacy_modify_membership(grantee_name, role_name, ch);
}
const auto modify_roles = seastar::format(
"UPDATE {}.{} SET member_of = member_of {} ? WHERE {} = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
meta::roles_table::name,
(ch == membership_change::add ? '+' : '-'),
meta::roles_table::role_col_name);
@@ -563,12 +296,12 @@ standard_role_manager::modify_membership(
switch (ch) {
case membership_change::add:
modify_role_members = seastar::format("INSERT INTO {}.{} (role, member) VALUES (?, ?)",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
ROLE_MEMBERS_CF);
break;
case membership_change::remove:
modify_role_members = seastar::format("DELETE FROM {}.{} WHERE role = ? AND member = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
ROLE_MEMBERS_CF);
break;
default:
@@ -661,7 +394,7 @@ future<role_set> standard_role_manager::query_granted(std::string_view grantee_n
future<role_to_directly_granted_map> standard_role_manager::query_all_directly_granted(::service::query_state& qs) {
const sstring query = seastar::format("SELECT * FROM {}.{}",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
ROLE_MEMBERS_CF);
const auto results = co_await _qp.execute_internal(
@@ -685,21 +418,15 @@ future<role_to_directly_granted_map> standard_role_manager::query_all_directly_g
future<role_set> standard_role_manager::query_all(::service::query_state& qs) {
const sstring query = seastar::format("SELECT {} FROM {}.{}",
meta::roles_table::role_col_name,
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
meta::roles_table::name);
// To avoid many copies of a view.
static const auto role_col_name_string = sstring(meta::roles_table::role_col_name);
if (utils::get_local_injector().enter("standard_role_manager_fail_legacy_query")) {
if (legacy_mode(_qp)) {
throw std::runtime_error("standard_role_manager::query_all: failed due to error injection");
}
}
const auto results = co_await _qp.execute_internal(
query,
db::consistency_level::QUORUM,
db::consistency_level::LOCAL_ONE,
qs,
cql3::query_processor::cache_internal::yes);
@@ -734,7 +461,7 @@ future<bool> standard_role_manager::can_login(std::string_view role_name) {
future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) {
const sstring query = seastar::format("SELECT name, value FROM {}.{} WHERE role = ? AND name = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
ROLE_ATTRIBUTES_CF);
const auto result_set = co_await _qp.execute_internal(query, db::consistency_level::ONE, qs, {sstring(role_name), sstring(attribute_name)}, cql3::query_processor::cache_internal::yes);
if (!result_set->empty()) {
@@ -765,14 +492,10 @@ future<> standard_role_manager::set_attribute(std::string_view role_name, std::s
throw auth::nonexistant_role(role_name);
}
const sstring query = seastar::format("INSERT INTO {}.{} (role, name, value) VALUES (?, ?, ?)",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
ROLE_ATTRIBUTES_CF);
if (legacy_mode(_qp)) {
co_await _qp.execute_internal(query, {sstring(role_name), sstring(attribute_name), sstring(attribute_value)}, cql3::query_processor::cache_internal::yes).discard_result();
} else {
co_await collect_mutations(_qp, mc, query,
{sstring(role_name), sstring(attribute_name), sstring(attribute_value)});
}
co_await collect_mutations(_qp, mc, query,
{sstring(role_name), sstring(attribute_name), sstring(attribute_value)});
}
future<> standard_role_manager::remove_attribute(std::string_view role_name, std::string_view attribute_name, ::service::group0_batch& mc) {
@@ -780,14 +503,10 @@ future<> standard_role_manager::remove_attribute(std::string_view role_name, std
throw auth::nonexistant_role(role_name);
}
const sstring query = seastar::format("DELETE FROM {}.{} WHERE role = ? AND name = ?",
get_auth_ks_name(_qp),
db::system_keyspace::NAME,
ROLE_ATTRIBUTES_CF);
if (legacy_mode(_qp)) {
co_await _qp.execute_internal(query, {sstring(role_name), sstring(attribute_name)}, cql3::query_processor::cache_internal::yes).discard_result();
} else {
co_await collect_mutations(_qp, mc, query,
{sstring(role_name), sstring(attribute_name)});
}
co_await collect_mutations(_qp, mc, query,
{sstring(role_name), sstring(attribute_name)});
}
future<std::vector<cql3::description>> standard_role_manager::describe_role_grants() {

View File

@@ -40,7 +40,6 @@ class standard_role_manager final : public role_manager {
cache& _cache;
future<> _stopped;
abort_source _as;
std::string _superuser;
shared_promise<> _superuser_created_promise;
public:
@@ -97,24 +96,13 @@ private:
role_set member_of;
};
future<> create_legacy_metadata_tables_if_missing() const;
bool legacy_metadata_exists();
future<> migrate_legacy_metadata();
future<> legacy_create_default_role_if_missing();
future<> maybe_create_default_role();
future<> maybe_create_default_role_with_retries();
future<> create_or_replace(std::string_view auth_ks_name, std::string_view role_name, const role_config&, ::service::group0_batch&);
future<> legacy_modify_membership(std::string_view role_name, std::string_view grantee_name, membership_change);
future<> create_or_replace(std::string_view role_name, const role_config&, ::service::group0_batch&);
future<> modify_membership(std::string_view role_name, std::string_view grantee_name, membership_change, ::service::group0_batch& mc);
future<std::optional<record>> legacy_find_record(std::string_view role_name);
future<std::optional<record>> find_record(std::string_view role_name);
future<record> require_record(std::string_view role_name);
future<> collect_roles(

View File

@@ -8,244 +8,200 @@
* SPDX-License-Identifier: (LicenseRef-ScyllaDB-Source-Available-1.0 and Apache-2.0)
*/
#include "auth/transitional.hh"
#include "auth/authenticated_user.hh"
#include "auth/authenticator.hh"
#include "auth/authorizer.hh"
#include "auth/default_authorizer.hh"
#include "auth/password_authenticator.hh"
#include "auth/cache.hh"
#include "auth/permission.hh"
#include "service/raft/raft_group0_client.hh"
#include "utils/class_registrator.hh"
namespace auth {
static const sstring PACKAGE_NAME("com.scylladb.auth.");
static const sstring& transitional_authenticator_name() {
static const sstring name = PACKAGE_NAME + "TransitionalAuthenticator";
return name;
transitional_authenticator::transitional_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm, cache& cache)
: transitional_authenticator(std::make_unique<password_authenticator>(qp, g0, mm, cache)) {
}
static const sstring& transitional_authorizer_name() {
static const sstring name = PACKAGE_NAME + "TransitionalAuthorizer";
return name;
transitional_authenticator::transitional_authenticator(std::unique_ptr<authenticator> a)
: _authenticator(std::move(a)) {
}
class transitional_authenticator : public authenticator {
std::unique_ptr<authenticator> _authenticator;
future<> transitional_authenticator::start() {
return _authenticator->start();
}
public:
static const sstring PASSWORD_AUTHENTICATOR_NAME;
future<> transitional_authenticator::stop() {
return _authenticator->stop();
}
transitional_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm, cache& cache)
: transitional_authenticator(std::make_unique<password_authenticator>(qp, g0, mm, cache)) {
std::string_view transitional_authenticator::qualified_java_name() const {
return "com.scylladb.auth.TransitionalAuthenticator";
}
bool transitional_authenticator::require_authentication() const {
return true;
}
authentication_option_set transitional_authenticator::supported_options() const {
return _authenticator->supported_options();
}
authentication_option_set transitional_authenticator::alterable_options() const {
return _authenticator->alterable_options();
}
future<authenticated_user> transitional_authenticator::authenticate(const credentials_map& credentials) const {
auto i = credentials.find(authenticator::USERNAME_KEY);
if ((i == credentials.end() || i->second.empty())
&& (!credentials.contains(PASSWORD_KEY) || credentials.at(PASSWORD_KEY).empty())) {
// return anon user
return make_ready_future<authenticated_user>(anonymous_user());
}
transitional_authenticator(std::unique_ptr<authenticator> a)
: _authenticator(std::move(a)) {
}
virtual future<> start() override {
return _authenticator->start();
}
virtual future<> stop() override {
return _authenticator->stop();
}
virtual std::string_view qualified_java_name() const override {
return transitional_authenticator_name();
}
virtual bool require_authentication() const override {
return true;
}
virtual authentication_option_set supported_options() const override {
return _authenticator->supported_options();
}
virtual authentication_option_set alterable_options() const override {
return _authenticator->alterable_options();
}
virtual future<authenticated_user> authenticate(const credentials_map& credentials) const override {
auto i = credentials.find(authenticator::USERNAME_KEY);
if ((i == credentials.end() || i->second.empty())
&& (!credentials.contains(PASSWORD_KEY) || credentials.at(PASSWORD_KEY).empty())) {
return make_ready_future().then([this, &credentials] {
return _authenticator->authenticate(credentials);
}).handle_exception([](auto ep) {
try {
std::rethrow_exception(ep);
} catch (const exceptions::authentication_exception&) {
// return anon user
return make_ready_future<authenticated_user>(anonymous_user());
}
return make_ready_future().then([this, &credentials] {
return _authenticator->authenticate(credentials);
}).handle_exception([](auto ep) {
try {
std::rethrow_exception(ep);
} catch (const exceptions::authentication_exception&) {
// return anon user
return make_ready_future<authenticated_user>(anonymous_user());
}
});
}
virtual future<> create(std::string_view role_name, const authentication_options& options, ::service::group0_batch& mc) override {
return _authenticator->create(role_name, options, mc);
}
virtual future<> alter(std::string_view role_name, const authentication_options& options, ::service::group0_batch& mc) override {
return _authenticator->alter(role_name, options, mc);
}
virtual future<> drop(std::string_view role_name, ::service::group0_batch& mc) override {
return _authenticator->drop(role_name, mc);
}
virtual future<custom_options> query_custom_options(std::string_view role_name) const override {
return _authenticator->query_custom_options(role_name);
}
virtual bool uses_password_hashes() const override {
return _authenticator->uses_password_hashes();
}
virtual future<std::optional<sstring>> get_password_hash(std::string_view role_name) const override {
return _authenticator->get_password_hash(role_name);
}
virtual const resource_set& protected_resources() const override {
return _authenticator->protected_resources();
}
virtual ::shared_ptr<sasl_challenge> new_sasl_challenge() const override {
class sasl_wrapper : public sasl_challenge {
public:
sasl_wrapper(::shared_ptr<sasl_challenge> sasl)
: _sasl(std::move(sasl)) {
}
virtual bytes evaluate_response(bytes_view client_response) override {
try {
return _sasl->evaluate_response(client_response);
} catch (const exceptions::authentication_exception&) {
_complete = true;
return {};
}
}
virtual bool is_complete() const override {
return _complete || _sasl->is_complete();
}
virtual future<authenticated_user> get_authenticated_user() const override {
return futurize_invoke([this] {
return _sasl->get_authenticated_user().handle_exception([](auto ep) {
try {
std::rethrow_exception(ep);
} catch (const exceptions::authentication_exception&) {
// return anon user
return make_ready_future<authenticated_user>(anonymous_user());
}
});
});
}
const sstring& get_username() const override {
return _sasl->get_username();
}
private:
::shared_ptr<sasl_challenge> _sasl;
bool _complete = false;
};
return ::make_shared<sasl_wrapper>(_authenticator->new_sasl_challenge());
}
virtual future<> ensure_superuser_is_created() const override {
return _authenticator->ensure_superuser_is_created();
}
};
class transitional_authorizer : public authorizer {
std::unique_ptr<authorizer> _authorizer;
public:
transitional_authorizer(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm)
: transitional_authorizer(std::make_unique<default_authorizer>(qp, g0, mm)) {
}
transitional_authorizer(std::unique_ptr<authorizer> a)
: _authorizer(std::move(a)) {
}
~transitional_authorizer() {
}
virtual future<> start() override {
return _authorizer->start();
}
virtual future<> stop() override {
return _authorizer->stop();
}
virtual std::string_view qualified_java_name() const override {
return transitional_authorizer_name();
}
virtual future<permission_set> authorize(const role_or_anonymous&, const resource&) const override {
static const permission_set transitional_permissions =
permission_set::of<
permission::CREATE,
permission::ALTER,
permission::DROP,
permission::SELECT,
permission::MODIFY>();
return make_ready_future<permission_set>(transitional_permissions);
}
virtual future<> grant(std::string_view s, permission_set ps, const resource& r, ::service::group0_batch& mc) override {
return _authorizer->grant(s, std::move(ps), r, mc);
}
virtual future<> revoke(std::string_view s, permission_set ps, const resource& r, ::service::group0_batch& mc) override {
return _authorizer->revoke(s, std::move(ps), r, mc);
}
virtual future<std::vector<permission_details>> list_all() const override {
return _authorizer->list_all();
}
virtual future<> revoke_all(std::string_view s, ::service::group0_batch& mc) override {
return _authorizer->revoke_all(s, mc);
}
virtual future<> revoke_all(const resource& r, ::service::group0_batch& mc) override {
return _authorizer->revoke_all(r, mc);
}
virtual const resource_set& protected_resources() const override {
return _authorizer->protected_resources();
}
};
});
}
//
// To ensure correct initialization order, we unfortunately need to use string literals.
//
future<> transitional_authenticator::create(std::string_view role_name, const authentication_options& options, ::service::group0_batch& mc) {
return _authenticator->create(role_name, options, mc);
}
static const class_registrator<
auth::authenticator,
auth::transitional_authenticator,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&,
auth::cache&> transitional_authenticator_reg(auth::PACKAGE_NAME + "TransitionalAuthenticator");
future<> transitional_authenticator::alter(std::string_view role_name, const authentication_options& options, ::service::group0_batch& mc) {
return _authenticator->alter(role_name, options, mc);
}
static const class_registrator<
auth::authorizer,
auth::transitional_authorizer,
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&> transitional_authorizer_reg(auth::PACKAGE_NAME + "TransitionalAuthorizer");
future<> transitional_authenticator::drop(std::string_view role_name, ::service::group0_batch& mc) {
return _authenticator->drop(role_name, mc);
}
future<custom_options> transitional_authenticator::query_custom_options(std::string_view role_name) const {
return _authenticator->query_custom_options(role_name);
}
bool transitional_authenticator::uses_password_hashes() const {
return _authenticator->uses_password_hashes();
}
future<std::optional<sstring>> transitional_authenticator::get_password_hash(std::string_view role_name) const {
return _authenticator->get_password_hash(role_name);
}
const resource_set& transitional_authenticator::protected_resources() const {
return _authenticator->protected_resources();
}
::shared_ptr<sasl_challenge> transitional_authenticator::new_sasl_challenge() const {
class sasl_wrapper : public sasl_challenge {
public:
sasl_wrapper(::shared_ptr<sasl_challenge> sasl)
: _sasl(std::move(sasl)) {
}
virtual bytes evaluate_response(bytes_view client_response) override {
try {
return _sasl->evaluate_response(client_response);
} catch (const exceptions::authentication_exception&) {
_complete = true;
return {};
}
}
virtual bool is_complete() const override {
return _complete || _sasl->is_complete();
}
virtual future<authenticated_user> get_authenticated_user() const override {
return futurize_invoke([this] {
return _sasl->get_authenticated_user().handle_exception([](auto ep) {
try {
std::rethrow_exception(ep);
} catch (const exceptions::authentication_exception&) {
// return anon user
return make_ready_future<authenticated_user>(anonymous_user());
}
});
});
}
const sstring& get_username() const override {
return _sasl->get_username();
}
private:
::shared_ptr<sasl_challenge> _sasl;
bool _complete = false;
};
return ::make_shared<sasl_wrapper>(_authenticator->new_sasl_challenge());
}
future<> transitional_authenticator::ensure_superuser_is_created() const {
return _authenticator->ensure_superuser_is_created();
}
transitional_authorizer::transitional_authorizer(cql3::query_processor& qp)
: transitional_authorizer(std::make_unique<default_authorizer>(qp)) {
}
transitional_authorizer::transitional_authorizer(std::unique_ptr<authorizer> a)
: _authorizer(std::move(a)) {
}
transitional_authorizer::~transitional_authorizer() {
}
future<> transitional_authorizer::start() {
return _authorizer->start();
}
future<> transitional_authorizer::stop() {
return _authorizer->stop();
}
std::string_view transitional_authorizer::qualified_java_name() const {
return "com.scylladb.auth.TransitionalAuthorizer";
}
future<permission_set> transitional_authorizer::authorize(const role_or_anonymous&, const resource&) const {
static const permission_set transitional_permissions =
permission_set::of<
permission::CREATE,
permission::ALTER,
permission::DROP,
permission::SELECT,
permission::MODIFY>();
return make_ready_future<permission_set>(transitional_permissions);
}
future<> transitional_authorizer::grant(std::string_view s, permission_set ps, const resource& r, ::service::group0_batch& mc) {
return _authorizer->grant(s, std::move(ps), r, mc);
}
future<> transitional_authorizer::revoke(std::string_view s, permission_set ps, const resource& r, ::service::group0_batch& mc) {
return _authorizer->revoke(s, std::move(ps), r, mc);
}
future<std::vector<permission_details>> transitional_authorizer::list_all() const {
return _authorizer->list_all();
}
future<> transitional_authorizer::revoke_all(std::string_view s, ::service::group0_batch& mc) {
return _authorizer->revoke_all(s, mc);
}
future<> transitional_authorizer::revoke_all(const resource& r, ::service::group0_batch& mc) {
return _authorizer->revoke_all(r, mc);
}
const resource_set& transitional_authorizer::protected_resources() const {
return _authorizer->protected_resources();
}
}

81
auth/transitional.hh Normal file
View File

@@ -0,0 +1,81 @@
/*
* Copyright (C) 2026-present ScyllaDB
*
* Modified by ScyllaDB
*/
/*
* SPDX-License-Identifier: (LicenseRef-ScyllaDB-Source-Available-1.0 and Apache-2.0)
*/
#pragma once
#include "auth/authenticator.hh"
#include "auth/authorizer.hh"
#include "auth/cache.hh"
namespace cql3 {
class query_processor;
}
namespace service {
class raft_group0_client;
class migration_manager;
}
namespace auth {
///
/// Transitional authenticator that allows anonymous access when credentials are not provided
/// or authentication fails. Used for migration scenarios.
///
class transitional_authenticator : public authenticator {
std::unique_ptr<authenticator> _authenticator;
public:
transitional_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm, cache& cache);
transitional_authenticator(std::unique_ptr<authenticator> a);
virtual future<> start() override;
virtual future<> stop() override;
virtual std::string_view qualified_java_name() const override;
virtual bool require_authentication() const override;
virtual authentication_option_set supported_options() const override;
virtual authentication_option_set alterable_options() const override;
virtual future<authenticated_user> authenticate(const credentials_map& credentials) const override;
virtual future<> create(std::string_view role_name, const authentication_options& options, ::service::group0_batch& mc) override;
virtual future<> alter(std::string_view role_name, const authentication_options& options, ::service::group0_batch& mc) override;
virtual future<> drop(std::string_view role_name, ::service::group0_batch& mc) override;
virtual future<custom_options> query_custom_options(std::string_view role_name) const override;
virtual bool uses_password_hashes() const override;
virtual future<std::optional<sstring>> get_password_hash(std::string_view role_name) const override;
virtual const resource_set& protected_resources() const override;
virtual ::shared_ptr<sasl_challenge> new_sasl_challenge() const override;
virtual future<> ensure_superuser_is_created() const override;
};
///
/// Transitional authorizer that grants a fixed set of permissions to all users.
/// Used for migration scenarios.
///
class transitional_authorizer : public authorizer {
std::unique_ptr<authorizer> _authorizer;
public:
transitional_authorizer(cql3::query_processor& qp);
transitional_authorizer(std::unique_ptr<authorizer> a);
~transitional_authorizer();
virtual future<> start() override;
virtual future<> stop() override;
virtual std::string_view qualified_java_name() const override;
virtual future<permission_set> authorize(const role_or_anonymous&, const resource&) const override;
virtual future<> grant(std::string_view s, permission_set ps, const resource& r, ::service::group0_batch& mc) override;
virtual future<> revoke(std::string_view s, permission_set ps, const resource& r, ::service::group0_batch& mc) override;
virtual future<std::vector<permission_details>> list_all() const override;
virtual future<> revoke_all(std::string_view s, ::service::group0_batch& mc) override;
virtual future<> revoke_all(const resource& r, ::service::group0_batch& mc) override;
virtual const resource_set& protected_resources() const override;
};
} // namespace auth

View File

@@ -10,24 +10,15 @@
#include <random>
#include <unordered_set>
#include <algorithm>
#include <seastar/core/sleep.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include <seastar/util/later.hh>
#include "gms/endpoint_state.hh"
#include "gms/versioned_value.hh"
#include "keys/keys.hh"
#include "replica/database.hh"
#include "db/system_keyspace.hh"
#include "db/system_distributed_keyspace.hh"
#include "dht/token-sharding.hh"
#include "locator/token_metadata.hh"
#include "types/set.hh"
#include "gms/application_state.hh"
#include "gms/inet_address.hh"
#include "gms/gossiper.hh"
#include "gms/feature_service.hh"
#include "utils/assert.hh"
#include "utils/error_injection.hh"
#include "utils/UUID_gen.hh"
@@ -41,16 +32,6 @@
extern logging::logger cdc_log;
static int get_shard_count(const locator::host_id& endpoint, const gms::gossiper& g) {
auto ep_state = g.get_application_state_ptr(endpoint, gms::application_state::SHARD_COUNT);
return ep_state ? std::stoi(ep_state->value()) : -1;
}
static unsigned get_sharding_ignore_msb(const locator::host_id& endpoint, const gms::gossiper& g) {
auto ep_state = g.get_application_state_ptr(endpoint, gms::application_state::IGNORE_MSB_BITS);
return ep_state ? std::stoi(ep_state->value()) : 0;
}
namespace db {
extern thread_local data_type cdc_streams_set_type;
}
@@ -225,12 +206,6 @@ static std::vector<stream_id> create_stream_ids(
return result;
}
bool should_propose_first_generation(const locator::host_id& my_host_id, const gms::gossiper& g) {
return g.for_each_endpoint_state_until([&] (const gms::endpoint_state& eps) {
return stop_iteration(my_host_id < eps.get_host_id());
}) == stop_iteration::no;
}
bool is_cdc_generation_optimal(const cdc::topology_description& gen, const locator::token_metadata& tm) {
if (tm.sorted_tokens().size() != gen.entries().size()) {
// We probably have garbage streams from old generations
@@ -330,38 +305,6 @@ future<utils::chunked_vector<mutation>> get_cdc_generation_mutations_v3(
co_return co_await get_common_cdc_generation_mutations(s, pkey, std::move(get_ckey), desc, mutation_size_threshold, ts);
}
// non-static for testing
size_t limit_of_streams_in_topology_description() {
// Each stream takes 16B and we don't want to exceed 4MB so we can have
// at most 262144 streams but not less than 1 per vnode.
return 4 * 1024 * 1024 / 16;
}
// non-static for testing
topology_description limit_number_of_streams_if_needed(topology_description&& desc) {
uint64_t streams_count = 0;
for (auto& tr_desc : desc.entries()) {
streams_count += tr_desc.streams.size();
}
size_t limit = std::max(limit_of_streams_in_topology_description(), desc.entries().size());
if (limit >= streams_count) {
return std::move(desc);
}
size_t streams_per_vnode_limit = limit / desc.entries().size();
auto entries = std::move(desc).entries();
auto start = entries.back().token_range_end;
for (size_t idx = 0; idx < entries.size(); ++idx) {
auto end = entries[idx].token_range_end;
if (entries[idx].streams.size() > streams_per_vnode_limit) {
entries[idx].streams =
create_stream_ids(idx, start, end, streams_per_vnode_limit, entries[idx].sharding_ignore_msb);
}
start = end;
}
return topology_description(std::move(entries));
}
// Compute a set of tokens that split the token ring into vnodes.
static auto get_tokens(const std::unordered_set<dht::token>& bootstrap_tokens, const locator::token_metadata_ptr tmptr) {
auto tokens = tmptr->sorted_tokens();
@@ -419,364 +362,6 @@ db_clock::time_point new_generation_timestamp(bool add_delay, std::chrono::milli
return ts;
}
future<cdc::generation_id> generation_service::legacy_make_new_generation(const std::unordered_set<dht::token>& bootstrap_tokens, bool add_delay) {
const locator::token_metadata_ptr tmptr = _token_metadata.get();
// Fetch sharding parameters for a node that owns vnode ending with this token
// using gossiped application states.
auto get_sharding_info = [&] (dht::token end) -> std::pair<size_t, uint8_t> {
if (bootstrap_tokens.contains(end)) {
return {smp::count, _cfg.ignore_msb_bits};
} else {
auto endpoint = tmptr->get_endpoint(end);
if (!endpoint) {
throw std::runtime_error(
format("Can't find endpoint for token {}", end));
}
auto sc = get_shard_count(*endpoint, _gossiper);
return {sc > 0 ? sc : 1, get_sharding_ignore_msb(*endpoint, _gossiper)};
}
};
auto uuid = utils::make_random_uuid();
auto gen = make_new_generation_description(bootstrap_tokens, get_sharding_info, tmptr);
// Our caller should ensure that there are normal tokens in the token ring.
auto normal_token_owners = tmptr->count_normal_token_owners();
SCYLLA_ASSERT(normal_token_owners);
if (_feature_service.cdc_generations_v2) {
cdc_log.info("Inserting new generation data at UUID {}", uuid);
// This may take a while.
co_await _sys_dist_ks.local().insert_cdc_generation(uuid, gen, { normal_token_owners });
// Begin the race.
cdc::generation_id_v2 gen_id{new_generation_timestamp(add_delay, _cfg.ring_delay), uuid};
cdc_log.info("New CDC generation: {}", gen_id);
co_return gen_id;
}
// The CDC_GENERATIONS_V2 feature is not enabled: some nodes may still not understand the V2 format.
// We must create a generation in the old format.
// If the cluster is large we may end up with a generation that contains
// large number of streams. This is problematic because we store the
// generation in a single row (V1 format). For a generation with large number of rows
// this will lead to a row that can be as big as 32MB. This is much more
// than the limit imposed by commitlog_segment_size_in_mb. If the size of
// the row that describes a new generation grows above
// commitlog_segment_size_in_mb, the write will fail and the new node won't
// be able to join. To avoid such problem we make sure that such row is
// always smaller than 4MB. We do that by removing some CDC streams from
// each vnode if the total number of streams is too large.
gen = limit_number_of_streams_if_needed(std::move(gen));
cdc_log.warn(
"Creating a new CDC generation in the old storage format due to a partially upgraded cluster:"
" the CDC_GENERATIONS_V2 feature is known by this node, but not enabled in the cluster."
" The old storage format forces us to create a suboptimal generation."
" It is recommended to finish the upgrade and then create a new generation either by bootstrapping"
" a new node or running the checkAndRepairCdcStreams nodetool command.");
// Begin the race.
cdc::generation_id_v1 gen_id{new_generation_timestamp(add_delay, _cfg.ring_delay)};
co_await _sys_dist_ks.local().insert_cdc_topology_description(gen_id, std::move(gen), { normal_token_owners });
cdc_log.info("New CDC generation: {}", gen_id);
co_return gen_id;
}
/* Retrieves CDC streams generation timestamp from the given endpoint's application state (broadcasted through gossip).
* We might be during a rolling upgrade, so the timestamp might not be there (if the other node didn't upgrade yet),
* but if the cluster already supports CDC, then every newly joining node will propose a new CDC generation,
* which means it will gossip the generation's timestamp.
*/
static std::optional<cdc::generation_id> get_generation_id_for(const locator::host_id& endpoint, const gms::endpoint_state& eps) {
const auto* gen_id_ptr = eps.get_application_state_ptr(gms::application_state::CDC_GENERATION_ID);
if (!gen_id_ptr) {
return std::nullopt;
}
auto gen_id_string = gen_id_ptr->value();
cdc_log.trace("endpoint={}, gen_id_string={}", endpoint, gen_id_string);
return gms::versioned_value::cdc_generation_id_from_string(gen_id_string);
}
static future<std::optional<cdc::topology_description>> retrieve_generation_data_v2(
cdc::generation_id_v2 id,
db::system_keyspace& sys_ks,
db::system_distributed_keyspace& sys_dist_ks) {
auto cdc_gen = co_await sys_dist_ks.read_cdc_generation(id.id);
if (!cdc_gen && id.id.is_timestamp()) {
// If we entered legacy mode due to recovery, we (or some other node)
// might gossip about a generation that was previously propagated
// through raft. If that's the case, it will sit in
// the system.cdc_generations_v3 table.
//
// If the provided id is not a timeuuid, we don't want to query
// the system.cdc_generations_v3 table. This table stores generation
// ids as timeuuids. If the provided id is not a timeuuid, the
// generation cannot be in system.cdc_generations_v3. Also, the query
// would fail with a marshaling error.
cdc_gen = co_await sys_ks.read_cdc_generation_opt(id.id);
}
co_return cdc_gen;
}
static future<std::optional<cdc::topology_description>> retrieve_generation_data(
cdc::generation_id gen_id,
db::system_keyspace& sys_ks,
db::system_distributed_keyspace& sys_dist_ks,
db::system_distributed_keyspace::context ctx) {
return std::visit(make_visitor(
[&] (const cdc::generation_id_v1& id) {
return sys_dist_ks.read_cdc_topology_description(id, ctx);
},
[&] (const cdc::generation_id_v2& id) {
return retrieve_generation_data_v2(id, sys_ks, sys_dist_ks);
}
), gen_id);
}
static future<> do_update_streams_description(
cdc::generation_id gen_id,
db::system_keyspace& sys_ks,
db::system_distributed_keyspace& sys_dist_ks,
db::system_distributed_keyspace::context ctx) {
if (co_await sys_dist_ks.cdc_desc_exists(get_ts(gen_id), ctx)) {
cdc_log.info("Generation {}: streams description table already updated.", gen_id);
co_return;
}
// We might race with another node also inserting the description, but that's ok. It's an idempotent operation.
auto topo = co_await retrieve_generation_data(gen_id, sys_ks, sys_dist_ks, ctx);
if (!topo) {
throw no_generation_data_exception(gen_id);
}
co_await sys_dist_ks.create_cdc_desc(get_ts(gen_id), *topo, ctx);
cdc_log.info("CDC description table successfully updated with generation {}.", gen_id);
}
/* Inform CDC users about a generation of streams (identified by the given timestamp)
* by inserting it into the cdc_streams table.
*
* Assumes that the cdc_generation_descriptions table contains this generation.
*
* Returning from this function does not mean that the table update was successful: the function
* might run an asynchronous task in the background.
*/
static future<> update_streams_description(
cdc::generation_id gen_id,
db::system_keyspace& sys_ks,
shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source& abort_src) {
try {
co_await do_update_streams_description(gen_id, sys_ks, *sys_dist_ks, { get_num_token_owners() });
} catch (...) {
cdc_log.warn(
"Could not update CDC description table with generation {}: {}. Will retry in the background.",
gen_id, std::current_exception());
// It is safe to discard this future: we keep system distributed keyspace alive.
(void)(([] (cdc::generation_id gen_id,
db::system_keyspace& sys_ks,
shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source& abort_src) -> future<> {
while (true) {
try {
co_await sleep_abortable(std::chrono::seconds(60), abort_src);
} catch (seastar::sleep_aborted&) {
cdc_log.warn( "Aborted update CDC description table with generation {}", gen_id);
co_return;
}
try {
co_await do_update_streams_description(gen_id, sys_ks, *sys_dist_ks, { get_num_token_owners() });
co_return;
} catch (...) {
cdc_log.warn(
"Could not update CDC description table with generation {}: {}. Will try again.",
gen_id, std::current_exception());
}
}
})(gen_id, sys_ks, std::move(sys_dist_ks), std::move(get_num_token_owners), abort_src));
}
}
static db_clock::time_point as_timepoint(const utils::UUID& uuid) {
return db_clock::time_point(utils::UUID_gen::unix_timestamp(uuid));
}
static future<std::vector<db_clock::time_point>> get_cdc_desc_v1_timestamps(
db::system_distributed_keyspace& sys_dist_ks,
abort_source& abort_src,
const noncopyable_function<unsigned()>& get_num_token_owners) {
while (true) {
try {
co_return co_await sys_dist_ks.get_cdc_desc_v1_timestamps({ get_num_token_owners() });
} catch (...) {
cdc_log.warn(
"Failed to retrieve generation timestamps for rewriting: {}. Retrying in 60s.",
std::current_exception());
}
co_await sleep_abortable(std::chrono::seconds(60), abort_src);
}
}
// Contains a CDC log table's creation time (extracted from its schema's id)
// and its CDC TTL setting.
struct time_and_ttl {
db_clock::time_point creation_time;
int ttl;
};
/*
* See `maybe_rewrite_streams_descriptions`.
* This is the long-running-in-the-background part of that function.
* It returns the timestamp of the last rewritten generation (if any).
*/
static future<std::optional<cdc::generation_id_v1>> rewrite_streams_descriptions(
std::vector<time_and_ttl> times_and_ttls,
db::system_keyspace& sys_ks,
shared_ptr<db::system_distributed_keyspace> sys_dist_ks,
noncopyable_function<unsigned()> get_num_token_owners,
abort_source& abort_src) {
cdc_log.info("Retrieving generation timestamps for rewriting...");
auto tss = co_await get_cdc_desc_v1_timestamps(*sys_dist_ks, abort_src, get_num_token_owners);
cdc_log.info("Generation timestamps retrieved.");
// Find first generation timestamp such that some CDC log table may contain data before this timestamp.
// This predicate is monotonic w.r.t the timestamps.
auto now = db_clock::now();
std::sort(tss.begin(), tss.end());
auto first = std::partition_point(tss.begin(), tss.end(), [&] (db_clock::time_point ts) {
// partition_point finds first element that does *not* satisfy the predicate.
return std::none_of(times_and_ttls.begin(), times_and_ttls.end(),
[&] (const time_and_ttl& tat) {
// In this CDC log table there are no entries older than the table's creation time
// or (now - the table's ttl). We subtract 10s to account for some possible clock drift.
// If ttl is set to 0 then entries in this table never expire. In that case we look
// only at the table's creation time.
auto no_entries_older_than =
(tat.ttl == 0 ? tat.creation_time : std::max(tat.creation_time, now - std::chrono::seconds(tat.ttl)))
- std::chrono::seconds(10);
return no_entries_older_than < ts;
});
});
// Find first generation timestamp such that some CDC log table may contain data in this generation.
// This and all later generations need to be written to the new streams table.
if (first != tss.begin()) {
--first;
}
if (first == tss.end()) {
cdc_log.info("No generations to rewrite.");
co_return std::nullopt;
}
cdc_log.info("First generation to rewrite: {}", *first);
bool each_success = true;
co_await max_concurrent_for_each(first, tss.end(), 10, [&] (db_clock::time_point ts) -> future<> {
while (true) {
try {
co_return co_await do_update_streams_description(cdc::generation_id_v1{ts}, sys_ks, *sys_dist_ks, { get_num_token_owners() });
} catch (const no_generation_data_exception& e) {
cdc_log.error("Failed to rewrite streams for generation {}: {}. Giving up.", ts, e);
each_success = false;
co_return;
} catch (...) {
cdc_log.warn("Failed to rewrite streams for generation {}: {}. Retrying in 60s.", ts, std::current_exception());
}
co_await sleep_abortable(std::chrono::seconds(60), abort_src);
}
});
if (each_success) {
cdc_log.info("Rewriting stream tables finished successfully.");
} else {
cdc_log.info("Rewriting stream tables finished, but some generations could not be rewritten (check the logs).");
}
if (first != tss.end()) {
co_return cdc::generation_id_v1{*std::prev(tss.end())};
}
co_return std::nullopt;
}
future<> generation_service::maybe_rewrite_streams_descriptions() {
if (!_db.has_schema(_sys_dist_ks.local().NAME, _sys_dist_ks.local().CDC_DESC_V1)) {
// This cluster never went through a Scylla version which used this table
// or the user deleted the table. Nothing to do.
co_return;
}
if (co_await _sys_ks.local().cdc_is_rewritten()) {
co_return;
}
if (_cfg.dont_rewrite_streams) {
cdc_log.warn("Stream rewriting disabled. Manual administrator intervention may be required...");
co_return;
}
// For each CDC log table get the TTL setting (from CDC options) and the table's creation time
std::vector<time_and_ttl> times_and_ttls;
_db.get_tables_metadata().for_each_table([&] (table_id, lw_shared_ptr<replica::table> t) {
auto& s = *t->schema();
auto base = cdc::get_base_table(_db, s.ks_name(), s.cf_name());
if (!base) {
// Not a CDC log table.
return;
}
auto& cdc_opts = base->cdc_options();
if (!cdc_opts.enabled()) {
// This table is named like a CDC log table but it's not one.
return;
}
times_and_ttls.push_back(time_and_ttl{as_timepoint(s.id().uuid()), cdc_opts.ttl()});
});
if (times_and_ttls.empty()) {
// There's no point in rewriting old generations' streams (they don't contain any data).
cdc_log.info("No CDC log tables present, not rewriting stream tables.");
co_return co_await _sys_ks.local().cdc_set_rewritten(std::nullopt);
}
auto get_num_token_owners = [tm = _token_metadata.get()] { return tm->count_normal_token_owners(); };
// This code is racing with node startup. At this point, we're most likely still waiting for gossip to settle
// and some nodes that are UP may still be marked as DOWN by us.
// Let's sleep a bit to increase the chance that the first attempt at rewriting succeeds (it's still ok if
// it doesn't - we'll retry - but it's nice if we succeed without any warnings).
co_await sleep_abortable(std::chrono::seconds(10), _abort_src);
cdc_log.info("Rewriting stream tables in the background...");
auto last_rewritten = co_await rewrite_streams_descriptions(
std::move(times_and_ttls),
_sys_ks.local(),
_sys_dist_ks.local_shared(),
std::move(get_num_token_owners),
_abort_src);
co_await _sys_ks.local().cdc_set_rewritten(last_rewritten);
}
static void assert_shard_zero(const sstring& where) {
if (this_shard_id() != 0) {
on_internal_error(cdc_log, format("`{}`: must be run on shard 0", where));
}
}
class and_reducer {
private:
bool _result = true;
@@ -803,195 +388,26 @@ public:
}
};
class generation_handling_nonfatal_exception : public std::runtime_error {
using std::runtime_error::runtime_error;
};
constexpr char could_not_retrieve_msg_template[]
= "Could not retrieve CDC streams with timestamp {} upon gossip event. Reason: \"{}\". Action: {}.";
generation_service::generation_service(
config cfg, gms::gossiper& g, sharded<db::system_distributed_keyspace>& sys_dist_ks,
config cfg,
sharded<db::system_keyspace>& sys_ks,
abort_source& abort_src, const locator::shared_token_metadata& stm, gms::feature_service& f,
replica::database& db)
: _cfg(std::move(cfg))
, _gossiper(g)
, _sys_dist_ks(sys_dist_ks)
, _sys_ks(sys_ks)
, _abort_src(abort_src)
, _token_metadata(stm)
, _feature_service(f)
, _db(db)
{
}
future<> generation_service::stop() {
try {
co_await std::move(_cdc_streams_rewrite_complete);
} catch (...) {
cdc_log.error("CDC stream rewrite failed: ", std::current_exception());
}
if (_joined && (this_shard_id() == 0)) {
co_await leave_ring();
}
_stopped = true;
return make_ready_future<>();
}
generation_service::~generation_service() {
SCYLLA_ASSERT(_stopped);
}
future<> generation_service::after_join(std::optional<cdc::generation_id>&& startup_gen_id) {
assert_shard_zero(__PRETTY_FUNCTION__);
_gen_id = std::move(startup_gen_id);
_gossiper.register_(shared_from_this());
_joined = true;
// Retrieve the latest CDC generation seen in gossip (if any).
co_await legacy_scan_cdc_generations();
// Ensure that the new CDC stream description table has all required streams.
// See the function's comment for details.
//
// Since this depends on the entire cluster (and therefore we cannot guarantee
// timely completion), run it in the background and wait for it in stop().
_cdc_streams_rewrite_complete = maybe_rewrite_streams_descriptions();
}
future<> generation_service::leave_ring() {
assert_shard_zero(__PRETTY_FUNCTION__);
_joined = false;
co_await _gossiper.unregister_(shared_from_this());
}
future<> generation_service::on_join(gms::inet_address ep, locator::host_id id, gms::endpoint_state_ptr ep_state, gms::permit_id pid) {
return on_change(ep, id, ep_state->get_application_state_map(), pid);
}
future<> generation_service::on_change(gms::inet_address ep, locator::host_id id, const gms::application_state_map& states, gms::permit_id pid) {
assert_shard_zero(__PRETTY_FUNCTION__);
return make_ready_future<>();
}
future<> generation_service::check_and_repair_cdc_streams() {
// FIXME: support Raft group 0-based topology changes
if (!_joined) {
throw std::runtime_error("check_and_repair_cdc_streams: node not initialized yet");
}
std::optional<cdc::generation_id> latest = _gen_id;
_gossiper.for_each_endpoint_state([&] (const gms::endpoint_state& state) {
auto addr = state.get_host_id();
if (_gossiper.is_left(addr)) {
cdc_log.info("check_and_repair_cdc_streams ignored node {} because it is in LEFT state", addr);
return;
}
if (!_gossiper.is_normal(addr)) {
throw std::runtime_error(fmt::format("All nodes must be in NORMAL or LEFT state while performing check_and_repair_cdc_streams"
" ({} is in state {})", addr, _gossiper.get_gossip_status(state)));
}
const auto gen_id = get_generation_id_for(addr, state);
if (!latest || (gen_id && get_ts(*gen_id) > get_ts(*latest))) {
latest = gen_id;
}
});
auto tmptr = _token_metadata.get();
auto sys_dist_ks = get_sys_dist_ks();
bool should_regenerate = false;
if (!latest) {
cdc_log.warn("check_and_repair_cdc_streams: no generation observed in gossip");
should_regenerate = true;
} else if (std::holds_alternative<cdc::generation_id_v1>(*latest)
&& _feature_service.cdc_generations_v2) {
cdc_log.info(
"Cluster still using CDC generation storage format V1 (id: {}), even though it already understands the V2 format."
" Creating a new generation using V2.", *latest);
should_regenerate = true;
} else {
cdc_log.info("check_and_repair_cdc_streams: last generation observed in gossip: {}", *latest);
static const auto timeout_msg = "Timeout while fetching CDC topology description";
static const auto topology_read_error_note = "Note: this is likely caused by"
" node(s) being down or unreachable. It is recommended to check the network and"
" restart/remove the failed node(s), then retry checkAndRepairCdcStreams command";
static const auto exception_translating_msg = "Translating the exception to `request_execution_exception`";
std::optional<topology_description> gen;
try {
gen = co_await retrieve_generation_data(*latest, _sys_ks.local(), *sys_dist_ks, { tmptr->count_normal_token_owners() });
} catch (exceptions::request_timeout_exception& e) {
cdc_log.error("{}: \"{}\". {}.", timeout_msg, e.what(), exception_translating_msg);
throw exceptions::request_execution_exception(exceptions::exception_code::READ_TIMEOUT,
format("{}. {}.", timeout_msg, topology_read_error_note));
} catch (exceptions::unavailable_exception& e) {
static const auto unavailable_msg = "Node(s) unavailable while fetching CDC topology description";
cdc_log.error("{}: \"{}\". {}.", unavailable_msg, e.what(), exception_translating_msg);
throw exceptions::request_execution_exception(exceptions::exception_code::UNAVAILABLE,
format("{}. {}.", unavailable_msg, topology_read_error_note));
} catch (...) {
const auto ep = std::current_exception();
if (is_timeout_exception(ep)) {
cdc_log.error("{}: \"{}\". {}.", timeout_msg, ep, exception_translating_msg);
throw exceptions::request_execution_exception(exceptions::exception_code::READ_TIMEOUT,
format("{}. {}.", timeout_msg, topology_read_error_note));
}
// On exotic errors proceed with regeneration
cdc_log.error("Exception while reading CDC topology description: \"{}\". Regenerating streams anyway.", ep);
should_regenerate = true;
}
if (!gen) {
cdc_log.error(
"Could not find CDC generation with timestamp {} in distributed system tables (current time: {}),"
" even though some node gossiped about it.",
latest, db_clock::now());
should_regenerate = true;
} else if (!is_cdc_generation_optimal(*gen, *tmptr)) {
should_regenerate = true;
cdc_log.info("CDC generation {} needs repair, regenerating", latest);
}
}
if (!should_regenerate) {
if (latest != _gen_id) {
co_await legacy_do_handle_cdc_generation(*latest);
}
cdc_log.info("CDC generation {} does not need repair", latest);
co_return;
}
const auto new_gen_id = co_await legacy_make_new_generation({}, true);
// Need to artificially update our STATUS so other nodes handle the generation ID change
// FIXME: after 0e0282cd nodes do not require a STATUS update to react to CDC generation changes.
// The artificial STATUS update here should eventually be removed (in a few releases).
auto status = _gossiper.get_this_endpoint_state_ptr()->get_application_state_ptr(gms::application_state::STATUS);
if (!status) {
cdc_log.error("Our STATUS is missing");
cdc_log.error("Aborting CDC generation repair due to missing STATUS");
co_return;
}
// Update _gen_id first, so that legacy_do_handle_cdc_generation (which will get called due to the status update)
// won't try to update the gossiper, which would result in a deadlock inside add_local_application_state
_gen_id = new_gen_id;
co_await _gossiper.add_local_application_state(
std::pair(gms::application_state::CDC_GENERATION_ID, gms::versioned_value::cdc_generation_id(new_gen_id)),
std::pair(gms::application_state::STATUS, *status)
);
co_await _sys_ks.local().update_cdc_generation_id(new_gen_id);
}
future<> generation_service::handle_cdc_generation(cdc::generation_id_v2 gen_id) {
future<> generation_service::handle_cdc_generation(cdc::generation_id gen_id) {
auto ts = get_ts(gen_id);
if (co_await container().map_reduce(and_reducer(), [ts] (generation_service& svc) {
return !svc._cdc_metadata.prepare(ts);
@@ -1013,171 +429,8 @@ future<> generation_service::handle_cdc_generation(cdc::generation_id_v2 gen_id)
}
}
future<> generation_service::legacy_handle_cdc_generation(std::optional<cdc::generation_id> gen_id) {
assert_shard_zero(__PRETTY_FUNCTION__);
if (!gen_id) {
co_return;
}
if (!_sys_dist_ks.local_is_initialized() || !_sys_dist_ks.local().started()) {
on_internal_error(cdc_log, "Legacy handle CDC generation with sys.dist.ks. down");
}
// The service should not be listening for generation changes until after the node
// is bootstrapped and since the node leaves the ring on decommission
if (co_await container().map_reduce(and_reducer(), [ts = get_ts(*gen_id)] (generation_service& svc) {
return !svc._cdc_metadata.prepare(ts);
})) {
co_return;
}
bool using_this_gen = false;
try {
using_this_gen = co_await legacy_do_handle_cdc_generation_intercept_nonfatal_errors(*gen_id);
} catch (generation_handling_nonfatal_exception& e) {
cdc_log.warn(could_not_retrieve_msg_template, gen_id, e.what(), "retrying in the background");
legacy_async_handle_cdc_generation(*gen_id);
co_return;
} catch (...) {
cdc_log.error(could_not_retrieve_msg_template, gen_id, std::current_exception(), "not retrying");
co_return; // Exotic ("fatal") exception => do not retry
}
if (using_this_gen) {
cdc_log.info("Starting to use generation {}", *gen_id);
co_await update_streams_description(*gen_id, _sys_ks.local(), get_sys_dist_ks(),
[&tm = _token_metadata] { return tm.get()->count_normal_token_owners(); },
_abort_src);
}
}
void generation_service::legacy_async_handle_cdc_generation(cdc::generation_id gen_id) {
assert_shard_zero(__PRETTY_FUNCTION__);
(void)(([] (cdc::generation_id gen_id, shared_ptr<generation_service> svc) -> future<> {
while (true) {
co_await sleep_abortable(std::chrono::seconds(5), svc->_abort_src);
try {
bool using_this_gen = co_await svc->legacy_do_handle_cdc_generation_intercept_nonfatal_errors(gen_id);
if (using_this_gen) {
cdc_log.info("Starting to use generation {}", gen_id);
co_await update_streams_description(gen_id, svc->_sys_ks.local(), svc->get_sys_dist_ks(),
[&tm = svc->_token_metadata] { return tm.get()->count_normal_token_owners(); },
svc->_abort_src);
}
co_return;
} catch (generation_handling_nonfatal_exception& e) {
cdc_log.warn(could_not_retrieve_msg_template, gen_id, e.what(), "continuing to retry in the background");
} catch (...) {
cdc_log.error(could_not_retrieve_msg_template, gen_id, std::current_exception(), "not retrying anymore");
co_return; // Exotic ("fatal") exception => do not retry
}
if (co_await svc->container().map_reduce(and_reducer(), [ts = get_ts(gen_id)] (generation_service& svc) {
return svc._cdc_metadata.known_or_obsolete(ts);
})) {
co_return;
}
}
})(gen_id, shared_from_this()));
}
future<> generation_service::legacy_scan_cdc_generations() {
assert_shard_zero(__PRETTY_FUNCTION__);
std::optional<cdc::generation_id> latest;
_gossiper.for_each_endpoint_state([&] (const gms::endpoint_state& eps) {
auto gen_id = get_generation_id_for(eps.get_host_id(), eps);
if (!latest || (gen_id && get_ts(*gen_id) > get_ts(*latest))) {
latest = gen_id;
}
});
if (latest) {
cdc_log.info("Latest generation seen during startup: {}", *latest);
co_await legacy_handle_cdc_generation(latest);
} else {
cdc_log.info("No generation seen during startup.");
}
}
future<bool> generation_service::legacy_do_handle_cdc_generation_intercept_nonfatal_errors(cdc::generation_id gen_id) {
assert_shard_zero(__PRETTY_FUNCTION__);
// Use futurize_invoke to catch all exceptions from legacy_do_handle_cdc_generation.
return futurize_invoke([this, gen_id] {
return legacy_do_handle_cdc_generation(gen_id);
}).handle_exception([] (std::exception_ptr ep) -> future<bool> {
try {
std::rethrow_exception(ep);
} catch (exceptions::request_timeout_exception& e) {
throw generation_handling_nonfatal_exception(e.what());
} catch (exceptions::unavailable_exception& e) {
throw generation_handling_nonfatal_exception(e.what());
} catch (exceptions::read_failure_exception& e) {
throw generation_handling_nonfatal_exception(e.what());
} catch (...) {
const auto ep = std::current_exception();
if (is_timeout_exception(ep)) {
throw generation_handling_nonfatal_exception(format("{}", ep));
}
throw;
}
});
}
future<bool> generation_service::legacy_do_handle_cdc_generation(cdc::generation_id gen_id) {
assert_shard_zero(__PRETTY_FUNCTION__);
auto sys_dist_ks = get_sys_dist_ks();
auto gen = co_await retrieve_generation_data(gen_id, _sys_ks.local(), *sys_dist_ks, { _token_metadata.get()->count_normal_token_owners() });
if (!gen) {
// This may happen during raft upgrade when a node gossips about a generation that
// was propagated through raft and we didn't apply it yet.
throw generation_handling_nonfatal_exception(fmt::format(
"Could not find CDC generation {} in distributed system tables (current time: {}),"
" even though some node gossiped about it.",
gen_id, db_clock::now()));
}
// We always gossip about the generation with the greatest timestamp. Specific nodes may remember older generations,
// but eventually they forget when their clocks move past the latest generation's timestamp.
// The cluster as a whole is only interested in the last generation so restarting nodes may learn what it is.
// We assume that generation changes don't happen ``too often'' so every node can learn about a generation
// before it is superseded by a newer one which causes nodes to start gossiping the about the newer one.
// The assumption follows from the requirement of bootstrapping nodes sequentially.
if (!_gen_id || get_ts(*_gen_id) < get_ts(gen_id)) {
_gen_id = gen_id;
co_await _sys_ks.local().update_cdc_generation_id(gen_id);
co_await _gossiper.add_local_application_state(
gms::application_state::CDC_GENERATION_ID, gms::versioned_value::cdc_generation_id(gen_id));
}
// Return `true` iff the generation was inserted on any of our shards.
co_return co_await container().map_reduce(or_reducer(),
[ts = get_ts(gen_id), &gen] (generation_service& svc) -> future<bool> {
// We need to copy it here before awaiting anything to avoid destruction of the captures.
const auto timestamp = ts;
topology_description gen_copy = co_await gen->clone_async();
co_return svc._cdc_metadata.insert(timestamp, std::move(gen_copy));
});
}
shared_ptr<db::system_distributed_keyspace> generation_service::get_sys_dist_ks() {
assert_shard_zero(__PRETTY_FUNCTION__);
if (!_sys_dist_ks.local_is_initialized()) {
throw std::runtime_error("system distributed keyspace not initialized");
}
return _sys_dist_ks.local_shared();
}
db_clock::time_point get_ts(const generation_id& gen_id) {
return std::visit([] (auto& id) { return id.ts; }, gen_id);
return gen_id.ts;
}
future<mutation> create_table_streams_mutation(table_id table, db_clock::time_point stream_ts, const locator::tablet_map& map, api::timestamp_type ts) {

View File

@@ -34,16 +34,6 @@ namespace seastar {
class abort_source;
} // namespace seastar
namespace db {
class config;
class system_distributed_keyspace;
} // namespace db
namespace gms {
class inet_address;
class gossiper;
} // namespace gms
namespace locator {
class tablet_map;
} // namespace locator
@@ -153,23 +143,6 @@ struct cdc_stream_diff {
using table_streams = std::map<api::timestamp_type, committed_stream_set>;
class no_generation_data_exception : public std::runtime_error {
public:
no_generation_data_exception(cdc::generation_id generation_ts)
: std::runtime_error(fmt::format("could not find generation data for timestamp {}", generation_ts))
{}
};
/* Should be called when we're restarting and we noticed that we didn't save any streams timestamp in our local tables,
* which means that we're probably upgrading from a non-CDC/old CDC version (another reason could be
* that there's a bug, or the user messed with our local tables).
*
* It checks whether we should be the node to propose the first generation of CDC streams.
* The chosen condition is arbitrary, it only tries to make sure that no two nodes propose a generation of streams
* when upgrading, and nothing bad happens if they for some reason do (it's mostly an optimization).
*/
bool should_propose_first_generation(const locator::host_id& me, const gms::gossiper&);
/*
* Checks if the CDC generation is optimal, which is true if its `topology_description` is consistent
* with `token_metadata`.

View File

@@ -15,48 +15,22 @@
namespace cdc {
struct generation_id_v1 {
db_clock::time_point ts;
bool operator==(const generation_id_v1&) const = default;
};
struct generation_id_v2 {
struct generation_id {
db_clock::time_point ts;
utils::UUID id;
bool operator==(const generation_id_v2&) const = default;
bool operator==(const generation_id&) const = default;
};
using generation_id = std::variant<generation_id_v1, generation_id_v2>;
db_clock::time_point get_ts(const generation_id&);
} // namespace cdc
template <>
struct fmt::formatter<cdc::generation_id_v1> {
constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }
template <typename FormatContext>
auto format(const cdc::generation_id_v1& gen_id, FormatContext& ctx) const {
return fmt::format_to(ctx.out(), "{}", gen_id.ts);
}
};
template <>
struct fmt::formatter<cdc::generation_id_v2> {
constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }
template <typename FormatContext>
auto format(const cdc::generation_id_v2& gen_id, FormatContext& ctx) const {
return fmt::format_to(ctx.out(), "({}, {})", gen_id.ts, gen_id.id);
}
};
template <>
struct fmt::formatter<cdc::generation_id> {
constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }
template <typename FormatContext>
auto format(const cdc::generation_id& gen_id, FormatContext& ctx) const {
return std::visit([&ctx] (auto& id) {
return fmt::format_to(ctx.out(), "{}", id);
}, gen_id);
return fmt::format_to(ctx.out(), "({}, {})", gen_id.ts, gen_id.id);
}
};

View File

@@ -11,135 +11,51 @@
#include <seastar/core/sharded.hh>
#include "cdc/metadata.hh"
#include "cdc/generation_id.hh"
#include "gms/i_endpoint_state_change_subscriber.hh"
namespace db {
class system_distributed_keyspace;
class system_keyspace;
}
namespace gms {
class gossiper;
class feature_service;
}
namespace seastar {
class abort_source;
}
namespace locator {
class shared_token_metadata;
class tablet_map;
}
namespace cdc {
class generation_service : public peering_sharded_service<generation_service>
, public async_sharded_service<generation_service>
, public gms::i_endpoint_state_change_subscriber {
, public async_sharded_service<generation_service> {
public:
struct config {
unsigned ignore_msb_bits;
std::chrono::milliseconds ring_delay;
bool dont_rewrite_streams = false;
};
private:
bool _stopped = false;
// The node has joined the token ring. Set to `true` on `after_join` call.
bool _joined = false;
config _cfg;
gms::gossiper& _gossiper;
sharded<db::system_distributed_keyspace>& _sys_dist_ks;
sharded<db::system_keyspace>& _sys_ks;
abort_source& _abort_src;
const locator::shared_token_metadata& _token_metadata;
gms::feature_service& _feature_service;
replica::database& _db;
/* Maintains the set of known CDC generations used to pick streams for log writes (i.e., the partition keys of these log writes).
* Updated in response to certain gossip events (see the handle_cdc_generation function).
*/
/* Maintains the set of known CDC generations used to pick streams for log writes (i.e., the partition keys of these log writes). */
cdc::metadata _cdc_metadata;
/* The latest known generation timestamp and the timestamp that we're currently gossiping
* (as CDC_GENERATION_ID application state).
*
* Only shard 0 manages this, hence it will be std::nullopt on all shards other than 0.
* This timestamp is also persisted in the system.cdc_local table.
*
* On shard 0 this may be nullopt only in one special case: rolling upgrade, when we upgrade
* from an old version of Scylla that didn't support CDC. In that case one node in the cluster
* will create the first generation and start gossiping it; it may be us, or it may be some
* different node. In any case, eventually - after one of the nodes gossips the first timestamp
* - we'll catch on and this variable will be updated with that generation.
*/
std::optional<cdc::generation_id> _gen_id;
future<> _cdc_streams_rewrite_complete = make_ready_future<>();
public:
generation_service(config cfg, gms::gossiper&,
sharded<db::system_distributed_keyspace>&,
generation_service(config cfg,
sharded<db::system_keyspace>& sys_ks,
abort_source&, const locator::shared_token_metadata&,
gms::feature_service&, replica::database& db);
replica::database& db);
future<> stop();
~generation_service();
/* After the node bootstraps and creates a new CDC generation, or restarts and loads the last
* known generation timestamp from persistent storage, this function should be called with
* that generation timestamp moved in as the `startup_gen_id` parameter.
* This passes the responsibility of managing generations from the node startup code to this service;
* until then, the service remains dormant.
* The startup code is in `storage_service::join_topology`, hence
* `after_join` should be called at the end of that function.
* Precondition: the node has completed bootstrapping and system_distributed_keyspace is initialized.
* Must be called on shard 0 - that's where the generation management happens.
*/
future<> after_join(std::optional<cdc::generation_id>&& startup_gen_id);
future<> leave_ring();
cdc::metadata& get_cdc_metadata() {
return _cdc_metadata;
}
virtual future<> on_join(gms::inet_address, locator::host_id id, gms::endpoint_state_ptr, gms::permit_id) override;
virtual future<> on_change(gms::inet_address, locator::host_id id, const gms::application_state_map&, gms::permit_id) override;
future<> check_and_repair_cdc_streams();
/* Generate a new set of CDC streams and insert it into the internal distributed CDC generations table.
* Returns the ID of this new generation.
*
* Should be called when starting the node for the first time (i.e., joining the ring).
*
* Assumes that the system_distributed_keyspace service is initialized.
* `cluster_supports_generations_v2` must be `true` if and only if the `CDC_GENERATIONS_V2` feature is enabled.
*
* If `CDC_GENERATIONS_V2` is enabled, the new generation will be inserted into
* `system_distributed_everywhere.cdc_generation_descriptions_v2` and the returned ID will be in the v2 format.
* Otherwise the new generation will be limited in size, causing suboptimal stream distribution, it will be inserted
* into `system_distributed.cdc_generation_descriptions` and the returned ID will be in the v1 format.
* The second case should happen only when we create new generations in a mixed cluster.
*
* The caller of this function is expected to insert the ID into the gossiper as fast as possible,
* so that other nodes learn about the generation before their clocks cross the generation's timestamp
* (not guaranteed in the current implementation, but expected to be the common case;
* we assume that `ring_delay` is enough for other nodes to learn about the new generation).
*
* Legacy: used for gossiper-based topology changes.
*/
future<cdc::generation_id> legacy_make_new_generation(
const std::unordered_set<dht::token>& bootstrap_tokens, bool add_delay);
/* Retrieve the CDC generation with the given ID from local tables
* and start using it for CDC log writes if it's not obsolete.
* Precondition: the generation was committed using group 0 and locally applied.
*/
future<> handle_cdc_generation(cdc::generation_id_v2);
future<> handle_cdc_generation(cdc::generation_id);
future<> load_cdc_tablet_streams(std::optional<std::unordered_set<table_id>> changed_tables);
@@ -151,56 +67,6 @@ public:
future<utils::chunked_vector<mutation>> garbage_collect_cdc_streams_for_table(table_id table, std::optional<std::chrono::seconds> ttl, api::timestamp_type ts);
future<> garbage_collect_cdc_streams(utils::chunked_vector<canonical_mutation>& muts, api::timestamp_type ts);
private:
/* Retrieve the CDC generation which starts at the given timestamp (from a distributed table created for this purpose)
* and start using it for CDC log writes if it's not obsolete.
*
* Legacy: used for gossiper-based topology changes.
*/
future<> legacy_handle_cdc_generation(std::optional<cdc::generation_id>);
/* If `legacy_handle_cdc_generation` fails, it schedules an asynchronous retry in the background
* using `legacy_async_handle_cdc_generation`.
*
* Legacy: used for gossiper-based topology changes.
*/
void legacy_async_handle_cdc_generation(cdc::generation_id);
/* Wrapper around `legacy_do_handle_cdc_generation` which intercepts timeout/unavailability exceptions.
* Returns: legacy_do_handle_cdc_generation(ts).
*
* Legacy: used for gossiper-based topology changes.
*/
future<bool> legacy_do_handle_cdc_generation_intercept_nonfatal_errors(cdc::generation_id);
/* Returns `true` iff we started using the generation (it was not obsolete or already known),
* which means that this node might write some CDC log entries using streams from this generation.
*
* Legacy: used for gossiper-based topology changes.
*/
future<bool> legacy_do_handle_cdc_generation(cdc::generation_id);
/* Scan CDC generation timestamps gossiped by other nodes and retrieve the latest one.
* This function should be called once at the end of the node startup procedure
* (after the node is started and running normally, it will retrieve generations on gossip events instead).
*
* Legacy: used for gossiper-based topology changes.
*/
future<> legacy_scan_cdc_generations();
/* generation_service code might be racing with system_distributed_keyspace deinitialization
* (the deinitialization order is broken).
* Therefore, whenever we want to access sys_dist_ks in a background task,
* we need to check if the instance is still there. Storing the shared pointer will keep it alive.
*/
shared_ptr<db::system_distributed_keyspace> get_sys_dist_ks();
/* Part of the upgrade procedure. Useful in case where the version of Scylla that we're upgrading from
* used the "cdc_streams_descriptions" table. This procedure ensures that the new "cdc_streams_descriptions_v2"
* table contains streams of all generations that were present in the old table and may still contain data
* (i.e. there exist CDC log tables that may contain rows with partition keys being the stream IDs from
* these generations). */
future<> maybe_rewrite_streams_descriptions();
};
} // namespace cdc

View File

@@ -618,7 +618,7 @@ static void set_default_properties_log_table(schema_builder& b, const schema& s,
b.set_caching_options(caching_options::get_disabled_caching_options());
auto rs = generate_replication_strategy(ksm, db.get_token_metadata().get_topology());
auto tombstone_gc_ext = seastar::make_shared<tombstone_gc_extension>(get_default_tombstone_gc_mode(*rs, db.get_token_metadata(), false));
auto tombstone_gc_ext = seastar::make_shared<tombstone_gc_extension>(get_default_tombstone_gc_mode(*rs, false));
b.add_extension(tombstone_gc_extension::NAME, std::move(tombstone_gc_ext));
}

View File

@@ -48,6 +48,7 @@
#include "mutation/mutation_fragment_stream_validator.hh"
#include "utils/assert.hh"
#include "utils/error_injection.hh"
#include "utils/chunked_vector.hh"
#include "utils/pretty_printers.hh"
#include "readers/multi_range.hh"
#include "readers/compacting.hh"
@@ -161,6 +162,7 @@ std::string_view to_string(compaction_type type) {
case compaction_type::Reshape: return "Reshape";
case compaction_type::Split: return "Split";
case compaction_type::Major: return "Major";
case compaction_type::RewriteComponent: return "RewriteComponent";
}
on_internal_error_noexcept(clogger, format("Invalid compaction type {}", int(type)));
return "(invalid)";
@@ -598,8 +600,7 @@ protected:
// Garbage collected sstables that were added to SSTable set and should be eventually removed from it.
std::vector<sstables::shared_sstable> _used_garbage_collected_sstables;
utils::observable<> _stop_request_observable;
// optional tombstone_gc_state that is used when gc has to check only the compacting sstables to collect tombstones.
std::optional<tombstone_gc_state> _tombstone_gc_state_with_commitlog_check_disabled;
tombstone_gc_state _tombstone_gc_state;
int64_t _output_repaired_at = 0;
private:
// Keeps track of monitors for input sstable.
@@ -611,23 +612,23 @@ private:
}
// Called in a seastar thread
dht::partition_range_vector
utils::chunked_vector<dht::partition_range>
get_ranges_for_invalidation(const std::vector<sstables::shared_sstable>& sstables) {
// If owned ranges is disengaged, it means no cleanup work was done and
// so nothing needs to be invalidated.
if (!_owned_ranges) {
return dht::partition_range_vector{};
return {};
}
auto owned_ranges = dht::to_partition_ranges(*_owned_ranges, utils::can_yield::yes);
auto owned_ranges = dht::to_partition_ranges_chunked(*_owned_ranges).get();
auto non_owned_ranges = sstables
| std::views::transform([] (const sstables::shared_sstable& sst) {
seastar::thread::maybe_yield();
return dht::partition_range::make({sst->get_first_decorated_key(), true},
{sst->get_last_decorated_key(), true});
}) | std::ranges::to<dht::partition_range_vector>();
}) | std::ranges::to<utils::chunked_vector<dht::partition_range>>();
return dht::subtract_ranges(*_schema, non_owned_ranges, std::move(owned_ranges)).get();
return dht::subtract_ranges(*_schema, std::move(non_owned_ranges), std::move(owned_ranges)).get();
}
protected:
compaction(compaction_group_view& table_s, compaction_descriptor descriptor, compaction_data& cdata, compaction_progress_monitor& progress_monitor, use_backlog_tracker use_backlog_tracker)
@@ -649,9 +650,12 @@ protected:
, _owned_ranges(std::move(descriptor.owned_ranges))
, _sharder(descriptor.sharder)
, _owned_ranges_checker(_owned_ranges ? std::optional<dht::incremental_owned_ranges_checker>(*_owned_ranges) : std::nullopt)
, _tombstone_gc_state_with_commitlog_check_disabled(descriptor.gc_check_only_compacting_sstables ? std::make_optional(_table_s.get_tombstone_gc_state().with_commitlog_check_disabled()) : std::nullopt)
, _tombstone_gc_state(_table_s.get_tombstone_gc_state())
, _progress_monitor(progress_monitor)
{
if (descriptor.gc_check_only_compacting_sstables) {
_tombstone_gc_state = _tombstone_gc_state.with_commitlog_check_disabled();
}
std::unordered_set<sstables::run_id> ssts_run_ids;
_contains_multi_fragment_runs = std::any_of(_sstables.begin(), _sstables.end(), [&ssts_run_ids] (sstables::shared_sstable& sst) {
return !ssts_run_ids.insert(sst->run_identifier()).second;
@@ -718,8 +722,8 @@ protected:
compaction_completion_desc
get_compaction_completion_desc(std::vector<sstables::shared_sstable> input_sstables, std::vector<sstables::shared_sstable> output_sstables) {
auto ranges_for_for_invalidation = get_ranges_for_invalidation(input_sstables);
return compaction_completion_desc{std::move(input_sstables), std::move(output_sstables), std::move(ranges_for_for_invalidation)};
auto ranges = get_ranges_for_invalidation(input_sstables);
return compaction_completion_desc{std::move(input_sstables), std::move(output_sstables), std::move(ranges)};
}
// Tombstone expiration is enabled based on the presence of sstable set.
@@ -849,8 +853,8 @@ private:
return _table_s.get_compaction_strategy().make_sstable_set(_table_s);
}
const tombstone_gc_state& get_tombstone_gc_state() const {
return _tombstone_gc_state_with_commitlog_check_disabled ? _tombstone_gc_state_with_commitlog_check_disabled.value() : _table_s.get_tombstone_gc_state();
tombstone_gc_state get_tombstone_gc_state() const {
return _tombstone_gc_state;
}
future<> setup() {
@@ -1050,7 +1054,7 @@ private:
return can_never_purge;
}
return [this] (const dht::decorated_key& dk, is_shadowable is_shadowable) {
return get_max_purgeable_timestamp(_table_s, *_selector, _compacting_for_max_purgeable_func, dk, _bloom_filter_checks, _compacting_max_timestamp, _tombstone_gc_state_with_commitlog_check_disabled.has_value(), is_shadowable);
return get_max_purgeable_timestamp(_table_s, *_selector, _compacting_for_max_purgeable_func, dk, _bloom_filter_checks, _compacting_max_timestamp, !_tombstone_gc_state.is_commitlog_check_enabled(), is_shadowable);
};
}
@@ -2048,6 +2052,7 @@ compaction_type compaction_type_options::type() const {
compaction_type::Reshape,
compaction_type::Split,
compaction_type::Major,
compaction_type::RewriteComponent,
};
static_assert(std::variant_size_v<compaction_type_options::options_variant> == std::size(index_to_type));
return index_to_type[_options.index()];
@@ -2084,6 +2089,9 @@ static std::unique_ptr<compaction> make_compaction(compaction_group_view& table_
std::unique_ptr<compaction> operator()(compaction_type_options::split split_options) {
return std::make_unique<split_compaction>(table_s, std::move(descriptor), cdata, std::move(split_options), progress_monitor);
}
std::unique_ptr<compaction> operator()(compaction_type_options::component_rewrite) {
throw std::runtime_error("component_rewrite compaction should be handled separately");
}
} visitor_factory{table_s, std::move(descriptor), cdata, progress_monitor};
return descriptor.options.visit(visitor_factory);
@@ -2101,7 +2109,7 @@ static future<compaction_result> scrub_sstables_validate_mode(compaction_descrip
validation_errors += co_await sst->validate(permit, cdata.abort, [&schema] (sstring what) {
scrub_compaction::report_validation_error(compaction_type::Scrub, *schema, what);
}, monitor_generator(sst));
}, monitor_generator(sst), true);
// Did validation actually finish because aborted?
if (cdata.is_stop_requested()) {
// Compaction manager will catch this exception and re-schedule the compaction.
@@ -2138,6 +2146,34 @@ future<compaction_result> scrub_sstables_validate_mode(compaction_descriptor des
co_return res;
}
future<compaction_result> rewrite_sstables_component(compaction_descriptor descriptor, compaction_group_view& table_s) {
return seastar::async([descriptor = std::move(descriptor), &table_s] () mutable {
compaction_result result {
.stats = {
.started_at = db_clock::now(),
},
};
const auto& options = descriptor.options.as<compaction_type_options::component_rewrite>();
bool update_id = static_cast<bool>(options.update_id);
// When rewriting a component, we cannot use the standard descriptor creator
// because we must preserve the sstable version.
auto creator = [&table_s] (sstables::shared_sstable sst) {
return table_s.make_sstable(sst->state(), sst->get_version());
};
result.new_sstables.reserve(descriptor.sstables.size());
for (auto& sst : descriptor.sstables) {
auto rewritten = sst->link_with_rewritten_component(creator, options.component_to_rewrite, options.modifier, update_id).get();
result.new_sstables.push_back(rewritten);
}
descriptor.replacer({std::move(descriptor.sstables), result.new_sstables});
result.stats.ended_at = db_clock::now();
return result;
});
}
future<compaction_result>
compact_sstables(compaction_descriptor descriptor, compaction_data& cdata, compaction_group_view& table_s, compaction_progress_monitor& progress_monitor) {
if (descriptor.sstables.empty()) {
@@ -2149,6 +2185,9 @@ compact_sstables(compaction_descriptor descriptor, compaction_data& cdata, compa
// Bypass the usual compaction machinery for dry-mode scrub
return scrub_sstables_validate_mode(std::move(descriptor), cdata, table_s, progress_monitor);
}
if (descriptor.options.type() == compaction_type::RewriteComponent) {
return rewrite_sstables_component(std::move(descriptor), table_s);
}
return compaction::run(make_compaction(table_s, std::move(descriptor), cdata, progress_monitor));
}

View File

@@ -12,10 +12,12 @@
#include <functional>
#include <optional>
#include <variant>
#include "sstables/component_type.hh"
#include "sstables/types_fwd.hh"
#include "sstables/sstable_set.hh"
#include "compaction_fwd.hh"
#include "mutation_writer/token_group_based_splitting_writer.hh"
#include "utils/chunked_vector.hh"
namespace compaction {
@@ -30,6 +32,7 @@ enum class compaction_type {
Reshape = 7,
Split = 8,
Major = 9,
RewriteComponent = 10,
};
struct compaction_completion_desc {
@@ -38,7 +41,7 @@ struct compaction_completion_desc {
// New, fresh SSTables that should be added to SSTable set, replacing the old ones.
std::vector<sstables::shared_sstable> new_sstables;
// Set of compacted partition ranges that should be invalidated in the cache.
dht::partition_range_vector ranges_for_cache_invalidation;
utils::chunked_vector<dht::partition_range> ranges_for_cache_invalidation;
};
// creates a new SSTable for a given shard
@@ -90,8 +93,15 @@ public:
struct split {
mutation_writer::classify_by_token_group classifier;
};
struct component_rewrite {
sstables::component_type component_to_rewrite;
std::function<void(sstables::sstable&)> modifier;
using update_sstable_id = bool_class<class update_sstable_id_tag>;
update_sstable_id update_id = update_sstable_id::yes;
};
private:
using options_variant = std::variant<regular, cleanup, upgrade, scrub, reshard, reshape, split, major>;
using options_variant = std::variant<regular, cleanup, upgrade, scrub, reshard, reshape, split, major, component_rewrite>;
private:
options_variant _options;
@@ -129,6 +139,10 @@ public:
return compaction_type_options(scrub{.operation_mode = mode, .quarantine_sstables = quarantine_sstables, .drop_unfixable = drop_unfixable_sstables});
}
static compaction_type_options make_component_rewrite(component_type component, std::function<void(sstables::sstable&)> modifier, component_rewrite::update_sstable_id update_id = component_rewrite::update_sstable_id::yes) {
return compaction_type_options(component_rewrite{.component_to_rewrite = component, .modifier = std::move(modifier), .update_id = update_id});
}
static compaction_type_options make_split(mutation_writer::classify_by_token_group classifier) {
return compaction_type_options(split{std::move(classifier)});
}

View File

@@ -46,6 +46,7 @@ public:
virtual reader_permit make_compaction_reader_permit() const = 0;
virtual sstables::sstables_manager& get_sstables_manager() noexcept = 0;
virtual sstables::shared_sstable make_sstable(sstables::sstable_state) const = 0;
virtual sstables::shared_sstable make_sstable(sstables::sstable_state, sstables::sstable_version_types) const = 0;
virtual sstables::sstable_writer_config configure_writer(sstring origin) const = 0;
virtual api::timestamp_type min_memtable_timestamp() const = 0;
virtual api::timestamp_type min_memtable_live_timestamp() const = 0;
@@ -54,7 +55,7 @@ public:
virtual future<> on_compaction_completion(compaction_completion_desc desc, sstables::offstrategy offstrategy) = 0;
virtual bool is_auto_compaction_disabled_by_user() const noexcept = 0;
virtual bool tombstone_gc_enabled() const noexcept = 0;
virtual const tombstone_gc_state& get_tombstone_gc_state() const noexcept = 0;
virtual tombstone_gc_state get_tombstone_gc_state() const noexcept = 0;
virtual compaction_backlog_tracker& get_backlog_tracker() = 0;
virtual const std::string get_group_id() const noexcept = 0;
virtual seastar::condition_variable& get_staging_done_condition() noexcept = 0;

View File

@@ -778,6 +778,7 @@ compaction_manager::get_incremental_repair_read_lock(compaction::compaction_grou
cmlog.debug("Get get_incremental_repair_read_lock for {} started", reason);
}
compaction::compaction_state& cs = get_compaction_state(&t);
auto gh = cs.gate.hold();
auto ret = co_await cs.incremental_repair_lock.hold_read_lock();
if (!reason.empty()) {
cmlog.debug("Get get_incremental_repair_read_lock for {} done", reason);
@@ -791,6 +792,7 @@ compaction_manager::get_incremental_repair_write_lock(compaction::compaction_gro
cmlog.debug("Get get_incremental_repair_write_lock for {} started", reason);
}
compaction::compaction_state& cs = get_compaction_state(&t);
auto gh = cs.gate.hold();
auto ret = co_await cs.incremental_repair_lock.hold_write_lock();
if (!reason.empty()) {
cmlog.debug("Get get_incremental_repair_write_lock for {} done", reason);
@@ -1040,7 +1042,7 @@ compaction_manager::compaction_manager(config cfg, abort_source& as, tasks::task
_compaction_controller.set_max_shares(max_shares);
}))
, _strategy_control(std::make_unique<strategy_control>(*this))
, _tombstone_gc_state(_shared_tombstone_gc_state) {
{
tm.register_module(_task_manager_module->get_name(), _task_manager_module);
register_metrics();
// Bandwidth throttling is node-wide, updater is needed on single shard
@@ -1064,7 +1066,7 @@ compaction_manager::compaction_manager(tasks::task_manager& tm)
, _compaction_static_shares_observer(_cfg.static_shares.observe(_update_compaction_static_shares_action.make_observer()))
, _compaction_max_shares_observer(_cfg.max_shares.observe([] (const float& max_shares) {}))
, _strategy_control(std::make_unique<strategy_control>(*this))
, _tombstone_gc_state(_shared_tombstone_gc_state) {
{
tm.register_module(_task_manager_module->get_name(), _task_manager_module);
// No metric registration because this constructor is supposed to be used only by the testing
// infrastructure.
@@ -1786,6 +1788,41 @@ protected:
}
};
class rewrite_sstables_component_compaction_task_executor final : public rewrite_sstables_compaction_task_executor {
std::unordered_map<sstables::shared_sstable, sstables::shared_sstable>& _rewritten_sstables;
public:
rewrite_sstables_component_compaction_task_executor(compaction_manager& mgr,
throw_if_stopping do_throw_if_stopping,
compaction_group_view* t,
tasks::task_id parent_id,
compaction_type_options options,
std::vector<sstables::shared_sstable> sstables,
compacting_sstable_registration compacting,
std::unordered_map<sstables::shared_sstable, sstables::shared_sstable>& rewritten_sstables)
: rewrite_sstables_compaction_task_executor(mgr, do_throw_if_stopping, t, parent_id, options, {},
std::move(sstables), std::move(compacting), compaction_manager::can_purge_tombstones::no, "component_rewrite"),
_rewritten_sstables(rewritten_sstables)
{}
protected:
virtual future<compaction_manager::compaction_stats_opt> do_run() override {
compaction_stats stats{};
switch_state(state::pending);
auto maintenance_permit = co_await acquire_semaphore(_cm._maintenance_ops_sem);
while (!_sstables.empty()) {
auto sst = consume_sstable();
auto it = _rewritten_sstables.emplace(sst, sstables::shared_sstable{}).first;
auto res = co_await rewrite_sstable(std::move(sst));
_cm._validation_errors += res.stats.validation_errors;
stats += res.stats;
it->second = std::move(res.new_sstables.front());
}
co_return stats;
}
};
class split_compaction_task_executor final : public rewrite_sstables_compaction_task_executor {
compaction_type_options::split _opt;
public:
@@ -1899,6 +1936,28 @@ compaction_manager::rewrite_sstables(compaction_group_view& t, compaction_type_o
return perform_task_on_all_files<rewrite_sstables_compaction_task_executor>("rewrite", info, t, std::move(options), std::move(owned_ranges_ptr), std::move(get_func), throw_if_stopping::no, can_purge, std::move(options_desc));
}
future<compaction_manager::compaction_stats_opt>
compaction_manager::rewrite_sstables_component(compaction_group_view& t,
std::vector<sstables::shared_sstable>& sstables,
compaction_type_options options,
std::unordered_map<sstables::shared_sstable, sstables::shared_sstable>& rewritten_sstables,
tasks::task_info info) {
auto gh = start_compaction(t);
if (!gh) {
co_return std::nullopt;
}
if (sstables.empty()) {
co_return std::nullopt;
}
compacting_sstable_registration compacting(*this, get_compaction_state(&t));
compacting.register_compacting(sstables);
co_return co_await perform_compaction<rewrite_sstables_component_compaction_task_executor>(throw_if_stopping::no, info, &t, info.id,
std::move(options), std::move(sstables), std::move(compacting), rewritten_sstables);
}
class validate_sstables_compaction_task_executor : public sstables_task_executor {
compaction_manager::quarantine_invalid_sstables _quarantine_sstables;
public:
@@ -2323,6 +2382,18 @@ compaction_manager::maybe_split_new_sstable(sstables::shared_sstable sst, compac
co_return ret;
}
future<std::unordered_map<sstables::shared_sstable, sstables::shared_sstable>> compaction_manager::perform_component_rewrite(compaction::compaction_group_view& t,
tasks::task_info info,
std::vector<sstables::shared_sstable> sstables,
sstables::component_type component,
std::function<void(sstables::sstable&)> modifier,
compaction_type_options::component_rewrite::update_sstable_id update_id) {
std::unordered_map<sstables::shared_sstable, sstables::shared_sstable> rewritten_sstables;
rewritten_sstables.reserve(sstables.size());
co_await rewrite_sstables_component(t, sstables, compaction_type_options::make_component_rewrite(component, std::move(modifier), update_id), rewritten_sstables, info);
co_return rewritten_sstables;
}
// Submit a table to be scrubbed and wait for its termination.
future<compaction_manager::compaction_stats_opt> compaction_manager::perform_sstable_scrub(compaction_group_view& t, compaction_type_options::scrub opts, tasks::task_info info) {
auto scrub_mode = opts.operation_mode;
@@ -2387,6 +2458,8 @@ future<> compaction_manager::remove(compaction_group_view& t, sstring reason) no
if (!c_state.gate.is_closed()) {
auto close_gate = c_state.gate.close();
co_await stop_ongoing_compactions(reason, &t);
// Wait for users of incremental repair lock (can be either repair itself or maintenance compactions).
co_await c_state.incremental_repair_lock.write_lock();
co_await std::move(close_gate);
}

View File

@@ -55,6 +55,7 @@ class custom_compaction_task_executor;
class regular_compaction_task_executor;
class offstrategy_compaction_task_executor;
class rewrite_sstables_compaction_task_executor;
class rewrite_sstables_component_compaction_task_executor;
class split_compaction_task_executor;
class cleanup_sstables_compaction_task_executor;
class validate_sstables_compaction_task_executor;
@@ -167,10 +168,6 @@ private:
std::unique_ptr<strategy_control> _strategy_control;
shared_tombstone_gc_state _shared_tombstone_gc_state;
// TODO: tombstone_gc_state should now have value semantics, but the code
// still uses it with reference semantics (inconsistently though).
// Drop this member, once the code is converted into using value semantics.
tombstone_gc_state _tombstone_gc_state;
utils::disk_space_monitor::subscription _out_of_space_subscription;
private:
@@ -256,6 +253,12 @@ private:
future<compaction_stats_opt> rewrite_sstables(compaction::compaction_group_view& t, compaction_type_options options, owned_ranges_ptr, get_candidates_func, tasks::task_info info,
can_purge_tombstones can_purge = can_purge_tombstones::yes, sstring options_desc = "");
future<compaction_stats_opt> rewrite_sstables_component(compaction_group_view& t,
std::vector<sstables::shared_sstable>& sstables,
compaction_type_options options,
std::unordered_map<sstables::shared_sstable, sstables::shared_sstable>& rewritten_sstables,
tasks::task_info info);
// Stop all fibers, without waiting. Safe to be called multiple times.
void do_stop() noexcept;
future<> really_do_stop() noexcept;
@@ -364,6 +367,13 @@ public:
// Submit a table to be scrubbed and wait for its termination.
future<compaction_stats_opt> perform_sstable_scrub(compaction::compaction_group_view& t, compaction_type_options::scrub opts, tasks::task_info info);
future<std::unordered_map<sstables::shared_sstable, sstables::shared_sstable>> perform_component_rewrite(compaction::compaction_group_view& t,
tasks::task_info info,
std::vector<sstables::shared_sstable> sstables,
sstables::component_type component,
std::function<void(sstables::sstable&)> modifier,
compaction_type_options::component_rewrite::update_sstable_id update_id = compaction_type_options::component_rewrite::update_sstable_id::yes);
// Submit a table for major compaction.
future<> perform_major_compaction(compaction::compaction_group_view& t, tasks::task_info info, bool consider_only_existing_data = false);
@@ -456,10 +466,6 @@ public:
compaction::strategy_control& get_strategy_control() const noexcept;
const tombstone_gc_state& get_tombstone_gc_state() const noexcept {
return _tombstone_gc_state;
};
shared_tombstone_gc_state& get_shared_tombstone_gc_state() noexcept {
return _shared_tombstone_gc_state;
};
@@ -489,6 +495,7 @@ public:
friend class compaction::regular_compaction_task_executor;
friend class compaction::offstrategy_compaction_task_executor;
friend class compaction::rewrite_sstables_compaction_task_executor;
friend class compaction::rewrite_sstables_component_compaction_task_executor;
friend class compaction::cleanup_sstables_compaction_task_executor;
friend class compaction::validate_sstables_compaction_task_executor;
friend compaction_reenabler;

View File

@@ -639,7 +639,7 @@ strict_is_not_null_in_views: true
# * workdir: the node will open the maintenance socket on the path <scylla's workdir>/cql.m,
# where <scylla's workdir> is a path defined by the workdir configuration option,
# * <socket path>: the node will open the maintenance socket on the path <socket path>.
maintenance_socket: ignore
maintenance_socket: workdir
# If set to true, configuration parameters defined with LiveUpdate option can be updated in runtime with CQL
# by updating system.config virtual table. If we don't want any configuration parameter to be changed in runtime
@@ -648,10 +648,9 @@ maintenance_socket: ignore
# e.g. for cloud users, for whom scylla's configuration should be changed only by support engineers.
# live_updatable_config_params_changeable_via_cql: true
# ****************
# * GUARDRAILS *
# ****************
#
# Guardrails options
#
# Guardrails to warn or fail when Replication Factor is smaller/greater than the threshold.
# Please note that the value of 0 is always allowed,
# which means that having no replication at all, i.e. RF = 0, is always valid.
@@ -661,6 +660,27 @@ maintenance_socket: ignore
# minimum_replication_factor_warn_threshold: 3
# maximum_replication_factor_warn_threshold: -1
# maximum_replication_factor_fail_threshold: -1
#
# Guardrails to warn about or disallow creating a keyspace with specific replication strategy.
# Each of these 2 settings is a list storing replication strategies considered harmful.
# The replication strategies to choose from are:
# 1) SimpleStrategy,
# 2) NetworkTopologyStrategy,
# 3) LocalStrategy,
# 4) EverywhereStrategy
#
# replication_strategy_warn_list:
# - SimpleStrategy
# replication_strategy_fail_list:
#
# Guardrail to enable the deprecated feature of CREATE TABLE WITH COMPACT STORAGE.
# enable_create_table_with_compact_storage: false
#
# Guardrails to limit usage of selected consistency levels for writes.
# Adding a warning to a CQL query response can significantly increase network
# traffic and decrease overall throughput.
# write_consistency_levels_warned: []
# write_consistency_levels_disallowed: []
#
# System information encryption settings
@@ -838,21 +858,6 @@ maintenance_socket: ignore
# key_namespace: <kmip key namespace> (optional)
#
# Guardrails to warn about or disallow creating a keyspace with specific replication strategy.
# Each of these 2 settings is a list storing replication strategies considered harmful.
# The replication strategies to choose from are:
# 1) SimpleStrategy,
# 2) NetworkTopologyStrategy,
# 3) LocalStrategy,
# 4) EverywhereStrategy
#
# replication_strategy_warn_list:
# - SimpleStrategy
# replication_strategy_fail_list:
# Guardrail to enable the deprecated feature of CREATE TABLE WITH COMPACT STORAGE.
# enable_create_table_with_compact_storage: false
# Control tablets for new keyspaces.
# Can be set to: disabled|enabled|enforced
#

View File

@@ -544,7 +544,6 @@ scylla_tests = set([
'test/boost/caching_options_test',
'test/boost/canonical_mutation_test',
'test/boost/cartesian_product_test',
'test/boost/cdc_generation_test',
'test/boost/cell_locker_test',
'test/boost/checksum_utils_test',
'test/boost/chunked_managed_vector_test',
@@ -619,6 +618,7 @@ scylla_tests = set([
'test/boost/reservoir_sampling_test',
'test/boost/result_utils_test',
'test/boost/rest_client_test',
'test/boost/rolling_max_tracker_test',
'test/boost/reusable_buffer_test',
'test/boost/rust_test',
'test/boost/s3_test',
@@ -1204,6 +1204,7 @@ scylla_core = (['message/messaging_service.cc',
'gms/application_state.cc',
'gms/inet_address.cc',
'dht/i_partitioner.cc',
'dht/fixed_shard.cc',
'dht/token.cc',
'dht/murmur3_partitioner.cc',
'dht/boot_strapper.cc',
@@ -1239,7 +1240,6 @@ scylla_core = (['message/messaging_service.cc',
'service/pager/query_pagers.cc',
'service/qos/qos_common.cc',
'service/qos/service_level_controller.cc',
'service/qos/standard_service_level_distributed_data_accessor.cc',
'service/qos/raft_service_level_distributed_data_accessor.cc',
'streaming/stream_task.cc',
'streaming/stream_session.cc',
@@ -1273,8 +1273,8 @@ scylla_core = (['message/messaging_service.cc',
'auth/common.cc',
'auth/default_authorizer.cc',
'auth/resource.cc',
'auth/roles-metadata.cc',
'auth/passwords.cc',
'auth/maintenance_socket_authenticator.cc',
'auth/password_authenticator.cc',
'auth/permission.cc',
'auth/service.cc',
@@ -1340,6 +1340,7 @@ scylla_core = (['message/messaging_service.cc',
'service/strong_consistency/groups_manager.cc',
'service/strong_consistency/coordinator.cc',
'service/strong_consistency/state_machine.cc',
'service/strong_consistency/raft_groups_storage.cc',
'service/raft/group0_state_id_handler.cc',
'service/raft/group0_state_machine.cc',
'service/raft/group0_state_machine_merger.cc',
@@ -1473,6 +1474,7 @@ idls = ['idl/gossip_digest.idl.hh',
'idl/messaging_service.idl.hh',
'idl/paxos.idl.hh',
'idl/raft.idl.hh',
'idl/raft_util.idl.hh',
'idl/raft_storage.idl.hh',
'idl/group0.idl.hh',
'idl/hinted_handoff.idl.hh',
@@ -1492,7 +1494,9 @@ idls = ['idl/gossip_digest.idl.hh',
'idl/gossip.idl.hh',
'idl/migration_manager.idl.hh',
"idl/node_ops.idl.hh",
"idl/tasks.idl.hh"
"idl/tasks.idl.hh",
"idl/client_state.idl.hh",
"idl/forward_cql.idl.hh",
]
scylla_tests_generic_dependencies = [
@@ -1585,6 +1589,7 @@ pure_boost_tests = set([
'test/boost/wrapping_interval_test',
'test/boost/range_tombstone_list_test',
'test/boost/reservoir_sampling_test',
'test/boost/rolling_max_tracker_test',
'test/boost/serialization_test',
'test/boost/small_vector_test',
'test/boost/top_k_test',
@@ -1733,6 +1738,7 @@ deps['test/boost/url_parse_test'] = ['utils/http.cc', 'test/boost/url_parse_test
deps['test/boost/murmur_hash_test'] = ['bytes.cc', 'utils/murmur_hash.cc', 'test/boost/murmur_hash_test.cc']
deps['test/boost/allocation_strategy_test'] = ['test/boost/allocation_strategy_test.cc', 'utils/logalloc.cc', 'utils/dynamic_bitset.cc', 'utils/labels.cc']
deps['test/boost/log_heap_test'] = ['test/boost/log_heap_test.cc']
deps['test/boost/rolling_max_tracker_test'] = ['test/boost/rolling_max_tracker_test.cc']
deps['test/boost/estimated_histogram_test'] = ['test/boost/estimated_histogram_test.cc']
deps['test/boost/summary_test'] = ['test/boost/summary_test.cc']
deps['test/boost/anchorless_list_test'] = ['test/boost/anchorless_list_test.cc']

View File

@@ -23,7 +23,7 @@ column_specification::column_specification(std::string_view ks_name_, std::strin
bool column_specification::all_in_same_table(const std::vector<lw_shared_ptr<column_specification>>& names)
{
SCYLLA_ASSERT(!names.empty());
throwing_assert(!names.empty());
auto first = names.front();
return std::all_of(std::next(names.begin()), names.end(), [first] (auto&& spec) {

View File

@@ -49,9 +49,9 @@ static cql3_type::kind get_cql3_kind(const abstract_type& t) {
cql3_type::kind operator()(const uuid_type_impl&) { return cql3_type::kind::UUID; }
cql3_type::kind operator()(const varint_type_impl&) { return cql3_type::kind::VARINT; }
cql3_type::kind operator()(const reversed_type_impl& r) { return get_cql3_kind(*r.underlying_type()); }
cql3_type::kind operator()(const tuple_type_impl&) { SCYLLA_ASSERT(0 && "no kind for this type"); }
cql3_type::kind operator()(const vector_type_impl&) { SCYLLA_ASSERT(0 && "no kind for this type"); }
cql3_type::kind operator()(const collection_type_impl&) { SCYLLA_ASSERT(0 && "no kind for this type"); }
cql3_type::kind operator()(const tuple_type_impl&) { throwing_assert(0 && "no kind for this type"); }
cql3_type::kind operator()(const vector_type_impl&) { throwing_assert(0 && "no kind for this type"); }
cql3_type::kind operator()(const collection_type_impl&) { throwing_assert(0 && "no kind for this type"); }
};
return visit(t, visitor{});
}
@@ -124,7 +124,7 @@ class cql3_type::raw_collection : public raw {
} else if (_kind == abstract_type::kind::map) {
return format("{}map<{}, {}>{}", start, _keys, _values, end);
}
abort();
throwing_assert(0 && "invalid raw_collection kind");
}
public:
raw_collection(const abstract_type::kind kind, shared_ptr<raw> keys, shared_ptr<raw> values)
@@ -150,7 +150,7 @@ public:
}
virtual cql3_type prepare_internal(const sstring& keyspace, const data_dictionary::user_types_metadata& user_types) override {
SCYLLA_ASSERT(_values); // "Got null values type for a collection";
throwing_assert(_values); // "Got null values type for a collection";
if (_values->is_counter()) {
throw exceptions::invalid_request_exception(format("Counters are not allowed inside collections: {}", *this));
@@ -190,7 +190,7 @@ private:
}
return cql3_type(set_type_impl::get_instance(_values->prepare_internal(keyspace, user_types).get_type(), !is_frozen()));
} else if (_kind == abstract_type::kind::map) {
SCYLLA_ASSERT(_keys); // "Got null keys type for a collection";
throwing_assert(_keys); // "Got null keys type for a collection";
if (_keys->is_duration()) {
throw exceptions::invalid_request_exception(format("Durations are not allowed as map keys: {}", *this));
}
@@ -198,7 +198,7 @@ private:
_values->prepare_internal(keyspace, user_types).get_type(),
!is_frozen()));
}
abort();
throwing_assert(0 && "do_prepare invalid kind");
}
};

View File

@@ -1603,7 +1603,7 @@ static cql3::raw_value do_evaluate(const collection_constructor& collection, con
case collection_constructor::style_type::vector:
return evaluate_vector(collection, inputs);
}
std::abort();
throwing_assert(0 && "do_evaluate invalid style");
}
static cql3::raw_value do_evaluate(const usertype_constructor& user_val, const evaluation_inputs& inputs) {

View File

@@ -876,7 +876,7 @@ cast_test_assignment(const cast& c, data_dictionary::database db, const sstring&
return assignment_testable::test_result::NOT_ASSIGNABLE;
}
} catch (exceptions::invalid_request_exception& e) {
abort();
throwing_assert(0 && "cast_test_assignment exception");
}
}

View File

@@ -544,7 +544,7 @@ functions::get_user_aggregates(const sstring& keyspace) const {
std::ranges::subrange<functions::declared_t::const_iterator>
functions::find(const function_name& name) const {
SCYLLA_ASSERT(name.has_keyspace()); // : "function name not fully qualified";
throwing_assert(name.has_keyspace()); // : "function name not fully qualified";
auto pair = _declared.equal_range(name);
return std::ranges::subrange(pair.first, pair.second);
}

View File

@@ -10,8 +10,9 @@
#include "types/types.hh"
#include "types/vector.hh"
#include "exceptions/exceptions.hh"
#include <span>
#include <bit>
#include <span>
#include <seastar/core/byteorder.hh>
namespace cql3 {
namespace functions {
@@ -30,14 +31,10 @@ std::vector<float> extract_float_vector(const bytes_opt& param, vector_dimension
expected_size, dimension, param->size()));
}
std::vector<float> result;
result.reserve(dimension);
bytes_view view(*param);
std::vector<float> result(dimension);
const char* p = reinterpret_cast<const char*>(param->data());
for (size_t i = 0; i < dimension; ++i) {
// read_simple handles network byte order (big-endian) conversion
uint32_t raw = read_simple<uint32_t>(view);
result.push_back(std::bit_cast<float>(raw));
result[i] = std::bit_cast<float>(consume_be<uint32_t>(p));
}
return result;
@@ -55,13 +52,14 @@ namespace {
// You should only use this function if you need to preserve the original vectors and cannot normalize
// them in advance.
float compute_cosine_similarity(std::span<const float> v1, std::span<const float> v2) {
double dot_product = 0.0;
double squared_norm_a = 0.0;
double squared_norm_b = 0.0;
#pragma clang fp contract(fast) reassociate(on) // Allow the compiler to optimize the loop.
float dot_product = 0.0;
float squared_norm_a = 0.0;
float squared_norm_b = 0.0;
for (size_t i = 0; i < v1.size(); ++i) {
double a = v1[i];
double b = v2[i];
float a = v1[i];
float b = v2[i];
dot_product += a * b;
squared_norm_a += a * a;
@@ -79,13 +77,14 @@ float compute_cosine_similarity(std::span<const float> v1, std::span<const float
}
float compute_euclidean_similarity(std::span<const float> v1, std::span<const float> v2) {
double sum = 0.0;
#pragma clang fp contract(fast) reassociate(on) // Allow the compiler to optimize the loop.
float sum = 0.0;
for (size_t i = 0; i < v1.size(); ++i) {
double a = v1[i];
double b = v2[i];
float a = v1[i];
float b = v2[i];
double diff = a - b;
float diff = a - b;
sum += diff * diff;
}
@@ -98,11 +97,12 @@ float compute_euclidean_similarity(std::span<const float> v1, std::span<const fl
// Assumes that both vectors are L2-normalized.
// This similarity is intended as an optimized way to perform cosine similarity calculation.
float compute_dot_product_similarity(std::span<const float> v1, std::span<const float> v2) {
double dot_product = 0.0;
#pragma clang fp contract(fast) reassociate(on) // Allow the compiler to optimize the loop.
float dot_product = 0.0;
for (size_t i = 0; i < v1.size(); ++i) {
double a = v1[i];
double b = v2[i];
float a = v1[i];
float b = v2[i];
dot_product += a * b;
}

View File

@@ -25,7 +25,7 @@ bool keyspace_element_name::has_keyspace() const
const sstring& keyspace_element_name::get_keyspace() const
{
SCYLLA_ASSERT(_ks_name);
throwing_assert(_ks_name);
return *_ks_name;
}

View File

@@ -62,7 +62,7 @@ lists::setter_by_index::fill_prepare_context(prepare_context& ctx) {
void
lists::setter_by_index::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
// we should not get here for frozen lists
SCYLLA_ASSERT(column.type->is_multi_cell()); // "Attempted to set an individual element on a frozen list";
throwing_assert(column.type->is_multi_cell()); // "Attempted to set an individual element on a frozen list";
auto index = expr::evaluate(_idx, params._options);
if (index.is_null()) {
@@ -105,7 +105,7 @@ lists::setter_by_uuid::requires_read() const {
void
lists::setter_by_uuid::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
// we should not get here for frozen lists
SCYLLA_ASSERT(column.type->is_multi_cell()); // "Attempted to set an individual element on a frozen list";
throwing_assert(column.type->is_multi_cell()); // "Attempted to set an individual element on a frozen list";
auto index = expr::evaluate(_idx, params._options);
auto value = expr::evaluate(*_e, params._options);
@@ -133,7 +133,7 @@ lists::setter_by_uuid::execute(mutation& m, const clustering_key_prefix& prefix,
void
lists::appender::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
const cql3::raw_value value = expr::evaluate(*_e, params._options);
SCYLLA_ASSERT(column.type->is_multi_cell()); // "Attempted to append to a frozen list";
throwing_assert(column.type->is_multi_cell()); // "Attempted to append to a frozen list";
do_append(value, m, prefix, column, params);
}
@@ -189,7 +189,7 @@ lists::do_append(const cql3::raw_value& list_value,
void
lists::prepender::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
SCYLLA_ASSERT(column.type->is_multi_cell()); // "Attempted to prepend to a frozen list";
throwing_assert(column.type->is_multi_cell()); // "Attempted to prepend to a frozen list";
cql3::raw_value lvalue = expr::evaluate(*_e, params._options);
if (lvalue.is_null()) {
return;
@@ -244,7 +244,7 @@ lists::discarder::requires_read() const {
void
lists::discarder::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
SCYLLA_ASSERT(column.type->is_multi_cell()); // "Attempted to delete from a frozen list";
throwing_assert(column.type->is_multi_cell()); // "Attempted to delete from a frozen list";
auto&& existing_list = params.get_prefetched_list(m.key(), prefix, column);
// We want to call bind before possibly returning to reject queries where the value provided is not a list.
@@ -300,7 +300,7 @@ lists::discarder_by_index::requires_read() const {
void
lists::discarder_by_index::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
SCYLLA_ASSERT(column.type->is_multi_cell()); // "Attempted to delete an item by index from a frozen list";
throwing_assert(column.type->is_multi_cell()); // "Attempted to delete an item by index from a frozen list";
cql3::raw_value index = expr::evaluate(*_e, params._options);
if (index.is_null()) {
throw exceptions::invalid_request_exception("Invalid null value for list index");

View File

@@ -45,7 +45,7 @@ maps::setter_by_key::fill_prepare_context(prepare_context& ctx) {
void
maps::setter_by_key::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
using exceptions::invalid_request_exception;
SCYLLA_ASSERT(column.type->is_multi_cell()); // "Attempted to set a value for a single key on a frozen map"m
throwing_assert(column.type->is_multi_cell()); // "Attempted to set a value for a single key on a frozen map"m
auto key = expr::evaluate(_k, params._options);
auto value = expr::evaluate(*_e, params._options);
if (key.is_null()) {
@@ -63,7 +63,7 @@ maps::setter_by_key::execute(mutation& m, const clustering_key_prefix& prefix, c
void
maps::putter::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
SCYLLA_ASSERT(column.type->is_multi_cell()); // "Attempted to add items to a frozen map";
throwing_assert(column.type->is_multi_cell()); // "Attempted to add items to a frozen map";
cql3::raw_value value = expr::evaluate(*_e, params._options);
do_put(m, prefix, params, value, column);
}
@@ -96,7 +96,7 @@ maps::do_put(mutation& m, const clustering_key_prefix& prefix, const update_para
void
maps::discarder_by_key::execute(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params) {
SCYLLA_ASSERT(column.type->is_multi_cell()); // "Attempted to delete a single key in a frozen map";
throwing_assert(column.type->is_multi_cell()); // "Attempted to delete a single key in a frozen map";
cql3::raw_value key = expr::evaluate(*_e, params._options);
if (key.is_null()) {
throw exceptions::invalid_request_exception("Invalid null map key");

View File

@@ -67,7 +67,7 @@ operation::set_element::prepare(data_dictionary::database db, const sstring& key
verify_no_aggregate_functions(mval, "SET clause");
return make_shared<maps::setter_by_key>(receiver, std::move(key), std::move(mval));
}
abort();
throwing_assert(0 && "prepare set_element collection type");
}
bool
@@ -166,7 +166,7 @@ operation::addition::prepare(data_dictionary::database db, const sstring& keyspa
} else if (ctype->get_kind() == abstract_type::kind::map) {
return make_shared<maps::putter>(receiver, std::move(v));
} else {
abort();
throwing_assert(0 && "prepare addition collection type");
}
}
@@ -216,7 +216,7 @@ operation::subtraction::prepare(data_dictionary::database db, const sstring& key
verify_no_aggregate_functions(v, "SET clause");
return ::make_shared<sets::discarder>(receiver, std::move(v));
}
abort();
throwing_assert(0 && "prepare subtraction collection type");
}
bool
@@ -267,7 +267,7 @@ operation::set_value::prepare(data_dictionary::database db, const sstring& keysp
} else if (k == abstract_type::kind::map) {
return make_shared<maps::setter>(receiver, std::move(v));
} else {
abort();
throwing_assert(0 && "prepare set_value collection type");
}
}
@@ -385,7 +385,7 @@ operation::element_deletion::prepare(data_dictionary::database db, const sstring
verify_no_aggregate_functions(key, "SET clause");
return make_shared<maps::discarder_by_key>(receiver, std::move(key));
}
abort();
throwing_assert(0 && "prepare element_deletion collection type");
}
expr::expression

View File

@@ -105,6 +105,7 @@ public:
static const std::chrono::minutes entry_expiry;
using key_type = prepared_cache_key_type;
using pinned_value_type = cache_value_ptr;
using value_type = checked_weak_ptr;
using statement_is_too_big = typename cache_type::entry_is_too_big;
@@ -116,9 +117,14 @@ public:
: _cache(size, entry_expiry, logger)
{}
template <typename LoadFunc>
future<pinned_value_type> get_pinned(const key_type& key, LoadFunc&& load) {
return _cache.get_ptr(key.key(), [load = std::forward<LoadFunc>(load)] (const cache_key_type&) { return load(); });
}
template <typename LoadFunc>
future<value_type> get(const key_type& key, LoadFunc&& load) {
return _cache.get_ptr(key.key(), [load = std::forward<LoadFunc>(load)] (const cache_key_type&) { return load(); }).then([] (cache_value_ptr v_ptr) {
return get_pinned(key, std::forward<LoadFunc>(load)).then([] (cache_value_ptr v_ptr) {
return make_ready_future<value_type>((*v_ptr)->checked_weak_from_this());
});
}

View File

@@ -11,6 +11,7 @@
#include "cql3/query_processor.hh"
#include <seastar/core/metrics.hh>
#include <seastar/core/memory.hh>
#include <seastar/core/shared_ptr.hh>
#include <seastar/coroutine/parallel_for_each.hh>
#include <seastar/coroutine/as_future.hh>
@@ -91,7 +92,11 @@ query_processor::query_processor(service::storage_proxy& proxy, data_dictionary:
, _authorized_prepared_cache_update_interval_in_ms_observer(_db.get_config().permissions_update_interval_in_ms.observe(_auth_prepared_cache_cfg_cb))
, _authorized_prepared_cache_validity_in_ms_observer(_db.get_config().permissions_validity_in_ms.observe(_auth_prepared_cache_cfg_cb))
, _lang_manager(langm)
, _write_consistency_levels_warned_observer(_db.get_config().write_consistency_levels_warned.observe([this](const auto& v) { _write_consistency_levels_warned = to_consistency_level_set(v); }))
, _write_consistency_levels_disallowed_observer(_db.get_config().write_consistency_levels_disallowed.observe([this](const auto& v) { _write_consistency_levels_disallowed = to_consistency_level_set(v); }))
{
_write_consistency_levels_warned = to_consistency_level_set(_db.get_config().write_consistency_levels_warned());
_write_consistency_levels_disallowed = to_consistency_level_set(_db.get_config().write_consistency_levels_disallowed());
namespace sm = seastar::metrics;
namespace stm = statements;
using clevel = db::consistency_level;
@@ -506,8 +511,40 @@ query_processor::query_processor(service::storage_proxy& proxy, data_dictionary:
_cql_stats.replication_strategy_fail_list_violations,
sm::description("Counts the number of replication_strategy_fail_list guardrail violations, "
"i.e. attempts to set a forbidden replication strategy in a keyspace via CREATE/ALTER KEYSPACE.")).set_skip_when_empty(),
sm::make_counter(
"forwarded_requests",
_cql_stats.forwarded_requests,
sm::description("Counts the total number of attempts to forward CQL requests to other nodes. One request may be forwarded multiple times, "
"particularly when a write is handled by a non-replica node.")).set_skip_when_empty(),
});
std::vector<sm::metric_definition> cql_cl_group;
for (auto cl = size_t(clevel::MIN_VALUE); cl <= size_t(clevel::MAX_VALUE); ++cl) {
cql_cl_group.push_back(
sm::make_counter(
"writes_per_consistency_level",
_cql_stats.writes_per_consistency_level[cl],
sm::description("Counts the number of writes for each consistency level."),
{cl_label(clevel(cl)), basic_level}).set_skip_when_empty());
}
_metrics.add_group("cql", cql_cl_group);
_metrics.add_group("cql", {
sm::make_counter(
"write_consistency_levels_disallowed_violations",
_cql_stats.write_consistency_levels_disallowed_violations,
sm::description("Counts the number of write_consistency_levels_disallowed guardrail violations, "
"i.e. attempts to write with a forbidden consistency level."),
{basic_level}),
sm::make_counter(
"write_consistency_levels_warned_violations",
_cql_stats.write_consistency_levels_warned_violations,
sm::description("Counts the number of write_consistency_levels_warned guardrail violations, "
"i.e. attempts to write with a discouraged consistency level."),
{basic_level}),
});
_mnotifier.register_listener(_migration_subscriber.get());
}
@@ -549,8 +586,7 @@ future<::shared_ptr<cql_transport::messages::result_message>> query_processor::e
::shared_ptr<cql_statement> statement, service::query_state& query_state, const query_options& options) {
// execute all statements that need group0 guard on shard0
if (this_shard_id() != 0) {
co_return ::make_shared<cql_transport::messages::result_message::bounce_to_shard>(0,
std::move(const_cast<cql3::query_options&>(options).take_cached_pk_function_calls()));
co_return bounce_to_shard(0, std::move(const_cast<cql3::query_options&>(options).take_cached_pk_function_calls()), false);
}
auto [remote_, holder] = remote();
@@ -697,7 +733,7 @@ future<::shared_ptr<cql_transport::messages::result_message::prepared>>
query_processor::prepare(sstring query_string, const service::client_state& client_state, cql3::dialect d) {
try {
auto key = compute_id(query_string, client_state.get_raw_keyspace(), d);
auto prep_ptr = co_await _prepared_cache.get(key, [this, &query_string, &client_state, d] {
auto prep_entry = co_await _prepared_cache.get_pinned(key, [this, &query_string, &client_state, d] {
auto prepared = get_statement(query_string, client_state, d);
prepared->calculate_metadata_id();
auto bound_terms = prepared->statement->get_bound_terms();
@@ -707,17 +743,17 @@ query_processor::prepare(sstring query_string, const service::client_state& clie
bound_terms,
std::numeric_limits<uint16_t>::max()));
}
SCYLLA_ASSERT(bound_terms == prepared->bound_names.size());
throwing_assert(bound_terms == prepared->bound_names.size());
return make_ready_future<std::unique_ptr<statements::prepared_statement>>(std::move(prepared));
});
const auto& warnings = prep_ptr->warnings;
const auto msg = ::make_shared<result_message::prepared::cql>(prepared_cache_key_type::cql_id(key), std::move(prep_ptr),
co_await utils::get_local_injector().inject(
"query_processor_prepare_wait_after_cache_get",
utils::wait_for_message(std::chrono::seconds(60)));
auto msg = ::make_shared<result_message::prepared::cql>(prepared_cache_key_type::cql_id(key), std::move(prep_entry),
client_state.is_protocol_extension_set(cql_transport::cql_protocol_extension::LWT_ADD_METADATA_MARK));
for (const auto& w : warnings) {
msg->add_warning(w);
}
co_return ::shared_ptr<cql_transport::messages::result_message::prepared>(std::move(msg));
co_return std::move(msg);
} catch(typename prepared_statements_cache::statement_is_too_big&) {
throw prepared_statement_is_too_big(query_string);
}
@@ -738,6 +774,10 @@ prepared_cache_key_type query_processor::compute_id(
std::unique_ptr<prepared_statement>
query_processor::get_statement(const std::string_view& query, const service::client_state& client_state, dialect d) {
// Measuring allocation cost requires that no yield points exist
// between bytes_before and bytes_after. It needs fixing if this
// function is ever futurized.
auto bytes_before = seastar::memory::stats().total_bytes_allocated();
std::unique_ptr<raw::parsed_statement> statement = parse_statement(query, d);
// Set keyspace for statement that require login
@@ -753,6 +793,8 @@ query_processor::get_statement(const std::string_view& query, const service::cli
audit_info->set_query_string(query);
p->statement->sanitize_audit_info();
}
auto bytes_after = seastar::memory::stats().total_bytes_allocated();
_parsing_cost_tracker.add_sample(bytes_after - bytes_before);
return p;
}
@@ -1228,9 +1270,25 @@ future<> query_processor::query_internal(
return query_internal(query_string, db::consistency_level::ONE, {}, 1000, std::move(f));
}
shared_ptr<cql_transport::messages::result_message> query_processor::bounce_to_shard(unsigned shard, cql3::computed_function_values cached_fn_calls) {
_proxy.get_stats().replica_cross_shard_ops++;
return ::make_shared<cql_transport::messages::result_message::bounce_to_shard>(shard, std::move(cached_fn_calls));
shared_ptr<cql_transport::messages::result_message> query_processor::bounce_to_shard(unsigned shard, cql3::computed_function_values cached_fn_calls, bool track) {
if (track) {
_proxy.get_stats().replica_cross_shard_ops++;
}
const auto my_host_id = _proxy.get_token_metadata_ptr()->get_topology().my_host_id();
return ::make_shared<cql_transport::messages::result_message::bounce>(my_host_id, shard, std::move(cached_fn_calls));
}
shared_ptr<cql_transport::messages::result_message> query_processor::bounce_to_node(locator::tablet_replica replica, cql3::computed_function_values cached_fn_calls, seastar::lowres_clock::time_point timeout, bool is_write) {
get_cql_stats().forwarded_requests++;
return ::make_shared<cql_transport::messages::result_message::bounce>(replica.host, replica.shard, std::move(cached_fn_calls), timeout, is_write);
}
query_processor::consistency_level_set query_processor::to_consistency_level_set(const query_processor::cl_option_list& levels) {
query_processor::consistency_level_set result;
for (const auto& opt : levels) {
result.set(static_cast<db::consistency_level>(opt));
}
return result;
}
void query_processor::update_authorized_prepared_cache_config() {

View File

@@ -31,9 +31,12 @@
#include "vector_search/vector_store_client.hh"
#include "utils/assert.hh"
#include "utils/observable.hh"
#include "utils/rolling_max_tracker.hh"
#include "service/raft/raft_group0_client.hh"
#include "types/types.hh"
#include "db/auth_version.hh"
#include "db/consistency_level_type.hh"
#include "db/config.hh"
#include "utils/enum_option.hh"
#include "service/storage_proxy_fwd.hh"
@@ -132,6 +135,9 @@ private:
prepared_statements_cache _prepared_cache;
authorized_prepared_statements_cache _authorized_prepared_cache;
// Tracks the rolling maximum of gross bytes allocated during CQL parsing
utils::rolling_max_tracker _parsing_cost_tracker{1000};
std::function<void(uint32_t)> _auth_prepared_cache_cfg_cb;
serialized_action _authorized_prepared_cache_config_action;
utils::observer<uint32_t> _authorized_prepared_cache_update_interval_in_ms_observer;
@@ -142,6 +148,30 @@ private:
std::unordered_map<sstring, std::unique_ptr<statements::prepared_statement>> _internal_statements;
lang::manager& _lang_manager;
using cl_option_list = std::vector<enum_option<db::consistency_level_restriction_t>>;
/// Efficient bitmask-based set of consistency levels.
using consistency_level_set = enum_set<super_enum<db::consistency_level,
db::consistency_level::ANY,
db::consistency_level::ONE,
db::consistency_level::TWO,
db::consistency_level::THREE,
db::consistency_level::QUORUM,
db::consistency_level::ALL,
db::consistency_level::LOCAL_QUORUM,
db::consistency_level::EACH_QUORUM,
db::consistency_level::SERIAL,
db::consistency_level::LOCAL_SERIAL,
db::consistency_level::LOCAL_ONE>>;
consistency_level_set _write_consistency_levels_warned;
consistency_level_set _write_consistency_levels_disallowed;
utils::observer<cl_option_list> _write_consistency_levels_warned_observer;
utils::observer<cl_option_list> _write_consistency_levels_disallowed_observer;
static consistency_level_set to_consistency_level_set(const cl_option_list& levels);
public:
static const sstring CQL_VERSION;
@@ -186,6 +216,11 @@ public:
return _cql_stats;
}
/// Returns the estimated peak memory cost of CQL parsing.
size_t parsing_cost_estimate() const noexcept {
return _parsing_cost_tracker.current_max();
}
lang::manager& lang() { return _lang_manager; }
const vector_search::vector_store_client& vector_store_client() const noexcept {
@@ -196,8 +231,6 @@ public:
return _vector_store_client;
}
db::auth_version_t auth_version;
statements::prepared_statement::checked_weak_ptr get_prepared(const std::optional<auth::authenticated_user>& user, const prepared_cache_key_type& key) {
if (user) {
auto vp = _authorized_prepared_cache.find(*user, key);
@@ -477,7 +510,12 @@ public:
friend class migration_subscriber;
shared_ptr<cql_transport::messages::result_message> bounce_to_shard(unsigned shard, cql3::computed_function_values cached_fn_calls);
shared_ptr<cql_transport::messages::result_message> bounce_to_shard(unsigned shard, cql3::computed_function_values cached_fn_calls, bool track = true);
shared_ptr<cql_transport::messages::result_message> bounce_to_node(
locator::tablet_replica replica,
cql3::computed_function_values cached_fn_calls,
seastar::lowres_clock::time_point timeout,
bool is_write);
void update_authorized_prepared_cache_config();
@@ -493,6 +531,21 @@ public:
int32_t page_size = -1,
service::node_local_only node_local_only = service::node_local_only::no) const;
enum class write_consistency_guardrail_state { NONE, WARN, FAIL };
inline write_consistency_guardrail_state check_write_consistency_levels_guardrail(db::consistency_level cl) {
_cql_stats.writes_per_consistency_level[size_t(cl)]++;
if (_write_consistency_levels_disallowed.contains(cl)) [[unlikely]] {
_cql_stats.write_consistency_levels_disallowed_violations++;
return write_consistency_guardrail_state::FAIL;
}
if (_write_consistency_levels_warned.contains(cl)) [[unlikely]] {
_cql_stats.write_consistency_levels_warned_violations++;
return write_consistency_guardrail_state::WARN;
}
return write_consistency_guardrail_state::NONE;
}
private:
// Keep the holder until you stop using the `remote` services.
std::pair<std::reference_wrapper<remote>, gate::holder> remote();

View File

@@ -89,10 +89,10 @@ public:
*/
void merge(const bounds_slice& other) {
if (has_bound(statements::bound::START)) {
SCYLLA_ASSERT(!other.has_bound(statements::bound::START));
throwing_assert(!other.has_bound(statements::bound::START));
_bounds[get_idx(statements::bound::END)] = other._bounds[get_idx(statements::bound::END)];
} else {
SCYLLA_ASSERT(!other.has_bound(statements::bound::END));
throwing_assert(!other.has_bound(statements::bound::END));
_bounds[get_idx(statements::bound::START)] = other._bounds[get_idx(statements::bound::START)];
}
}

View File

@@ -61,7 +61,7 @@ void metadata::set_paging_state(lw_shared_ptr<const service::pager::paging_state
}
void metadata::maybe_set_paging_state(lw_shared_ptr<const service::pager::paging_state> paging_state) {
SCYLLA_ASSERT(paging_state);
throwing_assert(paging_state);
if (paging_state->get_remaining() > 0) {
set_paging_state(std::move(paging_state));
} else {
@@ -138,7 +138,7 @@ bool result_set::empty() const {
}
void result_set::add_row(std::vector<managed_bytes_opt> row) {
SCYLLA_ASSERT(row.size() == _metadata->value_count());
throwing_assert(row.size() == _metadata->value_count());
_rows.emplace_back(std::move(row));
}

View File

@@ -212,11 +212,20 @@ public:
}
virtual uint32_t add_column_for_post_processing(const column_definition& c) override {
uint32_t index = selection::add_column_for_post_processing(c);
auto it = std::find_if(_selectors.begin(), _selectors.end(), [&c](const expr::expression& e) {
auto col = expr::as_if<expr::column_value>(&e);
return col && col->col == &c;
});
if (it != _selectors.end()) {
return std::distance(_selectors.begin(), it);
}
add_column(c);
get_result_metadata()->add_non_serialized_column(c.column_specification);
_selectors.push_back(expr::column_value(&c));
if (_inner_loop.empty()) {
// Simple case: no aggregation
return index;
return _selectors.size() - 1;
} else {
// Complex case: aggregation, must pass through temporary
auto first_func = cql3::functions::aggregate_fcts::make_first_function(c.type);
@@ -470,10 +479,21 @@ std::vector<const column_definition*> selection::wildcard_columns(schema_ptr sch
return simple_selection::make(schema, std::move(columns), false);
}
uint32_t selection::add_column_for_post_processing(const column_definition& c) {
selection::add_column_result selection::add_column(const column_definition& c) {
auto index = index_of(c);
if (index != -1) {
return {index, false};
}
_columns.push_back(&c);
_metadata->add_non_serialized_column(c.column_specification);
return _columns.size() - 1;
return {_columns.size() - 1, true};
}
uint32_t selection::add_column_for_post_processing(const column_definition& c) {
auto col = add_column(c);
if (col.added) {
_metadata->add_non_serialized_column(c.column_specification);
}
return col.index;
}
::shared_ptr<selection> selection::from_selectors(data_dictionary::database db, schema_ptr schema, const sstring& ks, const std::vector<prepared_selector>& prepared_selectors) {

View File

@@ -130,6 +130,14 @@ public:
virtual std::vector<shared_ptr<functions::function>> used_functions() const { return {}; }
query::partition_slice::option_set get_query_options();
protected:
// Result of add_column: index in _columns and whether it was added now (or existed already).
struct add_column_result {
uint32_t index;
bool added;
};
// Adds a column to the _columns if not already present, returns add_column_result.
add_column_result add_column(const column_definition& c);
private:
static bool processes_selection(const std::vector<prepared_selector>& prepared_selectors);
@@ -348,7 +356,7 @@ public:
add_value(*def, static_row_iterator);
break;
default:
SCYLLA_ASSERT(0);
throwing_assert(0);
}
}
_builder.complete_row();

View File

@@ -34,7 +34,7 @@ sets::setter::execute(mutation& m, const clustering_key_prefix& row_key, const u
void
sets::adder::execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params) {
const cql3::raw_value value = expr::evaluate(*_e, params._options);
SCYLLA_ASSERT(column.type->is_multi_cell()); // "Attempted to add items to a frozen set";
throwing_assert(column.type->is_multi_cell()); // "Attempted to add items to a frozen set";
do_add(m, row_key, params, value, column);
}
@@ -77,7 +77,7 @@ sets::adder::do_add(mutation& m, const clustering_key_prefix& row_key, const upd
void
sets::discarder::execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params) {
SCYLLA_ASSERT(column.type->is_multi_cell()); // "Attempted to remove items from a frozen set";
throwing_assert(column.type->is_multi_cell()); // "Attempted to remove items from a frozen set";
cql3::raw_value svalue = expr::evaluate(*_e, params._options);
if (svalue.is_null()) {
@@ -98,7 +98,7 @@ sets::discarder::execute(mutation& m, const clustering_key_prefix& row_key, cons
void sets::element_discarder::execute(mutation& m, const clustering_key_prefix& row_key, const update_parameters& params)
{
SCYLLA_ASSERT(column.type->is_multi_cell() && "Attempted to remove items from a frozen set");
throwing_assert(column.type->is_multi_cell() && "Attempted to remove items from a frozen set");
cql3::raw_value elt = expr::evaluate(*_e, params._options);
if (elt.is_null()) {
throw exceptions::invalid_request_exception("Invalid null set element");

View File

@@ -296,7 +296,7 @@ void alter_table_statement::drop_column(const query_options& options, const sche
std::pair<schema_ptr, std::vector<view_ptr>> alter_table_statement::prepare_schema_update(data_dictionary::database db, const query_options& options) const {
auto s = validation::validate_column_family(db, keyspace(), column_family());
if (s->is_view()) {
throw exceptions::invalid_request_exception("Cannot use ALTER TABLE on Materialized View");
throw exceptions::invalid_request_exception("Cannot use ALTER TABLE on Materialized View. (Did you mean ALTER MATERIALIZED VIEW)?");
}
const bool is_cdc_log_table = cdc::is_log_for_some_table(db.real_database(), s->ks_name(), s->cf_name());
@@ -368,7 +368,7 @@ std::pair<schema_ptr, std::vector<view_ptr>> alter_table_statement::prepare_sche
switch (_type) {
case alter_table_statement::type::add:
SCYLLA_ASSERT(_column_changes.size());
throwing_assert(_column_changes.size());
if (s->is_dense()) {
throw exceptions::invalid_request_exception("Cannot add new column to a COMPACT STORAGE table");
}
@@ -376,12 +376,12 @@ std::pair<schema_ptr, std::vector<view_ptr>> alter_table_statement::prepare_sche
break;
case alter_table_statement::type::alter:
SCYLLA_ASSERT(_column_changes.size() == 1);
throwing_assert(_column_changes.size() == 1);
invoke_column_change_fn(std::mem_fn(&alter_table_statement::alter_column));
break;
case alter_table_statement::type::drop:
SCYLLA_ASSERT(_column_changes.size());
throwing_assert(_column_changes.size());
if (!s->is_cql3_table()) {
throw exceptions::invalid_request_exception("Cannot drop columns from a non-CQL3 table");
}

View File

@@ -46,7 +46,7 @@ future<> alter_view_statement::check_access(query_processor& qp, const service::
view_ptr alter_view_statement::prepare_view(data_dictionary::database db) const {
schema_ptr schema = validation::validate_column_family(db, keyspace(), column_family());
if (!schema->is_view()) {
throw exceptions::invalid_request_exception("Cannot use ALTER MATERIALIZED VIEW on Table");
throw exceptions::invalid_request_exception("Cannot use ALTER MATERIALIZED VIEW on Table. (Did you mean ALTER TABLE)?");
}
if (!_properties) {

View File

@@ -25,7 +25,7 @@ attach_service_level_statement::attach_service_level_statement(sstring service_l
}
bool attach_service_level_statement::needs_guard(query_processor& qp, service::query_state& state) const {
return !auth::legacy_mode(qp) || state.get_service_level_controller().is_v2();
return true;
}
std::unique_ptr<cql3::statements::prepared_statement>

View File

@@ -11,7 +11,6 @@
#include "authentication_statement.hh"
#include "transport/messages/result_message.hh"
#include "cql3/query_processor.hh"
#include "auth/common.hh"
uint32_t cql3::statements::authentication_statement::get_bound_terms() const {
return 0;
@@ -26,7 +25,7 @@ future<> cql3::statements::authentication_statement::check_access(query_processo
}
bool cql3::statements::authentication_altering_statement::needs_guard(query_processor& qp, service::query_state&) const {
return !auth::legacy_mode(qp);
return true;
}
audit::statement_category cql3::statements::authentication_statement::category() const {

View File

@@ -14,7 +14,6 @@
#include "cql3/query_processor.hh"
#include "exceptions/exceptions.hh"
#include "db/cql_type_parser.hh"
#include "auth/common.hh"
uint32_t cql3::statements::authorization_statement::get_bound_terms() const {
return 0;
@@ -74,7 +73,7 @@ void cql3::statements::authorization_statement::maybe_correct_resource(auth::res
bool cql3::statements::authorization_altering_statement::needs_guard(
query_processor& qp, service::query_state&) const {
return !auth::legacy_mode(qp);
return true;
};
audit::statement_category cql3::statements::authorization_statement::category() const {

View File

@@ -259,6 +259,15 @@ future<shared_ptr<cql_transport::messages::result_message>> batch_statement::do_
if (options.getSerialConsistency() == null)
throw new InvalidRequestException("Invalid empty serial consistency level");
#endif
const auto cl = options.get_consistency();
const query_processor::write_consistency_guardrail_state guardrail_state = qp.check_write_consistency_levels_guardrail(cl);
if (guardrail_state == query_processor::write_consistency_guardrail_state::FAIL) {
return make_exception_future<shared_ptr<cql_transport::messages::result_message>>(
exceptions::invalid_request_exception(
format("Consistency level {} is not allowed for write operations", cl)));
}
for (size_t i = 0; i < _statements.size(); ++i) {
_statements[i].statement->restrictions().validate_primary_key(options.for_statement(i));
}
@@ -266,23 +275,31 @@ future<shared_ptr<cql_transport::messages::result_message>> batch_statement::do_
if (_has_conditions) {
++_stats.cas_batches;
_stats.statements_in_cas_batches += _statements.size();
return execute_with_conditions(qp, options, query_state);
return execute_with_conditions(qp, options, query_state).then([guardrail_state, cl] (auto result) {
if (guardrail_state == query_processor::write_consistency_guardrail_state::WARN) {
result->add_warning(format("Write with consistency level {} is warned by guardrail configuration", cl));
}
return result;
});
}
++_stats.batches;
_stats.statements_in_batches += _statements.size();
auto timeout = db::timeout_clock::now() + get_timeout(query_state.get_client_state(), options);
return get_mutations(qp, options, timeout, local, now, query_state).then([this, &qp, &options, timeout, tr_state = query_state.get_trace_state(),
return get_mutations(qp, options, timeout, local, now, query_state).then([this, &qp, cl, timeout, tr_state = query_state.get_trace_state(),
permit = query_state.get_permit()] (utils::chunked_vector<mutation> ms) mutable {
return execute_without_conditions(qp, std::move(ms), options.get_consistency(), timeout, std::move(tr_state), std::move(permit));
}).then([] (coordinator_result<> res) {
return execute_without_conditions(qp, std::move(ms), cl, timeout, std::move(tr_state), std::move(permit));
}).then([guardrail_state, cl] (coordinator_result<> res) {
if (!res) {
return make_ready_future<shared_ptr<cql_transport::messages::result_message>>(
seastar::make_shared<cql_transport::messages::result_message::exception>(std::move(res).assume_error()));
}
return make_ready_future<shared_ptr<cql_transport::messages::result_message>>(
make_shared<cql_transport::messages::result_message::void_message>());
auto result = make_shared<cql_transport::messages::result_message::void_message>();
if (guardrail_state == query_processor::write_consistency_guardrail_state::WARN) {
result->add_warning(format("Write with consistency level {} is warned by guardrail configuration", cl));
}
return make_ready_future<shared_ptr<cql_transport::messages::result_message>>(std::move(result));
});
}

View File

@@ -20,6 +20,7 @@
#include "cql3/attributes.hh"
#include "cql3/expr/expression.hh"
#include "cql3/expr/evaluate.hh"
#include "cql3/query_options.hh"
#include "cql3/query_processor.hh"
#include "cql3/values.hh"
#include "timeout_config.hh"
@@ -65,7 +66,7 @@ evaluate_prepared(
future<::shared_ptr<cql_transport::messages::result_message>>
broadcast_modification_statement::execute_without_checking_exception_message(query_processor& qp, service::query_state& qs, const query_options& options, std::optional<service::group0_guard> guard) const {
if (this_shard_id() != 0) {
co_return ::make_shared<cql_transport::messages::result_message::bounce_to_shard>(0, cql3::computed_function_values{});
co_return qp.bounce_to_shard(0, cql3::computed_function_values{}, false);
}
auto result = co_await qp.execute_broadcast_table_query(

View File

@@ -96,7 +96,7 @@ evaluate_prepared(
future<::shared_ptr<cql_transport::messages::result_message>>
broadcast_select_statement::execute_without_checking_exception_message(query_processor& qp, service::query_state& qs, const query_options& options, std::optional<service::group0_guard> guard) const {
if (this_shard_id() != 0) {
co_return ::make_shared<cql_transport::messages::result_message::bounce_to_shard>(0, cql3::computed_function_values{});
co_return qp.bounce_to_shard(0, cql3::computed_function_values{}, false);
}
auto result = co_await qp.execute_broadcast_table_query(

View File

@@ -51,7 +51,7 @@ public:
, _key(std::move(key_arg))
, _rows(schema_arg)
{
SCYLLA_ASSERT(_key.size() == 1 && query::is_single_partition(_key.front()));
throwing_assert(_key.size() == 1 && query::is_single_partition(_key.front()));
}
dht::partition_range_vector key() const {

View File

@@ -41,13 +41,13 @@ void cf_statement::prepare_keyspace(std::string_view keyspace)
}
bool cf_statement::has_keyspace() const {
SCYLLA_ASSERT(_cf_name.has_value());
throwing_assert(_cf_name.has_value());
return _cf_name->has_keyspace();
}
const sstring& cf_statement::keyspace() const
{
SCYLLA_ASSERT(_cf_name->has_keyspace()); // "The statement hasn't be prepared correctly";
throwing_assert(_cf_name->has_keyspace()); // "The statement hasn't be prepared correctly";
return _cf_name->get_keyspace();
}

View File

@@ -139,7 +139,7 @@ void create_table_statement::apply_properties_to(schema_builder& builder, const
void create_table_statement::add_column_metadata_from_aliases(schema_builder& builder, std::vector<bytes> aliases, const std::vector<data_type>& types, column_kind kind) const
{
SCYLLA_ASSERT(aliases.size() == types.size());
throwing_assert(aliases.size() == types.size());
for (size_t i = 0; i < aliases.size(); i++) {
if (!aliases[i].empty()) {
builder.with_column(aliases[i], types[i], kind);
@@ -151,7 +151,7 @@ std::unique_ptr<prepared_statement>
create_table_statement::prepare(data_dictionary::database db, cql_stats& stats) {
// Cannot happen; create_table_statement is never instantiated as a raw statement
// (instead we instantiate create_table_statement::raw_statement)
abort();
throwing_assert(0 && "create_table_statement::prepare");
}
future<> create_table_statement::grant_permissions_to_creator(const service::client_state& cs, service::group0_batch& mc) const {
@@ -239,7 +239,7 @@ std::unique_ptr<prepared_statement> create_table_statement::raw_statement::prepa
for (auto&& inner: type->all_types()) {
if (inner->is_multi_cell()) {
// a nested non-frozen UDT should have already been rejected when defining the type
SCYLLA_ASSERT(inner->is_collection());
throwing_assert(inner->is_collection());
throw exceptions::invalid_request_exception("Non-frozen UDTs with nested non-frozen collections are not supported");
}
}

View File

@@ -63,7 +63,7 @@ future<> create_view_statement::check_access(query_processor& qp, const service:
static const column_definition* get_column_definition(const schema& schema, column_identifier::raw& identifier) {
auto prepared = identifier.prepare(schema);
SCYLLA_ASSERT(dynamic_pointer_cast<column_identifier>(prepared));
throwing_assert(dynamic_pointer_cast<column_identifier>(prepared));
auto id = static_pointer_cast<column_identifier>(prepared);
return schema.get_column_definition(id->name());
}

View File

@@ -103,7 +103,7 @@ delete_statement::delete_statement(cf_name name,
, _deletions(std::move(deletions))
, _where_clause(std::move(where_clause))
{
SCYLLA_ASSERT(!_attrs->time_to_live.has_value());
throwing_assert(!_attrs->time_to_live.has_value());
}
}

View File

@@ -307,7 +307,7 @@ future<std::vector<description>> table(const data_dictionary::database& db, cons
auto s = validation::validate_column_family(db, ks, name);
if (s->is_view()) {
throw exceptions::invalid_request_exception("Cannot use DESC TABLE on materialized View");
throw exceptions::invalid_request_exception("Cannot use DESC TABLE on materialized View. (Did you mean DESC MATERIALIZED VIEW)?");
}
auto schema = table->schema();
@@ -659,8 +659,7 @@ future<std::vector<std::vector<managed_bytes_opt>>> schema_describe_statement::d
auto& auth_service = *client_state.get_auth_service();
if (config.with_hashed_passwords) {
const auto maybe_user = client_state.user();
if (!maybe_user || !co_await auth::has_superuser(auth_service, *maybe_user)) {
if (!co_await client_state.has_superuser()) {
co_await coroutine::return_exception(exceptions::unauthorized_exception(
"DESCRIBE SCHEMA WITH INTERNALS AND PASSWORDS can only be issued by a superuser"));
}

View File

@@ -23,7 +23,7 @@ detach_service_level_statement::detach_service_level_statement(sstring role_name
}
bool detach_service_level_statement::needs_guard(query_processor& qp, service::query_state&) const {
return !auth::legacy_mode(qp);
return true;
}
std::unique_ptr<cql3::statements::prepared_statement>

View File

@@ -63,7 +63,7 @@ future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, cql3::cql_w
}
}
if (!auth::legacy_mode(qp)) {
{
const auto& as = *state.get_client_state().get_auth_service();
co_await auth::revoke_all(as, auth::make_data_resource(_keyspace), mc);
co_await auth::revoke_all(as, auth::make_functions_resource(_keyspace), mc);

View File

@@ -58,7 +58,7 @@ future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, cql3::cql_w
}
}
if (!auth::legacy_mode(qp)) {
{
const auto& as = *state.get_client_state().get_auth_service();
co_await auth::revoke_all(as, auth::make_data_resource(keyspace(), column_family()), mc);
}

View File

@@ -81,7 +81,7 @@ list_effective_service_level_statement::execute(query_processor& qp, service::qu
}
auto& sl_controller = state.get_service_level_controller();
auto slo = co_await sl_controller.find_effective_service_level(_role_name);
auto slo = sl_controller.find_cached_effective_service_level(_role_name);
if (!slo) {
throw exceptions::invalid_request_exception(format("Role {} doesn't have assigned any service level", _role_name));

View File

@@ -16,6 +16,7 @@
#include "auth/common.hh"
#include "cql3/result_set.hh"
#include "cql3/column_identifier.hh"
#include "db/system_keyspace.hh"
#include "transport/messages/result_message.hh"
cql3::statements::list_permissions_statement::list_permissions_statement(
@@ -49,7 +50,7 @@ future<> cql3::statements::list_permissions_statement::check_access(query_proces
const auto& as = *state.get_auth_service();
const auto user = state.user();
return auth::has_superuser(as, *user).then([this, &as, user](bool has_super) {
return state.has_superuser().then([this, &as, user](bool has_super) {
if (has_super) {
return make_ready_future<>();
}
@@ -80,9 +81,9 @@ cql3::statements::list_permissions_statement::execute(
service::query_state& state,
const query_options& options,
std::optional<service::group0_guard> guard) const {
auto make_column = [auth_ks = auth::get_auth_ks_name(qp)](sstring name) {
auto make_column = [](sstring name) {
return make_lw_shared<column_specification>(
auth_ks,
db::system_keyspace::NAME,
"permissions",
::make_shared<column_identifier>(std::move(name), true),
utf8_type);

View File

@@ -14,6 +14,7 @@
#include "cql3/query_options.hh"
#include "cql3/column_identifier.hh"
#include "auth/common.hh"
#include "db/system_keyspace.hh"
#include "transport/messages/result_message.hh"
std::unique_ptr<cql3::statements::prepared_statement> cql3::statements::list_users_statement::prepare(
@@ -30,9 +31,9 @@ future<::shared_ptr<cql_transport::messages::result_message>>
cql3::statements::list_users_statement::execute(query_processor& qp, service::query_state& state, const query_options& options, std::optional<service::group0_guard> guard) const {
static const sstring virtual_table_name("users");
const auto make_column_spec = [auth_ks = auth::get_auth_ks_name(qp)](const sstring& name, const ::shared_ptr<const abstract_type>& ty) {
const auto make_column_spec = [](const sstring& name, const ::shared_ptr<const abstract_type>& ty) {
return make_lw_shared<column_specification>(
auth_ks,
db::system_keyspace::NAME,
virtual_table_name,
::make_shared<column_identifier>(name, true),
ty);
@@ -74,7 +75,7 @@ cql3::statements::list_users_statement::execute(query_processor& qp, service::qu
const auto& cs = state.get_client_state();
const auto& as = *cs.get_auth_service();
return auth::has_superuser(as, *cs.user()).then([&cs, &as, make_results = std::move(make_results)](bool has_superuser) mutable {
return cs.has_superuser().then([&cs, &as, make_results = std::move(make_results)](bool has_superuser) mutable {
if (has_superuser) {
return as.underlying_role_manager().query_all().then([&as, make_results = std::move(make_results)](std::unordered_set<sstring> roles) mutable {
return make_results(as, std::move(roles));

View File

@@ -268,10 +268,22 @@ modification_statement::do_execute(query_processor& qp, service::query_state& qs
inc_cql_stats(qs.get_client_state().is_internal());
const auto cl = options.get_consistency();
const query_processor::write_consistency_guardrail_state guardrail_state = qp.check_write_consistency_levels_guardrail(cl);
if (guardrail_state == query_processor::write_consistency_guardrail_state::FAIL) {
co_return coroutine::exception(
std::make_exception_ptr(exceptions::invalid_request_exception(
format("Consistency level {} is not allowed for write operations", cl))));
}
_restrictions->validate_primary_key(options);
if (has_conditions()) {
co_return co_await execute_with_condition(qp, qs, options);
auto result = co_await execute_with_condition(qp, qs, options);
if (guardrail_state == query_processor::write_consistency_guardrail_state::WARN) {
result->add_warning(format("Write with consistency level {} is warned by guardrail configuration", cl));
}
co_return result;
}
json_cache_opt json_cache = maybe_prepare_json_cache(options);
@@ -290,6 +302,9 @@ modification_statement::do_execute(query_processor& qp, service::query_state& qs
}
auto result = seastar::make_shared<cql_transport::messages::result_message::void_message>();
if (guardrail_state == query_processor::write_consistency_guardrail_state::WARN) {
result->add_warning(format("Write with consistency level {} is warned by guardrail configuration", cl));
}
if (keys_size_one) {
auto&& table = s->table();
if (_may_use_token_aware_routing && table.uses_tablets() && qs.get_client_state().is_protocol_extension_set(cql_transport::cql_protocol_extension::TABLETS_ROUTING_V1)) {
@@ -356,7 +371,6 @@ process_forced_rebounce(unsigned shard, query_processor& qp, const query_options
// On the last iteration, re-bounce to the correct shard.
if (counter != 0) {
const auto shard_num = smp::count;
assert(shard_num > 0);
const auto local_shard = this_shard_id();
auto target_shard = local_shard + 1;
if (target_shard == shard) {
@@ -490,7 +504,7 @@ modification_statement::process_where_clause(data_dictionary::database db, expr:
* partition to check conditions.
*/
if (_if_exists || _if_not_exists) {
SCYLLA_ASSERT(!_has_static_column_conditions && !_has_regular_column_conditions);
throwing_assert(!_has_static_column_conditions && !_has_regular_column_conditions);
if (s->has_static_columns() && !_restrictions->has_clustering_columns_restriction()) {
_has_static_column_conditions = true;
} else {
@@ -676,13 +690,13 @@ modification_statement::prepare_conditions(data_dictionary::database db, const s
if (_if_not_exists) {
// To have both 'IF NOT EXISTS' and some other conditions doesn't make sense.
// So far this is enforced by the parser, but let's SCYLLA_ASSERT it for sanity if ever the parse changes.
SCYLLA_ASSERT(!_conditions);
SCYLLA_ASSERT(!_if_exists);
// So far this is enforced by the parser, but let's throwing_assert it for sanity if ever the parse changes.
throwing_assert(!_conditions);
throwing_assert(!_if_exists);
stmt.set_if_not_exist_condition();
} else if (_if_exists) {
SCYLLA_ASSERT(!_conditions);
SCYLLA_ASSERT(!_if_not_exists);
throwing_assert(!_conditions);
throwing_assert(!_if_not_exists);
stmt.set_if_exist_condition();
} else {
stmt._condition = column_condition_prepare(*_conditions, db, keyspace(), schema);

View File

@@ -20,6 +20,7 @@
#include "cql3/column_specification.hh"
#include "cql3/column_identifier.hh"
#include "cql3/query_processor.hh"
#include "db/system_keyspace.hh"
#include "cql3/statements/alter_role_statement.hh"
#include "cql3/statements/create_role_statement.hh"
#include "cql3/statements/drop_role_statement.hh"
@@ -94,7 +95,7 @@ future<> create_role_statement::check_access(query_processor& qp, const service:
return;
}
const bool has_superuser = auth::has_superuser(*state.get_auth_service(), *state.user()).get();
const bool has_superuser = state.has_superuser().get();
if (_options.hashed_password && !has_superuser) {
throw exceptions::unauthorized_exception("Only superusers can create a role with a hashed password.");
@@ -213,7 +214,7 @@ future<> alter_role_statement::check_access(query_processor& qp, const service::
auto& as = *state.get_auth_service();
const auto& user = *state.user();
const bool user_is_superuser = auth::has_superuser(as, user).get();
const bool user_is_superuser = state.has_superuser().get();
if (_options.is_superuser) {
if (!user_is_superuser) {
@@ -306,7 +307,7 @@ future<> drop_role_statement::check_access(query_processor& qp, const service::c
auto& as = *state.get_auth_service();
const bool user_is_superuser = auth::has_superuser(as, *state.user()).get();
const bool user_is_superuser = state.has_superuser().get();
const bool role_has_superuser = [this, &as] {
try {
@@ -378,9 +379,9 @@ future<result_message_ptr>
list_roles_statement::execute(query_processor& qp, service::query_state& state, const query_options&, std::optional<service::group0_guard> guard) const {
static const sstring virtual_table_name("roles");
const auto make_column_spec = [auth_ks = auth::get_auth_ks_name(qp)](const sstring& name, const ::shared_ptr<const abstract_type>& ty) {
const auto make_column_spec = [](const sstring& name, const ::shared_ptr<const abstract_type>& ty) {
return make_lw_shared<column_specification>(
auth_ks,
db::system_keyspace::NAME,
virtual_table_name,
::make_shared<column_identifier>(name, true),
ty);
@@ -442,7 +443,7 @@ list_roles_statement::execute(query_processor& qp, service::query_state& state,
const auto& cs = state.get_client_state();
const auto& as = *cs.get_auth_service();
return auth::has_superuser(as, *cs.user()).then([this, &cs, &as, make_results = std::move(make_results)](bool super) mutable {
return cs.has_superuser().then([this, &cs, &as, make_results = std::move(make_results)](bool super) mutable {
auto& rm = as.underlying_role_manager();
const auto& a = as.underlying_authenticator();
const auto query_mode = _recursive ? auth::recursive_role_query::yes : auth::recursive_role_query::no;

View File

@@ -889,7 +889,7 @@ select_statement::execute_without_checking_exception_message_non_aggregate_unpag
auto timeout = db::timeout_clock::now() + get_timeout(state.get_client_state(), options);
if (needs_post_query_ordering() && _limit) {
return do_with(std::forward<dht::partition_range_vector>(partition_ranges), [this, &qp, &state, &options, cmd, timeout, cas_shard = std::move(cas_shard)](auto& prs) {
SCYLLA_ASSERT(cmd->partition_limit == query::max_partitions);
throwing_assert(cmd->partition_limit == query::max_partitions);
query::result_merger merger(cmd->get_row_limit() * prs.size(), query::max_partitions);
return utils::result_map_reduce(prs.begin(), prs.end(), [this, &qp, &state, &options, cmd, timeout, cas_shard = std::move(cas_shard)] (auto& pr) {
dht::partition_range_vector prange { pr };
@@ -1085,7 +1085,7 @@ view_indexed_table_select_statement::view_indexed_table_select_statement(schema_
, _used_index_restrictions(std::move(used_index_restrictions))
, _view_schema(view_schema)
{
SCYLLA_ASSERT(_view_schema);
throwing_assert(_view_schema);
if (_index.metadata().local()) {
_get_partition_ranges_for_posting_list = [this] (const query_options& options) { return get_partition_ranges_for_local_index_posting_list(options); };
_get_partition_slice_for_posting_list = [this] (const query_options& options) { return get_partition_slice_for_local_index_posting_list(options); };
@@ -1203,7 +1203,7 @@ view_indexed_table_select_statement::actually_do_execute(query_processor& qp,
? source_selector::INTERNAL : source_selector::USER;
++_stats.query_cnt(src_sel, _ks_sel, cond_selector::NO_CONDITIONS, statement_type::SELECT);
SCYLLA_ASSERT(_restrictions->uses_secondary_indexing());
throwing_assert(_restrictions->uses_secondary_indexing());
_stats.unpaged_select_queries(_ks_sel) += options.get_page_size() <= 0;
@@ -2240,8 +2240,8 @@ future<::shared_ptr<cql_transport::messages::result_message>> vector_indexed_tab
namespace raw {
static void validate_attrs(const cql3::attributes::raw& attrs) {
SCYLLA_ASSERT(!attrs.timestamp.has_value());
SCYLLA_ASSERT(!attrs.time_to_live.has_value());
throwing_assert(!attrs.timestamp.has_value());
throwing_assert(!attrs.time_to_live.has_value());
}
audit::statement_category select_statement::category() const {
@@ -2406,7 +2406,7 @@ std::unique_ptr<prepared_statement> select_statement::prepare(data_dictionary::d
std::visit([&](auto&& ordering) {
using T = std::decay_t<decltype(ordering)>;
if constexpr (!std::is_same_v<T, select_statement::ann_vector>) {
SCYLLA_ASSERT(!for_view);
throwing_assert(!for_view);
verify_ordering_is_allowed(*_parameters, *restrictions);
prepared_orderings_type prepared_orderings = prepare_orderings(*schema);
verify_ordering_is_valid(prepared_orderings, *schema, *restrictions);
@@ -2676,7 +2676,7 @@ select_statement::prepared_orderings_type select_statement::prepare_orderings(co
// Then specifying ascending order would cause the results of this column to be reverse in comparison to a standard select.
static bool are_column_select_results_reversed(const column_definition& column, select_statement::ordering_type column_ordering) {
auto ordering = std::get_if<select_statement::ordering>(&column_ordering);
SCYLLA_ASSERT(ordering);
throwing_assert(ordering);
if (*ordering == select_statement::ordering::ascending) {
return column.type->is_reversed();
@@ -2757,11 +2757,7 @@ select_statement::ordering_comparator_type select_statement::get_ordering_compar
// even if we don't
// ultimately ship them to the client (CASSANDRA-4911).
for (auto&& [column_def, is_descending] : orderings) {
auto index = selection.index_of(*column_def);
if (index < 0) {
index = selection.add_column_for_post_processing(*column_def);
}
auto index = selection.add_column_for_post_processing(*column_def);
sorters.emplace_back(index, column_def->type);
}
@@ -2864,9 +2860,7 @@ void select_statement::ensure_filtering_columns_retrieval(data_dictionary::datab
selection::selection& selection,
const restrictions::statement_restrictions& restrictions) {
for (auto&& cdef : restrictions.get_column_defs_for_filtering(db)) {
if (!selection.has_column(*cdef)) {
selection.add_column_for_post_processing(*cdef);
}
selection.add_column_for_post_processing(*cdef);
}
}

Some files were not shown because too many files have changed in this diff Show More