Commit Graph

274 Commits

Author SHA1 Message Date
Yaniv Michael Kaul
0fe552b074 cql3/query_processor: reformat code style
Reformat indentation, brace placement, lambda formatting, and
line wrapping for consistency.

The seastar logger already checks is_enabled() before formatting
arguments, so explicit guards around debug calls with simple
variable arguments are unnecessary.
AI-assisted: OpenCode / Claude Opus 4.6
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-03-24 18:30:39 +02:00
Piotr Dulikowski
d8b283e1fb Merge 'Add CQL forwarding for strongly consistent tables' from Wojciech Mitros
In this series we add support for forwarding strongly consistent CQL requests to suitable replicas, so that clients can issue reads/writes to any node and have the request executed on an appropriate tablet replica (and, for writes, on the Raft leader). We return the same CQL response as what the user would get while sending the request to the correct replica and we perform the same logging/stats updates on the request coordinator as if the coordinator was the appropriate replica.

The core mechanism of forwarding a strongly consistent request is sending an RPC containing the user's cql request frame to the appropriate replica and returning back a ready, serialized `cql_transport::response`. We do this in the CQL server - it is most prepared for handling these types and forwarding a request containing a CQL frame allows us to reuse near-top-level methods for CQL request handling in the new RPC handler (such as the general `process`)

For sending the RPC, the CQL server needs to obtain the information about who should it forward the request to. This requires knowledge about the tablet raft group members and leader. We obtain this information during the execution of a `cql3/strong_consistency` statement, and we return this information back to the CQL server using the generalized `bounce_to_shard` `response_message`, where we now store the information about either a shard, or a specific replica to which we should forward to. Similarly to `bounce_to_shard`, we need to handle this `result_message` in a loop - a replica may move during statement execution, or the Raft leader can change. We also use it for forwarding strongly consistent writes when we're not a member of the affected tablet raft group - in that case we need to forward the statement twice - once to any replica of the affected tablet, then that replica can find the leader and return this information to the coordinator, which allows the second request to be directed to the leader.

This feature also allows passing through exception messages which happened on the target replica while executing the statement. For that, many methods of the `cql_transport::cql_server::connection` for creating error responses needed to be moved to `cql_transport::cql_server`. And for final exception handling on the coordinator, we added additional error info to the RPC response, so that the handling can be performed without having the `result_message::exception` or `exception_ptr` itself.

Fixes [SCYLLADB-71](https://scylladb.atlassian.net/browse/SCYLLADB-71)

[SCYLLADB-71]: https://scylladb.atlassian.net/browse/SCYLLADB-71?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#27517

* github.com:scylladb/scylladb:
  test: add tests for CQL forwarding
  transport: enable CQL forwarding for strong consistency statements
  transport: add remote statement preparation for CQL forwarding
  transport: handle redirect responses in CQL forwarding
  transport: add exception handling for forwarded CQL requests
  transport: add basic CQL request forwarding
  idl: add a representation of client_state for forwarding
  cql_server: handle query, execute, batch in one case
  transport: inline process_on_shard in cql_server::process
  transport: extract process() to cql_server
  transport: add messaging_service to cql_server
  transport: add response reconstruction helpers for forwarding
  transport: generalize the bounce result message for bouncing to other nodes
  strong consistency: redirect requests to live replicas from the same rack
  transport: pass foreign_ptr into sleep_until_timeout_passes and move it to cql_server
  transport: extract the error handling from process_request_one
  transport: move error response helpers from connection to cql_server
2026-03-13 15:03:10 +01:00
Avi Kivity
76b6784c1a Merge 'cql3: track CQL parsing memory cost and use it for admission control' from Marcin Maliszkiewicz
Use rolling_max_tracker to record gross bytes allocated during each
CQL parse.  The rolling maximum is then added to the memory estimate
for incoming QUERY and PREPARE requests so that the admission control
in the CQL transport layer accounts for parsing overhead.

The measured memory footprint serves as upper bound rather than
exact number but it's purpose is to prevent OOMs under unprepared
statements heavy load.

In benchmark 1G memory node shows decrease of non-LSA memory usage
from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While
tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as
memory admission kicks in trying to prevent OOM.

This is phase 1 of OOM prevention, potential next steps:
- add second admission in query_processor::get_statement trying to prevent potential thundering herd problem
- decrease cql_server memory pool size
- count reads in the memory pool
- add per service level memory pool and a shared one

Related https://scylladb.atlassian.net/browse/SCYLLADB-740
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-938

Backport: no, new feature, but we may reconsider if some customer needs it

Closes scylladb/scylladb#28919

* github.com:scylladb/scylladb:
  cql3: track CQL parsing memory cost and use it for admission control
  utils: add rolling max tracker
2026-03-12 19:59:52 +02:00
Wojciech Mitros
e44820ba1f transport: generalize the bounce result message for bouncing to other nodes
In the following patches, we'll start allowing forwarding requests to strongly
consistent tables so that they'll get executed on the suitable tablet Raft group
members. For that we'll reuse the approach that we already have for bouncing
requests to other shards - we'll try to execute a request locally, and the
result of that will be a bounce message with another replica as the target.
In this patch we generalize the former bounce_to_shard result message so that
it will be able to specify the target of the bounce as another shard or specific
replica.
We also rename it to result_message::bounce so that it stops implying that only
another shard may be its target.
Aside from the host_id and the shard, the new message also includes the timeout,
because in the service handling the forwarding we won't have the access to it,
and it's needed for specifying how long we should wait for the forwarded
requests. It also includes an information whether this is a write request
to return correct timeout response in case the deadline is exceeded.
We will return other hosts in the new bounce message when executing requests to
strongly consistent tables when we can't handle the request because we aren't
a suitable replica. We can't handle this message yet, so we don't return it
anywhere and we still assume that every bounce message is a bounce to the same
host.
2026-03-12 17:48:57 +01:00
Marcin Maliszkiewicz
b277d9d9aa cql3: track CQL parsing memory cost and use it for admission control
Use rolling_max_tracker to record gross bytes allocated during each
CQL parse.  The rolling maximum is then added to the memory estimate
for incoming QUERY and PREPARE requests so that the admission control
in the CQL transport layer accounts for parsing overhead.

The measured memory footprint serves as upper bound rather than
exact number but it's purpose is to prevent OOMs under unprepared
statements heavy load.

In benchmark 1G memory node shows decrease of non-LSA memory usage
from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While
tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as
memory admission kicks in trying to prevent OOM.
2026-03-12 10:16:10 +01:00
Marcin Maliszkiewicz
2d22eea2f9 Merge 'cql3: Replace SCYLLA_ASSERT and abort by throwing_assert' from Nadav Har'El
In this patch we replace every single use of SCYLLA_ASSERT(), abort() and assert() in the cql3/ directory by throwing_assert().

The problem with SCYLLA_ASSERT()/abort()/assert() is that when it fails, it crashes Scylla. This is almost always a bad idea (see #7871 discussing why), but it's even riskier in front-end code like cql3/: In front-end code, there is a risk that due to a bug in our code, a specific user request can cause Scylla to crash. A malicious user can send this query to all nodes and crash the entire cluster. When the user is not malicious, it causes a small problem (a failing request) to become a much worse crash - and worse, the user has no idea which request is causing this crash and the crash will repeat if the same request is tried again.

All of this is solved by using the new throwing_assert(), which is the same as SCYLLA_ASSERT() but throws an exception (using on_internal_error()) instead of crashing. The exception will prevent the code path with the invalid assumption from continuing, but will result in only the current user request being aborted, with a clear error message reporting the internal server error due to an assertion failure.

I reviewed all the changes that I did in these patches to check that (to the best of my understanding) none of the assertions in cql3/ involve the sort of serious corruption that might require crashing the Scylla node entirely.

throwing_assert() also improves logging of assertion failures compared to the original SCYLLA_ASSERT()/abort() - SCYLLA_ASSERT() printed a message to stderr which in many installations is lost, and abort() often prints no message at all. But throwing_assert() uses Scylla's standard logger, and also includes a backtrace in the log message.

Fixes #13970 (Exorcise assertions from CQL code paths)
Refs #7871 (Exorcise assertions from Scylla)

Closes scylladb/scylladb#28847

* github.com:scylladb/scylladb:
  cql3: remove unnecessary assert()
  cql3: replace abort() by throwing_assert()
  cql3: Replace SCYLLA_ASSERT by throwing_assert
2026-03-12 09:09:24 +01:00
Nadav Har'El
c87d6407ed cql3: Replace SCYLLA_ASSERT by throwing_assert
In this patch we replace every single use of SCYLLA_ASSERT() in the cql3/
directory by throwing_assert().

The problem with SCYLLA_ASSERT() is that when it fails, it crashes Scylla.
This is almost always a bad idea (see #7871 discussing why), but it's even
riskier in front-end code like cql3/: In front-end code, there is a risk
that due to a bug in our code, a specific user request can cause Scylla
to crash. A malicious user can send this query to all nodes and crash
the entire cluster. When the user is not malicious, it causes a small
problem (a failing request) to become a much worse crash - and worse,
the user has no idea which request is causing this crash and the crash
will repeat if the same request is tried again.

All of this is solved by using the new throwing_assert(), which is the
same as SCYLLA_ASSERT() but throws an exception (using on_internal_error())
instead of crashing. The exception will prevent the code path with the
invalid assumption from continuing, but will result in only the current
user request being aborted, with a clear error message reporting the
internal server error due to an assertion failure.

I reviewed all the changes that I did in this patch to check that (to the
best of my understanding) none of the assertions in cql3/ involve the
sort of serious corruption that might require crashing the Scylla node
entirely.

throwing_assert() also improves logging of assertion failures compared
to the original SCYLLA_ASSERT() - SCYLLA_ASSERT() printed a message to
stderr which in many installations is lost, whereas throwing_assert()
uses Scylla's standard logger, and also includes a backtrace in the
log message.

Fixes #13970 (Exorcise assertions from CQL code paths)
Refs #7871 (Exorcise assertions from Scylla)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-11 09:41:20 +02:00
Alex
3ac4e258e8 transport/messages: hold pinned prepared entry in PREPARE result
result_message::prepared now owns a strong pinned prepared-cache entry instead of relying only on a weak pointer view. This closes the remaining lifetime gap after query_processor::prepare() returns, so users of the returned PREPARE message cannot observe an invalidated weak handle during subsequent
  processing.

  - update result_message::prepared::cql constructor to accept pinned entry
  - construct weak view from owned pinned entry inside the message
  - pass pinned cache entry from query_processor::prepare() into the message constructor
2026-03-10 14:17:57 +02:00
Alex
27051d9a7c cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race
query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak
  handle could no longer be promoted and the prepare path could fail nondeterministically.

  This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops.

  Test coverage was extended in test/cluster/test_prepare_race.py:

  - reproduces the invalidation-during-prepare window with injection,
  - verifies prepare completes successfully,
  - then invalidates again and executes the same stale client prepared object,
  - confirms the driver transparently re-requests/re-prepares and execution succeeds.

  This change introduces:

  - no behavior change for normal prepare flow besides stronger lifetime guarantees,
  - no new protocol semantics,
  - preserves existing cache invalidation logic,
  - adds explicit cluster-level regression coverage for both the race and driver reprepare path.
  - pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage
2026-03-10 14:17:57 +02:00
Andrzej Jackowski
371cdb3c81 cql3/query_processor: implement metrics to track CL of writes
Add `write_consistency_levels_disallowed_violations` and
`write_consistency_levels_warned_violations` metrics to track
violations of write_consistency_levels guardrails.

Add `writes_per_consistency_level` to track what CL is used by
writes, regardless of the guardrails configuration.
Data gathered by this metric can be used to decide whether enabling
a particular write consistency level guardrail in a particular
existing cluster is safe.

Refs: SCYLLADB-259
2026-03-03 21:18:11 +01:00
Andrzej Jackowski
3606934458 db: cql3/query_processor: add write_consistency_levels enum_sets
Add enum_sets to query_processor that track the configuration
values of `write_consistency_levels_warned` and
`write_consistency_levels_disallowed`.

Refs: SCYLLADB-259
2026-03-03 20:28:57 +01:00
Patryk Jędrzejczak
4e984139b2 Merge 'strongly consistent tables: basic implementation' from Petr Gusev
In this PR we add a basic implementation of the strongly-consistent tables:
* generate raft group id when a strongly-consistent table is created
* persist it into system.tables table
* start raft groups on replicas when a strongly-consistent tablet_map reaches them
* add strongly-consistent version of the storage_proxy, with the `query` and `mutate` methods
* the `mutate` method submits a command to the tablets raft group, the query method reads the data with `raft.read_barrier()`
* strongly-consistent versions of the `select_statement` and `modification_statement` are added
* a basic `test_strong_consistency.py/test_basic_write_read` is added which to check that we can write and read data in a strongly consistent fashion.

Limitations:
* for now the strongly consistent tables can have tablets only on shard zero. This is because we (ab/re) use the existing raft system tables which live only on shard0. In the next PRs we'll create separate tables for the new tablets raft groups.
* No Scylla-side proxying - the test has to figure out who is the leader and submit the command to the right node. This will be fixed separately.
* No tablet balancing -- migration/split/merges require separate complicated code.

The new behavior is hidden behind `STRONGLY_CONSISTENT_TABLES` feature, which is enabled when the `STRONGLY_CONSISTENT_TABLES` experimental feature flag is set.

Requirements, specs and general overview of the feature can be found [here](https://scylladb.atlassian.net/wiki/spaces/RND/pages/91422722/Strong+Consistency). Short term implementation plan is [here](https://docs.google.com/document/d/1afKeeHaCkKxER7IThHkaAQlh2JWpbqhFLIQ3CzmiXhI/edit?tab=t.0#heading=h.thkorgfek290)

One can check the strongly consistent writes and reads locally via cqlsh:
scylla.yaml:
```
experimental_features:
  - strongly-consistent-tables
```

cqlsh:
```
CREATE KEYSPACE IF NOT EXISTS my_ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 1} AND consistency = 'local';
CREATE TABLE my_ks.test (pk int PRIMARY KEY, c int);
INSERT INTO my_ks.test (pk, c) VALUES (10, 20);
SELECT * FROM my_ks.test WHERE pk = 10;
```

Fixes SCYLLADB-34
Fixes SCYLLADB-32
Fixes SCYLLADB-31
Fixes SCYLLADB-33
Fixes SCYLLADB-56

backport: no need

Closes scylladb/scylladb#27614

* https://github.com/scylladb/scylladb:
  test_encryption: capture stderr
  test/cluster: add test_strong_consistency.py
  raft_group_registry: disable metrics for non-0 groups
  strong consistency: implement select_statement::do_execute()
  cql: add select_statement.cc
  strong consistency: implement coordinator::query()
  cql: add modification_statement
  cql: add statement_helpers
  strong consistency: implement coordinator::mutate()
  raft.hh: make server::wait_for_leader() public
  strong_consistency: add coordinator
  modification_statement: make get_timeout public
  strong_consistency: add groups_manager
  strong_consistency: add state_machine and raft_command
  table: add get_max_timestamp_for_tablet
  tablets: generate raft group_id-s for new table
  tablet_replication_strategy: add consistency field
  tablets: add raft_group_id
  modification_statement: remove virtual where it's not needed
  modification_statement: inline prepare_statement()
  system_keyspace: disable tablet_balancing for strongly_consistent_tables
  cql: rename strongly_consistent statements to broadcast statements
2026-01-23 09:52:33 +01:00
Petr Gusev
7d111f2396 strong_consistency: add coordinator
Add the `coordinator` class, which will be responsible for coordinating
reads and writes to strongly consistent tables. This commit includes
only the boilerplate; the methods will be implemented in separate
commits.
2026-01-21 14:56:01 +01:00
Piotr Smaron
3ca6b59f80 cql: extend query_internal with query_state param
This later is going to be used to pass a query timeout via `qs` to
`query_internal`.
2026-01-20 09:15:48 +01:00
Botond Dénes
404f5b0808 cql3/query_processor: don't throw exceptions on the read-path
Use coroutine::try_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
2026-01-13 10:47:57 +02:00
Aleksandra Martyniuk
2e7ba1f8ce cql3: reject concurrent alter of the same keyspace
Reject ALTER KEYSPACE request if there is unfinished (queued, pending,
or paused) alter request of the same keyspace.

This is required as in the following changes, global request queue
will contain rf change requests meant to be resumed.
2025-12-16 13:27:48 +01:00
Karol Nowacki
eae71d3e91 vector_store_client: Move to vector_search module
Vector search related implementation moved to a new module vector_search.
As the vector search functionality is going to be extended, it is
better to keep it in a separate module.
2025-09-22 08:01:47 +02:00
Gleb Natapov
041011b2ee qp: fold prepare_one function into its only caller 2025-07-31 14:12:34 +03:00
Petr Gusev
ff1caa9798 query_options: add node_local_only mode
We want to access the paxos state table only on the local node and
shard (or shards in case of intranode_migration). In this commit we
add a node_local_only flag to query_options, which allows to do that.
This flag can be set for a query via make_internal_options.

We handle this flag on the statements layer by forwarding it to
either coordinator_query_options or coordinator_mutate_options.
2025-07-24 19:48:08 +02:00
Avi Kivity
6fce817aa8 Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz
This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft.

Pulling database::apply() out of schema merging code will allow to batch changes to subsystems. Future generic code will first call prepare() on all implementations, then single database::apply() and then update() on all implementations, then on each shard it will call commit() for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then post_commit().

Backport: no, it's a new feature

Fixes: https://github.com/scylladb/scylladb/issues/19649
Fixes https://github.com/scylladb/scylladb/issues/24531

Closes scylladb/scylladb#24886

[avi: adjust for std::vector<mutations> -> utils::chunked_vector<mutations>]

* github.com:scylladb/scylladb:
  test: add type creation to test_snapshot
  storage_service: always wake up load balancer on update tablet metadata
  db: schema_applier: call destroy also when exception occurs
  db: replica: simplify seeding ERM during shema change
  db: remove cleanup from add_column_family
  db: abort on exception during schema commit phase
  db: make user defined types changes atomic
  replica: db: make keyspace schema changes atomic
  db: atomically apply changes to tables and views
  replica: make truncate_table_on_all_shards get whole schema from table_shards
  service: split update_tablet_metadata into two phases
  service: pull out update_tablet_metadata from migration_listener
  db: service: add store_service dependency to schema_applier
  service: simplify load_tablet_metadata and update_tablet_metadata
  db: don't perform move on tablet_hint reference
  replica: split add_column_family_and_make_directory into steps
  replica: db: split drop_table into steps
  db: don't move map references in merge_tables_and_views()
  db: introduce commit_on_shard function
  db: access types during schema merge via special storage
  replica: make non-preemptive keyspace create/update/delete functions public
  replica: split update keyspace into two phases
  replica: split creating keyspace into two functions
  db: rename create_keyspace_from_schema_partition
  db: decouple functions and aggregates schema change notification from merging code
  db: store functions and aggregates change batch in schema_applier
  db: decouple tables and views schema change notifications from merging code
  db: store tables and views schema diff in schema_applier
  db: decouple user type schema change notifications from types merging code
  service: unify keyspace notification functions arguments
  db: replica: decouple keyspace schema change notifications to a separate function
  db: add class encapsulating schema merging
2025-07-13 20:47:55 +03:00
Benny Halevy
3feb759943 everywhere: use utils::chunked_vector for list of mutations
Currently, we use std::vector<*mutation> to keep
a list of mutations for processing.
This can lead to large allocation, e.g. when the vector
size is a function of the number of tables.

Use a chunked vector instead to prevent oversized allocations.

`perf-simple-query --smp 1` results obtained for fixed 400MHz frequency
and PGO disabled:

Before (read path):
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...

89055.97 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39417 insns/op,   18003 cycles/op,        0 errors)
103372.72 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39380 insns/op,   17300 cycles/op,        0 errors)
98942.27 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39413 insns/op,   17336 cycles/op,        0 errors)
103752.93 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39407 insns/op,   17252 cycles/op,        0 errors)
102516.77 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39403 insns/op,   17288 cycles/op,        0 errors)
throughput:
	mean=   99528.13 standard-deviation=6155.71
	median= 102516.77 median-absolute-deviation=3844.59
	maximum=103752.93 minimum=89055.97
instructions_per_op:
	mean=   39403.99 standard-deviation=14.25
	median= 39406.75 median-absolute-deviation=9.30
	maximum=39416.63 minimum=39380.39
cpu_cycles_per_op:
	mean=   17435.81 standard-deviation=318.24
	median= 17300.40 median-absolute-deviation=147.59
	maximum=18002.53 minimum=17251.75
```

After (read path)
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
59755.04 tps ( 66.2 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39466 insns/op,   22834 cycles/op,        0 errors)
71854.16 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39417 insns/op,   17883 cycles/op,        0 errors)
82149.45 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.2 tasks/op,   39411 insns/op,   17409 cycles/op,        0 errors)
49640.04 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.3 tasks/op,   39474 insns/op,   19975 cycles/op,        0 errors)
54963.22 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.3 tasks/op,   39474 insns/op,   18235 cycles/op,        0 errors)
throughput:
	mean=   63672.38 standard-deviation=13195.12
	median= 59755.04 median-absolute-deviation=8709.16
	maximum=82149.45 minimum=49640.04
instructions_per_op:
	mean=   39448.38 standard-deviation=31.60
	median= 39466.17 median-absolute-deviation=25.75
	maximum=39474.12 minimum=39411.42
cpu_cycles_per_op:
	mean=   19267.01 standard-deviation=2217.03
	median= 18234.80 median-absolute-deviation=1384.25
	maximum=22834.26 minimum=17408.67
```

`perf-simple-query --smp 1 --write` results obtained for fixed 400MHz frequency
and PGO disabled:

Before (write path):
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no}
Disabling auto compaction
63736.96 tps ( 59.4 allocs/op,  16.4 logallocs/op,  14.3 tasks/op,   49667 insns/op,   19924 cycles/op,        0 errors)
64109.41 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   49992 insns/op,   20084 cycles/op,        0 errors)
56950.47 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50005 insns/op,   20501 cycles/op,        0 errors)
44858.42 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50014 insns/op,   21947 cycles/op,        0 errors)
28592.87 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50027 insns/op,   27659 cycles/op,        0 errors)
throughput:
	mean=   51649.63 standard-deviation=15059.74
	median= 56950.47 median-absolute-deviation=12087.33
	maximum=64109.41 minimum=28592.87
instructions_per_op:
	mean=   49941.18 standard-deviation=153.76
	median= 50005.24 median-absolute-deviation=73.01
	maximum=50027.07 minimum=49667.05
cpu_cycles_per_op:
	mean=   22023.01 standard-deviation=3249.92
	median= 20500.74 median-absolute-deviation=1938.76
	maximum=27658.75 minimum=19924.32
```

After (write path)
```
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no}
Disabling auto compaction
53395.93 tps ( 59.4 allocs/op,  16.5 logallocs/op,  14.3 tasks/op,   50326 insns/op,   21252 cycles/op,        0 errors)
46527.83 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50704 insns/op,   21555 cycles/op,        0 errors)
55846.30 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50731 insns/op,   21060 cycles/op,        0 errors)
55669.30 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50735 insns/op,   21521 cycles/op,        0 errors)
52130.17 tps ( 59.3 allocs/op,  16.0 logallocs/op,  14.3 tasks/op,   50757 insns/op,   21334 cycles/op,        0 errors)
throughput:
	mean=   52713.91 standard-deviation=3795.38
	median= 53395.93 median-absolute-deviation=2955.40
	maximum=55846.30 minimum=46527.83
instructions_per_op:
	mean=   50650.57 standard-deviation=182.46
	median= 50731.38 median-absolute-deviation=84.09
	maximum=50756.62 minimum=50325.87
cpu_cycles_per_op:
	mean=   21344.42 standard-deviation=202.86
	median= 21334.00 median-absolute-deviation=176.37
	maximum=21554.61 minimum=21060.24
```

Fixes #24815

Improvement for rare corner cases. No backport required

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#24919
2025-07-13 19:13:11 +03:00
Marcin Maliszkiewicz
2f840e51d1 service: pull out update_tablet_metadata from migration_listener
It's not a good usage as there is only one non-empty implementation.
Also we need to change it further in the following commit which
makes it incompatible with listener code.
2025-07-10 10:40:43 +02:00
Pawel Pery
7bf53fc908 vector_store_client: implement initial vector_store_client service
This patch is a part of vector_store_client sharded service
implementation for a communication with vector-store service.

It adds a `services/vector_store_client.{cc|hh}` sharded service and a
configuration parameter `vector_store_uri` with a
`http://vector-store.dns.name:port` format. If there will be an error
during parsing that parameter there will be an exception during
construction.

For the future unit testing purposes the patch adds
`vector_store_client_tester` as a way to inject mockup functionality.

This service will be used by the select statements for the Vector search
indexes (see VS-46). For this reason I've added vector_store_client
service in the query processor.

Reference: VS-47 VS-45
2025-07-08 16:29:55 +02:00
Avi Kivity
cd79a8fc25 Revert "Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz"
This reverts commit 0b516da95b, reversing
changes made to 30199552ac. It breaks
cluster.random_failures.test_random_failures.test_random_failures
in debug mode (at least).

Fixes #24513
2025-06-16 22:38:12 +03:00
Marcin Maliszkiewicz
21a5a3c01f service: pull out update_tablet_metadata from migration_listener
It's not a good usage as there is only one non-empty implementation.
Also we need to change it further in the following commit which
makes it incompatible with listener code.
2025-06-06 08:50:33 +02:00
Benny Halevy
f8d5835cab query_processor: remote: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:48 +03:00
Amnon Heiman
0dde54d053 query_processor.cc: label metrics with basic_level
The following metrics will be marked with basic_level label:
scylla_cql_authorized_prepared_statements_cache_evictions
scylla_cql_batches
scylla_cql_deletes
scylla_cql_deletes_per_ks
scylla_cql_filtered_read_requests
scylla_cql_filtered_rows_dropped_total
scylla_cql_filtered_rows_matched_total
scylla_cql_filtered_rows_read_total
scylla_cql_inserts
scylla_cql_inserts_per_ks
scylla_cql_prepared_cache_evictions
scylla_cql_reads
scylla_cql_reads_per_ks
scylla_cql_reverse_queries
scylla_cql_rows_read
scylla_cql_secondary_index_reads
scylla_cql_select_bypass_caches
scylla_cql_select_partition_range_scan_no_bypass_cache
scylla_cql_statements_in_batches
scylla_cql_unpaged_select_queries
scylla_cql_unpaged_select_queries_per_ks
scylla_cql_updates
scylla_cql_updates_per_ks
2025-03-03 16:58:38 +02:00
Paweł Zakrzewski
98f5e49ea8 audit: Add support to CQL statements
Integrates audit functionality into CQL statement processing to enable tracking of database operations. Key changes:

- Add audit_info and statement_category to all CQL statements
- Implement audit categories for different statement types:
  - DDL: Schema altering statements (CREATE/ALTER/DROP)
  - DML: Data manipulation (INSERT/UPDATE/DELETE/TRUNCATE/USE)
  - DCL: Access control (GRANT/REVOKE/CREATE ROLE)
  - QUERY: SELECT statements
  - ADMIN: Service level operations

- Add audit inspection points in query processing:
  - Before statement execution
  - After access checks
  - After statement completion
  - On execution failures

- Add password sanitization for role management statements
  - Mask plaintext passwords in audit logs
  - Handle both direct password parameters and options maps
  - Preserve query structure while hiding sensitive data

- Modify prepared statement lifecycle to carry audit context
  - Pass audit info during statement preparation
  - Track audit info through statement execution
  - Support batch statement auditing

This change enables comprehensive auditing of CQL operations while ensuring sensitive data is properly masked in audit logs.
2025-01-15 11:10:36 +01:00
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Amnon Heiman
1f688bc670 cql3/query_processor.cc: Add skip_when_empty to metrics
This patch introduces the skip_when_empty flag to all CQL counters that
previously lacked this setting.

The skip_when_empty flag is a metric optimization that prevents
reporting on counters that have never been used. Once a counter has been
used (i.e., it holds a positive value), it will continue to be reported
consistently from that point onward.

Fixes #21046

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes scylladb/scylladb#21565
2024-11-24 17:30:46 +02:00
muthu90tech
0ea0234a7a Avoid unnecessary copy in query_processor::execute_direct_without_checking_exception_message
instead of making a copy of the warnings vector, make the warnings a non const in prepared_statement
and move the warnings vector to execute_maybe_with_guard

Closes scylladb/scylladb#20361

Closes scylladb/scylladb#21083
2024-11-22 13:34:31 +02:00
Nadav Har'El
b778ce08a9 cql3: change sstring_view to std::string_view
Our "sstring_view" is an historic alias for the standard std::string_view.
The cql3/ directory used this old alias in a few of random places, let's
change them to use the standard type name.

Refs #4062.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-11-18 15:57:20 +02:00
Kefu Chai
3e84d43f93 treewide: use seastar::format() or fmt::format() explicitly
before this change, we rely on `using namespace seastar` to use
`seastar::format()` without qualifying the `format()` with its
namespace. this works fine until we changed the parameter type
of format string `seastar::format()` from `const char*` to
`fmt::format_string<...>`. this change practically invited
`seastar::format()` to the club of `std::format()` and `fmt::format()`,
where all members accept a templated parameter as its `fmt`
parameter. and `seastar::format()` is not the best candidate anymore.
despite that argument-dependent lookup (ADT for short) favors the
function which is in the same namespace as its parameter, but
`using namespace` makes `seastar::format()` more competitive,
so both `std::format()` and `seastar::format()` are considered
as the condidates.

that is what is happening scylladb in quite a few caller sites of
`format()`, hence ADT is not able to tell which function the winner
in the name lookup:

```
/__w/scylladb/scylladb/mutation/mutation_fragment_stream_validator.cc:265:12: error: call to 'format' is ambiguous
  265 |     return format("{} ({}.{} {})", _name_view, s.ks_name(), s.cf_name(), s.id());
      |            ^~~~~~
/usr/bin/../lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/format:4290:5: note: candidate function [with _Args = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
 4290 |     format(format_string<_Args...> __fmt, _Args&&... __args)
      |     ^
/__w/scylladb/scylladb/seastar/include/seastar/core/print.hh:143:1: note: candidate function [with A = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
  143 | format(fmt::format_string<A...> fmt, A&&... a) {
      | ^
```

in this change, we

change all `format()` to either `fmt::format()` or `seastar::format()`
with following rules:
- if the caller expects an `sstring` or `std::string_view`, change to
  `seastar::format()`
- if the caller expects an `std::string`, change to `fmt::format()`.
  because, `sstring::operator std::basic_string` would incur a deep
  copy.

we will need another change to enable scylladb to compile with the
latest seastar. namely, to pass the format string as a templated
parameter down to helper functions which format their parameters.
to miminize the scope of this change, let's include that change when
bumping up the seastar submodule. as that change will depend on
the seastar change.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-09-11 23:21:40 +03:00
Avi Kivity
d69bf4f010 cql3: introduce dialect infrastructure
A dialect is a different way to interpret the same CQL statement.

Examples:
 - how duplicate bind variable names are handled (later in this series)
 - whether `column = NULL` in LWT can return true (as is now) or
   whether it always returns NULL (as in SQL)

Currently, dialect is an empty structure and will be filled in later.
It is passed to query_processor methods that also accept a CQL string,
and from there to the parser. It is part of the prepared statement cache
key, so that if the dialect is changed online, previous parses of the
statement are ignored and the statement is prepared again.

The patch is careful to pick up the dialect at the entry point (e.g.
CQL protocol server) so that the dialect doesn't change while a statement
is parsed, prepared, and cached.
2024-08-29 21:19:23 +03:00
Botond Dénes
2cec0d8dd1 service/migration_listener: update_tablet_metadata(): add hint parameter
The hint contains information related to what exactly changed, allowing
listeners to do partial updates, instead of reloading all metadata on
each notification.
2024-08-11 09:53:19 -04:00
Kefu Chai
ee80742c39 cql3: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#19906
2024-07-28 17:29:07 +03:00
Avi Kivity
3fc4e23a36 forward_service: rename to mapreduce_service
forward_service is nondescriptive and misnamed, as it does more than
forward requests. It's a classic map/reduce algorithm (and in fact one
of its parameters is "reducer"), so name it accordingly.

The name "forward" leaked into the wire protocol for the messaging
service RPC isolation cookie, so it's kept there. It's also maintained
in the name of the logger (for "nodetool setlogginglevel") for
compatibility with tests.

Closes scylladb/scylladb#19444
2024-07-03 19:29:47 +03:00
Botond Dénes
4e96e320b4 cql3/query_processor: for_each_cql_result(): move func to the coro frame
Said method has a func parameter (called just f), which it receives as
rvalue ref and just uses as a reference. This means that if caller
doesn't keep the func alive, for_each_cql_result() will run into
use-after-free after the first suspention point. This is unexpected for
callers, who don't expect to have to keep something alive, which they
passed in with std::move().
Adjust the signature to take a value instead, value parameters are moved
to the coro frame and survive suspention points.
Adjust internal callers (query_internal()) the same way.

There are no known vulnerable external callers.
2024-06-25 06:15:25 -04:00
Kefu Chai
7b10cc8079 treewide: include seastar headers with brackets
this change was created in the same spirit of ebff5f5d.

despite that we include Seastar as a submodule, Seastar is not a
part of scylla project. so we'd better include its headers using
brackets.

ebff5f5d addressed this cosmetic issue a while back. but probably
clangd's header-insertion helped some of contributor to insert
the missing headers with `"`. so this style of `include` returned
to the tree with these new changes.

unfortunately, clangd does not allow us to configure the style
of `include` at the time of writing.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#19406
2024-06-21 19:20:27 +03:00
Marcin Maliszkiewicz
56707e2965 cql: remove global_req_id from schema_altering_statement
Such field is no longer needed as the information comes
directly from group0_batch.

Fixes scylladb/scylladb#19365
2024-06-18 20:26:09 +02:00
Nadav Har'El
4faceeaa33 Merge 'treewide: drop thrift support' from Kefu Chai
thrift support was deprecated since ScyllaDB 5.2

> Thrift API - legacy ScyllaDB (and Apache Cassandra) API is
> deprecated and will be removed in followup release. Thrift has
> been disabled by default.

so let's drop it. in this change,

* thrift protocol support is dropped
* all references to thrift support in document are dropped
* the "thrift_version" column in system.local table is preserved for backward compatibility, as we could load from an existing system.local table which still contains this clolumn, so we need to write this column as well.
* "/storage_service/rpc_server" is only preserved for backward compatibility with java-based nodetool.

Fixes #3811
Fixes #18416
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

- [x] not a fix, no need to backport

Closes scylladb/scylladb#18453

* github.com:scylladb/scylladb:
  config: expand on rpc_keepalive's description
  api: s/rpc/thrift/
  db/system_keyspace: drop thrift_version from system.local table
  transport: do not return client_type from cql_server::connection::make_client_key()
  treewide: drop thrift support
2024-06-17 22:36:49 +03:00
Piotr Dulikowski
85128c5b10 Merge 'cql3: always return created event in create keyspace statement' from Marcin Maliszkiewicz
cql3: always return created event in create ks/table/type/view statement

In case multiple clients issue concurrently CREATE KEYSPACE IF NOT EXISTS
and later USE KEYSPACE it can happen that schema in driver's session is
out of sync because it synces when it receives special message from
CREATE KEYSPACE response.

Similar situation occurs with other schema change statements.

In this patch we fix only create keyspace/table/type/view statements
by always sending created event. Behavior of any other schema altering
statements remains unchanged.

Fixes https://github.com/scylladb/scylladb/issues/16909

**backport: no, it's not a regression**

Closes scylladb/scylladb#18819

* github.com:scylladb/scylladb:
  cql3: always return created event in create ks/table/type/view statement
  cql3: auth: move auto-grant closer to resource creation code
  cql3: extract create ks/table/type/view event code
2024-06-17 19:58:38 +02:00
Pavel Emelyanov
f950469af5 lang: Move manager to lang namespace
And, while at it, rename local variable to refer to it to as "manager"
not "wasm". Query processor and database also have getters named
"wasm()", these are not renamed yet to keep patch smaller (and those
getters are going to be reworked further anyway).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-06-07 12:35:57 +03:00
Marcin Maliszkiewicz
c13fea371c cql3: always return created event in create ks/table/type/view statement
In case multiple clients issue concurrently CREATE KEYSPACE IF NOT EXISTS
and later USE KEYSPACE it can happen that schema in driver's session is
out of sync because it synces when it receives special message from
CREATE KEYSPACE response.

Similar situation occurs with other schema change statements.

In this patch we fix only create keyspace/table/type/view statements
by always sending created event. Behavior of any other schema altering
statements remains unchanged.
2024-06-07 10:36:40 +02:00
Marcin Maliszkiewicz
f6108a72d3 cql3: auth: move auto-grant closer to resource creation code
This should reduce the risk of re-introducing issue similar to
the one fixed in ab6988c52f

When grant code is closer to actual creation code (announcing mutations)
there is lower chance of those two effects being triggered differently,
if we ever call grant_permissions_to_creator and not announce mutations
that's very likely a security vulnerability.

Additionally comment was rewritten to be more accurate.
2024-06-07 10:26:32 +02:00
Kefu Chai
ad649be1bf treewide: drop thrift support
thrift support was deprecated since ScyllaDB 5.2

> Thrift API - legacy ScyllaDB (and Apache Cassandra) API is
> deprecated and will be removed in followup release. Thrift has
> been disabled by default.

so let's drop it. in this change,

* thrift protocol support is dropped
* all references to thrift support in document are dropped
* the "thrift_version" column in system.local table is
  preserved for backward compatibility, as we could load
  from an existing system.local table which still contains
  this clolumn, so we need to write this column as well.
* "/storage_service/rpc_server" is only preserved for
  backward compatibility with java-based nodetool.
* `rpc_port` and `start_rpc` options are preserved, but
  they are marked as "Unused". so that the new release
  of scylladb can consume existing scylla.yaml configurations
  which might contain these settings. by making them
  deprecated, user will be able get warned, and update
  their configurations before we actually remove them
  in the next major release.

Fixes #3811
Fixes #18416
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-06-07 06:44:59 +08:00
Marcin Maliszkiewicz
63e6334a64 raft: rename mutations_collector to group0_batch 2024-06-06 13:26:34 +02:00
Marcin Maliszkiewicz
5ba7d1b116 cql3: auth: use mutation collector for drop role and auto-revoke
The main theme of this commit is executing drop
keyspace/table/aggregate/function statements in a single
transaction together with auth auto-revoke logic.
This is the logic which cleans related permissions after
resource is deleted.

It contains serveral parts which couldn't easily be split
into separate commits mainly because mutation collector related
paths can't be mixed together. It would require holding multiple
guards which we don't support. Another reason is that with mutation
collector the changes are announced in a single place, at the end
of statement execution, if we'd announce something in the middle
then it'd lead to raft concurrent modification infinite loop as it'd
invalidate our guard taken at the begining of statement execution.

So this commit contains:

- moving auto-revoke code to statement execution from migration_listener
 * only for auth-v2 flow, to not break the old one
 * it's now executed during statement execution and not merging schemas,
   which means it produces mutations once as it should and not on each
   node separately
 * on_before callback family wasn't used because I consider it much
   less readable code. Long term we want to remove
   auth_migration_listener.

- adding mutation collector to revoke_all
 * auto-revoke uses this function so it had to be changed,
   auth::revoke_all free function wrapper was added as cql3
   layer should not use underlying_authorizer() directly.

- adding mutation collector to drop_role
 * because it depends on revoke_all and we can't mix old and new flows
 * we need to switch all functions auth::drop_role call uses
 * gradual use of previously introduced modify_membership, otherwise
   we would need to switch even more code in this commit
2024-06-04 15:43:04 +02:00
Marcin Maliszkiewicz
0573fee2a9 cql3: auth: use mutation collector for grant and revoke permissions
This is done to achieve single transaction semantics.

The change includes auto-grant feature. In particular
for schema related auto-grant we don't use normal
mutation collector announce path but follow migration manager,
this may be unified in the future.
2024-06-04 15:43:04 +02:00
Pavel Emelyanov
273dca6f27 query_processor: Coroutinize stop()
This effectively removes "finally" block so if
authorized_prepared_cache.stop() resolves with exception, the
prepared_cache.stop() is skipped. But that's not a problem -- even if
.stop() throws the shole scylla stop aborts so we don't really care if
it was clean or not.

Also, authorized_prepared_cache.stop() closes the gate and cancels the
timer. None of those can resolve with exception.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#19001
2024-05-31 10:22:08 +03:00