utils/loading_cache.hh is an expensive template header that costs
~2,494 seconds of aggregate CPU time across 133 files that include it.
88 of those files include it only transitively via query_processor.hh
through the chain: query_processor.hh -> prepared_statements_cache.hh
-> loading_cache.hh, costing ~1,690s of template instantiation.
Break the chain by:
- Replacing #include of prepared_statements_cache.hh and
authorized_prepared_statements_cache.hh in query_processor.hh with
forward declarations and the lightweight prepared_cache_key_type.hh
- Replacing #include of result_message.hh with result_message_base.hh
(which doesn't pull in prepared_statements_cache.hh)
- Changing prepared_statements_cache and authorized_prepared_statements_cache
members to std::unique_ptr (PImpl) since forward-declared types
cannot be held by value
- Moving get_prepared(), execute_prepared(), execute_direct(), and
execute_batch() method bodies from the header to query_processor.cc
- Updating transport/server.cc to use the concrete type instead of the
no-longer-visible authorized_prepared_statements_cache::value_type
Per-file measurement: files including query_processor.hh now show zero
loading_cache template instantiation events (previously 20-32s each).
Wall-clock measurement (clean build, -j16, 16 cores, Seastar cached):
Baseline (origin/master): avg 24m01s (24m03s, 23m59s)
With loading_cache chain break: avg 23m29s (23m32s, 23m29s, 23m27s)
Improvement: ~32s, ~2.2%
Move prepared_cache_key_type class and its std::hash / fmt::formatter
specializations from prepared_statements_cache.hh into a new header
cql3/prepared_cache_key_type.hh.
The new header only depends on bytes.hh, utils/hash.hh, and
cql3/dialect.hh -- it does NOT include utils/loading_cache.hh.
This allows code that needs the cache key type (e.g. for function
signatures) without pulling in the expensive loading_cache template
machinery.
prepared_statements_cache.hh now includes prepared_cache_key_type.hh,
so existing includers are unaffected.
No functional change. Prepares for breaking the loading_cache include
chain from query_processor.hh.
Add explicit #include directives for headers that are currently
available transitively through cql3/query_processor.hh but will stop
being available after a subsequent refactoring that removes the
loading_cache include chain.
Files changed:
- cql3/statements/drop_keyspace_statement.cc: add unimplemented.hh
- cql3/statements/truncate_statement.cc: add unimplemented.hh
- cql3/statements/batch_statement.cc: add result_message.hh
- cql3/statements/broadcast_modification_statement.cc: add result_message.hh
- service/paxos/paxos_state.cc: add result_message.hh
- test/lib/cql_test_env.cc: add result_message.hh
- table_helper.cc: add result_message.hh
No functional change. Prepares for subsequent query_processor.hh cleanup.
Include non-primary key restrictions (e.g. regular column filters) in
the filter JSON sent to the Vector Store service. Previously only
partition key and clustering column restrictions were forwarded, so
filtering on regular columns was silently ignored.
Add get_nonprimary_key_restrictions() getter to statement_restrictions.
Add unit tests for non-primary key equality, range, and bind marker
restrictions in filter_test.
Fixes: SCYLLADB-970
Closesscylladb/scylladb#29019
Motivation
----------
Since strongly consistent tables are based on the concept of Raft
groups, operations on them can get stuck for indefinite amounts of
time. That may be problematic, and so we'd like to implement a way
to cancel those operations at suitable times.
Description of solution
-----------------------
The situations we focus on are the following:
* Timed-out queries
* Leader changes
* Tablet migrations
* Table drops
* Node shutdowns
We handle each of them and provide validation tests.
Implementation strategy
-----------------------
1. Auxiliary commits.
2. Abort operations on timeout.
3. Abort operations on tablet removal.
4. Extend `client_state`.
5. Abort operation on shutdown.
6. Help `state_machine` be aborted as soon as possible.
Tests
-----
We provide tests that validate the correctness of the solution.
The total time spent on `test_strong_consistency.py`
(measured on my local machine, dev mode):
Before:
```
real 0m31.809s
user 1m3.048s
sys 0m21.812s
```
After:
```
real 0m34.523s
user 1m10.307s
sys 0m27.223s
```
The incremental differences in time can be found in the commit messages.
Fixes SCYLLADB-429
Backport: not needed. This is an enhancement to an experimental feature.
Closesscylladb/scylladb#28526
* github.com:scylladb/scylladb:
service: strong_consistency: Abort state_machine::apply when aborting server
service: strong_consistency: Abort ongoing operations when shutting down
service: client_state: Extend with abort_source
service: strong_consistency: Handle abort when removing Raft group
service: strong_consistency: Abort Raft operations on timeout
service: strong_consistency: Use timeout when mutating
service: strong_consistency: Fix indentation
service: strong_consistency: Enclose coordinator methods with try-catch
service: strong_consistency: Crash at unexpected exception
test: cluster: Extract default config & cmdline in test_strong_consistency.py
Accept the Cassandra SAI 'source_model' option for vector indexes.
This option is used by Cassandra libraries (e.g., CassIO, LangChain)
to tag vector indexes with the name of the embedding model that
produced the vectors.
ScyllaDB does not use the source_model value but stores it and
includes it in the DESCRIBE INDEX output for Cassandra compatibility.
Additionally, extend vector_index::describe() to emit a
WITH OPTIONS = {...} clause containing all user-provided index options
(filtering out system keys: target, class_name, index_version).
This makes options like similarity_function, source_model, etc.
visible in DESCRIBE output.
Libraries such as CassIO, LangChain, and LlamaIndex create vector
indexes using Cassandra's StorageAttachedIndex (SAI) class name.
This commit lets ScyllaDB accept these statements without modification.
When a CREATE CUSTOM INDEX statement specifies an SAI class name on a
vector column, ScyllaDB automatically rewrites it to the native
vector_index implementation. Accepted class names (case-insensitive):
- org.apache.cassandra.index.sai.StorageAttachedIndex
- StorageAttachedIndex
- sai
SAI on non-vector columns is rejected with a clear error directing
users to a secondary index instead.
The SAI detection and rewriting logic is extracted into a dedicated
static function (maybe_rewrite_sai_to_vector_index) to keep the
already-long validate_while_executing method manageable.
Multi-column (local index) targets and nonexistent columns are
skipped with continue — the former are treated as filtering columns
by vector_index::check_target(), and the latter are caught later by
vector_index::validate().
Tests that exercise features common to both backends (basic creation,
similarity_function, IF NOT EXISTS, bad options, etc.) now use the
SAI class name with the skip_on_scylla_vnodes fixture so they run
against both ScyllaDB and Cassandra. ScyllaDB-specific tests continue
to use USING 'vector_index' with scylla_only.
These changes are complementary to those from a recent commit where we
handled aborting ongoing operations during tablet events, such as
tablet migration. In this commit, we consider the case of shutting down
a node.
When a node is shutting down, we eventually close the connections. When
the client can no longer get a response from the server, it makes no
sense to continue with the queries. We'd like to cancel them at that
point.
We leverage the abort source passed down via `client_state` down to
the strongly consistent coordinator. This way, the transport layer can
communicate with it and signal that the queries should be canceled.
The abort source is triggered by the CQL server (cf.
`generic_server::server::{stop,shutdown}`).
---
Note that this is not an optional change. In fact, if we don't abort
those requests, we might hang for an indefinite amount of time when
executing the following code in `main.cc`:
```
// Register at_exit last, so that storage_service::drain_on_shutdown will be called first
auto do_drain = defer_verbose_shutdown("local storage", [&ss] {
ss.local().drain_on_shutdown().get();
});
```
The problem boils down to the fact that `generic_server::server::stop`
will wait for all connections to be closed, but that won't happen until
all ongoing operations (at least those to strongly consistent tables)
are finished.
It's important to highlight that even though we hang on this, the
client can no longer get any response. Thus, it's crucial that at that
point we simply abort ongoing operations to proceed with the rest of
shutdown.
---
Two tests are added to verify that the implementation is correct:
one focusing on local operations, the other -- on a forwarded write.
Difference in time spent on the whole test file
`test_strong_consistency.py` on my local machine, in dev mode:
Before:
```
real 0m31.775s
user 1m4.475s
sys 0m22.615s
```
After:
```
real 0m32.024s
user 1m10.751s
sys 0m23.871s
```
Individual runs of the added tests:
test_queries_when_shutting_down:
```
real 0m12.818s
user 0m36.726s
sys 0m4.577s
```
test_abort_forwarded_write_upon_shutdown:
```
real 0m12.930s
user 0m36.622s
sys 0m4.752s
```
We remove the inconsistency between reads and writes to strongly
consistent tables. Before the commit, only reads used a timeout.
Now, writes do as well.
Although the parameter isn't used yet, that will change in the following
commit. This is a prerequisite for it.
Queries against local vector indexes were failing with the error:
"ANN ordering by vector requires the column to be indexed using 'vector_index'"
This was a regression introduced by 15788c3734, which incorrectly
assumed the first column in the targets list is always the vector column.
For local vector indexes, the first column is the partition key, causing
the failure.
Previously, serialization logic for the target index option was shared
between vector and secondary indexes. This is no longer viable due to
the introduction of local vector indexes and vector indexes with filtering
columns, which have different target format.
This commit introduces a dedicated JSON-based serialization format for
vector index targets, identifying the target column (tc), filtering
columns (fc), and partition key columns (pk). This ensures unambiguous
serialization and deserialization for all vector index types.
This change is backward compatible for regular vector indexes. However,
it breaks compatibility for local vector indexes and vector indexes with
filtering columns created in version 2026.1.0. To mitigate this, usage
of these specific index types will be blocked in the 2026.1.0 release
by failing ANN queries against them in vector-store service.
Fixes: SCYLLADB-895
This patch series introduces a new documentation for exiting guardrails.
Moreover:
- Warning / failure messages of recently added write CL guardrails (SCYLLADB-259) are rephrased, so all guardrails have similar messages.
- Some new tests are added, to help verify the correctness of the documentation and avoid situations where the documentation and implementation diverge.
Fixes: [SCYLLADB-257](https://scylladb.atlassian.net/browse/SCYLLADB-257)
No backport, just new docs and tests.
[SCYLLADB-257]: https://scylladb.atlassian.net/browse/SCYLLADB-257?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#29011
* github.com:scylladb/scylladb:
test: add new guardrail tests matching documentation scenarios
test: add metric assertions to guardrail replication strategy tests
test: use regex matching in guardrail replication strategy tests
test: extract ks_opts helper in test_guardrail_replication_strategy
docs: document CQL guardrails
cql: improve write consistency level guardrail messages
In this series we add support for forwarding strongly consistent CQL requests to suitable replicas, so that clients can issue reads/writes to any node and have the request executed on an appropriate tablet replica (and, for writes, on the Raft leader). We return the same CQL response as what the user would get while sending the request to the correct replica and we perform the same logging/stats updates on the request coordinator as if the coordinator was the appropriate replica.
The core mechanism of forwarding a strongly consistent request is sending an RPC containing the user's cql request frame to the appropriate replica and returning back a ready, serialized `cql_transport::response`. We do this in the CQL server - it is most prepared for handling these types and forwarding a request containing a CQL frame allows us to reuse near-top-level methods for CQL request handling in the new RPC handler (such as the general `process`)
For sending the RPC, the CQL server needs to obtain the information about who should it forward the request to. This requires knowledge about the tablet raft group members and leader. We obtain this information during the execution of a `cql3/strong_consistency` statement, and we return this information back to the CQL server using the generalized `bounce_to_shard` `response_message`, where we now store the information about either a shard, or a specific replica to which we should forward to. Similarly to `bounce_to_shard`, we need to handle this `result_message` in a loop - a replica may move during statement execution, or the Raft leader can change. We also use it for forwarding strongly consistent writes when we're not a member of the affected tablet raft group - in that case we need to forward the statement twice - once to any replica of the affected tablet, then that replica can find the leader and return this information to the coordinator, which allows the second request to be directed to the leader.
This feature also allows passing through exception messages which happened on the target replica while executing the statement. For that, many methods of the `cql_transport::cql_server::connection` for creating error responses needed to be moved to `cql_transport::cql_server`. And for final exception handling on the coordinator, we added additional error info to the RPC response, so that the handling can be performed without having the `result_message::exception` or `exception_ptr` itself.
Fixes [SCYLLADB-71](https://scylladb.atlassian.net/browse/SCYLLADB-71)
[SCYLLADB-71]: https://scylladb.atlassian.net/browse/SCYLLADB-71?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#27517
* github.com:scylladb/scylladb:
test: add tests for CQL forwarding
transport: enable CQL forwarding for strong consistency statements
transport: add remote statement preparation for CQL forwarding
transport: handle redirect responses in CQL forwarding
transport: add exception handling for forwarded CQL requests
transport: add basic CQL request forwarding
idl: add a representation of client_state for forwarding
cql_server: handle query, execute, batch in one case
transport: inline process_on_shard in cql_server::process
transport: extract process() to cql_server
transport: add messaging_service to cql_server
transport: add response reconstruction helpers for forwarding
transport: generalize the bounce result message for bouncing to other nodes
strong consistency: redirect requests to live replicas from the same rack
transport: pass foreign_ptr into sleep_until_timeout_passes and move it to cql_server
transport: extract the error handling from process_request_one
transport: move error response helpers from connection to cql_server
Update warn and fail messages for the write_consistency_levels_warned
and write_consistency_levels_disallowed guardrails to include the
configuration option name and actionable guidance. The main motivation
is to make the messages follow the conventions of other guardrails.
Refs: SCYLLADB-257
This is short cleanup after recent removal of creating default cassandra superuser and auth-v1 code removal.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1036
Backport: no, just code cleanup
Closesscylladb/scylladb#29004
* github.com:scylladb/scylladb:
auth: remove DEFAULT_SUPERUSER_NAME constant and dead DEFAULT_USER_PASSWORD
auth: use configurable default_superuser in describe_roles
auth: move default_superuser to common, remove _superuser member
auth: use LOCAL_ONE for all auth queries
auth: remove get_auth_ks_name indirection
can_use_effective_service_level_cache() always returns true now, so the function can be dropped entirely and all the code that assumes it may return false can be dropped as well. Also drop async versions of find_effective_service_level and get_user_scheduling_group since they are unused.
No need to backport, code removal,
Closesscylladb/scylladb#29002
* github.com:scylladb/scylladb:
service level: make maybe_update_per_service_level_params synchronous
service level: remove unused get_user_scheduling_group function
service level: drop async find_effective_service_level
service level: remove remnants of version 1 service level
Add basic cluster tests for CQL forwarding.
The test cases include:
- basic reads and writes
- prepared statements with binds
- forwarding from a non-replica
- exception passthrough during forwarding (using an injection)
- re-preparing a statement on the target node, even if the user
query is also an EXECUTE request on a prepared statement
- verification metric updates
The existing test_basic_write_read was modified so that a few extra
cases could be validated on the same cluster.
We enable CQL forwarding by starting to return the bounce_to_node
result message in redirect_statement() instead of throwing. The
forwarding code introduced in the preceding patches reacts to these
messages, allowing the requests to be forwarded.
With the update, some tests assuming that requests can't be forwarded
need to be adjusted, so we do that as well.
Use rolling_max_tracker to record gross bytes allocated during each
CQL parse. The rolling maximum is then added to the memory estimate
for incoming QUERY and PREPARE requests so that the admission control
in the CQL transport layer accounts for parsing overhead.
The measured memory footprint serves as upper bound rather than
exact number but it's purpose is to prevent OOMs under unprepared
statements heavy load.
In benchmark 1G memory node shows decrease of non-LSA memory usage
from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While
tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as
memory admission kicks in trying to prevent OOM.
This is phase 1 of OOM prevention, potential next steps:
- add second admission in query_processor::get_statement trying to prevent potential thundering herd problem
- decrease cql_server memory pool size
- count reads in the memory pool
- add per service level memory pool and a shared one
Related https://scylladb.atlassian.net/browse/SCYLLADB-740
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-938
Backport: no, new feature, but we may reconsider if some customer needs it
Closesscylladb/scylladb#28919
* github.com:scylladb/scylladb:
cql3: track CQL parsing memory cost and use it for admission control
utils: add rolling max tracker
In the following patches, we'll start allowing forwarding requests to strongly
consistent tables so that they'll get executed on the suitable tablet Raft group
members. For that we'll reuse the approach that we already have for bouncing
requests to other shards - we'll try to execute a request locally, and the
result of that will be a bounce message with another replica as the target.
In this patch we generalize the former bounce_to_shard result message so that
it will be able to specify the target of the bounce as another shard or specific
replica.
We also rename it to result_message::bounce so that it stops implying that only
another shard may be its target.
Aside from the host_id and the shard, the new message also includes the timeout,
because in the service handling the forwarding we won't have the access to it,
and it's needed for specifying how long we should wait for the forwarded
requests. It also includes an information whether this is a write request
to return correct timeout response in case the deadline is exceeded.
We will return other hosts in the new bounce message when executing requests to
strongly consistent tables when we can't handle the request because we aren't
a suitable replica. We can't handle this message yet, so we don't return it
anywhere and we still assume that every bounce message is a bounce to the same
host.
Use rolling_max_tracker to record gross bytes allocated during each
CQL parse. The rolling maximum is then added to the memory estimate
for incoming QUERY and PREPARE requests so that the admission control
in the CQL transport layer accounts for parsing overhead.
The measured memory footprint serves as upper bound rather than
exact number but it's purpose is to prevent OOMs under unprepared
statements heavy load.
In benchmark 1G memory node shows decrease of non-LSA memory usage
from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While
tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as
memory admission kicks in trying to prevent OOM.
In this patch we replace every single use of SCYLLA_ASSERT(), abort() and assert() in the cql3/ directory by throwing_assert().
The problem with SCYLLA_ASSERT()/abort()/assert() is that when it fails, it crashes Scylla. This is almost always a bad idea (see #7871 discussing why), but it's even riskier in front-end code like cql3/: In front-end code, there is a risk that due to a bug in our code, a specific user request can cause Scylla to crash. A malicious user can send this query to all nodes and crash the entire cluster. When the user is not malicious, it causes a small problem (a failing request) to become a much worse crash - and worse, the user has no idea which request is causing this crash and the crash will repeat if the same request is tried again.
All of this is solved by using the new throwing_assert(), which is the same as SCYLLA_ASSERT() but throws an exception (using on_internal_error()) instead of crashing. The exception will prevent the code path with the invalid assumption from continuing, but will result in only the current user request being aborted, with a clear error message reporting the internal server error due to an assertion failure.
I reviewed all the changes that I did in these patches to check that (to the best of my understanding) none of the assertions in cql3/ involve the sort of serious corruption that might require crashing the Scylla node entirely.
throwing_assert() also improves logging of assertion failures compared to the original SCYLLA_ASSERT()/abort() - SCYLLA_ASSERT() printed a message to stderr which in many installations is lost, and abort() often prints no message at all. But throwing_assert() uses Scylla's standard logger, and also includes a backtrace in the log message.
Fixes#13970 (Exorcise assertions from CQL code paths)
Refs #7871 (Exorcise assertions from Scylla)
Closesscylladb/scylladb#28847
* github.com:scylladb/scylladb:
cql3: remove unnecessary assert()
cql3: replace abort() by throwing_assert()
cql3: Replace SCYLLA_ASSERT by throwing_assert
Replace get_auth_ks_name(qp) with db::system_keyspace::NAME directly.
The function always returned the constant "system" and its qp
parameter was unused.
query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak handle could no longer be promoted and the prepare path could fail nondeterministically.
This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops.
Test coverage was extended in test/cluster/test_prepare_race.py:
- reproduces the invalidation-during-prepare window with injection,
- verifies prepare completes successfully,
- then invalidates again and executes the same stale client prepared object,
- confirms the driver transparently re-requests/re-prepares and execution succeeds.
This change introduces:
- no behavior change for normal prepare flow besides stronger lifetime guarantees,
- no new protocol semantics,
- preserves existing cache invalidation logic,
- adds explicit cluster-level regression coverage for both the race and driver reprepare path.
- pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage
Fixes: https://github.com/scylladb/scylladb/issues/27657
Backport to active branches recommended: No node crash, but user-visible PREPARE failures under rare schema-invalidation race; low-risk timeout-bounded retry improves robustness.
Closesscylladb/scylladb#28952
* github.com:scylladb/scylladb:
transport/messages: hold pinned prepared entry in PREPARE result
cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race
Remove the rest of the code that assumes that either group0 does not exist yet or a cluster is till not upgraded to raft topology. Both of those are not supported any more.
No need to backport since we remove functionality here.
Closesscylladb/scylladb#28841
* github.com:scylladb/scylladb:
service level: remove version 1 service level code
features: move GROUP0_SCHEMA_VERSIONING to deprecated features list
migration_manager: remove unused forward definitions
test: remove unused code
auth: drop auth_migration_listener since it does nothing now
schema: drop schema_registry_entry::maybe_sync() function
schema: drop make_table_deleting_mutations since it should not be needed with raft
schema: remove calculate_schema_digest function
schema: drop recalculate_schema_version function and its uses
migration_manager: drop check for group0_schema_versioning feature
cdc: drop usage of cdc_local table and v1 generation definition
storage_service: no need to add yourself to the topology during reboot since raft state loading already did it
storage_service: remove unused functions
group0: drop with_raft() function from group0_guard since it always returns true now
gossiper: do not gossip TOKENS and CDC_GENERATION_ID any more
gossiper: drop tokens from loaded_endpoint_state
gossiper: remove unused functions
storage_service: do not pass loaded_peer_features to join_topology()
storage_service: remove unused fields from replacement_info
gossiper: drop is_safe_for_restart() function and its use
storage_service: remove unused variables from join_topology
gossiper: remove the code that was only used in gossiper topology
storage_service: drop the check for raft mode from recovery code
cdc: remove legacy code
test: remove unused injection points
auth: remove legacy auth mode and upgrade code
treewide: remove schema pull code since we never pull schema any more
raft topology: drop upgrade_state and its type from the topology state machine since it is not used any longer
group0: hoist the checks for an illegal upgrade into main.cc
api: drop get_topology_upgrade_state and always report upgrade status as done
service_level_controller: drop service level upgrade code
test: drop run_with_raft_recovery parameter to cql_test_env
group0: get rid of group0_upgrade_state
storage_service: drop topology_change_kind as it is no longer needed
storage_service: drop check_ability_to_perform_topology_operation since no upgrades can happen any more
service_storage: remove unused functions
storage_service: remove non raft rebuild code
storage_service: set topology change kind only once
group0: drop in_recovery function and its uses
group0: rename use_raft to maintenance_mode and make it sync
In cql3/, there was one call to assert() (not SCYLLA_ASSERT or
throwing_assert), and it was:
const auto shard_num = smp::count;
assert(shard_num > 0)
Rather than converting this assert() to throwing_assert() as I did in
previous patches, I decided to outright remove it: Seastar guarantees
that smp::count is not zero. Many other places in the code use
smp::count assuming that it is correct, no other place bothers to assert
it isn't zero.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
After the previous patch replaced all SCYLLA_ASSERT() calls by
throwing_assert(), this patch also replaces all calls to abort().
All these abort() calls are supposedly cases that can never happen,
but if they ever do happen because of a bug, in none of these places
we absolutely need to crash - and exception that aborts the current
operation should be enough.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In this patch we replace every single use of SCYLLA_ASSERT() in the cql3/
directory by throwing_assert().
The problem with SCYLLA_ASSERT() is that when it fails, it crashes Scylla.
This is almost always a bad idea (see #7871 discussing why), but it's even
riskier in front-end code like cql3/: In front-end code, there is a risk
that due to a bug in our code, a specific user request can cause Scylla
to crash. A malicious user can send this query to all nodes and crash
the entire cluster. When the user is not malicious, it causes a small
problem (a failing request) to become a much worse crash - and worse,
the user has no idea which request is causing this crash and the crash
will repeat if the same request is tried again.
All of this is solved by using the new throwing_assert(), which is the
same as SCYLLA_ASSERT() but throws an exception (using on_internal_error())
instead of crashing. The exception will prevent the code path with the
invalid assumption from continuing, but will result in only the current
user request being aborted, with a clear error message reporting the
internal server error due to an assertion failure.
I reviewed all the changes that I did in this patch to check that (to the
best of my understanding) none of the assertions in cql3/ involve the
sort of serious corruption that might require crashing the Scylla node
entirely.
throwing_assert() also improves logging of assertion failures compared
to the original SCYLLA_ASSERT() - SCYLLA_ASSERT() printed a message to
stderr which in many installations is lost, whereas throwing_assert()
uses Scylla's standard logger, and also includes a backtrace in the
log message.
Fixes#13970 (Exorcise assertions from CQL code paths)
Refs #7871 (Exorcise assertions from Scylla)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Similarly to LWTs, we reject queries with user-provided timestamps
when they target strongly consistent tables.
Such statements could force us to rewrite history, and that contradicts
the philosophy of linearizability we aim for.
Fixes SCYLLADB-879
Closesscylladb/scylladb#28867
result_message::prepared now owns a strong pinned prepared-cache entry instead of relying only on a weak pointer view. This closes the remaining lifetime gap after query_processor::prepare() returns, so users of the returned PREPARE message cannot observe an invalidated weak handle during subsequent
processing.
- update result_message::prepared::cql constructor to accept pinned entry
- construct weak view from owned pinned entry inside the message
- pass pinned cache entry from query_processor::prepare() into the message constructor
query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak
handle could no longer be promoted and the prepare path could fail nondeterministically.
This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops.
Test coverage was extended in test/cluster/test_prepare_race.py:
- reproduces the invalidation-during-prepare window with injection,
- verifies prepare completes successfully,
- then invalidates again and executes the same stale client prepared object,
- confirms the driver transparently re-requests/re-prepares and execution succeeds.
This change introduces:
- no behavior change for normal prepare flow besides stronger lifetime guarantees,
- no new protocol semantics,
- preserves existing cache invalidation logic,
- adds explicit cluster-level regression coverage for both the race and driver reprepare path.
- pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage
When a user tries to use ALTER TABLE on a materialized view,
the resulting error message is `Cannot use ALTER TABLE on Materialized View`.
The intention behind this error is that ALTER MATERIALIZED VIEW should
be used instead.
But we observed that some users interpret this error message as a general
"You cannot do any ALTER on this thing".
This patch enhances the error message (and others similar to it)
to prevent the confusion.
Closesscylladb/scylladb#28831
Previous code performed endian conversion by bulk-copying raw bytes
into a std::vector<float> and then iterating over it via a
reinterpret_cast<uint32_t*> pointer. Accessing float storage through a
uint32_t* violates C++ strict aliasing rules, giving the compiler
freedom to reorder or elide the stores, causing undefined behavior.
Replace the two-pass approach with a single-pass loop using
seastar::consume_be<uint32_t>() and std::bit_cast<float>(), which is
both well-defined and auto-vectorizable.
Follow-up #28754Closesscylladb/scylladb#28912
The purpose of `add_column_for_post_processing` is to add columns that are required for processing of a query,
but are not part of SELECT clause and shouldn't be returned. They are added to the final result set, but later are not serialized.
Mainly it is used for filtering and grouping columns, with a special case of `WHERE primary_key IN ... ORDER BY ...` when the whole result set needs additional final sorting,
and ordering columns must be added as well.
There was a bug that manifested in #9435, #8100 and was actually identified in #22061.
In case of selection with processing (e.g functions involved), result set row is formed in two stages.
Initially it is a list of columns fetched from replicas - on which filtering and grouping is performed.
After that the actual selection is resolved and the final number of columns can change.
Ordering is performed on this final shape, but the ordering column index returned by `add_column_for_post_processing` refereed to initial shape.
If selection refereed to the same column twice (e.g. `v, TTL(v)` as in #9435) final row was longer than initial and ordering refereed to incorrect column.
If a function in selection refereed to multiple columns (e.g. as_json(.., ..) which #8100 effectively uses) the final row was shorter
and ordering tried to use a non-existing column.
This patch fixes the problem by making sure that column index of the final result set is used for ordering.
The previously crashing test `cassandra_tests/validation/entities/json_test.py::testJsonOrdering` doesn't have to be skipped, but now it is failing on issue #28467.
Fixes#9435Fixes#8100Fixes#22061Closesscylladb/scylladb#28472
This patch series implements `write_consistency_levels_warned` and `write_consistency_levels_disallowed` guardrails, allowing the configuration of which consistency levels are unwanted for writes. The motivation for these guardrails is to forbid writing with consistency levels that don't provide high durability guarantees (like CL=ANY, ONE, or LOCAL_ONE).
Neither guardrail is enabled by default, so as not to disrupt clusters that are currently using any of the CLs for writes. The warning guardrail may seem harmless, as it only adds a warning to the CQL response; however, enabling it can significantly increase network traffic (as a warning message is added to each response) and also decrease throughput due to additional allocations required to prepare the warning. Therefore, both guardrails should be enabled with care. The newly added `writes_per_consistency_level` metric, which is incremented unconditionally, can help decide whether a guardrail can be safely enabled in an existing cluster.
This commit adds additional `if` instructions on the critical path. However, based on the `perf_simple_query` benchmark for writes, the difference is marginal (~40 additional instructions, which is a relative difference smaller than 0.001).
BEFORE:
```
291443.35 tps ( 53.3 allocs/op, 16.0 logallocs/op, 14.2 tasks/op, 48067 insns/op, 18885 cycles/op, 0 errors)
throughput:
mean= 289743.07 standard-deviation=6075.60
median= 291424.69 median-absolute-deviation=1702.56
maximum=292498.27 minimum=261920.06
instructions_per_op:
mean= 48072.30 standard-deviation=21.15
median= 48074.49 median-absolute-deviation=12.07
maximum=48119.87 minimum=48019.89
cpu_cycles_per_op:
mean= 18884.09 standard-deviation=56.43
median= 18877.33 median-absolute-deviation=14.71
maximum=19155.48 minimum=18821.57
```
AFTER:
```
290108.83 tps ( 53.3 allocs/op, 16.0 logallocs/op, 14.2 tasks/op, 48121 insns/op, 18988 cycles/op, 0 errors)
throughput:
mean= 289105.08 standard-deviation=3626.58
median= 290018.90 median-absolute-deviation=1072.25
maximum=291110.44 minimum=274669.98
instructions_per_op:
mean= 48117.57 standard-deviation=18.58
median= 48114.51 median-absolute-deviation=12.08
maximum=48162.18 minimum=48087.18
cpu_cycles_per_op:
mean= 18953.43 standard-deviation=28.76
median= 18945.82 median-absolute-deviation=20.84
maximum=19023.93 minimum=18916.46
```
Fixes: SCYLLADB-259
Refs: SCYLLADB-739
No backport, it's a new feature
Closesscylladb/scylladb#28570
* github.com:scylladb/scylladb:
scylla.yaml: add write CL guardrails to scylla.yaml
scylla.yaml: reorganize guardrails config to be in one place
test: add cluster tests for write CL guardrails
test: implement test_guardrail_write_consistency_level
cql3: start using write CL guardrails
cql3/query_processor: implement metrics to track CL of writes
db: cql3/query_processor: add write_consistency_levels enum_sets
config: add write_consistency_levels_* guardrails configuration
The main loops iterating over vector components were not vectorized due to:
- "cannot prove it is safe to reorder floating-point operations"
- "Cannot vectorize early exit loop with more than one early exit"
The first issue is fixed with adding `#pragma clang fp contract(fast) reassociate(on)`, which allows compiler to optimize floating point operations.
The second issue is solved by refactoring the operations in the affected loop.
Additionally using float operations instead of double increases throughput and numerical accuracy is not the main consideration in vector search scenarios.
Performance measured:
- scylla built using dbuild
- using https://github.com/zilliztech/VectorDBBench (modified to call `SELECT id, similarity_cosine({vector<float, 1536>}, {vector<float, 1536>}) ...` without ANN search):
- client concurrency 20
before: ~2250 QPS
`float` operations: ~2350 QPS
`compute_cosine_similarity` vectorization: ~2500QPS
`extract_float_vector` vectorization: ~3000QPS
Follow-up https://github.com/scylladb/scylladb/pull/28615
Ref https://scylladb.atlassian.net/browse/SCYLLADB-764Closesscylladb/scylladb#28754
Enable verification of write consistency level guardrails in
`modification_statement` and `batch_statement`.
Neither guardrail is enabled by default, so as not to disrupt clusters
that are currently using any of the CLs for writes. The warning
guardrail may seem harmless, as it only adds a warning to the CQL
response; however, enabling it can significantly increase network
traffic (as a warning message is added to each response) and also
decrease throughput due to additional allocations required to prepare
the warning. Therefore, both guardrails should be enabled with care.
The newly added `writes_per_consistency_level` metric, which is
incremented unconditionally, can help decide whether a guardrail can
be safely enabled in an existing cluster.
This commit adds additional `if` instructions on the critical path.
However, based on the `perf_simple_query` benchmark for writes,
the difference is marginal (~40 additional instructions, which is
a relative difference smaller than 0.001).
BEFORE:
```
291443.35 tps ( 53.3 allocs/op, 16.0 logallocs/op, 14.2 tasks/op, 48067 insns/op, 18885 cycles/op, 0 errors)
throughput:
mean= 289743.07 standard-deviation=6075.60
median= 291424.69 median-absolute-deviation=1702.56
maximum=292498.27 minimum=261920.06
instructions_per_op:
mean= 48072.30 standard-deviation=21.15
median= 48074.49 median-absolute-deviation=12.07
maximum=48119.87 minimum=48019.89
cpu_cycles_per_op:
mean= 18884.09 standard-deviation=56.43
median= 18877.33 median-absolute-deviation=14.71
maximum=19155.48 minimum=18821.57
```
AFTER:
```
290108.83 tps ( 53.3 allocs/op, 16.0 logallocs/op, 14.2 tasks/op, 48121 insns/op, 18988 cycles/op, 0 errors)
throughput:
mean= 289105.08 standard-deviation=3626.58
median= 290018.90 median-absolute-deviation=1072.25
maximum=291110.44 minimum=274669.98
instructions_per_op:
mean= 48117.57 standard-deviation=18.58
median= 48114.51 median-absolute-deviation=12.08
maximum=48162.18 minimum=48087.18
cpu_cycles_per_op:
mean= 18953.43 standard-deviation=28.76
median= 18945.82 median-absolute-deviation=20.84
maximum=19023.93 minimum=18916.46
```
Fixes: SCYLLADB-259
Encapsulate the superuser check in client_state so that it
respects _bypass_auth_checks. Connections that bypass auth
(internal callers and the maintenance socket) are always
considered superusers.
Migrate existing call sites from auth::has_superuser(service, user)
to client_state.has_superuser(). Also add _bypass_auth_checks
handling to ensure_not_anonymous().
Refs SCYLLADB-409
Add `write_consistency_levels_disallowed_violations` and
`write_consistency_levels_warned_violations` metrics to track
violations of write_consistency_levels guardrails.
Add `writes_per_consistency_level` to track what CL is used by
writes, regardless of the guardrails configuration.
Data gathered by this metric can be used to decide whether enabling
a particular write consistency level guardrail in a particular
existing cluster is safe.
Refs: SCYLLADB-259
Add enum_sets to query_processor that track the configuration
values of `write_consistency_levels_warned` and
`write_consistency_levels_disallowed`.
Refs: SCYLLADB-259
The secondary index mechanism is currently used to determine the target column.
This mechanism works incorrectly for vector indexes with filtering because
it returns the last specified column as the target (vectors) column.
However, the syntax for a vector index requires the first column to be the target:
```
CREATE CUSTOM INDEX ON t(vectors, users) USING 'vector_index';
```
This discrepancy eventually leads to the following exception when performing an
ANN search on a vector index with filtering columns:
````
ANN ordering by vector requires the column to be indexed using 'vector_index'
````
This commit fixes the issue by introducing dedicated logic for vector indexes
to correctly identify the target(vectors) column.
Fixes: SCYLLADB-635
Closesscylladb/scylladb#28740
3f7ee3ce5d introduced system.batchlog_v2, with a schema designed to speed up batchlog replays and make post-replay cleanups much more effective.
It did not introduce a cluster feature for the new table, because it is node local table, so the cluster can switch to the new table gradually, one node at a time.
However, https://github.com/scylladb/scylladb/issues/27886 showed that the switching causes timeouts during upgrades, in mixed clusters. Furthermore, switching to the new table unconditionally on upgrades nodes, means that on rollback, the batches saved into the v2 table are lost.
This PR introduces re-introduces v1 (`system.batchlog`) support and guards the use of the v2 table with a cluster feature, so mixed clusters keep using v1 and thus be rollback-compatible.
The re-introduced v1 support doesn't support post-replay cleanups for simplicity. The cleanup in v1 was never particularly effective anyway and we ended up disabling it for heavy batchlog users, so I don't think the lack of support for cleanup is a problem.
Fixes: https://github.com/scylladb/scylladb/issues/27886
Needs backport to 2026.1, to fix upgrades for clusters using batches
Closesscylladb/scylladb#28736
* github.com:scylladb/scylladb:
test/boost/batchlog_manager_test: add tests for v1 batchlog
test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2
test/boost/batchlog_manager_test: fix indentation
test/boost/batchlog_manager_test: extract prepare_batches() method
test/lib/cql_assertions: is_rows(): add dump parameter
tools/scylla-sstable: extract query result printers
tools/scylla-sstable: add std::ostream& arg to query result printers
repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log
db/batchlog_manager: re-add v1 support
db/batchlog_manager: return all_replayed from process_batch()
db/batchlog_manager: process_bath() fix indentation
db/batchlog_manager: make batch() a standalone function
db/batchlog_manager: make structs stats public
db/batchlog_manager: allocate limiter on the stack
db/batchlog_manager: add feature_service dependency
gms/feature_service: add batchlog_v2 feature
This series implements a new per-row TTL feature for CQL. The per-row TTL feature was requested in issue #13000. It is a feature that does not exist in Cassandra, and was inspired by DynamoDB's TTL feature - and under the hood uses the same implementation that we used in Alternator to implement this DynamoDB feature.
The new per-row TTL feature is completely separate from CQL's existing per-write (and per-cell) TTL, and both will be available to users.
In the per-row TTL feature, one column in the table is designated as the "TTL" column, and its value for a row is the expiration time for that row. The TTL column can be designated at table creation time, e.g.:
```cql
CREATE TABLE tab (
id int PRIMARY KEY,
t text,
expiration timestamp TTL
);
```
Or after the table already exists with:
```cql
ALTER TABLE tab TTL expiration
```
Expiration can also be disabled, with:
```cql
ALTER TABLE tab TTL NULL
```
The new per-row TTL feature has two features that users have been asking for:
1. A user can change the value of just the TTL column - without rewriting the entire row - to change the expiration time of the entire row.
2. When an expired row is finally deleted, a CDC event about this deletion appears in the CDC log (if CDC is enabled), including - if a preimage is enabled - the content of the deleted row.
To achieve the second goal (CDC events), a row is not guaranteed to disappear at exactly its expiration time (as CQL's original TTL feature guarantees). Rather, the row is deleted some time later, depending on `alternator_ttl_period_in_seconds`; Until the actual deletion, the row is still readable (and even writable). But we are guaranteed that when the row is finally deleted, the CDC event will come too.
The implementation uses the same background thread used by Alternator to periodically scan for expired items and delete them.
The expiration thread keeps the same metrics as it did for Alternator:
* `scylla_expiration_scan_passes`
* `scylla_expiration_scan_table`
* `scylla_expiration_items_deleted`
* `scylla_expiration_secondary_ranges_scanned`
The series begins with a few small preparation patches, followed by the main part of the feature (which isn't big, since we are just enabling the pre-existing Alternator expiration machinary for CQL) and finally 30 tests (single-node and multi-node tests) and documentation.
This series is a new feature, so traditionally would not be backported. However, I wouldn't be surprised if we will be requested to backport it so that customers will not need to wait for a new major release.
Fixes#13000Closesscylladb/scylladb#28320
* github.com:scylladb/scylladb:
test/cqlpy: verify that a column can't be both STATIC and PRIMARY KEY
docs/cql: document the new CQL per-row TTL feature
test/cluster: tests for the new CQL per-row TTL feature
test/cqlpy: tests for the new CQL per-row TTL feature
test: set low alternator_ttl_period_in_seconds in CQL tests
cql ttl: fix ALTER TABLE to disable TTL if column is dropped
cql ttl: add setting/unsetting of TTL column to ALTER TABLE
cql ttl: add TTL column support to CREATE TABLE and DESC TABLE
ttl: add CQL support to Alternator's TTL expiration service
alternator ttl: move TTL_TAG_KEY to a header file
alternator ttl: remove unnecessary check of feature flag
cql: add "cql_row_ttl" cluster feature
alternator: fix error message if UpdateTimeToLive is not supported
Switch vector dimension handling to fixed-width `uint32_t` type,
update parsing/validation, and add boundary tests.
The dimension is parsed as `unsigned long` at first which is guaranteed
to be **at least** 32-bit long, which is safe to downcast to `uint32_t`.
Move `MAX_VECTOR_DIMENSION` from `cql3_type::raw_vector` to `cql3_type`
to ensure public visibility for checks outside the class.
Add tests to verify the type boundaries.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-223
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Co-authored-by: Dawid Pawlik <dawid.pawlik@scylladb.com>
Closesscylladb/scylladb#28762
If "ALTER TABLE tab DROP x" is done to delete column x, and column x was
the designated TTL column, then the per-row TTL feature should be disabled
on this table.
If we don't do this, the expiration scanner will continue to scan the
table trying to read the dropped column - which will be wasteful or
worse.
A test for this case is also included in test/cqlpy/test_ttl_row.py
in a later patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The previous patch added the ability in CREATE TABLE to designate one
of the regular columns as a "TTL column", to be used by the per-row TTL
feature (Refs #13000). In this patch we add to ALTER TABLE the ability
to enable per-row TTL on an existing table with a given column as the
TTL column:
ALTER TABLE tab TTL colname
and also the ability to disable per-row TTL with
ALTER TABLE tab TTL NULL
as in CREATE TABLE, the designated TTL column must be a regular column
(it can't be a primary key column or a static column), and must have
the types timestamp, bigint or int.
You can't enable per-row TTL if already enabled, or disable it if
already disabled. To change the TTL column on an existing table,
you must first disable TTL, and then re-enable it with the new column.
A large collection of functional tests (in test/cqlpy), for every detail
of this patch, will come in a later patch in this series.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>