Change add_tablet_info() to accept locator::tablet_routing_info instead
of destructured (tablet_replica_set, token_range) pair. This simplifies
all three call sites.
Remove the empty-replicas guard inside add_tablet_info(): the only
producer of tablet_routing_info is tablet ERM's check_locality(), which
returns either nullopt (correctly routed) or info with replicas copied
from tablet_info — a tablet always has replicas. All callers already
check for nullopt before calling add_tablet_info(), so by the time we
enter the function replicas are guaranteed non-empty.
The timeout_config (more exactly -- updatable_timeout_config) is used by alternator/controller and transport/controller. Both create a local copy of that opbject by constructing one out of db::config. Also some options from this config are needed by storage_proxy, but since it doesn't have access to any timeout_config-s, it just uses db::config by getting it from the database.
This PR introduces top-level sharded<updateable_timeout_config>, initializes it from db::config values and makes existing users plus storage_proxy us it where required. Motivation -- remove more replica::database::get_config() users. A side effect -- timeout_config is not duplicated by transport and alternator controllers.
Components' dependencies cleanup, not backporting.
Closesscylladb/scylladb#29636
* github.com:scylladb/scylladb:
storage_proxy: Use shared updateable_timeout_config for CAS contention timeout
alternator: Use shared updateable_timeout_config by reference
cql_transport: Use shared updateable_timeout_config by reference
storage_proxy: Use shared updateable_timeout_config by reference
main: Introduce sharded<updateable_timeout_config>
storage_proxy: Keep own updateable_timeout_config
Four logger calls used %s (printf-style) instead of {} (fmtlib-style),
causing __func__ to be silently ignored and the literal text "%s" to
appear in the log output. The same file already uses {} correctly in
the on_create_function and on_create_aggregate handlers.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Pass sharded<updateable_timeout_config>& into cql_transport::controller,
which feeds the shard-local instance as a reference into
cql_server_config::timeout_config. This drops the per-shard local
updateable_timeout_config constructed from db::config inside the
controller's sharded_parameter lambda, replacing it with a reference
into the shared sharded instance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Commit 16b56c2451 ("Audit: avoid dynamic_cast on a hot path") moved
audit info into batch_statement via set_audit_info(), but only wired it
for the CQL-text BATCH path (raw::batch_statement::prepare()).
Native-protocol BATCH messages (opcode 0x0D), handled by
process_batch_internal in transport/server.cc, construct a
batch_statement without setting audit_info. This causes audit to
silently skip the entire batch.
Set audit_info on the batch_statement so these batches are audited.
Fixes SCYLLADB-1652
No backport - bug introduced recently.
Closesscylladb/scylladb#29570
* github.com:scylladb/scylladb:
test/audit: add reproducer for native-protocol batch not being audited
audit: set audit_info for native-protocol BATCH messages
test/audit: rename internal test methods to avoid CI misdetection
Commit 16b56c2451 ("Audit: avoid dynamic_cast on a hot path") moved
audit info into batch_statement via set_audit_info(), but only wired it
for the CQL-text BATCH path (raw::batch_statement::prepare()).
Native-protocol BATCH messages (opcode 0x0D), handled by
process_batch_internal in transport/server.cc, construct a
batch_statement without setting audit_info. This causes audit to
silently skip the entire batch.
Set audit_info on the batch_statement so these batches are audited.
Fixes SCYLLADB-1652
The existing scylla_transport_requests_serving metric is a single global per-shard gauge counting outstanding CQL requests. When debugging latency spikes, it's useful to know which service level is contributing the most in-flight requests.
This PR adds a new per-scheduling-group gauge scylla_transport_cql_requests_serving (with the scheduling_group_name label), using the existing cql_sg_stats per-SG infrastructure. The cql_ prefix is intentional — it follows the convention of all other per-SG transport metrics (cql_requests_count, cql_request_bytes, etc.) and avoids Prometheus confusion with the global requests_serving metric (which lacks the scheduling_group_name label).
Fixes: SCYLLADB-1340
New feature, no backport.
Closesscylladb/scylladb#29493
* github.com:scylladb/scylladb:
transport: add per-service-level cql_requests_serving metric
transport: move requests_serving decrement to after response is sent
Add a per-scheduling-group gauge that tracks the number of in-flight CQL
requests for each service level. The existing scylla_transport_requests_serving
metric is a single global per-shard counter; the new metric breaks it down
by scheduling group so operators can see which service level contributes
the most in-flight requests when debugging latency.
The metric is named cql_requests_serving (exposed as
scylla_transport_cql_requests_serving) following the cql_ prefix convention
used by all other per-scheduling-group transport metrics (cql_requests_count,
cql_request_bytes, cql_response_bytes, cql_pending_response_memory). Using
a cql_ prefix avoids Prometheus confusion with the global requests_serving
metric, which lacks the scheduling_group_name label.
The counter is incremented when a request enters process_request() and
decremented in the same 'leave' defer block as the global requests_serving,
ensuring the request is counted as in-flight until the response is sent.
The requests_serving metric was decremented right after query processing
completed, but before the response was written to the client. This means
requests whose responses were queued in the write pipeline were no longer
counted as in-flight, understating the actual load.
Move the decrement into the 'leave' defer block, which fires after the
response is fully sent via _ready_to_respond. This makes the shedding
check (max_concurrent_requests_per_shard) more accurate: requests that
have finished processing but are still waiting in the response queue now
correctly count toward the in-flight limit.
Prepared LIST statements were not calculating metadata in PREPARE path, and sent empty string hash to client causing problematic behaviour where metadat_id was not recalculated correctly.
This patch moves metadata construction into get_result_metadata() for the affected LIST statements and reuse that metadata when building the result set.
This gives PREPARE a stable metadata id for LIST ROLES, LIST USERS, LIST PERMISSIONS and the service-level variants.
This patch also adds a new boost test that verifies that when an EXECUTE request carries an empty result metadata id while the server has a real metadata id for the result set, the response is marked METADATA_CHANGED and includes the full result metadata plus the server metadata id.
This covers the recovery path for clients that send an empty or otherwise unusable metadata id instead of a matching cached one.
Replace move_to_shard()/move_to_host() with as_bounce()/target_shard()/
target_host() to clarify the interface after bounce was extended to
support cross-node bouncing.
- Add virtual as_bounce() returning const bounce* to the base class
(nullptr by default, overridden in bounce to return this), replacing
the virtual move_to_shard() which conflated bounce detection with
shard access
- Rename move_to_shard() -> target_shard() (now non-virtual, returns
unsigned directly) and move_to_host() -> target_host() on bounce
- Replace dynamic_pointer_cast with static_pointer_cast at call sites
that already checked as_bounce()
- Move forward declarations of message types before the virtual
methods so as_bounce() can reference bounce
Fixes: SCYLLADB-1066
Closesscylladb/scylladb#29367
Motivation
----------
Since strongly consistent tables are based on the concept of Raft
groups, operations on them can get stuck for indefinite amounts of
time. That may be problematic, and so we'd like to implement a way
to cancel those operations at suitable times.
Description of solution
-----------------------
The situations we focus on are the following:
* Timed-out queries
* Leader changes
* Tablet migrations
* Table drops
* Node shutdowns
We handle each of them and provide validation tests.
Implementation strategy
-----------------------
1. Auxiliary commits.
2. Abort operations on timeout.
3. Abort operations on tablet removal.
4. Extend `client_state`.
5. Abort operation on shutdown.
6. Help `state_machine` be aborted as soon as possible.
Tests
-----
We provide tests that validate the correctness of the solution.
The total time spent on `test_strong_consistency.py`
(measured on my local machine, dev mode):
Before:
```
real 0m31.809s
user 1m3.048s
sys 0m21.812s
```
After:
```
real 0m34.523s
user 1m10.307s
sys 0m27.223s
```
The incremental differences in time can be found in the commit messages.
Fixes SCYLLADB-429
Backport: not needed. This is an enhancement to an experimental feature.
Closesscylladb/scylladb#28526
* github.com:scylladb/scylladb:
service: strong_consistency: Abort state_machine::apply when aborting server
service: strong_consistency: Abort ongoing operations when shutting down
service: client_state: Extend with abort_source
service: strong_consistency: Handle abort when removing Raft group
service: strong_consistency: Abort Raft operations on timeout
service: strong_consistency: Use timeout when mutating
service: strong_consistency: Fix indentation
service: strong_consistency: Enclose coordinator methods with try-catch
service: strong_consistency: Crash at unexpected exception
test: cluster: Extract default config & cmdline in test_strong_consistency.py
After obtaining the CQL response, check if its actual size exceeds the initially acquired memory permit. If so, acquire additional semaphore units and adopt them into the permit, ensuring accurate memory accounting for large responses.
Additionally, move the permit into a .then() continuation so that the semaphore units are kept alive until write_message finishes, preventing premature release of memory permit. This is especially important with slow networks and big responses when buffers can accumulate and deplete a node's memory.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1306
Related https://scylladb.atlassian.net/browse/SCYLLADB-740
Backport: all supported versions
Closesscylladb/scylladb#29288
* github.com:scylladb/scylladb:
transport: add per-service-level pending response memory metric
transport: hold memory permit until response write completes
transport: account for response size exceeding initial memory estimate
These changes are complementary to those from a recent commit where we
handled aborting ongoing operations during tablet events, such as
tablet migration. In this commit, we consider the case of shutting down
a node.
When a node is shutting down, we eventually close the connections. When
the client can no longer get a response from the server, it makes no
sense to continue with the queries. We'd like to cancel them at that
point.
We leverage the abort source passed down via `client_state` down to
the strongly consistent coordinator. This way, the transport layer can
communicate with it and signal that the queries should be canceled.
The abort source is triggered by the CQL server (cf.
`generic_server::server::{stop,shutdown}`).
---
Note that this is not an optional change. In fact, if we don't abort
those requests, we might hang for an indefinite amount of time when
executing the following code in `main.cc`:
```
// Register at_exit last, so that storage_service::drain_on_shutdown will be called first
auto do_drain = defer_verbose_shutdown("local storage", [&ss] {
ss.local().drain_on_shutdown().get();
});
```
The problem boils down to the fact that `generic_server::server::stop`
will wait for all connections to be closed, but that won't happen until
all ongoing operations (at least those to strongly consistent tables)
are finished.
It's important to highlight that even though we hang on this, the
client can no longer get any response. Thus, it's crucial that at that
point we simply abort ongoing operations to proceed with the rest of
shutdown.
---
Two tests are added to verify that the implementation is correct:
one focusing on local operations, the other -- on a forwarded write.
Difference in time spent on the whole test file
`test_strong_consistency.py` on my local machine, in dev mode:
Before:
```
real 0m31.775s
user 1m4.475s
sys 0m22.615s
```
After:
```
real 0m32.024s
user 1m10.751s
sys 0m23.871s
```
Individual runs of the added tests:
test_queries_when_shutting_down:
```
real 0m12.818s
user 0m36.726s
sys 0m4.577s
```
test_abort_forwarded_write_upon_shutdown:
```
real 0m12.930s
user 0m36.622s
sys 0m4.752s
```
We make `client_state` store a pointer to an `abort_source`. This will
be useful in the following commit that will implement aborting ongoing
requests to strongly consistent tables upon connection shutdowns.
It might also be useful in some other places in the code in the future.
We set the abort source for client states in relevant places.
Track the total memory consumed by responses waiting to be
written to the socket, exposed as a per-scheduling-group gauge
(cql_pending_response_memory). This complements the response
memory accounting added in the previous commits by giving
visibility into how much memory each service level is holding
in unsent response buffers.
Capture the memory permit in the leave lambda's .finally()
continuation so that the semaphore units are kept alive until
write_response finishes, preventing premature release of
memory accounting.
This is especially important with slow network and big responses
when buffers can accumulate and deplete node's memory.
After obtaining the CQL response, check if its actual size exceeds
the initially acquired memory permit. If so, take semaphore units
and adopt them into the permit (non blocking).
This doesn't fully prevent from allocating too much memory as
size is known when buffer is already allocated but improves
memory accounting for big responses.
When authenticate() returns a user directly (certificate-based auth,
introduced in 20e9619bb1), process_startup was missing the same
post-authentication bookkeeping that the no-auth and SASL paths perform:
- update_scheduling_group(): without it, the connection runs under the
default scheduling group instead of the one mapped to the user's
service level.
- _authenticating = false / _ready = true: without them,
system.clients reports connection_stage = AUTHENTICATING forever
instead of READY.
- on_connection_ready(): without it, the connection never releases its
slot in the uninitialized-connections concurrency semaphore (acquired
at connection creation), leaking one unit per cert-authenticated
connection for the lifetime of the connection.
The omission was introduced when on_connection_ready() was added to the
else and SASL branches in 474e84199c but the cert-auth branch was missed.
Fixes: 20e9619bb1 ("auth: support certificate-based authentication")
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
In this series we add support for forwarding strongly consistent CQL requests to suitable replicas, so that clients can issue reads/writes to any node and have the request executed on an appropriate tablet replica (and, for writes, on the Raft leader). We return the same CQL response as what the user would get while sending the request to the correct replica and we perform the same logging/stats updates on the request coordinator as if the coordinator was the appropriate replica.
The core mechanism of forwarding a strongly consistent request is sending an RPC containing the user's cql request frame to the appropriate replica and returning back a ready, serialized `cql_transport::response`. We do this in the CQL server - it is most prepared for handling these types and forwarding a request containing a CQL frame allows us to reuse near-top-level methods for CQL request handling in the new RPC handler (such as the general `process`)
For sending the RPC, the CQL server needs to obtain the information about who should it forward the request to. This requires knowledge about the tablet raft group members and leader. We obtain this information during the execution of a `cql3/strong_consistency` statement, and we return this information back to the CQL server using the generalized `bounce_to_shard` `response_message`, where we now store the information about either a shard, or a specific replica to which we should forward to. Similarly to `bounce_to_shard`, we need to handle this `result_message` in a loop - a replica may move during statement execution, or the Raft leader can change. We also use it for forwarding strongly consistent writes when we're not a member of the affected tablet raft group - in that case we need to forward the statement twice - once to any replica of the affected tablet, then that replica can find the leader and return this information to the coordinator, which allows the second request to be directed to the leader.
This feature also allows passing through exception messages which happened on the target replica while executing the statement. For that, many methods of the `cql_transport::cql_server::connection` for creating error responses needed to be moved to `cql_transport::cql_server`. And for final exception handling on the coordinator, we added additional error info to the RPC response, so that the handling can be performed without having the `result_message::exception` or `exception_ptr` itself.
Fixes [SCYLLADB-71](https://scylladb.atlassian.net/browse/SCYLLADB-71)
[SCYLLADB-71]: https://scylladb.atlassian.net/browse/SCYLLADB-71?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#27517
* github.com:scylladb/scylladb:
test: add tests for CQL forwarding
transport: enable CQL forwarding for strong consistency statements
transport: add remote statement preparation for CQL forwarding
transport: handle redirect responses in CQL forwarding
transport: add exception handling for forwarded CQL requests
transport: add basic CQL request forwarding
idl: add a representation of client_state for forwarding
cql_server: handle query, execute, batch in one case
transport: inline process_on_shard in cql_server::process
transport: extract process() to cql_server
transport: add messaging_service to cql_server
transport: add response reconstruction helpers for forwarding
transport: generalize the bounce result message for bouncing to other nodes
strong consistency: redirect requests to live replicas from the same rack
transport: pass foreign_ptr into sleep_until_timeout_passes and move it to cql_server
transport: extract the error handling from process_request_one
transport: move error response helpers from connection to cql_server
can_use_effective_service_level_cache() always returns true now, so the function can be dropped entirely and all the code that assumes it may return false can be dropped as well. Also drop async versions of find_effective_service_level and get_user_scheduling_group since they are unused.
No need to backport, code removal,
Closesscylladb/scylladb#29002
* github.com:scylladb/scylladb:
service level: make maybe_update_per_service_level_params synchronous
service level: remove unused get_user_scheduling_group function
service level: drop async find_effective_service_level
service level: remove remnants of version 1 service level
Add basic cluster tests for CQL forwarding.
The test cases include:
- basic reads and writes
- prepared statements with binds
- forwarding from a non-replica
- exception passthrough during forwarding (using an injection)
- re-preparing a statement on the target node, even if the user
query is also an EXECUTE request on a prepared statement
- verification metric updates
The existing test_basic_write_read was modified so that a few extra
cases could be validated on the same cluster.
During forwarding of CQL EXECUTE requests, the target node may
not have the prepared statement in its cache. If we do have this
statement as a coordinator, instead of returning PREPARED NOT FOUND
to the client, we want to prepare the statement ourselves on target
node.
For that, we add a new FORWARD_CQL_PREPARE RPC. We use the new RPC
after gettting the prepared_not_found status during forwarding. When
we try to forward a request, we always have the query string (we
decide whether to forward based on this query), so we can always use
the new RPC when getting the prepared_not_found status.
After receiving the response, we try forwarding the EXECUTE request
again.
During CQL forwarding, when the target node can't handle the request,
it will find another node which can execute the request or which knows
where the request can be executed. We return this information in
responses to CQL forwarding, and in this patch, we add handling of
this kind of a response.
After getting a redirect response, we retry forwarding to the returned
host/shard until success or timeout. This can happen many times during
a single request, when we first forward to a replica and later to the
coordinator, or when a replica/coordinator migrated while we were
performing the forwarding
When a forwarded request fails on the remote node, we can't use the
exception handling that happens in process_request_one because we
don't go through this code path. Instead, we use the previously
extracted cql_server::handle_exception handler, which performs
all accounting on the forwarded-to node, and which prepares the
response. For the read_failure_exception_with_timeout exception,
we need to perform the sleep on the source node, so we return the
timeout in the forwarding response and use it on the source node
to know how long to sleep without any extra calculations.
The handle_forward_execute() method is extracted from the inline handler
lambda to make the error catching wrapper cleaner.
Add the infrastructure for forwarding CQL requests to other nodes.
When a process() call results in a node bounce (as opposed to a shard
bounce), the coordinator serializes the request and sends it via the
FORWARD_CQL_EXECUTE RPC verb to the target node.
In this patch we omit several features that allow handling more
scenarios that can happen when trying to forward a CQL request,
but the RPC request and response are already prepared for them.
They will be handled in the following commits.
Use rolling_max_tracker to record gross bytes allocated during each
CQL parse. The rolling maximum is then added to the memory estimate
for incoming QUERY and PREPARE requests so that the admission control
in the CQL transport layer accounts for parsing overhead.
The measured memory footprint serves as upper bound rather than
exact number but it's purpose is to prevent OOMs under unprepared
statements heavy load.
In benchmark 1G memory node shows decrease of non-LSA memory usage
from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While
tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as
memory admission kicks in trying to prevent OOM.
This is phase 1 of OOM prevention, potential next steps:
- add second admission in query_processor::get_statement trying to prevent potential thundering herd problem
- decrease cql_server memory pool size
- count reads in the memory pool
- add per service level memory pool and a shared one
Related https://scylladb.atlassian.net/browse/SCYLLADB-740
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-938
Backport: no, new feature, but we may reconsider if some customer needs it
Closesscylladb/scylladb#28919
* github.com:scylladb/scylladb:
cql3: track CQL parsing memory cost and use it for admission control
utils: add rolling max tracker
Currently we perform the same steps when handling query, execute
and batch CQL requests. So instead of creating multiple functions
performing these steps, we can handle them all in one fallthrough
case in cql_server::connection::process_request_one.
The process_on_shard method is relatively short, it's only used
in the process() method and the Process concept that is uses
is as long as the function itself. This area will be made more
complex by the following patches for cql forwarding, so we simplify
it by inlining process_on_shard in cql_server::process.
Move process() and process_on_shard() from cql_server::connection to
cql_server. The process() method is no longer a template - instead, it
takes an opcode parameter and uses get_process_fn_for_opcode() to select
the appropriate internal processing function.
The process_query, process_execute, and process_batch wrappers on
connection now delegate to _server.process() with the appropriate opcode.
This refactoring is preparation for CQL request forwarding, where
process() will need to be called from a context other than connection
- the forwarding RPC handler).
The messaging service will be used by cql_server to register RPC
handlers for forwarding CQL requests between nodes.
We pass it through the controller to cql_server.
Expose response::flags() and response::extract_body(), and a new constructor.
It will be needed for creating a cql_transport::response from the response body returned
during CQL forwarding.
In the following patches, we'll start allowing forwarding requests to strongly
consistent tables so that they'll get executed on the suitable tablet Raft group
members. For that we'll reuse the approach that we already have for bouncing
requests to other shards - we'll try to execute a request locally, and the
result of that will be a bounce message with another replica as the target.
In this patch we generalize the former bounce_to_shard result message so that
it will be able to specify the target of the bounce as another shard or specific
replica.
We also rename it to result_message::bounce so that it stops implying that only
another shard may be its target.
Aside from the host_id and the shard, the new message also includes the timeout,
because in the service handling the forwarding we won't have the access to it,
and it's needed for specifying how long we should wait for the forwarded
requests. It also includes an information whether this is a write request
to return correct timeout response in case the deadline is exceeded.
We will return other hosts in the new bounce message when executing requests to
strongly consistent tables when we can't handle the request because we aren't
a suitable replica. We can't handle this message yet, so we don't return it
anywhere and we still assume that every bounce message is a bounce to the same
host.
Change sleep_until_timeout_passes() to accept a foreign_ptr<std::unique_ptr<response>>.
We can easily create the foreign_ptr for the responses created in the CQL server,
but we'll need this when we get responses when forwarding CQL statements - the responses
may come from other shards.
We also move it from cql_server::connection to cql_server, because for forwarded CQL
requests, we'll need to handle it at the cql_server level.
The method also loses its const qualifier - the abort_source that we pass into
sleep_abortable needs to be non-const. Apparently, we could still use it in a const
method of cql_server::connection because we passed it as _server._abort_source which
caused the const qualifier to be lost.
can_use_effective_service_level_cache() always returns true now, so the
function can be dropped entirely and all the code that assumes it may
return false can be dropped as well.
Use rolling_max_tracker to record gross bytes allocated during each
CQL parse. The rolling maximum is then added to the memory estimate
for incoming QUERY and PREPARE requests so that the admission control
in the CQL transport layer accounts for parsing overhead.
The measured memory footprint serves as upper bound rather than
exact number but it's purpose is to prevent OOMs under unprepared
statements heavy load.
In benchmark 1G memory node shows decrease of non-LSA memory usage
from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While
tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as
memory admission kicks in trying to prevent OOM.
When we forward CQL statements, we'll need to handle the errors
on the destination node. Only for read_failure_exception_with_timeout
exception, we'll still need to wait until timeout passes on the
source node.
For that we extract the exception handling to a separate method.
Additionally, we separate the waiting and all other handling,
so that all handling aside from waiting will be reusable after
forwarding, and we'll also be able to sleep on the source node
if necessary.
These methods are used only in the error handler in the cql server,
and outside of 3 cases, they don't need any information from the
cql_server::connection. We move them from cql_server::connection
to cql_server, so that they can be used in the following patches
for methods for CQL request forwarding where we'll have no instance
of cql_server::connection on the node forwarded to.
After the change the methods require no access to the server's
or connection's fields, so we also make them static methods.
query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak handle could no longer be promoted and the prepare path could fail nondeterministically.
This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops.
Test coverage was extended in test/cluster/test_prepare_race.py:
- reproduces the invalidation-during-prepare window with injection,
- verifies prepare completes successfully,
- then invalidates again and executes the same stale client prepared object,
- confirms the driver transparently re-requests/re-prepares and execution succeeds.
This change introduces:
- no behavior change for normal prepare flow besides stronger lifetime guarantees,
- no new protocol semantics,
- preserves existing cache invalidation logic,
- adds explicit cluster-level regression coverage for both the race and driver reprepare path.
- pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage
Fixes: https://github.com/scylladb/scylladb/issues/27657
Backport to active branches recommended: No node crash, but user-visible PREPARE failures under rare schema-invalidation race; low-risk timeout-bounded retry improves robustness.
Closesscylladb/scylladb#28952
* github.com:scylladb/scylladb:
transport/messages: hold pinned prepared entry in PREPARE result
cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race
result_message::prepared now owns a strong pinned prepared-cache entry instead of relying only on a weak pointer view. This closes the remaining lifetime gap after query_processor::prepare() returns, so users of the returned PREPARE message cannot observe an invalidated weak handle during subsequent
processing.
- update result_message::prepared::cql constructor to accept pinned entry
- construct weak view from owned pinned entry inside the message
- pass pinned cache entry from query_processor::prepare() into the message constructor
Authorization checks were previously skipped based on the
_is_internal flag. This couples two concerns: marking client
state as internal and bypassing authorization.
Introduce _bypass_auth_checks to handle only the authorization
bypass. Internal client state sets it to true, preserving current
behavior. External client state accepts it as a constructor
parameter, defaulting to false.
This will allow maintenance socket connections to skip
authorization without being marked as internal.
Refs SCYLLADB-409
The concurrency semaphore gates uninitialized connections across all
do_accepts loops, but was initialized to a fixed value regardless of
how many listeners exist. With multiple listeners competing for the
same units, each effectively gets less than the configured concurrency.
Initialize the semaphore to concurrency - 1 and signal 1 per listen()
call, so total capacity is concurrency - 1 + nr_listeners. This
guarantees each listener's accept loop can have at least one unit
available.
It mainly fixes problem when setting uninitialized_connections_semaphore_cpu_concurrency
config value to 1 would result in not being able to process connections, as only 1 out of 2
listeners got the semaphore.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-762
Backport: no, it's a minor problem
Closesscylladb/scylladb#28747
* github.com:scylladb/scylladb:
test: add test_uninitialized_conns_semaphore
generic_server: fix waiters count in shed log
generic_server: scale connection concurrency semaphore by listener count
This series closes a gap in how CQL request and response sizes are reported.
Previously, request_size and response_size were tracked as simple counters,
providing only cumulative totals per shard. This made it difficult to understand
the distribution of message sizes and identify potential issues with very large
or very small requests.
After this series, the CQL transport reports detailed histogram metrics showing
the distribution of request and response sizes. These histograms are tracked
per-instance, per-type (per ops), and per-scheduling-group, providing
much better visibility into CQL traffic patterns.
The histograms are collected for QUERY, EXECUTE, and BATCH operations, which are
the primary data path operations where message size distribution is most relevant.
This data can help identify:
- Clients sending unexpectedly large requests
- Operations with oversized result sets
- Scheduling group differences in traffic patterns
To support this, the series extends the approx_exponential_histogram template to
handle accurate sum, adds a bytes_histogram type alias optimized for byte-range measurements (1KB to 1GB).
The existing per-shard counter metrics are maintained for backward compatibility.
Metrics example:
```
scylla_transport_cql_request_bytes{kind="BATCH",scheduling_group_name="sl:default",shard="0"} 129808
scylla_transport_cql_request_bytes{kind="EXECUTE",scheduling_group_name="sl:default",shard="0"} 227409
scylla_transport_cql_request_bytes{kind="PREPARE",scheduling_group_name="sl:default",shard="0"} 631
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:default",shard="0"} 2809
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:driver",shard="0"} 4079
scylla_transport_cql_request_bytes{kind="REGISTER",scheduling_group_name="sl:default",shard="0"} 98
scylla_transport_cql_request_bytes{kind="STARTUP",scheduling_group_name="sl:driver",shard="0"} 432
scylla_transport_cql_request_histogram_bytes_sum{kind="QUERY",scheduling_group_name="sl:driver"} 4079
scylla_transport_cql_request_histogram_bytes_count{kind="QUERY",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1024.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2048.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4096.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8192.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16384.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="32768.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="65536.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="131072.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="262144.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="524288.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1048576.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2097152.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4194304.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8388608.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16777216.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="33554432.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="67108864.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="134217728.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="268435456.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="536870912.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1073741824.000000",scheduling_group_name="sl:driver"} 57
```
**The field sees it as an important issue**
Fixes#14850Closesscylladb/scylladb#28419
* github.com:scylladb/scylladb:
test/boost/estimated_histogram_test.cc: Switch to real Sum
transport/server: to bytes_histogram
approx_exponential_histogram: Add sum() method for accurate value tracking
utils/estimated_histogram.hh: Add bytes_histogram
The concurrency semaphore gates uninitialized connections across all
do_accepts loops, but was initialized to a fixed value regardless of
how many listeners exist. With multiple listeners competing for the
same units, each effectively gets less than the configured concurrency.
Initialize the semaphore to concurrency - 1 and signal 1 per listen()
call, so total capacity is concurrency - 1 + nr_listeners. This
guarantees each listener's accept loop can have at least one unit
available.
The connection's cpu_concurrency_t struct tracks the state of a connection
to manage the admission of new requests and prevent CPU overload during
connection storms. When a connection holds units (allowed only 0 or 1), it is
considered to be in the "CPU state" and contributes to the concurrency limits
used when accepting new connections.
The bug stems from the fact that `counted_data_source_impl::get` and
`counted_data_sink_impl::put` calls can interleave during execution. This
occurs because of `should_parallelize` and `_ready_to_respond`, the latter being
a future chain that can run in the background while requests are being read.
Consequently, while reading request (N), the system may concurrently be
writing the response for request (N-1) on the same connection.
This interleaving allows `return_all()` to be called twice before the
subsequent `consume_units()` is invoked. While the second `return_all()` call
correctly returns 0 units, the matching `consume_units()` call would
mistakenly take an extra unit from the semaphore. Over time, a connection
blocked on a read operation could end up holding an unreturned semaphore
unit. If this pattern repeats across multiple connections, the semaphore
units are eventually depleted, preventing the server from accepting any
new connections.
The fix ensures that we always consume the exact number of units that were
previously returned. With this change, interleaved operations behave as
follows:
get() return_all — returns 1 unit
put() return_all — returns 0 units
get() consume_units — takes back 1 unit
put() consume_units — takes back 0 units
Logically, the networking phase ends when the first network operation
concludes. But more importantly, when a network operation
starts, we no longer hold any units.
Other solutions are possible but the chosen one seems to be the
simplest and safest to backport.
Fixes SCYLLADB-485
This patch replaces simple counters with bytes_histogram for tracking
CQL request and response sizes, enabling better visibility into message
size distribution.
Changes:
- Replace request_size and response_size metrics with bytes_histogram in
cql_sg_stats::request_kind_stats
- Per-shard metrics continue to be reported as before
- QUERY, EXECUTE, and BATCH operations now report per-node, per-scheduling-group
histograms of bytes sent and received, providing detailed insight into these
operations
Other CQL operations (e.g., PREPARE, OPTIONS) are not included in per-node
histogram reporting as they are less performance-critical, but can be added
in the future if proven useful.
Metrics example:
```
# HELP scylla_transport_cql_request_bytes Counts the total number of received bytes in CQL messages of a specific kind.
# TYPE scylla_transport_cql_request_bytes counter
scylla_transport_cql_request_bytes{kind="BATCH",scheduling_group_name="sl:default",shard="0"} 129808
scylla_transport_cql_request_bytes{kind="EXECUTE",scheduling_group_name="sl:default",shard="0"} 227409
scylla_transport_cql_request_bytes{kind="PREPARE",scheduling_group_name="sl:default",shard="0"} 631
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:default",shard="0"} 2809
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:driver",shard="0"} 4079
scylla_transport_cql_request_bytes{kind="REGISTER",scheduling_group_name="sl:default",shard="0"} 98
scylla_transport_cql_request_bytes{kind="STARTUP",scheduling_group_name="sl:driver",shard="0"} 432
# HELP scylla_transport_cql_request_histogram_bytes A histogram of received bytes in CQL messages of a specific kind and specific scheduling group.
# TYPE scylla_transport_cql_request_histogram_bytes histogram
scylla_transport_cql_request_histogram_bytes_sum{kind="QUERY",scheduling_group_name="sl:driver"} 4079
scylla_transport_cql_request_histogram_bytes_count{kind="QUERY",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1024.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2048.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4096.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8192.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16384.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="32768.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="65536.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="131072.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="262144.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="524288.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1048576.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2097152.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4194304.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8388608.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16777216.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="33554432.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="67108864.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="134217728.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="268435456.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="536870912.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1073741824.000000",scheduling_group_name="sl:driver"} 57
```