CDC and $paxos tables are managed internally by Scylla. Users are
already prohibited from running ALTER and DROP commands on CDC tables.
In this commit, we extend the same restrictions to $paxos tables to
prevent users from shooting themselves in the foot.
Other commands are generally allowed for CDC and $paxos tables. An
important distinction is that CDC tables are meant to be accessed
directly by users, so appropriate permissions must be set for
non-superusers. In contrast, $paxos tables are not intended for direct
access by users. Therefore, this commit explicitly disallows
non-superusers from accessing them. Superusers are still allowed
access for debugging and troubleshooting purposes.
Note that these restrictions apply even if explicit permissions have
been granted. For example, a non-superuser may be granted SELECT
permissions on a $paxos table, but the restriction above will
still take precedence. We don't try to restrict users
from giving permissions to $paxos tables for simplicity.
This is a refactoring commit — it extracts the CDC permissions handling
logic into a separate function: check_internal_table_permissions.
This is a preparatory step for the next commit, where we'll handle
paxos state tables similarly to CDC tables.
We want to access the paxos state table only on the local node and
shard (or shards in case of intranode_migration). In this commit we
add a node_local_only flag to query_options, which allows to do that.
This flag can be set for a query via make_internal_options.
We handle this flag on the statements layer by forwarding it to
either coordinator_query_options or coordinator_mutate_options.
We add the remove_non_local_host_ids() helper, which
will be used in the next commit to support the read
path. HostIdVector concept is introduced to be able
to handle both host_id_vector_replica_set and
host_id_vector_topology_change uniformly.
The storage_proxy_coordinator_mutate_options class
is declared outside of storage_proxy to avoid C++
compiler complaints about default field initializers.
In particular, some storage_proxy methods use this
class for optional parameters with default values,
which is not allowed when the class is defined inside
storage_proxy.
Add a per-request flag that restricts query execution
to the local node by filtering out all non-local replicas.
Standard consistency level (CL) rules still apply:
if the local node alone cannot satisfy the
requested CL, an exception is thrown.
This flag is required for Paxos state access, where
reads and writes must target only the local node.
As a side effect, this also enables the implementation
of scylladb/scylladb#16478, which proposes a CQL
extension to expose 'local mode' query execution to users.
Support for this flag in storage_proxy's read and write
code paths will be added in follow-up commits.
In upcoming commits, we want to add a node_local_only flag to both read
and write paths in storage_proxy. This requires passing the flag from
query_processor to the part of storage_proxy where replica selection
decisions are made.
For reads, it's sufficient to add the flag to the existing
coordinator_query_options class. For writes, there is no such options
container, so we introduce coordinator_mutate_options in this commit.
In the future, we may move some of the many mutate() method arguments
into this container to simplify the code.
Most of the create_write_response_handler overloads follow the same
signature pattern to satisfy the sp::mutate_prepare call. The one which
doesn't follow it is invoked by others and is responsible for creating
a concrete handler instance. In this refactoring commit we rename
it to make_write_response_handler to reduce confusion.
This is a refactoring commit. We remove extra lambda parameters from
mutate_prepare since the CreateWriteHandler lambda can simply
capture them.
We can't std::move(permit) in another mutate_prepare overload,
because each handler wants its own copy of this pemit.
We call paxos_store::ensure_initialized in the beginning of
storage_proxy::cas to create a paxos state table for a user table if
it doesn't exist. When the LWT coordinator sends RPCs to replicas,
some of them may not yet have the paxos schema. In
paxos_store::get_paxos_state_schema we just wait for them to appear,
or throw 'no_such_column_family' if the base table was dropped.
Pass a timeout parameter through to start_operation()
and add_entry(), respectively.
This is a preparatory change for the next commit, which
will use the timeout to properly handle timeouts during
lazy creation of Paxos state tables.
Switch paxos_store from using internal queries to regular prepared
queries, so that prepared statements are correctly updated when
the base table is recreated.
The do_execute_cql_with_timeout function is extracted to reduce
code bloat when execute_cql_with_timeout template function
is instantiated.
We change return type of execute_cql_with_timeout to untyped_result_set
since shared_ptr is not really needed here.
In upcoming commits, we will switch paxos_store from using internal
queries to regular prepared queries, so that prepared statements are
correctly updated when the base table is recreated. To support this,
we want to reuse the logic for converting parameters from
vector<data_value_or_unset> to raw_value_vector_with_unset.
This commit makes make_internal_options public to enable that reuse.
We want to reuse the same queries to access system.paxos and the the
co-located table. A separate co-located table will be created for each
user table, so we won't need cf_id filter for them. In this commit
we make cf_if filter optional and apply it only if the stable table
is actually system.paxos.
This is another preparational step. We want to add more logic to
paxos_store state access functions in the next commits, it's easier
to do with coroutines.
Pass ballot by value to delete_paxos_decision because
paxos_state::prune is not a coroutine and the ballot parameter
is destroyed when we return from it. The alternative
solution -- pass by const reference to paxos_state::prune -- doesn't
work because paxos_state::prune is called
from a lambda in paxos_response_handler::prune, this lambda is
not a coroutine and the 'ballot' field could be destroyed along
with the body of this lambda as soon as we return from
paxos_state::prune.
Introduce paxos_store abstraction to isolate Paxos state access.
Prepares for supporting either system.paxos or a co-located
table as the storage backend.
Currently we do token metadata barrier before accepting a replacing
node. It was needed for the "replace with the same IP" case to make sure
old request will not contact new node by mistake. But now since we
address nodes by id this is no longer possible since old requests will
use old id and will be rejected.
Closesscylladb/scylladb#25047
Skip removing any artifacts when -s provided between test.py invocation.
Logs from the previous run will be overridden if tests were executed one
more time. Fox example:
1. Execute tests A, B, C with parameter -s
2. All logs are present even if tests are passed
3. Execute test B with parameter -s
4. Logs for A and C are from the first run
5. Logs for B are from the most recent run
Backport is not needed, since it framework enhancement.
Closesscylladb/scylladb#24838
* github.com:scylladb/scylladb:
test.py: skip cleaning artifacts when -s provided
test.py: move deleting directory to prepare_dir
Currently it grows dynamically and triggers oversized allocation
warning. Also it may be hard to find sufficient contiguous memory chunk
after the system runs for a while. This patch pre-allocates enough
memory for ~1M outstanding writes per shard.
Fixes#24660Fixes#24217Closesscylladb/scylladb#25098
When a node shuts down, in storage service, after storage_proxy RPCs are stopped, some write handlers within storage_proxy may still be waiting for background writes to complete. These handlers hold appropriate ERMs to block schema changes before the write finishes. After the RPCs are stopped, these writes cannot receive the replies anymore.
If, at the same time, there are RPC commands executing `barrier_and_drain`, they may get stuck waiting for these ERM holders to finish, potentially blocking node shutdown until the writes time out.
This change introduces cancellation of all outstanding write handlers from storage_service after the storage proxy RPCs were stopped.
Fixesscylladb/scylladb#23665
Backport: since this fixes an issue that frequently causes issues in CI, backport to 2025.1, 2025.2, and 2025.3.
Closesscylladb/scylladb#24714
* https://github.com/scylladb/scylladb:
storage_service: Cancel all write requests on storage_proxy shutdown
test: Add test for unfinished writes during shutdown and topology change
If schema pull are disabled group0 is used to bring up to date schema
by calling start_group0_operation() which executes raft read barrier
internally, but if the group0 is still in use_pre_raft_procedures
start_group0_operation() silently does nothing. Later the code that
assumes that schema is already up-to-date will fail and print warnings
into the log. But since getting queries in the state when a node is in
raft enabled mode but group0 is still not configured is illegal it is
better to make those errors more visible buy asserting them during
testing.
Closesscylladb/scylladb#25112
* seastar 26badcb1...60b2e7da (42):
> Revert "Fix incorrect defaults for io queue iops/bandwidth"
> fair_queue: Ditch queue-wide accumulator reset on overflow
> addr2line, scripts/stall-analyser: change the default tool to llvm-addr2line
> Fix incorrect defaults for io queue iops/bandwidth
> core/reactor: add cxx_exceptions() getter
> gate: make destructor virtual
> scripts/seastar-addr2line: change the default addr2line utility to llvm-addr2line
> coding-style: Align example return types
> reactor: Remove min_vruntime() declaration
> reactor: Move enable_timer() method to private section
> smp: fix missing span include
> core: Don't keep internal errors counter on reactor
> pollable_fd: Untangle shutdown()
> io_queue: Remove deprecated statistics getters
> fair_queue: Remove queued/executing resource counters
> reactor: Move set_current_task() from public reactor API
> util: make SEASTAR_ASSERT() failure generate SIGABRT
> core: fix high CPU use at idle on high core count machines
> Merge 'Move output IO throttler to IO queue level' from Pavel Emelyanov
fair_queue: Move io_throttler to io_queue.hh
fair_queue: Move metrics from to io_queue::stream
fair_queue: Remove io_throttler from tests
fair_queue_test: Remove io-throttler from fair-queue
fair_queue: Remove capacity getters
fair_queue: Move grab_result into io_queue::stream too
fair_queue: Move throtting code to io_queue.cc
fair_queue: Move throttling code to io_queue::stream class
fair_queue: Open-code dispatch_requests() into users
fair_queue: Split dispatch_requests() into top() and pop_front()
fair_queue: Swap class push back and dispatch
fair_queue: Configure forgiving factor externally
fair_queue: Move replenisher kick to dispatch caller
io_queue: Introduce io_queue::stream
fair_queue: Merge two grab_capacity overloads
fair_queue: Detatch outcoming capacity grabbing from main dispatch loop
fair_queue: Move available tokens update into if branch
io_queue: Rename make_fair_group_config into configure_throttler
io_queue: Rename get_fair_group into get_throttler
fair_queue: Rename fair_group -> io_throttler
> http::reply: Add 308 (permanent redirect) and make pretty-print handle unknown values
> Merge 'Relax reactor coupling with file_data_source_impl' from Pavel Emelyanov
reactor: Relax friendship with file_data_source_impl
fstream: Use direct io_stats reference
> thread_pool: Relax coupling with reactor
> reactor: Mark some IO classes management methods private
> http: Deprecate json_exception
> io_tester: Collect and report disk queue length samples
> test/perf: Add context-switch measurer
> http/client: Zero-copy forward content-length body into the underlying stream
> json2code: Genrate move constructor and move-assignment operator
> Merge 'Semi-mixed mode for output_stream' from Pavel Emelyanov
output_stream: Support semi-mixed mode writing
output_stream: Complete write(temporary_buffer) piggy-back-ing write(packet)
iostream: Add friends for iostream tests
packet: Mark bool cast operator const
iostream: Document output_stream::write() methods
> io_tester: Show metrics about requests split
> reactor: add counter for internal errors
> iotune: Print correct throughput units
> core: add label to io_threaded_fallbacks to categorize operations
> slab: correct allocation logic and enforce memory limits
> Merge 'Fix for non-json http function_handlers' from Travis Downs
httpd_test: add test for non-JSON function handler
function_handlers: avoid implicit conversions
http: do not always treat plain text reply as json
> Merge 'tls: add ALPN support' from Łukasz Kurowski
tls: add server-side ALPN support
tls: add client-side ALPN support
> Merge 'coroutine: experimental: generator: implement move and swap' from Benny Halevy
coroutine: experimental: generator: implement move and swap
coroutine: experimental: generator: unconstify buffer capacity
> future: downgrade asserts
> output_stream: Remove unused bits
> Merge 'Upstream a couple of minor reactor optimizations' from Travis Downs
Match type for pure_check_for_work
Do not use std::function for check_for_work()
> Handle ENOENT in getgrnam
Includes scylla-gdb.py update by Pavel Emelyanov.
Closesscylladb/scylladb#25094
During a graceful node shutdown, RPC listeners are stopped in `storage_service::drain_on_shutdown`
as one of the first steps. However, even after RPCs are shut down, some write handlers in
`storage_proxy` may still be waiting for background writes to complete. These handlers retain the ERM.
Since the RPC subsystem is no longer active, replies cannot be received, and if any RPC commands are
concurrently executing `barrier_and_drain`, they may get stuck waiting for those writes. This can block
the messaging server shutdown and delay the entire shutdown process until the write timeout occurs.
This change introduces the cancellation of all outstanding write handlers in `storage_proxy`
during shutdown to prevent unnecessary delays.
Fixesscylladb/scylladb#23665
This test reproduces an issue where a topology change and an ongoing write query
during query coordinator shutdown can cause the node to get stuck.
When a node receives a write request, it creates a write handler that holds
a copy of the current table's ERM (Effective Replication Map). The ERM ensures
that no topology or schema changes occur while the request is being processed.
After the query coordinator receives the required number of replica write ACKs
to satisfy the consistency level (CL), it sends a reply to the client. However,
the write response handler remains alive until all replicas respond — the remaining
writes are handled in the background.
During shutdown, when all network connections are closed, these responses can no longer
be received. As a result, the write response handler is only destroyed once the write
timeout is reached.
This becomes problematic because the ERM held by the handler blocks topology or schema
change commands from executing. Since shutdown waits for these commands to complete,
this can lead to unnecessary delays in node shutdown and restarts, and occasional
test case failures.
Test for: scylladb/scylladb#23665
- Change schedule from twice weekly (Mon/Thu) to once weekly (Mon only)
- Extend notification cooldown period from 3 days to 1 week
- Prevent notification spam while maintaining immediate conflict detection on pushes
Fixes: https://github.com/scylladb/scylladb/issues/25130Closesscylladb/scylladb#25131
If a test fails very early (still have to find why), test.py
crashes while flushing a non-existent log_file, as shown below.
To fix, initialize the property to None and check it during
cleanup.
```
================================================================================
[N/TOTAL] SUITE MODE RESULT TEST
------------------------------------------------------------------------------
'ScyllaServer' object has no attribute 'log_file'
test_cluster_features Traceback (most recent call last):
File "/home/avi/scylla-maint/./test.py", line 816, in <module>
sys.exit(asyncio.run(main()))
~~~~~~~~~~~^^^^^^^^
File "/usr/lib64/python3.13/asyncio/runners.py", line 195, in run
return runner.run(main)
~~~~~~~~~~^^^^^^
File "/usr/lib64/python3.13/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
File "/usr/lib64/python3.13/asyncio/base_events.py", line 725, in run_until_complete
return future.result()
~~~~~~~~~~~~~^^
File "/home/avi/scylla-maint/./test.py", line 523, in main
total_tests_pytest, failed_pytest_tests = await run_all_tests(signaled, options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/avi/scylla-maint/./test.py", line 452, in run_all_tests
failed += await reap(done, pending, signaled)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/avi/scylla-maint/./test.py", line 418, in reap
result = coro.result()
File "/home/avi/scylla-maint/test/pylib/suite/python.py", line 143, in run
return await super().run(test, options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/avi/scylla-maint/test/pylib/suite/base.py", line 216, in run
await test.run(options)
File "/home/avi/scylla-maint/test/pylib/suite/topology.py", line 48, in run
async with get_cluster_manager(self.uname, self.suite.clusters, str(self.suite.log_dir)) as manager:
~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.13/contextlib.py", line 221, in __aexit__
await anext(self.gen)
File "/home/avi/scylla-maint/test/pylib/scylla_cluster.py", line 2006, in get_cluster_manager
await manager.stop()
File "/home/avi/scylla-maint/test/pylib/scylla_cluster.py", line 1539, in stop
await self.clusters.put(self.cluster, is_dirty=True)
File "/home/avi/scylla-maint/test/pylib/pool.py", line 104, in put
await self.destroy(obj)
File "/home/avi/scylla-maint/test/pylib/suite/python.py", line 65, in recycle_cluster
srv.log_file.close()
^^^^^^^^^^^^
AttributeError: 'ScyllaServer' object has no attribute 'log_file'
```
Closesscylladb/scylladb#24885
CPU scheduling has been with us since 641aaba12c
(2017), and no one ever disables it. Likely nothing really works without
it.
Make it mandatory and mark the option unused.
Closesscylladb/scylladb#24894
Rather than calling std::views::transform with a lambda that extracts
a member from a class, call std::views::transform with a pointer-to-member
to do the same thing. This results in more concise code.
Closesscylladb/scylladb#25012
test/cqlpy/README.md explains how to run the cqlpy tests against
Cassandra, and mentions that if you don't have "nodetool" in your path
you need to set the NODETOOL variable. However, when giving a simple
example how to use the run-cassandra script, we forgot to remind the
user to set NODETOOL in addition to CASSANDRA, causing confusion for
users who didn't know why tests were failing.
So this patch fixes the section in test/cqlpy/README.md with the
run-cassandra example to also set the NODETOOL environment variable,
not just CASSANDRA.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#25051
Parent task keeps a vector of statuses (task_essentials) of its finished
children. When the children number is large - for example because we
have many tables and a child task is created for each table - we may hit
oversize allocation while adding a new child essentials to the vector.
Keep task_essentails of children in chunked_vector.
Fixes: #25040.
Closesscylladb/scylladb#25064
Audit tests are vulnerable to noise from LOGIN queries (because AUTH
audit logs can appear at any time). Most tests already use the
`filter_out_noise` mechanism to remove this noise, but tests
focused on AUTH verification did not, leading to sporadic failures.
This change adds a filter to ignore AUTH logs generated by the default
"cassandra" user, so tests only verify logs from the user created
specifically for each test.
Additionally, this PR:
- Adds missing `nonlocal new_rows` statement that prevented some checks from being called
- Adds a testcase for audit logs of `cassandra` user
Fixes: https://github.com/scylladb/scylladb/issues/25069
Better backport those test changes to 2025.3. 2025.2 and earlier don't have `./cluster/dtest/audit_test.py`.
Closesscylladb/scylladb#25111
* github.com:scylladb/scylladb:
test: audit: add cassandra user test case
test: audit: ignore cassandra user audit logs in AUTH tests
test: audit: change names of `filter_out_noise` parameters
test: audit: add missing `nonlocal new_rows` statement
Enhance and fix error handling in the `chunked_download_source` to prevent errors seeping from the request callback. Also stop retrying on seastar's side since it is going to break the integrity of data which maybe downloaded more than once for the same range.
Fixes: https://github.com/scylladb/scylladb/issues/25043
Should be backported to 2025.3 since we have an intention to release native backup/restore feature
Closesscylladb/scylladb#24883
* github.com:scylladb/scylladb:
s3_client: Disable Seastar-level retries in HTTP client creation
s3_test: Validate handling of non-`aws_error` exceptions
s3_client: Improve error handling in chunked_download_source
aws_error: Add factory method for `aws_error` from exception
Fixes: #25045
added the ability to supply the list of files to
restore from the a given file.
mainly required for local testing.
Signed-off-by: Ran Regev <ran.regev@scylladb.com>
Closesscylladb/scylladb#25077
Prevent Seastar from retrying HTTP requests to avoid buffer double-feed
issues when an entire request is retried. This could cause data
corruption in `chunked_download_source`. The change is global for every
instance of `s3_client`, but it is still safe because:
* Seastar's `http_client` resets connections regardless of retry behavior
* `s3_client` retry logic handles all error types—exceptions, HTTP errors,
and AWS-specific errors—via `http_retryable_client`
Create aws_error from raised exceptions when possible and respond
appropriately. Previously, non-aws_exception types leaked from the
request handler and were treated as non-retryable, causing potential
data corruption during download.
Audit tests use the `filter_out_noise` function to remove noise from
audit logs generated by user authentication. As a result, none of the
existing tests covered audit logs for the default `cassandra` user.
This change adds a test case for that user.
Refs: scylladb/scylladb#25069
Audit tests are vulnerable to noise from LOGIN queries (because AUTH
audit logs can appear at any time). Most tests already use the
`filter_out_noise` mechanism to remove this noise, but tests
focused on AUTH verification did not, leading to sporadic failures.
This change adds a filter to ignore AUTH logs generated by the default
"cassandra" user, so tests only verify logs from the user created
specifically for each test.
Fixes: scylladb/scylladb#25069
This is a refactoring commit that changes the names of the parameters
of the `filter_out_noise` function, as well as names of related
variables. The motiviation for the change is introduction of more
complex filtering logic in next commit of this patch series.
Refs: scylladb/scylladb#25069
The variable `new_rows` was not updated by the inner function
`is_number_of_new_rows_correct` because the `nonlocal new_rows`
statement was missing. As a result, `sorted_new_rows` was empty and
certain checks were skipped.
This change:
- Introduces the missing `nonlocal new_rows` declaration
- Adds an assertion verifying that the number of new rows matches
the expected count
- Fixes the incorrect variable name in the lambda used for row sorting