Commit Graph

48597 Commits

Author SHA1 Message Date
Petr Gusev
78aa36b257 check_internal_table_permissions: handle Paxos state tables
CDC and $paxos tables are managed internally by Scylla. Users are
already prohibited from running ALTER and DROP commands on CDC tables.
In this commit, we extend the same restrictions to $paxos tables to
prevent users from shooting themselves in the foot.

Other commands are generally allowed for CDC and $paxos tables. An
important distinction is that CDC tables are meant to be accessed
directly by users, so appropriate permissions must be set for
non-superusers. In contrast, $paxos tables are not intended for direct
access by users. Therefore, this commit explicitly disallows
non-superusers from accessing them. Superusers are still allowed
access for debugging and troubleshooting purposes.

Note that these restrictions apply even if explicit permissions have
been granted. For example, a non-superuser may be granted SELECT
permissions on a $paxos table, but the restriction above will
still take precedence. We don't try to restrict users
from giving permissions to $paxos tables for simplicity.
2025-07-24 19:48:08 +02:00
Petr Gusev
ec3c5f4cbc client_state: extract check_internal_table_permissions
This is a refactoring commit — it extracts the CDC permissions handling
logic into a separate function: check_internal_table_permissions.

This is a preparatory step for the next commit, where we'll handle
paxos state tables similarly to CDC tables.
2025-07-24 19:48:08 +02:00
Petr Gusev
bb4e7a669f paxos_store: handle base table removal
Subscribe to on_before_drop_column_family to drop the associated
Paxos state table when the corresponding user table is dropped.
2025-07-24 19:48:08 +02:00
Petr Gusev
1b70623908 database: get_base_table_for_tablet_colocation: handle paxos state table
We need to mark paxos state table as colocated with the user table, so
that the corresponding tablets are migrated/repaired together.
2025-07-24 19:48:08 +02:00
Petr Gusev
03aa2e4823 paxos_state: use node_local_only mode to access paxos state 2025-07-24 19:48:08 +02:00
Petr Gusev
ff1caa9798 query_options: add node_local_only mode
We want to access the paxos state table only on the local node and
shard (or shards in case of intranode_migration). In this commit we
add a node_local_only flag to query_options, which allows to do that.
This flag can be set for a query via make_internal_options.

We handle this flag on the statements layer by forwarding it to
either coordinator_query_options or coordinator_mutate_options.
2025-07-24 19:48:08 +02:00
Petr Gusev
65c7e36b7c storage_proxy: handle node_local_only in query
In this commit we support node_local_only flag in read code path in
storage_proxy.
2025-07-24 19:48:08 +02:00
Petr Gusev
2d747d97b8 storage_proxy: handle node_local_only in mutate
We add the remove_non_local_host_ids() helper, which
will be used in the next commit to support the read
path. HostIdVector concept is introduced to be able
to handle both host_id_vector_replica_set and
host_id_vector_topology_change uniformly.

The storage_proxy_coordinator_mutate_options class
is declared outside of storage_proxy to avoid C++
compiler complaints about default field initializers.
In particular, some storage_proxy methods use this
class for optional parameters with default values,
which is not allowed when the class is defined inside
storage_proxy.
2025-07-24 19:48:08 +02:00
Petr Gusev
7eb198f2cc storage_proxy: introduce node_local_only flag
Add a per-request flag that restricts query execution
to the local node by filtering out all non-local replicas.
Standard consistency level (CL) rules still apply:
if the local node alone cannot satisfy the
requested CL, an exception is thrown.

This flag is required for Paxos state access, where
reads and writes must target only the local node.

As a side effect, this also enables the implementation
of scylladb/scylladb#16478, which proposes a CQL
extension to expose 'local mode' query execution to users.

Support for this flag in storage_proxy's read and write
code paths will be added in follow-up commits.
2025-07-24 19:48:08 +02:00
Petr Gusev
8e745137de abstract_replication_strategy: remove unused using 2025-07-24 19:48:08 +02:00
Petr Gusev
4c1aca3927 storage_proxy: add coordinator_mutate_options
In upcoming commits, we want to add a node_local_only flag to both read
and write paths in storage_proxy. This requires passing the flag from
query_processor to the part of storage_proxy where replica selection
decisions are made.

For reads, it's sufficient to add the flag to the existing
coordinator_query_options class. For writes, there is no such options
container, so we introduce coordinator_mutate_options in this commit.

In the future, we may move some of the many mutate() method arguments
into this container to simplify the code.
2025-07-24 19:48:08 +02:00
Petr Gusev
b6ccaffd45 storage_proxy: rename create_write_response_handler -> make_write_response_handler
Most of the create_write_response_handler overloads follow the same
signature pattern to satisfy the sp::mutate_prepare call. The one which
doesn't follow it is invoked by others and is responsible for creating
a concrete handler instance. In this refactoring commit we rename
it to make_write_response_handler to reduce confusion.
2025-07-24 19:48:08 +02:00
Petr Gusev
db946edd1d storage_proxy: simplify mutate_prepare
This is a refactoring commit. We remove extra lambda parameters from
mutate_prepare since the CreateWriteHandler lambda can simply
capture them.

We can't std::move(permit) in another mutate_prepare overload,
because each handler wants its own copy of this pemit.
2025-07-24 19:48:08 +02:00
Petr Gusev
ac4bc3f816 paxos_state: lazily create paxos state table
We call paxos_store::ensure_initialized in the beginning of
storage_proxy::cas to create a paxos state table for a user table if
it doesn't exist. When the LWT coordinator sends RPCs to replicas,
some of them may not yet have the paxos schema. In
paxos_store::get_paxos_state_schema we just wait for them to appear,
or throw 'no_such_column_family' if the base table was dropped.
2025-07-24 19:48:08 +02:00
Petr Gusev
3e0347c614 migration_manager: add timeout to start_group0_operation and announce
Pass a timeout parameter through to start_operation()
and add_entry(), respectively.

This is a preparatory change for the next commit, which
will use the timeout to properly handle timeouts during
lazy creation of Paxos state tables.
2025-07-24 16:39:50 +02:00
Petr Gusev
519f40a95e paxos_store: use non-internal queries
Switch paxos_store from using internal queries to regular prepared
queries, so that prepared statements are correctly updated when
the base table is recreated.

The do_execute_cql_with_timeout function is extracted to reduce
code bloat when execute_cql_with_timeout template function
is instantiated.

We change return type of execute_cql_with_timeout to untyped_result_set
since shared_ptr is not really needed here.
2025-07-24 16:39:50 +02:00
Petr Gusev
6caa1ae649 qp: make make_internal_options public
In upcoming commits, we will switch paxos_store from using internal
queries to regular prepared queries, so that prepared statements are
correctly updated when the base table is recreated. To support this,
we want to reuse the logic for converting parameters from
vector<data_value_or_unset> to raw_value_vector_with_unset.
This commit makes make_internal_options public to enable that reuse.
2025-07-24 16:39:50 +02:00
Petr Gusev
13f7266052 paxos_store: conditional cf_id filter
We want to reuse the same queries to access system.paxos and the the
co-located table. A separate co-located table will be created for each
user table, so we won't need cf_id filter for them. In this commit
we make cf_if filter optional and apply it only if the stable table
is actually system.paxos.
2025-07-24 16:39:50 +02:00
Petr Gusev
370f91adb7 paxos_store: coroutinize
This is another preparational step. We want to add more logic to
paxos_store state access functions in the next commits, it's easier
to do with coroutines.

Pass ballot by value to delete_paxos_decision because
paxos_state::prune is not a coroutine and the ballot parameter
is destroyed when we return from it. The alternative
solution -- pass by const reference to paxos_state::prune -- doesn't
work because paxos_state::prune is called
from a lambda in paxos_response_handler::prune, this lambda is
not a coroutine and the 'ballot' field could be destroyed along
with the body of this lambda as soon as we return from
paxos_state::prune.
2025-07-24 16:39:50 +02:00
Petr Gusev
ab03badc15 feature_service: add LWT_WITH_TABLETS feature
We will need this feature to determine if it's safe to enable
LWTs for a tablet-based table.
2025-07-24 16:39:50 +02:00
Petr Gusev
8292ecf2e1 paxos_state: inline system_keyspace functions into paxos_store
Prepares for reusing the same functions to access either
system.paxos or a co-located table.
2025-07-24 16:39:50 +02:00
Petr Gusev
6e87a6cdb0 paxos_state: extract state access functions into paxos_store
Introduce paxos_store abstraction to isolate Paxos state access.
Prepares for supporting either system.paxos or a co-located
table as the storage backend.
2025-07-24 16:39:50 +02:00
Gleb Natapov
d5e023bbad topology coordinator: drop no longer needed token metadata barrier
Currently we do token metadata barrier before accepting a replacing
node. It was needed for the "replace with the same IP" case to make sure
old request will not contact new node by mistake. But now since we
address nodes by id this is no longer possible since old requests will
use old id and will be rejected.

Closes scylladb/scylladb#25047
2025-07-24 11:15:42 +02:00
Tomasz Grabiec
c9bf010d6d Merge 'test.py: skip cleaning testlog' from Andrei Chekun
Skip removing any artifacts when -s provided between test.py invocation.
Logs from the previous run will be overridden if tests were executed one
more time. Fox example:
1. Execute tests A, B, C with parameter -s
2. All logs are present even if tests are passed
3. Execute test B with parameter -s
4. Logs for A and C are from the first run
5. Logs for B are from the most recent run

Backport is not needed, since it framework enhancement.

Closes scylladb/scylladb#24838

* github.com:scylladb/scylladb:
  test.py: skip cleaning artifacts when -s provided
  test.py: move deleting directory to prepare_dir
2025-07-24 09:46:42 +03:00
Gleb Natapov
ab6e328226 storage_proxy: preallocate write response handler hash table
Currently it grows dynamically and triggers oversized allocation
warning. Also it may be hard to find sufficient contiguous memory chunk
after the system runs for a while. This patch pre-allocates enough
memory for ~1M outstanding writes per shard.

Fixes #24660
Fixes #24217

Closes scylladb/scylladb#25098
2025-07-24 09:46:42 +03:00
Patryk Jędrzejczak
f89ffe491a Merge 'storage_service: cancel all write requests after stopping transports' from Sergey Zolotukhin
When a node shuts down, in storage service, after storage_proxy RPCs are stopped, some write handlers within storage_proxy may still be waiting for background writes to complete. These handlers hold appropriate ERMs to block schema changes before the write finishes. After the RPCs are stopped, these writes cannot receive the replies anymore.

If, at the same time, there are RPC commands executing `barrier_and_drain`, they may get stuck waiting for these ERM holders to finish, potentially blocking node shutdown until the writes time out.

This change introduces cancellation of all outstanding write handlers from storage_service after the storage proxy RPCs were stopped.

Fixes scylladb/scylladb#23665

Backport: since this fixes an issue that frequently causes issues in CI, backport to 2025.1, 2025.2, and 2025.3.

Closes scylladb/scylladb#24714

* https://github.com/scylladb/scylladb:
  storage_service: Cancel all write requests on storage_proxy shutdown
  test: Add test for unfinished writes during shutdown and topology change
2025-07-24 09:46:42 +03:00
Gleb Natapov
ddc3b6dcf5 migration manager: assert that if schema pull is disabled the group0 is not in use_pre_raft_procedures state
If schema pull are disabled group0 is used to bring up to date schema
by calling start_group0_operation() which executes raft read barrier
internally, but if the group0 is still in use_pre_raft_procedures
start_group0_operation() silently does nothing. Later the code that
assumes that schema is already up-to-date will fail and print warnings
into the log. But since getting queries in the state when a node is in
raft enabled mode but group0 is still not configured is illegal it is
better to make those errors more visible buy asserting them during
testing.

Closes scylladb/scylladb#25112
2025-07-23 14:10:17 +02:00
Botond Dénes
b65a2e2303 Update seastar submodule
* seastar 26badcb1...60b2e7da (42):
  > Revert "Fix incorrect defaults for io queue iops/bandwidth"
  > fair_queue: Ditch queue-wide accumulator reset on overflow
  > addr2line, scripts/stall-analyser: change the default tool to llvm-addr2line
  > Fix incorrect defaults for io queue iops/bandwidth
  > core/reactor: add cxx_exceptions() getter
  > gate: make destructor virtual
  > scripts/seastar-addr2line: change the default addr2line utility to llvm-addr2line
  > coding-style: Align example return types
  > reactor: Remove min_vruntime() declaration
  > reactor: Move enable_timer() method to private section
  > smp: fix missing span include
  > core: Don't keep internal errors counter on reactor
  > pollable_fd: Untangle shutdown()
  > io_queue: Remove deprecated statistics getters
  > fair_queue: Remove queued/executing resource counters
  > reactor: Move set_current_task() from public reactor API
  > util: make SEASTAR_ASSERT() failure generate SIGABRT
  > core: fix high CPU use at idle on high core count machines
  > Merge 'Move output IO throttler to IO queue level' from Pavel Emelyanov
    fair_queue: Move io_throttler to io_queue.hh
    fair_queue: Move metrics from to io_queue::stream
    fair_queue: Remove io_throttler from tests
    fair_queue_test: Remove io-throttler from fair-queue
    fair_queue: Remove capacity getters
    fair_queue: Move grab_result into io_queue::stream too
    fair_queue: Move throtting code to io_queue.cc
    fair_queue: Move throttling code to io_queue::stream class
    fair_queue: Open-code dispatch_requests() into users
    fair_queue: Split dispatch_requests() into top() and pop_front()
    fair_queue: Swap class push back and dispatch
    fair_queue: Configure forgiving factor externally
    fair_queue: Move replenisher kick to dispatch caller
    io_queue: Introduce io_queue::stream
    fair_queue: Merge two grab_capacity overloads
    fair_queue: Detatch outcoming capacity grabbing from main dispatch loop
    fair_queue: Move available tokens update into if branch
    io_queue: Rename make_fair_group_config into configure_throttler
    io_queue: Rename get_fair_group into get_throttler
    fair_queue: Rename fair_group -> io_throttler
  > http::reply: Add 308 (permanent redirect) and make pretty-print handle unknown values
  > Merge 'Relax reactor coupling with file_data_source_impl' from Pavel Emelyanov
    reactor: Relax friendship with file_data_source_impl
    fstream: Use direct io_stats reference
  > thread_pool: Relax coupling with reactor
  > reactor: Mark some IO classes management methods private
  > http: Deprecate json_exception
  > io_tester: Collect and report disk queue length samples
  > test/perf: Add context-switch measurer
  > http/client: Zero-copy forward content-length body into the underlying stream
  > json2code: Genrate move constructor and move-assignment operator
  > Merge 'Semi-mixed mode for output_stream' from Pavel Emelyanov
    output_stream: Support semi-mixed mode writing
    output_stream: Complete write(temporary_buffer) piggy-back-ing write(packet)
    iostream: Add friends for iostream tests
    packet: Mark bool cast operator const
    iostream: Document output_stream::write() methods
  > io_tester: Show metrics about requests split
  > reactor: add counter for internal errors
  > iotune: Print correct throughput units
  > core: add label to io_threaded_fallbacks to categorize operations
  > slab: correct allocation logic and enforce memory limits
  > Merge 'Fix for non-json http function_handlers' from Travis Downs
    httpd_test: add test for non-JSON function handler
    function_handlers: avoid implicit conversions
    http: do not always treat plain text reply as json
  > Merge 'tls: add ALPN support' from Łukasz Kurowski
    tls: add server-side ALPN support
    tls: add client-side ALPN support
  > Merge 'coroutine: experimental: generator: implement move and swap' from Benny Halevy
    coroutine: experimental: generator: implement move and swap
    coroutine: experimental: generator: unconstify buffer capacity
  > future: downgrade asserts
  > output_stream: Remove unused bits
  > Merge 'Upstream a couple of minor reactor optimizations' from Travis Downs
    Match type for pure_check_for_work
    Do not use std::function for check_for_work()
  > Handle ENOENT in getgrnam

Includes scylla-gdb.py update by Pavel Emelyanov.

Closes scylladb/scylladb#25094
2025-07-22 18:19:58 +02:00
Sergey Zolotukhin
e0dc73f52a storage_service: Cancel all write requests on storage_proxy shutdown
During a graceful node shutdown, RPC listeners are stopped in `storage_service::drain_on_shutdown`
as one of the first steps. However, even after RPCs are shut down, some write handlers in
`storage_proxy` may still be waiting for background writes to complete. These handlers retain the ERM.
Since the RPC subsystem is no longer active, replies cannot be received, and if any RPC commands are
concurrently executing `barrier_and_drain`, they may get stuck waiting for those writes. This can block
the messaging server shutdown and delay the entire shutdown process until the write timeout occurs.

This change introduces the cancellation of all outstanding write handlers in `storage_proxy`
during shutdown to prevent unnecessary delays.

Fixes scylladb/scylladb#23665
2025-07-22 15:03:30 +02:00
Sergey Zolotukhin
bc934827bc test: Add test for unfinished writes during shutdown and topology change
This test reproduces an issue where a topology change and an ongoing write query
during query coordinator shutdown can cause the node to get stuck.

When a node receives a write request, it creates a write handler that holds
a copy of the current table's ERM (Effective Replication Map). The ERM ensures
that no topology or schema changes occur while the request is being processed.

After the query coordinator receives the required number of replica write ACKs
to satisfy the consistency level (CL), it sends a reply to the client. However,
the write response handler remains alive until all replicas respond — the remaining
writes are handled in the background.

During shutdown, when all network connections are closed, these responses can no longer
be received. As a result, the write response handler is only destroyed once the write
timeout is reached.

This becomes problematic because the ERM held by the handler blocks topology or schema
change commands from executing. Since shutdown waits for these commands to complete,
this can lead to unnecessary delays in node shutdown and restarts, and occasional
test case failures.

Test for: scylladb/scylladb#23665
2025-07-22 15:03:13 +02:00
Ran Regev
3d82b9485e docs: update nodetool restore documentation for --sstables-file-list
Fixes: #25128
A leftover from #25077

Closes scylladb/scylladb#25129
2025-07-22 14:43:35 +02:00
Yaron Kaikov
4445c11c69 ./github/workflows/conflict_reminder: improve workflow with weekly notifications
- Change schedule from twice weekly (Mon/Thu) to once weekly (Mon only)
- Extend notification cooldown period from 3 days to 1 week
- Prevent notification spam while maintaining immediate conflict detection on pushes

Fixes: https://github.com/scylladb/scylladb/issues/25130

Closes scylladb/scylladb#25131
2025-07-22 15:21:12 +03:00
Avi Kivity
e4c4141d97 test.py: don't crash on early cleanup of ScyllaServer
If a test fails very early (still have to find why), test.py
crashes while flushing a non-existent log_file, as shown below.

To fix, initialize the property to None and check it during
cleanup.

```
================================================================================
[N/TOTAL]   SUITE    MODE   RESULT   TEST
------------------------------------------------------------------------------

'ScyllaServer' object has no attribute 'log_file'
test_cluster_features Traceback (most recent call last):
  File "/home/avi/scylla-maint/./test.py", line 816, in <module>
    sys.exit(asyncio.run(main()))
             ~~~~~~~~~~~^^^^^^^^
  File "/usr/lib64/python3.13/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ~~~~~~~~~~^^^^^^
  File "/usr/lib64/python3.13/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
  File "/usr/lib64/python3.13/asyncio/base_events.py", line 725, in run_until_complete
    return future.result()
           ~~~~~~~~~~~~~^^
  File "/home/avi/scylla-maint/./test.py", line 523, in main
    total_tests_pytest, failed_pytest_tests = await run_all_tests(signaled, options)
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/avi/scylla-maint/./test.py", line 452, in run_all_tests
    failed += await reap(done, pending, signaled)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/avi/scylla-maint/./test.py", line 418, in reap
    result = coro.result()
  File "/home/avi/scylla-maint/test/pylib/suite/python.py", line 143, in run
    return await super().run(test, options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/avi/scylla-maint/test/pylib/suite/base.py", line 216, in run
    await test.run(options)
  File "/home/avi/scylla-maint/test/pylib/suite/topology.py", line 48, in run
    async with get_cluster_manager(self.uname, self.suite.clusters, str(self.suite.log_dir)) as manager:
               ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.13/contextlib.py", line 221, in __aexit__
    await anext(self.gen)
  File "/home/avi/scylla-maint/test/pylib/scylla_cluster.py", line 2006, in get_cluster_manager
    await manager.stop()
  File "/home/avi/scylla-maint/test/pylib/scylla_cluster.py", line 1539, in stop
    await self.clusters.put(self.cluster, is_dirty=True)
  File "/home/avi/scylla-maint/test/pylib/pool.py", line 104, in put
    await self.destroy(obj)
  File "/home/avi/scylla-maint/test/pylib/suite/python.py", line 65, in recycle_cluster
    srv.log_file.close()
    ^^^^^^^^^^^^
AttributeError: 'ScyllaServer' object has no attribute 'log_file'
```

Closes scylladb/scylladb#24885
2025-07-22 12:39:01 +02:00
Avi Kivity
2db2b42556 sstables: version: drop custom operator<=>
The default comparison for enums is equivalent and
sufficient.

Closes scylladb/scylladb#24888
2025-07-22 12:39:01 +02:00
Avi Kivity
e89f6c5586 config, main: make cpu scheduling mandatory
CPU scheduling has been with us since 641aaba12c
(2017), and no one ever disables it. Likely nothing really works without
it.

Make it mandatory and mark the option unused.

Closes scylladb/scylladb#24894
2025-07-22 12:39:01 +02:00
Avi Kivity
ee138217ba alternator: simplify std::views::transform calls that extract a member from a class
Rather than calling std::views::transform with a lambda that extracts
a member from a class, call std::views::transform with a pointer-to-member
to do the same thing. This results in more concise code.

Closes scylladb/scylladb#25012
2025-07-22 12:39:01 +02:00
Jakub Smolar
6e0a063ce3 gdb: handle zero-size reads in managed_bytes
Fixes: https://github.com/scylladb/scylladb/issues/25048

Closes scylladb/scylladb#25050
2025-07-22 12:39:01 +02:00
Nadav Har'El
298a0ec4de test/cqlpy: in README.md, remind users of run-cassandra to set NODETOOL
test/cqlpy/README.md explains how to run the cqlpy tests against
Cassandra, and mentions that if you don't have "nodetool" in your path
you need to set the NODETOOL variable. However, when giving a simple
example how to use the run-cassandra script, we forgot to remind the
user to set NODETOOL in addition to CASSANDRA, causing confusion for
users who didn't know why tests were failing.

So this patch fixes the section in test/cqlpy/README.md with the
run-cassandra example to also set the NODETOOL environment variable,
not just CASSANDRA.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25051
2025-07-22 12:39:00 +02:00
Aleksandra Martyniuk
b5026edf49 tasks: change _finished_children type
Parent task keeps a vector of statuses (task_essentials) of its finished
children. When the children number is large - for example because we
have many tables and a child task is created for each table - we may hit
oversize allocation while adding a new child essentials to the vector.

Keep task_essentails of children in chunked_vector.

Fixes: #25040.

Closes scylladb/scylladb#25064
2025-07-22 12:39:00 +02:00
Pavel Emelyanov
d94be313c1 Merge 'test: audit: ignore cassandra user audit logs in AUTH tests' from Andrzej Jackowski
Audit tests are vulnerable to noise from LOGIN queries (because AUTH
audit logs can appear at any time). Most tests already use the
`filter_out_noise` mechanism to remove this noise, but tests
focused on AUTH verification did not, leading to sporadic failures.

This change adds a filter to ignore AUTH logs generated by the default
"cassandra" user, so tests only verify logs from the user created
specifically for each test.

Additionally, this PR:
 - Adds missing `nonlocal new_rows` statement that prevented some checks from being called
 - Adds a testcase for audit logs of `cassandra` user

Fixes: https://github.com/scylladb/scylladb/issues/25069

Better backport those test changes to 2025.3. 2025.2 and earlier don't have `./cluster/dtest/audit_test.py`.

Closes scylladb/scylladb#25111

* github.com:scylladb/scylladb:
  test: audit: add cassandra user test case
  test: audit: ignore cassandra user audit logs in AUTH tests
  test: audit: change names of `filter_out_noise` parameters
  test: audit: add missing `nonlocal new_rows` statement
2025-07-22 10:42:16 +03:00
Pavel Emelyanov
295165d8ea Merge 's3_client: Enhance s3_client error handling' from Ernest Zaslavsky
Enhance and fix error handling in the `chunked_download_source` to prevent errors seeping from the request callback. Also stop retrying on seastar's side since it is going to break the integrity of data which maybe downloaded more than once for the same range.

Fixes: https://github.com/scylladb/scylladb/issues/25043

Should be backported to 2025.3 since we have an intention to release native backup/restore feature

Closes scylladb/scylladb#24883

* github.com:scylladb/scylladb:
  s3_client: Disable Seastar-level retries in HTTP client creation
  s3_test: Validate handling of non-`aws_error` exceptions
  s3_client: Improve error handling in chunked_download_source
  aws_error: Add factory method for `aws_error` from exception
2025-07-22 10:40:39 +03:00
Ran Regev
dd67d22825 nodetool restore: sstable list from a file
Fixes: #25045

added the ability to supply the list of files to
restore from the a given file.
mainly required for local testing.

Signed-off-by: Ran Regev <ran.regev@scylladb.com>

Closes scylladb/scylladb#25077
2025-07-22 09:11:02 +03:00
Ernest Zaslavsky
fc2c9dd290 s3_client: Disable Seastar-level retries in HTTP client creation
Prevent Seastar from retrying HTTP requests to avoid buffer double-feed
issues when an entire request is retried. This could cause data
corruption in `chunked_download_source`. The change is global for every
instance of `s3_client`, but it is still safe because:
* Seastar's `http_client` resets connections regardless of retry behavior
* `s3_client` retry logic handles all error types—exceptions, HTTP errors,
  and AWS-specific errors—via `http_retryable_client`
2025-07-21 17:03:23 +03:00
Ernest Zaslavsky
ba910b29ce s3_test: Validate handling of non-aws_error exceptions
Inject exceptions not wrapped in `aws_error` from request callback
lambda to verify they are properly caught and handled.
2025-07-21 16:52:43 +03:00
Ernest Zaslavsky
b7ae6507cd s3_client: Improve error handling in chunked_download_source
Create aws_error from raised exceptions when possible and respond
appropriately. Previously, non-aws_exception types leaked from the
request handler and were treated as non-retryable, causing potential
data corruption during download.
2025-07-21 16:49:47 +03:00
Ernest Zaslavsky
d53095d72f aws_error: Add factory method for aws_error from exception
Move `aws_error` creation logic out of `retryable_http_client` and
into the `aws_error` class to support reuse across components.
2025-07-21 16:42:44 +03:00
Andrzej Jackowski
21aedeeafb test: audit: add cassandra user test case
Audit tests use the `filter_out_noise` function to remove noise from
audit logs generated by user authentication. As a result, none of the
existing tests covered audit logs for the default `cassandra` user.
This change adds a test case for that user.

Refs: scylladb/scylladb#25069
2025-07-21 14:54:20 +02:00
Andrzej Jackowski
aef6474537 test: audit: ignore cassandra user audit logs in AUTH tests
Audit tests are vulnerable to noise from LOGIN queries (because AUTH
audit logs can appear at any time). Most tests already use the
`filter_out_noise` mechanism to remove this noise, but tests
focused on AUTH verification did not, leading to sporadic failures.

This change adds a filter to ignore AUTH logs generated by the default
"cassandra" user, so tests only verify logs from the user created
specifically for each test.

Fixes: scylladb/scylladb#25069
2025-07-21 14:54:20 +02:00
Andrzej Jackowski
daf1c58e21 test: audit: change names of filter_out_noise parameters
This is a refactoring commit that changes the names of the parameters
of the `filter_out_noise` function, as well as names of related
variables. The motiviation for the change is introduction of more
complex filtering logic in next commit of this patch series.

Refs: scylladb/scylladb#25069
2025-07-21 14:54:01 +02:00
Andrzej Jackowski
e634a2cb4f test: audit: add missing nonlocal new_rows statement
The variable `new_rows` was not updated by the inner function
`is_number_of_new_rows_correct` because the `nonlocal new_rows`
statement was missing. As a result, `sorted_new_rows` was empty and
certain checks were skipped.

This change:
 - Introduces the missing `nonlocal new_rows` declaration
 - Adds an assertion verifying that the number of new rows matches
   the expected count
 - Fixes the incorrect variable name in the lambda used for row sorting
2025-07-21 14:53:48 +02:00