Commit Graph

48138 Commits

Author SHA1 Message Date
Dario Mirovic
4a6f71df68 test/cqlpy: add cpp exception metric test conditions
Tested code paths should not throw exceptions. `scylla_reactor_cpp_exceptions`
metric is used. This is a global metric. To address potential test flakiness,
each test runs multiple times:
- `run_count = 100`
- `cpp_exception_threshold = 10`

If a change in the code introduced an exception, expectation is that the number
of registered exceptions will be > `cpp_exception_threshold` in `run_count` runs.
In which case the test fails.
2025-07-17 17:02:48 +02:00
Dario Mirovic
5390f92afc transport/server: replace protocol_exception throws with returns
Replace throwing protocol_exception with returning it as a result
or an exceptional future in the transport server module. This
improves performance, for example during connection storms and
server restarts, where protocol exceptions are more frequent.

In functions already returning a future, protocol exceptions are
propagated using an exceptional future. In functions not already
returning a future, result_with_exception is used.

Notable change is checking v.failed() before calling v.get() in
process_request function, to avoid throwing in case of an
exceptional future.

Refs: #24567
2025-07-17 16:54:05 +02:00
Dario Mirovic
9f4344a435 utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception
Make make_bytes_ostream and make_fragmented_temporary_buffer accept
writer callbacks that return utils::result_with_exception instead of
forcing them to throw on error. This lets callers propagate failures
by returning an error result rather than throwing an exception.

Introduce buffer_writer_for, bytes_ostream_writer, and fragmented_buffer_writer
concepts to simplify and document the template requirements on writer callbacks.

This patch does not modify the actual callbacks passed, except for the syntax
changes needed for successful compilation, without changing the logic.

Refs: #24567
2025-07-17 16:40:02 +02:00
Dario Mirovic
30d424e0d3 transport/server: avoid exception-throw overhead in handle_error
Previously, connection::handle_error always called f.get() inside a try/catch,
forcing every failed future to throw and immediately catch an exception just to
classify it. This change eliminates that extra throw/catch cycle by first checking
f.failed(), getting the stored std::exception_ptr via f.get_exception(), and
then dispatching on its type via utils::try_catch<T>(eptr).

The error-response logic is not changed - cassandra_exception, std::exception,
and unknown exceptions are caught and processed, and any exceptions thrown by
write_response while handling those exceptions continues to escape handle_error.

Refs: #24567
2025-07-17 16:40:02 +02:00
Dario Mirovic
7aaeed012e test/cqlpy: add protocol_exception tests
Add a helper to fetch scylla_transport_cql_errors_total{type="protocol_error"} counter
from Scylla's metrics endpoint. These metrics are used to track protocol error
count before and after each test.

Add cql_with_protocol context manager utility for session creation with parameterized
protocol_version value. This is used for testing connection establishment with
different protocol versions, and proper disposal of successfully established sessions.

The tests cover two failure scenarios:
- Protocol version mismatch in test_protocol_version_mismatch which tests both supported
and unsupported protocol version
- Malformed frames via raw socket in _protocol_error_impl, used by several test functions,
and also test_no_protocol_exceptions test to assert that the error counters never decrease
during test execution, catching unintended metric resets

Refs: #24567
2025-07-17 16:39:54 +02:00
Nadav Har'El
85c19d21bb Merge 'cql, schema: Extend keyspace, table, views, indexes name length limit from 48 to 192 bytes' from Karol Nowacki
cql, schema: Extend name length limit from 48 to 192 bytes

    This commit increases the maximum length of names for keyspaces, tables, materialized views, and indexes from 48 to 192 bytes.
    The previous 48-bytes limit was inherited from Cassandra 3 for compatibility. However, this validation was removed in Cassandra 4 and 5 (see CASSANDRA-20389)
    and some usage scenarios (such as some feature store workflows generating long table names) now depend on this relaxed constraint.
    This change brings ScyllaDB's behavior in line with modern Cassandra versions and better supports these use cases.

    The new limit of 192 bytes is derived from underlying filesystem limitations to prevent runtime errors when creating directories for table data.
    When a new table is created, ScyllaDB generates a directory for its SSTables. The directory name is constructed from the table name, a dash, and a 32-character UUID.
    For a CDC-enabled table, an associated log table is also created, which has the suffix `_scylla_cdc_log` appended to its name.
    The directory name for this log table becomes the longest possible representation.
    Additionally we reserve 15 bytes for future use, allowing for potential future extensions without breaking existing schemas.
    To guarantee that directory creation never fails due to exceeding filesystem name limits, the maximum name length is calculated as follows:
      255 bytes (common filesystem limit for a path component)
    -  32 bytes (for the 32-character UUID string)
    -   1 byte  (for the '-' separator)
    -  15 bytes (for the '_scylla_cdc_log' suffix)
    -  15 bytes (reserved for future use)
    ----------
    = 192 bytes (Maximum allowed name length)
    This calculation is similar in principle to the one proposed for Cassandra to fix related directory creation failures (see apache/cassandra/pull/4038).

    This patch also updates/adds all associated tests to validate the new 192-byte limit.
    The documentation has been updated accordingly.

Fixes #4480

Backport 2025.2: The significantly shorter maximum table name length in Scylla compared to Cassandra is becoming a more common issue for users in the latest release.

Closes scylladb/scylladb#24500

* github.com:scylladb/scylladb:
  cql, schema: Extend name length limit from 48 to 192 bytes
  replica: Remove unused keyspace::init_storage()
2025-06-22 17:41:10 +03:00
Avi Kivity
770b91447b Merge 'memtable: ensure _flushed_memory doesn't grow above total_memory' from Michał Chojnowski
`dirty_memory_manager` tracks two quantities about memtable memory usage:
"real" and "unspooled" memory usage.

"real" is the total memory usage (sum of `occupancy().total_space()`)
by all memtable LSA regions, plus a upper-bound estimate of the size of
memtable data which has already moved to the cache region but isn't
evictable (merged into the cache) yet.

"unspooled" is the difference between total memory usage by all memtable
LSA regions, and the total flushed memory (sum of `_flushed_memory`)
of memtables.

`dirty_memory_manager` controls the shares of compaction and/or blocks
writes when these quantities cross various thresholds.

"Total flushed memory" isn't a well defined notion,
since the actual consumption of memory by the same data can vary over
time due to LSA compactions, and even the data present in memtable can
change over the course of the flush due to removals of outdated MVCC versions.
So `_flushed_memory` is merely an approximation computed by `flush_reader`
based on the data passing through it.

This approximation is supposed to be a conservative lower bound.
In particular, `_flushed_memory` should be not greater than
`occupancy().total_space()`. Otherwise, for example, "unspooled" memory
could become negative (and/or wrap around) and weird things could happen.
There is an assertion in `~flush_memory_accounter` which checks that
`_flushed_memory < occupancy().total_space()` at the end of flush.

But it can fail. Without additional treatment, the memtable reader sometimes emits
data which is already deleted. (In particular, it emites rows covered by
a partition tombstone in a newer MVCC version.)
This data is seen by `flush_reader` and accounted in `_flushed_memory`.
But this data can be garbage-collected by the `mutation_cleaner` later during the
flush and decrease `total_memory` below `_flushed_memory`.

There is a piece of code in `mutation_cleaner` intended to prevent that.
If `total_memory` decreases during a `mutation_cleaner` run,
`_flushed_memory` is lowered by the same amount, just to preserve the
asserted property. (This could also make `_flushed_memory` quite inaccurate,
but that's considered acceptable).

But that only works if `total_memory` is decreased during that run. It doesn't
work if the `total_memory` decrease (enabled by the new allocator holes made
by `mutation_cleaner`'s garbage collection work) happens asynchronously
(due to memory reclaim for whatever reason) after the run.

This patch fixes that by tracking the decreases of `total_memory` closer to the
source. Instead of relying on `mutation_cleaner` to notify the memtable if it
lowers `total_memory`, the memtable itself listens for notifications about
LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's
estimate of flushed memory decreased by the change in `total_memory` since the
beginning of flush (if it was positive), and it keeps the amount of "spooled"
memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.

Fixes scylladb/scylladb#21413

Backport candidate because it fixes a crash that can happen in existing stable branches.

Closes scylladb/scylladb#21638

* github.com:scylladb/scylladb:
  memtable: ensure _flushed_memory doesn't grow above total memory usage
  replica/memtable: move region_listener handlers from dirty_memory_manager to memtable
2025-06-22 11:19:25 +03:00
Michał Chojnowski
975e7e405a memtable: ensure _flushed_memory doesn't grow above total memory usage
dirty_memory_manager tracks two quantities about memtable memory usage:
"real" and "unspooled" memory usage.

"real" is the total memory usage (sum of `occupancy().total_space()`)
by all memtable LSA regions, plus a upper-bound estimate of the size of
memtable data which has already moved to the cache region but isn't
evictable (merged into the cache) yet.

"unspooled" is the difference between total memory usage by all memtable
LSA regions, and the total flushed memory (sum of `_flushed_memory`)
of memtables.

dirty_memory_manager controls the shares of compaction and/or blocks
writes when these quantities cross various thresholds.

"Total flushed memory" isn't a well defined notion,
since the actual consumption of memory by the same data can vary over
time due to LSA compactions, and even the data present in memtable can
change over the course of the flush due to removals of outdated MVCC versions.
So `_flushed_memory` is merely an approximation computed by `flush_reader`
based on the data passing through it.

This approximation is supposed to be a conservative lower bound.
In particular, `_flushed_memory` should be not greater than
`occupancy().total_space()`. Otherwise, for example, "unspooled" memory
could become negative (and/or wrap around) and weird things could happen.
There is an assertion in ~flush_memory_accounter which checks that
`_flushed_memory < occupancy().total_space()` at the end of flush.

But it can fail. Without additional treatment, the memtable reader sometimes emits
data which is already deleted. (In particular, it emites rows covered by
a partition tombstone in a newer MVCC version.)
This data is seen `flush_reader` and accounted in `_flushed_memory`.
But this data can be garbage-collected by the mutation_cleaner later during the
flush and decrease `total_memory` below `_flushed_memory`.

There is a piece of code in mutation_cleaner intended to prevent that.
If `total_memory` decreases during a `mutation_cleaner` run,
`_flushed_memory` is lowered by the same amount, just to preserve the
asserted property. (This could also make `_flushed_memory` quite inaccurate,
but that's considered acceptable).

But that only works if `total_memory` is decreased during that run. It doesn't
work if the `total_memory` decrease (enabled by the new allocator holes made
by `mutation_cleaner`'s garbage collection work) happens asynchronously
(due to memory reclaim for whatever reason) after the run.

This patch fixes that by tracking the decreases of `total_memory` closer to the
source. Instead of relying on `mutation_cleaner` to notify the memtable if it
lowers `total_memory`, the memtable itself listens for notifications about
LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's
estimate of flushed memory decreased by the change in `total_memory` since the
beginning of flush (if it was positive), and it keeps the amount of "spooled"
memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.
2025-06-20 11:42:30 +02:00
Michał Chojnowski
7d551f99be replica/memtable: move region_listener handlers from dirty_memory_manager to memtable
The memtable wants to listen for changes in its `total_memory` in order
to decrease its `_flushed_memory` in case some of the freed memory has already
been accounted as flushed. (This can happen because the flush reader sees
and accounts even outdated MVCC versions, which can be deleted and freed
during the flush).

Today, the memtable doesn't listen to those changes directly. Instead,
some calls which can affect `total_memory` (in particular, the mutation cleaner)
manually check the value of `total_memory` before and after they run, and they
pass the difference to the memtable.

But that's not good enough, because `total_memory` can also change outside
of those manually-checked calls -- for example, during LSA compaction, which
can occur anytime. This makes memtable's accounting inaccurate and can lead
to unexpected states.

But we already have an interface for listening to `total_memory` changes
actively, and `dirty_memory_manager`, which also needs to know it,
does just that. So what happens e.g. when `mutation_cleaner` runs
is that `mutation_cleaner` checks the value of `total_memory` before it runs,
then it runs, causing several changes to `total_memory` which are picked up
by `dirty_memory_manager`, then `mutation_cleaner` checks the end value of
`total_memory` and passes the difference to `memtable`, which corrects
whatever was observed by `dirty_memory_manager`.

To allow memtable to modify its `_flushed_memory` correctly, we need
to make `memtable` itself a `region_listener`. Also, instead of
the situation where `dirty_memory_manager` receives `total_memory`
change notifications from `logalloc` directly, and `memtable` fixes
the manager's state later, we want to only the memtable listen
for the notifications, and pass them already modified accordingl
to the manager, so there is no intermediate wrong states.

This patch moves the `region_listener` callbacks from the
`dirty_memory_manager` to the `memtable`. It's not intended to be
a functional change, just a source code refactoring.
The next patch will be a functional change enabled by this.
2025-06-20 11:42:30 +02:00
Łukasz Paszkowski
a9a53d9178 compaction_manager: cancel submission timer on drain
The `drain` method, cancels all running compactions and moves the
compaction manager into the disabled state. To move it back to
the enabled state, the `enable` method shall be called.

This, however, throws an assertion error as the submission time is
not cancelled and re-enabling the manager tries to arm the armed timer.

Thus, cancel the timer, when calling the drain method to disable
the compaction manager.

Fixes https://github.com/scylladb/scylladb/issues/24504

All versions are affected. So it's a good candidate for a backport.

Closes scylladb/scylladb#24505
2025-06-20 11:33:49 +03:00
Nadav Har'El
70f5a6a4d6 test/cqlpy: fix run-cassandra script to ignore CASSANDRA_HOME
As test/cqlpy/README.md explains, the way to tell the run-cassandra
script which version of Cassandra should be run is through the
"CASSANDRA" variable, for example:

    CASSANDRA=$HOME/apache-cassandra-4.1.6/bin/cassandra \
    test/cqlpy/run-cassandra test_file.py::test_function

But all the Cassandra scripts, of all versions, have one strange
feature: If you set CASSANDRA_HOME, then instead of running the
actual Cassandra script you tried to run (in this case, 4.1.6), the
Cassandra script goes to run the other Cassandra from CASSANDRA_HOME!
This means that if a user happens to have, for some reason, set
CASSANDRA_HOME, then the documented "CASSANDRA" variable doesn't work.

The simple fix is to clear CASSANDRA_HOME in the environment that
run-cassandra passes to Cassandra.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#24546
2025-06-20 11:31:02 +03:00
Anna Stuchlik
17eabbe712 doc: improve the tablets limitations section
This PR improves the Limitations and Unsupported Features section
for tablets, as it has been confusing to the customers.

Refs https://github.com/scylladb/scylla-enterprise/issues/5465

Fixes https://github.com/scylladb/scylladb/issues/24562

Closes scylladb/scylladb#24563
2025-06-20 11:28:38 +03:00
Gleb Natapov
e364995e28 api: return error from get_host_id_map if gossiper is not enabled yet.
Token metadata api is initialized before gossiper is started.
get_host_id_map REST endpoint cannot function without the fully
initialized gossiper though. The gossiper is started deep in
the join_cluster call chain, but if we move token_metadata api
initialization after the call it means that no api will be available
during bootstrap. This is not what we want.

Make a simple fix by returning an error from the api if the gossiper is
not initialized yet.

Fixes: #24479

Closes scylladb/scylladb#24575
2025-06-20 11:27:28 +03:00
Andrei Chekun
392a7fc171 test.py: Fix the boost output file name
File name for the boost test do not use run_id, so each consequent run will
overwrite the logs from the previous one. If the first repeat fails, and the
second will pass, it overwrites the failed log. This PR allows saving the
failed one.

Closes scylladb/scylladb#24580
2025-06-20 11:26:16 +03:00
Asias He
c5a136c3b5 storage_service: Use utils::chunked_vector to avoid big allocation
The following was seen:

```
!WARNING | scylla[6057]:  [shard 12:strm] seastar_memory - oversized allocation: 212992 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at
[Backtrace #0]
void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89
 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:99
seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136
seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169
seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:848
seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:911
operator new(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1706
std::allocator<dht::token_range_endpoints>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/allocator.h:196
 (inlined by) std::allocator_traits<std::allocator<dht::token_range_endpoints> >::allocate(std::allocator<dht::token_range_endpoints>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/alloc_traits.h:515
 (inlined by) std::_Vector_base<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:380
 (inlined by) void std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_realloc_append<dht::token_range_endpoints const&>(dht::token_range_endpoints const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/vector.tcc:596
locator::describe_ring(replica::database const&, gms::gossiper const&, seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:1294
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242
 (inlined by) seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80
seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2635
std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:4684
```

Fix by using chunked_vector.

Fixes #24158

Closes scylladb/scylladb#24561
2025-06-19 16:51:01 +03:00
Andrei Chekun
fcc2ad8ff5 test.py: Fix test result are overwritten
Currently, CI uses several nodes to execute the different modes to
reduce overall time for execution. During copying the results from nodes
to the main job test reports will be overwritten, since they are using
the same directory and the same name. This patch allows to
distinguishing these results and not overwrite them.

Closes scylladb/scylladb#24559
2025-06-19 16:51:01 +03:00
Pavel Emelyanov
dc166be663 s3: Mark claimed_buffer constructor noexcept
It just std::move-s a buffer and a semaphore_units objects, both moves
are noexcept, so is the constructor itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24552
2025-06-18 20:36:45 +03:00
Avi Kivity
c89ab90554 Merge 'main: don't start maintenance auth service if not enabled' from Marcin Maliszkiewicz
In f96d30c2b5
we introduced the maintenance service, which is an additional
instance of auth::service. But this service has a somewhat
confusing 2-level startup mechanism: it's initialized with
sharded<Service>::start and then auth::service::start
(different method with the same name to confuse even more).

When maintenance_socket was disabled (default setting), the code
did only the first part of the startup. This registered a config
observer but didn't create a permission_cache instance.
As a result, a crash on SIGHUP when config is reloaded can occur.

Fixes: https://github.com/scylladb/scylladb/issues/24528
Backport: all not eol versions since 6.0 and 2025.1

Closes scylladb/scylladb#24527

* github.com:scylladb/scylladb:
  test: add test for live updates of permissions cache config
  main: don't start maintenance auth service if not enabled
2025-06-18 20:28:53 +03:00
Karol Nowacki
4577c66a04 cql, schema: Extend name length limit from 48 to 192 bytes
This commit increases the maximum length of names for keyspaces, tables, materialized views, and indexes from 48 to 192 bytes.
The previous 48-bytes limit was inherited from Cassandra 3 for compatibility. However, this validation was removed in Cassandra 4 and 5 (see CASSANDRA-20389)
and some usage scenarios (such as some feature store workflows generating long table names) now depend on this relaxed constraint.
This change brings ScyllaDB's behavior in line with modern Cassandra versions and better supports these use cases.

The new limit of 192 bytes is derived from underlying filesystem limitations to prevent runtime errors when creating directories for table data.
When a new table is created, ScyllaDB generates a directory for its SSTables. The directory name is constructed from the table name, a dash, and a 32-character UUID.
For a CDC-enabled table, an associated log table is also created, which has the suffix `_scylla_cdc_log` appended to its name.
The directory name for this log table becomes the longest possible representation.
Additionally we reserve 15 bytes for future use, allowing for potential future extensions without breaking existing schemas.
To guarantee that directory creation never fails due to exceeding filesystem name limits, the maximum name length is calculated as follows:
  255 bytes (common filesystem limit for a path component)
-  32 bytes (for the 32-character UUID string)
-   1 byte  (for the '-' separator)
-  15 bytes (for the '_scylla_cdc_log' suffix)
-  15 bytes (reserved for future use)
----------
= 192 bytes (Maximum allowed name length)
This calculation is similar in principle to the one proposed for Cassandra to fix related directory creation failures (see apache/cassandra/pull/4038).

This patch also updates/adds all associated tests to validate the new 192-byte limit.
The documentation has been updated accordingly.
2025-06-18 14:08:38 +02:00
Karol Nowacki
a41c12cd85 replica: Remove unused keyspace::init_storage()
This function was declared but had no implementation or callers. It is being removed as minor code cleanup.
2025-06-18 14:08:38 +02:00
Marcin Maliszkiewicz
dd01852341 test: add test for live updates of permissions cache config 2025-06-18 11:27:08 +02:00
Marcin Maliszkiewicz
97c60b8153 main: don't start maintenance auth service if not enabled
In f96d30c2b5
we introduced the maintenance service, which is an additional
instance of auth::service. But this service has a somewhat
confusing 2-level startup mechanism: it's initialized with
sharded<Service>::start and then auth::service::start
(different method with the same name to confuse even more).

When maintenance_socket was disabled (default setting), the code
did only the first part of the startup. This registered a config
observer but didn't create a permission_cache instance.
As a result, a crash on SIGHUP when config is reloaded can occur.
2025-06-18 11:27:08 +02:00
Botond Dénes
da1a3dd640 Merge 'test: introduce upgrade tests to test.py, add a SSTable dict compression upgrade test' from Michał Chojnowski
This PR adds an upgrade test for SSTable compression with shared dictionaries, and adds some bits to pylib and test.py to support that.

In the series, we:
1. Mount `$XDG_CACHE_DIR` into dbuild.
2. Add a pylib function which downloads and installs a released ScyllaDB package into a subdirectory of `$XDG_CACHE_DIR/scylladb/test.py`, and returns the path to `bin/scylla`.
3. Add new methods and params to the cluster manager, which let the test start nodes with historical Scylla executables, and switch executables during the test.
4. Add a test which uses the above to run an upgrade test between the released package and the current build.
5. Add `--run-internet-dependent-tests` to `test.py` which lets the user of `test.py` skip this test (and potentially other internet-dependent tests in the future).

(The patch modifying `wait_for_cql_and_get_hosts` is a part of the new test — the new test needs it to test how particular nodes in a mixed-version cluster react to some CQL queries.)

This is a follow-up to #23025, split into a separate PR because the potential addition of upgrade tests to `test.py` deserved a separate thread.

Needs backport to 2025.2, because that's where the tested feature is introduced.

Fixes #24110

Closes scylladb/scylladb#23538

* github.com:scylladb/scylladb:
  test: add test_sstable_compression_dictionaries_upgrade.py
  test.py: add --run-internet-dependent-tests
  pylib/manager_client: add server_switch_executable
  test/pylib: in add_server, give a way to specify the executable and version-specific config
  pylib: pass scylla_env environment variables to the topology suite
  test/pylib: add get_scylla_2025_1_executable()
  pylib/scylla_cluster: give a way to pass executable-specific options to nodes
  dbuild: mount "$XDG_CACHE_HOME/scylladb"
2025-06-18 12:21:21 +03:00
Avi Kivity
2177ec8dc1 gdb: adjust unordered container accessors for libstdc++15
In libstdc++15, the internal structure of an unordered container
hashtable node changed from _M_storage._M_storage.__data to just
_M_storage._M_storage (though the layout is the same). Adjust
the code to work with both variants.

Closes scylladb/scylladb#24549
2025-06-18 09:15:03 +03:00
Michał Chojnowski
27f66fb110 test/boost/mutation_reader_test: fix a use-after-free in test_fast_forwarding_combined_reader_is_consistent_with_slicing
The contract in mutation_reader.hh says:

```
// pr needs to be valid until the reader is destroyed or fast_forward_to()
// is called again.
    future<> fast_forward_to(const dht::partition_range& pr) {
```

`test_fast_forwarding_combined_reader_is_consistent_with_slicing` violates
this by passing a temporary to `fast_forward_to`.

Fix that.

Fixes scylladb/scylladb#24542

Closes scylladb/scylladb#24543
2025-06-17 19:30:50 +03:00
Anna Stuchlik
648d8caf27 doc: add support for z3 GCP
This commit adds support for z3-highmem-highlssd instance types to
Cloud Instance Recommendations for GCP.

Fixes https://github.com/scylladb/scylladb/issues/24511

Closes scylladb/scylladb#24533
2025-06-17 13:50:46 +03:00
Robert Bindar
1dd37ba47a Add dev documentation for manipulating s3 data manually
This patch intends to give an overview of where, when and how we store
data in S3 and provide a quick set of commands
which help gain local access to the data in case there is a need for
manual intervention.

The patch also collects in the same place links/descriptions for all
formats we use in S3.

Fixes #22438

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#24323
2025-06-17 13:21:30 +03:00
Pavel Emelyanov
b0766d1e73 Merge 's3_client: Refactor range class for state validation' from Ernest Zaslavsky
Revamped the `range` class to actively manage its state by enforcing validation on all modifications. This prevents overflow, invalid states, and ensures the object size does not exceed the 5TiB limit in S3. This should address and prevent future problems related to this issue https://github.com/minio/minio/issues/21333

No backport needed since this problem related only to this change https://github.com/scylladb/scylladb/pull/23880

Closes scylladb/scylladb#24312

* github.com:scylladb/scylladb:
  s3_client: headers cleanup
  s3_client: Refactor `range` class for state validation
2025-06-17 10:34:55 +03:00
Ernest Zaslavsky
e398576795 s3_client: Fix hang in get() on EOF by signaling condition variable
* Ensure _get_cv.signal() is called when an empty buffer received
* Prevents `get()` from stalling indefinitely while waiting on EOF
* Found when testing https://github.com/scylladb/scylladb/pull/23695

Closes scylladb/scylladb#24490
2025-06-17 10:33:19 +03:00
Calle Wilund
4a98c258f6 http: Add missing thread_local specifier for static
Refs #24447

Patch adding this somehow managed to leave out the thread_local
specifier. While gnutls cert object can be shared across shards
just fine, the actual shared_ptr here cannot, thus we could
cause memory errors.

Closes scylladb/scylladb#24514
2025-06-17 10:23:52 +03:00
Avi Kivity
cd79a8fc25 Revert "Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz"
This reverts commit 0b516da95b, reversing
changes made to 30199552ac. It breaks
cluster.random_failures.test_random_failures.test_random_failures
in debug mode (at least).

Fixes #24513
2025-06-16 22:38:12 +03:00
Ernest Zaslavsky
1b20e0be4a s3_client: headers cleanup 2025-06-16 16:02:30 +03:00
Ernest Zaslavsky
9ad7a456fe s3_client: Refactor range class for state validation
Revamped the `range` class to actively manage its state by enforcing validation on all modifications. This prevents overflow, invalid states, and ensures the object size does not exceed the 5TiB limit in S3.
2025-06-16 16:02:24 +03:00
Pavel Emelyanov
5c2e5890a6 Merge 'test.py: Integrate pytest c++ test execution to test.py' from Andrei Chekun
With current changes, pytest executes boost tests. Gathering metrics added to the pytest BoostFacade and UnitFacade to have the possibility to get them for C++ test as previously.
Since boost, raft, unit, and ldap directories aren't executed by test.py, suite.yaml files are renamed to test_config.yaml to preserve the old way of test configuration and removing them from execution by test.py
Pytest executes all modes by itself, JUnit report for the C++ test will be one for the run. That means that there is no possibility to output them in testlog in different folders. So testlog/report directory is used to store all kinds of reports generated during tests. JUnit reports should be testlog/report/junit, Allure reports should be in testlog/report/allure.

**Breaking changes:**
1. Terminal output changed. test.py will run pytest for the next directories: `test/boost`, `test/ldap`, `test/raft`, `test/unit`. `test.py` will blindly translate the output of the pytest to the terminal. Then when all these tests are finished, `test.py` will continue to show previous output for the rest of the test.
2. The format of execution of C++ test directories mentioned above has been changed. Now it will be a simple path to the file with extension. For example, instead of `boost/aggregate_fcts_test` now you need to use `test/boost/aggregate_fcts_test.cc`
3. This PR creates a spike in test amount. The previous logic was to consolidate the boost results from different runs and different modes to one report. So for the three repeats and three modes (nine test results) in CI was shown one result. Now it shows nine results, with differentiating them by mode and run.

**Note:**
Pytest uses pytest-xdist module to run tests in parallel. The Frozen toolchain has this dependency installed, for the local use, please install it manually.

Changes for CI https://github.com/scylladb/scylla-pkg/pull/4949. It will be merged after the current PR will be in master. Short disruption is expected, while PR in scylla-pkg will not be merged.

Fixes: https://github.com/scylladb/qa-tasks/issues/1777

Closes scylladb/scylladb#22894

* github.com:scylladb/scylladb:
  test.py: clean code that isn't used anymore
  test.py: switch off C++ tests from test.py discovery
  test.py: Integrate pytest c++ test execution to test.py
2025-06-16 16:01:37 +03:00
Pavel Emelyanov
0b6532a895 api: Shorten get_simple_states() handler
The one collects map<ip, state> then converts it to a jsonable vector of
helper objects with key and value members. This patch removes the
intermediate map and creates the vector instantly. With that change the
handler makes less data manipulations and behaves like the
get_all_endpoint_states one.

Very similar change was done in 12420dc644 with get_host_to_id_map
handler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#24456
2025-06-16 15:21:27 +03:00
Tomasz Grabiec
cdb1499898 Merge 'interval: reduce memory footprint' from Avi Kivity
The interval class's memory footprint isn't important for single objects,
but intervals are frequently held in moderately sized collections. In #3335 this
caused a stall. Therefore reducing interval's memory footprint and reduce
allocation pressure.

This series does this by consolidating badly-padded booleans in the object tree
spanned by interval into 5 booleans that are consecutive in memory. This
reduces the space required by these booleans from 40 bytes to 8 bytes.

perf-simple-query report (with refresh-pgo-profiles.sh for each measurement):

before:

252127.60 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37128 insns/op,   18147 cycles/op,        0 errors)
INFO  2025-06-07 21:00:34,010 [shard 0:main] group0_tombstone_gc_handler - Setting reconcile time to   1749319231 (min id=4dbed2f4-43c9-11f0-cbc6-87d1a08b4ca4)
246492.37 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37153 insns/op,   18411 cycles/op,        0 errors)
253633.11 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37127 insns/op,   17941 cycles/op,        0 errors)
254029.93 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37155 insns/op,   17951 cycles/op,        0 errors)
254465.76 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37123 insns/op,   17906 cycles/op,        0 errors)
throughput:
	mean=   252149.75 standard-deviation=3282.75
	median= 253633.11 median-absolute-deviation=1880.17
	maximum=254465.76 minimum=246492.37
instructions_per_op:
	mean=   37137.24 standard-deviation=15.71
	median= 37127.54 median-absolute-deviation=14.45
	maximum=37155.24 minimum=37122.79
cpu_cycles_per_op:
	mean=   18071.19 standard-deviation=212.25
	median= 17950.62 median-absolute-deviation=130.10
	maximum=18411.50 minimum=17906.13

after:

252561.26 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37039 insns/op,   18075 cycles/op,        0 errors)
256876.44 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37022 insns/op,   17785 cycles/op,        0 errors)
257084.38 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37030 insns/op,   17840 cycles/op,        0 errors)
257305.35 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37042 insns/op,   17804 cycles/op,        0 errors)
258088.53 tps ( 66.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   37028 insns/op,   17778 cycles/op,        0 errors)
throughput:
	mean=   256383.19 standard-deviation=2185.22
	median= 257084.38 median-absolute-deviation=922.16
	maximum=258088.53 minimum=252561.26
instructions_per_op:
	mean=   37032.17 standard-deviation=8.06
	median= 37030.46 median-absolute-deviation=6.44
	maximum=37041.83 minimum=37021.93
cpu_cycles_per_op:
	mean=   17856.60 standard-deviation=124.70
	median= 17804.16 median-absolute-deviation=71.24
	maximum=18075.50 minimum=17777.95

A small improvement is observed in instructions_per_op. It could be random fluctuations in the compiler performance, or maybe the default constructor/destructor of interval are meaningful even in this simple test.

Small performance improvement, so not a backport candidate.

Closes scylladb/scylladb#24232

* github.com:scylladb/scylladb:
  interval: reduce sizeof
  interval: change start()/end() not to return references to data members
  interval: rename start_ref() back to start() (and end_ref() etc).
  interval: rename start() to start_ref() (and end() etc).
  test: wrapping_interval_test: add more tests for intervals
2025-06-16 09:23:56 +02:00
Botond Dénes
898ce98500 db/batchlog_manager: remove unused member _total_batches_replayed
And its getter. There are no users for either.

Closes scylladb/scylladb#24416
2025-06-16 09:37:00 +03:00
Nadav Har'El
847d9c0911 alternator: update documentation that ttl with tablets does work
Our documentation docs/alternator/new-apis.md claims that Alternator TTL
does not work with tablets, due to issue #16567. However, we fixed that
issue in commit de96c28625. So let's drop
the outdated statement that it doesn't work.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#24427
2025-06-16 09:36:11 +03:00
Ernest Zaslavsky
2b300c8eb9 s3_client: Improve reporting of S3 client statistics
Revise how we report statistics for `chunked_download_source`. Ensure
metrics for downloaded but unconsumed data are visible, as they do not
contribute to read amplification, which is tracked separately.

Closes scylladb/scylladb#24491
2025-06-16 09:33:57 +03:00
Pavel Emelyanov
9aaa33c15a Merge 'main.cc: fix group0 shutdown order' from Petr Gusev
Applier fiber needs local storage, so before shutting down local storage we need to make sure that group0 is stopped.

We also improve the logs for the case when `gate_closed_exception` is thrown while a mutation is being written.

Fixes [scylladb/scylladb#24401](https://github.com/scylladb/scylladb/issues/24401)

Backport: no backport -- not safe and the problem is minor.

Closes scylladb/scylladb#24418

* github.com:scylladb/scylladb:
  storage_service: test_group0_apply_while_node_is_being_shutdown
  main.cc: fix group0 shutdown order
  storage_proxy: log gate_closed_exception
2025-06-16 09:32:34 +03:00
Amnon Heiman
55b21b01ee alternator/stats.cc, metrics-config.yml: docs fix per-table metrics
This patch updates alternator/stats.cc and the get_description.py
configuration (metrics-config.yml) to restore compatibility with
per-table alternator metrics in the documentation generation process.

Previously, the group name for metrics was selected using an inline
expression like (has_table)? "alternator_table" : "alternator", which
made it difficult to maintain a straightforward mapping in the
configuration file.  With this change, the group name is now assigned to
a variable in alternator/stats.cc, allowing metrics-config.yml to map
group names directly. This makes the configuration easier to maintain
and enables get_description.py to document both global and per-table
metrics correctly.

This is a minimal, targeted fix to get the documentation working again
with the new per-table metrics format.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes scylladb/scylladb#24509
2025-06-15 18:06:36 +03:00
Jenkins Promoter
1b5eee6a12 Update pgo profiles - aarch64 2025-06-15 04:57:59 +03:00
Jenkins Promoter
e0c2d591c7 Update pgo profiles - x86_64 2025-06-15 04:44:13 +03:00
Avi Kivity
42d7ae1082 interval: reduce sizeof
An interval object stores five booleans: start()->is_inclusive(),
a boolean since start() itself is an std::optional, two more for
end(), and is_singular(). Due to bad packing, these five booleans
occupy 8 bytes each, for a total of 40 bytes.

Re-pack the interval class by storing those booleans explicitly
close by. Since we lose std::optional's ability to store
a maybe-constructed object, we re-implement it using anonymous
unions and therefore have to implement the 5 special methods.

This helps saves space when vectors of intervals are used, as
seen in #3335 for example.
2025-06-14 21:29:43 +03:00
Avi Kivity
f3dccc2215 interval: change start()/end() not to return references to data members
We'd like to change the data layout of `interval` to save space.
As a result, start() and end() which return references to data
members must return objects (not references). Since we'd like
to maintain zero-copy for these functions, we change them to
return objects containing references (rather than references
to objects), avoiding copying of potentially expensive objects.

We repurpose the interval_bound class to hold references (by
instantiating it with `const T&` instead of `T`) and provide
converting constructors. To make transform_bounds() retain
zero-copy, we add start() and end() that take *this by
rvalue reference.
2025-06-14 21:26:17 +03:00
Avi Kivity
16fb68bb5e interval: rename start_ref() back to start() (and end_ref() etc).
To reduce noise, rename start_ref() back to its original name start(),
after it was changed in the previous patch to force an audit of all calls.
2025-06-14 21:26:16 +03:00
Avi Kivity
3363bc41e2 interval: rename start() to start_ref() (and end() etc).
We are about to change start() to return a proxy object rather
than a `const interval_bound<T>&`. This is generally transparent,
except in one case: `auto x = i.start()`. With the current implementation,
we'll copy object referred to and assign it to x. With the planned
implementation, the proxy object will be assigned to `x`, but it
will keep referring to `i`.

To prevent such problems, rename start() to start_ref() and end()
to end_ref(). This forces us to audit all calls, and redirect calls
that will break to new start_copy() and end_copy() methods.
2025-06-14 21:26:16 +03:00
Avi Kivity
674118fd2e test: wrapping_interval_test: add more tests for intervals
In this series, we will make interval manage its memory directly,
specifically it will directly construct and destroy T values that
it contains rather than let std::optional<T> manage those values
itself.

Add tests that expose bugs encountered during development (actually,
review) of this series. The tests pass before the series, fail
with series as it was before fixing, and pass with the series as
it is now.

The tests use a class maybe_throwing_interval_payload that can
be set to throw at strategic locations and exercise all the interesting
interval shapes.
2025-06-14 21:26:14 +03:00
Patryk Jędrzejczak
c4cf95aeb3 Merge 'raft: simplify voter handler code to not pass node references around' from Emil Maskovsky
Refactor the voter handler logic to only pass around node IDs (`raft::server_id`), instead of pairs of IDs and node descriptor references. Node descriptors can always be efficiently retrieved from the original nodes map, which remains valid throughout the calculation.

This change reduces unnecessary reference passing and simplifies the code. All node detail lookups are now performed via the central nodes map as needed.

Additional cleanup has been done:
* removing redundant comments (that just repeat what the code does)
* use explicit comparators for the datacenter and rack information priorities (instead of the comparison operator) to be more explicit about the prioritization

Fixes: scylladb/scylladb#24035

No backport: This change does not fix any bug and doesn't change the behavior, just cleans up the code in master, therefore no backport is needed.

Closes scylladb/scylladb#24452

* https://github.com/scylladb/scylladb:
  raft: simplify voter handler code to not pass node references around
  raft: reformat voter handler for consistent indentation
  raft: use explicit priority comparators for datacenters and racks
  raft: clean up voter handler by removing redundant comments
2025-06-13 19:02:07 +02:00
Anna Stuchlik
e2b7302183 doc: extend 2025.2 upgrade with a note about consistent topology updates
This commit adds a note that the user should enable consistent topology updates before upgrading
to 2025.2 if they didn't do it (for some reason) when previously upgrading to version 2025.1.

Fixes https://github.com/scylladb/scylladb/issues/24467

Closes scylladb/scylladb#24468
2025-06-13 13:54:59 +03:00