Commit Graph

4014 Commits

Author SHA1 Message Date
Ernest Zaslavsky
fa1ca8096b s3_client: add tests for calc_part_size
Introduce tests that validate the corrected multipart part-size
calculation, including boundary conditions and error cases.

(cherry picked from commit 6280cb91ca)
2026-02-18 09:40:10 +00:00
Ernest Zaslavsky
af3f266496 s3_test: remove client double-close
`test_chunked_download_data_source_with_delays` was calling `close()` on a client twice, remove the unnecessary call

(cherry picked from commit bd9d5ad75b)
2026-02-04 09:31:45 +02:00
Botond Dénes
3d9f38ccbb db/row_cache: make_nonpopulating_reader(): pass cache tracker to snapshot
The API contract in partition_version.hh states that when dealing with
evictable entries, a real cache tracker pointer has to be passed to all
methods that ask for it. The nonpopulating reader violates this, passing
a nullptr to the snapshot. This was observed to cause a crash when a
concurrent cache read accessed the snapshot with the null tracker.

A reproducer is included which fails before and passes after the fix.

Fixes: #26847

Closes scylladb/scylladb#28163

(cherry picked from commit a53f989d2f)

Closes scylladb/scylladb#28278
2026-01-30 16:15:26 +02:00
Botond Dénes
cb6b20e197 reader_concurrency_semaphore: improve handling of base resources
reader_permit::release_base_resources() is a soft evict for the permit:
it releases the resources aquired during admission. This is used in
cases where a single process owns multiple permits, creating a risk for
deadlock, like it is the case for repair. In this case,
release_base_resources() acts as a manual eviction mechanism to prevent
permits blockings each other from admission.

Recently we found a bad interaction between release_base_resources() and
permit eviction. Repair uses both mechanism: it marks its permits as
inactive and later it also uses release_base_resources(). This partice
might be worth reconsidering, but the fact remains that there is a bug
in the reader permit which causes the base resources to be released
twice when release_base_resources() is called on an already evicted
permit. This is incorrect and is fixed in this patch.

Improve release_base_resources():
* make _base_resources const
* move signal call into the if (_base_resources_consumed()) { }
* use reader_permit::impl::signal() instead of
  reader_concurrency_semaphore::signal()
* all places where base resources are released now call
  release_base_resources()

A reproducer unit test is added, which fails before and passes after the
fix.

Fixes: #28083

Closes scylladb/scylladb#28155

(cherry picked from commit b7bc48e7b7)

Closes scylladb/scylladb#28242
2026-01-21 06:45:13 +02:00
Ernest Zaslavsky
161b66759c aws_error: fix nested exception handling
The loop that unwraps nested exception, rethrows nested exception and saves pointer to the temporary std::exception& inner on stack, then continues. This pointer is, thus, pointing to a released temporary

Closes scylladb/scylladb#28143

(cherry picked from commit 829bd9b598)

Closes scylladb/scylladb#28239
2026-01-20 11:19:06 +01:00
Nikos Dragazis
f89af142da test: database_test: Fix serialization of partition key
The `make_key` lambda erroneously allocates a fixed 8-byte buffer
(`sizeof(s.size())`) for variable-length strings, potentially causing
uninitialized bytes to be included. If such bytes exist and they are
not valid UTF-8 characters, deserialization fails:

```
ERROR 2026-01-16 08:18:26,062 [shard 0:main] testlog - snapshot_list_contains_dropped_tables: cql env callback failed, error: exceptions::invalid_request_exception (Exception while binding column p1: marshaling error: Validation failed - non-UTF8 character in a UTF8 string, at byte offset 7)
```

Fixes #28195.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#28197

(cherry picked from commit 8aca7b0eb9)

Closes scylladb/scylladb#28208
2026-01-19 09:40:25 +02:00
Calle Wilund
6a7c9cd750 db::commitlog: Fix sanity check error on race between segment flushing and oversized alloc
Fixes #27992

When doing a commit log oversized allocation, we lock out all other writers by grabbing
the _request_controller semaphore fully (max capacity).
We thereafter assert that the semaphore is in fact zero. However, due to how things work
with the bookkeep here, the semaphore can in fact become negative (some paths will not
actually wait for the semaphore, because this could deadlock).

Thus, if, after we grab the semaphore and execution actually returns to us (task schedule),
new_buffer via segment::allocate is called (due to a non-fully-full segment), we might
in fact grab the segment overhead from zero, resulting in a negative semaphore.

The same problem applies later when we try to sanity check the return of our permits.

Fix is trivial, just accept less-than-zero values, and take same possible ltz-value
into account in exit check (returning units)

Added whitebox (special callback interface for sync) unit test that provokes/creates
the race condition explicitly (and reliably).

Closes scylladb/scylladb#27998

(cherry picked from commit a7cdb602e1)

Closes scylladb/scylladb#28097
2026-01-16 16:21:22 +02:00
Łukasz Paszkowski
1284c18647 load_sketch: Allow populating load_sketch with normalized current load
Currently, tablet allocation intentionally ignores current load (
introduced by the commit #1e407ab) which could cause identical shard
selection when allocating a small number of tablets in the same topology.
When a tablet allocator is asked to allocate N tablets (where N is smaller
than the number of shards on a node), it selects the first N lowest shards.
If multiple such tables are created, each allocator run picks the same
shards, leading to tablet imbalance across shards.

This change initializes the load sketch with the current shard load,
scaled into the [0,1] range, ensuring allocation still remains even
while starting from globally least-loaded shards.

Fixes https://github.com/scylladb/scylladb/issues/27620

Closes https://github.com/scylladb/scylladb/pull/27802

Closes scylladb/scylladb#28106
2026-01-16 13:50:16 +01:00
Patryk Jędrzejczak
cefe7b270f Merge '[Backport 2025.3] database: truncate_table_on_all_shards: consider can_flush on all shards' from Scylladb[bot]
Currently, database::truncate_table_on_all_shards calls the table::can_flush only on the coordinator shard
and therefore it may miss shards with dirty data if the coordinator shard happens to have empty memtables, leading to clearing the memtables with dirty data rather than flushing them.

This change fixes that by making flush safe to be called, even if the memtable list is empty, and calling it on every shard that can flush (i.e. seal_immediate_fn is engaged).

Also, change database_test::do_with_some_data is use random keys instead of hard-coded key names, to reproduce this issue with `snapshot_list_contains_dropped_tables`.

Fixes #27639

* The issue exists since forever and might cause data loss due to wrongly clearing the memtable, so it needs backport to all live versions

- (cherry picked from commit ec4069246d)

- (cherry picked from commit 5be6b80936)

- (cherry picked from commit 0342a24ee0)

- (cherry picked from commit 02ee341a03)

- (cherry picked from commit 2a803d2261)

- (cherry picked from commit 93b827c185)

- (cherry picked from commit ebd667a8e0)

Parent PR: #27643

Closes scylladb/scylladb#28071

* https://github.com/scylladb/scylladb:
  test: database_test: do_with_some_data: randomize keys
  database: truncate_table_on_all_shards: drop outdated TODO comment
  database: truncate_table_on_all_shards: consider can_flush on all shards
  memtable_list: unify can_flush and may_flush
  test: database_test: add test_flush_empty_table_waits_on_outstanding_flush
  replica: table, storage_group, compaction_group: add needs_flush
  test: database_test: do_with_some_data_in_thread: accept void callback function
2026-01-12 11:21:42 +01:00
Botond Dénes
16576da935 reader_concurrency_semaphore: add protection against negative count resource leaks
The semaphore has detection and protection against regular resource
leaks, where some resources go unaccounted for and are not released by
the time the semaphore is destroyed. There is no detection or protection
against negative leaks: where resources are "made up" of thin air. This
kind of leaks looks benign at first sight, a few extra resources won't
hurt anyone so long as this is a small amount. But turns out that even a
single extra count resource can defeat a very important anti-deadlock
protection in can_admit_read(): the special case which admits a new
permit regardless of memory resources, when all original count resources
all available. This check uses ==, so if resource > original, the
protection is defeated indefinitely. Instead of just changing == to >=,
we add detection of such negative leaks to signal(), via
on_internal_error_noexcept().
At this time I still don't now how this negative leak happens (the code
doesn't confess), with this detection, hopefully we'll get a clue from
tests or the field. Note that on_internal_error_noexcept() will not
generate a coredump, unless ScyllaDB is explicitely configured to do so.
In production, it will just generate an error log with a backtrace.
The detection also clams the _resources to _initial_resources, to
prevent any damage from the negativae leak.

I just noticed that there is no unit test for the deadlock protection
described above, so one is added in this PR, even if only loosely
related to the rest of the patch.

Fixes: SCYLLADB-163

Closes scylladb/scylladb#27764

(cherry picked from commit e4da0afb8d)

Closes scylladb/scylladb#28002
2026-01-09 13:28:14 +02:00
Benny Halevy
c495252be6 test: database_test: do_with_some_data: randomize keys
With randomized keys, and since we're inserting only 2 keys,
it is possible that they would end up owned only by a single shard,
reproducing #27639 in snapshot_list_contains_dropped_tables.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ebd667a8e0)
2026-01-09 08:35:29 +02:00
Benny Halevy
d5dcfd9133 test: database_test: add test_flush_empty_table_waits_on_outstanding_flush
Test that table::flush waits on outstanding flushes, even if the active memtable is empty

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 0342a24ee0)
2026-01-09 08:31:33 +02:00
Benny Halevy
0e8b738dff test: database_test: do_with_some_data_in_thread: accept void callback function
Many test cases already assume `func` is being called a seastar
thread and although the function they pass returns a (ready) future,
it serves no purpose other than to conform to the interface.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ec4069246d)
2026-01-09 08:16:41 +02:00
Aleksandra Martyniuk
af5a3281a2 db: repair: throw if replay fails
Return a flag determining whether all the batches were sent successfully in
batchlog_manager::replay_all_failed_batches (batches skipped due to being
too fresh are not counted). Throw in repair_flush_hints_batchlog_handler
if not all batches were replayed, to ensure that repair_time isn't updated.

(cherry picked from commit 7f20b66eff)
2025-12-16 15:38:32 +01:00
Tomasz Grabiec
fed0f95626 address_map: Use barrier() to wait for replication
More efficient than 100 pings.

There was one ping in test which was done "so this shard notices the
clock advance". It's not necessary, since obsering completed SMP
call implies that local shard sees the clock advancement done within in.

(cherry picked from commit f83c4ffc68)
2025-12-04 14:50:31 +01:00
Tomasz Grabiec
eba97f80e0 address_map: Use more efficient and reliable replication method
Primary issue with the old method is that each update is a separate
cross-shard call, and all later updated queue behind it. If one of the
shards has high latency for such calls, the queue may accumulate and
system will appear unresponsive for mapping changes on non-zero shards.

This happened in the field when one of the shards was overloaded with
sstables and compaction work, which caused frequent stalls which
delayed polling for ~100ms. A queue of 3k address updates
accumulated. This made bootstrap impossible, since nodes couldn't
learn about the IP mapping for the bootstrapping node and streaming
failed.

To protect against that, use a more efficient method of replication
which requires a single cross-shard call to replicate all prior
updates.

It is also more reliable, if replication fails transiently for some
reason, we don't give up and fail all later updates.

Fixes #26865
Fixes #26835

(cherry picked from commit 4a85ea8eb2)
2025-12-04 14:50:31 +01:00
Tomasz Grabiec
777b54f072 utils: Introduce helper for replicated data structures
Key goals:
  - efficient (batching updates)
  - reliable (no lost updates)

Will be used in data structures maintained on one designed owning
shard and replicated to other shards.

(cherry picked from commit ed8d127457)
2025-12-04 14:50:31 +01:00
Calle Wilund
7eb51568e9 commitlog::read_log_file: Check for eof position on all data reads
Fixes #24346

When reading, we check for each entry and each chunk, if advancing there
will hit EOF of the segment. However, IFF the last chunk being read has
the last entry _exactly_ matching the chunk size, and the chunk ending
at _exactly_ segment size (preset size, typically 32Mb), we did not check
the position, and instead complained about not being able to read.

This has literally _never_ happened in actual commitlog (that was replayed
at least), but has apparently happened more and more in hints replay.

Fix is simple, just check the file position against size when advancing
said position, i.e. when reading (skipping already does).

v2:

* Added unit test

Closes scylladb/scylladb#27236

(cherry picked from commit 59c87025d1)

Closes scylladb/scylladb#27343
2025-12-03 12:25:02 +03:00
Michael Litvak
4ac402c5c5 tablet: scheduler: Do not emit conflicting migration in merge colocation
The tablet scheduler should not emit conflicting migrations for the same
tablet. This was addressed initially in scylladb/scylladb#26038 but the
check is missing in the merge colocation plan, so add it there as well.

Without this check, the merge colocation plan could generate a
conflicting migration for a tablet that is already scheduled for
migration, as the test demonstrates.

This can cause correctness problems, because if the load balancer
generates two migrations for a single tablet, both will be written as
mutations, and the resulting mutation could contain mixed cells from
both migrations.

Fixes scylladb/scylladb#27304

Closes scylladb/scylladb#27312

(cherry picked from commit 97b7c03709)
2025-11-30 10:13:42 +01:00
Tomasz Grabiec
951d3f50ea tablet: scheduler: Do not emit conflicting migrations in the plan
Plan-making is invoked independently for different DCs (and in the
future, racks) and then plans are merged. It could be that the same
tablets are selected for migration in different DCs. Only one
migration will prevail and be committed to group0, so it's not a
correctness problem. Next cycle will recognize that the tablet is in
transition and will not be selected by plan-maker. But it makes
plan-making less efficient.

It may also surprise consumers of the plan, like we saw in #25912.

So we should make plan-maker be aware of already scheduled transitions
and not consider those tablets as candidates.

Fixes #26038

Closes scylladb/scylladb#26048

(cherry picked from commit 981592bca5)
2025-11-30 10:00:22 +01:00
Gleb Natapov
aa75444438 test: test that expired erm that held for too long triggers notification
(cherry picked from commit 5dcdaa6f66)
2025-11-26 15:08:41 +00:00
Pavel Emelyanov
5eb6da551f Merge '[Backport 2025.3] db/config: Add SSTable compression options for user tables' from Scylladb[bot]
ScyllaDB offers the `compression` DDL property for configuring compression per user table (compression algorithm and chunk size). If not specified, the default compression algorithm is the LZ4Compressor with a 4KiB chunk size. The same default applies to system tables as well.

This series introduces a new configuration option to allow customizing the default for user tables. It also adds some tests for the new functionality.

Fixes #25195.

- (cherry picked from commit 1106157756)

- (cherry picked from commit ea41f652c4)

- (cherry picked from commit a7e46974d4)

- (cherry picked from commit e1d9c83406)

- (cherry picked from commit 8d5bd212ca)

- (cherry picked from commit 6ba0fa20ee)

- (cherry picked from commit 8410532fa0)

Parent PR: #26003

Closes scylladb/scylladb#26301

* github.com:scylladb/scylladb:
  test/cluster: Add tests for invalid SSTable compression options
  test/boost: Add tests for SSTable compression config options
  main: Validate SSTable compression options from config
  db/config: Add SSTable compression options for user tables
  db/config: Prepare compression_parameters for config system
  compressor: Validate presence of sstable_compression in parameters
  compressor: Add missing space in exception message
2025-10-30 10:31:16 +03:00
Pavel Emelyanov
0e6381f14d lister: Fix race between readdir and stat
Sometimes file::list_directory() returns entries without type set. In
thase case lister calls file_type() on the entry name to get it. In case
the call returns disengated type, the code assumes that some error
occurred and resolves into exception.

That's not correct. The file_type() method returns disengated type only
if the file being inspected is missing (i.e. on ENOENT errno). But this
can validly happen if a file is removed bettween readdir and stat. In
that case it's not "some error happened", but a enry should be just
skipped. In "some error happened", then file_type() would resolve into
exceptional future on its own.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26595

(cherry picked from commit d9bfbeda9a)

Closes scylladb/scylladb#26764
2025-10-29 11:34:47 +02:00
Ernest Zaslavsky
aca20f5ca5 s3_client: improve exception handling for chunked downloads
Refactor the wrapping exception used in `chunked_download_source` to
prevent the retry strategy from reattempting failed requests. The new
implementation preserves the original `exception_ptr`, making the root
cause clearer and easier to diagnose.

(cherry picked from commit 1d34657b14)
2025-10-22 15:24:06 +03:00
Nikos Dragazis
5cfbddfa43 test/boost: Add tests for SSTable compression config options
Since patch 03461d6a54, all boost unit tests depending on `cql_test_env`
are compiled into a single executable (`combined_tests`). Add the new
test in there.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 6ba0fa20ee)
2025-10-20 09:28:13 +03:00
Michał Chojnowski
75f671ff18 test/boost/sstable_compressor_factory_test: fix thread-unsafe usage of Boost.Test
It turns out that Boost assertions are thread-unsafe,
(and can't be used from multiple threads concurrently).
This causes the test to fail with cryptic log corruptions sometimes.
Fix that by switching to thread-safe checks.

Fixes scylladb/scylladb#24982

Closes scylladb/scylladb#26472

(cherry picked from commit 7c6e84e2ec)

Closes scylladb/scylladb#26552
2025-10-15 12:13:54 +03:00
Botond Dénes
6e94299ccf Merge '[Backport 2025.3] compaction: ensure that all compaction executors are stopped' from Scylladb[bot]
Currently, while stopping the compaction_manager, we stop task_manager
compaction module and concurrently run compaction_manager::really_do_stop.
really_do_stop stops and waits for all task_executors that are kept
in compaction_manager::_tasks, but nothing ensures that no more tasks will
be added there. Due to leftover tasks, we trigger  on_fatal_internal_error.

Modify the order of compaction_manager::stop. After the change, we stop
compaction tasks in the following order:
- abort module abort source;
- close module gate in the background;
- stop_ongoing_compactions (kept in compaction_manager::_tasks);
- wait until module gate is closed.

Check module abort source before creating compaction executor and
adding it to _tasks.

Thanks to the above, we can be sure that:
- after module::stop there will be no tasks in _tasks;
- compaction_manager::stop aborts all tasks; we don't wait for any whole
  compaction to finish.

Fixes: https://github.com/scylladb/scylladb/issues/25806.

Fixes shutdown bug; Needs backports to all version

- (cherry picked from commit 17707d0e6b)

- (cherry picked from commit 97c77d7cd5)

Parent PR: #25885

Closes scylladb/scylladb#26224

* github.com:scylladb/scylladb:
  compaction: move _tasks check
  compaction: stop compaction module in really_do_stop
2025-09-26 13:20:52 +03:00
Aleksandra Martyniuk
9fc38c01a0 compaction: stop compaction module in really_do_stop
Currently, compaction::task_manager_module is stopped in compaction_manager::stop,
concurrently to really_do_stop. We can't predict the order of the two.

Do not set _task_manager_module to nullptr at stop, because
compaction_manager::really_do_stop() may be called before the actual
shutdown, while other components still try to use it.
compaction::task_manager_module does not keep a pointer to compaction_manager,
so we won't end up with memory leak.

Stop compaction module in really_do_stop, after ongoing compactions
are stopped.

It's a preparation for further patches.

(cherry picked from commit 17707d0e6b)
2025-09-25 15:55:55 +02:00
Ferenc Szili
99b69092ef load_balancer: fix std::out_of_bounds when decommissioning with empty nodes
Consider the following:

The tablet load balancer is working on:

- node1: an empty node (no tablets) with a large disk capacity
- node2: an empty node (no tablets) with a lower disk capacity then node1
- node3: is being decommissioned and contains tablet replicas

In load_balancer::make_internode_plan() the initial destination
node/shard is selected like this:

// Pick best target shard.
auto dst = global_shard_id {target, _load_sketch->get_least_loaded_shard(target)};

load_sketch::get_least_loaded_shard(host_id) calls ensure_node() which
adds the host to load_sketch's internal hash maps in case the node was
not yet seen by load_sketch.

Let's assume dst is a shard on node1.

Later in load_balancer::make_internode_plan() we will call
pick_candidate() to try to find a better destination node than the
initial one:

// May choose a different source shard than src.shard or different destination host/shard than dst.
auto candidate = co_await pick_candidate(nodes, src_node_info, target_info, src, dst, nodes_by_load_dst,
                                            drain_skipped);
auto source_tablets = candidate.tablets;
src = candidate.src;
dst = candidate.dst;

If pick_candidate() selects some other empty destination (due to larger
capacity: node1) node, and that node has not yet been seen by
load_sketch (because it was empty), a subsequent call to
load_sketch::pick() will search for the node using
std::unordered_map::at(), and because the node is not found it will
throw a std::out_of_bounds() exception crashing the load balancer.

This problem is fixed by changing load_sketch::populate() to initialize
its internal maps with all the nodes which populate()'s arguments
filter for.

Fixes: #26203

Closes scylladb/scylladb#26207

(cherry picked from commit c6c9c316a7)

Closes scylladb/scylladb#26240
2025-09-25 09:42:47 +03:00
Lakshmi Narayanan Sreethar
6e94a73fd4 compaction/scrub: register sstables for compaction before validation
When `scrub --validate` runs, it collects all candidate sstables at the
start and validates them one by one in separate compaction tasks.
However, scrub in validate mode does not register these sstables for
compaction, which allows regular compaction to pick them up and
potentially compact them away before validation begins. This leads to
scrub failures because the sstables can no longer be found.

This patch fixes the issue by first disabling compaction, collecting the
sstables, and then registering them for compaction before starting
validation. This ensures that the enqueued sstables remain available for
the entire duration of the scrub validation task.

Fixes #23363

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 7cdda510ee)
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-09-19 18:38:54 +05:30
Szymon Malewski
cca78c6568 alternator/expressions.g: Fix antlr3 missing token leak
This patch overrides the antlr3 function that allocates the missing
tokens that would eventually leak. The override stores these tokens in
a vector, ensuring memory is freed whenever the parser is destroyed.
Solution is copied from CQL implementation.

A unit test to reproduce the issue is added - leak would be reported
by ASAN, when running this test in debug mode - the test passed but
the leak is discovered when the test file exits.

Fixes #25878

Closes scylladb/scylladb#25930

(cherry picked from commit 776f90e2f8)

Closes scylladb/scylladb#26085
2025-09-18 07:50:31 +03:00
Wojciech Mitros
246fcb8b6a mv: delete previously undetected ghost rows in PRUNE MATERIALIZED VIEW statement
The PRUNE MATERIALIZED VIEW statement is supposed to remove ghost rows from the
view. Ghost rows are rows in the view with no corresponding row in the base table.
Before this patch, only rows whose primary key columns of the base table had
different values than any of the base rows were treated as ghost rows by the PRUNE
statement. However, view rows which have a column in their primary key that's not
in the base primary can also be ghost rows if this column has a different value
than the base row with the same values of remaining primary key columns. That's
because these rows won't be deleted unless we change value of this column in the
base table to this specific value.
In this patch we add a check for this column in the PRUNE MATERIALIZED VIEW logic.
If this column isn't the same in the base table and the view, these rows are also
deleted.

Fixes https://github.com/scylladb/scylladb/issues/25655

Closes scylladb/scylladb#25720

(cherry picked from commit 1f9be235b8)

Closes scylladb/scylladb#25956
2025-09-15 12:26:02 +02:00
Avi Kivity
0900a88884 Merge 'auth: move passwords::check call to alien thread' from Andrzej Jackowski
Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive.

This PR addresses the issue in two ways:
1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case).
2) `passwords::check` is moved to a dedicated alien thread.

Regarding point 1: before this change, the following hashing schemes were supported by     `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However:
- The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it  not good idea to fix or use it.
- SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512.
- MD5 is no longer considered secure for password hashing.

Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers.

Fixes https://github.com/scylladb/scylladb/issues/24524

Backport not needed, as it is a new feature.

Closes scylladb/scylladb#24924

* github.com:scylladb/scylladb:
  main: utils: add thread names to alien workers
  auth: move passwords::check call to alien thread
  test: wait for 3 clients with given username in test_service_level_api
  auth: refactor password checking in password_authenticator
  auth: make SHA-512 the only password hashing scheme for new passwords
  auth: whitespace change in identify_best_supported_scheme()
  auth: require scheme as parameter for `generate_salt`
  auth: check password hashing scheme support on authenticator start

(cherry picked from commit c762425ea7)
2025-09-07 13:38:33 +03:00
Taras Veretilnyk
ddb7c8ea12 keys: from_nodetool_style_string don't split single partition keys
Users with single-column partition keys that contain colon characters
were unable to use certain REST APIs and 'nodetool' commands, because the
API split key by colon regardless of the partition key schema.

Affected commands:
- 'nodetool getendpoints'
- 'nodetool getsstables'
Affected endpoints:
- '/column_family/sstables/by_key'
- '/storage_service/natural_endpoints'

Refs: #16596 - This does not fully fix the issue, as users with compound
keys will face the issue if any column of the partition key contains
a colon character.

Closes scylladb/scylladb#24829

Closes scylladb/scylladb#25565
2025-09-01 15:36:56 +03:00
Calle Wilund
fe87af4674 commitlog: Ensure segment deletion is re-entrant
Fixes #25709

If we have large allocations, spanning more than one segment, and
the internal segment references from lead to secondary are the
only thing keeping a segment alive, the implicit drop in
discard_unused_segments and orphan_all can cause a recursive call
to discard_unused_segments, which in turn can lead to vector
corruption/crash, or even double free of segment (iterator confusion).

Need to separate the modification of the vector (_segments) from
actual releasing of objects. Using temporaries is the easiest
solution.

To further reduce recursion, we can also do an early clear of
segment dependencies in callbacks from segment release (cf release).

Closes scylladb/scylladb#25719

(cherry picked from commit cc9eb321a1)

Closes scylladb/scylladb#25756
2025-08-30 18:50:47 +03:00
Ernest Zaslavsky
8a017834a0 s3_client: add memory fallback in chunked_download_source
Introduce fallback logic in `chunked_download_source` to handle
memory exhaustion. When memory is low, feed the `deque` with only
one uncounted buffer at a time. This allows slow but steady progress
without getting stuck on the memory semaphore.

Fixes: https://github.com/scylladb/scylladb/issues/25453
Fixes: https://github.com/scylladb/scylladb/issues/25262

Closes scylladb/scylladb#25452

(cherry picked from commit dd51e50f60)

Closes scylladb/scylladb#25511
2025-08-15 13:30:38 +03:00
Ernest Zaslavsky
c70ba8384e s3_client: make memory semaphore acquisition abortable
Add `abort_source` to the `get_units` call for the memory semaphore
in the S3 client, allowing the acquisition process to be aborted.

Fixes: https://github.com/scylladb/scylladb/issues/25454

Closes scylladb/scylladb#25469

(cherry picked from commit 380c73ca03)

Closes scylladb/scylladb#25499
2025-08-14 10:34:28 +02:00
Nikos Dragazis
26174a9c67 test: kmip: Fix segfault from premature destruction of port_promise
`kmip_test_helper()` is a utility function to spawn a dedicated PyKMIP
server for a particular Boost test case. The function runs the server as
an external process and uses a thread to parse the port from the
server's logs. The thread communicates the port to the main thread via
a promise.

The current implementation has a bug where the thread may set a value
to the promise after its destruction, causing a segfault. This happens
when the server does not start within 20 seconds, in which case the port
future throws and the stack unwinding machinery destroys the port
promise before the thread that writes to it.

Fix the bug by declaring the promise before the cleanup action.

The bug has been encountered in CI runs on slow machines, where the
PyKMIP server takes too long to create its internal tables (due to slow
fdatasync calls from SQLite). This patch does not improve CI stability -
it only ensures that the error condition is properly reflected in the
test output.

This patch is not a backport. The same bug has been fixed in master as
part of a larger rewrite of the `kmip_test_helper()` (see 722e2bce96).

Refs #24747, #24842.
Fixes #24574.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#25030
2025-08-06 11:59:40 +03:00
Ernest Zaslavsky
ccdc98c8f0 s3_creds: Make reload unconditional
Assume that any caller invoking `reload` intends to refresh credentials.
Remove conditional logic that checks for expiration before reloading.

(cherry picked from commit e4ebe6a309)
2025-08-06 00:50:36 +00:00
Ernest Zaslavsky
bd79ae3826 s3_creds: Add test exposing credentials renewal issue
Add a test demonstrating that renewing credentials does not update
their expiration. After requesting credentials again, the expiration
remains unchanged, indicating no actual update occurred.

(cherry picked from commit 68855c90ca)
2025-08-06 00:50:36 +00:00
Avi Kivity
a8193bd503 Merge '[Backport 2025.3] transport: remove throwing protocol_exception on connection start' from Dario Mirovic
`protocol_exception` is thrown in several places. This has become a performance issue, especially when starting/restarting a server. To alleviate this issue, throwing the exception has to be replaced with returning it as a result or an exceptional future.

This PR replaces throws in the `transport/server` module. This is achieved by using result_with_exception, and in some places, where suitable, just by creating and returning an exceptional future.

There are four commits in this PR. The first commit introduces tests in `test/cqlpy`. The second commit refactors transport server `handle_error` to not rethrow exceptions. The third commit refactors reusable buffer writer callbacks. The fourth commit replaces throwing `protocol_exception` to returning it.

Based on the comments on an issue linked in https://github.com/scylladb/scylladb/issues/24567, the main culprit from the side of protocol exceptions is the invalid protocol version one, so I tested that exception for performance.

In order to see if there is a measurable difference, a modified version of `test_protocol_version_mismatch` Python is used, with 100'000 runs across 10 processes (not threads, to avoid Python GIL). One test run consisted of 1 warm-up run and 5 measured runs. First test run has been executed on the current code, with throwing protocol exceptions. Second test urn has been executed on the new code, with returning protocol exceptions. The performance report is in https://github.com/scylladb/scylladb/pull/24738#issuecomment-3051611069. It shows ~10% gains in real, user, and sys time for this test.

Testing

Build: `release`

Test file: `test/cqlpy/test_protocol_exceptions.py`
Test name: `test_protocol_version_mismatch` (modified for mass connection requests)

Test arguments:
```
max_attempts=100'000
num_parallel=10
```

Throwing `protocol_exception` results:
```
real=1:26.97  user=10:00.27  sys=2:34.55  cpu=867%
real=1:26.95  user=9:57.10  sys=2:32.50  cpu=862%
real=1:26.93  user=9:56.54  sys=2:35.59  cpu=865%
real=1:26.96  user=9:54.95  sys=2:32.33  cpu=859%
real=1:26.96  user=9:53.39  sys=2:33.58  cpu=859%

real=1:26.95 user=9:56.85 sys=2:34.11 cpu=862%   # average
```

Returning `protocol_exception` as `result_with_exception` or an exceptional future:
```
real=1:18.46  user=9:12.21  sys=2:19.08  cpu=881%
real=1:18.44  user=9:04.03  sys=2:17.91  cpu=869%
real=1:18.47  user=9:12.94  sys=2:19.68  cpu=882%
real=1:18.49  user=9:13.60  sys=2:19.88  cpu=883%
real=1:18.48  user=9:11.76  sys=2:17.32  cpu=878%

real=1:18.47 user=9:10.91 sys=2:18.77 cpu=879%   # average
```

This PR replaced `transport/server` throws of `protocol_exception` with returns. There are a few other places where protocol exceptions are thrown, and there are many places where `invalid_request_exception` is thrown. That is out of scope of this single PR, so the PR just refs, and does not resolve issue #24567.

Refs: #24567
Fixes: #25271

This PR improves performance in cases when protocol exceptions happen, for example during connection storms. It will require backporting.

* (cherry picked from commit 7aaeed012e)

* (cherry picked from commit 30d424e0d3)

* (cherry picked from commit 9f4344a435)

* (cherry picked from commit 5390f92afc)

* (cherry picked from commit 4a6f71df68)

Parent PR: #24738

Closes scylladb/scylladb#25117

* github.com:scylladb/scylladb:
  test/cqlpy: add cpp exception metric test conditions
  transport/server: replace protocol_exception throws with returns
  utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception
  transport/server: avoid exception-throw overhead in handle_error
  test/cqlpy: add protocol_exception tests
2025-08-05 14:16:14 +03:00
Nikos Dragazis
257ebbeca9 test: Use in-memory SQLite for PyKMIP server
The PyKMIP server uses an SQLite database to store artifacts such as
encryption keys. By default, SQLite performs a full journal and data
flush to disk on every CREATE TABLE operation. Each operation triggers
three fdatasync(2) calls. If we multiply this by 16, that is the number
of tables created by the server, we get a significant number of file
syncs, which can last for several seconds on slow machines.

This behavior has led to CI stability issues from KMIP unit tests where
the server failed to complete its schema creation within the 20-second
timeout (observed on spider9 and spider11).

Fix this by configuring the server to use an in-memory SQLite.

Fixes #24842.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#24995

(cherry picked from commit 2656fca504)

Closes scylladb/scylladb#25300
2025-08-02 17:12:05 +03:00
Piotr Dulikowski
ba70b39486 qos: don't populate effective service level cache until auth is migrated to raft
Right now, service levels are migrated in one group0 command and auth
is migrated in the next one. This has a bad effect on the group0 state
reload logic - modifying service levels in group0 causes the effective
service levels cache to be recalculated, and to do so we need to fetch
information about all roles. If the reload happens after SL upgrade and
before auth upgrade, the query for roles will be directed to the legacy
auth tables in system_auth - and the query, being a potentially remote
query, has a timeout. If the query times out, it will throw
an exception which will break the group0 apply fiber and the node will
need to be restarted to bring it back to work.

In order to solve this issue, make sure that the service level module
does not start populating and using the service level cache until both
service levels and auth are migrated to raft. This is achieved by adding
the check both to the cache population logic and the effective service
level getter - they now look at service level's accessor new method,
`can_use_effective_service_level_cache` which takes a look at the auth
version.

Fixes: scylladb/scylladb#24963
(cherry picked from commit 2bb800c004)
2025-07-31 15:13:57 +00:00
Dario Mirovic
1078a1f03a utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception
Make make_bytes_ostream and make_fragmented_temporary_buffer accept
writer callbacks that return utils::result_with_exception instead of
forcing them to throw on error. This lets callers propagate failures
by returning an error result rather than throwing an exception.

Introduce buffer_writer_for, bytes_ostream_writer, and fragmented_buffer_writer
concepts to simplify and document the template requirements on writer callbacks.

This patch does not modify the actual callbacks passed, except for the syntax
changes needed for successful compilation, without changing the logic.

Refs: #24567
Fixes: #25271
(cherry picked from commit 9f4344a435)
2025-07-30 21:35:15 +02:00
Pavel Emelyanov
7c04619ecf Merge '[Backport 2025.3] encryption_at_rest_test: Fix some spurious errors' from Scylladb[bot]
Fixes #24574

* Ensure we close the embedded load_cache objects on encryption shutdown, otherwise we can, in unit testing, get destruction of these while a timer is still active -> assert
* Add extra exception handling to `network_error_test_helper`, so even if test framework might exception-escape, we properly stop the network proxy to avoid use after free.

- (cherry picked from commit ee98f5d361)

- (cherry picked from commit 8d37e5e24b)

Parent PR: #24633

Closes scylladb/scylladb#24772

* github.com:scylladb/scylladb:
  encryption_at_rest_test: Add exception handler to ensure proxy stop
  encryption: Ensure stopping timers in provider cache objects
2025-07-24 16:35:53 +03:00
Pavel Emelyanov
53637fdf61 Merge '[Backport 2025.3] storage: add make_data_or_index_source to the storages' from Scylladb[bot]
Add `make_data_or_index_source` to the storages to utilize new S3 based data source which should improve restore performance

* Introduce the `encrypted_data_source` class that wraps an existing data source to read and decrypt data on the fly using block encryption. Also add unit tests to verify correct decryption behavior.
* Add `make_data_or_index_source` to the `storage` interface, implement it  for `filesystem_storage` storage which just creates `data_source` from a file and for the `s3_storage` create a (maybe) decrypting source from s3 make_download_source. This change should solve performance improvement for reading large objects from S3 and should not affect anything for the `filesystem_storage`

Fixes: https://github.com/scylladb/scylladb/issues/22458

- (cherry picked from commit 211daeaa40)

- (cherry picked from commit 7e5e3c5569)

- (cherry picked from commit 0de61f56a2)

- (cherry picked from commit 8ac2978239)

- (cherry picked from commit dff9a229a7)

- (cherry picked from commit 8d49bb8af2)

Parent PR: #23695

Closes scylladb/scylladb#25016

* github.com:scylladb/scylladb:
  sstables: Start using `make_data_or_index_source` in `sstable`
  sstables: refactor readers and sources to use coroutines
  sstables: coroutinize futurized readers
  sstables: add `make_data_or_index_source` to the `storage`
  encryption: refactor key retrieval
  encryption: add `encrypted_data_source` class
2025-07-21 18:05:53 +03:00
Dawid Mędrek
10a9ced4d1 cdc: Forbid altering columns of CDC log tables directly
The set of columns of a CDC log table should be managed automatically
by Scylla, and the user should not have the ability to manipulate them
directly. That could lead to disastrous consequences such as a
segmentation fault.

In this commit, we're restricting those operations. We also provide two
validation tests.

One of the existing tests had to be adjusted as it modified the type
of a column in a CDC log table. Since the test simply verifies that
the user has sufficient permissions to perform `ALTER TABLE` on the log
table, the test is still valid.

Fixes scylladb/scylladb#24643

(cherry picked from commit 20d0050f4e)
2025-07-21 11:43:49 +00:00
Ernest Zaslavsky
549d139e84 sstables: Start using make_data_or_index_source in sstable
Convert all necessary methods to be awaitable. Start using `make_data_or_index_source`
when creating data_source for data and index components.

For proper working of compressed/checksummed input streams, start passing
stream creator functors to `make_(checksummed/compressed)_file_(k_l/m)_format_input_stream`.

(cherry picked from commit 8d49bb8af2)
2025-07-16 12:45:58 +00:00
Ernest Zaslavsky
243ba1fb66 encryption: add encrypted_data_source class
Introduce the `encrypted_data_source` class that wraps an existing data
source to read and decrypt data on the fly using block encryption. Also add
unit tests to verify correct decryption behavior.
NOTE: The wrapped source MUST read from offset 0, `encrypted_data_source` assumes it is

Co-authored-by: Calle Wilund <calle@scylladb.com>
(cherry picked from commit 211daeaa40)
2025-07-16 12:45:58 +00:00
Ernest Zaslavsky
873c8503cd s3_test: Add s3_client test for non-retryable error handling
Introduce a test that injects a non-retryable error and verifies
that the chunked download source throws an exception as expected.

(cherry picked from commit acf15eba8e)
2025-07-13 13:17:14 +00:00