scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-02 14:15:46 +00:00

Author	SHA1	Message	Date
Emil Maskovsky	c04cbd5aa2	encryption: cover system.raft table in system_info_encryption Extend system_info_encryption to encrypt system.raft SSTables. system.raft contains the Raft log, which may hold sensitive user data (e.g. batched mutations), so it warrants the same treatment as system.batchlog and system.paxos. During upgrade, existing unencrypted system.raft SSTables remain readable. Existing data is rewritten encrypted via compaction, or immediately via nodetool upgradesstables -a. Update the operator-facing system_info_encryption description to mention system.raft and add a focused test that verifies the schema extension is present on system.raft. Fixes: CUSTOMER-317 Backport: 2026.1 - closes an encryption-at-rest coverage gap: system.raft may persist sensitive user-originated data unencrypted; backport to the current LTS. Closes scylladb/scylladb#29242 (cherry picked from commit `91df3795fc`) Closes scylladb/scylladb#29526 Closes scylladb/scylladb#29582 Closes scylladb/scylladb#29615	2026-04-24 18:08:41 +03:00
Botond Dénes	5f86c008e2	Merge 'cql3: fix null handling in data_value formatting' from Dario Mirovic `data_value::to_parsable_string()` crashes with a null pointer dereference when called on a `null` data_value. Return `"null"` instead. Added tests after the fix. Manually checked that tests fail without the fix. Fixes SCYLLADB-1350 This is a fix that prevents format crash. No known occurrence in production, but backport is desirable. Closes scylladb/scylladb#29262 * github.com:scylladb/scylladb: test: boost: test null data value to_parsable_string cql3: fix null handling in data_value formatting (cherry picked from commit `816f2bf163`) Closes scylladb/scylladb#29468	2026-04-16 11:00:44 +03:00
Botond Dénes	ac36cafa98	db/row_cache: make_nonpopulating_reader(): pass cache tracker to snapshot The API contract in partition_version.hh states that when dealing with evictable entries, a real cache tracker pointer has to be passed to all methods that ask for it. The nonpopulating reader violates this, passing a nullptr to the snapshot. This was observed to cause a crash when a concurrent cache read accessed the snapshot with the null tracker. A reproducer is included which fails before and passes after the fix. Fixes: #26847 Closes scylladb/scylladb#28163 (cherry picked from commit `a53f989d2f`) Closes scylladb/scylladb#28277	2026-01-30 16:22:22 +02:00
Botond Dénes	abf7c34b40	reader_concurrency_semaphore: improve handling of base resources reader_permit::release_base_resources() is a soft evict for the permit: it releases the resources aquired during admission. This is used in cases where a single process owns multiple permits, creating a risk for deadlock, like it is the case for repair. In this case, release_base_resources() acts as a manual eviction mechanism to prevent permits blockings each other from admission. Recently we found a bad interaction between release_base_resources() and permit eviction. Repair uses both mechanism: it marks its permits as inactive and later it also uses release_base_resources(). This partice might be worth reconsidering, but the fact remains that there is a bug in the reader permit which causes the base resources to be released twice when release_base_resources() is called on an already evicted permit. This is incorrect and is fixed in this patch. Improve release_base_resources(): * make _base_resources const * move signal call into the if (_base_resources_consumed()) { } * use reader_permit::impl::signal() instead of reader_concurrency_semaphore::signal() * all places where base resources are released now call release_base_resources() A reproducer unit test is added, which fails before and passes after the fix. Fixes: #28083 Closes scylladb/scylladb#28155 (cherry picked from commit `b7bc48e7b7`) Closes scylladb/scylladb#28238	2026-01-21 06:47:21 +02:00
Nikos Dragazis	d21b37c8eb	test: database_test: Fix serialization of partition key The `make_key` lambda erroneously allocates a fixed 8-byte buffer (`sizeof(s.size())`) for variable-length strings, potentially causing uninitialized bytes to be included. If such bytes exist and they are not valid UTF-8 characters, deserialization fails: ``` ERROR 2026-01-16 08:18:26,062 [shard 0:main] testlog - snapshot_list_contains_dropped_tables: cql env callback failed, error: exceptions::invalid_request_exception (Exception while binding column p1: marshaling error: Validation failed - non-UTF8 character in a UTF8 string, at byte offset 7) ``` Fixes #28195. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#28197 (cherry picked from commit `8aca7b0eb9`) Closes scylladb/scylladb#28207	2026-01-19 09:44:21 +02:00
Calle Wilund	74a721b9db	db::commitlog: Fix sanity check error on race between segment flushing and oversized alloc Fixes #27992 When doing a commit log oversized allocation, we lock out all other writers by grabbing the _request_controller semaphore fully (max capacity). We thereafter assert that the semaphore is in fact zero. However, due to how things work with the bookkeep here, the semaphore can in fact become negative (some paths will not actually wait for the semaphore, because this could deadlock). Thus, if, after we grab the semaphore and execution actually returns to us (task schedule), new_buffer via segment::allocate is called (due to a non-fully-full segment), we might in fact grab the segment overhead from zero, resulting in a negative semaphore. The same problem applies later when we try to sanity check the return of our permits. Fix is trivial, just accept less-than-zero values, and take same possible ltz-value into account in exit check (returning units) Added whitebox (special callback interface for sync) unit test that provokes/creates the race condition explicitly (and reliably). Closes scylladb/scylladb#27998 (cherry picked from commit `a7cdb602e1`) Closes scylladb/scylladb#28095	2026-01-16 16:24:43 +02:00
Patryk Jędrzejczak	52f20e66ef	Merge '[Backport 2025.1] database: truncate_table_on_all_shards: consider can_flush on all shards' from Scylladb[bot] Currently, database::truncate_table_on_all_shards calls the table::can_flush only on the coordinator shard and therefore it may miss shards with dirty data if the coordinator shard happens to have empty memtables, leading to clearing the memtables with dirty data rather than flushing them. This change fixes that by making flush safe to be called, even if the memtable list is empty, and calling it on every shard that can flush (i.e. seal_immediate_fn is engaged). Also, change database_test::do_with_some_data is use random keys instead of hard-coded key names, to reproduce this issue with `snapshot_list_contains_dropped_tables`. Fixes #27639 * The issue exists since forever and might cause data loss due to wrongly clearing the memtable, so it needs backport to all live versions - (cherry picked from commit `ec4069246d`) - (cherry picked from commit `5be6b80936`) - (cherry picked from commit `0342a24ee0`) - (cherry picked from commit `02ee341a03`) - (cherry picked from commit `2a803d2261`) - (cherry picked from commit `93b827c185`) - (cherry picked from commit `ebd667a8e0`) Parent PR: #27643 Closes scylladb/scylladb#28067 * https://github.com/scylladb/scylladb: test: database_test: do_with_some_data: randomize keys database: truncate_table_on_all_shards: drop outdated TODO comment database: truncate_table_on_all_shards: consider can_flush on all shards memtable_list: unify can_flush and may_flush test: database_test: add test_flush_empty_table_waits_on_outstanding_flush replica: table, storage_group, compaction_group: add needs_flush test: database_test: do_with_some_data_in_thread: accept void callback function	2026-01-12 11:27:01 +01:00
Botond Dénes	f9be7b4672	reader_concurrency_semaphore: add protection against negative count resource leaks The semaphore has detection and protection against regular resource leaks, where some resources go unaccounted for and are not released by the time the semaphore is destroyed. There is no detection or protection against negative leaks: where resources are "made up" of thin air. This kind of leaks looks benign at first sight, a few extra resources won't hurt anyone so long as this is a small amount. But turns out that even a single extra count resource can defeat a very important anti-deadlock protection in can_admit_read(): the special case which admits a new permit regardless of memory resources, when all original count resources all available. This check uses ==, so if resource > original, the protection is defeated indefinitely. Instead of just changing == to >=, we add detection of such negative leaks to signal(), via on_internal_error_noexcept(). At this time I still don't now how this negative leak happens (the code doesn't confess), with this detection, hopefully we'll get a clue from tests or the field. Note that on_internal_error_noexcept() will not generate a coredump, unless ScyllaDB is explicitely configured to do so. In production, it will just generate an error log with a backtrace. The detection also clams the _resources to _initial_resources, to prevent any damage from the negativae leak. I just noticed that there is no unit test for the deadlock protection described above, so one is added in this PR, even if only loosely related to the rest of the patch. Fixes: SCYLLADB-163 Closes scylladb/scylladb#27764 (cherry picked from commit `e4da0afb8d`) Closes scylladb/scylladb#28000	2026-01-09 13:28:46 +02:00
Benny Halevy	7389784151	test: database_test: do_with_some_data: randomize keys With randomized keys, and since we're inserting only 2 keys, it is possible that they would end up owned only by a single shard, reproducing #27639 in snapshot_list_contains_dropped_tables. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `ebd667a8e0`)	2026-01-09 09:02:29 +02:00
Benny Halevy	3920cacadc	test: database_test: add test_flush_empty_table_waits_on_outstanding_flush Test that table::flush waits on outstanding flushes, even if the active memtable is empty Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `0342a24ee0`)	2026-01-09 08:57:39 +02:00
Benny Halevy	faecb2aabc	test: database_test: do_with_some_data_in_thread: accept void callback function Many test cases already assume `func` is being called a seastar thread and although the function they pass returns a (ready) future, it serves no purpose other than to conform to the interface. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `ec4069246d`)	2026-01-09 08:53:16 +02:00
Aleksandra Martyniuk	9422baf49f	db: repair: throw if replay fails Return a flag determining whether all the batches were sent successfully in batchlog_manager::replay_all_failed_batches (batches skipped due to being too fresh are not counted). Throw in repair_flush_hints_batchlog_handler if not all batches were replayed, to ensure that repair_time isn't updated. (cherry picked from commit `7f20b66eff`)	2025-12-16 15:55:25 +01:00
Calle Wilund	7defa0b4cd	commitlog::read_log_file: Check for eof position on all data reads Fixes #24346 When reading, we check for each entry and each chunk, if advancing there will hit EOF of the segment. However, IFF the last chunk being read has the last entry _exactly_ matching the chunk size, and the chunk ending at _exactly_ segment size (preset size, typically 32Mb), we did not check the position, and instead complained about not being able to read. This has literally _never_ happened in actual commitlog (that was replayed at least), but has apparently happened more and more in hints replay. Fix is simple, just check the file position against size when advancing said position, i.e. when reading (skipping already does). v2: * Added unit test Closes scylladb/scylladb#27236 (cherry picked from commit `59c87025d1`) Closes scylladb/scylladb#27336	2025-12-03 12:21:13 +03:00
Michael Litvak	e3f5924f71	tablet: scheduler: Do not emit conflicting migration in merge colocation The tablet scheduler should not emit conflicting migrations for the same tablet. This was addressed initially in scylladb/scylladb#26038 but the check is missing in the merge colocation plan, so add it there as well. Without this check, the merge colocation plan could generate a conflicting migration for a tablet that is already scheduled for migration, as the test demonstrates. This can cause correctness problems, because if the load balancer generates two migrations for a single tablet, both will be written as mutations, and the resulting mutation could contain mixed cells from both migrations. Fixes scylladb/scylladb#27304 Closes scylladb/scylladb#27312 (cherry picked from commit `97b7c03709`)	2025-11-30 10:37:58 +01:00
Tomasz Grabiec	dc1a318971	tablet: scheduler: Do not emit conflicting migrations in the plan Plan-making is invoked independently for different DCs (and in the future, racks) and then plans are merged. It could be that the same tablets are selected for migration in different DCs. Only one migration will prevail and be committed to group0, so it's not a correctness problem. Next cycle will recognize that the tablet is in transition and will not be selected by plan-maker. But it makes plan-making less efficient. It may also surprise consumers of the plan, like we saw in #25912. So we should make plan-maker be aware of already scheduled transitions and not consider those tablets as candidates. Fixes #26038 Closes scylladb/scylladb#26048 (cherry picked from commit `981592bca5`)	2025-11-30 10:37:58 +01:00
Gleb Natapov	01c71681a8	test: test that expired erm that held for too long triggers notification (cherry picked from commit `5dcdaa6f66`)	2025-11-26 15:07:28 +00:00
Pavel Emelyanov	14fd0d9c21	lister: Fix race between readdir and stat Sometimes file::list_directory() returns entries without type set. In thase case lister calls file_type() on the entry name to get it. In case the call returns disengated type, the code assumes that some error occurred and resolves into exception. That's not correct. The file_type() method returns disengated type only if the file being inspected is missing (i.e. on ENOENT errno). But this can validly happen if a file is removed bettween readdir and stat. In that case it's not "some error happened", but a enry should be just skipped. In "some error happened", then file_type() would resolve into exceptional future on its own. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26595 (cherry picked from commit `d9bfbeda9a`) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26754	2025-11-24 17:19:42 +03:00
Avi Kivity	67c4d980dd	Revert "Merge 'auth: move passwords::check call to alien thread' from Andrzej Jackowski" This reverts commit `1fd82d32e0`. It causes connection storms to snowball into a node crash via this mechanism: 1. large node suffers mild connection storm 2. password hash requests queue up on alien hash thread 3. incoming hash requests queue faster than the alien thread can retire them. 4. auth latency grows without bounds 5. this encourages the clients to create new connections 6. problem grows Reverting the patch restores the hash stall, but at least prevents node crashes. Fixes #26461 (2025.1) Closes scylladb/scylladb#26462	2025-10-09 11:04:34 +03:00
Botond Dénes	b37ddaee90	Merge '[Backport 2025.1] compaction: ensure that all compaction executors are stopped' from Scylladb[bot] Currently, while stopping the compaction_manager, we stop task_manager compaction module and concurrently run compaction_manager::really_do_stop. really_do_stop stops and waits for all task_executors that are kept in compaction_manager::_tasks, but nothing ensures that no more tasks will be added there. Due to leftover tasks, we trigger on_fatal_internal_error. Modify the order of compaction_manager::stop. After the change, we stop compaction tasks in the following order: - abort module abort source; - close module gate in the background; - stop_ongoing_compactions (kept in compaction_manager::_tasks); - wait until module gate is closed. Check module abort source before creating compaction executor and adding it to _tasks. Thanks to the above, we can be sure that: - after module::stop there will be no tasks in _tasks; - compaction_manager::stop aborts all tasks; we don't wait for any whole compaction to finish. Fixes: https://github.com/scylladb/scylladb/issues/25806. Fixes shutdown bug; Needs backports to all version - (cherry picked from commit `17707d0e6b`) - (cherry picked from commit `97c77d7cd5`) Parent PR: #25885 Closes scylladb/scylladb#26222 * github.com:scylladb/scylladb: compaction: move _tasks check compaction: stop compaction module in really_do_stop	2025-10-06 07:04:20 +03:00
Aleksandra Martyniuk	9c5ed9586d	compaction: stop compaction module in really_do_stop Currently, compaction::task_manager_module is stopped in compaction_manager::stop, concurrently to really_do_stop. We can't predict the order of the two. Do not set _task_manager_module to nullptr at stop, because compaction_manager::really_do_stop() may be called before the actual shutdown, while other components still try to use it. compaction::task_manager_module does not keep a pointer to compaction_manager, so we won't end up with memory leak. Stop compaction module in really_do_stop, after ongoing compactions are stopped. It's a preparation for further patches. (cherry picked from commit `17707d0e6b`)	2025-09-25 16:18:45 +02:00
Ferenc Szili	38aa7be6a1	load_balancer: fix std::out_of_bounds when decommissioning with empty nodes Consider the following: The tablet load balancer is working on: - node1: an empty node (no tablets) with a large disk capacity - node2: an empty node (no tablets) with a lower disk capacity then node1 - node3: is being decommissioned and contains tablet replicas In load_balancer::make_internode_plan() the initial destination node/shard is selected like this: // Pick best target shard. auto dst = global_shard_id {target, _load_sketch->get_least_loaded_shard(target)}; load_sketch::get_least_loaded_shard(host_id) calls ensure_node() which adds the host to load_sketch's internal hash maps in case the node was not yet seen by load_sketch. Let's assume dst is a shard on node1. Later in load_balancer::make_internode_plan() we will call pick_candidate() to try to find a better destination node than the initial one: // May choose a different source shard than src.shard or different destination host/shard than dst. auto candidate = co_await pick_candidate(nodes, src_node_info, target_info, src, dst, nodes_by_load_dst, drain_skipped); auto source_tablets = candidate.tablets; src = candidate.src; dst = candidate.dst; If pick_candidate() selects some other empty destination (due to larger capacity: node1) node, and that node has not yet been seen by load_sketch (because it was empty), a subsequent call to load_sketch::pick() will search for the node using std::unordered_map::at(), and because the node is not found it will throw a std::out_of_bounds() exception crashing the load balancer. This problem is fixed by changing load_sketch::populate() to initialize its internal maps with all the nodes which populate()'s arguments filter for. Fixes: #26203 Closes scylladb/scylladb#26207 (cherry picked from commit `c6c9c316a7`) Closes scylladb/scylladb#26238	2025-09-25 09:51:23 +03:00
Pavel Emelyanov	4431dd158f	Merge '[Backport 2025.1] compaction/scrub: register sstables for compaction before validation' from Scylladb[bot] compaction/scrub: register sstables for compaction before validation When `scrub --validate` runs, it collects all candidate sstables at the start and validates them one by one in separate compaction tasks. However, scrub in validate mode does not register these sstables for compaction, which allows regular compaction to pick them up and potentially compact them away before validation begins. This leads to scrub failures because the sstables can no longer be found. This patch fixes the issue by first disabling compaction, collecting the sstables, and then registering them for compaction before starting validation. This ensures that the enqueued sstables remain available for the entire duration of the scrub validation task. Fixes #23363 This reported scrub failure occurs on all versions that have the checksum/digest validation feature for uncompressed sstables. So, backport it to older versions. - (cherry picked from commit `84f2e99c05`) - (cherry picked from commit `7cdda510ee`) Parent PR: #26034 Closes scylladb/scylladb#26097 * github.com:scylladb/scylladb: compaction/scrub: register sstables for compaction before validation compaction/scrub: handle exceptions when moving invalid sstables to quarantine	2025-09-24 09:50:05 +03:00
Łukasz Paszkowski	51b13edd2c	compaction_manager: cancel submission timer on drain The `drain` method, cancels all running compactions and moves the compaction manager into the disabled state. To move it back to the enabled state, the `enable` method shall be called. This, however, throws an assertion error as the submission time is not cancelled and re-enabling the manager tries to arm the armed timer. Thus, cancel the timer, when calling the drain method to disable the compaction manager. Fixes https://github.com/scylladb/scylladb/issues/24504 All versions are affected. So it's a good candidate for a backport. Closes scylladb/scylladb#24505 (cherry picked from commit `a9a53d9178`) Closes scylladb/scylladb#24589	2025-09-23 10:30:34 +03:00
Szymon Malewski	21610258df	alternator/expressions.g: Fix antlr3 missing token leak This patch overrides the antlr3 function that allocates the missing tokens that would eventually leak. The override stores these tokens in a vector, ensuring memory is freed whenever the parser is destroyed. Solution is copied from CQL implementation. A unit test to reproduce the issue is added - leak would be reported by ASAN, when running this test in debug mode - the test passed but the leak is discovered when the test file exits. Fixes #25878 Closes scylladb/scylladb#25930 (cherry picked from commit `776f90e2f8`) Closes scylladb/scylladb#26083	2025-09-19 19:13:24 +03:00
Lakshmi Narayanan Sreethar	f26bc7e2e1	compaction/scrub: register sstables for compaction before validation When `scrub --validate` runs, it collects all candidate sstables at the start and validates them one by one in separate compaction tasks. However, scrub in validate mode does not register these sstables for compaction, which allows regular compaction to pick them up and potentially compact them away before validation begins. This leads to scrub failures because the sstables can no longer be found. This patch fixes the issue by first disabling compaction, collecting the sstables, and then registering them for compaction before starting validation. This ensures that the enqueued sstables remain available for the entire duration of the scrub validation task. Fixes #23363 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `7cdda510ee`) Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-09-19 18:30:13 +05:30
Wojciech Mitros	90536040ad	mv: delete previously undetected ghost rows in PRUNE MATERIALIZED VIEW statement The PRUNE MATERIALIZED VIEW statement is supposed to remove ghost rows from the view. Ghost rows are rows in the view with no corresponding row in the base table. Before this patch, only rows whose primary key columns of the base table had different values than any of the base rows were treated as ghost rows by the PRUNE statement. However, view rows which have a column in their primary key that's not in the base primary can also be ghost rows if this column has a different value than the base row with the same values of remaining primary key columns. That's because these rows won't be deleted unless we change value of this column in the base table to this specific value. In this patch we add a check for this column in the PRUNE MATERIALIZED VIEW logic. If this column isn't the same in the base table and the view, these rows are also deleted. Fixes https://github.com/scylladb/scylladb/issues/25655 Closes scylladb/scylladb#25720 (cherry picked from commit `1f9be235b8`) Closes scylladb/scylladb#25954	2025-09-16 16:00:02 +02:00
Avi Kivity	1fd82d32e0	Merge 'auth: move passwords::check call to alien thread' from Andrzej Jackowski Analysis of customer stalls revealed that the function `detail::hash_with_salt` (invoked by `passwords::check`) often blocks the reactor. Internally, this function uses the external `crypt_r` function to compute password hashes, which is CPU-intensive. This PR addresses the issue in two ways: 1) `sha-512` is now the only password hashing scheme for new passwords (it was already the common-case). 2) `passwords::check` is moved to a dedicated alien thread. Regarding point 1: before this change, the following hashing schemes were supported by `identify_best_supported_scheme()`: bcrypt_y, bcrypt_a, SHA-512, SHA-256, and MD5. The reason for this was that the `crypt_r` function used for password hashing comes from an external library (currently `libxcrypt`), and the supported hashing algorithms vary depending on the library in use. However: - The bcrypt schemes never worked properly because their prefixes lack the required round count (e.g. `$2y$` instead of `$2y$05$`). Moreover, bcrypt is slower than SHA-512, so it not good idea to fix or use it. - SHA-256 and SHA-512 both belong to the SHA-2 family. Libraries that support one almost always support the other, so it’s very unlikely to find SHA-256 without SHA-512. - MD5 is no longer considered secure for password hashing. Regarding point 2: the `passwords::check` call now runs on a shared alien thread created at database startup. An `std::mutex` synchronizes that thread with the shards. In theory this could introduce a frequent lock contention, but in practice each shard handles only a few hundred new connections per second—even during storms. There is already `_conns_cpu_concurrency_semaphore` in `generic_server` limits the number of concurrent connection handlers. Fixes https://github.com/scylladb/scylladb/issues/24524 Backport not needed, as it is a new feature. Closes scylladb/scylladb#24924 * github.com:scylladb/scylladb: main: utils: add thread names to alien workers auth: move passwords::check call to alien thread test: wait for 3 clients with given username in test_service_level_api auth: refactor password checking in password_authenticator auth: make SHA-512 the only password hashing scheme for new passwords auth: whitespace change in identify_best_supported_scheme() auth: require scheme as parameter for `generate_salt` auth: check password hashing scheme support on authenticator start (cherry picked from commit `c762425ea7`)	2025-09-07 15:47:34 +03:00
Pavel Emelyanov	bf0d1cf272	Merge '[Backport 2025.1] encryption_at_rest_test: Improve exception handling and increase verbosity in fake proxy' from Nikos Dragazis PRs included: * #22810 (encryption_at_rest_test/encryption: Add some verbosity etc to help diagnose test run issues) * #23778 (encryption_at_rest_test: Make fake_proxy read/write loop noexcept) Backporting them together because the latter needs the former to avoid conflicts, but the former cannot be backported individually due to not fixing an issue. * (cherry picked from commit `83aa66da1a`) * (cherry picked from commit `5905c19ab4`) * (cherry picked from commit `00263aa57a`) * (cherry picked from commit `4a44651fce`) Refs #22628. Fixes #23774. Closes scylladb/scylladb#25774 * github.com:scylladb/scylladb: encryption_at_rest_test: Make fake_proxy read/write loop noexcept gcp/aws kms: Promote service_error to recoverable + use malformed_response_error encryption_at_rest_test: Add verbosity + earlier stream close to proxy encryption: Add exception handler to context init (for tests)	2025-09-03 20:51:45 +03:00
Calle Wilund	39242c3d5a	commitlog: Ensure segment deletion is re-entrant Fixes #25709 If we have large allocations, spanning more than one segment, and the internal segment references from lead to secondary are the only thing keeping a segment alive, the implicit drop in discard_unused_segments and orphan_all can cause a recursive call to discard_unused_segments, which in turn can lead to vector corruption/crash, or even double free of segment (iterator confusion). Need to separate the modification of the vector (_segments) from actual releasing of objects. Using temporaries is the easiest solution. To further reduce recursion, we can also do an early clear of segment dependencies in callbacks from segment release (cf release). Closes scylladb/scylladb#25719 (cherry picked from commit `cc9eb321a1`) Closes scylladb/scylladb#25754	2025-09-03 06:55:59 +03:00
Nadav Har'El	fc738ee9ce	Merge '[backport 2025.1] token_range_vector: fragment' from Avi Kivity token_range_vector is a sequence of intervals of tokens. It is used to describe vnodes or token ranges owned by shards. Since tokens are bloated (16 bytes instead of 8), and intervals are bloated (40 byte of overhead instead of 8), and since we have plenty of token ranges, such vectors can exceed our allocation unit of 128 kB and cause allocation stalls. This series fixes that by first generalizing some helpers and then changing token_range_vector to use chunked_vector. Although this touches IDL, there is no compatibility problem since the encoding for vector and chunked_vector are identical. There is no performance concern since token_range_vector is never used on any hot path (hot paths always contain a partition key). Fixes #3335. Fixes #24115. Fixes #24156 Backport notes: Due to compiler limitations in this toolchain, the template template parameters were replaced by elaborate template metaprogramming, see patch 'partition_range_compat: generalize wrap/unwrap helpers'. Closes scylladb/scylladb#25704 * github.com:scylladb/scylladb: dht: fragment token_range_vector partition_range_compat: generalize wrap/unwrap helpers utils: chunked_vector: add swap() method utils: chunked_vector: add range insert() overloads	2025-09-01 19:07:54 +03:00
Calle Wilund	57f3f1465e	encryption_at_rest_test: Make fake_proxy read/write loop noexcept Fixes #23774 Test code falls into same when_all issue as http client did. Avoid passing exceptions through this, and instead catch and report in worker lambda. Closes scylladb/scylladb#23778 (cherry picked from commit `4a44651fce`)	2025-09-01 16:08:33 +03:00
Calle Wilund	e52675415f	encryption_at_rest_test: Add verbosity + earlier stream close to proxy Refs #22628 Adds some verbosity to track issues with the network proxy used to test EAR connector difficulties. Also adds an earlier close in input stream to help network usage. Note: This is a diagnostic helper. Still cannot repro the issue above. (cherry picked from commit `5905c19ab4`)	2025-09-01 16:06:57 +03:00
Taras Veretilnyk	bd4a80301c	keys: from_nodetool_style_string don't split single partition keys Users with single-column partition keys that contain colon characters were unable to use certain REST APIs and 'nodetool' commands, because the API split key by colon regardless of the partition key schema. Affected commands: - 'nodetool getendpoints' - 'nodetool getsstables' Affected endpoints: - '/column_family/sstables/by_key' - '/storage_service/natural_endpoints' Refs: #16596 - This does not fully fix the issue, as users with compound keys will face the issue if any column of the partition key contains a colon character. Closes scylladb/scylladb#24829 Closes scylladb/scylladb#25554	2025-09-01 09:03:12 +03:00
Avi Kivity	951841f71f	utils: chunked_vector: add swap() method Following std::vector(), we implement swap(). It's a simple matter of swapping all the contents. A unit test is added. (cherry picked from commit `13a75ff835`)	2025-08-27 14:24:35 +03:00
Avi Kivity	81891a6c31	utils: chunked_vector: add range insert() overloads Inserts an iterator range at some position. Again we insert the range at the end and use std::rotate() to move the newly inserted elements into place, forgoing possible optimizations. Unit tests are added. (cherry picked from commit `24e0d17def`)	2025-08-27 14:24:35 +03:00
Sergey Zolotukhin	0cb64577fd	generic_server: Two-step connection shutdown. When shutting down in `generic_server`, connections are now closed in two steps. First, only the RX (receive) side is shut down. Then, after all ongoing requests are completed, or a timeout happened the connections are fully closed. Fixes scylladb/scylladb#24481 (cherry picked from commit `ea311be12b`)	2025-08-20 10:30:09 +02:00
Nikos Dragazis	22dbdafd64	test: kmip: Fix segfault from premature destruction of port_promise `kmip_test_helper()` is a utility function to spawn a dedicated PyKMIP server for a particular Boost test case. The function runs the server as an external process and uses a thread to parse the port from the server's logs. The thread communicates the port to the main thread via a promise. The current implementation has a bug where the thread may set a value to the promise after its destruction, causing a segfault. This happens when the server does not start within 20 seconds, in which case the port future throws and the stack unwinding machinery destroys the port promise before the thread that writes to it. Fix the bug by declaring the promise before the cleanup action. The bug has been encountered in CI runs on slow machines, where the PyKMIP server takes too long to create its internal tables (due to slow fdatasync calls from SQLite). This patch does not improve CI stability - it only ensures that the error condition is properly reflected in the test output. This patch is not a backport. The same bug has been fixed in master as part of a larger rewrite of the `kmip_test_helper()` (see `722e2bce96`). Refs #24747, #24842. Fixes #24574. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#25028	2025-08-06 11:56:59 +03:00
Botond Dénes	f453b5bfa3	Merge '[Backport 2025.1] sstables: Fix quadratic space complexity in partitioned_sstable_set' from Scylladb[bot] Interval map is very susceptible to quadratic space behavior when it's flooded with many entries overlapping all (or most of) intervals, since each such entry will have presence on all intervals it overlaps with. A trigger we observed was memtable flush storm, which creates many small "L0" sstables that spans roughly the entire token range. Since we cannot rely on insertion order, solution will be about storing sstables with such wide ranges in a vector (unleveled). There should be no consequence for single-key reads, since upper layer applies an additional filtering based on token of key being queried. And for range scans, there can be an increase in memory usage, but not significant because the sstables span an wide range and would have been selected in the combined reader if the range of scan overlaps with them. Anyway, this is a protection against storm of memtable flushes and shouldn't be the common scenario. It works both with tablets and vnodes, by adjusting the token range spanned by compaction group accordingly. Fixes #23634. We can backport this into 2024.2, 2025.1, but we should let this cook in master for 1 month or so. - (cherry picked from commit `494ed6b887`) - (cherry picked from commit `59dad2121f`) - (cherry picked from commit `21d1e78457`) - (cherry picked from commit `c77f710a0c`) - (cherry picked from commit `d5bee4c814`) Parent PR: #23806 Closes scylladb/scylladb#24012 * github.com:scylladb/scylladb: test: Verify partitioned set store split and unsplit correctly sstables: Fix quadratic space complexity in partitioned_sstable_set compaction: Wire table_state into make_sstable_set() compaction: Introduce token_range() to table_state dht: Add overlap_ratio() for token range	2025-08-06 09:56:43 +03:00
Nikos Dragazis	3e96b9a13d	test: Use in-memory SQLite for PyKMIP server The PyKMIP server uses an SQLite database to store artifacts such as encryption keys. By default, SQLite performs a full journal and data flush to disk on every CREATE TABLE operation. Each operation triggers three fdatasync(2) calls. If we multiply this by 16, that is the number of tables created by the server, we get a significant number of file syncs, which can last for several seconds on slow machines. This behavior has led to CI stability issues from KMIP unit tests where the server failed to complete its schema creation within the 20-second timeout (observed on spider9 and spider11). Fix this by configuring the server to use an in-memory SQLite. Fixes #24842. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#24995 (cherry picked from commit `2656fca504`) Closes scylladb/scylladb#25297	2025-08-06 09:53:18 +03:00
Avi Kivity	6f44cd672e	Merge '[Backport 2025.1] qos: don't populate effective service level cache until auth is migrated to raft' from Scylladb[bot] Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work. In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version. Fixes: scylladb/scylladb#24963 Should be backported to all versions which support upgrade to topology over raft - the issue described here may put the cluster into a state which is difficult to get out of (group0 apply fiber can break on multiple nodes, which necessitates their restart). - (cherry picked from commit `2bb800c004`) - (cherry picked from commit `3a082d314c`) Parent PR: #25188 Closes scylladb/scylladb#25283 * github.com:scylladb/scylladb: test: sl: verify that legacy auth is not queried in sl to raft upgrade qos: don't populate effective service level cache until auth is migrated to raft	2025-08-03 15:35:32 +03:00
Piotr Dulikowski	afe786843f	qos: don't populate effective service level cache until auth is migrated to raft Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work. In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version. Fixes: scylladb/scylladb#24963 (cherry picked from commit `2bb800c004`)	2025-07-31 15:12:50 +00:00
Dario Mirovic	028de964c8	utils/reusable_buffer: accept non-throwing writer callbacks via result_with_exception Make make_bytes_ostream and make_fragmented_temporary_buffer accept writer callbacks that return utils::result_with_exception instead of forcing them to throw on error. This lets callers propagate failures by returning an error result rather than throwing an exception. Introduce buffer_writer_for, bytes_ostream_writer, and fragmented_buffer_writer concepts to simplify and document the template requirements on writer callbacks. This patch does not modify the actual callbacks passed, except for the syntax changes needed for successful compilation, without changing the logic. Refs: #24567 Fixes: #25273 (cherry picked from commit `9f4344a435`)	2025-07-30 21:57:15 +02:00
Pavel Emelyanov	57b24383ed	Merge '[Backport 2025.1] Improve background disposal of tablet_metadata' from Scylladb[bot] As seen in #23284, when the tablet_metadata contains many tables, even empty ones, we're seeing a long queue of seastar tasks coming from the individual destruction of `tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>`. This change improves `tablet_metadata::clear_gently` to destroy the `tablet_map_ptr` objects on their owner shard by sorting them into vectors, per- owner shard. Also, background call to clear_gently was added to `~token_metadata`, as it is destroyed arbitrarily when automatic token_metadata_ptr variables go out of scope, so that the contained tablet_metadata would be cleared gently. Finally, a unit test was added to reproduce the `Too long queue accumulated for gossip` symptom and verify that it is gone with this change. Fixes #24814 Refs #23284 This change is not marked as fixing the issue since we still need to verify that there is no impact on query performance, reactor stalls, or large allocations, with a large number of tablet-based tables. * Since the issue exists in 2025.1, requesting backport to 2025.1 and upwards - (cherry picked from commit `3acca0aa63`) - (cherry picked from commit `493a2303da`) - (cherry picked from commit `e0a19b981a`) - (cherry picked from commit `2b2cfaba6e`) - (cherry picked from commit `2c0bafb934`) - (cherry picked from commit `4a3d14a031`) - (cherry picked from commit `6e4803a750`) Parent PR: #24618 Closes scylladb/scylladb#24862 * github.com:scylladb/scylladb: token_metadata_impl: clear_gently: release version tracker early test: topology_custom: test_tablets_merge: add test_tablet_split_merge_with_many_tables token_metadata: clear_and_destroy_impl when destroyed token_metadata: keep a reference to shared_token_metadata token_metadata: move make_token_metadata_ptr into shared_token_metadata class replica: database: get and expose a mutable locator::shared_token_metadata locator: tablets: tablet_metadata: clear_gently: optimize foreign ptr destruction	2025-07-22 11:02:00 +03:00
Dawid Mędrek	02108573b6	cdc: Forbid altering columns of CDC log tables directly The set of columns of a CDC log table should be managed automatically by Scylla, and the user should not have the ability to manipulate them directly. That could lead to disastrous consequences such as a segmentation fault. In this commit, we're restricting those operations. We also provide two validation tests. One of the existing tests had to be adjusted as it modified the type of a column in a CDC log table. Since the test simply verifies that the user has sufficient permissions to perform `ALTER TABLE` on the log table, the test is still valid. Fixes scylladb/scylladb#24643 (cherry picked from commit `20d0050f4e`)	2025-07-21 11:42:38 +00:00
Benny Halevy	d70ed984cb	token_metadata: clear_and_destroy_impl when destroyed We have a lot of places in the code where a token_metadata_ptr is kept in an automatic variable and destroyed when it leaves the scope. since it's a referenced counted lw_shared_ptr, the token_metadata object is rarely destroyed in those cases, but when it is, it doesn't go through clear_gently, and in particular its tablet_metadata is not cleared gently, leading to inefficient destruction of potentially many foreign_ptr:s. This patch calls clear_and_destroy_impl that gently clears and destroys the impl object in the background using the shared_token_metadata. Fixes #13381 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `2c0bafb934`)	2025-07-21 10:02:19 +03:00
Benny Halevy	e24bfb0bf3	token_metadata: keep a reference to shared_token_metadata To be used by a following patch to gently clean and destroy the token_data_impl in the background. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `2b2cfaba6e`)	2025-07-21 10:00:49 +03:00
Benny Halevy	aa8e65616e	token_metadata: move make_token_metadata_ptr into shared_token_metadata class So we can use the local shared_token_metadata instance for safe background destroy of token_metadata_impl:s. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `e0a19b981a`)	2025-07-21 10:00:49 +03:00
Avi Kivity	60135c8d6c	utils: chunked_vector: implement erase() for single elements and ranges Implement using std::rotate() and resize(). The elements to be erased are rotated to the end, then resized out of existence. Again we defer optimization for trivially copyable types. Unit tests are added. Needed for range_streamer with token_ranges using chunked_vector. (cherry picked from commit `d6eefce145`)	2025-07-16 07:38:55 +08:00
Avi Kivity	efad391fac	utils: chunked_vector: implement insert() for single-element inserts partition_range_compat's unwrap() needs insert if we are to use it for chunked_vector (which we do). Implement using push_back() and std::rotate(). emplace(iterator, args) is also implemented, though the benefit is diluted (it will be moved after construction). The implementation isn't optimal - if T is trivially copyable then using std::memmove() will be much faster that std::rotate(), but this complex optimization is left for later. Unit tests are added. (cherry picked from commit `5301f3d0b5`)	2025-07-16 07:38:55 +08:00
Raphael S. Carvalho	63bdbebdef	sstables: Fix quadratic space complexity in partitioned_sstable_set Interval map is very susceptible to quadratic space behavior when it's flooded with many entries overlapping all (or most of) intervals, since each such entry will have presence on all intervals it overlaps with. A trigger we observed was memtable flush storm, which creates many small "L0" sstables that spans roughly the entire token range. Since we cannot rely on insertion order, solution will be about storing sstables with such wide ranges in a vector (unleveled). There should be no consequence for single-key reads, since upper layer applies an additional filtering based on token of key being queried. And for range scans, there can be an increase in memory usage, but not significant because the sstables span an wide range and would have been selected in the combined reader if the range of scan overlaps with them. Anyway, this is a protection against storm of memtable flushes and shouldn't be the common scenario. It works both with tablets and vnodes, by adjusting the token range spanned by compaction group accordingly. Fixes #23634. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `c77f710a0c`)	2025-07-11 10:05:30 -03:00

1 2 3 4 5 ...

3809 Commits