scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-30 11:36:54 +00:00

Author	SHA1	Message	Date
Piotr Smaron	e441ce2fd9	test/nodetool: bind JMX to per-module loopback IP The Cassandra nodetool fixture picked a random JMX port on 127.0.0.1, which can collide with unrelated listeners and has a TOCTOU race between port selection and bind. Bind JMX to the per-module loopback IP with the standard port 7199 instead. Set java.rmi.server.hostname so the RMI endpoint stays on the same leased address.	2026-05-25 15:35:51 +02:00
Piotr Smaron	8fd946c649	test/pylib: default KMIP wrapper to loopback The standalone KMIP CLI wrapper could inherit a non-loopback hostname from defaults when the config did not specify one. Default to 127.0.0.1 so the dynamically assigned port remains local unless a test explicitly overrides the address.	2026-05-25 15:35:51 +02:00
Piotr Dulikowski	3a5dd2e5be	Merge 'strong_consistency: forward reads to the raft leader' from Wojciech Mitros Strongly consistent reads currently call read_barrier() on whichever replica happens to process the request. When a follower runs read_barrier(), it sends an RPC to the leader to get the current read index, then waits for its local apply index to catch up. If the follower is behind, this wait can be significant. By forwarding linearizable reads to the leader, we don't need an RPC from replica to leader to get the index to wait for apply -- it's available locally. Note that read_barrier() is still required on the leader to confirm it is still the leader and guarantee linearizability. A future optimization would be to implement leases in the raft library, which could eliminate read_barrier() on the leader entirely. The CL-to-behavior mapping is isolated in a single parse_consistency_level() function: - CL=(LOCAL_)QUORUM -> linearizable: forwarded to the raft leader - CL=(LOCAL_)ONE -> non-linearizable: existing behavior (no read_barrier()/forwarding, may return stale results) - All other CLs -> invalid request Read forwarding reuses the same CQL-layer bounce_to_node() mechanism that write forwarding already uses. The transport layer's existing requests_forwarded_* metrics automatically count forwarded reads. Coordinator-level metrics (linearizable_reads, non_linearizable_reads, writes) are added for visibility into the strong consistency workload. Fixes: SCYLLADB-1157 Closes scylladb/scylladb#29575 * github.com:scylladb/scylladb: strong_consistency: test read forwarding to leader strong_consistency: skip read_barrier() for non-linearizable reads strong_consistency: split coordinator-level read latency metrics strong_consistency: forward linearizable reads to raft leader strong_consistency: classify reads by consistency level strong_consistency: add begin_read() to raft_server	2026-05-25 10:55:00 +02:00
Avi Kivity	69a5b417d1	Merge 'pgo: enable tablets for SI and LWT' from Michael Litvak PGO training for secondary indexes and LWT was configured with tablets disabled because it wasn't supported at the time. This is no longer the case, so we should remove the restrictions and enable the training with the default mode. To make this work we also need to fix the training cluster to be RF-rack-valid, because some workloads have RF=3 but the cluster has 3 nodes in a single rack. We change the script to create a 3-rack cluster by writing a separate rackdc file for each node. no backport needed - small build improvement Closes scylladb/scylladb#30002 * github.com:scylladb/scylladb: pgo: enable train with tablets for SI and LWT pgo: make training cluster RF-rack-valid	2026-05-24 22:15:23 +03:00
Gleb Natapov	0bf050d175	storage_proxy: hold shared pointer to a table object during entire query_partition_key_range_concurrent execution Otherwise if a table is dropped in the middle of a scan the object may disappear. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-2137 Closes scylladb/scylladb#29988	2026-05-24 21:54:08 +03:00
Yaron Kaikov	648fa8f5b1	dist: remove bundled node_exporter, add dependency on scylla-node-exporter The node_exporter binary has moved to its own dedicated repository (scylladb/scylla-node-exporter). Remove the bundled copy from the core repo to eliminate the toolchain dependency required to build/package it here and to resolve associated CVEs inherited from the vendored binary. This removes the download logic, build rules, packaging subpackage, systemd/sysconfig/supervisor files, and install/uninstall references. Instead, a hard dependency on the separate scylla-node-exporter package is declared in both the RPM spec and Debian control file. [Yaron: - regenerate frozen toolchain with optimized clang from https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-x86_64.tar.gz ] Fixes: RELENG-502 Fixes: RELENG-503 Closes scylladb/scylladb#29716	2026-05-24 16:30:24 +03:00
Avi Kivity	8f5c67b458	Merge 'logstor: disable logstor compaction in table truncate' from Michael Litvak in database::truncate_table_on_all_shards disable logstor compaction before the table data is truncated, similarly to how non-logstor compaction is disabled, to avoid race conditions between logstor compaction and segments discarding. Fixes [SCYLLADB-2186](https://scylladb.atlassian.net/browse/SCYLLADB-2186) backport to 2026.2 for CI stability [SCYLLADB-2186]: https://scylladb.atlassian.net/browse/SCYLLADB-2186?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#30055 * github.com:scylladb/scylladb: logstor: compaction state cleanup logstor: disable logstor compaction in table truncate	2026-05-24 16:10:17 +03:00
Michael Litvak	bde18c4e51	logstor: compaction state cleanup add a simple cleanup for the logstor compaction state map to remove entries of stale compaction groups. remove the state of compaction group from the map if it doesn't have anything in progress.	2026-05-24 10:25:37 +02:00
Michael Litvak	73470150a0	logstor: disable logstor compaction in table truncate in database::truncate_table_on_all_shards disable logstor compaction before the table data is truncated, similarly to how non-logstor compaction is disabled, to avoid race conditions between logstor compaction and segments discarding. Fixes SCYLLADB-2186	2026-05-24 10:25:08 +02:00
Petr Gusev	954426407e	storage_proxy: only cancel write handlers with pending remote targets during drain The previous fix (cancel_all_write_response_handlers in do_drain) was too aggressive — it killed all handlers including ones used by group0 for raft commits. Since group0 is still running at that point (before wait_for_group0_stop), this caused group0 operations to fail (SCYLLADB-2168). The actual problem is only with handlers that have pending remote targets: after stop_transport() their MUTATION_DONE responses can never arrive via messaging. Handlers whose only pending targets are local can still complete via apply_locally and should be left alone. Add cancel_nonlocal_write_response_handlers() which checks each handler's remaining targets against the local host ID. Only handlers with at least one remote pending target are cancelled. Use it in do_drain instead of cancel_all_write_response_handlers. The latter remains unchanged for drain_on_shutdown (final proxy shutdown where all handlers must be killed). Fixes: SCYLLADB-2168 Closes scylladb/scylladb#30020	2026-05-23 13:37:34 +02:00
Wojciech Mitros	45f5df14e5	strong_consistency: test read forwarding to leader Test the linearizable read forwarding behavior in a single test that exercises all scenarios on one cluster: - CL=QUORUM reads on leader, follower, and non-replica nodes - CL=ONE reads (non-linearizable, no forwarding) - Linearizability: write + CL=QUORUM read from follower (10 iterations) - Coordinator latency histogram metrics for both read types Refs: SCYLLADB-1157	2026-05-23 11:35:37 +02:00
Wojciech Mitros	afa2ef6816	strong_consistency: skip read_barrier() for non-linearizable reads Non-linearizable reads (CL=ONE) no longer call read_barrier() before querying the local replica. This is safe because state_machine::apply() only writes to the table after raft commit, so a local read without read_barrier cannot see uncommitted data — just potentially stale data which is acceptable for CL=ONE semantics.	2026-05-23 11:35:37 +02:00
Wojciech Mitros	d07692a7ff	strong_consistency: split coordinator-level read latency metrics Split the latency metrics for strongly consistent reads into two categories: linearizable and non-linearizable. They replace the existing metrics for both types combined - this shouldn't cause issues because the feature is still experimental and both the initial introduction of latency metrics and the split will be a part of the same release. Also fix a test that was using the old metric.	2026-05-23 11:35:37 +02:00
Wojciech Mitros	297094c08f	strong_consistency: forward linearizable reads to raft leader For linearizable reads (CL=QUORUM), check leadership via begin_read() before proceeding. If this node is not the leader, redirect the request to the leader via need_redirect (handled by bounce_to_node() in the CQL layer). If the leader is unknown, wait and retry. When this node is the leader, perform read_barrier() locally. This avoids sending an RPC from the replica to the leader to get the index to wait for apply - it's available locally. Also, linearizable reads can use and fill the cache of leaders that we store for strongly consistent tablet groups. Non-linearizable reads (CL=ONE) retain the existing behavior: create_operation_ctx() redirects if not a replica, then read_barrier() is performed on the local replica. This will be changed in the following commit. Also fix a copy-paste typo in the unknown exception log message that said "mutate()" instead of "query()" Fixes: SCYLLADB-1157	2026-05-23 11:35:37 +02:00
Wojciech Mitros	c0ea98f922	strong_consistency: classify reads by consistency level Introduce a read_type enum (linearizable vs non_linearizable) and transform the existing "validate" function into a "parse" method - instead of checking if the consistency level is one of the accepted ones, we now also return the correcponding read type for strong consistency. The "parse" function maps CQL consistency levels to following read types: - CL=(LOCAL_)QUORUM -> linearizable (this is the default CL) - CL=(LOCAL_)ONE -> non_linearizable - all others -> throw The classification is performed in the CQL layer (select_statement) to keep the coordinator free of CL concepts.	2026-05-23 11:35:37 +02:00
Wojciech Mitros	1f91524547	strong_consistency: add begin_read() to raft_server Add begin_read() method to raft_server that checks leadership for read operations. Unlike begin_mutate(), it does not need to compute a timestamp or interact with leader_info. It simply checks current_leader() and returns one of three dispositions: - ok: this node is the leader, proceed with read_barrier() locally - raft::not_a_leader: redirect to the indicated leader - need_wait_for_leader: leader unknown, caller must wait and retry This will be used by the read forwarding logic in subsequent commits.	2026-05-23 11:35:36 +02:00
Tomasz Grabiec	6bffc0d2e0	Merge 'utils/serialized_action: harden shutdown synchronization' from Piotr Szymaniak `serialized_action::join()` is used as a shutdown barrier. After it returns, callers commonly destroy the owning object, and action lambdas often capture that owner by `this`. The previous implementation waited for the internal semaphore once. This handles actions that are already running or triggers already queued before `join()`, because Seastar semaphores serve waiters FIFO. The problematic case is a late `trigger()` after `join()` has started while an older action is still running. Such a trigger can queue behind `join()`, allowing `join()` to return before that late trigger runs. Review also found a separate semaphore bookkeeping bug in `trigger()`. The code manually waited on the semaphore and later signaled it through the caller-visible pending future. If the wait itself completed exceptionally, the signal path could still run and give back a semaphore unit that had never been acquired. Make `join()` a terminal operation for `serialized_action`. Once `join()` starts, new `trigger()` calls fail with `broken_semaphore`. `join()` still waits for work that was accepted before it started, and only then breaks the semaphore so later waiters are rejected. I audited the existing `serialized_action` users. Some callers explicitly remove trigger sources before `join()`, such as audit and topology_coordinator. Others rely on observer destruction or broader shutdown ordering, such as database, compaction_manager, io_throughput_updater, and schema_push. The least locally fenced case is `migration_manager::_group0_barrier`, which is reachable through several external paths, including task status lookup and other services. That makes this better enforced in `serialized_action` itself rather than relying on each caller to prove all trigger entrances are closed. This is generic hardening of the shutdown contract, not a fix for a confirmed topology_coordinator-specific reproducer. Also restore acquire/release ownership in `trigger()` by using `with_semaphore()`. This keeps semaphore release tied to successful acquisition while preserving the existing behavior where action completion and action errors are reported through the shared pending future. Refs SCYLLADB-1904 No backport: this is generic shutdown hardening without a confirmed user-visible reproducer. The semaphore bookkeeping fix closes a latent exceptional wait path noticed during review, not a known production failure. Closes scylladb/scylladb#29991 * github.com:scylladb/scylladb: utils/serialized_action: pair semaphore release with acquisition utils/serialized_action: harden join() against late triggers	2026-05-23 00:45:24 +02:00
Łukasz Paszkowski	cf0ad2bde9	tablet_allocator: use chunked_vector in cluster_resize_load to avoid oversized allocations In make_resize_plan(), the tables_need_resize vector in cluster_resize_load accumulates all tables that require a resize decision before the downstream heap-based logic selects the top-N most urgent ones to emit. In clusters with thousands of tables and aggressive tablets-per-shard scaling (e.g., 5000 empty tables with scaling factors of 0.04-0.12), nearly all tables satisfy the merge condition (scaled target < current tablet count), causing the vector to grow to thousands of entries. With ~100 bytes per element, std::vector's doubling strategy triggers contiguous allocations exceeding 256KB, producing seastar oversized allocation warnings. Replace std::vector with utils::chunked_vector in cluster_resize_load for both tables_need_resize and tables_being_resized. chunked_vector caps individual allocations at 128KB, splitting into multiple chunks when needed. For normal workloads (fewer than ~1300 resize candidates), behavior is iadentical to std::vector — single contiguous chunk, same performance. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1955 Closes scylladb/scylladb#29946	2026-05-22 16:52:12 +03:00
Yaniv Michael Kaul	acd3115645	sstables: include SSTable filename in Stats metadata error messages When Stats metadata is not available or malformed, include the SSTable filename in the error message to help operators identify which SSTable files need attention during startup failures. Fixes: https://github.com/scylladb/scylla-enterprise/issues/5439 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> AI-assisted: yes Backport: no, benign improvement Closes scylladb/scylladb#29950	2026-05-22 16:49:37 +03:00
Łukasz Paszkowski	96a992002c	tasks: fix busy-spin and shutdown hang in tablet_virtual_task::wait() for repair tasks The condition variable predicate for repair tasks unconditionally returned true (introduced in `e5928497ce`), which meant event.wait(pred) never actually suspended: do_until checks the predicate first, and if it's already satisfied, returns immediately without calling the inner wait(). This caused two problems: 1. The while(true) loop busy-spun, polling without blocking between topology changes. 2. During shutdown, event.broken() had no effect because no waiter was registered on the CV. The loop kept spinning, holding the HTTP server's task gate open and preventing http_server::stop() from completing. After ~15 minutes, systemd killed the process with SIGABRT. The fix replaces the synchronous predicate with an async task_finished() helper that dispatches on the task type. Since the repair check is async (for_each_tablet scans every tablet), we cannot use event.wait(Pred). Instead, we register a waiter via event.wait() before running the async check, ensuring no broadcast is missed during the check. event.broken() during shutdown propagates broken_condition_variable to the registered waiter and unblocks the loop promptly. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1532 Closes scylladb/scylladb#29485	2026-05-22 16:47:48 +03:00
Piotr Szymaniak	b9a7a6c25d	utils/serialized_action: pair semaphore release with acquisition The previous manual wait/signal split could signal the semaphore even if wait() completed exceptionally, giving back units that were never acquired. Use with_semaphore() so failed acquisition does not release anything. Bug found by tgrabiec.	2026-05-22 14:19:36 +02:00
Pavel Emelyanov	8cb32a9958	replica: Fix use-after-free in get_sstables_from_object_store The lambdas inside get_sstables_from_object_store captured get_abort_src by reference, but get_abort_src is a by-value function parameter living on the stack frame of get_sstables_from_object_store. Since the outer lambda is moved into seastar::async via get_sstables_from and executed after get_sstables_from_object_store returns, the reference becomes dangling. Fix by capturing get_abort_src by value (copying the std::function) in both lambdas. Found by AddressSanitizer: stack-use-after-return at distributed_loader.cc:243. Fixes SCYLLADB-2172 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#29954	2026-05-22 15:05:21 +03:00
Raphael S. Carvalho	3ba6184462	repair, test: fix split-repair synchronization test timeout in debug mode The test_split_and_incremental_repair_synchronization[True] test was timing out waiting for 'Finalizing resize decision for table' in debug mode. The root cause is a timing race: the incremental_repair_prepare_wait error injection has a hardcoded 60s auto-expiry timeout (wait_for_message(60s)), but split compactions in debug mode take ~58s per SSTable due to -O0 compilation and scheduler starvation (the maintenance_compaction group gets ~10% of wall-clock time). When the injection auto-expires before split finalization, the repair fails, leaving tablets stuck in transition=repair state. This prevents the topology coordinator from finalizing the split, causing the 600s test timeout. Fix both contributing factors: - Increase the injection timeout from 60s to 10min, giving split compactions ample time to complete before the injection auto-expires. The test explicitly messages the injection to release it (line 2200), so the longer timeout is just a safety net. - Reduce data volume from 256 to 64 rows (and repair data from 256 to 64 rows), producing smaller SSTables that split much faster in debug mode. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-2123. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#30004	2026-05-22 15:03:47 +03:00
Patryk Jędrzejczak	082936ce43	Merge 'test: pylib: Convict the node on server_stop()' from Tomasz Grabiec This is about ungraceful stop, where the node is killed. Test cases typically need to wait for other nodes to notice that the node is down before proceeding. By default, that takes about 20s. Can be reduced via config by reducing failure detector threshold, but it's not the best solution: - cannot set the threshold too low, or we'll introduce falkiness due to false positives - so it's still slow (a couple of seconds) - developers forget about it and the test still works This patch speeds this up by adding a way to convict the node immediately after stopping the node, controlled by the "convict" parameter. At the end of the series the "convict" parameter is required, and each test decides what it wants. Commits are split into steps: - the series starts with defaulting to convict=False - each test case sets "convict" explicitly, and changes are split into 3 commits depending on whether convict=True is: useless, beneficial, undesirable - finally, the "convict" parameter is made mandatory There is also a dedicated test for natural failure detection (test_natural_failure_detection in test_gossiper.py) to ensure FD coverage is not lost. Tested on dev-mode cluster/test_tablets_parallel_decommission.py::test_node_lost_during_decommission_drain: Wall clock time reduced from 41s to 16s No backport: enhancement Closes scylladb/scylladb#28495 * https://github.com/scylladb/scylladb: test: gossiper: Add test for natural failure detection test: pylib: Make convict a required parameter in server_stop() test: Annotate server_stop() calls where conviction is harmful test: Annotate server_stop() calls where conviction is beneficial test: Annotate server_stop() calls where conviction is useless test: pylib: Add convict option to server_stop() api: failure_detector: Introduce convict-node API gms: gossiper: Make convict() public and safe to call from any scheduling group api: Extract validate functions to common header	2026-05-22 13:39:50 +02:00
Piotr Szymaniak	13562110d6	utils/serialized_action: harden join() against late triggers serialized_action::join() is used as a shutdown barrier: callers expect that, once it returns, no action can still be running against the owning object. The previous implementation only waited by acquiring the internal semaphore once. That is sufficient for actions already running or already queued before join(), because semaphore waiters are served FIFO. It did not make join() a terminal operation: if trigger() is called after join() starts while an older action still holds the semaphore, that trigger can queue behind join(). The old join() may then return first, leaving the late trigger pending while the owner is destroyed. Make join() close the serialized_action before waiting. Subsequent trigger() calls fail immediately with broken_semaphore, while join() still waits for in-flight work and drains waiters that were already present. I audited the current users. Some remove trigger sources before join(), and others rely on shutdown or destruction ordering. migration_manager::_group0_barrier has external trigger paths, so the primitive should enforce this shutdown guarantee instead of requiring every caller to prove all entrances are closed. This is generic hardening of the shutdown contract, not a confirmed topology_coordinator-specific reproducer. Refs SCYLLADB-1904	2026-05-22 12:32:18 +02:00
Yaniv Michael Kaul	bb69ae5a02	test: assert ALTER TYPE RENAME rejected on frozen PK UDTs Add assertion that ALTER TYPE RENAME is rejected when the UDT is used as a frozen partition key column. The existing test only covered ALTER TYPE ADD. This closes the coverage gap from dtest udtencoding_test.py::test_udt_change_in_partition_key, enabling its removal. Refs: SCYLLADB-1929 Closes scylladb/scylladb#29840	2026-05-22 12:29:43 +02:00
Marcin Maliszkiewicz	dcff319221	Merge 'cql: request-side custom payload parsing' from Dario Mirovic When a CQL client sends a request with the `CUSTOM_PAYLOAD` flag (`0x04`) set, the frame body starts with a [bytes map] before the message. Scylla never implemented parsing of this map on the request side. This caused it to fail parsing with protocol errors such as `"truncated frame: expected 65546 bytes"`. This was discovered through DataStax Java Driver 4.19.x tests that attach a `request-id` to queries via custom payload. The same issue affects any CQL client that sets the `CUSTOM_PAYLOAD` flag. Fix this by skipping over the custom payload [bytes map] from the frame body before dispatching to opcode-specific handlers. The payload contents are discarded since Scylla has no pluggable `QueryHandler`. Cassandra's default `QueryHandler` also discards them. Fixes SCYLLADB-745 Reported on 2026.2, backport. Closes scylladb/scylladb#30005 * github.com:scylladb/scylladb: cql: fix request-side custom payload parsing test/cqlpy: add tests for request-side custom payload handling	2026-05-22 12:18:26 +02:00
Marcin Maliszkiewicz	18dd281e72	Merge 'test: audit: pin empty-keyspace DDL audit behavior' from Andrzej Jackowski `9646ee05bd` changed behavior of empty keyspace handling and this code path was never tested for CQL audit. Test CREATE/DROP FUNCTION and CREATE/DROP AGGREGATE targeting both an existing keyspace and a nonexistent one to verify both are audited with empty keyspace. No backport, just a missing test case. Closes scylladb/scylladb#29542 * github.com:scylladb/scylladb: test: audit: pin empty-keyspace DDL audit behavior test: audit: restart server when any non-live config key changes test: audit: rename 'needed' to 'target_config' for clarity	2026-05-22 09:42:34 +02:00
Ernest Zaslavsky	b992ead4bb	sstables_loader: hold token_metadata_ptr to prevent use-after-free in tablet_restore_task_impl::run() `tablet_restore_task_impl::run()` iterates `topo.get_datacenter_racks().at(dc) \| std::views::keys` in a range-based for loop that contains a `co_await` in its body. The original code obtained `topo` as a raw `const auto&` by dereferencing the temporary `lw_shared_ptr` returned from `get_token_metadata_ptr()`. Because no copy of the `lw_shared_ptr` was kept, the ref count did not stay elevated: const auto& topo = db.get_token_metadata().get_topology(); // ↑ temporary lw_shared_ptr — destroyed immediately When the coroutine suspends at the inner `co_await`, a Raft topology update can run on the reactor, calling `shared_token_metadata::set()` with a new token_metadata. The old token_metadata`s refcount then drops to 0, its destructor fires `clear_and_dispose_impl()`, and `topology::clear_gently()` is scheduled as a background fire-and-forget future. That future erases nodes from `_dc_racks` one-by-one with yields between batches. When the coroutine resumes and the range-based for loop advances its iterator, the iterator`s hash-node pointer references freed memory: AddressSanitizer: heap-use-after-free sstables_loader.cc:1204:31 in sstables_loader::tablet_restore_task_impl::run() (.resume) Fix: store the `token_metadata_ptr` (`lw_shared_ptr<token_metadata>`) in a local variable on the coroutine frame. Because coroutine locals survive across suspension points, the elevated ref count keeps the token_metadata (and its topology) alive for the duration of the loop, making the existing range iteration safe without any data copying. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-2149 Closes scylladb/scylladb#29995	2026-05-22 01:10:25 +02:00
Tomasz Grabiec	445a8b9a3e	test: gossiper: Add test for natural failure detection Add test_natural_failure_detection which verifies that the failure detector detects a killed node as DOWN without using the convict mechanism. Uses the failure_detector_timeout fixture to keep the FD timeout short (2s in release mode). This ensures that natural failure detection continues to work correctly even as other tests adopt the convict mechanism for speed.	2026-05-21 21:33:24 +02:00
Tomasz Grabiec	fa7e24f5f7	test: pylib: Make convict a required parameter in server_stop() Remove the default value from the convict parameter in ManagerClient.server_stop(), making it required. All call sites have been annotated with explicit convict=True or convict=False in the preceding commits, so this change enforces that every future caller must make a conscious choice about whether to convict the stopped node.	2026-05-21 21:33:24 +02:00
Tomasz Grabiec	9b40cf89fe	test: Annotate server_stop() calls where conviction is harmful Add explicit convict=False to server_stop() calls where convicting the node would break or weaken the test. In test_backoff_when_node_fails_task_rpc, the desired behavior is for the node to not be marked as down immediately: # The purpose of this is to simulate a situation when the gossiper # doesn't mark a dead node as such immediately. In raft tests, conviction could trigger voter reassignment while the test wants to test the scenario with voters being still down. In test_tablet_mv_replica_pairing_during_replace, conviction triggers SCYLLADB-1996 (replace fails with "Failed to add server").	2026-05-21 21:33:19 +02:00
Tomasz Grabiec	92416d850a	test: Annotate server_stop() calls where conviction is beneficial Add explicit convict=True to server_stop() calls where the test needs other nodes to detect the stopped node as DOWN in order to proceed. These are cases before remove_node, replace, or explicit waits for failure detection (server_not_sees_other_server, wait_new_coordinator_elected). Convicting immediately speeds up the test.	2026-05-21 21:31:22 +02:00
Tomasz Grabiec	624fe11178	test: Annotate server_stop() calls where conviction is useless Pass convict=False explicitly to server_stop() calls where conviction provides no benefit because there is no consumer of the failure detection: - single-node clusters (no other node to call the API on) - all nodes being stopped concurrently (no live node remains) - immediate restart (no test logic between stop and start depends on other nodes detecting the stopped node as dead) - node stopped for file manipulation or bootstrap abort - majority killed with no quorum on surviving nodes to react - no test logic depends on other nodes detecting the failure This is a no-op change since the default is already convict=False, but makes the intent explicit for each call site.	2026-05-21 21:13:55 +02:00
Tomasz Grabiec	a19e3f6f64	test: pylib: Add convict option to server_stop() Add support for convicting a node after stopping it non-gracefully. Convicting a node means that all live nodes will immediately mark it as DOWN, bypassing the natural failure detection delay (~20s by default). The convict parameter defaults to False (no conviction). Tests that want fast failure detection after a kill should pass convict=True explicitly. Tested on dev-mode cluster/test_tablets_parallel_decommission.py::test_node_lost_during_decommission_drain: Wall clock time reduced from 41s to 16s (when using convict=True)	2026-05-21 21:13:54 +02:00
Tomasz Grabiec	2ec8f6c9da	api: failure_detector: Introduce convict-node API Add POST /failure_detector/convict/{host} endpoint that forces local failure detection of a given host, marking it as DOWN and dropping connections. Will be used in tests to speed up failure detection after killing a node, avoiding the ~20s natural failure detection delay.	2026-05-21 21:13:54 +02:00
Tomasz Grabiec	35808b4f4e	gms: gossiper: Make convict() public and safe to call from any scheduling group Make gossiper::convict() public so that it can be called from external contexts (e.g. REST API handlers). Add co_await coroutine::switch_to() at entry to ensure it always runs on the gossip scheduling group, regardless of which scheduling group the caller is on. This is needed because convict() accesses gossiper state that must be manipulated on the gossip scheduling group.	2026-05-21 21:13:54 +02:00
Tomasz Grabiec	e57027bb9d	api: Extract validate functions to common header	2026-05-21 21:13:54 +02:00
Dario Mirovic	f9e8518776	cql: fix request-side custom payload parsing When a CQL client sends a request with the CUSTOM_PAYLOAD flag (0x04) set, the frame body starts with a [bytes map] before the message. Scylla never implemented parsing of this map on the request side. This caused it to fail parsing with protocol errors such as "truncated frame: expected 65546 bytes". Fix this by skipping over the custom payload [bytes map] from the frame body before dispatching to opcode-specific handlers. The payload contents are discarded since Scylla has no pluggable QueryHandler. Cassandra's default QueryHandler also discards them. Fixes SCYLLADB-745	2026-05-21 18:36:37 +02:00
Dario Mirovic	8e6d2d0631	test/cqlpy: add tests for request-side custom payload handling Add tests that verify Scylla's handling of CQL native protocol requests with the CUSTOM_PAYLOAD flag (0x04) set. Each test asserts the specific parse error that the unfixed server produces. A separate CQL session is used for each test. The protocol error kills the driver connection, and we need to catch it properly. Refs SCYLLADB-745	2026-05-21 18:34:43 +02:00
Avi Kivity	305346a3ec	Merge 'Don't materialize collections into intermediate representations' from Botond Dénes Collections have an age-old problem in ScyllaDB: they had to be unserialized into an intermediate representation for any access or manipulation. The intermediate representation needs effort to produce and also requires additional memory to store. Both can be significant for large collections. This intermediate representation is then either discarded immediately after use, or re-serialized again. This problem was significant enough for us to consider the use of collections as somewhat of an anti-pattern. But our customers keep using it. Alternator is also a heavy user of collections. This PR aims to solve this problem once and for all. The plan is as follows: * Promote direct use of the serialized collection format: - Add accessor methods to `collection_mutation_view` which read from the serialized format directly: `tomb()`, `size()` and `begin()`/`end()`. - Add a `collection_mutation_writer` which provides container semantics for generating a serialized `collection_mutation` directly on the go (`push_back()`). * Replace all usage of `collection_mutation_description`, `collection_mutation_view_description` and friends with use of the new infrastructure. * Drop the old infrastructure, to avoid accidental regressions. Continues the work started by https://github.com/scylladb/scylladb/pull/29033 and takes it to its conclusion. To help focus review, here is a summary of the patches: * [1, 2] preparatory refactoring: drop some unused abstract_type params * [3, 6] introduce new infrastructure to write and read serialized collections directly; this is the meat of the PR * [6, -1) replace all usage of old materializing infrastructure with usage of the new one * [-1] drop old infrastructure Command: ``` dbuild -it -- build/release/scylla perf-simple-query --collection=16 -c1 -m2G --default-log-level=error ``` \| Metric \| Before \| After \| Change \| \|--------------------------\|--------:\|--------:\|------------\| \| Throughput (median tps) \| 315,760 \| 332,021 \| +5.1% \| \| Instructions/op (median) \| 53,776 \| 48,681 \| -9.5% \| \| CPU cycles/op (median) \| 17,365 \| 16,471 \| -5.1% \| \| Allocations/op \| 85.1 \| 82.1 \| -3.5% \| Significant improvement. Throughput is up ~5%, and both instruction count and cycle count are meaningfully reduced. --- Command: ``` dbuild -it -- build/release/scylla perf-simple-query --collection=16 -c1 -m2G --default-log-level=error --write ``` \| Metric \| Before \| After \| Change \| \|--------------------------\|----------:\|---------:\|-----------\| \| Throughput (median tps) \| 150,823 \| 149,678 \| -0.8% \| \| Instructions/op (median) \| 108,388 \| 103,858 \| -4.2% \| \| CPU cycles/op (median) \| 34,860 \| 35,371 \| +1.5% \| \| Allocations/op \| ~105–108 \| ~102–103 \| -3.0% \| Mixed, mostly neutral. Throughput is essentially flat (within noise). Instructions/op improved by ~4%, allocations dropped slightly, but cycles/op edged up marginally. --- Command: ``` dbuild -it -- build/release/scylla perf-alternator --workload write --developer-mode=1 --alternator-port=8000 --alternator-write-isolation=unsafe -c1 -m2G --default-log-level=error ``` \| Metric \| Before \| After \| Change \| \|--------------------------\|--------:\|-------:\|-----------\| \| Throughput (median tps) \| 55,777 \| 56,051 \| +0.5% \| \| Instructions/op (median) \| 246,215 \|246,610 \| +0.2% \| \| CPU cycles/op (median) \| 77,641 \| 77,020 \| -0.8% \| \| Allocations/op \| 340.4 \| 335.4 \| -1.5% \| Essentially neutral. All metrics are within noise margins. Slight reduction in allocations and cycles, negligible otherwise. --- The change has a clear, substantial positive effect on reads (~5% throughput gain, ~9.5% fewer instructions per op). The write and alternator paths are unaffected in practice — changes there are within measurement noise. No regressions are apparent. This is expected: https://github.com/scylladb/scylladb/pull/29033 did the heavy lifting when it comes to the write path, this PR finishes the job, mostly improving reads. Fixes: #3602 Improvement, no backport. Closes scylladb/scylladb#29127 * github.com:scylladb/scylladb: mutation/collection_mutation: make collection_mutation::_data private mutation_collection: drop collection_mutation_description and friends test: move away from collection_mutation_description tree: move away from collection_mutation_description test: move away from collection_mutation_view::with_deserialized() tree: move away from collection_mutation_view::with_deserialized() types: fix indendation, left broken by previous commit types: move away from collection_mutation_view::with_deserialized() types: serialize_for_cql(): use throwing_assert() instead of SCYLLA_ASSERT() schema: column_computation: move away from collection_mutation_view::with_deserialized() mutation: move away from collection_mutation_view::with_deserialized() alternator: move away from collection_mutation_view::with_deserialized() cdc: move away from collection_mutation_view::with_deserialized() mutation/collection_mutation: printer: don't deserialize collections mutation/collection_mutation: difference(): don't deserialize collections mutation/collection_mutation: merge(): don't deserialize collections mutation/collection_mutation: extract compact_and_expire() to free function mutation/collection_mutation: refactor empty(), is_any_live() and last_update() compaction_garbage_collector: pass collection_mutation to collect() test/boost/mutation_test: add tests for collection_mutation_{view,writer} mutation/collaction_mutation: collection_mutation_view: add methods to inspect content mutation/collection_mutation: add collection_mutation_writer mutation/collection_mutation: collection_mutation(): generate valid collection mutation/collection_mutation: collection_mutation(): remove unused abstract_type param mutation/atomic_cell: drop unused type param from from_bytes()	2026-05-21 17:10:40 +03:00
Patryk Jędrzejczak	1ed3f5c4af	Merge 'storage_service: cancel write handlers during drain to prevent shutdown deadlock' from Petr Gusev Fixes a shutdown deadlock where a node hangs because `stale_versions_in_use()` blocks on stale `token_metadata` versions held by write handlers whose `MUTATION_DONE` responses can never arrive (transport is already stopped). Two manifestations depending on whether the shutting-down node is the topology coordinator: - Coordinator: do_drain → wait_for_group0_stop deadlocks because the topology coordinator fiber is stuck in barrier_and_drain → stale_versions_in_use(). - Non-coordinator: ss::stop → uninit_messaging_service deadlocks because the barrier_and_drain RPC handler holds the gate open. The non-coordinator case was fixed in PR #24714 (cancel all write requests on storage_proxy shutdown), but its test never actually failed — the write handler always captured the current token_metadata version because `pause_before_barrier_and_drain` used `one_shot=True,` so only the first `barrier_and_drain` was paused. The topology state hadn't advanced by that point, meaning the write handler's ERM version matched the current version and `stale_versions_in_use()` returned immediately. The coordinator case was not covered at all. Cancel all write response handlers on all shards right after `stop_transport()` in `do_drain()`. This releases their ERMs and the associated stale token_metadata versions, unblocking `stale_versions_in_use()`. Fixed the test to ensure the write handler holds a stale version: use one_shot=False, let the first barrier_and_drain through (version still current), then wait for the second one (version now stale). Extended to cover both coordinator and non-coordinator shutdown on the same 2-node cluster. Also includes supporting changes: - error_injection: release wait_for_message waiters on disable() so the test can atomically unblock paused handlers - error_injection: add non-shared mode to wait_for_message for per-invocation message semantics - scylla_cluster.py: allow stop() to bypass start_stop_lock so SIGKILL works while stop_gracefully is blocked Fixes: SCYLLADB-1842 Refs: scylladb/scylladb#23665 backports: SCYLLADB-1842 reported a failure in 2025.1, so we need to backport to all versions starting from 2025.1 Closes scylladb/scylladb#29882 * https://github.com/scylladb/scylladb: storage_service: cancel write handlers during drain to prevent shutdown deadlock test_unfinished_writes_during_shutdown: extend to cover coordinator shutdown test_unfinished_writes_during_shutdown: fix to reproduce the shutdown deadlock test_unfinished_writes_during_shutdown: await add_last_node_task instead of cancelling it test_unfinished_writes_during_shutdown: add timeout and deadlock detection for shutdown_task test: scylla_cluster: allow stop() to bypass start_stop_lock error_injection: add non-shared mode to wait_for_message error_injection: release waiters when injection is disabled	2026-05-21 15:43:36 +02:00
Piotr Dulikowski	6148316f66	Merge 'db/view/view_building_coordinator: add flag to mark if any remote work was finished' from Michał Jadwiszczak There is small windows just after view building coordinator releases group0 guard and before it waits on view_building_state_machine's CV, when the coordinator may miss CV broadcast triggered by finished remote work. To fix it, this patch adds a boolean flag, which is set to true before broadcasting the CV and is checked before awaiting on the CV. Fixes SCYLLADB-2029 The problem is not critical but it should be backported to 2025.4 and newer version, all of them contains view building coordinator. Closes scylladb/scylladb#27313 * github.com:scylladb/scylladb: test/cluster/test_view_building_coordinator: add reproducer db/view/view_building_coordinator: add flag to mark if any remote work was finished	2026-05-21 15:11:58 +02:00
Michael Litvak	47d90da867	pgo: enable train with tablets for SI and LWT pgo training for secondary indexes and LWT was configured with tablets disabled because it wasn't supported at the time. this is no longer the case, so we should remove the restrictions and enable the training with the default mode.	2026-05-21 15:01:30 +02:00
Michael Litvak	31d3e20cd6	pgo: make training cluster RF-rack-valid The pgo training script creates a 3-node cluster, all in a single rack. However, some of the workloads create a keyspace with RF=3. This is not allowed in some cases, for example materialized views with tablets require the cluster to be RF-rack-valid, so it must have at least 3 different racks. Change the cluster to be RF-rack-valid by configuring each node in a different rack using the rackdc properties file. Instead of using a shared config directory, we define a separate home directory for each node, copy the config files into it, and write the separate rackdc file for each node.	2026-05-21 15:01:30 +02:00
Pavel Emelyanov	4b13b24695	Merge 's3: make S3 connection pool size configurable per scheduling group' from Ernest Zaslavsky The S3 client creates a separate HTTP connection pool per scheduling group. Previously, the pool size was hardcoded as shares/100, yielding 1-10 connections. This was not tunable and could under-provision connections for groups with low share counts. Changes - A missing include (short_streams.hh) in sstables_loader.cc is added first to fix CMake builds where the header is not transitively included. - The hardcoded per-share divisor is replaced with a per-shard connection budget. The new `object_storage_connections_per_shard` config option (default 128) specifies the total number of connections available on each shard. Connections are distributed proportionally across scheduling groups based on their shares: `max_connections = budget * group_shares / total_shares`. Remainder connections are assigned to the group with the most shares. When a new scheduling group client is created, all existing groups are rebalanced via `set_maximum_connections`. Creation and rebalance are serialized with a semaphore to prevent concurrent rebalances from racing. - The config option is made live-updateable: a `storage_manager` observer propagates changes to all existing S3 clients, triggering rebalance under the same semaphore. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1704 No backport needed since this change affects KS on object storage which is not operational yet. Closes scylladb/scylladb#29719 * github.com:scylladb/scylladb: s3: make connections_per_shard live-updateable s3: distribute connection pool proportionally across scheduling groups	2026-05-21 12:12:36 +03:00
Andrzej Jackowski	f8156702de	tree: add missing -present to copyright headers ~2076 files used "Copyright (C) YYYY-present ScyllaDB" while ~88 files used "Copyright (C) YYYY ScyllaDB". This inconsistency leads to unnecessary code review discussions and gradual spread of the less common format. Standardize all ScyllaDB copyright headers to use -present. Fixes SCYLLADB-1984 Closes scylladb/scylladb#29876	2026-05-21 10:57:42 +02:00
Wojciech Mitros	13c043903d	strong_consistency: cache leader location for non-replica nodes When a non-replica node handles a strongly consistent write, it must forward the request to a replica. If the closest replica is not the leader, the request gets redirected again, causing an extra roundtrip. Add a leader location cache in groups_manager, keyed by raft group_id. After a write request is forwarded, the CQL transport layer records the final node as the leader in the cache. Subsequent write requests from the same node for the same group are forwarded directly to the cached leader, eliminating the extra roundtrip. The cache is only used for writes. Reads can be served by any replica, so they skip the cache and use proximity-based routing instead. Cache entries are validated at use time: if the cached leader is no longer a replica (e.g. after tablet migration), the entry is evicted and the normal closest-replica path is taken. This prevents a scenario where two nodes keep redirecting to each other because both think that the other is the leader but actually both are non-replicas - such loop is broken as soon as the tablet maps are updated. On token_metadata updates, entries for groups that no longer exist (e.g. table dropped, tablet merged) are evicted. Entries for groups that still exist are kept — use-time validation handles staleness. An on_node_resolved callback is propagated through the redirect/bounce path so the transport layer can update the cache generically without coupling to the strong-consistency coordinator. The coordinator creates the callback only for writes (capturing the groups_manager and group_id) and attaches it to the bounce message; the transport layer invokes it once the final node is known, keeping the forwarding infrastructure subsystem-agnostic. We also add a test which verifies that after the initial redirect, following requests to the same node avoid the extra redirect and forward directly to the leader. Fixes: SCYLLADB-1064 Closes scylladb/scylladb#29392	2026-05-21 10:32:56 +02:00
Gleb Natapov	cc034f84c5	schema: ensure committed_by_group0 is set for all non-system tables on boot Tables created before the GROUP0_SCHEMA_VERSIONING feature was enabled have committed_by_group0 = null in system_schema.scylla_tables. This causes maybe_delete_schema_version() to delete their version cell, forcing the legacy hash-based schema version computation path. Add ensure_committed_by_group0() which runs on boot and fixes up any non-system tables where committed_by_group0 is not true (null or false): 1. Queries system_schema.scylla_tables for rows where committed_by_group0 is null or false, skipping system keyspaces (system, system_schema). 2. Takes a group0 guard 3. Re-checks after the raft barrier in case another node already fixed it. 4. For each table needing fixup, creates a mutation writing the version cell (from the in-memory schema). The committed_by_group0 = true flag is stamped by add_committed_by_group0_flag() inside announce(). 5. Announces via raft group0. 6. Retries with a small random delay on group0_concurrent_modification. On other nodes, schema_applier will detect these as "altered" tables (scylla_tables mutation changed), but since the actual table definition is unchanged, update_column_family is effectively a no-op. This is a prerequisite for eventually removing the legacy hash-based schema versioning code path. Closes scylladb/scylladb#29911	2026-05-21 10:22:07 +02:00
Patryk Jędrzejczak	cbadc3d675	test: fix flaky test_raft_snapshot_truncation by waiting for async log truncation Snapshot creation and raft log truncation happen asynchronously in the IO fiber after a schema change completes. The test was querying system.raft immediately after the schema change returned, racing with the IO fiber's store_snapshot_descriptor call. Replace immediate assertions with wait_for polling loops: - log_size == 0: wait for log truncation after drop keyspace - new_snap_id != original_snap_id: wait for new snapshot to be persisted Fixes: SCYLLADB-2120 Closes scylladb/scylladb#29967	2026-05-21 10:50:00 +03:00

1 2 3 4 5 ...

54082 Commits