scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-09 08:23:29 +00:00

Author	SHA1	Message	Date
Benny Halevy	d9f721baf3	cql3/statements/describe_statement: use chunked_vector to prevent oversized allocations Running the 5000 tables scenario using tablets following Scylla warnings appeared: ``` 2026-02-23T23:18:31.903 schema-scale-tablets-5000t-2026-1-db-node-77930459-4 !WARNING \| scylla[5208] [shard 1:sl:d] seastar_memory - oversized allocation: 655360 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at 0x320cf9f 0x320cba0 0x1826a28 0x2fb8f97 0x180340e 0x447855e 0x4461c5a 0x161c3c6 0x161c4b3 0x161e9b7 0x551f43c 0x54df6ca /opt/scylladb/libreloc/libc.so.6+0x72463 /opt/scylladb/libreloc/libc.so.6+0xf55ab seastar::current_backtrace_tasklocal() at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:85 seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136 seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169 seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:865 seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:928 cql3::statements::(anonymous namespace)::tables(data_dictionary::database const&, seastar::lw_shared_ptr<data_dictionary::keyspace_metadata> const&, std::optional<bool>) [clone .resume] at ././seastar/src/core/memory.cc:1727 std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<cql3::description, std::allocator<cql3::description> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/coroutine:247 ``` This patch replaces the use of `std::vector<description>` with `utils::chunked_vector` to prevent the large allocation. Fixes: SCYLLADB-2388 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#30146 (cherry picked from commit `d4d43213f6`) Closes scylladb/scylladb#30218	2026-06-04 15:03:02 +03:00
Nadav Har'El	452c246476	Merge 'cql3: fix stack overflow and quadratic behavior' from Avi Kivity This series fixes two vulnerabilities: unbounded recursion during expression evaluation with deeply nested expressions quadratic computation with large WHERE clauses The fixes simply bound the depth of recursion and the length of the WHERE clause. The WHERE clause limits are configurable. Nesting is less likely to be exceeded, so not configurable. Limits inspired by Common Expression Language: https://github.com/google/cel-spec/blob/master/doc/langdef.md#syntax Implementations are required to support at least: 24-32 repetitions of repeating rules 12 repetitions of recursive rules CVE-2026-31948 CVE-2026-31947 Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1003 Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1002 Fixes https://github.com/scylladb/scylladb/issues/14472 Closes scylladb/scylladb-ghsa-m4h7-g37h-mgxf#3 * github.com:scylladb/scylladb-ghsa-m4h7-g37h-mgxf: cql3: limit number of relations in WHERE clause cql3: add max_relations_in_where_clause to dialect test/cqlpy: add tests for WHERE clause relation count limit cql3: limit nesting depth of function calls and CASTs in CQL parser test/cqlpy: add tests for deeply nested function calls and CASTs (cherry picked from commit `75a05fc2b3`)	2026-06-02 11:52:17 +03:00
Raphael S. Carvalho	908d3fb53a	compaction: fail resharding when out of space prevention is activated When out-of-space prevention is activated, the compaction manager is drained and disabled. This caused resharding to silently succeed without actually processing any SSTables, because: 1. run_custom_job() calls start_compaction() which returns nullopt when is_disabled() is true, and run_custom_job() would just return immediately — appearing as a successful no-op. 2. reshard() used throw_if_stopping::no, so even within the compaction task executor, stopping would be silently swallowed rather than propagated as an exception. The SSTable loader interprets a successful return from resharding as "all SSTables processed", so it proceeds without error, leaving the unprocessed SSTables orphaned and their data missing from the table. Fix this with two changes: - run_custom_job(): when start_compaction() returns nullopt, check is_disabled() and throw via make_disabled_exception() rather than returning silently. This ensures callers are always informed when a job was skipped because compaction is disabled (e.g. due to disk space pressure), as opposed to a benign skip (e.g. table removed). - reshard(): change throw_if_stopping::no to throw_if_stopping::yes. Resharding is mandatory for correct SSTable loading — unlike reshape which is optional and can be safely skipped, resharding failure must be propagated to the caller so the loader does not proceed with incomplete data. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-2085. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#30041 (cherry picked from commit `ea3615de1e`) Closes scylladb/scylladb#30155	2026-06-01 14:23:39 +03:00
Avi Kivity	36d3c3dad9	Merge '[Backport 2026.2] tablets: fix update_tablet_metadata failures during bootstrap' from Scylladb[bot] When partition_split_builder splits a tablet metadata partition into multiple mutations, the first mutation gets the partition tombstone and/or static row while subsequent mutations contain only clustered rows. The hint logic would correctly clear tokens (marking a full partition read) upon seeing the tombstone in the first mutation, but then re-add tokens when processing the subsequent row-only mutations. This caused update_tablet_metadata to attempt a point update via mutate_tablet_map_async on a tablet map that doesn't exist yet during bootstrap, throwing no_such_tablet_map and failing the snapshot transfer. Fix by adding a full_read flag to table_hint. Once a full partition read is decided (due to partition tombstone, range tombstone, static row, or row deletion), the flag prevents subsequent mutations for the same table from re-adding tokens. Additionally, fall back to a full partition read when the tablet map is missing locally, which happens when the joining node receives tablet metadata for a table it has never seen before. Fixes: SCYLLADB-2333. Needs backports to 2026.1+. 2026.1 introduces the regression with `b17a36c071` - (cherry picked from commit `d6c1707a04`) - (cherry picked from commit `491db28fbf`) Parent PR: #30115 Closes scylladb/scylladb#30154 * github.com:scylladb/scylladb: tablets: fall back to full partition read when tablet map is missing tablets: fix hint re-adding tokens after full partition read decision	2026-06-01 14:12:00 +03:00
Botond Dénes	93a9fb2347	mutation_fragment_stream_validator: use legacy byte order for same-token partition key comparison When two partition keys share the same token, their relative order is determined by their raw serialized bytes (legacy_tri_compare), which matches the physical on-disk order in SSTables. The validator was using partition_key::tri_compare instead — a type-aware comparator that can disagree with byte order for types like timeuuid. The result was a false-positive "out-of-order partition key" error for any two same-token partitions whose timeuuid (or other type-aware) order is the reverse of their byte order. In scrub mode this caused the second partition to be silently dropped. Fixes: SCYLLADB-2338 Closes scylladb/scylladb#30120 (cherry picked from commit `46631692cd`) Closes scylladb/scylladb#30156	2026-06-01 14:11:26 +03:00
Jenkins Promoter	00a066028a	Update pgo profiles - aarch64	2026-06-01 04:38:44 +03:00
Aleksandra Martyniuk	2f73e808ed	tablets: fall back to full partition read when tablet map is missing When update_tablet_metadata receives a hint with non-empty tokens for a table whose tablet map doesn't exist locally yet, it would call mutate_tablet_map_async which throws no_such_tablet_map. This happens during bootstrap when the joining node receives tablet metadata for a table it has never seen before. Fix by checking has_tablet_map() before attempting the point update. If the map is missing, fall back to do_update_tablet_metadata_partition which reads the full partition from system.tablets and creates the map. (cherry picked from commit `491db28fbf`)	2026-05-29 15:32:00 +00:00
Aleksandra Martyniuk	101237d7cf	tablets: fix hint re-adding tokens after full partition read decision When partition_split_builder splits a tablet metadata partition into multiple mutations, the first mutation gets the partition tombstone and/or static row while subsequent mutations contain only clustered rows. The tablet metadata change hint logic would correctly clear tokens (marking a full partition read) upon seeing the tombstone in the first mutation, but then re-add tokens when processing the subsequent row-only mutations. This caused update_tablet_metadata to attempt a point update via mutate_tablet_map_async on a tablet map that doesn't exist yet during bootstrap, throwing no_such_tablet_map and failing the snapshot transfer. Fix by adding a full_read flag to table_hint. Once a full partition read is decided (due to partition tombstone, range tombstone, static row, or row deletion), the flag prevents subsequent mutations for the same table from re-adding tokens. (cherry picked from commit `d6c1707a04`)	2026-05-29 15:32:00 +00:00
Botond Dénes	e94d305856	Merge '[Backport 2026.2] index: fix local vector index locality detection after schema reload' from Scylladb[bot] After schema reload, `target_parser::is_local()` did not recognize the vector-index local target format `{"pk": [...], "tc": "..."}`, causing local vector indexes to be treated as global. This broke duplicate detection when both a global and a local vector index existed on the same column. Fix by introducing `vector_index::is_local()` and dispatching to it from `create_index_from_index_row()` based on the index class. Also adds tests for local/global vector index coexistence. Fixes: SCYLLADB-2315 backport reasoning: we added local vector index support in 2026.1 - (cherry picked from commit `cf372ba87b`) - (cherry picked from commit `119ef942f8`) Parent PR: #29492 Closes scylladb/scylladb#30123 * github.com:scylladb/scylladb: test/cqlpy: add tests for global and local vector index coexistence index: fix local vector index locality detection after schema reload	2026-05-29 14:05:47 +03:00
Nikos Dragazis	ef2d2ef5c3	test: Order task-wait before finalization in test_migration_wait_task The purpose of this test is to verify that the task manager's "wait" API works correctly for vnodes-to-tablets migration virtual tasks. It starts a `wait_task` HTTP request concurrently with a finalize (or rollback) operation, and asserts that the wait returns the correct final state ("done" or "suspended"). The test `uses asyncio.create_task()` to wrap the wait request into a task, and then immediately calls finalize. With asyncio's lazy task scheduling, the wait coroutine does not start until the event loop yields, so the finalization request reaches the server before wait, and therefore may also complete before it. Once finalization completes, the virtual migration task is no longer discoverable, causing a "task not found" error. Add a log message in Scylla's wait handler and a synchronization point in the test to ensure that the wait request lands the server before finalization. This follows the same pattern used in `test_tablet_tasks.py::check_and_abort_repair_task`. Fixes SCYLLADB-2077 Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#29973 (cherry picked from commit `54cb6d4608`) Closes scylladb/scylladb#30095	2026-05-29 14:05:06 +03:00
Anna Stuchlik	d7802fd1de	doc: remove broken References section from sstables-3-index Remove the References section containing broken links to Cassandra source files that no longer exist. Fixes https://github.com/scylladb/scylladb/issues/30080 Closes scylladb/scylladb#30081 (cherry picked from commit `bd089ebcaa`) Closes scylladb/scylladb#30089	2026-05-29 14:04:10 +03:00
Michael Litvak	481f9070a7	logstor: disable logstor compaction in table truncate in database::truncate_table_on_all_shards disable logstor compaction before the table data is truncated, similarly to how non-logstor compaction is disabled, to avoid race conditions between logstor compaction and segments discarding. Fixes SCYLLADB-2186 (cherry picked from commit `73470150a0`) Closes scylladb/scylladb#30059	2026-05-29 14:03:26 +03:00
Raphael S. Carvalho	65c31c2791	repair, test: fix split-repair synchronization test timeout in debug mode The test_split_and_incremental_repair_synchronization[True] test was timing out waiting for 'Finalizing resize decision for table' in debug mode. The root cause is a timing race: the incremental_repair_prepare_wait error injection has a hardcoded 60s auto-expiry timeout (wait_for_message(60s)), but split compactions in debug mode take ~58s per SSTable due to -O0 compilation and scheduler starvation (the maintenance_compaction group gets ~10% of wall-clock time). When the injection auto-expires before split finalization, the repair fails, leaving tablets stuck in transition=repair state. This prevents the topology coordinator from finalizing the split, causing the 600s test timeout. Fix both contributing factors: - Increase the injection timeout from 60s to 10min, giving split compactions ample time to complete before the injection auto-expires. The test explicitly messages the injection to release it (line 2200), so the longer timeout is just a safety net. - Reduce data volume from 256 to 64 rows (and repair data from 256 to 64 rows), producing smaller SSTables that split much faster in debug mode. Fixes: SCYLLADB-2178. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#30004 (cherry picked from commit `3ba6184462`) Closes scylladb/scylladb#30035	2026-05-29 14:02:37 +03:00
Yehuda Lebi	d4bbc144a4	fix: raise scylla-helper.slice CPUWeight from 10 to 100 to prevent node_exporter CPU starvation Closes scylladb/scylladb#29839 (cherry picked from commit `6307e17795`) Closes scylladb/scylladb#29949	2026-05-29 13:55:49 +03:00
Botond Dénes	b617cd8c02	Merge '[Backport 2026.2] alternator: Graduate Alternator Streams from experimental' from Scylladb[bot] As a final step for https://scylladb.atlassian.net/browse/SCYLLADB-461 we need to graduate Alternator Streams from experimental. So let's remove `--experimental-features=alternator-streams` and map the obsolete config string to `UNUSED` for backward compatibility. Also, remove the related gating of the feature. Finally, stop providing the config flag in test configs. Fixes SCYLLADB-1680 Fixes #16367 To documentation tracked by https://scylladb.atlassian.net/browse/SCYLLADB-462 still remains. This PR needs to hit 2026.2, so (only) if it branches before the PR is merged to `master`, we'd need to backport. - (cherry picked from commit `870013b437`) - (cherry picked from commit `9a86044c63`) Parent PR: #29604 Closes scylladb/scylladb#29817 * github.com:scylladb/scylladb: test: Stop providing alternator-streams experimental flag alternator: Graduate Alternator Streams from experimental	2026-05-29 13:54:54 +03:00
Patryk Jędrzejczak	6fa12c6419	Merge '[Backport 2026.2] storage_service: cancel write handlers during drain to prevent shutdown deadlock' from Scylladb[bot] Fixes a shutdown deadlock where a node hangs because `stale_versions_in_use()` blocks on stale `token_metadata` versions held by write handlers whose `MUTATION_DONE` responses can never arrive (transport is already stopped). Two manifestations depending on whether the shutting-down node is the topology coordinator: - Coordinator: do_drain → wait_for_group0_stop deadlocks because the topology coordinator fiber is stuck in barrier_and_drain → stale_versions_in_use(). - Non-coordinator: ss::stop → uninit_messaging_service deadlocks because the barrier_and_drain RPC handler holds the gate open. The non-coordinator case was fixed in PR #24714 (cancel all write requests on storage_proxy shutdown), but its test never actually failed — the write handler always captured the current token_metadata version because `pause_before_barrier_and_drain` used `one_shot=True,` so only the first `barrier_and_drain` was paused. The topology state hadn't advanced by that point, meaning the write handler's ERM version matched the current version and `stale_versions_in_use()` returned immediately. The coordinator case was not covered at all. Cancel all write response handlers on all shards right after `stop_transport()` in `do_drain()`. This releases their ERMs and the associated stale token_metadata versions, unblocking `stale_versions_in_use()`. Fixed the test to ensure the write handler holds a stale version: use one_shot=False, let the first barrier_and_drain through (version still current), then wait for the second one (version now stale). Extended to cover both coordinator and non-coordinator shutdown on the same 2-node cluster. Also includes supporting changes: - error_injection: release wait_for_message waiters on disable() so the test can atomically unblock paused handlers - error_injection: add non-shared mode to wait_for_message for per-invocation message semantics - scylla_cluster.py: allow stop() to bypass start_stop_lock so SIGKILL works while stop_gracefully is blocked Fixes: SCYLLADB-2163 Refs: scylladb/scylladb#23665 backports: SCYLLADB-1842 reported a failure in 2025.1, so we need to backport to all versions starting from 2025.1 - (cherry picked from commit `bc4dc13e94`) - (cherry picked from commit `324a08295d`) - (cherry picked from commit `c88120abca`) - (cherry picked from commit `fa01f74ae6`) - (cherry picked from commit `32002f6443`) - (cherry picked from commit `a093be9ca9`) - (cherry picked from commit `5bc3e84d1e`) - (cherry picked from commit `2927f0dd21`) Parent PR: #29882 Closes scylladb/scylladb#30008 * https://github.com/scylladb/scylladb: storage_proxy: only cancel write handlers with pending remote targets during drain storage_service: cancel write handlers during drain to prevent shutdown deadlock test_unfinished_writes_during_shutdown: extend to cover coordinator shutdown test_unfinished_writes_during_shutdown: fix to reproduce the shutdown deadlock test_unfinished_writes_during_shutdown: await add_last_node_task instead of cancelling it test_unfinished_writes_during_shutdown: add timeout and deadlock detection for shutdown_task test: scylla_cluster: allow stop() to bypass start_stop_lock error_injection: add non-shared mode to wait_for_message error_injection: release waiters when injection is disabled	2026-05-28 12:49:15 +02:00
Michał Hudobski	8db6ab0434	test/cqlpy: add tests for global and local vector index coexistence Add integration tests verifying that both a global and a local vector index can be created on the same column without triggering a spurious "duplicate custom index" error. This was fixed by #29407. Tests cover: - Creating global+local and local+global index pairs on the same column. - Duplicate detection still rejects a second index of the same locality. - IF NOT EXISTS is a no-op for a duplicate same-locality index (and verifies no extra index is created). - IF NOT EXISTS with a different locality creates both indexes. - Two indexes with the same name on different tables are rejected (partially validates VECTOR-643). Fixes: SCYLLADB-2315 (cherry picked from commit `119ef942f8`)	2026-05-28 12:39:54 +02:00
Michał Hudobski	018b3a72e4	index: fix local vector index locality detection after schema reload When index metadata was deserialized from system tables during schema reload, target_parser::is_local() failed to recognize local vector indexes. It only handled the non-vector JSON format {"pk": [...], "ck": [...]}, but vector indexes serialize their targets as {"pk": [...], "tc": "..."}. As a result, every local vector index was incorrectly marked as global after a schema reload. Fix this by introducing vector_index::is_local() that recognizes the vector-specific target format, and dispatching to it from the schema deserialization code based on the index class name. This keeps target_parser as secondary-index-specific and follows the same dispatch pattern already used for target serialization. Also remove the now-unused has_vector_index_on_column() helper (its callers were removed by #29407). (cherry picked from commit `cf372ba87b`)	2026-05-28 12:39:54 +02:00
Nadav Har'El	79492c9662	Merge '[Backport 2026.2] cql: rewrite CassIO SAI metadata index to regular secondary index' from Scylladb[bot] CassIO (the library backing LangChain's `langchain_community.vectorstores.Cassandra` integration) issues the following DDL during schema setup to create a metadata index: ```sql CREATE CUSTOM INDEX IF NOT EXISTS eidx_metadata_s_<table> ON <keyspace>.<table> (ENTRIES(metadata_s)) USING 'org.apache.cassandra.index.sai.StorageAttachedIndex'; ``` ScyllaDB does not support Cassandra's StorageAttachedIndex (SAI) for non-vector columns and previously rejected this statement with: ``` StorageAttachedIndex (SAI) is only supported on vector columns; use a secondary index for non-vector columns ``` This blocks seamless migration of existing LangChain/CassIO applications from Cassandra to ScyllaDB — applications fail during initialization before any application-level workaround can run, even when metadata filtering is not used (`metadata_indexing="none"`). CassIO is no longer actively maintained but remains the only official LangChain integration path for Apache Cassandra over CQL, meaning existing applications will continue using this setup pattern. Instead of rejecting the CassIO metadata-map SAI DDL, detect the pattern and rewrite it to a standard ScyllaDB secondary index on collection entries: - Detection: SAI class name + single `ENTRIES` target on a non-frozen `map` column - Rewrite: Clear the custom class so the index is created through the standard secondary index path (which already fully supports indexing map entries) - Warning: Emit a CQL warning informing the user that SAI is not supported by ScyllaDB, a regular secondary index was created instead, and metadata filtering behavior may differ from Cassandra SAI The rewrite is placed early in `validate_while_executing()`, before the rf-rack-validity check, so the standard secondary index code path handles all subsequent validation naturally — no code duplication. After this change, the CassIO schema setup succeeds on ScyllaDB: - `CREATE CUSTOM INDEX ... USING 'sai'` on `ENTRIES(metadata_s)` creates a real secondary index - The index is functional and can accelerate metadata filtering queries - A CQL warning makes the rewrite transparent to operators - SAI on non-vector, non-map-entries columns is still rejected as before - Vector SAI indexes continue to be rewritten to `vector_index` as before - `test_sai_entries_on_map_creates_regular_index` — verifies the index is created and the warning is emitted (fully-qualified SAI class name) - `test_sai_entries_on_map_short_name` — same with the `'sai'` short alias - `test_sai_on_regular_column_rejected` — confirms SAI on regular scalar columns is still rejected All 148 tests in `test_vector_index.py` and `test_secondary_index.py` pass with no regressions (125 passed, 22 xfailed, 1 skipped). Fixes: SCYLLADB-2234 Backport: 2026.2 as this is the version where the support for SAI class needed by LangChain was added. - (cherry picked from commit `242eb96b16`) - (cherry picked from commit `5ee339b11d`) Parent PR: #29981 Closes scylladb/scylladb#30084 * github.com:scylladb/scylladb: cql: rewrite CassIO SAI metadata index to regular secondary index db/config: add enable_cassio_compatibility flag	2026-05-27 14:59:49 +03:00
Petr Gusev	e7d94d9927	storage_proxy: only cancel write handlers with pending remote targets during drain The previous fix (cancel_all_write_response_handlers in do_drain) was too aggressive — it killed all handlers including ones used by group0 for raft commits. Since group0 is still running at that point (before wait_for_group0_stop), this caused group0 operations to fail (SCYLLADB-2168). The actual problem is only with handlers that have pending remote targets: after stop_transport() their MUTATION_DONE responses can never arrive via messaging. Handlers whose only pending targets are local can still complete via apply_locally and should be left alone. Add cancel_nonlocal_write_response_handlers() which checks each handler's remaining targets against the local host ID. Only handlers with at least one remote pending target are cancelled. Use it in do_drain instead of cancel_all_write_response_handlers. The latter remains unchanged for drain_on_shutdown (final proxy shutdown where all handlers must be killed). Fixes: SCYLLADB-2168 (cherry picked from commit `2ff30ee6f0`)	2026-05-27 13:35:30 +02:00
Patryk Jędrzejczak	ae88f7209d	Merge '[Backport 2026.2] compaction_manager: unregister compaction module on early shutdown' from Scylladb[bot] The compaction module is registered with task_manager in the compaction_manager constructor, and unregistered in compaction_manager::really_do_stop(), which was gated behind `_state != state::none` in compaction_manager::do_stop(). Since enable() -- which transitions _state from none to running -- is called later during startup (from database::start() or the disk space monitor callback) than the compaction_manager constructor, an early shutdown could leave the compaction module registered after compaction_manager::do_stop() returned. task_manager::stop() then aborted with 'Tried to stop task manager while some modules were not unregistered'. Fix compaction_manager::do_stop() to call _task_manager_module->stop() even when `_state == state::none`, so that the compaction module is always properly unregistered. Fixes: SCYLLADB-2226 Backport to all supported branches, as the bug is there and it has already caused a failure in 2026.1 CI. - (cherry picked from commit `6cde390e21`) - (cherry picked from commit `b7400d20dd`) Parent PR: #30015 Closes scylladb/scylladb#30082 * https://github.com/scylladb/scylladb: test: add test_stop_before_starting_compaction_manager compaction_manager: unregister compaction module on early shutdown	2026-05-26 10:06:51 +02:00
Szymon Wasik	c30c9f3a82	cql: rewrite CassIO SAI metadata index to regular secondary index When CassIO creates a SAI ENTRIES index on a map column, ScyllaDB now rewrites it to a regular secondary index and emits a CQL warning. This allows LangChain/CassIO applications to work without DDL errors. The rewrite is gated behind the enable_cassio_compatibility flag (disabled by default). Refs: SCYLLADB-2113 (cherry picked from commit `5ee339b11d`)	2026-05-25 23:37:32 +00:00
Szymon Wasik	4251357118	db/config: add enable_cassio_compatibility flag Add a new live-updatable boolean configuration option 'enable_cassio_compatibility' (default: false). When enabled, it allows ScyllaDB to rewrite CassIO's SAI index DDL on map entries to a regular secondary index, so that LangChain/CassIO applications can run without DDL errors. The flag is disabled by default to avoid affecting users who don't need CassIO compatibility. (cherry picked from commit `242eb96b16`)	2026-05-25 23:37:31 +00:00
Patryk Jędrzejczak	d1e7eaad13	test: add test_stop_before_starting_compaction_manager (cherry picked from commit `b7400d20dd`)	2026-05-25 17:59:17 +00:00
Patryk Jędrzejczak	bbde5a64b2	compaction_manager: unregister compaction module on early shutdown The compaction module is registered with task_manager in the compaction_manager constructor, and unregistered in compaction_manager::really_do_stop(), which was gated behind `_state != state::none` in compaction_manager::do_stop(). Since enable() -- which transitions _state from none to running -- is called later during startup (from database::start() or the disk space monitor callback) than the compaction_manager constructor, an early shutdown could leave the compaction module registered after compaction_manager::do_stop() returned. task_manager::stop() then aborted with 'Tried to stop task manager while some modules were not unregistered'. Fix compaction_manager::do_stop() to call _task_manager_module->stop() even when `_state == state::none`, so that the compaction module is always properly unregistered. Fixes: SCYLLADB-2226 (cherry picked from commit `6cde390e21`)	2026-05-25 17:59:16 +00:00
Gleb Natapov	81dc11557c	storage_proxy: hold shared pointer to a table object during entire query_partition_key_range_concurrent execution Otherwise if a table is dropped in the middle of a scan the object may disappear. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-2137 Closes scylladb/scylladb#29988 (cherry picked from commit `0bf050d175`) Closes scylladb/scylladb#30067	2026-05-25 12:48:42 +02:00
Avi Kivity	0606b71893	Merge '[Backport 2026.2] cql: request-side custom payload parsing' from Scylladb[bot] When a CQL client sends a request with the `CUSTOM_PAYLOAD` flag (`0x04`) set, the frame body starts with a [bytes map] before the message. Scylla never implemented parsing of this map on the request side. This caused it to fail parsing with protocol errors such as `"truncated frame: expected 65546 bytes"`. This was discovered through DataStax Java Driver 4.19.x tests that attach a `request-id` to queries via custom payload. The same issue affects any CQL client that sets the `CUSTOM_PAYLOAD` flag. Fix this by skipping over the custom payload [bytes map] from the frame body before dispatching to opcode-specific handlers. The payload contents are discarded since Scylla has no pluggable `QueryHandler`. Cassandra's default `QueryHandler` also discards them. Fixes SCYLLADB-745 Reported on 2026.2, backport. - (cherry picked from commit `8e6d2d0631`) - (cherry picked from commit `f9e8518776`) Parent PR: #30005 Closes scylladb/scylladb#30026 * github.com:scylladb/scylladb: cql: fix request-side custom payload parsing test/cqlpy: add tests for request-side custom payload handling	2026-05-24 22:11:41 +03:00
Łukasz Paszkowski	d66b65d6d4	tasks: fix busy-spin and shutdown hang in tablet_virtual_task::wait() for repair tasks The condition variable predicate for repair tasks unconditionally returned true (introduced in `e5928497ce`), which meant event.wait(pred) never actually suspended: do_until checks the predicate first, and if it's already satisfied, returns immediately without calling the inner wait(). This caused two problems: 1. The while(true) loop busy-spun, polling without blocking between topology changes. 2. During shutdown, event.broken() had no effect because no waiter was registered on the CV. The loop kept spinning, holding the HTTP server's task gate open and preventing http_server::stop() from completing. After ~15 minutes, systemd killed the process with SIGABRT. The fix replaces the synchronous predicate with an async task_finished() helper that dispatches on the task type. Since the repair check is async (for_each_tablet scans every tablet), we cannot use event.wait(Pred). Instead, we register a waiter via event.wait() before running the async check, ensuring no broadcast is missed during the check. event.broken() during shutdown propagates broken_condition_variable to the registered waiter and unblocks the loop promptly. Fixes: SCYLLADB-2181 Closes scylladb/scylladb#29485 (cherry picked from commit `96a992002c`) Closes scylladb/scylladb#30042	2026-05-24 22:10:47 +03:00
Gleb Natapov	74a057bee0	schema: ensure committed_by_group0 is set for all non-system tables on boot Tables created before the GROUP0_SCHEMA_VERSIONING feature was enabled have committed_by_group0 = null in system_schema.scylla_tables. This causes maybe_delete_schema_version() to delete their version cell, forcing the legacy hash-based schema version computation path. Add ensure_committed_by_group0() which runs on boot and fixes up any non-system tables where committed_by_group0 is not true (null or false): 1. Queries system_schema.scylla_tables for rows where committed_by_group0 is null or false, skipping system keyspaces (system, system_schema). 2. Takes a group0 guard 3. Re-checks after the raft barrier in case another node already fixed it. 4. For each table needing fixup, creates a mutation writing the version cell (from the in-memory schema). The committed_by_group0 = true flag is stamped by add_committed_by_group0_flag() inside announce(). 5. Announces via raft group0. 6. Retries with a small random delay on group0_concurrent_modification. On other nodes, schema_applier will detect these as "altered" tables (scylla_tables mutation changed), but since the actual table definition is unchanged, update_column_family is effectively a no-op. This is a prerequisite for eventually removing the legacy hash-based schema versioning code path. Closes scylladb/scylladb#29911 (cherry picked from commit `cc034f84c5`) Closes scylladb/scylladb#30000	2026-05-24 22:10:19 +03:00
Piotr Dulikowski	2d33cbd644	Merge '[Backport 2026.2] db/view/view_building_coordinator: add flag to mark if any remote work was finished' from Michał Jadwiszczak There is small windows just after view building coordinator releases group0 guard and before it waits on view_building_state_machine's CV, when the coordinator may miss CV broadcast triggered by finished remote work. To fix it, this patch adds a boolean flag, which is set to true before broadcasting the CV and is checked before awaiting on the CV. Fixes SCYLLADB-2029 The problem is not critical but it should be backported to 2025.4 and newer version, all of them contains view building coordinator. (cherry picked from commit `c7f65131bf`) (cherry picked from commit `c767ac7ef3`) Parent PR: https://github.com/scylladb/scylladb/pull/27313 Closes scylladb/scylladb#30029 * github.com:scylladb/scylladb: test/cluster/test_view_building_coordinator: add reproducer db/view/view_building_coordinator: add flag to mark if any remote work was finished	2026-05-23 10:22:37 +02:00
Dario Mirovic	1af677dff5	cql: fix request-side custom payload parsing When a CQL client sends a request with the CUSTOM_PAYLOAD flag (0x04) set, the frame body starts with a [bytes map] before the message. Scylla never implemented parsing of this map on the request side. This caused it to fail parsing with protocol errors such as "truncated frame: expected 65546 bytes". Fix this by skipping over the custom payload [bytes map] from the frame body before dispatching to opcode-specific handlers. The payload contents are discarded since Scylla has no pluggable QueryHandler. Cassandra's default QueryHandler also discards them. Fixes SCYLLADB-745 (cherry picked from commit `f9e8518776`)	2026-05-22 18:17:34 +02:00
Avi Kivity	b259295366	Update seastar submodule (coroutine generator fix) * seastar 74f19b81ca...5462d881a6 (1): > coroutine/generator: fix setup of generator's waiting task Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1798 Ref `6df04c9e5b`	2026-05-22 18:01:44 +03:00
Michał Jadwiszczak	2a3904ec43	test/cluster/test_view_building_coordinator: add reproducer Add test which reproduces scylladb/scylladb#27298 (cherry picked from commit `c767ac7ef3`)	2026-05-22 15:59:27 +02:00
Michał Jadwiszczak	b3e345d680	db/view/view_building_coordinator: add flag to mark if any remote work was finished In the main coordinator loop (`view_building_coordinator::run()`), there is small windows just after view building coordinator releases group0 guard and before it waits on view_building_state_machine's CV, when the coordinator may miss CV broadcast triggered by finished remote work (`view_building_coordinator::work_on_tasks()`). To fix it, this patch adds a boolean flag, which is set to true before broadcasting the CV by finished/failed RPC call and is checked before awaiting on the CV. Fixes scylladb/scylladb#27298 (cherry picked from commit `c7f65131bf`)	2026-05-22 15:58:45 +02:00
Dario Mirovic	829b04fb21	test/cqlpy: add tests for request-side custom payload handling Add tests that verify Scylla's handling of CQL native protocol requests with the CUSTOM_PAYLOAD flag (0x04) set. Each test asserts the specific parse error that the unfixed server produces. A separate CQL session is used for each test. The protocol error kills the driver connection, and we need to catch it properly. Refs SCYLLADB-745 (cherry picked from commit `8e6d2d0631`)	2026-05-22 13:46:02 +00:00
Michael Litvak	846ff3ce7f	test: wait for others_not_see_server before exclude Between stopping a server and excluding it, wait for other nodes to see the server as down, otherwise exclude may see the server as alive and fail. Fixes SCYLLADB-2110 Closes scylladb/scylladb#29966 (cherry picked from commit `eecbead541`) Closes scylladb/scylladb#29975	2026-05-22 15:11:25 +03:00
Szymon Malewski	17a61e0015	vector_search: fix decimal/varint precision loss in filter value_to_json() value_to_json() converts CQL values to JSON for vector search filters. For decimal and varint types, it used rjson::parse() on the JSON string, which parses through a double and silently loses precision for values exceeding ~15 significant digits — producing wrong filter results. Additionally, for decimal type we need an exact string representation that preserves the original (unscaled, scale) pair, because partition keys use byte-level identity: different serialized representations of the same numeric value are distinct rows, so the filter must reproduce the exact representation stored in the key. Add big_decimal::to_string_canonical() which follows the Java BigDecimal toString() spec (JDK 8+), producing a bijective string representation that uses exponential notation for extreme scales instead of expanding trailing zeros (which could cause OOM). This could replace to_string(), but doing so has wider consequences (e.g. hash/equality contract for decimal_type) described in SCYLLADB-1574. Use it in value_to_json() for decimal_type, and use rjson::from_string() for varint_type, both bypassing the lossy double parse path. Tests cover the new to_string_canonical() and the filter fix, as well as existing decimal type behavior (key representation, clustering order, toJson) that we rely on and must not break. The CQL decimal type tests (test_type_decimal.py) also pass against Cassandra. Fixes: SCYLLADB-2107 Refs: https://scylladb.atlassian.net/browse/SCYLLADB-1574 Closes scylladb/scylladb#29505 (cherry picked from commit `15493872b2`) Closes scylladb/scylladb#29957	2026-05-22 15:09:54 +03:00
Botond Dénes	f3245c933d	Merge '[Backport 2026.2] load_balancer: apply balance threshold to intranode shard balancing' from Scylladb[bot] - Fix intranode shard balancing to respect the size-based balance threshold, preventing unnecessary migrations when load difference between shards is negligible - Add a regression test that verifies the threshold is respected for intranode balancing The intranode shard balancing loop only stopped when the algorithm exhausted the migration candidates or when a migration would go against convergence (it would increase imbalance instead of decrease it). This caused unnecessary tablet migrations for negligible imbalances (e.g., 0.78% difference between shards). The inter-node balancer already uses `is_balanced()` to stop when the relative load difference is within the configured `size_based_balance_threshold`, but this check was missing from the intranode path. Apply the same `is_balanced()` threshold check that is already used for inter-node balancing to the intranode convergence loop. When the relative load difference between the most-loaded and least-loaded shards on a node is within the threshold, the balancer now stops without issuing further migrations. The test creates a single node with 2 shards and 512 tablets: 1. Balanced scenario (257 vs 255 tablets, same size): relative diff = 0.78% < 1% threshold → verifies no intranode migration is emitted 2. Unbalanced scenario (307 vs 205 tablets, same size): relative diff = 33% >> 1% threshold → verifies intranode migration IS emitted Fixes: SCYLLADB-2006 This is a performance improvement which reduces the number of intranode migrations issued, and needs to be backported to versions with size-based load balancing: 2026.1 and 2026.2 - (cherry picked from commit `aaead10e5d`) - (cherry picked from commit `6856f51097`) Parent PR: #29756 Closes scylladb/scylladb#29895 * github.com:scylladb/scylladb: test: add test for intranode balance threshold in size-based mode tablet_allocator: apply balance threshold to intranode shard balancing	2026-05-22 15:07:34 +03:00
Petr Gusev	a2b2a42936	storage_service: cancel write handlers during drain to prevent shutdown deadlock When a node shuts down, do_drain() calls stop_transport() which tears down the messaging service. After this point, MUTATION_DONE responses from replicas can no longer reach the coordinator, so any in-flight write_response_handlers will never complete naturally. These handlers hold ERMs referencing stale token_metadata versions. If the topology coordinator calls barrier_and_drain (either on itself or via RPC), it blocks in stale_versions_in_use() waiting for these stale versions to be released. This causes: - On the coordinator node: do_drain -> wait_for_group0_stop deadlock (the topology coordinator fiber is stuck in barrier_and_drain). - On non-coordinator nodes: ss::stop -> uninit_messaging_service deadlock (the barrier_and_drain RPC handler holds the gate open). Fix: cancel all write response handlers on all shards right after stop_transport() in do_drain(). This releases their ERMs and the associated stale token_metadata versions, unblocking stale_versions_in_use(). Heap-allocate _write_handlers_gate and add an allow_new parameter to cancel_all_write_response_handlers(). When allow_new=true (used by do_drain), the gate is closed and swapped with a fresh one — existing handlers are waited on while new handlers can still be created. This avoids blocking internal writes (paxos learn, compaction history updates) that still need to create handlers during the remainder of the drain sequence. When allow_new=false (used by drain_on_shutdown), the gate is closed permanently — no new handlers can be created after final shutdown. Update test_lwt_shutdown to wait for 'Stop transport: done' instead of 'Shutting down storage proxy RPC verbs'. The latter message is now only logged after do_drain() completes, but do_drain() blocks in cancel_all_write_response_handlers() waiting for the background paxos learn handler — which is exactly what the test needs to release before shutdown can proceed. Fixes: SCYLLADB-2163 Refs: scylladb/scylladb#23665 (cherry picked from commit `2927f0dd21`)	2026-05-21 18:58:06 +00:00
Petr Gusev	1268ab6f92	test_unfinished_writes_during_shutdown: extend to cover coordinator shutdown The existing test only covers the case where the shutting-down node is NOT the topology coordinator (deadlocks in uninit_messaging_service). When the node IS the coordinator, the deadlock manifests differently: the topology coordinator fiber calls barrier_and_drain on itself (without messaging), and do_drain -> wait_for_group0_stop blocks because the coordinator can't stop while stale_versions_in_use is waiting on the uncancelled write handler. Run the test twice on the same 2-node cluster (RF=2): - Run 1: target is a non-coordinator - Restore cluster state (restart target, decommission added node) - Run 2: target is the topology coordinator Use CL=ONE so the write completes from the local replica even with the other server's response paused. Mark as xfail since this reproduces bugs not yet fixed on this branch. Refs: SCYLLADB-1842 (cherry picked from commit `5bc3e84d1e`)	2026-05-21 18:58:06 +00:00
Petr Gusev	b147ab4418	test_unfinished_writes_during_shutdown: fix to reproduce the shutdown deadlock The test was written for another case, and was not supposed to reproduce the issue that was fixed in this PR. Fix the test to reproduce the real scenario: 1. Use one_shot=False for pause_before_barrier_and_drain so the injection fires on every barrier_and_drain RPC, not just the first. 2. Let the first barrier_and_drain through (at this point the write handler's ERM version matches the current token_metadata version). 3. Wait for the second barrier_and_drain. Between the two calls, topology_state_load installs a new token_metadata version. The write handler still holds the old version's ERM — now stale. 4. After stop_transport completes, disable the injection (rather than sending a single message) to release the paused handler and any subsequent ones that arrived during stop_transport. The 'disabled' flag in injection_shared_data ensures all waiters wake up. With these changes the test reliably fails (shutdown deadlock within 15s) on the unfixed code and passes on the fixed version from `e0dc73f52a` ('Cancel all write requests on storage_proxy shutdown'). Refs: scylladb/scylladb#23665 (cherry picked from commit `a093be9ca9`)	2026-05-21 18:58:05 +00:00
Petr Gusev	8489493323	test_unfinished_writes_during_shutdown: await add_last_node_task instead of cancelling it asyncio cancel() only affects the client-side coroutine. The server-side addserver handler in the cluster manager continues running. If it can't complete (e.g. no raft quorum because the target node is shut down), the orphaned handler blocks _after_test cleanup for 120s. Await the task instead so it completes cleanly (we restart the target node first to restore quorum). (cherry picked from commit `32002f6443`)	2026-05-21 18:58:05 +00:00
Petr Gusev	8addbed0dc	test_unfinished_writes_during_shutdown: add timeout and deadlock detection for shutdown_task Add a 15s timeout around the shutdown_task await. If the timeout fires, the deadlock is reproduced (shutdown hung because stale_versions_in_use blocks on a write handler holding a stale token_metadata version). When the timeout fires, explicitly kill the node via server_stop() so that the manager's _after_test handler does not wait 120s for the stuck stop_gracefully request. Then fail the test with a clear message. (cherry picked from commit `fa01f74ae6`)	2026-05-21 18:58:04 +00:00
Petr Gusev	de55e0472a	test: scylla_cluster: allow stop() to bypass start_stop_lock Remove the @stop_event and @start_stop_lock decorators from ScyllaServer.stop() so it can SIGKILL a server even while stop_gracefully() holds the lock (e.g. the node is deadlocked during shutdown and stop_gracefully is blocked on cmd.wait()). A local copy of self.cmd is used because there are await points after which another coroutine (stop_gracefully) may set self.cmd to None. The concurrent stop_gracefully() unblocks once the process dies from SIGKILL since its cmd.wait() returns. Also make shutdown_control_connection a plain (non-async) function since it contains no await points — this makes it obvious that no coroutine interleaving is possible inside it. (cherry picked from commit `c88120abca`)	2026-05-21 18:58:04 +00:00
Petr Gusev	09aca68f71	error_injection: add non-shared mode to wait_for_message Add a 'share' parameter to wait_for_message (default true, preserving existing behavior). When share=false, each handler invocation requires its own dedicated message to proceed — a message consumed by one handler is not visible to others. Use share=false for the pause_before_barrier_and_drain injection in raft_topology_cmd_handler. The topology coordinator sends multiple barrier_and_drain RPCs during a single topology transition (one per state change). With share=true a single message_injection call releases all handlers. With share=false the test can release them one at a time, controlling exactly which topology state the write handler's ERM captures. (cherry picked from commit `324a08295d`)	2026-05-21 18:58:04 +00:00
Petr Gusev	3655879f48	error_injection: release waiters when injection is disabled When an error injection is disabled (via disable() or disable_all()), any handlers currently suspended in wait_for_message() must be woken up so they can proceed instead of hanging until timeout. Add a 'disabled' flag to injection_shared_data. When disable() or disable_all() is called, set the flag and broadcast the condition variable. The wait_for_message() predicate checks the flag and returns true immediately, letting the handler continue. This makes disable() atomic with respect to releasing waiters: it both wakes up blocked handlers and removes the injection from the enabled map in one call. This avoids races that would occur with separate message_injection() + disable() calls — message_injection() after disable() fails because the injection is already gone, and disable() after message_injection() risks a new handler hitting the injection between the two calls. Concrete example: test_unfinished_writes_during_shutdown pauses barrier_and_drain RPC handlers via wait_for_message. During shutdown, the test calls disable_injection() to simultaneously release the paused handler and prevent any new barrier_and_drain RPCs from getting stuck. (cherry picked from commit `bc4dc13e94`)	2026-05-21 18:58:03 +00:00
Patryk Jędrzejczak	d5a27e1cf1	test: fix flaky test_raft_snapshot_truncation by waiting for async log truncation Snapshot creation and raft log truncation happen asynchronously in the IO fiber after a schema change completes. The test was querying system.raft immediately after the schema change returned, racing with the IO fiber's store_snapshot_descriptor call. Replace immediate assertions with wait_for polling loops: - log_size == 0: wait for log truncation after drop keyspace - new_snap_id != original_snap_id: wait for new snapshot to be persisted Fixes: SCYLLADB-2157 Closes scylladb/scylladb#29967 (cherry picked from commit `cbadc3d675`) Closes scylladb/scylladb#29999	2026-05-21 16:06:24 +02:00
Avi Kivity	0afe3dcfd5	Update seastar submodule (default DMA alignment) * seastar 4d268e0ef5...74f19b81ca (1): > file: fix default DMA alignment Fixes https://scylladb.atlassian.net/browse/SCYLLADB-2043 Ref `6df04c9e5b`	2026-05-20 18:57:19 +03:00
Jenkins Promoter	c4c38aeeda	Update pgo profiles - aarch64	2026-05-20 15:41:19 +03:00
Jenkins Promoter	11c7df5510	Update pgo profiles - x86_64	2026-05-20 14:35:23 +03:00

1 2 3 4 5 ...

53665 Commits