Commit Graph

53948 Commits

Author SHA1 Message Date
Anna Stuchlik
1f7d20f701 doc: label Migration from Vnodes to Tablets as experimental
The procedure to migrate a vnodes-based keyspace to tablets-based keyspace
has been labeled as experimental.

Fixes SCYLLADB-1932

Closes scylladb/scylladb#29834
2026-05-11 17:07:39 +03:00
Yaniv Michael Kaul
377bbeb076 docs: fix invalid UUID characters in examples
Replace UUIDs containing non-hexadecimal characters (like 'g', 'n', 'y')
with valid UUIDs in documentation examples.

Fixes #26797

Closes scylladb/scylladb#29674
2026-05-11 17:05:30 +03:00
Calle Wilund
2cc1a2c406 storage_service: Disable snapshots after raft decommission
Fixes: SCYLLADB-1693

In case we abort a decommission operation, the snapshot/backup
mechanism need to remain open.

This change moves it to after raft_decommission.

In the case of a cluster snapshot, our nodes ownership
or not of tables will be serialized by raft anyway, so
should remain consistent. In that case we at worst coordinate
from a node in "leave" status

In the case of a local snapshot, ownership matters less,
only sstables on disk, which should not change.

In the case of backup, this operates on a snapshot, state of which
is not affected.

Adds an injection point for testing.

v2:
- Added injection point to ensure test can abort decommission

Closes scylladb/scylladb#29667
2026-05-11 17:04:09 +03:00
Anna Stuchlik
4c01556f79 doc: mark Vector Search in Alternator as Cloud-only
This commit adds the information missing from the Alternator docs
that Vector Search is only available in ScyllaDB Cloud.

Fixes https://github.com/scylladb/scylladb/issues/29661

Closes scylladb/scylladb#29664
2026-05-11 17:03:20 +03:00
Avi Kivity
f5ffbd3c3e cql3: restrictions: reindent statement_restrictions.cc
6165124fcc has left statement_restrictions.cc scarred and
deformed. Restore it to standard 4-space indentation. This patch
contains only whitespace changes.

Closes scylladb/scylladb#29598
2026-05-11 17:02:14 +03:00
Yaniv Michael Kaul
3cba27d25f topology: propagate error messages through raft_topology_cmd_result
When a topology command (e.g., rebuild) fails on a target node, the
exception message was being swallowed at multiple levels:

1. raft_topology_cmd_handler caught exceptions and returned a bare
   fail status with no error details.
2. exec_direct_command_helper saw the fail status and threw a generic
   "failed status returned from {id}" message.
3. The rebuilding handler caught that and stored a hardcoded
   "streaming failed" message.

This meant users only saw "rebuild failed: streaming failed" instead
of the actionable error from the safety check (e.g., "it is unsafe
to use source_dc=dc2 to rebuild keyspace=...").

Fix by:
- Adding an error_message field to raft_topology_cmd_result (with
  [[version 2026.2]] for wire compatibility).
- Populating error_message with the exception text in the handler's
  catch blocks.
- Including error_message in the exception thrown by
  exec_direct_command_helper.
- Passing the actual error through to rtbuilder.done() instead of
  the hardcoded "streaming failed".

A follow-up test is in https://github.com/scylladb/scylladb/pull/29363

Fixes: SCYLLADB-1404

Closes scylladb/scylladb#29362
2026-05-11 17:01:15 +03:00
Yaniv Michael Kaul
cf9cde664c .github/workflows/call_sync_milestone_to_jira.yml: add missing workflow permissions
Add explicit empty permissions block (permissions: {}) since this
workflow only syncs milestones to Jira using its own secrets and needs
no GITHUB_TOKEN permissions. Fixes code scanning alert #171.

Closes scylladb/scylladb#29184
2026-05-11 17:00:10 +03:00
Raphael S. Carvalho
20fe1e6f68 replica: Improve diagnostics when tablet split fails due to non-empty split-unready groups
When finalizing a tablet split, all data must have been moved into
split-ready compaction groups before the storage groups can be remapped
to the new tablet count. If split-unready groups still hold data at that
point, handle_tablet_split_completion() calls on_internal_error(), which
previously only reported the tablet and table IDs — giving no insight
into why the split-unready groups were not empty.

Add fmt::formatter specializations for compaction_group and storage_group
so the full state of the offending storage_group is included in the error
message. The storage_group formatter emits:

  main=<cg>, merging=[<cg>...], split_ready=[<cg>...]

Each compaction_group formatter emits:

  [sstables=[<sstable_desc>...], memtable_empty=<bool>, sstable_add_gate=<count>]

where sstable_desc includes filename, origin, identifier and originating
host, memtable_empty reflects whether all memtables have been flushed,
and sstable_add_gate count reveals whether an in-flight sstable add is
holding data in the group.

Supporting changes:

- compaction_group: add memtable_empty() const noexcept (delegates to
  memtable_list::empty()) and a const overload of sstable_add_gate()
  so both are accessible from a const compaction_group reference inside
  the formatter.
- Promote sstable_desc from a local lambda in compaction_group_for_sstable
  to a static free function so it is reusable by the formatter.

Refs https://scylladb.atlassian.net/browse/SCYLLADB-1019.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29178
2026-05-11 16:59:05 +03:00
Yaniv Michael Kaul
3674deea54 scylla-gdb: display ms-format sstable summary from partitions db footer
For ms-format (trie-based) sstables, the traditional summary structure
is not populated. Instead, read equivalent metadata from the
_partitions_db_footer field: first_key, last_key, partition_count,
and trie_root_position.

This is a follow-up to the crash fix for SCYLLADB-1180, replacing the
informational-only message with actual useful output.

Refs: SCYLLADB-1180

Closes scylladb/scylladb#29164
2026-05-11 16:58:22 +03:00
Calle Wilund
db1b92c185 service::load_balancer: Add metrics for repair and rebuild count
Fixes #21115

Adds cluster counter for repairs, and dc counter for rebuilds

Closes scylladb/scylladb#28985
2026-05-11 16:57:46 +03:00
Piotr Smaron
71542206bc cql: return InvalidRequest for oversized partition/clustering keys
When a partition key or clustering key value exceeds the 64 KiB limit
(65535 bytes serialized), Scylla used to raise a generic
std::runtime_error "Key size too large: N > M" from the low-level
compound-key serializer. That error surfaced to clients as a CQL
server error (code 0x0000, "NoHostAvailable"-looking), which is both
ugly and incompatible with Cassandra - Cassandra returns a clean
InvalidRequest with the message "Key length of N is longer than
maximum of M".

Fix this at the single chokepoint: compound_type::serialize_value in
keys/compound.hh. The serializer is on every path that materializes a
key - INSERT/UPDATE/DELETE/BATCH build mutations through it, and
SELECT builds partition and clustering ranges through it - so a single
throw replacement produces a clean InvalidRequest consistently across
all paths and all key shapes (single, compound PK, composite CK).

The previous approach on this PR branch patched three call sites in
cql3/restrictions/statement_restrictions.cc, which only covered
SELECT, duplicated the check, and placed it mid-restrictions code
(flagged in review). Dropping those changes in favour of the
root-cause fix here.

Un-xfail the tests this fixes:
- test/cqlpy/test_key_length.py: test_insert_65k_pk, test_insert_65k_ck,
  test_where_65k_pk, test_where_65k_ck, test_insert_65k_ck_composite,
  test_insert_total_compound_pk_err, test_insert_total_composite_ck_err.
- test/cqlpy/cassandra_tests/.../insert_test.py: testPKInsertWithValueOver64K,
  testCKInsertWithValueOver64K.
- test/cqlpy/cassandra_tests/.../select_test.py: testPKQueryWithValueOver64K.

test_insert_65k_pk_compound stays xfail: its oversized value gets
rejected by the Python driver's CQL wire-protocol encoder (see
CASSANDRA-19270) before reaching the server, so the fix can't apply.
Updated its reason. testCKQueryWithValueOver64K stays xfail with an
updated reason: Cassandra silently returns empty for an oversized
clustering key in WHERE, while Scylla now throws InvalidRequest - a
deliberate choice mirroring the partition-key case, documented in
the discussion on #10366.

Add three tight-boundary tests (addressing review feedback on the
previous revision) that pin MAX+1 behaviour for SELECT and INSERT of
both partition and clustering keys.

Update test/cluster/dtest/limits_test.py to match the new message
("Key length of \\d+ is longer than maximum of 65535").

fixes #10366
fixes #12247

Co-authored-by: Alexander Turetskiy <someone.tur@gmail.com>

Closes scylladb/scylladb#23433
2026-05-11 16:56:35 +03:00
Piotr Smaron
959f67b345 cql: verify tuples length in multi-column IN restriction
When a multi-column IN restriction contains tuples with a different
number of elements than the number of restricted columns (e.g.
`(b, c, d) IN ((1, 2), (2, 1, 4))`), Scylla would either produce an
inconsistent error message or, for over-sized tuples, an internal
type-mismatch error referencing the list literal representation.

Validate each tuple's arity against the number of restricted columns
while building the IN restriction and raise a clear
"Expected N elements in value tuple, but got M" error in both the
under- and over-sized cases.

Fixes #13241

Co-authored-by: Alexander Turetskiy <someone.tur@gmail.com>

Closes scylladb/scylladb#18407
2026-05-11 16:55:09 +03:00
Anna Stuchlik
a7b7019f90 doc: update the node size limit
This commit increases the node size limit from 256 to 4096 CPUs
based on be1f566488

Fixes SCYLLADB-1676

Closes scylladb/scylladb#29602
2026-05-11 16:38:53 +03:00
Nadav Har'El
f1b2b9bd52 Merge 'Register fulltext_index custom index type' from Dawid Pawlik
This PR adds the `fulltext_index` custom index class, laying the groundwork for full-text search in ScyllaDB. It focuses on the CQL-facing layer - schema validation, option parsing, and metadata - without implementing the search backend itself.

Users can now write:

```cql
CREATE CUSTOM INDEX ON t(content) USING 'fulltext_index'
WITH OPTIONS = {'analyzer': 'english', 'positions': 'false'};
```

The implementation follows the same custom index pattern established by vector search: a `custom_index` subclass registered in the factory map, with no backing materialized view. This keeps the door open for a CDC-based indexing pipeline similar to the one vector search uses.

As part of this work, the option validation helpers (`validate_enumerated_option`, `validate_positive_option`, `validate_factor_option`) were extracted from `vector_index.cc` into a shared header so both index types can reuse them. The `custom_index` base class also gained a virtual `index_type_name()` method, giving each subclass a self-describing name for error messages without hardcoding strings in shared code.

The PR is split into three commits:

1. Extract shared validation utilities and add `index_type_name()` to `custom_index`
2. Implement `fulltext_index` with column type and option validation
3. Integration tests covering creation, validation, describe, and metadata

Fixes: SCYLLADB-1517
Fixes: SCYLLADB-1510
References: SCYLLADB-1516

Closes scylladb/scylladb#29658

* github.com:scylladb/scylladb:
  test/cqlpy: add integration tests for `fulltext_index`
  index: unify custom index description
  index: add `fulltext_index` custom index implementation
  index: extract option validation helpers
2026-05-11 16:16:58 +03:00
Nadav Har'El
fcfad51284 Merge 'cql3/selection: require EXECUTE on UDA REDUCEFUNC at SELECT time' from Marcin Maliszkiewicz
selection::used_functions() pushed the UDA, its SFUNC and its FINALFUNC,
but never the REDUCEFUNC. The reducefunc is invoked by the distributed
aggregation path in service::mapreduce_service, so a user could cause it
to run server-side without holding EXECUTE on it as long as the query
took the mapreduce path.

Also push agg.state_reduction_function so select_statement::check_access
requires EXECUTE on it too.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1756
Backport: no, it's a minor fix and UDFs are experimental feature in Scylla

Closes scylladb/scylladb#29717

* github.com:scylladb/scylladb:
  test/cqlpy: add test for EXECUTE permission on UDA sub-functions
  cql3/selection: require EXECUTE on UDA REDUCEFUNC at SELECT time
2026-05-11 16:14:38 +03:00
Gleb Natapov
c3d2f0bde9 raft_group0: remove finish_setup_after_join function
The only thing it does not change a bootstrapping node to become a voter
in case the cluster does not support limited voters feature. But the
feature was introduced in 2025.2 and direct upgrade from 2025.1 to
version newer than 2026.1 is not supported. But even if such upgrade is
done the removed code has affect only during bootstrap, not during
regular boot.

Also remove the upgrade test since after the patch suppressing the
feature on the first boot will no longer behave correctly.
2026-05-11 15:38:36 +03:00
Botond Dénes
cf37f541a0 Merge ' sstables_loader: ensure upload directory is empty when load_and_stream returns' from Taras Veretilnyk
After `load_and_stream` (e.g. via `nodetool refresh --load-and-stream`)
returns success, source sstable files in the `upload/` directory may
still be on disk. `mark_for_deletion()` only sets an in-memory flag; the
actual file deletion runs lazily when the last `shared_sstable`
reference drops.

This leaves a window between API success and physical deletion where a
follow-up scan of the upload directory can detected sstables that will be deleted soon.
This might cause failure because SSTable will be already wiped during processing.

For fix:
Force unlink to complete before `stream()` returns, so the upload
directory is in a consistent state by the time the API reports success.
For tablet streaming, partially-contained sstables participate in
multiple per-tablet batches; eagerly unlinking after each batch would
break the next batch that still needs to read the file. A
`defer_unlinking` flag on the streamer postpones the explicit unlink
until after all batches complete (called once at the end of
`tablet_sstable_streamer::stream()`). Vnode streaming unlink eagerly at the end of
`stream_sstable_mutations`.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1647

Backport is required, as it is a bug fix that was introduced in 517a4dc4df.

Closes scylladb/scylladb#29599

* github.com:scylladb/scylladb:
  sstables_loader: synchronously unlink streamed sstables before returning
  sstables: make sstable::unlink() idempotent
2026-05-11 14:43:46 +03:00
Asias He
0204372156 repair: Reject repair requests where start and end tokens are equal
When a user calls the repair API with identical startToken and endToken
values, the code creates a wrapping interval (T, T]. This causes
unwrap() to split it into (-inf, T] and (T, +inf), covering the entire
token ring and triggering a full repair.

Reject such requests early with an error message matching
Cassandra's behavior: "Start and end tokens must be different."

Fixes: https://scylladb.atlassian.net/browse/CUSTOMER-358

Closes scylladb/scylladb#29821
2026-05-11 14:08:20 +03:00
Botond Dénes
ad7ac62835 Merge ' Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key' from Dimitrios Symonidis
Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key, so the primary key becomesv PRIMARY KEY ((table_id, node_owner), generation).

This is the first step toward moving the sstables registry into system_distributed: once distributed, each node's startup scan  must read only the rows it owns, which requires the owning node to be part of the partition key. Partitioning by (table_id, node_owner) turns that scan into a single-partition read of exactly the local node's rows.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1562
No need to backport this, keyspace over object storage is experimental feature

Closes scylladb/scylladb#29659

* github.com:scylladb/scylladb:
  db, sstables: add node_owner to sstables registry primary key
  db, sstables: rename sstables registry column owner to table_id
2026-05-11 14:08:19 +03:00
Botond Dénes
2edfb91070 sstables: migrate all bufsize_mismatch_exception throw sites to throw_bufsize_mismatch_exception()
Replace the two remaining direct 'throw bufsize_mismatch_exception(...)'
call sites with the new throw_bufsize_mismatch_exception() helper, which
routes through throw_malformed_sstable_exception() and thus also respects
the --abort-on-malformed-sstable-error flag.

Affected files:
- sstables/sstables.cc (1 site, in check_buf_size())
- sstables/m_format_read_helpers.cc (1 site, in check_buf_size())
2026-05-11 11:58:14 +03:00
Botond Dénes
d65c1523c2 sstables: migrate all malformed_sstable_exception throw sites to throw_malformed_sstable_exception()
Replace all direct 'throw malformed_sstable_exception(...)' call sites
with the new throw_malformed_sstable_exception() helper, which respects
the --abort-on-malformed-sstable-error flag.
2026-05-11 11:58:14 +03:00
Botond Dénes
84c27658d9 sstables: make on_parse_error() and on_bti_parse_error() respect --abort-on-malformed-sstable-error
Both functions now check abort_on_malformed_sstable_error() first. If
set, they log the error and call std::abort() directly, generating a
coredump. Otherwise they fall through to the existing on_internal_error()
path, which is in turn controlled by --abort-on-internal-error.
2026-05-11 11:58:14 +03:00
Botond Dénes
4ebcc002d6 sstables: disable abort-on-malformed-sstable-error in tests that corrupt sstables on purpose
Add scoped_no_abort_on_malformed_sstable_error RAII guard (modeled after
seastar::testing::scoped_no_abort_on_internal_error) and use it in all
tests that intentionally corrupt sstables and expect
malformed_sstable_exception to be thrown rather than the process aborting.
2026-05-11 11:58:14 +03:00
Botond Dénes
f6dc2cb5f8 sstables: introduce --abort-on-malformed-sstable-error infrastructure
Add the --abort-on-malformed-sstable-error command-line option and the
supporting infrastructure. When set, any malformed sstable error will
abort the process and generate a coredump instead of throwing an
exception. This is useful for debugging memory corruption that may
manifest as apparent sstable corruption.

The implementation introduces:
- throw_malformed_sstable_exception() and throw_bufsize_mismatch_exception()
  helper functions in sstables/sstables.cc, which check the new flag and
  either abort (with logging) or throw the appropriate exception.
- set_abort_on_malformed_sstable_error() / abort_on_malformed_sstable_error()
  to control the per-process atomic flag.
- abort_on_malformed_sstable_error config option (LiveUpdate, default false)
  wired up in main.cc alongside abort_on_internal_error.

Call-site migration will follow in subsequent commits.
2026-05-11 11:58:14 +03:00
Botond Dénes
c3daa6379c sstables: refactor parse_path() to return std::expected<> instead of throwing
make_entry_descriptor() and the two overloads of parse_path() used to signal
parse failures by throwing malformed_sstable_exception, which made parse_path()
expensive to use as a probe (e.g. to classify directory entries).

Change make_entry_descriptor() and both parse_path() overloads to return
std::expected<T, sstring>, where the sstring carries the error message on
failure, eliminating the exception overhead at probe call sites.

Call sites that previously caught malformed_sstable_exception to treat the
path as a non-SSTable file (utils/directories.cc, db/snapshot/backup_task.cc,
tools/scylla-sstable.cc) now check the expected result directly.

Call sites where a parse failure is a genuine error (sstable_directory.cc,
sstables.cc, tools/schema_loader.cc, tools/scylla-sstable.cc) re-throw
explicitly as malformed_sstable_exception using the error string, preserving
the existing error propagation behaviour.
2026-05-11 11:58:14 +03:00
Gleb Natapov
5213aee99f raft_group0: fix indentation after the last change 2026-05-11 11:56:26 +03:00
Gleb Natapov
5f7f72fa50 raft_group: drop unneeded checks 2026-05-11 11:55:39 +03:00
Marcin Maliszkiewicz
fa9d15d31a test/cqlpy: add test for EXECUTE permission on UDA sub-functions
Verify that SELECT of a UDA requires EXECUTE on its SFUNC, FINALFUNC,
and REDUCEFUNC individually.  If any one permission is missing, the
query must be rejected at planning time (even on an empty table).

The test is parameterized over the three sub-functions and uses
Lua on Scylla or Java on Cassandra, so it runs on both backends.
The REDUCEFUNC case is skipped on Cassandra since REDUCEFUNC is a
Scylla extension.

Refs SCYLLADB-1756
2026-05-11 10:23:39 +02:00
copilot-swe-agent[bot]
9e7d67612c docs: fix typo in materialized views docs - "columns are" instead of "is"
The MV Select Statement description was missing the word "columns" and
used incorrect verb agreement, making the sentence grammatically broken
and ambiguous.

docs/cql/mv.rst: "which of the base table is included" →
"which of the base table columns are included"

Fixes #29662
Closes #29663

Co-authored-by: annastuchlik <37244380+annastuchlik@users.noreply.github.com>
2026-05-11 11:15:25 +03:00
Botond Dénes
eae15f4fdd Merge 'Share timeout_config between services' from Pavel Emelyanov
The timeout_config (more exactly -- updatable_timeout_config) is used by alternator/controller and transport/controller.  Both create a local copy of that opbject by constructing one out of db::config. Also some options from this config are needed by storage_proxy, but since it doesn't have access to any timeout_config-s, it just uses db::config by getting it from the database.

This PR introduces top-level sharded<updateable_timeout_config>, initializes it from db::config values and makes existing users plus storage_proxy us it where required. Motivation -- remove more replica::database::get_config() users. A side effect -- timeout_config is not duplicated by transport and alternator controllers.

Components' dependencies cleanup, not backporting.

Closes scylladb/scylladb#29636

* github.com:scylladb/scylladb:
  storage_proxy: Use shared updateable_timeout_config for CAS contention timeout
  alternator: Use shared updateable_timeout_config by reference
  cql_transport: Use shared updateable_timeout_config by reference
  storage_proxy: Use shared updateable_timeout_config by reference
  main: Introduce sharded<updateable_timeout_config>
  storage_proxy: Keep own updateable_timeout_config
2026-05-11 11:12:01 +03:00
Botond Dénes
9b2dfab2e5 Merge 'Don't use database.get_config() to fetch calculate_view_update_throttling_delay option' from Pavel Emelyanov
This option is used in two places -- proxy and view-update-generator both need it to calculate the calculate_view_update_throttling_delay() value. This PR moves the option onto view_update_backlog top-level service, makes the calculating helper be method of that class and patches the callers to use it. This eliminates more places that abuse database as db::config accessor.

Code dependencies refactoring, not backporting

Closes scylladb/scylladb#29635

* github.com:scylladb/scylladb:
  view: Turn calculate_view_update_throttling_delay into node_update_backlog member
  view: Place view_flow_control_delay_limit_in_ms on node_update_backlog
  view: Add node_update_backlog reference to view_update_generator
2026-05-11 10:30:24 +03:00
Pavel Emelyanov
f39cbb1ec6 storage_proxy: Move maintenance_mode onto storage_proxy::config
Stop reading maintenance_mode through replica::database's db::config.
Add a properly typed maintenance_mode_enabled field to
storage_proxy::config, populate it in main.cc from cfg->maintenance_mode()
(same as messaging_service::config), and use a cached member in
storage_proxy instead of db.local().get_config().maintenance_mode().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29637
2026-05-11 10:11:20 +03:00
Yaniv Michael Kaul
631f1e1654 compaction: set_skip_when_empty() for validation_errors metric
Add .set_skip_when_empty() to compaction_manager::validation_errors.
This metric only increments when scrubbing encounters out-of-order or
invalid mutation fragments in SSTables, indicating data corruption.
It is almost always zero and creates unnecessary reporting overhead.

AI-Assisted: yes
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#29349
2026-05-11 09:12:40 +03:00
Yaniv Michael Kaul
b8a150e22c build: add -ftime-trace support for compilation profiling
Add a --time-trace flag to configure.py and a Scylla_TIME_TRACE CMake
option that enable Clang's -ftime-trace on all C++ compilations. When
enabled, each .o file produces a companion .json trace that can be
analyzed with ClangBuildAnalyzer or loaded in chrome://tracing to
identify slow headers and costly template instantiations.

This is the first step toward data-driven build speed improvements.

Refs #1

Usage:
  configure.py:  ./configure.py --time-trace --mode dev
  CMake:         cmake -DScylla_TIME_TRACE=ON -DCMAKE_BUILD_TYPE=Dev ..

Closes scylladb/scylladb#29462
2026-05-11 08:55:33 +03:00
Dmitry Kropachev
85d0011b3c gitignore: add missing rust build artifacts
rust/**/target and Cargo.lock files under rust/inc/ and
rust/wasmtime_bindings/ were not ignored, nor was
test/resource/wasm/rust/target/.

Closes scylladb/scylladb#28943
2026-05-11 07:06:26 +03:00
Botond Dénes
3f72852d8c Merge 'Fix missing format string placeholders across the codebase (33 bugs across 14 modules )' from Yaniv Kaul
Fix 28 format string bugs plus 5 related format argument bugs across 14 modules
where `{}` placeholders were missing or arguments were wrong, causing arguments to
be silently dropped or misleading output from the `{fmt}` library.

Inspired by https://github.com/scylladb/scylladb/pull/29143 (which fixed a single
instance in `replica/table.cc`), a comprehensive audit of the entire codebase was
performed to find all similar issues.

- **Missing `{}` placeholder** (21 instances): format string simply lacks `{}` for a
  passed argument, e.g. `format("msg for table {}", group_id, table_id)` -- `group_id`
  is silently dropped
- **Spurious comma breaking C++ string literal concatenation** (2 instances): a comma
  after a string literal prevents adjacent-literal concatenation, turning the
  continuation into a format argument instead of part of the format string
- **Printf-style `%s` in fmtlib context** (4 instances): `%s` has no meaning in fmtlib
  and appears as literal text while the argument is silently ignored
- **Extra spurious argument** (1 instance): an extraneous `t.tomb()` argument inserted
  between correct arguments, causing wrong values in the wrong slots

- **Wrong variable in error message** (4 instances in `types/map.hh`): error messages
  for oversized map keys/values reported `map_size` (total entry count) instead of the
  actual `elem.first.size()` or `elem.second.size()` that exceeded the limit
- **Swapped argument order** (1 instance in `data_dictionary/data_dictionary.cc`):
  format string says `"Extraneous options for {type}: {values}"` but the values and
  type arguments were passed in reverse order

| Module | Bugs Fixed | Files |
|--------|:---------:|-------|
| `replica/` | 1 | `table.cc` |
| `service/` | 4 | `raft_group0.cc`, `storage_service.cc` |
| `db/` | 6 | `heat_load_balance.cc`, `commitlog_replayer.cc`, `view_update_generator.cc`, `view_building_worker.cc`, `row_locking.cc` |
| `cql3/` | 2 | `prepare_expr.cc`, `statement_restrictions.cc` |
| `transport/` | 4 | `event_notifier.cc` |
| `sstables/` | 3 | `partition_reversing_data_source.cc`, `reader.cc` |
| `alternator/` | 1 | `conditions.cc` |
| `cdc/` | 1 | `split.cc` |
| `raft/` | 1 | `server.cc` |
| `utils/` | 2 | `gcp/object_storage.cc`, `s3/client.cc` |
| `mutation/` | 1 | `mutation_partition.hh` |
| `ent/` | 2 | `kmip_host.cc`, `kms_host.cc` |
| `types/` | 4 | `map.hh` |
| `data_dictionary/` | 1 | `data_dictionary.cc` |

The `{fmt}` library's compile-time checker validates that each `{}` placeholder
references a valid argument, but does **not** verify the reverse -- that every
argument has a corresponding placeholder. Extra arguments are silently ignored
at both compile time and runtime.

Build verified with `dbuild ninja build/dev/scylla` -- compiles cleanly.

---

**Note:** Commits were amended to fix the author name from "Yaniv Michael Kaul" to "Yaniv Kaul".

Closes scylladb/scylladb#29448

* github.com:scylladb/scylladb:
  data_dictionary: fix swapped arguments in extraneous options error
  types: fix wrong variable in map key/value size error messages
  ent: fix missing format placeholders in encryption error/log messages
  mutation: fix spurious argument in shadowable_tombstone formatter
  utils: fix missing format placeholders in object storage log messages
  raft: fix missing format placeholder in server ostream operator
  cdc: fix missing format placeholder in error message
  alternator: fix missing format placeholder in error message
  sstables: fix missing format placeholders in error messages
  transport: fix printf-style format specifiers in fmtlib log calls
  cql3: fix missing format placeholders in error messages
  db: fix missing format placeholders in log and error messages
  service: fix missing format placeholders in log messages
  replica: fix missing format placeholder in cleanup log message
2026-05-11 07:04:42 +03:00
Yaron Kaikov
5694c93c12 build: add collect-dist target to organize build artifacts
Build artifacts are currently scattered across
build/dist/$mode/redhat/, tools/python3/build/, tools/cqlsh/build/, etc. with unpredictable names. Add a new 'collect-dist' ninja target that
gathers all distributable artifacts into a well-known structure:

  build/$mode/dist/rpm/       -- all binary RPMs (no SRPMs)
  build/$mode/dist/deb/       -- all .deb packages
  build/$mode/dist/tar/       -- relocatable tarballs (already here)

The collection is done via a reusable 'collect_pkgs' ninja rule defined
directly in configure.py, which knows all the source paths. No external
script is needed.

Fixes: SCYLLADB-75

Closes scylladb/scylladb#29475
2026-05-11 06:54:29 +03:00
Michael Litvak
274024a76b configure.py: update compile_commands.json if stale
configure.py creates compile_commands.json in the root directory as a
symbolic link to the file in one of the build directories. If the file
already exists it does nothing.

However it may happen that the file exists but the target file does not
exist. For example, if the build directory is removed and then building
with a different mode. Then the file will remain as a stale symbolic
link.

To address this, when the file exists check also if it's a valid
symbolic link. If not, then recreate it with a valid target.

Closes scylladb/scylladb#29680
2026-05-10 22:17:16 +03:00
Piotr Szymaniak
459c1dc32f test/alternator: stop avoiding tablets in Streams tests
Alternator Streams now supports tablets, so stop skipping the TTL Streams test in tablet mode and stop forcing vnodes in the Streams audit test.

Refs SCYLLADB-463

Closes scylladb/scylladb#29697
2026-05-10 22:13:15 +03:00
Nadav Har'El
df8c9b17b8 Merge 'alternator: Graduate Alternator Streams from experimental' from Piotr Szymaniak
As a final step for https://scylladb.atlassian.net/browse/SCYLLADB-461 we need to graduate Alternator Streams from experimental.
So let's remove `--experimental-features=alternator-streams` and map the obsolete config string to `UNUSED` for backward compatibility. Also, remove the related gating of the feature.
Finally, stop providing the config flag in test configs.

Fixes SCYLLADB-1680
Fixes #16367

To documentation tracked by https://scylladb.atlassian.net/browse/SCYLLADB-462 still remains.

This PR needs to hit 2026.2, so (only) if it branches before the PR is merged to `master`, we'd need to backport.

Closes scylladb/scylladb#29604

* github.com:scylladb/scylladb:
  test: Stop providing alternator-streams experimental flag
  alternator: Graduate Alternator Streams from experimental
2026-05-10 22:10:03 +03:00
Nadav Har'El
34136d3bc2 Merge 'vector_search: test: migrate CQL tests for vector search from C++/Boost to pytest' from Karol Nowacki
Migrate vector search (ANN ordered select query) CQL tests from C++/Boost suite to pytest.

This migration includes:
- New pytest tests in `test/cqlpy/test_vector_search_with_vector_store_mock.py`
- VectorStoreMock server as pytest fixture to simulate vector store responses

The benefits of this migration are:
- Extended test coverage to verify CQL protocol serialization and driver
- Reduced overall test time (no compilation required for pytest)

Fixes SCYLLADB-695

No backport needed as this is a refactoring.

Closes scylladb/scylladb#29593

* github.com:scylladb/scylladb:
  vector_search: test: migrate paging warnings tests to Python
  vector_search: test: migrate local_vector_index to Python
  vector_search: test: migrate vector_index_with_additional_filtering_column to Python
  vector_search: test: migrate cql_error_contains_http_error_description to Python
  vector_search: test: migrate pk in restriction test to Python
2026-05-10 22:09:17 +03:00
Nadav Har'El
d4aa528834 Merge 'load_balancer: fix tablet allocator dropped table' from Ferenc Szili
- Handle dropped tables gracefully in the tablet load balancer's `get_schema_and_rs()` instead of aborting with `on_internal_error`
- The load balancer operates on a token metadata snapshot but accesses the live schema for table lookups. A DROP TABLE applied by another fiber between coroutine yield points can remove a table from the live schema while it still exists in the snapshot, causing an abort.

`get_schema_and_rs()` now returns `std::optional` and logs a warning in debug log level instead of aborting when a table is missing. All callers skip dropped tables:
- `make_sizing_plan`: skips to next table
- `make_resize_plan`: skips to next table (merge suppression is moot)
- `check_constraints`: returns `skip_info{}` with empty viable targets
- `get_rs`: returns `nullptr`, checked by `check_constraints`

The call chain is: `make_plan` → `make_internode_plan` → `check_constraints` → `get_rs` → `get_schema_and_rs`. The `make_internode_plan` coroutine has multiple `co_await` yield points (`maybe_yield`, `pick_candidate`) between building the candidate tablet list and checking replication constraints. A DROP TABLE schema mutation applied during any of these yields removes the table from `_db.get_tables_metadata()` while the candidate list still references it.

Added `test_load_balancing_with_dropped_table` which simulates the race by capturing a token metadata snapshot, dropping the table, then calling `balance_tablets` with the stale snapshot.

Fixes: SCYLLADB-1664

This fix needs to be backported to versions: 2025.4, 2026.1

Closes scylladb/scylladb#29585

* github.com:scylladb/scylladb:
  test: verify load balancer handles dropped tables gracefully
  tablet_allocator: handle dropped tables gracefully in get_schema_and_rs
2026-05-10 22:07:51 +03:00
Nadav Har'El
63927e07ea Merge 'alternator/streams: keep disabled streams usable and purge on re-enable' from Piotr Szymaniak
When an Alternator stream is disabled, the data should continue to be accessible so that consumers can finish reading. When the stream is later re-enabled, a new StreamArn is produced and only then the old data is purged.

On disable, the existing CDC options (including preimage and postimage) are preserved so that DescribeStream can still report StreamViewType. All stream APIs continue to work on the disabled stream, with all shards reported as closed (EndingSequenceNumber set). No new CDC records are written; existing data expires via TTL after 24 hours.

On re-enable, the old CDC log table is dropped as a separate Raft group0 schema change and a fresh one is created with a new UUID, giving a new StreamArn. This is Alternator-specific — CQL CDC keeps reusing the log table. Re-enabling is the only way to immediately purge old stream data.

Old stream data is removed immediately upon re-enable (a discrepancy with DynamoDB, which keeps it readable for 24 hours through the old StreamArn).

Tests updated to cover the new disable and re-enable behavior.

Fixes #7239
Fixes SCYLLADB-523

Closes scylladb/scylladb#29413

* github.com:scylladb/scylladb:
  alternator/streams: remove dead next_iter in get_records
  test/alternator: fix stream wait timeouts to use wall-clock time
  docs/alternator: document stream disable/re-enable behavior
  alternator/streams: keep disabled streams usable and purge on re-enable
2026-05-10 22:04:35 +03:00
Nadav Har'El
e277f747bd Merge 'Make collection unfreezing more efficient' from Botond Dénes
Introduce `read_from_collection_cell_view()` which reads a `collection_mutation` directly from the IDL representation of a collection (`ser::collection_cell_view`). This cuts down the number of allocations required drastically compared to the current method of:

    IDL -> collection_mutatio_description -> collection_mutation

Reduces the number of allocations to unfreeze a collection from O(collection_cell_count) -> O(1) (actually, due to buffer fragmentation, it is O(collection_size)).
The new method is used when unfreezing frozen mutations and frozen mutation fragments. This is on the hot path: all writes with collections benefit.

Add a `--collection` flag to `perf-simple-query` to allow measuring the performance improvement of this PR.
With  `dbuild -it -- build/release/scylla perf-simple-query --collection=16 -c1 -m2G --default-log-level=error --write`  the number of allocations drop from ~123 to 102, which is a significant amount of allocations shaved off.

Refs: https://github.com/scylladb/scylladb/issues/3602 (solves one use-case out of the many listed therein)
Fixes: SCYLLADB-1046
Fixes: SCYLLADB-1077

Backport: this is an optimization so normally not a backport candidate, but we may have to backport to relieve certain customers

Closes scylladb/scylladb#29033

* github.com:scylladb/scylladb:
  test/perf/perf_simple_query: add --collection=N
  test/boost/frozen_mutation_test: add freeze/unfreeze test for large collections
  mutation/mutation_partition_view: use read_from_collection_cell_view() to read collections
  mutation/collection_mutation: introduce read_from_collection_cell_view()
  mutation/atomic_cell: atomic_cell_type: add write*() and *serialized_size()
  mutation/collection_mutation: generalize serialize_collection_mutation
  mutation/mutation_partition_view: avoid copying collection
  mutation/mutation_partition_view: accept collection_mutation in the consume API
  partition_builder: add move variant of accept_*_cell() collection overloads
2026-05-10 20:39:08 +03:00
Nadav Har'El
2501a22b10 alternator: remove unneeded call to format()
Removed a silly call to format() on a constant string without parameters.
2026-05-10 20:34:36 +03:00
Nadav Har'El
b3a62dc9d2 alternator: improve CONTAINS operator's validity checking
Copilot who review the implementation of the CONTAINS operator
complained that in some places we assume without checking that the
user-providing parameter to CONTAINS has the expected structure.

Not doing all the checks explicitly is actually not terrible in
RapidJSON, because its methods like BeginMembers() always validate the
type before trying to follow a pointer, throwing an exception if it
the JSON value doesn't have the right type. But it's still cleaner
to do these checks explicitly, and throw a clean SerializationError
instead of some internal server error. So this is what this patch does.

If the malformed object doesn't come from the query but rather comes
from the data, we just silently return false. This is our usual
convention - we don't expect malformed data in our database, but if
we do have some (see issue #8070) we shouldn't tell the user that
there was an error in his completely valid query.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-10 20:34:36 +03:00
Yaniv Kaul
a6cf45f9e2 data_dictionary: fix swapped arguments in extraneous options error
The format string says "Extraneous options for {type}: {values}"
but the arguments were passed in the wrong order (values first, type
second), producing misleading error messages like
"Extraneous options for bucket,endpoint: S3" instead of
"Extraneous options for S3: bucket,endpoint".

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:20 +03:00
Yaniv Kaul
a13da94308 types: fix wrong variable in map key/value size error messages
Four error messages for oversized map keys/values reported map_size
(the total number of entries) instead of the actual key or value size
that exceeded the limit. The condition checks elem.first.size() or
elem.second.size(), but the error message printed map_size. This
affects both the bytes and managed_bytes serialization overloads.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:20 +03:00
Yaniv Kaul
bf1d59ad95 ent: fix missing format placeholders in encryption error/log messages
Fix two format string bugs:

- kmip_host.cc: cmd_in was passed as an argument to a trace log but
  had no {} placeholder, so the command was silently dropped.
- kms_host.cc: the XML node name (what) was passed to the error
  message but had no {} placeholder, so the error never showed which
  XML node was missing.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:20 +03:00
Yaniv Kaul
a76774f8f9 mutation: fix spurious argument in shadowable_tombstone formatter
The formatter for shadowable_tombstone had a spurious t.tomb()
argument between the timestamp and deletion_time arguments. This
caused t.tomb() (the whole tombstone) to be formatted into the
deletion_time={} slot, while the actual deletion_time count was
silently dropped. Remove the extra argument.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:19 +03:00