Commit Graph

52506 Commits

Author SHA1 Message Date
Gleb Natapov
f888f2dced service level: remove remnants of version 1 service level
can_use_effective_service_level_cache() always returns true now, so the
function can be dropped entirely and all the code that assumes it may
return false can be dropped as well.
2026-03-12 12:27:52 +02:00
Nadav Har'El
09a399ae3c Merge 'Replace estimated_histogram with approx_exponential_histogram - alternator' from Amnon Heiman
_"A journey of a thousand miles begins with a single step" Lao Tzu_

ScyllaDB uses estimated_histogram in many places.
We already have a more efficient alternative: approx_exponential_histogram. It is both CPU and
memory-efficient and can be exported as Prometheus native histograms.

Its main limitation (which has its benefits) is that the bucket layout is fixed at compile time, so
histograms with different configurations cannot be mixed.

The end goal is to replace all uses of estimated_histogram in the codebase.
That migration needs a few small API adjustments, so I am splitting the work
into steps for easier review.

This series is the first step. It introduces a base template for fixed-size
estimated histograms, and switches the Alternator's estimated_histogram with the template.
This change is self-contained and valuable on its own, while keeping the scope limited.

Minor adjustments were made to the code and tests so that the tests would pass.

Follow-up PRs will apply the same pattern to the rest of the code.

**New feature no need to backport**

Closes scylladb/scylladb#28987

* github.com:scylladb/scylladb:
  alternator: migrate to operation_size_kb histograms
  test/alternator/test_metrics.py: Update the bucket in the histogram search
  alternator: Use batch_histogram for batch size histograms
  estimated_histogram.hh: adds estimated_histogram_with_max
2026-03-12 00:06:16 +02:00
Amnon Heiman
1339a44163 alternator: migrate to operation_size_kb histograms
Switch Alternator operation-size metrics from the legacy estimated
histogram implementation to estimated_histogram_with_max<512> and export
them through the native approx-exponential histogram path.

Add a dedicated operation-size histogram type alias based on
estimated_histogram_with_max<512>.
Replace all per-operation size histograms (GetItem/PutItem/DeleteItem/
UpdateItem/BatchGetItem/BatchWriteItem) with the new type.

Remove the custom legacy histogram-to-metrics adapter and use
to_metrics_histogram() for operation size metrics, aligning export
behavior with other approx-exponential histograms.
Update Alternator metrics tests to compute expected le bucket boundaries using
approx-exponential bucket math (including deduplication of equal
bounds), so assertions match the new exported histogram schema.
Update bucket helper signatures to use (max, precision) parameters and keep
+Inf handling unchanged.

Replace byte-to-KB ceiling conversion with plain integer division (bytes
/ 1024): histogram export already reports each bucket by its upper bound
(le), so rounding input values up before bucketing is unnecessary and
would over-shift borderline samples into higher buckets.
2026-03-11 17:29:14 +02:00
David
79f9967eaa docs: update theme 1.9
Motivation

Upgrades Sphinx to 9.x, MyST Parser to 5.x, Python to 3.11+–3.14, Node.js to 22, and replaces Poetry with uv for dependency management.

Changelog: https://github.com/scylladb/sphinx-scylladb-theme/blob/master/docs/source/upgrade/CHANGELOG.md#190---26-february-2026
How to test

* Make sure you are using Python 3.11-3.14:
* python --version
* Install uv:
* make setupenv
* Build the docs:
* make preview
* Docs should render without errors at http://127.0.0.1:5500

Closes scylladb/scylladb#28971
2026-03-11 16:56:51 +02:00
Aleksandra Martyniuk
2e68f48068 nodetool: cluster repair: do not fail if a table was dropped
nodetool cluster repair without additional params repairs all tablet
keyspaces in a cluster. Currently, if a table is dropped while
the command is running, all tables are repaired but the command finishes
with a failure.

Modify nodetool cluster repair. If a table wasn't specified
(i.e. all tables are repaired), the command finishes successfully
even if a table was dropped.

If a table was specified and it does not exist (e.g. because it was
dropped before the repair was requested), then the behavior remains
unchanged.

Fixes: SCYLLADB-568.

Closes scylladb/scylladb#28739
2026-03-11 16:35:04 +02:00
Dani Tweig
45d7d9a96c .github/workflow: also call call_sync_milestone_to_jira.yml for close milestone event
What changed

* Added closed to milestone event types in
  call_sync_milestone_to_jira.yml (types: [created] -> types: [created, closed])
* Added VECTOR to the list of Jira project keys being synced
  (jira_project_keys: SCYLLADB,CUSTOMER,SMI,RELENG -> jira_project_keys: SCYLLADB,CUSTOMER,SMI,RELENG,VECTOR)

Why (Requirements Summary)

* The call_sync_milestone_to_jira.yml workflow only triggered on
  milestone creation. When a GitHub milestone is closed, the
  corresponding Jira versions (in SCYLLADB, CUSTOMER, SMI, RELENG
  projects) should be marked as released. Adding the closed trigger
  enables the called workflow (main_sync_milestone_to_jira_release.yml
  in github-automation) to handle both creating and releasing Jira
  versions from GitHub milestone events.
* Added the VECTOR project so its Jira versions are also created/released
  when milestones are created or closed in scylladb.git.
* This is consistent with the same change already applied to the staging
  and scylla-machine-image repos.

Fixes:PM-216

Update call_sync_milestone_to_jira.yml in scylladb.git - add close trigger and VECTOR project sync

Closes scylladb/scylladb#28981
2026-03-11 15:56:55 +02:00
Amnon Heiman
69fbcd32bd test/alternator/test_metrics.py: Update the bucket in the histogram search 2026-03-11 15:24:05 +02:00
Amnon Heiman
50af1f3671 alternator: Use batch_histogram for batch size histograms
Switch batch-related histograms to estimated_histogram_with_max.
Results with better memory consumption and improve efficiency.
2026-03-11 15:21:25 +02:00
Amnon Heiman
b22162c719 estimated_histogram.hh: adds estimated_histogram_with_max
This patch adds estimated_histogram_with_max template that will be a
based for specific estimated_histograms, eventually replacing the current
struct implementation.

Introduce estimated_histogram_with_max<Max> as a reusable wrapper around
approx_exponential_histogram<1, Max, 4>, providing merge support and the
same add helpers used by existing estimated_histogra type.

Add estimated_histogram_with_max_merge()

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-03-11 15:02:37 +02:00
Radosław Cybulski
fe8117feee alternator: fix shard's parent calculation for vnodes
Fix an invalid condition, when searching for a parent shard, when table
is based on vnodes. Shards have associated with them `last token` -
token, than marks the end of the range of tokens they consume (inclusive).
An additional assumptions are whole token space is used and
(for vnodes) token space wraps around.

Previously code looked like this:
    auto pid = std::upper_bound(..., [](const dht::token& t, const cdc::stream_id& id) {
                    return t < id.token();
    });
    if (pid != pids.begin()) {
        pid = std::prev(pid);
    }

An `upper_bound` call with `t < id.token()` means it is looking for
an iterator, for which value `t < id.token()` changed to true,
which effectively means a position, where iterator is bigger
then searched value. Then we move iterator backward once if possible.
Assuming token space <-2, 2> and parents [0, 2], when we search for:
- -1 -> we will get 0, it's first, so we can't move backward, so 0 (ok)
- 0 -> we will get 2, it's not first, so we go back and we return 0 (ok)
- 1 -> we will get 2, it's not first, so we go back and we return 0
      (not ok - should be 2)

The fix is to replace it with `std::lower_bound` and remove conditional
backward motion. Since we've a guarantees that whole token space is used
if `std::lower_bound` ends with `end()` value, then we have a wrap
around case and we need to pick `begin()` as result.

Fixes #28354
Fixes: SCYLLADB-537

Closes scylladb/scylladb#28382
2026-03-11 14:51:42 +02:00
Piotr Dulikowski
d9a277453e Merge 'cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race' from Alex Dathskovsky
query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak handle could no longer be promoted and the prepare path could fail nondeterministically.

This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops.

  Test coverage was extended in test/cluster/test_prepare_race.py:

  - reproduces the invalidation-during-prepare window with injection,
  - verifies prepare completes successfully,
  - then invalidates again and executes the same stale client prepared object,
  - confirms the driver transparently re-requests/re-prepares and execution succeeds.

  This change introduces:

  - no behavior change for normal prepare flow besides stronger lifetime guarantees,
  - no new protocol semantics,
  - preserves existing cache invalidation logic,
  - adds explicit cluster-level regression coverage for both the race and driver reprepare path.
  - pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage

Fixes: https://github.com/scylladb/scylladb/issues/27657

Backport to active branches recommended: No node crash, but user-visible PREPARE failures under rare schema-invalidation race; low-risk timeout-bounded retry improves robustness.

Closes scylladb/scylladb#28952

* github.com:scylladb/scylladb:
  transport/messages: hold pinned prepared entry in PREPARE result
  cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race
2026-03-11 12:09:23 +01:00
Patryk Jędrzejczak
37aeba9c8c Merge 'raft: add global read barrier to group0_batch::commit and switch auth and service levels' from Marcin Maliszkiewicz
This series adds a global read barrier to raft_group0_client, ensuring that Raft group0 mutations are applied on all live nodes before returning to the caller.

Currently, after a group0_batch::commit, the mutations are only guaranteed to be applied on the leader. Other nodes may still be catching up, leading to stale reads. This patch introduces a broadcast read barrier mechanism. Calling  send_group0_read_barrier_to_live_members after committing will cause the coordinator to send a read barrier RPC to all live nodes (discovered via gossiper) and waits for them to complete. This is best effort attempt to get cluster-wide visibility of the committed state before the response is returned to the user.

Auth and service levels write paths are switched to use this new mechanism.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-650

Backport: no, new feature

Closes scylladb/scylladb#28731

* https://github.com/scylladb/scylladb:
  test: add tests for global group0_batch barrier feature
  qos: switch service levels write paths to use global group0_batch barrier
  auth: switch write paths to use global group0_batch barrier
  raft: add function to broadcast read barrier request
  raft: add gossiper dependency to raft_group0_client
  raft: add read barrier RPC
2026-03-11 10:37:19 +01:00
Botond Dénes
54bddeb3b5 sstables/mx/writer: yield after writing a cell
With the goal of avoiding stalls on writing large collections, like
below:
    ++[0#1/1 100%] addr=0x5422d1e total=32 count=1 avg=32:
    |              seastar::backtrace<seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}> at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:85
    ++           - addr=0x541b6d4:
    |              seastar::backtrace_buffer::append_backtrace_oneline at ./build/release/seastar/./seastar/src/core/reactor.cc:811
    |              (inlined by) seastar::print_with_backtrace at ./build/release/seastar/./seastar/src/core/reactor.cc:838
    ++           - addr=0x541afb7:
    |              seastar::internal::cpu_stall_detector::generate_trace at ./build/release/seastar/./seastar/src/core/reactor.cc:1479
    ++           - addr=0x541b86c:
    |              seastar::internal::cpu_stall_detector::maybe_report at ./build/release/seastar/./seastar/src/core/reactor.cc:1214
    |              (inlined by) seastar::internal::cpu_stall_detector::on_signal at ./build/release/seastar/./seastar/src/core/reactor.cc:1234
    |              (inlined by) seastar::reactor::block_notifier at ./build/release/seastar/./seastar/src/core/reactor.cc:1548
    /opt/scylladb/libreloc/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=f83d43b9b4b0ed5c2bd0a1613bf33e08ee054c93, for GNU/Linux 3.2.0, not stripped

    ++           - addr=/opt/scylladb/libreloc/libc.so.6+0x1a28f:
    |              sigpending at ??:0
    ++           - addr=0x1760bf6:
    |              std::basic_string_view<signed char, std::char_traits<signed char> >::remove_prefix at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/string_view:302
    |              (inlined by) managed_bytes_basic_view<(mutable_view)0>::remove_prefix at ././utils/managed_bytes.hh:421
    |              (inlined by) _Z11read_simpleIlTk14FragmentedView24managed_bytes_basic_viewIL12mutable_view0EEET_RT0_ at ././utils/fragment_range.hh:365
    |              (inlined by) _ZL9get_fieldIlTk14FragmentedView24managed_bytes_basic_viewIL12mutable_view0EEQsr3stdE12is_trivial_vIT_EES3_T0_j at ././mutation/atomic_cell.hh:62
    |              (inlined by) atomic_cell_type::timestamp at ././mutation/atomic_cell.hh:103
    |              (inlined by) basic_atomic_cell_view<(mutable_view)0>::timestamp at ././mutation/atomic_cell.hh:232
    |              (inlined by) sstables::mc::writer::write_cell at ./sstables/mx/writer.cc:1101
    |              (inlined by) sstables::mc::writer::write_collection(bytes_ostream&, clustering_key_prefix const*, column_definition const&, collection_mutation_view, sstables::mc::writer::row_time_properties const&, bool)::$_0::operator() at ./sstables/mx/writer.cc:1233
    |              (inlined by) collection_mutation_view::with_deserialized<sstables::mc::writer::write_collection(bytes_ostream&, clustering_key_prefix const*, column_definition const&, collection_mutation_view, sstables::mc::writer::row_time_properties const&, bool)::$_0> at ././mutation/collection_mutation.hh:97
    |              (inlined by) sstables::mc::writer::write_collection at ./sstables/mx/writer.cc:1221
    ++           - addr=0x1677af3:
    |              sstables::mc::writer::write_cells at ./sstables/mx/writer.cc:1261
    |              (inlined by) sstables::mc::writer::write_row_body at ./sstables/mx/writer.cc:1287
    |              (inlined by) sstables::mc::writer::write_clustered at ./sstables/mx/writer.cc:1377
    |              (inlined by) _ZN8sstables2mc6writer15write_clusteredI14clustering_rowQ9ClusteredIT_EEEvRKS4_9tombstone at ./sstables/mx/writer.cc:766
    |              (inlined by) sstables::mc::writer::consume at ./sstables/mx/writer.cc:1425

Putting the yield in write_cell() instead of in write_collection() means
that writing any row benefits from the added yield point in the middle.

Refs: SCYLLADB-964

Closes scylladb/scylladb#28948
2026-03-11 10:34:55 +01:00
Botond Dénes
475220b9c9 Merge 'Remove the rest of pre raft topology code' from Gleb Natapov
Remove the rest of the code that assumes that either group0 does not exist yet or a cluster is till not upgraded to raft topology. Both of those are not supported any more.

No need to backport since we remove functionality here.

Closes scylladb/scylladb#28841

* github.com:scylladb/scylladb:
  service level: remove version 1 service level code
  features: move GROUP0_SCHEMA_VERSIONING to deprecated features list
  migration_manager: remove unused forward definitions
  test: remove unused code
  auth: drop auth_migration_listener since it does nothing now
  schema: drop schema_registry_entry::maybe_sync() function
  schema: drop make_table_deleting_mutations since it should not be needed with raft
  schema: remove calculate_schema_digest function
  schema: drop recalculate_schema_version function and its uses
  migration_manager: drop check for group0_schema_versioning feature
  cdc: drop usage of cdc_local table and v1 generation definition
  storage_service: no need to add yourself to the topology during reboot since raft state loading already did it
  storage_service: remove unused functions
  group0: drop with_raft() function from group0_guard since it always returns true now
  gossiper: do not gossip TOKENS and CDC_GENERATION_ID any more
  gossiper: drop tokens from loaded_endpoint_state
  gossiper: remove unused functions
  storage_service: do not pass loaded_peer_features to join_topology()
  storage_service: remove unused fields from replacement_info
  gossiper: drop is_safe_for_restart() function and its use
  storage_service: remove unused variables from join_topology
  gossiper: remove the code that was only used in gossiper topology
  storage_service: drop the check for raft mode from recovery code
  cdc: remove legacy code
  test: remove unused injection points
  auth: remove legacy auth mode and upgrade code
  treewide: remove schema pull code since we never pull schema any more
  raft topology: drop upgrade_state and its type from the topology state machine since it is not used any longer
  group0: hoist the checks for an illegal upgrade into main.cc
  api: drop get_topology_upgrade_state and always report upgrade status as done
  service_level_controller: drop service level upgrade code
  test: drop run_with_raft_recovery parameter to cql_test_env
  group0: get rid of group0_upgrade_state
  storage_service: drop topology_change_kind as it is no longer needed
  storage_service: drop check_ability_to_perform_topology_operation since no upgrades can happen any more
  service_storage: remove unused functions
  storage_service: remove non raft rebuild code
  storage_service: set topology change kind only once
  group0: drop in_recovery function and its uses
  group0: rename use_raft to maintenance_mode and make it sync
2026-03-11 10:24:20 +02:00
Piotr Dulikowski
38a2829f69 Merge 'Return HTTP error description in Vector Store client' from Szymon Wasik
The `service_error` struct: 6dc2c42f8b/service/vector_store_client.hh (L64)
currently stores just the error status code. For this reason whenever the HTTP error occurs, only the error code can be forwarded to the client. For example see here: 6dc2c42f8b/service/vector_store_client.cc (L580)
For this reason in the output of the drivers full description of the error is missing which forces user to take a look into Scylla server logs.

The objective of this PR is to extend the support for HTTP errors in Vector Store client to handle messages as well.

Moreover, it removes the quadratic reallocation in response_content_to_sstring() helper function that is used for getting the response in case of error.

Fixes: VECTOR-189

Closes scylladb/scylladb#26139

* github.com:scylladb/scylladb:
  vector_search: Avoid quadratic reallocation in response_content_to_sstring
  vector_store_client: Return HTTP error description, not just code
2026-03-11 09:19:27 +01:00
Calle Wilund
6d8ac23731 test_encryption: Use maximum replication in _smoke_test
Refs: SCYLLADB-557

We should use full replication in KS/CF creation and population,
for at least two reasons:
1.) Ensure we wait fully for and write to all nodes
2.) Make test more "real", behaving like a proper cluster

Closes scylladb/scylladb#28959
2026-03-11 09:54:57 +02:00
Botond Dénes
99fa912f1b Merge 'Generalize streaming scopes tests' from Pavel Emelyanov
To restore how streaming scopes work there are two tests that greatly duplicate each other -- test_restore_with_streaming_scopes from cluster/object_store suite and test_refresh_with_streaming_scopes from cluster suite.

This patch generalizes both into a do_test_streaming_scopes() non-test function

Closes scylladb/scylladb#28874

* github.com:scylladb/scylladb:
  test: Re-sort comments around do_test_streaming_scopes()
  test: Split do_load_sstables()
  test: Drop load_fn argument from do_load_sstables()
  test: Re-use do_test_streaming_scopes() in refresh test
  test: Introduce SSTablesOnLocalStorage
  test: Introduce SSTablesOnObjectStorage
  test: Move test_restore_with_streaming_scopes() into do_test_streaming_scopes()
2026-03-11 09:35:21 +02:00
Dmitriy Kruglov
cee44716db docs: add cluster platform migration procedure
Document how to migrate a ScyllaDB cluster to different instance
types using the add-and-replace node cycling approach.

Closes: QAINFRA-42

Closes scylladb/scylladb#28458
2026-03-11 09:31:35 +02:00
Nadav Har'El
401dc1894c test/alternator,cqlpy: avoid xfail_strict against DynamoDB/Cassandra
Recently, in commit 7b30a39, we added to pytest.ini the option xfail_strict.
This option causes every time a test XPASSes, i.e., an XFAIL test actually
passes, to be considered an error and fail the test.

While this has some benefits, it's a big problem when running tests
against a reference implementation like DynamoDB or Cassandra: We
typically mark a test "xfail" if the test shows a known bug - i.e., if
the test fails on Scylla but passes on the reference system (DynamoDB
or Cassandra). This means that when running "test/cqlpy/run-cassandra"
or "test/alternator/run --aws", we expect to see many tests XPASS,
and now this will cause these runs to "fail".

So in this patch we add the xfail_strict=false to cqlpy/run-cassandra
and alternator/run --aws. This option is not added to cqlpy/run or
to alternator/run without --aws, and also doesn't affect test.py or
Jenkins.

P.S. This is another nail in the coffin of doing "cd test/alternator;
pytest --aws". You should get used to running Alternator tests through
test/alternator/run, even if you don't need to run Scylla (the "--aws"
option doesn't run Scylla).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28973
2026-03-11 09:29:30 +02:00
Robert Bindar
29619e48d7 replica/table: calculate manifest tablet_count from tablet map
During tests I noticed that if the number of tablets is very small,
say 2, and the number of nodes is 3 (2 shards per node), using the
number of storage groups on each shard, a shard may end up holding 0 groups,
whilst the other holds 1 group. And in some nodes even both shards have
0 groups.
Taking the minimum among shards here was showing in manifests a tablet
count of 0 for all 3 nodes, which is incorrect.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#28978
2026-03-11 09:27:04 +02:00
Botond Dénes
3fed6f9eff Merge 'service: tasks: scan all tablets in tablet_virtual_task::wait' from Aleksandra Martyniuk
Currently, for repair tasks tablet_virtual_task::wait gathers the
ids of tablets that are to be repaired. The gathered set is later
used to check if the repair is still ongoing.

However, if the tablets are resized (split or merged), the gathered
set becomes irrelevant. Those, we may end up with invalid tablet id
error being thrown.

Wait until repair is done for all tablets in the table.

Fixes: https://github.com/scylladb/scylladb/issues/28202

Backport to 2026.1 needed as it contains the change introducing the issue d51b1fea94

Closes scylladb/scylladb#28323

* github.com:scylladb/scylladb:
  service: fix indentation
  test: add test_tablet_repair_wait
  service: remove status_helper::tablets
  service: tasks: scan all tablets in tablet_virtual_task::wait
2026-03-11 09:24:07 +02:00
Raphael S. Carvalho
cc5b1acadf Improve log when sstable load fails due to missing tablet replica
A bug or some bad operator intervention can lead to a sstable existing in a
node after the tablet replica was moved to a different node.

This will result sstable loading during boot failing, requiring operator
intervention.
The log today just dumps the name of the "orphaned" sstable, but one
investigating it might want to know which process (repair, memtable, whatever)
generated that sstable, if the sstable was created locally or remotely,
and the current replica set of the underlying tablet.
From the original identifier, we can know the exact time the sstable was
created on its original node. From the current id, we know the time it
was created on the current node.
All this info can help the investigator to correlate with events in other nodes
(includes actions from the coordinator) to get closer to the root cause.

The new log will look like this:

"Unable to load SSTable .../me-3gyg_1fsw_2u0u826b00b71vc46o-big-Data.db
(originated from compaction with id 913f41c0-18c2-11f1-8f08-cb8521b3f330
on host e483238c-2287-4022-8bc4-b4f1c4cb2b0d)
of tablet 6 (replica set: [e483238c-2287-4022-8bc4-b4f1c4cb2b0d:0])"

Refs https://scylladb.atlassian.net/browse/SCYLLADB-788.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#28921
2026-03-11 06:20:34 +02:00
Avi Kivity
b17e1259e3 Merge 'types: optimize vector deserialization for high-dimensional vectors' from Szymon Wasik
Vector deserialization is an operation which performance is critical for
vector similarity search feature because it is frequently executed during
rescoring operation. Some of the identified performance bottlenecks
for it include:

1. Per-element virtual dispatch in deserialize(): each of the N elements
   went through visit() which switches on ~28 type variants. For a
   1024-dimension float vector, that's 1024 redundant type switches when
   the element type is the same for all of them.

2. Redundant work in split_fragmented(): value_length_if_fixed() was
   called inside the loop (N virtual calls), and no reserve() was done
   on the output vector causing repeated reallocations.

This series fixes both:

- Introduce deserialize_vector_visitor that dispatches on the element
  type once for the entire vector, then loops inside the resolved
  handler. Simple numeric types (float, int, etc.) call
  deserialize_value() directly with no virtual dispatch per element.
  String types (ascii, utf8) get a dedicated handler that skips
  make_empty() (sstring has no empty_t constructor). Complex types
  (list, map, tuple, etc.) fall back to per-element dispatch.

- In split_fragmented(), reserve the output vector to _dimension and
  cache value_length_if_fixed() before the loop.

Benchmark results (1024-dim float vector, release build, -O3 -flto):

  deserialize:      15.73 us -> 11.70 us  (1.34x, 26% faster)
  split_fragmented: 10.34 us ->  7.45 us  (1.39x, 28% faster)

References: SCYLLADB-471

Backport: none, unless we observe some critical performance improvement for quantization.

Closes scylladb/scylladb#28618

* github.com:scylladb/scylladb:
  types: optimize reading vector fragments
  types: optimize vector deserialization for high-dimensional vectors
2026-03-11 00:39:46 +02:00
Dawid Mędrek
167feabe1a cql3: Reject user-provided timestamps for strongly consistent tables
Similarly to LWTs, we reject queries with user-provided timestamps
when they target strongly consistent tables.

Such statements could force us to rewrite history, and that contradicts
the philosophy of linearizability we aim for.

Fixes SCYLLADB-879

Closes scylladb/scylladb#28867
2026-03-10 22:11:39 +02:00
Marcin Maliszkiewicz
8ae80a32c0 Update seastar submodule
* seastar d2953d2a...4d268e0e (32):
  > Merge 'prometheus: support multiple __name__ filters and prefixed names' from Travis Downs
    doc: update prometheus.md with __name__ filter enhancements
    prometheus: support prefixed names in __name__ filter
    prometheus: add benchmarks for name filter performance
    prometheus: support multiple __name__ query parameters
    prometheus: move write_body_args to header
  > fair_queue: Subtract from _queued_capacity on pop_front()
  > memory: expose cumulative allocated bytes statistic
  > Merge 'Add ability to configure IO bandwidth limit for supergroup' from Pavel Emelyanov
    test: IO bandiwdth throttler unit tests
    code: Add ability to configure IO bandwidth limit for supergroup
    io_queue: Have more than one throttler par class
    io_queue: Introduce bandwidth_throttler helper class
    io_queue: Nest io_group::priotiy_class_data-s
    io_queue: Update class bandwidth on group's class data
    io_queue: Make io_group::priority_class_data::tokens() static
    fair_queue: Introduce group (un)plugging
  > Fix _shard_to_numa_node_mapping double population
  > Use exception parameter in log_timer_callback_exception()
  > Fix wakeup_granularity() fallback debug-fs reading
  > test_fixture: Fix SEASTAR_FIXTURE_THREAD_TEST_CASE thread not propagated
  > build: support tuning -ffile-prefix-map
  > test: Remove unused C::dup() method of testing class
  > src/core/reactor: introduce reactor::get_backend_name()
  > util/process: add pid() accessor
  > Merge 'Add source location to task and tasktrace object' from Radosław Cybulski
    coroutine.hh: disable source_location for GCC to avoid ICE
    reactor: improve do_dump_task_queue reporting
    Use source_location in `do_dump_task_queue`
    Update backtrace with source locations of resume points
    Add calls to update resume_point
    Add a std::source_location (resume_point) to task object.
  > Merge 'Refine posix file .dup() implementation' from Pavel Emelyanov
    file: Templatize posix_file_handle_impl
    file: Don't dup() non-read-only files
    file: Split ..._impl::dup() implementations
    test: Add a simple test for dup()
  > Merge 'Deprecate reactor::make_pollable_fd(socket_address, int)' from Pavel Emelyanov
    reactor: Deprecate make_pollable_fd()
    net/posix: Create file_desc for sockets in-place
    reactor,net: Keep sock_need_nonblock boolean on posix_network_stack
    net/posix: Re-format constructor initializer lists
  > Merge 'test: add fuzz testing infrastructure and sstring fuzzer' from Travis Downs
    test: add fuzz tests to CI workflow
    test: add sstring differential fuzzer
    test: add fuzz testing infrastructure
  > Introduce "integrated queue length" metrics and use it for IO classes (#3210)
  > reactor: Remove get_sg_data(unsigned) overload
  > memcached: Stop using scattered_message
  > reactor: Mark uptime() method const
  > alien: Remove deprecated run_on and submit_to calls
  > file: make open_flags and access_flags constexpr
  > scheduling: Unfriend some methods from scheduling_group
  > reactor: Move _dying bit to epoll backend
  > file: coroutinize the with_file templates
  > configure: validate --cook ingredient names
  > fix trailing whitespace
  > Merge 'Estimate timing overhead, allow failing if it is too high' from Travis Downs
    perf_tests: document overhead column and threshold options
    perf_tests: add measurement overhead tracking and warnings
    perf_tests: remove inline/hot attributes from time_measurement methods
    perf_tests: move time_measurement class to implementation file
    perf_tests: move perf counters into time_measurement singleton
  > rpc: log handler type
  > Merge 'Add pre-commit with trailing whitespace hook' from Travis Downs
    Add GitHub Actions workflow for pre-commit enforcement
    Add pre-commit setup documentation to HACKING.md
    Add pre-commit configuration with trailing-whitespace hook
    Remove trailing whitespace from source files
  > posix-stack: Make internal::posix_connect() resolve exceptions into futures
  > sstring: fix npos to be size_t for consistency with std::string

Closes scylladb/scylladb#28954
2026-03-10 22:06:58 +02:00
Szymon Wasik
7fae78d2b0 types: optimize reading vector fragments
There was a redundant work in split_fragmented(): value_length_if_fixed() was
called inside the loop (N virtual calls), and no reserve() was done
on the output vector causing repeated reallocations.

This patch reserves the output vector to _dimension and
caches value_length_if_fixed() before the loop.

Additionally, split read_vector_element() into two specialized functions:
read_vector_element_fixed() and read_vector_element_variable(), and hoist
the branch on fixed_len outside the loop in split_fragmented() and
deserialize_loop(). This avoids a conditional branch per element in the
hot path.

Benchmark results (1024-dim float vector, release build, -O3 -flto):
10.34 us ->  7.45 us  (1.39x, 28% faster)
2026-03-10 20:17:31 +01:00
Szymon Wasik
6c0ef8eb92 types: optimize vector deserialization for high-dimensional vectors
One of the performance bottlenecks while deserializing vectors was
per-element virtual dispatch in deserialize(): each of the N elements
went through visit() which switches on ~28 type variants. For a
1024-dimension float vector, that's 1024 redundant type switches when
the element type is the same for all of them.

This patch introduces deserialize_vector_visitor that dispatches on the element
type once for the entire vector, then loops inside the resolved
handler. Simple numeric types (float, int, etc.) call
deserialize_value() directly with no virtual dispatch per element.
String types (ascii, utf8) get a dedicated handler that skips
make_empty() (sstring has no empty_t constructor). Complex types
(list, map, tuple, etc.) fall back to per-element dispatch.

Benchmark results (1024-dim float vector, release build, -O3 -flto):
15.73 us -> 11.70 us  (1.34x, 26% faster)
2026-03-10 18:21:34 +01:00
Andrzej Jackowski
9247dff8c2 reader_concurrency_semaphore: fix leak workaround
`e4da0afb8d5491bf995cbd1d7a7efb966c79ac34` introduces a protection
against resources that are "made up" of thin air to
`reader_concurrency_semaphore`. If there are more `_resources` than
the `_initial_resources`, it means there is a negative leak, and
`on_internal_error_noexcept` is called. In addition to it,
`_resources` is set to `std::max(_resources, _initial_resources)`.

However, the commit message of `e4da0afb8d5491bf995cbd1d7a7efb966c79ac34`
states the opposite: "The detection also clamps the
_resources to _initial_resources, to prevent any damage".

Before this commit, the protection mechanism doesn't clamp
`_resources` to `_initial_resources` but instead keeps `_resources` high,
possibly even indefinitely growing. This commit changes `std::max` to
`std::min` to make the code behave as intended.

Refs: SCYLLADB-163

Closes scylladb/scylladb#28982
2026-03-10 18:57:31 +02:00
Szymon Wasik
74d86d3fe9 vector_search: Avoid quadratic reallocation in response_content_to_sstring
Pre-compute the total size and allocate a single uninitialized sstring
before copying the buffers, following the pattern from Seastar's
read_entire_stream_contiguous(). This avoids iterative reallocation
which is O(n^2) for large responses.
2026-03-10 17:45:55 +01:00
Szymon Wasik
d27610f138 vector_store_client: Return HTTP error description, not just code
This simple patch adds support for storing the HTTP error description
that Vector Store client receives from vector store. Until now it was
just printed to the log but it was not returned. For this reason it
was not forwarded to the drivers which forced users to access ScyllaDB
server logs to understand what is wrong with Vector Store.

This patch also updates formatter to print the message next to the
error code.

Fixes: VECTOR-189
2026-03-10 17:22:30 +01:00
Botond Dénes
81e214237f Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk
This pull request adds support for calculation and storing CRC32 digests for all SSTable components.
This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure
and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream.
Several test cases where introduced to verify expected behaviour.

Additionally, this PR adds new rewrite component mechanism for safe sstable component rewriting.
Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed to the final name after sealing. This allowed crash recovery by simply removing the temporary file on startup.

However, with component digests stored in scylla_metadata (#20100),
replacing a component like Statistics requires atomically updating both the component
and scylla_metadata with the new digest - impossible with POSIX rename.

The new mechanism creates a clone sstable with a fresh generation:

- Hard-links all components from the source except the component being rewritten and scylla_metadata
- Copies original sstable components pointer and recognized components from the source
- Invokes a modifier callback to adjust the new sstable before rewriting
- Writes the modified component along with updated scylla_metadata containing the new digest
- Seals the new sstable with a temporary TOC
- Replaces the old sstable atomically, the same way as it is done in compaction

This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair).
In case of any failure durning the whole process, sstable will be automatically deleted on the node startup due to
temporary toc persistence.

Backport is not required, it is a new feature

Fixes https://github.com/scylladb/scylladb/issues/20100, https://github.com/scylladb/scylladb/issues/27453

Closes scylladb/scylladb#28338

* github.com:scylladb/scylladb:
  docs: document components_digests subcomponent and trailing digest in Scylla.db
  sstable_compaction_test: Add tests for perform_component_rewrite
  sstable_test: add verification testcases of SSTable components digests persistance
  sstables: store digest of all sstable components in scylla metadata
  sstables: replace rewrite_statistics with new rewrite component mechanism
  sstables: add new rewrite component mechanism for safe sstable component rewriting
  compaction: add compaction_group_view method to specify sstable version
  sstables: add null_data_sink and serialized_checksum for checksum-only calculation
  sstables: extract default write open flags into a constant
  sstables: Add write_simple_with_digest for component checksumming
  sstables: Extract file writer closing logic into separate methods
  sstables: Implement CRC32 digest-only writer
2026-03-10 16:02:53 +02:00
Aleksandra Martyniuk
e02b0d763c service: fix indentation 2026-03-10 14:44:52 +01:00
Aleksandra Martyniuk
02257d1429 test: add test_tablet_repair_wait
Add a test that checks if tablet_virtual_task::wait won't fail if
tablets are merged.
2026-03-10 14:42:27 +01:00
Aleksandra Martyniuk
0e0070f118 service: remove status_helper::tablets
Currently, status_helper::tablets, which keeps a vector of processed
tablet ids, is used only in tablet_virtual_task::get_status_helper,
so there is no point in returning it. Also, in get_status_helper,
it is used only to determine if any tablets are processed.

Remove status_helper::tablets. Use a flag instead of the vector
in get_status_helper.
2026-03-10 14:42:27 +01:00
Aleksandra Martyniuk
e5928497ce service: tasks: scan all tablets in tablet_virtual_task::wait
Currently, for repair tasks tablet_virtual_task::wait gathers the
ids of tablets that are to be repaired. The gathered set is later
used to check if the repair is still ongoing.

However, if the tablets are resized (split or merged), the gathered
set becomes irrelevant. Those, we may end up with invalid tablet id
error being thrown.

Wait until repair is done for all tablets in the table.
2026-03-10 14:42:21 +01:00
Andrei Chekun
c36df5ecf4 test.py: eliminite drivers exception
There is a race condition in driver that raises the RuntimeException.
This pollutes the output, so this PR is just silencing this exception.

Fixes: SCYLLADB-900

Closes scylladb/scylladb#28957
2026-03-10 14:31:36 +02:00
Alex
3ac4e258e8 transport/messages: hold pinned prepared entry in PREPARE result
result_message::prepared now owns a strong pinned prepared-cache entry instead of relying only on a weak pointer view. This closes the remaining lifetime gap after query_processor::prepare() returns, so users of the returned PREPARE message cannot observe an invalidated weak handle during subsequent
  processing.

  - update result_message::prepared::cql constructor to accept pinned entry
  - construct weak view from owned pinned entry inside the message
  - pass pinned cache entry from query_processor::prepare() into the message constructor
2026-03-10 14:17:57 +02:00
Alex
27051d9a7c cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race
query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak
  handle could no longer be promoted and the prepare path could fail nondeterministically.

  This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops.

  Test coverage was extended in test/cluster/test_prepare_race.py:

  - reproduces the invalidation-during-prepare window with injection,
  - verifies prepare completes successfully,
  - then invalidates again and executes the same stale client prepared object,
  - confirms the driver transparently re-requests/re-prepares and execution succeeds.

  This change introduces:

  - no behavior change for normal prepare flow besides stronger lifetime guarantees,
  - no new protocol semantics,
  - preserves existing cache invalidation logic,
  - adds explicit cluster-level regression coverage for both the race and driver reprepare path.
  - pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage
2026-03-10 14:17:57 +02:00
Piotr Dulikowski
37f8cdf485 Merge 'test.py: fix unawaited ScyllaLogFile.grep() coroutines' from Andrei Chekun
Fixed several places where ScyllaLogFile.grep() was called without await, resulting in checking coroutine objects for truthiness instead of actual log matches.

Fixes: SCYLLADB-903

No backport, framework fix and one test fix.

Closes scylladb/scylladb#28909

* github.com:scylladb/scylladb:
  test.py: fix unawaited ScyllaLogFile.grep() coroutines
  tests: fix test_group0_recovers_after_partial_command_application
2026-03-10 12:29:23 +01:00
Dario Mirovic
f72081194c db: use prefix tombstones in DROP TABLE schema mutations
When dropping a table, make_drop_table_or_view_mutations() creates
a point tombstone in system_schema.columns for every column in the table.

The clustering key of system_schema.columns is (table_name, column_name).
A clustering key with only the table_name component acts as a prefix
tombstone. That tombstone covers all columns belonging to that table.
This approach is already used by make_table_deleting_mutations() during
CREATE TABLE.

Apply the same prefix tombstone approach to DROP TABLE for the columns,
view_virtual_columns, computed_columns, and dropped_columns schema tables.
This reduces tombstone accumulation in schema table sstables.

In test_max_cells test case, which repeatedly creates and drops a table
with 32768 columns, overall test time improved from ~180s to ~157s, which
is ~12.7% improvement.

Refs SCYLLADB-815

Closes scylladb/scylladb#28976
2026-03-10 11:59:00 +01:00
Gleb Natapov
b59b3d4f8a service level: remove version 1 service level code 2026-03-10 10:46:48 +02:00
Gleb Natapov
b633ec1779 features: move GROUP0_SCHEMA_VERSIONING to deprecated features list 2026-03-10 10:46:48 +02:00
Gleb Natapov
40ec0d4942 migration_manager: remove unused forward definitions 2026-03-10 10:46:48 +02:00
Gleb Natapov
aa9eb0ef8c test: remove unused code 2026-03-10 10:46:48 +02:00
Gleb Natapov
4660f908f9 auth: drop auth_migration_listener since it does nothing now 2026-03-10 10:46:48 +02:00
Gleb Natapov
74b5a8d43d schema: drop schema_registry_entry::maybe_sync() function
Schema is synced through group0 now. Drop all the test of the function
as well.
2026-03-10 10:46:47 +02:00
Gleb Natapov
b9f3281af6 schema: drop make_table_deleting_mutations since it should not be needed with raft
Also remove the test since it is no longer relevant
2026-03-10 10:46:47 +02:00
Gleb Natapov
f76199e5c2 schema: remove calculate_schema_digest function
It is used by the test only, so remove the test and its data as well.
2026-03-10 10:46:47 +02:00
Gleb Natapov
08e33ad7f7 schema: drop recalculate_schema_version function and its uses
There is no need to recalculate schema version any more since it is set
by group0.
2026-03-10 10:46:39 +02:00
Gleb Natapov
7bb334a5dd migration_manager: drop check for group0_schema_versioning feature
We do not allow upgrading from a version that does not have it any
longer.
2026-03-10 10:39:59 +02:00