Commit Graph

49613 Commits

Author SHA1 Message Date
Karol Nowacki
6da598fa4a vector_store_client_test: Ensure cleanup on exception
Move the mock/test server shutdown into a `finally()` block to
guarantee cleanup even if the test case throws an exception.
2025-09-24 10:41:37 +02:00
Karol Nowacki
381586f1b8 vector_store_client_test: Fix unreliable unavailable port tests
The `generate_unavailable_localhost_port` function is not robust because it
can suffer from a race condition. It finds an available port but does not
keep it occupied, meaning another process could bind to it before the test
can use it.

The `unavailable_server` helper is a more robust solution. It creates a
server that listens on a port for its entire lifetime and immediately
closes any incoming connections. This guarantees the port remains
unavailable, making the test more reliable.
2025-09-24 10:23:24 +02:00
Aleksandra Martyniuk
48bbe09c8b test: fix test_two_tablets_concurrent_repair_and_migration_repair_writer_level
test_two_tablets_concurrent_repair_and_migration_repair_writer_level waits
for the first node that logs info about repair_writer using asyncio.wait.
The done group is never awaited, so we never learn about the error.

The test itself is incorrect and the log about repair_writer is never
printed. We never learn about that and tests finishes successfully
after 10 minutes timeout.

Fix the test:
- disable hinted handoff;
- repair tablets of the whole table:
  - new table is added so that concurrent migration is possible;
- use wait_for_first_completed that awaits done group;
- do some cleanups.

Remove nightly mark.

Fixes: #26148.

Closes scylladb/scylladb#26209
2025-09-24 06:40:45 +03:00
Avi Kivity
2239474a87 Merge 'tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true' from Tomasz Grabiec
Greatly improves performance of plan making, because we don't consider
candidates in other racks, most of which will fail to be selected due
to replication constraints (no rack overload). Also (but minor)
reduces the overhead of candidate evaluation, as we don't have to
evaluate rack load.

Enabled only for rf_rack_valid_keyspaces because such setups guarantee
that we will not need (because we must not) move tablets across racks,
and we don't need to execute the general algorithm for the whole DC.

Tested with perf-load-balancing, which performs a single scale-out
operation on a cluster which initially has 10 nodes 88 shards each, 2
racks, RF=2, 70 tables, 256 tablets per table. Scale out adds 6 new
nodes (same shard count). Time to reballance the cluster (plan making
only, sum of all iterations, no streaming):

Before:  16 min 25 s
After:    0 min 25 s

Before, plan making cost (single incremental iteration) alternated
between fast (0.1 [s]) and slow (14.1 [s]):

  testlog - Rebalance iteration 7 took 14.156 [s]: mig=88, bad=88, first_bad=17741, eval=93874484, skiplist=0, skip: (load=0, rack=17653, node=0)
  testlog - Rebalance iteration 8 took 0.143 [s]: mig=88, bad=88, first_bad=88, eval=865407, skiplist=0, skip: (load=0, rack=0, node=0)

The slow run chose min and max nodes in different racks, hence the
fast path failed to find any candidates and we switched to exhaustive
search of candidates in other nodes.

After, all iterations are fast (0.1 [s] per rack, 0.2 [s] per plan-making). The plan is twice as large because it combines the output of two subsequent (pre-patch) plan-making calls.

Fixes #26016

Closes scylladb/scylladb#26017

* github.com:scylladb/scylladb:
  test: perf: perf-load-balancing: Add parallel-scaleout scenario
  test: perf: perf-load-balancing: Convert to tool_app_template
  tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true
2025-09-23 22:45:35 +03:00
Tomasz Grabiec
981592bca5 tablet: scheduler: Do not emit conflicting migrations in the plan
Plan-making is invoked independently for different DCs (and in the
future, racks) and then plans are merged. It could be that the same
tablets are selected for migration in different DCs. Only one
migration will prevail and be committed to group0, so it's not a
correctness problem. Next cycle will recognize that the tablet is in
transition and will not be selected by plan-maker. But it makes
plan-making less efficient.

It may also surprise consumers of the plan, like we saw in #25912.

So we should make plan-maker be aware of already scheduled transitions
and not consider those tablets as candidates.

Fixes #26038

Closes scylladb/scylladb#26048
2025-09-23 22:40:08 +03:00
Łukasz Paszkowski
5089ffe06f tools: toolchain: add e2fsprogs, fuse3 to the dependencies
The packages contain filesystem utilities to create volumes such
that sudo/unshare are not required.

Closes #26135

[avi: regenerate frozen toolchain with optimized clang from
  https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-aarch64.tar.gz
  https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-x86_64.tar.gz
]

Closes scylladb/scylladb#26165
2025-09-23 18:49:37 +03:00
Botond Dénes
f9172c934a Merge 'comparable_bytes: handle counter type' from Lakshmi Narayanan Sreethar
Byte comparable format is not supported for counter types. This patch
adds explicit handling for them for completeness, allowing the default
abstract type handler to be removed.

Refs #19407

New features. No need to backport.

Closes scylladb/scylladb#26206

* github.com:scylladb/scylladb:
  types/comparable_bytes: remove default abstract type handler
  types/comparable_bytes: handle counter type
2025-09-23 18:23:49 +03:00
Pavel Emelyanov
f6e8a14fb0 Update seastar submodule
Includes fix for scylla-gdb -- the fair_queue is now hierarchy and
priority queues are no longer accessible via _handles member. However,
the fix is incomplete -- it silently assumes that the hierarchy is flat
(and it _is_ flat now, scylla doesn't yet create nested groups) but it
should be handled eventually

* seastar b6be384e...c8a3515f (8):
  > Merge 'Nested scheduling groups (IO classes)' from Pavel Emelyanov
    test: Add test case for wake-from-idle accumulator fixups
    test: Add fair_queue test for queues activations
    test: Expand fair queue random run test with groups
    test: Add test for basic fair-queue nested linkage
    test: Cleanup scheduling groups after io_queue_test cases
    code: Update IO priority group shares from supergroup shares change
    io_queue: Register class groups in fair-queue
    fair_queue: Add test class friendship
    fair_queue: Move nr_classes into group_data
    fair_queue: Fix indentation after previous patch
    fair_queue: Implement hierarchical queue activation (wakeup)
    fair_queue: Remove now unused push/pop helpers
    fair_queue: Implement hierarchical priority_entry::pop_front()
    fair_queue: Implement hierarchical priority_entry::top()
    fair_queue: Link priority_entries into tree
    fair_queue: Add priority_class_group_data::reserve()
    fair_queue: Move more bits onto priority_entry
    fair_queue: Move shares on priority_entry
    fair_queue: Move last_accumulated on priority_class_group_data
    fair_queue: Introduce priority_class_group_data
    fair_queue: Inherit priority_class_data from priority_entry
    fair_queue: Rename priority_class_ptr
    ioqueue: Opencode get_class_info() helper
    ioq: Move fair_queue class registration down
  > tls: Rework session termination
  > http/request: get_query_param(): add default_value argument
  > http/request: add has_query_param()
  > sharded: Deprecate distributed alias
  > io_tester: Allow configuring sloppy_size_hint for files
  > file: Remove duplicating static_assert-ions

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26143
2025-09-23 18:20:45 +03:00
Patryk Jędrzejczak
da44d6af09 Merge 'Move some compaction manager API handlers from storage_service.cc to tasks.cc' from Pavel Emelyanov
There's a bunch of /storage_service/... endpoints that start compaction manager tasks and wait for it. Most of them have async peer in /tasks/... that start the very same task, but return to the caller with the task ID.

This patch moves those handlers' code from storage_service.cc to tasks.cc, next to the corresponding async peers, to keep handlers that need compaction_manager in one place.

That's preparation for more future changes. Later all those endpoints will stop using database from http_context and will capture the compaction_manager they need from main, like it was done in #20962 for /compaction_manager/... endpoints. Even "more later", the former and the latter blocks of endpoints will be registered and unregistered together, e.g. like database endpoints were collected in one reg/unreg sequence by #25674.

Part of http_context dependencies cleanup effort, no need to backport.

Closes scylladb/scylladb#26140

* https://github.com/scylladb/scylladb:
  api: Move /storage_service/compact to tasks.cc
  api: Move /storage_service/keyspace_upgrade_sstables to tasks.cc
  api: Move /storage_service/keyspace_offstrategy_compaction to tasks.cc
  api: Move /storage_service/keyspace_cleanup to tasks.cc
  api: Move /storage_service/keyspace_compaction to tasks.cc
2025-09-23 15:08:48 +02:00
Lakshmi Narayanan Sreethar
0914978605 types/comparable_bytes: remove default abstract type handler
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-09-23 13:44:59 +05:30
Lakshmi Narayanan Sreethar
ee1e648a7f types/comparable_bytes: handle counter type
Byte comparable format is not supported for counter types. This patch
adds explicit handling for them for completeness, allowing the default
abstract type handler to be removed in the next patch.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-09-23 13:44:59 +05:30
Andrzej Jackowski
c8f45dbbb2 test: speed up test_long_query_timeout_erm
`test_long_query_timeout_erm` is slow because it has many parameterized
variants, and it verifies timeout behavior during ERM operations, which
are slow by nature.

This change speeds the test up by roughly 3× (319s -> 114s) by:
 - Removing two of the five scenarios that were near duplicates.
 - Shortening timeout values to reduce waiting time.
 - Parallelizing waiting on server_log with asyncio.TaskGroup().

The two removed scenarios (`("SELECT", True, False)`,
`("SELECT_WHERE", True, False)`) were near duplicates to
`("SELECT_COUNT_WHERE", True, False)` scenario, because all three
scenarios use non-mapreduce query and triggers basically the same
system behavior. It is sufficient to keep only one of them, so the test
verifies three cases:
 - One with nodes shutdown
 - One with mapreduce query
 - One with non-mapreduce query

Fixes: scylladb/scylla#24127

Closes scylladb/scylladb#25987
2025-09-23 10:28:07 +03:00
Piotr Dulikowski
482ddfb3b4 Merge 'mv: handle mismatched base/view replica count caused by RF change' from Wojciech Mitros
During an ALTER KEYSPACE statement execution where a table with a view
is present, we need to perform tablet migrations for both tables.
These migrations are not synchronized, so at some point the base may
have a different number of non-pending replicas than the view. Because
of that, we can't pair them correctly. If there is more non-pending
base replicas than view replicas, we don't need to do anything because
the view replica that didn't finish migrating is a pending replica
and will get view updates from all base replicas. But if there is more
non-pending view replicas than base replicas, we may currently lose
view updates to the new view replica.

This patch adds a workaround for this scenario. If after one migration
we have too more non-pending view replicas than base replicas, we add
it to the pending replica list so that it gets an update anyway.

This patch will also take effect if the base and view replica counts
differ due to some other bug. To track that, a new metric is added
to count such occurrences.

This patch also includes a test for this exact scenario, which is enforced by an injection.

Fixes https://github.com/scylladb/scylladb/issues/21492

Closes scylladb/scylladb#24396

* github.com:scylladb/scylladb:
  mv: handle mismatched base/view replica count caused by RF change
  mv: save the nodes used for pairing calculations for later reuse
  mv: move the decision about simple rack-aware pairing later
2025-09-23 08:10:08 +02:00
Dawid Mędrek
35f7d2aec6 db/batchlog: Drop batch if table has been dropped
If there are pending mutations in the batchlog for a table that
has been dropped, we'll keep attempting to replay them but with
no success -- `db::no_such_column_family` exceptions will be thrown,
and we'll keep trying again and again.

To prevent that, we drop the batch in that case just like we do
in the case of a non-existing keyspace.

A reproducer test has been included in the commit. It fails without
the changes in `db/batchlog_manager.cc`, and it succeeds with them.

Fixes scylladb/scylladb#24806

Closes scylladb/scylladb#26057
2025-09-23 07:48:59 +02:00
Tomasz Grabiec
2b03a69065 test: perf: perf-load-balancing: Add parallel-scaleout scenario
Simulates reblancing on a single scale-out involving simultaneous
addition of multiple nodes per rack.

Default parameters create a cluster with 2 racks, 70 tables, 256
tablets/table, 10 nodes, 88 shards/node.
Adds 6 nodes in parallel (3 per rack).

Current result on my laptop:

  testlog - Rebalance took 21.874 [s] after 82 iteration(s)
2025-09-23 00:31:31 +02:00
Tomasz Grabiec
0dcaaa061e test: perf: perf-load-balancing: Convert to tool_app_template
To support sub-commands for testing different scenarios.

The current scenario is given the name "rolling-add-dec".
2025-09-23 00:30:38 +02:00
Tomasz Grabiec
c9f0a9d0eb tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true
Greatly improves performance of plan making, because we don't consider
candidates in other racks, most of which will fail to be selected due
to replication constraints (no rack overload). Also (but minor)
reduces the overhead of candidate evaluation, as we don't have to
evaluate rack load.

Enabled only for rf_rack_valid_keyspaces because such setups guarantee
that we will not need (because we must not) move tablets across racks,
and we don't need to execute the general algorithm for the whole DC.

Tested with perf-load-balancing, which performs a single scale-out
operation on a cluster which initially has 10 nodes 88 shards each, 2
racks, RF=2, 70 tables, 256 tablets per table. Scale out adds 6 new
nodes (same shard count). Time to rebalance the cluster (plan making
only, sum of all iterations, no streaming):

Before: 16 min 25 s
After: 0 min 25 s

Before, plan making cost (single incremental iteration) alternated
between fast (0.1 [s]) and slow (14.1 [s]):

  Rebalance iteration 7 took 14.156 [s]: mig=88, bad=88, first_bad=17741, eval=93874484, skiplist=0, skip: (load=0, rack=17653, node=0)
  Rebalance iteration 8 took 0.143 [s]: mig=88, bad=88, first_bad=88, eval=865407, skiplist=0, skip: (load=0, rack=0, node=0)

The slow run chose min and max nodes in different racks, hence the
fast path failed to find any candidates and we switched to exhaustive
search of candidates in other nodes.

After, all iterations are fast (0.1 [s] per rack, 0.2 [s] per plan-making).
The plan is twice as large because it combines the output of two subsequent (pre-patch)
plan-making calls.

Fixes #26016
2025-09-23 00:30:37 +02:00
Patryk Jędrzejczak
a56115f77b test: deflake driver reconnections in the recovery procedure tests
All three tests could hit
https://github.com/scylladb/python-driver/issues/295. We use the
standard workaround for this issue: reconnecting the driver after
the rolling restart, and before sending any requests to local tables
(that can fail if the driver closes a connection to the node that
restarted last).

All three tests perform two rolling restarts, but the latter ones
already have the workaround.

Fixes #26005

Closes scylladb/scylladb#26056
2025-09-22 17:21:06 +02:00
Andrzej Jackowski
15e71ee083 test: audit: stop using datetime.datetime.now() in syslog converter
`line_to_row` is a test function that converts `syslog` audit log to
the format of `table` audit log so tests can use the same checks
for both types of audit. Because `syslog` audit doesn't have `date`
information, the field was filled with the current date. This behavior
broke the tests running at 23:59:59 because `line_to_row` returned
different results on different days.

Fixes: scylladb/scylladb#25509

Closes scylladb/scylladb#26101
2025-09-22 15:31:33 +03:00
Pavel Emelyanov
b23aab882a Merge 'test/alternator: multiple fixes for tests so they would pass on DynamoDB' from Nadav Har'El
Issue #26079 noted that multiple Alternator tests are failing when run against DynamoDB. This pull request fixes many of them, in several small patches. In one case we need to avoid a DynamoDB bug that wasn't even the point of the original test (and we create a new test specifically for that DynamoDB bug). Another test exposed a real incompatibility with Alternator (#26103) but didn't need to be exposed in this specific test so again we split the test to one that passes, and another one that xfails on Alternator (not on DynamoDB). A bigger changed had to be done to the tags feature test - since August 2024, the TagResource operation became asynchronous which broke our tests, so we fix this.

Each of these changes are described in more detail in the individual patches.

Refs #26079. It doesn't fix it completely because there are some tests which remain flaky, and some tests which, surprisingly, pass on us-east-1 but fail on eu-north-1. We'll need to address the rest later.

No backports needed, we only run tests against DynamDB from master (when we rarely do...), not on old branches.

Closes scylladb/scylladb#26114

* github.com:scylladb/scylladb:
  test/alternator: fix test_list_tables_paginated on DynamoDB
  test/alternator: fix tests in test_tag.py on DynamoDB
  test/alternator: fix test_health_only_works_for_root_path on DynamoDB
  test/alternator: reproducer tests for faux GSI range key problem
  test/alternator: fix test "test_17119a" to pass on DynamoDB
  test/alternator: fix test to pass on DynamoDB
2025-09-22 15:30:40 +03:00
Pavel Emelyanov
f6860d1de0 Merge 'mv: run view building worker fibers in streaming group' from Piotr Dulikowski
The background fibers of the view building worker are indirectly spawned by the main function, thus the fibers inherit the "main" scheduling group. The main scheduling group is not supposed to be used for regular work, only for initialization and deinitialization, so this is wrong.

Wrap the call to `start_backgroud_fibers()` with `with_scheduling_group` and use the streaming scheduling group. The view building worker already handles RPCs in the streaming scheduling group (which do most of the work; background fibers only do some maintenance), so this seems like a good fit.

No need to backport, view build coordinator is not a part of any release yet.

Closes scylladb/scylladb#26122

* github.com:scylladb/scylladb:
  mv: fix typo in start_backgroud_fibers
  mv: run view building worker fibers in streaming group
2025-09-22 15:28:38 +03:00
Pavel Emelyanov
ce8dd798a2 Merge 'tools/scylla-sstable-scripts: introduce purgeable.lua and writetime-histogram.lua' from Botond Dénes
`purgeable.lua` was written for a specific investigation a few years ago.
`writetime-histogram.lua` is an sstable script transcription of the former scylla-sstable writetime-histogram command. This was also written for an investigation (before script command existed) and is too specific to be a native command, so was removed by edaf67edcb.

Add both scripts to the sample script library, they can be useful, either for a future investigation, or as samples to copy+edit to write new scripts (and train AI).

New sstable scripts, no backport

Closes scylladb/scylladb#26137

* github.com:scylladb/scylladb:
  tools/scylla-sstable-scripts: introduce writetime-histogram.lua
  tools/scylla-sstable-scripts: introduce purgable.lua
2025-09-22 15:27:49 +03:00
Avi Kivity
29032213c8 test: avoid #include <boost/test/included/...>
The boost/test/included/... directory is apparently internal and not
intended for user consumption.

Including it caused a One-Definition-Rule violation, due to
boost/test/impl/unit_test_parameters.ipp containing code like this:

```c++
namespace runtime_config {

// UTF parameters
std::string btrt_auto_start_dbg    = "auto_start_dbg";
std::string btrt_break_exec_path   = "break_exec_path";
std::string btrt_build_info        = "build_info";
std::string btrt_catch_sys_errors  = "catch_system_errors";
std::string btrt_color_output      = "color_output";
std::string btrt_detect_fp_except  = "detect_fp_exceptions";
std::string btrt_detect_mem_leaks  = "detect_memory_leaks";
std::string btrt_list_content      = "list_content";
```

This is defining variables in a header, and so can (and in fact does)
create duplicate variable definitions, which later cause trouble.

So far, we were protected from this trouble by -fvisibility=hidden, which
caused the duplicate definitions to be in fact not duplicate.

Fix this by correcting the include path away from <boost/test/included/>.

Closes scylladb/scylladb#26161
2025-09-22 15:26:06 +03:00
Wojciech Mitros
d9b8278178 mv: handle mismatched base/view replica count caused by RF change
During an ALTER KEYSPACE statement execution where a table with a view
is present, we need to perform tablet migrations for both tables.
These migrations are not synchronized, so at some point the base may
have a different number of non-pending replicas than the view. Because
of that, we can't pair them correctly. If there is more non-pending
base replicas than view replicas, we don't need to do anything because
the view replica that didn't finish migrating is a pending replica
and will get view updates from all base replicas. But if there is more
non-pending view replicas than base replicas, we may currently lose
view updates to the new view replica.

This patch adds a workaround for this scenario. If after one migration
we have too more non-pending view replicas than base replicas, we add
it to the pending replica list so that it gets an update anyway.

This patch will also take effect if the base and view replica counts
differ due to some other bug. To track that, a new metric is added
to count such occurrences.

This patch also includes a test for this exact scenario, which is enforced by an injection.

Fixes https://github.com/scylladb/scylladb/issues/21492
2025-09-22 12:50:16 +02:00
Wojciech Mitros
59c40a2edd mv: save the nodes used for pairing calculations for later reuse
In get_view_natural_endpoint() we start with the list if host_ids
from the effective replication maps, which we later translate to
locator::node to get the information about racks and datacenters.
We check all replicas, but we only store the ones relevant for
pairing, so for tablets, the ones in the same DC as the replica
sending the update.
In the next patch, we'll occasionally need to send cross-dc view
updates, so to avoid computing the nodes again, in this patch
we adjust the logic to prepare them in advance and save them so
that they can be later reused.
2025-09-22 12:45:24 +02:00
Wojciech Mitros
9d4449a492 mv: move the decision about simple rack-aware pairing later
We'll need to get the lists for the whole dc when fixing replica
count mismatches caused by RF changes, so let's first get these lists,
and only filter them later if we decide to use simple rack-aware pairing.
2025-09-22 12:45:24 +02:00
Nadav Har'El
b205e1a3da Merge 'vector_store_client: Extract DNS logic into a dedicated class' from Karol Nowacki
Vector search related implementation moved to a new module vector_search.
As the vector search functionality is going to be extended, it is better to keep it in a separate module.

The DNS resolution logic and its background task are moved out of the `vector_store_client` and into a new, dedicated class `vector_search::dns`.

This refactoring is the first step towards supporting DNS hostnames that resolve to multiple IP addresses.

References: VECTOR-187

No backport needed as this is refactoring.

Closes scylladb/scylladb#26052

* github.com:scylladb/scylladb:
  vector_store_client_test: Verify DNS is not refreshed when disabled
  vector_store_client: Extract DNS logic into a dedicated class
  vector_search: Apply clang-format
  vector_store_client: Move to vector_search module
2025-09-22 13:24:34 +03:00
Avi Kivity
1258e7c165 Revert "Merge 'transport: service_level_controller: create and use driver service level' from Andrzej Jackowski"
This reverts commit fe7e63f109, reversing
changes made to b5f3f2f4c5. It is causing
test.py failures around cqlpy.

Fixes #26163

Closes scylladb/scylladb#26174
2025-09-22 09:32:46 +03:00
Piotr Dulikowski
b382531d99 Merge 'cdc: fix create table with cdc if not exists' from Michael Litvak
Fix an issue where executing a CREATE TABLE IF NOT EXISTS statement with
CDC enabled fails with an error if the table already exists. Instead,
the query should succeed and be a no-op.

This regression was introduced by commit fed1048059. Previously, when
executing the query, we would first check if the table exists in
do_prepare_new_column_families_announcement. If it did, we would throw
an already_exists_exception, which was handled correctly; otherwise, we
would continue and create the CDC table in the
before_create_column_families notification.

The order of operations was changed in fed1048059, causing the
regression. Now, we first create the CDC schema and add it to the schema
list for creation, and then check for each of them if they already
exist. The problem is that when we create the CDC schema in
on_pre_create_column_families, it also checks if the CDC table already
exists. If it does, it throws an invalid_request_exception, which is not
caught and handled as expected.

This patch restores the previous order of operations: we first check if
the tables exist, and only then add the CDC schema in pre_create.

Fixes https://github.com/scylladb/scylladb/issues/26142

no backport - recent regression, not released yet

Closes scylladb/scylladb#26155

* github.com:scylladb/scylladb:
  test: add test for creating table with CDC enabled if not exists
  cdc: fix create table with cdc if not exists
2025-09-22 08:18:26 +02:00
Piotr Dulikowski
591a67c7e7 Merge 'view_builder: register view on all shards atomically' from Michael Litvak
When the view builder starts to build a new view, each shard registers
itself by writing the shard id and current token to the
scylla_views_builds_in_progress table.

Previously, this happened independently by each shard. We change it now
to register all shards "atomically" - when a shard registers itself, it
also registers all other shards with an empty status, if they aren't
registered yet. This ensures that we don't have a partial state in the
table where only some of the shards are registered, but we always have a
status for all shards.

The reason we want to register all shards atomically is that if it
happens that only some of the shards were registered, then we restart
and load the status from table, this doesn't work well for multiple
reasons.

One example is that to know how many shards we had previously, we take
the maximum shard id we see in the table. If it's different than the
current shard count, we will execute the reshard code. But of course, if
the last shard is missing from the table because it didn't register
itself, this calculation will be wrong, and we can't know the previous
number of shards.

This is a problem because suppose we have two shards, and shard 0
finished building the view but shard 1 didn't start. When we come up, we
will think that previously we had only a single shard and it completed
building everything, when in fact we built only half the view
approximately. The problem is that we don't have enough information in
the tables to know that.

There are additional problems related to reshard. In the reshard
function, whether it is executed because we actually do node reshard or
because we calculated the wrong number of previous shards, if the status
of some shard is missing then the calculation of new ranges will be
wrong. When some shard didn't make progress we should start building the
view from scratch. However, this doesn't happen if we don't have a
status for the shard, because the code looks only for shards that have a
status. In effect, this shard is considered complete even though it
didn't start. This could cause the view building to get stuck or
complete without building all tokens ranges.

By registering all shards atomically, this should solve the above
problems because we will always have statuses for all shards.

Fixes https://github.com/scylladb/scylladb/issues/22989

backport not needed - the issue is probably not common and there's a workaround

Closes scylladb/scylladb#25790

* github.com:scylladb/scylladb:
  test: mv: add a test for view build interrupt during registration
  view_builder: register view on all shards atomically
2025-09-22 08:03:44 +02:00
Karol Nowacki
6bd1d7db49 vector_store_client_test: Verify DNS is not refreshed when disabled
Extend the `vector_store_client_uri_update_to_empty` test case to
verify that the DNS resolver stops refreshing when the vector store is
disabled.
2025-09-22 08:02:59 +02:00
Karol Nowacki
27219b8b7c vector_store_client: Extract DNS logic into a dedicated class
The DNS resolution logic and its background task are moved out of the
`vector_store_client` and into a new, dedicated class `vector_search::dns`.

This refactoring is the first step towards supporting DNS hostnames
that resolve to multiple IP addresses.

Signed-off-by: Karol Nowacki <karol.nowacki@scylladb.com>
2025-09-22 08:01:53 +02:00
Karol Nowacki
7cc7b95681 vector_search: Apply clang-format
Run clang-format on the vector_search module to fix minor formatting
inconsistencies.
2025-09-22 08:01:50 +02:00
Karol Nowacki
eae71d3e91 vector_store_client: Move to vector_search module
Vector search related implementation moved to a new module vector_search.
As the vector search functionality is going to be extended, it is
better to keep it in a separate module.
2025-09-22 08:01:47 +02:00
Ferenc Szili
d9f272dbdd load_balancer: fix badness object creation
The load balancer introduced the idea of badness, which is a measure of
how a tablet migration effects table balance on the source and
destination. This is an abbreviated definition of the badness struct:

struct migration_badness {
    double src_shard_badness = 0;
    double src_node_badness = 0;
    double dst_shard_badness = 0;
    double dst_node_badness = 0;

    ...

    double node_badness() const {
        return std::max(src_node_badness, dst_node_badness);
    }

    double shard_badness() const {
        return std::max(src_shard_badness, dst_shard_badness);
    }
};

A negative value for either of these 4 members signifies a good
migration (improves table balance), and a positive signifies a bad
migration.

In two places in the balancer, badness for source and destination is
computed independently in two objects of type migration_badness
(src_badness and dst_badness), and later combined into a single object
similar to this:

return migration_badness{
    src_badness.shard_badness(),
    src_badness.node_badness(),
    dst_badness.shard_badness(),
    dst_badness.node_badness()
};

This is a problem when, for instance, source shard badness is good
(less that 0), shard_badness() will return 0 because of std::max().
This way the actual computed badness is not set in the final object.
This can lead to incorrect decisions made later by the balancer, when it
searches for the best migration among a set of candidates.

Closes scylladb/scylladb#26091
2025-09-21 21:37:23 +02:00
Dawid Mędrek
0d2560c07f test/perf/tablet_load_balancing.cc: Create nodes within one DC
In 789a4a1ce7, we adjusted the test file
to work with the configuration option `rf_rack_valid_keyspaces`. Part of
the commit was making the two tables used in the test replicate in
separate data centers.

Unfortunately, that destroyed the point of the test because the tables
no longer competed for resources. We fix that by enforcing the same
replication factor for both tables.

We still accept different values of replication factor when provided
manually by the user (by `--rf1` and `--rf2` commandline options). Scylla
won't allow for creating RF-rack-invalid keyspaces, but there's no reason
to take away the flexibility the user of the test already has.

Fixes scylladb/scylladb#26026

Closes scylladb/scylladb#26115
2025-09-21 21:36:43 +02:00
Tomasz Grabiec
ddbcea3e2a tablets: scheduler: Run plan-maker in maintenance scheduling group
Currently, it runs in the gossiper scheduling group, because it's
invoked by the topology coordinator. That scheduling group has the
same amount of shares as user workload. Plan-making can take
significant amount of time during rebalancing, and we don't want that
to impact user workload which happens to run on the same shard.

Reduce impact by running in the maintenance scheduling group.

Fixes #26037

Closes scylladb/scylladb#26046
2025-09-21 18:44:57 +03:00
Tomasz Grabiec
4a83b4eef3 Merge 'topology_coordinator: abort view building a bit later in case of tablet migration' from Piotr Dulikowski
In multi DC setup, tablet load balancer might generate multiple migrations of the same tablet_id but only one is actually commited to the `system.tablets` table.

This PR moved abortion of view building tasks from the same start of the migration (`<no tablet transition> -> allow_write_both_read_old`) to the next step (`allow_write_both_read_old -> write_both_read_old`). This way, we'll abort only tasks for which the tablet migration was actually started.

The PR also includes a reproducer test.

Fixes scylladb/scylladb#25912

View building coordinator hasn't been released yet, so no backport is needed.

Closes scylladb/scylladb#26144

* github.com:scylladb/scylladb:
  test/test_view_building_coordinator: add reproducer
  topology_coordinator: abort view building a bit later in case of tablet migration
2025-09-21 15:41:53 +02:00
Karol Nowacki
eedf506be5 vector_store_client: Rename vector_store_uri to vector_store_primary_uri
The configuration setting vector_store_uri is renamed to
vector_store_primary_uri according to the final design.
In the future, the vector_store_secondary_uri setting will
be introduced.

This setting now also accepts a comma-separated list of URIs to prepare
for future support for redundancy and load balancing. Currently, only the
first URI in the list is used.

This change must be included before the next release.
Otherwise, users will be affected by a breaking change.

References: VECTOR-187

Closes scylladb/scylladb#26033
2025-09-21 16:33:10 +03:00
Michael Litvak
3dffb8e0dc test: mv: add a test for view build interrupt during registration
Add a new test that reproduces issue #22989. The test starts view
building and interrupts it by restarting the node while some shards
registered their status and some didn't.
2025-09-21 10:39:30 +02:00
Michael Litvak
6043409c31 view_builder: register view on all shards atomically
When the view builder starts to build a new view, each shard registers
itself by writing the shard id and current token to the
scylla_views_builds_in_progress table.

Previously, this happened independently by each shard. We change it now
to register all shards "atomically" - when a shard registers itself, it
also registers all other shards with an empty status, if they aren't
registered yet. This ensures that we don't have a partial state in the
table where only some of the shards are registered, but we always have a
status for all shards.

The reason we want to register all shards atomically is that if it
happens that only some of the shards were registered, then we restart
and load the status from table, this doesn't work well for multiple
reasons.

One example is that to know how many shards we had previously, we take
the maximum shard id we see in the table. If it's different than the
current shard count, we will execute the reshard code. But of course, if
the last shard is missing from the table because it didn't register
itself, this calculation will be wrong, and we can't know the previous
number of shards.

This is a problem because suppose we have two shards, and shard 0
finished building the view but shard 1 didn't start. When we come up, we
will think that previously we had only a single shard and it completed
building everything, when in fact we built only half the view
approximately. The problem is that we don't have enough information in
the tables to know that.

There are additional problems related to reshard. In the reshard
function, whether it is executed because we actually do node reshard or
because we calculated the wrong number of previous shards, if the status
of some shard is missing then the calculation of new ranges will be
wrong. When some shard didn't make progress we should start building the
view from scratch. However, this doesn't happen if we don't have a
status for the shard, because the code looks only for shards that have a
status. In effect, this shard is considered complete even though it
didn't start. This could cause the view building to get stuck or
complete without building all tokens ranges.

By registering all shards atomically, this should solve the above
problems because we will always have statuses for all shards.

Fixes scylladb/scylladb#22989
2025-09-21 10:39:05 +02:00
Evgeniy Naydanov
85cbe7a8d4 test: add test for creating table with CDC enabled if not exists
Check if there are no errors on the second attempt of executing
"create table if not exists" query if CDC is enabled.
2025-09-21 09:38:36 +02:00
Michael Litvak
5a7e6e53ff cdc: fix create table with cdc if not exists
Fix an issue where executing a CREATE TABLE IF NOT EXISTS statement with
CDC enabled fails with an error if the table already exists. Instead,
the query should succeed and be a no-op.

This regression was introduced by commit fed1048059. Previously, when
executing the query, we would first check if the table exists in
do_prepare_new_column_families_announcement. If it did, we would throw
an already_exists_exception, which was handled correctly; otherwise, we
would continue and create the CDC table in the
before_create_column_families notification.

The order of operations was changed in fed1048059, causing the
regression. Now, we first create the CDC schema and add it to the schema
list for creation, and then check for each of them if they already
exist. The problem is that when we create the CDC schema in
on_pre_create_column_families, it also checks if the CDC table already
exists. If it does, it throws an invalid_request_exception, which is not
caught and handled as expected.

This patch restores the previous order of operations: we first check if
the tables exist, and only then add the CDC schema in pre_create.

Fixes scylladb/scylladb#26142
2025-09-21 09:38:36 +02:00
Michał Hudobski
1690e5265a vector search: correct column name formatting
This patch corrects the column name formatting whenever
an "Undefined column name" exception is thrown.
Until now we used the `name()` function which
returns a bytes object. This resulted in a message
with a garbled ascii bytes column name instead of
a proper string. We switch to the `text()` function
that returns a sstring instead, making the message
readable.
Tests are adjusted to confirm this behavior.

Fixes: VECTOR-228

Closes scylladb/scylladb#26120
2025-09-20 07:02:53 +02:00
Michał Jadwiszczak
2aabf8ee3f test/test_view_building_coordinator: add reproducer
Adds a test which reproduces the issue described
in scylladb/scylladb#25912.

The test creates a situation where a single tablet is replicated across
multiple DCs / racks, and all those tablet replicas are eligible for
migration. The tablet load balancer is unpaused at that moment which
currently causes it to attempt to generate multiple migrations for
different tablet replicas of the same tablet. Before the fix for #25912,
this used to confuse the view build coordinator which would react to
each migration attempt, pausing view building work for each tablet
replica for which there was an attempt to migrate but only unpausing it
for the tablet replica that was actually migrated. After the fix, the
view build coordinator only reacts to the migration that has "won" so
the test successfully passes.
2025-09-19 19:08:34 +02:00
Michał Jadwiszczak
50c5354d0b topology_coordinator: abort view building a bit later in case of tablet migration
In multi DC setup, tablet load balancer might generate multiple
migrations of the same tablet_id but only one is actually commited to
the `system.tablets` table.

This patch moved abortion of view building tasks from the same start of
the migration (`<no tablet transition> -> allow_write_both_read_old`) to
the next step (`allow_write_both_read_old -> write_both_read_old`).
This way, we'll abort only tasks for which the tablet migration was
actually started.

Fixes scylladb/scylladb#25912
2025-09-19 18:02:41 +02:00
Michał Chojnowski
9e70df83ab db: get rid of sstables-format-selector
Our sstable format selection logic is weird, and hard to follow.

If I'm not misunderstanding, the pieces are:
1. There's the `sstable_format` config entry, which currently
   doesn't do anything, but in the past it used to disable
   cluster features for versions newer than the specified one.
2. There are deprecated and unused config entries for individual
   versions (`enable_sstables_mc_format`, `enable_sstables_md_format`,
   etc).
3. There is a cluster feature for each version:
   ME_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, etc.
   (Currently all sstable version features have been grandfathered,
   and aren't checked by the code anymore).
4. There's an entry in `system.scylla_local` which contains the
   latest enabled sstable version. (Why? Isn't this directly derived
   from cluster features anyway)?
5. There's `sstable_manager::_format` which contains the
   sstable version to be used for new writes.
   This field is updated by `sstables_format_selector`
   based on cluster features and the `system.scylla_local` entry.

I don't see why those pieces are needed. Version selection has the
following constraints:
1. New sstables must be written with a format that supports existing
   data. For example, range tombstones with an infinite bound are only
   supported by sstables since version "mc". So if a range tombstone
   with an infinite bound exists somewhere in the dataset,
   the format chosen for new sstables has to be at least as new as "mc".
2. A new format might only be used after a corresponding cluster feature
   is enabled. (Otherwise new sstables might become unreadable if they
   are sent to another node, or if a node is downgraded).
3. The user should have a way to inhibit format ugprades if he wishes.

So far, constraint (1) has been fulfilled by never using formats older
than the newest format ever enabled on the node. (With an exception
for resharding and reshaping system tables).
Constraint (2) has been fulfilled by calling `sstable_manager::set_format`
only after the corresponsing cluster feature is enabled.
Constraint (3) has been fulfilled by the ability to inhibit cluster
features by setting `sstable_format` by some fixed value.

The main thing I don't like about this whole setup is that it doesn't
let me downgrade the preferred sstable format. After a format is
enabled, there is no way to go back to writing the old format again.
That is no good -- after I make some performance-sensitive changes
in a new format, it might turn out to be a pessimization for the
particular workload, and I want to be able to go back.

This patch aims to give a way to downgrade formats without violating
the constraints. What it does is:
1. The entry in `system.scylla_local` becomes obsolete.
   After the patch we no longer update or read it.
   As far as I understand, the purpose of this entry is to prevent
   unwanted format downgrades (which is something cluster features
   are designed for) and it's updated if and only if relevant
   cluster features are updated. So there's no reason to have it,
   we can just directly use cluster features.
2. `sstable_format_selector` gets deleted.
   Without the `system.scylla_local` around, it's just a glorified
   feature listener.
3. The format selection logic is moved into `sstable_manager`.
   It already sees the `db::config` and the `gms::feature_service`.
   For the foreseeable future, the knowledge of enabled cluster features
   and current config should be enough information to pick the right formats.
4. The `sstable_format` entry in `db::config` is no longer intended to
   inhibit cluster features. Instead, it is intended to select the
   format for new sstables, and it becomes live-updatable.
5. Instead of writing new sstables with "highest supported" format,
   (which used to be set by `sstables_format_selector`) we write
   them with the "preferred" format, which is determined by
   `sstable_manager` based on the combination of enabled features
   and the current value of `sstable_format`.

Closes scylladb/scylladb#26092

[avi: Pavel found the reason for the scylla_local entry -
      it predates stable storage for cluster features]
2025-09-19 16:17:56 +03:00
Pavel Emelyanov
d1626dfa86 api: Move /storage_service/compact to tasks.cc
This one doesn't have async peer there, but it's still a pure compaction
manager endpoint handler

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-09-19 13:23:59 +03:00
Pavel Emelyanov
6eaa2138ad api: Move /storage_service/keyspace_upgrade_sstables to tasks.cc
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-09-19 13:23:54 +03:00
Pavel Emelyanov
fe2a184713 api: Move /storage_service/keyspace_offstrategy_compaction to tasks.cc
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-09-19 13:23:49 +03:00