Commit Graph

4636 Commits

Author SHA1 Message Date
copilot-swe-agent[bot]
77ee7f3417 Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy"
This reverts commit 8192f45e84.

The merge exposed a bug where truncate (via drop) fails and causes Raft
errors, leading to schema inconsistencies across nodes. This results in
test_table_drop_with_auto_snapshot failures with 'Keyspace test does not exist'
errors.

The specific problematic change was in commit 19b6207f which modified
truncate_table_on_all_shards to set use_sstable_identifier = true. This
causes exceptions during truncate that are not properly handled, leading
to Raft applier fiber stopping and nodes losing schema synchronization.
2025-12-12 03:55:13 +00:00
Avi Kivity
24264e24bb Revert "repair: Add tablet repair progress report support"
This reverts commit faad0167d7. It causes
a regression in

test_two_tablets_concurrent_repair_and_migration_repair_writer_level

in debug mode (with ~5%-10% probability).

Fixes #27510.

Closes scylladb/scylladb#27560
2025-12-11 12:18:11 +02:00
Nadav Har'El
95e303faf3 Merge 'Refactor get_view_natural_endpoint' from Wojciech Mitros
With the introduction of rack-lists and the reliance of materialized views on them, the `get_view_natural_endpoint` function can be greatly simplified. When using tablets, instead of doing any index-matching, we can now pair base tables with views only in the same rack.
In this series we remove no longer needed code and reorganize the needed code for better clarity.
After the changes, the `get_view_natural_endpoint` function goes down from 245 lines to 85 lines, while the whole pairing-related text goes down from 346 lines to 239 lines.

Fixes https://github.com/scylladb/scylladb/issues/26313

Closes scylladb/scylladb#27383

* github.com:scylladb/scylladb:
  mv: replace the simple/complex rack-aware pairing with exact rack matching
  mv: split out vnode pairing code from get_view_natural_endpoint
  mv: unify self-pairing and rack-aware pairing into one bool
  mv: remove the workaround for left nodes when sending view updates
2025-12-09 13:19:13 +02:00
Asias He
faad0167d7 repair: Add tablet repair progress report support
This patch adds tablet repair progress report support so that the user
could use the /task_manager/task_status API to query the progress.

In order to support this, a new system table is introduced to record the
user request related info, i.e, start of the request and end of the
request.

The progress is accurate when tablet split or merge happens in the
middle of the request, since the tokens of the tablet are recorded when
the request is started and when repair of each tablet is finished. The
original tablet repair is considered as finished when the finished
ranges cover the original tablet token ranges.

After this patch, the /task_manager/task_status API will report correct
progress_total and progress_completed.

Fixes #22564
Fixes #26896

Closes scylladb/scylladb#26924
2025-12-08 13:35:19 +02:00
Pavel Emelyanov
8192f45e84 Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy
This change adds a new option to the REST api and correspondingly, to scylla nodetool: use_sstable_identifier.
When set, we use the sstable identifier, if available, to name each sstable in the snapshots directory
and the manifest.json file, rather than using the sstable generation.

This can be used by the user (e.g. Scylla Manager) for global deduplication with tablets, where an sstable
may be migrated across shards or across nodes, and in this case, its generation may change, but its
sstable identifier remains sstable.

Currently, Scylla manager uses the sstable generation to detect sstables that are already backed up to
object storage and exist in previous backed up snapshots.
Historically, the sstable generation was guaranteed to be unique only per table per node,
so the dedup code currently checks for deduplication in the node scope.

However, with tablet migration, sstables are renamed when migrated to a different shard,
i.e. their generation changes, and they may be renamed when migrated to another node,
but even if they are not, the dedup logic still assumes uniqueness only within a node.

To address both cases, we keep the sstable_id stable throughout the sstable life cycle (since 3a12ad96c7).
Given the globally unique sstable identifier, scylla manager can now detect duplicate sstables
in a wider scope.  This can be cluster-wide, but we practically need only rack-wide deduplication
or dc-wide, as tablets are migrated across racks only in rare occasions (like when converting from a
numerical replication factor to a rack list containing a subset of the available racks in a datacenter).

Fixes #27181

* New feature, no backport required

Closes scylladb/scylladb#27184

* github.com:scylladb/scylladb:
  database: truncate_table_on_all_shards: set use_sstable_identifier to true
  nodetool: snapshot: add --use-sstable-identifier option
  api: storage_service: take_snapshot: add use_sstable_identifier option
  test: database_test: add snapshot_use_sstable_identifier_works
  test: database_test: snapshot_works: add validate_manifest
  sstable: write_scylla_metadata: add random_sstable_identifier error injection
  table: snapshot_on_all_shards: take snapshot_options
  sstable: add get_format getter
  sstable: snapshot: add use_sstable_identifier option
  db: snapshot_ctl: snapshot_options: add use_sstable_identifier options
  db: snapshot_ctl: move skip_flush to struct snapshot_options
2025-12-08 12:56:12 +03:00
Tomasz Grabiec
082342ecad Attach names to allocating sections for better debuggability
Large reserves in allocating_section can cause stalls. We already log
reserve increase, but we don't know which table it belongs to:

  lsa - LSA allocation failure, increasing reserve in section 0x600009f94590 to 128 segments;

Allocating sections used for updating row cache on memtable flush are
notoriously problematic. Each table has its own row_cache, so its own
allocating_section(s). If we attached table name to those sections, we
could identify which table is causing problems. In some issues we
suspected system.raft, but we can't be sure.

This patch allows naming allocating_sections for the purpose of
identifying them in such log messages. I use abstract_formatter for
this purpose to avoid the cost of formatting strings on the hot path
(e.g. index_reader). And also to avoid duplicating strings which are
already stored elsewhere.

Fixes #25799

Closes scylladb/scylladb#27470
2025-12-07 14:14:25 +02:00
Avi Kivity
47efbdffbc Merge 'cache, mvcc: Preempt cache update when applying range tombstone from memtable' from Tomasz Grabiec
Range tombstones are represented as entry attributes, which applies to
the interval between entries. So if a range tombstone covers many
rows, to apply it we have to update all covered entries.  In some
workloads that could be many entries, even the whole cache.  Before
the patch, we did this update without preemption, which can cause
reactor stalls in such workloads.

This scenario is already covered by mvcc_tests,
e.g. test_apply_to_incomplete_respects_continuity. And I verified that
the new preemption point is hit in the test.

perf-row-cache-update results show no significant stalls anymore (max
2ms scheduling delay, instead of previous 1.5 s):

    Generated 1124195 rows
    Memtable fill took 4179.457520 [ms], {count: 8295, 99%: 0.654949 [ms], max: 32.817176 [ms]}
    Draining...
    took 0.000616 [ms]
    cache: 2506/2948 [MB], memtable: 781/1024 [MB], alloc/comp: 1051/662 [MB] (amp: 0.630)
    update: 2874.157471 [ms], preemption: {count: 26650, 99%: 1.131752 [ms], max: 2.068762 [ms]}, cache: 3027/3973 [MB], alloc/comp: 3951/2424 [MB] (amp: 0.614), pr/me/dr 1124195/0/0

Fixes #23479
Fixes #2578

Closes scylladb/scylladb#27469

* github.com:scylladb/scylladb:
  cache, mvcc: Preempt cache update when applying range tombstone from memtable
  partition_snapshot_row_cursor: Clarify non-obvious semantic difference of range_tombstone()
  perf-row-cache-update: Add scenario with large tombstone covering many rows
2025-12-07 11:54:15 +02:00
Tomasz Grabiec
d4014b7970 Drop legacy schema support
We switched to using v3 schema tables (in system_schema keyspace) in
2017, in 9eb91bc30b.

So no system should have the old schema any more.

No need to run legacy_schema_migrator on boot.

Closes scylladb/scylladb#27420
2025-12-07 00:09:13 +02:00
Tomasz Grabiec
e546143fd9 partition_snapshot_row_cursor: Clarify non-obvious semantic difference of range_tombstone() 2025-12-06 01:03:10 +01:00
Botond Dénes
9d2f7c3f52 Merge 'mv: allow setting concurrency in PRUNE MATERIALIZED VIEW' from Wojciech Mitros
The PRUNE MATERALIZED VIEW statement is performed as follows:
1. Perform a range scan of the view table from the view replicas based
on the ranges specified in the statement.
2. While reading the paged scan above, for each view row perform a read
from all base replicas at the corresponding primary key. If a discrepancy
is detected, delete the row in the view table.

When reading multiple rows, this is very slow because for each view row
we need to performe a single row query on multiple replicas.
In this patch we add an option to speed this up by performing many of the
single base row reads concurrently, at the concurrency specified in the
USING CONCURRENCY clause.

Aside from the unit test, I checked manually on a 3-node cluster with 10M rows, using vnodes. There were actually no ghost rows in the test, but we still had to iterate over all view rows and read the corresponding base rows. And actual ghost rows, if there are any, should be a tiny fraction of all rows. I compared concurrencies 1,2,10,100 and the results were:
* Pruning with concurrency 1 took total 1416 seconds
* Pruning with concurrency 2 took total 731 seconds
* Pruning with concurrency 10 took total 234 seconds
* Pruning with concurrency 100 took total 171 seconds
So after a concurrency of 10 or so we're hitting diminishing returns (at least in this setup). At that point we may be no longer bottlenecked by the reads, but by CPU on the shard that's handling the PRUNE

Fixes https://github.com/scylladb/scylladb/issues/27070

Closes scylladb/scylladb#27097

* github.com:scylladb/scylladb:
  mv: allow setting concurrency in PRUNE MATERIALIZED VIEW
  cql: add CONCURRENCY to the USING clause
2025-12-04 11:47:41 +02:00
Benny Halevy
1c45ad7cee db: snapshot_ctl: snapshot_options: add use_sstable_identifier options
To be used for naming sstables in the snapshot by their
sstable identifiers rather than their generation, to
facilitate global deduplication of sstables in backup.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 09:46:35 +02:00
Benny Halevy
c18133b6cb db: snapshot_ctl: move skip_flush to struct snapshot_options
Prepare for adding another option: use_sstable_identifer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 09:46:35 +02:00
Szymon Wasik
4f803aad22 Improve documentation of vector search configuration parameters.
This patch adds separate group for vector search parameters in the
documentation and fixes small typos and formatting.

Fixes: SCYLLADB-77.

Closes scylladb/scylladb#27385
2025-12-03 21:02:59 +02:00
Piotr Dulikowski
654ac9099b db/view/view_building_coordinator: skip work if no view is built
Even though that `view_building_coordinator::work_on_view_building` has
an `if` at the very beginning which checks whether the currently
processed base table is set, it only prints a message and continues
executing the rest of the function regardless of the result of the
check. However, some of the logic in the function assumes that the
currently processed base table field is set and tries to access the
value of the field. This can lead to the view building coordinator
accessing a disengaged optional, which is undefined behavior.

Fix the function by adding the clearly missing `co_await` to the check.
A regression test is added which checks that the view building state
observer - a different fiber which used to print a weird message due to
erroneus view building coordinator behavior - does not print a warning.

Fixes: scylladb/scylladb#27363

Closes scylladb/scylladb#27373
2025-12-03 09:44:28 +02:00
Wojciech Mitros
6221c58325 mv: replace the simple/complex rack-aware pairing with exact rack matching
When the initial version of rack-aware pairing was introduced, materialized
views with tablets were still experimental. Since then, we decided that
we'll only allow materialized views in clusters where the base table and
the view are replicated on the same racks, with one replica of each tablet
on each rack.
This allows us to remove almost all logic from our base-view pairing. The
only check for the paired view replica is now whether it's in the same
rack as the base replica sending the update.
In this patch we replace the simple and complex rack-aware pairing with
the simple check above.
Because of this, we have to remove a test case from network_topology_strategy_test
which was testing complex pairing. The tested topology is not supported
for views with tablets (or is unlikely to be supported, as it's a random test),
so there's no use keeping the test.
The test case for simple rack aware pairing was kept, but now we only test
the case where each rack has one replica, not multiple.
Additionally, we split finding of an unpaired replica to a separate function
and partially rewrite it without reusing the helper stuctures that were
present when calculating the simple and complex rack-aware pairing.
We only look for an unpaired replica if we couldn't find a paired replica
ourselves or if the number of view replicas didn't match the base replicas.
If an unpaired replica appears while these conditions pass, we won't send
an extra update, but that would be a new bug altogether, because we only
expect the unpaired replica to appear during RF changes, so when these
conditions aren't fulfilled.

Fixes https://github.com/scylladb/scylladb/issues/26313
2025-12-02 10:52:36 +01:00
Wojciech Mitros
4ec0fa6eb5 mv: split out vnode pairing code from get_view_natural_endpoint
To avoid repeatedly checking whether we're using tablets and having
to use unnecesarily flexible code fitting both cases, we split out
the base-view pairing code for the case of vnodes to another function.
The get_view_natural_endpoint will now have only common steps,
a call to that function, and steps specific to tablets.
2025-12-02 03:32:36 +01:00
Wojciech Mitros
c313b215e4 mv: unify self-pairing and rack-aware pairing into one bool
We always use "legacy self pairing" when not using tablets, and
the "rack aware pairing" has been enabled in every version where
views with tablets isn't experimental. So in practice, instead
of checking these variables we can just look at whether the
table uses tablets.
2025-12-02 03:32:32 +01:00
Wojciech Mitros
7c612e1789 mv: remove the workaround for left nodes when sending view updates
At one point, the get_view_natural_endpoint was using IP for the
view update (and hint) destinations, but the hint code was using
host_id for the destinations. When a node left, we could no longer
have a mapping for a IP to host_id and when trying to store a hint
for this IP, we'd crash.
We worked around this issue by dropping the view update completely
if the target is in the "left" state.
Since then, we also moved to host_id's in the view update code, so
there's no longer any translation needed when storing the hints.
Additionally, we now drain hints not when entering the "left" state,
but when the node actually stops owning tokens.
Because of that, the workaround is not needed anymore, so we remove
it in this commit.
The existing test_mv_tablets_empty_ip case verifies that indeed, we
do not crash in the original problematic scenario.
2025-12-01 12:27:28 +01:00
Piotr Dulikowski
44c605e59c Merge 'Fix the types of change events in Alternator Streams' from Piotr Wieczorek
This patch increases the compatibility with DynamoDB Streams by integrating the DynamoDB's event type rules (described in https://github.com/scylladb/scylladb/issues/6918) into Alternator. The main changes are:
- introduce a new flag `alternator_streams_strict_compatibility`, meant as a guard of performance-intensive operations that increase the compatibility with DynamoDB Streams. If enabled, Alternator always performs a RBW before a data-modifying operation, and propagates its result to CDC. Then, the old item is compared to the new one, to determine the mutation type (INSERT vs MODIFY). This option is a no-op for tables with disabled Alternator Streams,
- reduce splitting of simple Alternator mutations,
- correctly distinguish event types described in #6918, except for item deletes. Deleting a missing item with DeleteItem, BatchWriteItem, or a missing field with UpdateItem still emit REMOVEs.

To summarize, the emitted events of the data manipulation operations should be as follows:
- DeleteItem/BatchWriteItem.DeleteItem of existing item: REMOVE (OK)
- DeleteItem of nonexistent item: nothing (OK)
- BatchWriteItem.DeleteItem of nonexistent item: nothing (OK)
- PutItem/UpdateItem/BatchWriteItem.PutItem of existing and not equal item: MODIFY (OK)
- PutItem/UpdateItem/BatchWriteItem.PutItem of existing and equal item: nothing (OK)
- PutItem/UpdateItem/BatchWriteItem.PutItem of nonexistent item: INSERT (OK)

No backport is necessary.

Refs https://github.com/scylladb/scylladb/pull/26149
Refs https://github.com/scylladb/scylladb/pull/26396
Refs https://github.com/scylladb/scylladb/issues/26382
Fixes https://github.com/scylladb/scylladb/issues/6918

Closes scylladb/scylladb#26121

* github.com:scylladb/scylladb:
  test/alternator: Enable the tests failing because of #6918
  alternator, cdc: Don't emit events for no-op removes
  alternator, cdc: Don't emit an event for equal items
  alternator/streams, cdc: Differentiate item replace and item update in CDC
  alternator: Change the return type of rmw_operation_return
  config: Add alternator_streams_strict_compatibility flag
  cdc: Don't split a row marker away from row cells
2025-11-30 07:20:22 +01:00
Dawid Mędrek
48a28c24c5 db/commitlog: Include position and alignment information in errors
When we come across a segment truncation, this information may
be helpful to determine when the error occurred exactly and
hint at what code path might've led to it.

Closes scylladb/scylladb#27207
2025-11-28 15:28:08 +03:00
Calle Wilund
59c87025d1 commitlog::read_log_file: Check for eof position on all data reads
Fixes #24346

When reading, we check for each entry and each chunk, if advancing there
will hit EOF of the segment. However, IFF the last chunk being read has
the last entry _exactly_ matching the chunk size, and the chunk ending
at _exactly_ segment size (preset size, typically 32Mb), we did not check
the position, and instead complained about not being able to read.

This has literally _never_ happened in actual commitlog (that was replayed
at least), but has apparently happened more and more in hints replay.

Fix is simple, just check the file position against size when advancing
said position, i.e. when reading (skipping already does).

v2:

* Added unit test

Closes scylladb/scylladb#27236
2025-11-28 15:26:46 +03:00
Wojciech Mitros
323e5cd171 mv: allow setting concurrency in PRUNE MATERIALIZED VIEW
The PRUNE MATERALIZED VIEW statement is performed as follows:
1. Perform a range scan of the view table from the view replicas based
on the ranges specified in the statement.
2. While reading the paged scan above, for each view row perform a read
from all base replicas at the corresponding primary key. If a discrepancy
is detected, delete the row in the view table.

When reading multiple rows, this is very slow because for each view row
we need to performe a single row query on multiple replicas.
In this patch we add an option to speed this up by performing many of the
single base row reads concurrently, at the concurrency specified in the
USING CONCURRENCY clause.

Fixes https://github.com/scylladb/scylladb/issues/27070
2025-11-27 00:02:28 +01:00
Nadav Har'El
9cde93e3da Merge 'db/view/view_building_coordinator: get rid of task's state in group0' from Michał Jadwiszczak
Previously, the view building coordinator relied on setting each task's state to STARTED and then explicitly removing these state entries once tasks finished, before scheduling new ones. This approach induced a significant number of group0 commits, particularly in large clusters with many nodes and tablets, negatively impacting performance and scalability.

With the update, the coordinator and worker logic has been restructured to operate without maintaining per-task states. Instead, tasks are simply tracked with an aborted boolean flag, which is still essential for certain tablet operations. This change removes much of the coordination complexity, simplifies the view building code, and reduces operational overhead.

In addition, the coordinator now batches reports of finished tasks before making commits. Rather than committing task completions individually, it aggregates them and reports in groups, significantly minimizing the frequency of group0 commits. This new approach is expected to improve efficiency and scalability during materialized view construction, especially in large deployments.

Fixes https://github.com/scylladb/scylladb/issues/26311

This patch needs to be backported to 2025.4.

Closes scylladb/scylladb#26897

* github.com:scylladb/scylladb:
  docs/dev/view-building-coordinator: update the docs after recent changes
  db/view/view_building: send coordinator's term in the RPC
  db/view/view_building_state: replace task's state with `aborted` flag
  db/view/view_building_coordinator: batch finished tasks reporting
  db/view/view_building_worker: change internal implementation
  db/view/view_building_coordinator: change `work_on_tasks` RPC return type
2025-11-26 11:35:44 +02:00
Botond Dénes
384bffb8da Merge 'compaction: limit the maximum shares allocated to a compaction scheduling class' from Raphael Raph Carvalho
This PR adds support for limiting the maximum shares allocated to a
compaction scheduling class by the compaction controller. It introduces
a new configuration parameter, compaction_max_shares, which, when set
to a non zero value, will cap the shares allocated to compaction jobs.
This PR also exposes the shares computed by the compaction controller
via metrics, for observability purposes.

Fixes https://github.com/scylladb/scylladb/issues/9431

Enhancement. No need to backport.

NOTE: Replaces PR https://github.com/scylladb/scylladb/pull/26696

Ran a test in which the backlog raised the need for max shares (normalized backlog above normalization_factor), and played with different values for new option compaction_max_shares to see it works (500, 1000, 2000, 250, 50)

Closes scylladb/scylladb#27024

* github.com:scylladb/scylladb:
  db/config: introduce new config parameter `compaction_max_shares`
  compaction_manager:config: introduce max_shares
  compaction_controller: add configurable maximum shares
  compaction_controller: introduce `set_max_shares()`
2025-11-26 06:51:30 +02:00
Michał Jadwiszczak
fb8cbf1615 db/view/view_building: send coordinator's term in the RPC
To avoid case when an old coordinator (which hasn't been stopped yet)
dictates what should be done, add raft term to the `work_on_view_building_tasks`
RPC.
The worker needs to check if the term matches the current term from raft
server, and deny the request when the term is bad.
2025-11-25 12:14:05 +01:00
Michał Jadwiszczak
24d69b4005 db/view/view_building_state: replace task's state with aborted flag
After previous commits, we can drop entire task's state and replace it
with single boolean flag, which determines if a task was aborted.

Once a task was aborted, it cannot get resurrected to a normal state.
2025-11-25 12:14:04 +01:00
Michał Jadwiszczak
eb04af5020 db/view/view_building_coordinator: batch finished tasks reporting
In previous implementation to execute view building tasks, the
coordinator needed to firstly set their states to `STARTED`
and then it needed to remove them before it could start the next ones.
This logic required a lot of group0 commits, especially in large
clusters with higher number of nodes and big tablet count.

After previous commit to the view building worker, the coordinator
can start view building tasks without setting the `STARTED` state
and deleting finished tasks.

This patch adjusts the coordinator to save finished tasks locally,
so it can continue to execute next ones and the finished tasks are
periodically removed from the group0 by `finished_task_gc_fiber()`.
2025-11-25 12:14:04 +01:00
Lakshmi Narayanan Sreethar
9cb766f929 db/config: introduce new config parameter compaction_max_shares
Add support for the new configuration parameter `compaction_max_shares`,
and update the compaction manager to pass it down to the compaction
controller when it changes. The shares allocated to compaction jobs will
be limited by this new parameter.

Fixes #9431

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-11-24 12:52:29 -03:00
Tomasz Grabiec
d4b77c422f Merge 'load_stats: leaving replica could be std::nullopt' from Ferenc Szili
When migrating tablet size during the end_migration tablet transition stage, we need the pending and leaving replica hosts. The leaving and pending replicas are gathered in objects of type std::optional<tablet_replica> and are not checked if they contain a value before dereferencing which could cause an exception in the topology coordinator.

This patch adds a check for leaving and pending replicas, and only performs the tablet size migration if neither are empty.

This bug was introduced in 10f07fb95a

This change also adds the ability to create a tablet size in load_stats during end_migration stage of a tablet rebuild. We compute the new tablet size from by averaging the tablet sizes of the existing replicas.

This change also adds the virtual table tablet_sizes which contains tablet sizes of all the replicas of all the tablets in the cluster.

A version containing this bug has not yet been released, so a backport is not needed.

Closes scylladb/scylladb#27118

* github.com:scylladb/scylladb:
  test: add tests for tablet size migration during end_migration
  virtual_table: add tablet_sizes virtual table
  load_stats: update tablet sizes after migration or rebuild
2025-11-24 15:31:30 +01:00
Michał Jadwiszczak
08974e1d50 db/view/view_building_worker: change internal implementation
This commit doesn't change the logic behind the view building worker but
it changes how the worker is executing view building tasks.

Previously, the worker had a state only on shard0 and it was reacting to
changes in group0 state. When it noticed some tasks were moved to
`STARTED` state, the worker was creating a batch for it on the shard0
state.
The RPC call was used only to start the batch and to get its result.

Now, the main logic of batch management was moved to the RPC call
handler.
The worker has a local state on each shard and the state
contains:
- unique ptr to the batch
- set of completed tasks
- information for which views the base table was flushed

So currently, each batch lives on a shard where it has its work to do
exclusively. This eliminates a need to do a synchronization between
shard0 and work shard, which was a painful point in previous
implementation.

The worker still reacts to changes in group0 view building state, but
currently it's only used to observe whether any view building tasks was
aborted by setting `ABORTED` state.

To prepare for further changes to drop the view building task state,
the worker ignores `IDLE` and `STARTED` states completely.
2025-11-24 11:12:31 +01:00
Michał Jadwiszczak
6d853c8f11 db/view/view_building_coordinator: change work_on_tasks RPC return type
During the initial implementation of the view builing coordinator,
we decided that if a view building task fails locally on the worker
(example reason: view update's target replica is not available),
the worker will retry this work instead of reporting a failure to the
coordinator.

However, we left return type of the RPC, which was telling if a task was
finished successfully or aborted.
But the worker doesn't need to report that a task was aborted, because
it's the coordinator, who decides to abort a task.

So, this commit changes the return type to list of UUIDs of completed
tasks.
Previously length of the returned vector needed to be the same as length
of the vector sent in the request.
No we can drop this restriction and the RPC handler return list of UUIDs
of completed tasks (subset of vector sent in the request).

This change is required to drop `STARTED` state in next commits.

Since Scylla 2025.4 wasn't released yet and we're going to merge this
patch before releasing, no RPC versioning or cluster feature is needed.
2025-11-24 11:12:29 +01:00
Karol Nowacki
c40b3ba4b3 vector_search: Add HTTPS support for vector store connections
This commit introduces TLS encryption support for vector store connections.
A new configuration option is added:
- vector_store_encryption_options.truststore: path to the trust store file

To enable secure connections, use the https:// scheme in the
vector_store_primary_uri/vector_store_secondary_uri configuration options.

Fixes: VECTOR-327
2025-11-22 08:18:45 +01:00
Ferenc Szili
e96863be0c virtual_table: add tablet_sizes virtual table
This change adds the tablet_sizes virtual table. The contents of this
table are gathered from the current load_stats data structure.
2025-11-21 16:53:28 +01:00
Gautam Menghani
939fcc0603 db/system_keyspace: Remove the FIXME related to caching of large tables
Remove the FIXME comment for re-enabling caching of the large tables
since the tables are used infrequently [1].

[1] : github.com/scylladb/scylladb/pull/26789#issuecomment-3477540364

Fixes #26032

Signed-off-by: Gautam Menghani <gautam.opensource@gmail.com>

Closes scylladb/scylladb#26789
2025-11-21 12:34:34 +02:00
Radosław Cybulski
d589e68642 Add precompiled headers to CMakeLists.txt
Add precompiled header support to CMakeLists.txt and configure.py -
it improves compilation time by approximately 10%.

New header `stdafx.hh` is added, don't include it manually -
the compiler will include it for you. The header contains includes from
external libraries used by Scylla - seastar, standard library,
linux headers and zlib.

The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER`
or configure.py --disable-precompiled-header to disable.

The feature should be disabled, when trying to check headers - otherwise
you might get false negatives on missing includes from seastar / abseil and so on.

Note: following configuration needs to be added to ccache.conf:

    sloppiness = pch_defines,time_macros,include_file_mtime,include_file_ctime

Closes scylladb/scylladb#26617
2025-11-21 12:27:41 +02:00
Calle Wilund
3c4546d839 messaging_service: Add internode_compression=rack as option
Fixes #27085

Adds a "rack" option to enum/config and handles in connection
setup in messaging_service.

Closes scylladb/scylladb#27099
2025-11-21 11:50:55 +02:00
Karol Nowacki
104de44a8d vector_search: Add support for secondary vector store clients
This change adds support for secondary vector store clients, typically
located in different availability zones. Secondary clients serve as
fallback targets when all primary clients are unavailable.
New configuration option allows specifying secondary client addresses
and ports.

Fixes: VECTOR-187

Closes scylladb/scylladb#26484
2025-11-20 08:37:18 +01:00
Botond Dénes
6ee0f1f3a7 Merge 'replica/table: add a metric for hypothetical total file size without compression' from Michał Chojnowski
This patch adds a metric for pre-compression size of sstable files.

This patch adds a per-table metric
`scylla_column_family_total_disk_space_before_compression`,
which measures the hypothetical total size of sstables on disk,
if Data.db was replaced with an uncompressed equivalent.

As for the implementation:
Before the patch, tables and sstable sets are already tracking their total physical file size.
Whenever sstables are added or removed, the size delta is propagated from the sstable up through sstable sets into table_stats.
To implement the new metric, we turn the size delta that is getting passed around from a one-dimensional to a two-dimensional value, which includes both the physical and the pre-compression size.

New functionality, no backport needed.

Closes scylladb/scylladb#26996

* github.com:scylladb/scylladb:
  replica/table: add a metric for hypothetical total file size without compression
  replica/table: keep track of total pre-compression file size
2025-11-20 09:10:38 +02:00
Botond Dénes
2ca66133a4 Revert "db/config: don't use RBNO for scaling"
This reverts commit 43738298be.

This commit causes instability in dtests. Several non-gating dtests
started failing, as well as some gating ones, see #27047.

Closes scylladb/scylladb#27067

Fixes #27047
2025-11-18 08:17:17 +02:00
Botond Dénes
514c1fc719 Merge 'db: batchlog_manager: update _last_replay only if all batches were re…' from Aleksandra Martyniuk
…played

Currently, if flushing hints falls within the repair cache timeout, then the flush_time is set to batchlog_manager::_last_replay. _last_replay is updated on each replay, even if some batches weren't replayed. Due to that, we risk the data resurrection.

Update _last_replay only if all batches were replayed.

Fixes: https://github.com/scylladb/scylladb/issues/24415.

Needs backport to all live versions.

Closes scylladb/scylladb#26793

* github.com:scylladb/scylladb:
  test: extend test_batchlog_replay_failure_during_repair
  db: batchlog_manager: update _last_replay only if all batches were replayed
2025-11-18 08:17:16 +02:00
Piotr Dulikowski
f0039381d2 Merge 'db/view/view_building_worker: support staging sstables intra-node migration and tablet merge' from Michał Jadwiszczak
This PR fixes staging stables handling by view building coordinator in case of intra-node tablet migration or tablet merge.

To support tablet merge, the worker stores the sstables grouped only be `table_id`, instead of `(table_id, last_token)` pair.
There shouldn't be that many staging sstables, so selecting relevant for each `process_staging` task is fine.
For the intra-node migration support, the patch adds methods to load migrated sstables on the destination shard and to cleanup them on source shard.

The patch should be backported to 2025.4

Fixes https://github.com/scylladb/scylladb/issues/26244

Closes scylladb/scylladb#26454

* github.com:scylladb/scylladb:
  service/storage_service: migrate staging sstables in view building worker during intra-node migration
  db/view/view_building_worker: support sstables intra-node migration
  db/view_building_worker: fix indent
  db/view/view_building_worker: don't organize staging sstables by last token
2025-11-17 08:53:19 +01:00
Aleksandra Martyniuk
e3dcb7e827 test: extend test_batchlog_replay_failure_during_repair
Modify test_batchlog_replay_failure_during_repair to also check
that there isn't data resurrection if flushing hints falls within
the repair cache timeout.
2025-11-14 14:18:07 +01:00
Patryk Jędrzejczak
1141342c4f Merge 'topology: refactor excluded nodes' from Petr Gusev
This PR refactors excluded nodes handling for tablets and topology. For tablets a dedicated variable `topology::excluded_tablet_nodes` is introduced, for topology operations a method get_excluded_nodes() is inlined into topology_coordinator and renamed to `get_excluded_nodes_for_topology_request`.

The PR improves codes readability and efficiency, no behavior changes.

backport: this is a refactoring/optimization, no need to backport

Closes scylladb/scylladb#26907

* https://github.com/scylladb/scylladb:
  topology_coordinator: drop unused exec_global_command overload
  topology_coordinator: rename get_excluded_nodes -> get_excluded_nodes_for_topology_request
  topology_state_machine: inline get_excluded_nodes
  messaging_service: simplify and optimize ban_host
  storage_service: topology_state_load: extract topology variable
  topology_coordinator: excluded_tablet_nodes -> ignored_nodes
  topology_state_machine: add excluded_tablet_nodes field
2025-11-14 11:52:00 +01:00
Botond Dénes
43738298be db/config: don't use RBNO for scaling
Remove bootstrap and decomission from allowed_repair_based_node_ops.
Using RBNO over streaming for these operations has no benefits, as they
are not exposed to the out-of-date replica problem that replace,
removenode and rebuild are.
On top of that, RBNO is known to have problems with empty user tables.
Using streaming for boostrap and decomission is safe and faster
than RBNO in all condition, especially when the table is small.

One test needs adjustment as it relies on RBNO being used for all node
ops.

Fixes: #24664

Closes scylladb/scylladb#26330
2025-11-14 13:03:50 +03:00
Piotr Dulikowski
43506e5f28 Merge 'db/view: Add backoff when RPC fails' from Dawid Mędrek
The view building coordinator manages the process by sending RPC
requests to all nodes in the cluster, instructing them what to do.
If processing that message fails, the coordinator decides if it
wants to retry it or (temporarily) abandon the work.

An example of the latter scenario could be if one of the target nodes
dies and any attempts to communicate with it would fail.

Unfortunately, the current approach to it is not perfect and may result
in a storm of warnings, effectively clogging the logs. As an example,
take a look at scylladb/scylladb#26686: the gossiper failed to mark
one of the dead nodes as DOWN fast enough, and it resulted in a warning storm.

To prevent situations like that, we implement a form of backoff.
If processing an RPC message fails, we postpone finishing the task for
a second. That should reduce the number of messages in the logs and avoid
retries that are likely to fail as well.

We provide a reproducer test.

Fixes scylladb/scylladb#26686

Backport: impact on the user. We should backport it to 2025.4.

Closes scylladb/scylladb#26729

* github.com:scylladb/scylladb:
  tet/cluster/mv: Clean up test_backoff_when_node_fails_task_rpc
  db/view/view_building_coordinator: Rate limit logging failed RPC
  db/view: Add backoff when RPC fails
2025-11-14 10:17:57 +01:00
Dawid Mędrek
acd9120181 db/view/view_building_coordinator: Rate limit logging failed RPC
The view building coordinator sends tasks in form of RPC messages
to other nodes in the cluster. If processing that RPC fails, the
coordinator logs the error.

However, since tasks are per replica (so per shard), it may happen
that we end up with a large number of similar messages, e.g. if the
target node has died, because every shard will fail to process its
RPC message. It might become even worse in the case of a network
partition.

To mitigate that, we rate limit the logging by 1 seconds.

We extend the test `test_backoff_when_node_fails_task_rpc` so that
it allows the view building coordinator to have multiple tablet
replica targets. If not for rate limiting the warning messages,
we should start getting more of them, potentially leading to
a test failure.
2025-11-13 17:57:23 +01:00
Dawid Mędrek
4a5b1ab40a db/view: Add backoff when RPC fails
The view building coordinator manages the process of view building
by sending RPC requests to all nodes in the cluster, instructing them
what to do. If processing that message fails, the coordinator decides
if it wants to retry it or (temporarily) abandon the work.

An example of the latter scenario could be if one of the target nodes
dies and any attempts to communicate with it would fail.

Unfortunately, the current approach to it is not perfect and may result
in a storm of warnings, effectively clogging the logs. As an example,
take a look at scylladb/scylladb#26686: the gossiper failed to mark
one of the dead nodes as DOWN fast enough, and it resulted in a warning storm.

To prevent situations like that, we implement a form of backoff.
If processing an RPC message fails, we postpone finishing the task for
a second. That should reduce the number of messages in the logs and avoid
retries that are likely to fail as well.

We provide a reproducer test: it fails before this commit and succeeds
with it.

Fixes scylladb/scylladb#26686
2025-11-13 17:55:41 +01:00
Aleksandra Martyniuk
4d0de1126f db: batchlog_manager: update _last_replay only if all batches were replayed
Currently, if flushing hints falls within the repair cache timeout,
then the flush_time is set to batchlog_manager::_last_replay.
_last_replay is updated on each replay, even if some batches weren't
replayed. Due to that, we risk the data resurrection.

Update _last_replay only if all batches were replayed.

Fixes: https://github.com/scylladb/scylladb/issues/24415.
2025-11-13 10:40:19 +01:00
Piotr Dulikowski
2e5eb92f21 Merge 'cdc: use CDC schema that is compatible with the base schema' from Michael Litvak
When generating CDC log mutations for some base mutation, use a CDC schema that is compatible with the base schema.

The compatible CDC schema has for every base column a corresponding CDC column with the same name. If using a non-compatible schema, we may encounter a situation, especially during ALTER, that we have a mutation with a base column set with some value, but the CDC schema doesn't have a column by that name. This would cause the user request to fail with an error.

We add to the schema object a schema_ptr that for CDC-enabled tables points to the schema object of the CDC table that is compatible with the schema. It is set by the schema merge algorithm when creating the schema for a table that is created or altered. We use the fact that a base table and its CDC table are created and altered in the same group0 operation, and this way we can find and set the cdc schema for a base table.

When transporting the base schema as a frozen schema between shards, we transport with it the frozen cdc schema as well.

The patch starts with a series of refactoring commits that make extending the frozen schema easier and cleans up some duplication in the code about the frozen schema. We combine the two types `frozen_schema_with_base_info` and `view_schema_and_base_info` to a single type `extended_frozen_schema` that holds a frozen schema with additional data that is not part of the schema mutations but needs to be transported with it to unfreeze it - base_info, and the frozen cdc schema which is added in a later commit.

Fixes https://github.com/scylladb/scylladb/issues/26405

backport not needed - enhancement

Closes scylladb/scylladb#24960

* github.com:scylladb/scylladb:
  test: cdc: test cdc compatible schema
  cdc: use compatiable cdc schema
  db: schema_applier: create schema with pointer to CDC schema
  db: schema_applier: extract cdc tables
  schema: add pointer to CDC schema
  schema_registry: remove base_info from global_schema_ptr
  schema_registry: use extended_frozen_schema in schema load
  schema_registry: replace frozen_schema+base_info with extended_frozen_schema
  frozen_schema: extract info from schema_ptr in the constructor
  frozen_schema: rename frozen_schema_with_base_info to extended_frozen_schema
2025-11-13 10:11:54 +01:00
Michał Chojnowski
1cfce430f1 replica/table: keep track of total pre-compression file size
Every table and sstable set keeps track of the total file size
of contained sstables.

Due to a feature request, we also want to keep track of the hypothetical
file size if Data files were uncompressed, to add a metric that
shows the compression ratio of sstables.

We achieve this by replacing the relevant `uint_64 bytes_on_disk`
counters everywhere with a struct that contains both the actual
(post-compression) size and the hypothetical pre-compression size.

This patch isn't supposed to change any observable behavior.
In the next patch, we will use these changes to add a new metric.
2025-11-13 00:49:57 +01:00