Commit Graph

4484 Commits

Author SHA1 Message Date
Piotr Dulikowski
591a67c7e7 Merge 'view_builder: register view on all shards atomically' from Michael Litvak
When the view builder starts to build a new view, each shard registers
itself by writing the shard id and current token to the
scylla_views_builds_in_progress table.

Previously, this happened independently by each shard. We change it now
to register all shards "atomically" - when a shard registers itself, it
also registers all other shards with an empty status, if they aren't
registered yet. This ensures that we don't have a partial state in the
table where only some of the shards are registered, but we always have a
status for all shards.

The reason we want to register all shards atomically is that if it
happens that only some of the shards were registered, then we restart
and load the status from table, this doesn't work well for multiple
reasons.

One example is that to know how many shards we had previously, we take
the maximum shard id we see in the table. If it's different than the
current shard count, we will execute the reshard code. But of course, if
the last shard is missing from the table because it didn't register
itself, this calculation will be wrong, and we can't know the previous
number of shards.

This is a problem because suppose we have two shards, and shard 0
finished building the view but shard 1 didn't start. When we come up, we
will think that previously we had only a single shard and it completed
building everything, when in fact we built only half the view
approximately. The problem is that we don't have enough information in
the tables to know that.

There are additional problems related to reshard. In the reshard
function, whether it is executed because we actually do node reshard or
because we calculated the wrong number of previous shards, if the status
of some shard is missing then the calculation of new ranges will be
wrong. When some shard didn't make progress we should start building the
view from scratch. However, this doesn't happen if we don't have a
status for the shard, because the code looks only for shards that have a
status. In effect, this shard is considered complete even though it
didn't start. This could cause the view building to get stuck or
complete without building all tokens ranges.

By registering all shards atomically, this should solve the above
problems because we will always have statuses for all shards.

Fixes https://github.com/scylladb/scylladb/issues/22989

backport not needed - the issue is probably not common and there's a workaround

Closes scylladb/scylladb#25790

* github.com:scylladb/scylladb:
  test: mv: add a test for view build interrupt during registration
  view_builder: register view on all shards atomically
2025-09-22 08:03:44 +02:00
Karol Nowacki
eedf506be5 vector_store_client: Rename vector_store_uri to vector_store_primary_uri
The configuration setting vector_store_uri is renamed to
vector_store_primary_uri according to the final design.
In the future, the vector_store_secondary_uri setting will
be introduced.

This setting now also accepts a comma-separated list of URIs to prepare
for future support for redundancy and load balancing. Currently, only the
first URI in the list is used.

This change must be included before the next release.
Otherwise, users will be affected by a breaking change.

References: VECTOR-187

Closes scylladb/scylladb#26033
2025-09-21 16:33:10 +03:00
Michael Litvak
3dffb8e0dc test: mv: add a test for view build interrupt during registration
Add a new test that reproduces issue #22989. The test starts view
building and interrupts it by restarting the node while some shards
registered their status and some didn't.
2025-09-21 10:39:30 +02:00
Michael Litvak
6043409c31 view_builder: register view on all shards atomically
When the view builder starts to build a new view, each shard registers
itself by writing the shard id and current token to the
scylla_views_builds_in_progress table.

Previously, this happened independently by each shard. We change it now
to register all shards "atomically" - when a shard registers itself, it
also registers all other shards with an empty status, if they aren't
registered yet. This ensures that we don't have a partial state in the
table where only some of the shards are registered, but we always have a
status for all shards.

The reason we want to register all shards atomically is that if it
happens that only some of the shards were registered, then we restart
and load the status from table, this doesn't work well for multiple
reasons.

One example is that to know how many shards we had previously, we take
the maximum shard id we see in the table. If it's different than the
current shard count, we will execute the reshard code. But of course, if
the last shard is missing from the table because it didn't register
itself, this calculation will be wrong, and we can't know the previous
number of shards.

This is a problem because suppose we have two shards, and shard 0
finished building the view but shard 1 didn't start. When we come up, we
will think that previously we had only a single shard and it completed
building everything, when in fact we built only half the view
approximately. The problem is that we don't have enough information in
the tables to know that.

There are additional problems related to reshard. In the reshard
function, whether it is executed because we actually do node reshard or
because we calculated the wrong number of previous shards, if the status
of some shard is missing then the calculation of new ranges will be
wrong. When some shard didn't make progress we should start building the
view from scratch. However, this doesn't happen if we don't have a
status for the shard, because the code looks only for shards that have a
status. In effect, this shard is considered complete even though it
didn't start. This could cause the view building to get stuck or
complete without building all tokens ranges.

By registering all shards atomically, this should solve the above
problems because we will always have statuses for all shards.

Fixes scylladb/scylladb#22989
2025-09-21 10:39:05 +02:00
Michał Chojnowski
9e70df83ab db: get rid of sstables-format-selector
Our sstable format selection logic is weird, and hard to follow.

If I'm not misunderstanding, the pieces are:
1. There's the `sstable_format` config entry, which currently
   doesn't do anything, but in the past it used to disable
   cluster features for versions newer than the specified one.
2. There are deprecated and unused config entries for individual
   versions (`enable_sstables_mc_format`, `enable_sstables_md_format`,
   etc).
3. There is a cluster feature for each version:
   ME_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, etc.
   (Currently all sstable version features have been grandfathered,
   and aren't checked by the code anymore).
4. There's an entry in `system.scylla_local` which contains the
   latest enabled sstable version. (Why? Isn't this directly derived
   from cluster features anyway)?
5. There's `sstable_manager::_format` which contains the
   sstable version to be used for new writes.
   This field is updated by `sstables_format_selector`
   based on cluster features and the `system.scylla_local` entry.

I don't see why those pieces are needed. Version selection has the
following constraints:
1. New sstables must be written with a format that supports existing
   data. For example, range tombstones with an infinite bound are only
   supported by sstables since version "mc". So if a range tombstone
   with an infinite bound exists somewhere in the dataset,
   the format chosen for new sstables has to be at least as new as "mc".
2. A new format might only be used after a corresponding cluster feature
   is enabled. (Otherwise new sstables might become unreadable if they
   are sent to another node, or if a node is downgraded).
3. The user should have a way to inhibit format ugprades if he wishes.

So far, constraint (1) has been fulfilled by never using formats older
than the newest format ever enabled on the node. (With an exception
for resharding and reshaping system tables).
Constraint (2) has been fulfilled by calling `sstable_manager::set_format`
only after the corresponsing cluster feature is enabled.
Constraint (3) has been fulfilled by the ability to inhibit cluster
features by setting `sstable_format` by some fixed value.

The main thing I don't like about this whole setup is that it doesn't
let me downgrade the preferred sstable format. After a format is
enabled, there is no way to go back to writing the old format again.
That is no good -- after I make some performance-sensitive changes
in a new format, it might turn out to be a pessimization for the
particular workload, and I want to be able to go back.

This patch aims to give a way to downgrade formats without violating
the constraints. What it does is:
1. The entry in `system.scylla_local` becomes obsolete.
   After the patch we no longer update or read it.
   As far as I understand, the purpose of this entry is to prevent
   unwanted format downgrades (which is something cluster features
   are designed for) and it's updated if and only if relevant
   cluster features are updated. So there's no reason to have it,
   we can just directly use cluster features.
2. `sstable_format_selector` gets deleted.
   Without the `system.scylla_local` around, it's just a glorified
   feature listener.
3. The format selection logic is moved into `sstable_manager`.
   It already sees the `db::config` and the `gms::feature_service`.
   For the foreseeable future, the knowledge of enabled cluster features
   and current config should be enough information to pick the right formats.
4. The `sstable_format` entry in `db::config` is no longer intended to
   inhibit cluster features. Instead, it is intended to select the
   format for new sstables, and it becomes live-updatable.
5. Instead of writing new sstables with "highest supported" format,
   (which used to be set by `sstables_format_selector`) we write
   them with the "preferred" format, which is determined by
   `sstable_manager` based on the combination of enabled features
   and the current value of `sstable_format`.

Closes scylladb/scylladb#26092

[avi: Pavel found the reason for the scylla_local entry -
      it predates stable storage for cluster features]
2025-09-19 16:17:56 +03:00
Pavel Emelyanov
a1ea553fe1 code: Replace distributed<> with sharded<>
The latter is recommended in seastar, and the former was left as
compatibility alias. Latest seastar explicitly marks it as deprecated so
once the submodule is updated, compilation logs will explode.

Most of the patch is generated with

    for f in $(git grep -l '\<distributed<[A-Za-z0-9:_]*>') ; do sed -e 's/\<distributed<\([A-Za-z0-9:_]*\)>/sharded<\1>/g' -i $f; done
    for f in $(git grep -l distributed.hh); do sed -e 's/distributed.hh/sharded.hh/' -i $f ; done

and a small manual change in test/perf/perf.hh

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26136
2025-09-19 12:22:51 +02:00
Avi Kivity
fe7e63f109 Merge 'transport: service_level_controller: create and use driver service level' from Andrzej Jackowski
This patch series:
 - Increases the number of allowed scheduling groups to allow creation of `sl:driver`
 - Implements `create_driver_service_level` that creates `sl:driver` with shares=200 if it wasn't already created
 - Implements creation of `sl:driver` for new systems and tests in `raft_initialize_discovery_leader`
 - Modifies `topology_coordinator` to use  create `sl:driver` after upgrades.
 - Implements using `sl:driver` for new connections in `transport/server`
 - Adds to `transport/server` recognition of driver's control connections and forcing them to keep using `sl:driver`.
 - Adds tests to verify the new functionality
 - Modifies existing tests to let them pass after `sl:driver` is added
 - Modifies the documentation to contain new `sl:driver`

The changes were evaluated by a test with the following scenario ([test_connections-sl-driver.py](https://github.com/user-attachments/files/22021273/test_connections-sl-driver.py)):
 - Start ScyllaDB with one node
 - Create 1000 keyspaces, 1 table in each keyspace
 - Start `cassandra-stress` (`-rate threads=50  -mode native cql3`)
 - Run connection storm with 1000 session (100 python processes, 10 sessions each)

The maximum latency during connection storm dropped **from 224.94ms to 41.43ms** (those numbers are average from 20 test executions, were max latency was in [140ms, 361ms] before change and [31.4ms, 61.5ms] after).

The snippet of cassandra-stress output from the moment of connection storm:
Before:
```
type       total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
...
total,        789206,   85887,   85887,   85887,     0.6,     0.3,     2.0,     2.0,     2.5,     5.0,    9.0,  0.09679,      0,      0,       0,       0,       0,       0
total,        909322,  120116,  120116,  120116,     0.4,     0.2,     1.9,     2.0,     2.1,     3.1,   10.0,  0.09053,      0,      0,       0,       0,       0,       0
total,        964392,   55070,   55070,   55070,     0.9,     0.4,     2.0,     4.5,     7.7,    18.9,   11.0,  0.09203,      0,      0,       0,       0,       0,       0
total,        975705,   11313,   11313,   11313,     4.4,     3.5,     6.5,    24.5,    82.7,    83.0,   12.0,  0.11713,      0,      0,       0,       0,       0,       0
total,        987548,   11843,   11843,   11843,     4.2,     3.5,     6.5,    33.7,    48.6,    51.5,   13.0,  0.13366,      0,      0,       0,       0,       0,       0
total,        995422,    7874,    7874,    7874,     6.3,     4.0,     7.7,    85.6,   112.9,   113.5,   14.0,  0.14753,      0,      0,       0,       0,       0,       0
total,       1007228,   11806,   11806,   11806,     4.3,     3.5,     6.5,    29.1,    43.8,    87.1,   15.0,  0.15598,      0,      0,       0,       0,       0,       0
total,       1012840,    5612,    5612,    5612,     8.2,     5.0,    11.5,   121.8,   166.6,   170.1,   16.0,  0.16535,      0,      0,       0,       0,       0,       0
total,       1016186,    3346,    3346,    3346,    13.4,     7.4,    20.1,   204.9,   207.6,   210.4,   17.0,  0.17405,      0,      0,       0,       0,       0,       0
total,       1025462,    9276,    9276,    9276,     6.3,     3.9,     9.6,    74.6,   206.8,   210.0,   18.0,  0.17800,      0,      0,       0,       0,       0,       0
total,       1035979,   10517,   10517,   10517,     4.8,     3.5,     6.7,    38.5,    82.6,    83.0,   19.0,  0.18120,      0,      0,       0,       0,       0,       0
total,       1047488,   11509,   11509,   11509,     4.3,     3.5,     6.0,    32.6,    72.3,    74.0,   20.0,  0.18334,      0,      0,       0,       0,       0,       0
total,       1077456,   29968,   29968,   29968,     1.7,     1.6,     2.9,     3.6,     7.0,     8.2,   21.0,  0.17943,      0,      0,       0,       0,       0,       0
total,       1105490,   28034,   28034,   28034,     1.8,     1.8,     3.5,     4.6,     5.3,    13.8,   22.0,  0.17609,      0,      0,       0,       0,       0,       0
total,       1132221,   26731,   26731,   26731,     1.9,     1.8,     3.8,     5.2,     8.4,    11.1,   23.0,  0.17314,      0,      0,       0,       0,       0,       0
total,       1162149,   29928,   29928,   29928,     1.7,     1.7,     3.0,     4.5,     8.0,     9.1,   24.0,  0.16950,      0,      0,       0,       0,       0,       0
...
```

After:
```
type       total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
...
total,        822863,   94379,   94379,   94379,     0.5,     0.3,     2.0,     2.0,     2.1,     3.7,    9.0,  0.06669,      0,      0,       0,       0,       0,       0
total,        937337,  114474,  114474,  114474,     0.4,     0.2,     2.0,     2.0,     2.1,     3.4,   10.0,  0.06301,      0,      0,       0,       0,       0,       0
total,        986630,   49293,   49293,   49293,     1.0,     1.0,     2.0,     2.1,    17.9,    19.0,   11.0,  0.07318,      0,      0,       0,       0,       0,       0
total,       1026734,   40104,   40104,   40104,     1.2,     1.0,     2.0,     2.2,     6.3,     7.1,   12.0,  0.08410,      0,      0,       0,       0,       0,       0
total,       1066124,   39390,   39390,   39390,     1.3,     1.0,     2.0,     2.2,     2.6,     3.4,   13.0,  0.09108,      0,      0,       0,       0,       0,       0
total,       1103082,   36958,   36958,   36958,     1.3,     1.1,     2.1,     2.5,     3.1,     4.2,   14.0,  0.09643,      0,      0,       0,       0,       0,       0
total,       1141987,   38905,   38905,   38905,     1.3,     1.0,     2.0,     2.4,    11.4,    12.7,   15.0,  0.09894,      0,      0,       0,       0,       0,       0
total,       1180023,   38036,   38036,   38036,     1.3,     1.0,     2.0,     3.7,     5.6,     7.1,   16.0,  0.10070,      0,      0,       0,       0,       0,       0
total,       1216481,   36458,   36458,   36458,     1.4,     1.0,     2.1,     3.6,     4.7,     5.0,   17.0,  0.10210,      0,      0,       0,       0,       0,       0
total,       1256819,   40338,   40338,   40338,     1.2,     1.0,     2.0,     2.2,     3.5,     5.4,   18.0,  0.10173,      0,      0,       0,       0,       0,       0
total,       1295122,   38303,   38303,   38303,     1.3,     1.0,     2.0,     2.4,    21.0,    21.1,   19.0,  0.10136,      0,      0,       0,       0,       0,       0
total,       1334743,   39621,   39621,   39621,     1.3,     1.0,     2.0,     2.3,     3.3,     4.0,   20.0,  0.10055,      0,      0,       0,       0,       0,       0
total,       1375579,   40836,   40836,   40836,     1.2,     1.0,     2.0,     2.1,     3.4,     5.7,   21.0,  0.09927,      0,      0,       0,       0,       0,       0
total,       1415576,   39997,   39997,   39997,     1.2,     1.0,     2.0,     2.3,     3.2,     4.1,   22.0,  0.09807,      0,      0,       0,       0,       0,       0
total,       1449268,   33692,   33692,   33692,     1.5,     1.4,     2.5,     3.2,     4.2,     5.6,   23.0,  0.09800,      0,      0,       0,       0,       0,       0
total,       1471873,   22605,   22605,   22605,     2.2,     2.0,     4.8,     5.9,     7.0,     7.9,   24.0,  0.10015,      0,      0,       0,       0,       0,       0
...
```

Fixes: https://github.com/scylladb/scylladb/issues/24411

This is a new feature, so no backport needed.

Closes scylladb/scylladb#25412

* github.com:scylladb/scylladb:
  docs: workload-prioritization: add driver service level
  test: add test to verify use of `sl:driver`
  transport: use `sl:driver` to handle driver's control connections
  transport: whitespace only change in update_scheduling_group
  transport: call update_scheduling_group for non-auth connections
  generic_server: transport: start using `sl:driver` for new connections
  test: add test_desc_* for driver service level
  test: service_levels: add tests for sl:driver creation and removal
  test: add reload_raft_topology_state() to ScyllaRESTAPIClient
  service_level_controller: automatically create `sl:driver`
  service_level_controller: methods to create driver service level
  service_level_controller: handle special sl:driver in DESC output
  topology_coordinator: add service_level_controller reference
  system_keyspace: add service_level_driver_created
  test: add MAX_USER_SERVICE_LEVELS
2025-09-18 19:45:17 +03:00
Piotr Dulikowski
5f55787e50 Merge 'CDC with tablets' from Michael Litvak
initial implementation to support CDC in tablets-enabled keyspaces.

The design is described in https://docs.google.com/document/d/1qO5f2q5QoN5z1-rYOQFu6tqVLD3Ha6pphXKEqbtSNiU/edit?usp=sharing
It is followed closely for the most part except "Deciding when to change streams" - instead, streams are changed synchronously with tablet split / merge.
Instead of the stream switching algorithm with the double writes, we use a scheme similar to the previous method for vnodes - we add the new streams with timestamp that is sufficiently far into the future.

In this PR we:
* add new group0-based internal system tables for tablet stream metadata and loading it into in-memory CDC metadata
* add virtual tables for CDC consumers
* the write coordinator chooses a stream by looking up the appropriate stream in the CDC metadata
* enable creating tables with CDC enabled in tablets-enabled keyspaces. tablets are allocated for the CDC table, and a stream is created per each tablet.
* on tablet resize (split / merge), the topology coordinator creates a new stream set with a new stream for each new tablet.
* the cdc tablets are co-located with the base tablets

Fixes https://github.com/scylladb/scylladb/issues/22576

backport not needed - new feature

update dtests: https://github.com/scylladb/scylla-dtest/pull/5897
update java cdc library: https://github.com/scylladb/scylla-cdc-java/pull/102
update rust cdc library: https://github.com/scylladb/scylla-cdc-rust/pull/136

Closes scylladb/scylladb#23795

* github.com:scylladb/scylladb:
  docs/dev: update CDC dev docs for tablets
  doc: update CDC docs for tablets
  test: cluster_events: enable add_cdc and drop_cdc
  test/cql: enable cql cdc tests to run with tablets
  test: test_cdc_with_alter: adjust for cdc with tablets
  test/cqlpy: adjust cdc tests for tablets
  test/cluster/test_cdc_with_tablets: introduce cdc with tablets tests
  cdc: enable cdc with tablets
  topology coordinator: change streams on tablet split/merge
  cdc: virtual tables for cdc with tablets
  cdc: generate_stream_diff helper function
  cdc: choose stream in tablets enabled keyspaces
  cdc: rename get_stream to get_vnode_stream
  cdc: load tablet streams metadata from tables
  cdc: helper functions for reading metadata from tables
  cdc: colocate cdc table with base
  cdc: remove streams when dropping CDC table
  cdc: create streams when allocating tablets
  migration_listener: add on_before_allocate_tablet_map notification
  cdc: notify when creating or dropping cdc table
  cdc: move cdc table creation to pre_create
  cdc: add internal tables for cdc with tablets
  cdc: add cdc_with_tablets feature flag
  cdc: add is_log_schema helper
2025-09-18 13:39:37 +02:00
Piotr Dulikowski
4ed045a15c Merge 'db/view/view_building_worker: wrap shared_sstable in foreign_ptr' from Michał Jadwiszczak
When a staging sstable is registered to view building worker, it needs to make a round trip from its original shard to shard 0
(in order to create a view building task) and back (to be eventually processed).
Until now this was done using plain `sstables::shared_sstable` (= `lw_shared_ptr`) which is not safe to be moved between shards.

This patch fixes this by wrapping the pointer in `foreign_ptr` and obtains necessary informations (owner shard, last token) on the original shard (instead of on shard0).
Then all of those objects are put into freshly introduced structure `staging_sstable_task_info`, which can be safely moved between shards.

Fixes https://github.com/scylladb/scylladb/issues/25859

View building coordinator isn't present in any release yet, no backport needed.

Closes scylladb/scylladb#25832

* github.com:scylladb/scylladb:
  db/view/view_building_worker: fix indent
  db/view/view_building_worker: wrap `shared_sstable` in `foreign_ptr`
  db/view/view_building_worker: use table id in `register_staging_sstable_tasks()`
  db/view/view_building_worker: move helper functions higher
2025-09-18 10:24:27 +02:00
Piotr Dulikowski
b71af71ab5 Merge 'db/view/view_building_worker: change sharded<abort_source> to local abort_source' from Michał Jadwiszczak
Previously the sharded abort_sources was stopped at the end of `batch::do_work()`, which is working in parallel to view building worker main loop.
This leads to races because the worker may call `batch::abort()`, which access the abort_sources.

This patch solves this be changing `sharded<abort_source>` into `abort_source`.
Since now `batch::do_work()` is executed on tasks' shard, all abort source checks are also done on tasks' shard.
The only place where shard0 uses the abort source is `batch::abort()`, but this method now does `smp::submit_to(replica.shard, [request abort])`, so the abort source is used on tasks' shard exclusively.

Fixes https://github.com/scylladb/scylladb/issues/25805
Fixes https://github.com/scylladb/scylladb/issues/26045

View building coordinator hasn't been released yet, so no backport needed.

Closes scylladb/scylladb#26059

* github.com:scylladb/scylladb:
  db/view/view_building_worker: fix indents
  db/view/view_building_worker: change `sharded<abort_source>` to local `abort_source`
  db/view/view_building_worker: execute entire `batch::do_work` on tasks shard
  db/view/view_building_worker: store reference to sharded worker in batch
2025-09-18 10:11:20 +02:00
Andrzej Jackowski
dd9b4c64d2 system_keyspace: add service_level_driver_created
This commit extends sytem.scylla_local table with an additional
key/value pair that can be used later in this patch series to
keep an information that `sl:driver` was already created. The purpose
of storing this information is to ensure that `sl:driver` is
not recreated after being intentionally removed.

A new mutation is included in `register_raft_pull_snapshot` to keep
`service_level_driver_created` in state machine shapshot, which is
required for proper propagation of the value when a new node is added
to the cluster.

Refs: scylladb/scylladb#24411
2025-09-18 09:28:32 +02:00
Michał Jadwiszczak
1d8b41a51d db/view/view_building_worker: fix indents 2025-09-18 03:24:43 +02:00
Michał Jadwiszczak
99db5a6c30 db/view/view_building_worker: change sharded<abort_source> to local abort_source
Previously the sharded abort_sources was stopped at the end of batch::do_work(),
which is working in parallel to view building worker main loop.
This leads to races because the worker may call batch::abort(),
which access the abort_sources.

This patch solves this be changing `sharded<abort_source>` into
`abort_source`.
Since now `batch::do_work()` is executed on tasks' shard,
all abort source checks are also done on tasks' shard.
The only place where shard0 uses the abort source is `batch::abort()`,
but this method now does `smp::submit_to(replica.shard, [request abort])`,
so the abort source is used on tasks' shard exclusively.

Fixes scylladb/scylladb#25805
Fixes scylladb/scylladb#26045
2025-09-18 03:24:43 +02:00
Michał Jadwiszczak
7b9db335c0 db/view/view_building_worker: execute entire batch::do_work on tasks shard
This change will allow us to get rid of problematic
`sharded<abort_source>` and use local `abort_source` instead.
2025-09-18 03:24:43 +02:00
Michał Jadwiszczak
2f65af8aa7 db/view/view_building_worker: store reference to sharded worker in batch
Change reference to view building worker in batch to sharded container.
In next commits, I'm going to execute `do_work()` exclusively on tasks
target shard and sharded reference will be more useful.
2025-09-18 03:24:43 +02:00
Michał Jadwiszczak
e4a0de53ea db/view/view_building_worker: fix indent 2025-09-18 02:57:36 +02:00
Michał Jadwiszczak
7dfb76f9a7 db/view/view_building_worker: wrap shared_sstable in foreign_ptr
When a staging sstable is registered to view building worker,
it needs to make a round trip from its original shard to shard 0
(in order to create a view building task) and back (to be eventually
processed).
Until now this was done using plain `sstables::shared_sstable`
(= `lw_shared_ptr`) which is not safe to be moved between shards.

This patch fixes this by wrapping the pointer in `foreign_ptr` and
obtains necessary informations (owner shard, last token) on the original
shard (instead of on shard0).
Then all of those objects are put into freshly introduced structure
`staging_sstable_task_info`, which can be safely moved between shards.

Fixes scylladb/scylladb#25859
2025-09-18 02:57:36 +02:00
Michał Jadwiszczak
50678030c0 db/view/view_building_worker: use table id in register_staging_sstable_tasks()
There is no need to pass the pointer only to get id of the table.
2025-09-18 02:57:35 +02:00
Michał Jadwiszczak
b44c223d47 db/view/view_building_worker: move helper functions higher
So they can be used in
`view_building_worker::register_staging_sstable_tasks()`.
2025-09-18 02:57:35 +02:00
Ernest Zaslavsky
a1f18a8883 treewide: Move schema related files to a schema directory As requested in #22111, moved the files and fixed other includes and build system.
Moved files:
- frozen_schema.hh
- frozen_schema.cc
- schema_mutations.hh
- schema_mutations.cc
- column_computation.hh

Fixes: #22111

Closes scylladb/scylladb#25089
2025-09-17 17:31:05 +03:00
Botond Dénes
cc5153ef8c Merge 'db: cache: consider preempting after each partition' from Aleksandra Martyniuk
Currently, during cache invaldation we check if we need to preempt
only after the partition gets invaldaited. This may lead to stalls
if we have a chain of filtered out partitions.

Check for preemption even if the partition does not get invaldated.

Refs: https://github.com/scylladb/scylladb/issues/9136.

Optimization; no backport

Closes scylladb/scylladb#26053

* github.com:scylladb/scylladb:
  db: fix indentation
  db: cache: consider preempting after each partition
2025-09-17 17:26:29 +03:00
Michael Litvak
b8442c6087 cdc: virtual tables for cdc with tablets
Define two new virtual tables in system keyspace: cdc_timestamps and
cdc_streams. They expose the internal cdc metadata for tablets-enabled
keyspace to be consumed by users consuming the CDC log.

cdc_timestamps lists all timestamps for a table where a stream change
occured.

cdc_streams list additionally the current streams sets for each
table and timestamp, as well as difference - closed and opened streams -
from the previous stream set.
2025-09-17 14:47:12 +02:00
Michael Litvak
aa61f074b5 cdc: helper functions for reading metadata from tables
Define functions in system_keyspace that read from the internal cdc
tables and construct the data into internal cdc data structures.
2025-09-17 14:47:12 +02:00
Michael Litvak
b9ee28eaab cdc: add internal tables for cdc with tablets
Add new group0-based tables in system keyspace to be used for cdc with tablets:
* cdc_streams_state - describing "base" state of CDC streams for each
  table - an initial timestamp and a stream set.
* cdc_streams_history - describing following committed stream sets by
  diffs (opened / closed streams) from the previous set.
2025-09-17 14:47:11 +02:00
Piotr Dulikowski
6a90a1fd29 Merge 'db/view/view_building_worker: split batch's data preparation and execution' from Michał Jadwiszczak
The view building batch lives on shard0 but it might be doing
work on shard which owns the tablet replica.
Until now the batch data was accessed from multiple shards (shard0 and
where the batch was executed).

This patch fixes this by splitting tasks execution into:
- preparation which is always happening on shard0
- actual execution of the tasks on relevant shard, but all necessary
  data is copied to the shard and batch object isn't accessed.

Fixes https://github.com/scylladb/scylladb/issues/25804

View building coordinator hasn't been released yet, so no backport needed.

Closes scylladb/scylladb#26058

* github.com:scylladb/scylladb:
  db/view/view_building_worker: move try-catch outside `invoke_on()`
  db/view/view_building_worker: split batch's data preparation and execution
2025-09-17 14:17:25 +02:00
Michał Jadwiszczak
d98237b33c db/view/view_building_worker: move try-catch outside invoke_on()
It's just stylist change, to me doing `invoke_on()` in try-catch
block looks better than the other way.
2025-09-16 23:15:44 +02:00
Michał Jadwiszczak
9458ceff8f db/view/view_building_worker: split batch's data preparation and execution
The view building batch lives on shard0 but it might be doing
work on shard which owns the tablet replica.
Until now the batch data was accessed from multiple shards (shard0 and
where the batch was executed).

This patch fixes this by splitting tasks execution into:
- preparation which is always happening on shard0
- actual execution of the tasks on relevant shard, but all necessary
  data is copied to the shard and batch object isn't accessed.

Fixes scylladb/scylladb#25804
2025-09-16 23:13:36 +02:00
Ernest Zaslavsky
d624413ddd treewide: Move query related files to a new query directory
As requested in #22120, moved the files and fixed other includes and build system.

Moved files:
- query.cc
- query-request.hh
- query-result.hh
- query-result-reader.hh
- query-result-set.cc
- query-result-set.hh
- query-result-writer.hh
- query_id.hh
- query_result_merger.hh

Fixes: #22120

This is a cleanup, no need to backport

Closes scylladb/scylladb#25105
2025-09-16 23:40:47 +03:00
Sergey Zolotukhin
2640b288c2 raft: disable caching for raft log.
This change disables caching for raft log table due to the following reasons:
* Immediate reason is a deficiency in handling emerging range tombstones in the cache, which causes stalls.
* Long-term reason is that sequential reads from the raft log do not benefit from the cache, making it better to bypass it to free up space and avoid stalls.

Fixes scylladb/scylladb#26027

Closes scylladb/scylladb#26031
2025-09-16 23:40:47 +03:00
Aleksandra Martyniuk
17e9ec11d7 db: fix indentation 2025-09-16 14:49:54 +02:00
Aleksandra Martyniuk
0024339a71 db: cache: consider preempting after each partition
Currently, during cache invaldation we check if we need to preempt
only after the partition gets invaldaited. This may lead to stalls
if we have a chain of filtered out partitions.

Check for preemption even if the partition does not get invaldated.

Refs: #9136.
2025-09-16 14:45:28 +02:00
Aleksandra Martyniuk
75b772adfb db: optimize cache invalidation following repair/streaming
Currently, if a new sstable is created during repair/streaming,
we invalidate its whole	token range in cache. If the sstable
is sparse, we unnecessarily clear too much data.

Modify cache invalidation, so that only the partitions present
in the sstable are cleared.

To check whether a partition is present in the sstable, we use bloom
filters. Bloom filters may return false positives and show that
an sstable contains a partition, even though it does not. Due to that
we may invalidate a bit more than we need to, but the cache will be
in valid state.

An issue arises when we do not invalidate two consecutive partitions
that are continuous. The sstable may contain a token that falls
between these partitions, breaking the continuity. To check that, we
would need to scan sstable index. However, such a change would
noticeably complicate the invalidation, both performance and code.
In this change, sstable index reader isn't used. Instead, the continuity
flag is unset for all scanned partitions. This comes at a cost of
heavier reads, as we will need to verify continuity when reading more
than one partition from cache.

Fixes: https://github.com/scylladb/scylladb/issues/9136.

Closes scylladb/scylladb#25996
2025-09-14 19:48:14 +03:00
Radosław Cybulski
30306c3375 Remove const & from tags_extension constructor
`tags_extension` constructor unnecesarily takes `std::map` by const ref,
forcing a copy. This patch removes const ref for performance reasons.

Closes scylladb/scylladb#25977
2025-09-14 13:32:21 +03:00
Radosław Cybulski
436150eb52 treewide: fix spelling errors
Fix spelling errors reported by copilot on github.
Remove single use namespace alias.

Closes scylladb/scylladb#25960
2025-09-12 15:58:19 +03:00
Avi Kivity
c91b326d5a Merge 'transport: replace throwing protocol_exception with returns' from Dario Mirovic
Replace throwing `protocol_exception` with returning it as a result or an exceptional future in the transport server module. The goal is to improve performance.

Most of the `protocol_exception` throws were made from `fragmented_temporary_buffer` module, by passing `exception_thrower()` to its `read*` methods. `fragmented_temporary_buffer` is changed so that it now accepts an exception creator, not exception thrower. `fragmented_temporary_buffer_concepts::ExceptionCreator` concept replaced `fragmented_temporary_buffer_concepts::ExceptionThrower` and all methods that have been throwing now return failed result of type `utils::result_with_eptr`. This change is then propagated to the callers.

The scope of this patch is `protocol_exception`, so commitlog just calls `.value()` method on the result. If the result failed, that will throw the exception from the result, as defined by `utils::result_with_eptr_throw_policy`. This means that the behavior of commitlog module stays the same.

transport server module handles results gracefully. All the caller functions that return non-future value `T` now return `utils::result_with_eptr<T>`. When the caller is a function that returns a future, and it receives failed result, `make_exception_future(std::move(failed_result).value())` is returned. The rest of the callstack up to the transport server `handle_error` function is already working without throwing, and that's how zero throws is achieved.

cql3 module changes do the same as transport server module.

Benchmark that is not yet merged has commit `67fbe35833e2d23a8e9c2dcb5e04580231d8ec96`, [GitHub diff view](https://github.com/scylladb/scylladb/compare/master...nuivall:scylladb:perf_cql_raw). It uses either read or write query.

Command line used:
```
./build/release/scylla perf-cql-raw --workdir ~/tmp/scylladir --smp 1 --developer-mode 1 --workload write --duration 300 --concurrency 1000 --username cassandra --password cassandra 2>/dev/null
```
The only thing changed across runs is `--workload write`/`--workload read`.

Built and run on `release` target.

<details>

```
throughput:
        mean=   36946.04 standard-deviation=1831.28
        median= 37515.49 median-absolute-deviation=1544.52
        maximum=39748.41 minimum=28443.36
instructions_per_op:
        mean=   108105.70 standard-deviation=965.19
        median= 108052.56 median-absolute-deviation=53.47
        maximum=124735.92 minimum=107899.00
cpu_cycles_per_op:
        mean=   70065.73 standard-deviation=2328.50
        median= 69755.89 median-absolute-deviation=1250.85
        maximum=92631.48 minimum=66479.36

⏱  real=5:11.08  user=2:00.20  sys=2:25.55  cpu=85%
```

```
throughput:
        mean=   40718.30 standard-deviation=2237.16
        median= 41194.39 median-absolute-deviation=1723.72
        maximum=43974.56 minimum=34738.16
instructions_per_op:
        mean=   117083.62 standard-deviation=40.74
        median= 117087.54 median-absolute-deviation=31.95
        maximum=117215.34 minimum=116874.30
cpu_cycles_per_op:
        mean=   58777.43 standard-deviation=1225.70
        median= 58724.65 median-absolute-deviation=776.03
        maximum=64740.54 minimum=55922.58

⏱  real=5:12.37  user=27.461  sys=3:54.53  cpu=83%
```

```
throughput:
        mean=   37107.91 standard-deviation=1698.58
        median= 37185.53 median-absolute-deviation=1300.99
        maximum=40459.85 minimum=29224.83
instructions_per_op:
        mean=   108345.12 standard-deviation=931.33
        median= 108289.82 median-absolute-deviation=55.97
        maximum=124394.65 minimum=108188.37
cpu_cycles_per_op:
        mean=   70333.79 standard-deviation=2247.71
        median= 69985.47 median-absolute-deviation=1212.65
        maximum=92219.10 minimum=65881.72

⏱  real=5:10.98  user=2:40.01  sys=1:45.84  cpu=85%
```

```
throughput:
        mean=   38353.12 standard-deviation=1806.46
        median= 38971.17 median-absolute-deviation=1365.79
        maximum=41143.64 minimum=32967.57
instructions_per_op:
        mean=   117270.60 standard-deviation=35.50
        median= 117268.07 median-absolute-deviation=16.81
        maximum=117475.89 minimum=117073.74
cpu_cycles_per_op:
        mean=   57256.00 standard-deviation=1039.17
        median= 57341.93 median-absolute-deviation=634.50
        maximum=61993.62 minimum=54670.77

⏱  real=5:12.82  user=4:10.79  sys=11.530  cpu=83%
```

This shows ~240 instructions per op increase for reads and ~180 instructions per op increase for writes.

Tests have been run multiple times, with almost identical results. Each run lasted 300 seconds. Number of operations executed is roughly 38k per second * 300 seconds = 11.4m ops.

Update:

I have repeated the benchmark with clean state - reboot computer, put in performance mode, rebuild, closed other apps that might affect CPU and disk usage.

run count: 5 times before and 5 times after the patch
duration: 300 seconds

Average write throughput median before patch: 41155.99
Average write throughput median after patch: 42193.22

Median absolute deviation is also lower now, with values in range 350-550, while the previous runs' values were in range 750-1350.

</details>

Built and run on `release` target.

<details>

./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false --bypass-cache 2>/dev/null

```
throughput:
        mean=   14910.90 standard-deviation=477.72
        median= 14956.73 median-absolute-deviation=294.16
        maximum=16061.18 minimum=13198.68
instructions_per_op:
        mean=   659591.63 standard-deviation=495.85
        median= 659595.46 median-absolute-deviation=324.91
        maximum=661184.94 minimum=658001.49
cpu_cycles_per_op:
        mean=   213301.49 standard-deviation=2724.27
        median= 212768.64 median-absolute-deviation=1403.85
        maximum=225837.15 minimum=208110.12

⏱  real=5:19.26  user=5:00.22  sys=15.827  cpu=98%
```

./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false 2>/dev/null

```
throughput:
        mean=   93345.45 standard-deviation=4499.00
        median= 93915.52 median-absolute-deviation=2764.41
        maximum=104343.64 minimum=79816.66
instructions_per_op:
        mean=   65556.11 standard-deviation=97.42
        median= 65545.11 median-absolute-deviation=71.51
        maximum=65806.75 minimum=65346.25
cpu_cycles_per_op:
        mean=   34160.75 standard-deviation=803.02
        median= 33927.16 median-absolute-deviation=453.08
        maximum=39285.19 minimum=32547.13

⏱  real=5:03.23  user=4:29.46  sys=29.255  cpu=98%
```

./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache true 2>/dev/null

```
throughput:
        mean=   206982.18 standard-deviation=15894.64
        median= 208893.79 median-absolute-deviation=9923.41
        maximum=232630.14 minimum=127393.34
instructions_per_op:
        mean=   35983.27 standard-deviation=6.12
        median= 35982.75 median-absolute-deviation=3.75
        maximum=36008.24 minimum=35952.14
cpu_cycles_per_op:
        mean=   17374.87 standard-deviation=985.06
        median= 17140.81 median-absolute-deviation=368.86
        maximum=26125.38 minimum=16421.99

⏱  real=5:01.23  user=4:57.88  sys=0.124  cpu=98%
```

./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false --bypass-cache 2>/dev/null

```
throughput:
        mean=   16198.26 standard-deviation=902.41
        median= 16094.02 median-absolute-deviation=588.58
        maximum=17890.10 minimum=13458.74
instructions_per_op:
        mean=   659752.73 standard-deviation=488.08
        median= 659789.16 median-absolute-deviation=334.35
        maximum=660881.69 minimum=658460.82
cpu_cycles_per_op:
        mean=   216070.70 standard-deviation=3491.26
        median= 215320.37 median-absolute-deviation=1678.06
        maximum=232396.48 minimum=209839.86

⏱  real=5:17.33  user=4:55.87  sys=18.425  cpu=99%
```

./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false 2>/dev/null

```
throughput:
        mean=   97067.79 standard-deviation=2637.79
        median= 97058.93 median-absolute-deviation=1477.30
        maximum=106338.97 minimum=87457.60
instructions_per_op:
        mean=   65695.66 standard-deviation=58.43
        median= 65695.93 median-absolute-deviation=37.67
        maximum=65947.76 minimum=65547.05
cpu_cycles_per_op:
        mean=   34300.20 standard-deviation=704.66
        median= 34143.92 median-absolute-deviation=321.72
        maximum=38203.68 minimum=33427.46

⏱  real=5:03.22  user=4:31.56  sys=29.164  cpu=99%
```

./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache true 2>/dev/null

```
throughput:
        mean=   223495.91 standard-deviation=6134.95
        median= 224825.90 median-absolute-deviation=3302.09
        maximum=234859.90 minimum=193209.69
instructions_per_op:
        mean=   35981.41 standard-deviation=3.16
        median= 35981.13 median-absolute-deviation=2.12
        maximum=35991.46 minimum=35972.55
cpu_cycles_per_op:
        mean=   17482.26 standard-deviation=281.82
        median= 17424.08 median-absolute-deviation=143.91
        maximum=19120.68 minimum=16937.43

⏱  real=5:01.23  user=4:58.54  sys=0.136  cpu=99%
```

</details>

Fixes: #24567

This PR is a continuation of #24738 [transport: remove throwing protocol_exception on connection start](https://github.com/scylladb/scylladb/pull/24738). This PR does not solve a burning issue, but is rather an improvement in the same direction. As it is just an enhancement, it should not be backported.

Closes scylladb/scylladb#25408

* github.com:scylladb/scylladb:
  test/cqlpy: add protocol exception tests
  test/cqlpy: `test_protocol_exceptions.py` refactor message frame building
  test/cqlpy: `test_protocol_exceptions.py` refactor duplicate code
  transport: replace `make_frame` throw with return result
  cql3: remove throwing `protocol_exception`
  transport: replace throw in validate_utf8 with result_with_exception_ptr return
  transport: replace throwing protocol_exception with returns
  utils: add result_with_exception_ptr
  test/cqlpy: add unknown compression algorithm test case
2025-09-10 21:54:15 +03:00
Wojciech Mitros
1f9be235b8 mv: delete previously undetected ghost rows in PRUNE MATERIALIZED VIEW statement
The PRUNE MATERIALIZED VIEW statement is supposed to remove ghost rows from the
view. Ghost rows are rows in the view with no corresponding row in the base table.
Before this patch, only rows whose primary key columns of the base table had
different values than any of the base rows were treated as ghost rows by the PRUNE
statement. However, view rows which have a column in their primary key that's not
in the base primary can also be ghost rows if this column has a different value
than the base row with the same values of remaining primary key columns. That's
because these rows won't be deleted unless we change value of this column in the
base table to this specific value.
In this patch we add a check for this column in the PRUNE MATERIALIZED VIEW logic.
If this column isn't the same in the base table and the view, these rows are also
deleted.

Fixes https://github.com/scylladb/scylladb/issues/25655

Closes scylladb/scylladb#25720
2025-09-10 07:35:00 +02:00
Wojciech Mitros
10b8e1c51c storage_proxy: send hints to pending replicas
Consider the following scenario:
- Current replica set is [A, B, C]
- write succeeds on [A, B], and a hint is logged for node C
- before the hint is replayed, D bootstraps and the token migrates from C to D
- hint is replayed to node C while D is pending, but it's too late, since streaming for that token is already done
- C is cleaned up, replayed data is lost, and D has a stale copy until next repair.
In the scenario we effectively fail to send the hint. This scenario is also more likely to happen with tablets,
as it can happen for every tablet migration.

This issue is particularly detrimental to materialized views. View updates use hints by default and a specific
view update may be sent to just one view replica (when a single base replica has a different row state due to
reordering or missed writes). When we lose a hint for such a view update, we can generate a persistent inconsistency
between the base and view - ghost rows can appear due to a lost tombstone and rows may be missing in the view due
to a lost row update. Such inconsistencies can't be fixed neither by repairing the view or the base table.

To handle this, in this patch we add the pending replicas to the list of targets of each hint, even if the original
target is still alive.

This will cause some updates to be redundant. These updates are probably unavoidable for now, but they shouldn't
be too common either. The scenarios for them are:
1. managing to send the hint to the source of a migrating replica before streaming that its token - the write will
arrive on the pending replica anyway in streaming
2. the hint target not being the source of the migration - if we managed to apply the original write of the hint to
the actual source of the migration, the pending replica will get it during streaming
3. sending the same hint to many targets at a similar time - while sending to each target, we'll see the same pending
replica for the hint so we'll send it multiple times
4. possible retries where even though the hint was successfully sent to the main target, we failed to send it to the
pending replica, so we need to retry the entire write

This patch handles both tablet migrations and tablet rebuilds. In the future, for tablet migrations, we can avoid
sending the hint to pending replias if the hint target is not the source fo the migration, which would allow us to
avoid the redundant writes 2 and 3. For rack-aware RF, this will be as simple as checking whether the replicas are
in the same rack.

We also add a test case reproducing the issue.

Co-Authored-By: Raphael S. Carvalho <raphaelsc@scylladb.com>

Fixes https://github.com/scylladb/scylladb/issues/19835

Closes scylladb/scylladb#25590
2025-09-08 09:18:20 +02:00
Radosław Cybulski
c242234552 Revert "build: add precompiled headers to CMakeLists.txt"
This reverts commit 01bb7b629a.

Closes scylladb/scylladb#25735
2025-09-03 09:46:00 +03:00
Calle Wilund
bc20861afb system_keyspace: Prune dropped tables from truncation on start/drop
Fixes #25683

Once a table drop is complete, there should be no reason to retain
truncation records for it, as any replay should skip mutations
anyway (no CF), and iff we somehow resurrect a dropped table,
this replay-resurrected data is the least problem anyway.

Adds a prune phase to the startup drop_truncation_rp_records run,
which ignores updating, and instead deletes records for non-existant
tables (which should patch any existing servers with lingering data
as well).

Also does an explicit delete of records on actual table DROP, to
ensure we don't grow this table more than needed even in long
uptime nodes.

Small unit test included.

Closes scylladb/scylladb#25699
2025-09-03 07:25:34 +03:00
Avi Kivity
1f4c9b1528 Merge 'system_keyspace: add peers cache to get_ip_from_peers_table' from Petr Gusev
The gossiper can call `storage_service::on_change` frequently (see  scylladb/scylla-enterprise#5613), which may cause high CPU load and even trigger OOMs or related issues.

This PR adds a temporary cache for `system.peers` to resolve host_id -> ip without hitting storage on every call. The cache is short-lived to handle the unlikely case where `system.peers` is updated directly via CQL.

This is a temporary fix; a more thorough solution is tracked in https://github.com/scylladb/scylladb/issues/25620.

Fixes scylladb/scylladb#25660

backport: this patch needs to be backported to all supported versions (2025.1/2/3).

Closes scylladb/scylladb#25658

* github.com:scylladb/scylladb:
  storage_service: move get_host_id_to_ip_map to system_keyspace
  system_keyspace: use peers cache in get_ip_from_peers_table
  storage_service: move get_ip_from_peers_table to system_keyspace
2025-08-31 15:34:35 +03:00
Avi Kivity
bc5773f777 Merge 'Add out of space prevention mechanisms' from Łukasz Paszkowski
When a scaling out is delayed or fails, it is crucial to ensure that clusters remain operational
and recoverable even under extreme conditions. To achieve this, the following proactive measures
are implemented:
- reject writes
      - includes: inserts, updates, deletes, counter updates, hints, read+repair and lwt writes
      - applicable to: user tables, views, CDC log, audit, cql tracing
- stop running compactions/repairs and prevent from starting new ones
- reject incoming tablet migrations

The aforementioned mechanisms are automatically enabled when node's disk utilization reaches
the critical level (default: 98%) and disabled when the utilization drop below the threshold.

Apart from that, the series add tests that require mounted volumes to simulate out of space.
The paths to the volumes can be provided using the a pytest argument, i.e.  `--space-limited-dirs`.
When not provided, tests are skipped.

Test scenarios:

1. Start a cluster and write data until one of the nodes reaches 90% of the disk utilization
2. Perform an **operation** that would take the nodes over 100%
3. The nodes should not exceed the critical disk utilization (98% by default)
4. Scale out the cluster by adding one node per rack
5. Retry or wait for the **operation** from step 2

The **operation** is: writing data, running compactions, building materialized views, running repair,
migrating tablets (caused by RF change, decommission).

The test is successful, if no nodes run out of space, the **operation** from step 2 is
aborted/paused/timed out and the **operation** from step 5 is successful.

`perf-simple-query --smp 1 -m 1G` results obtained for fixed 400MHz frequency:

Read path (before)

```
instructions_per_op:
	mean=   39661.51 standard-deviation=34.53
	median= 39655.39 median-absolute-deviation=23.33
	maximum=39708.71 minimum=39622.61
```

Read path (after)

```
instructions_per_op:
	mean=   39691.68 standard-deviation=34.54
	median= 39683.14 median-absolute-deviation=11.94
	maximum=39749.32 minimum=39656.63
```

Write path (before):

```
instructions_per_op:
	mean=   50942.86 standard-deviation=97.69
	median= 50974.11 median-absolute-deviation=34.25
	maximum=51019.23 minimum=50771.60
```

Write path (after):

```
instructions_per_op:
	mean=   51000.15 standard-deviation=115.04
	median= 51043.93 median-absolute-deviation=52.19
	maximum=51065.81 minimum=50795.00
```

Fixes: https://github.com/scylladb/scylladb/issues/14067
Refs: https://github.com/scylladb/scylladb/issues/2871

No backport, as it is a new feature.

Closes scylladb/scylladb#23917

* github.com:scylladb/scylladb:
  tests/cluster: Add new storage tests
  test/scylla_cluster: Override workdir when passed via cmdline
  streaming: Reject incoming migrations
  storage_service: extend locator::load_stats to collect per-node critical disk utilization flag
  repair_service: Add a facility to disable the service
  compaction_manager: Subscribe to out of space controller
  compaction_manager: Replace enabled/disabled states with running state
  database: Add critical_disk_utilization mode database can be moved to
  disk_space_monitor: add subscription API for threshold-based disk space monitoring
  docs: Add feature documentation
  config: Add critical_disk_utilization_level option
  replica/exceptions: Add a new custom replica exception
2025-08-30 18:47:57 +03:00
Calle Wilund
cc9eb321a1 commitlog: Ensure segment deletion is re-entrant
Fixes #25709

If we have large allocations, spanning more than one segment, and
the internal segment references from lead to secondary are the
only thing keeping a segment alive, the implicit drop in
discard_unused_segments and orphan_all can cause a recursive call
to discard_unused_segments, which in turn can lead to vector
corruption/crash, or even double free of segment (iterator confusion).

Need to separate the modification of the vector (_segments) from
actual releasing of objects. Using temporaries is the easiest
solution.

To further reduce recursion, we can also do an early clear of
segment dependencies in callbacks from segment release (cf release).

Closes scylladb/scylladb#25719
2025-08-30 08:24:57 +02:00
Piotr Dulikowski
7ccb50514d Merge 'Introduce view building coordinator' from Michał Jadwiszczak
This patch introduces `view_building_coordinator`, a single entity within whole cluster responsible for building tablet-based views.

The view building coordinator takes slightly different approach than the existing node-local view builder. The whole process is split into smaller view building tasks, one per each tablet replica of the base table.
The coordinator builds one base table at a time and it can choose another when all views of currently processing base table are built.
The tasks are started by setting `STARTED` state and they are executed by node-local view building worker. The tasks are scheduled in a way, that each shard processes only one tablet at a time (multiple tasks can be started for a shard on a node because a table can have multiple views but then all tasks have the same base table and tablet (last_token)). Once the coordinator starts the tasks, it sends `work_on_view_building_tasks` RPC to start the tasks and receive their results.
This RPC is resilient to RPC failure or raft leader change, meaning if one RPC call started a batch of tasks but then failed (for instance the raft leader was changed and caller aborted waiting for the response), next RPC call will attach itself to the already started batch.

The coordinator plugs into handling tablet operations (migration/resize/RF change) and adjusts its tasks accordingly. At the start of each tablet operation, the coordinator aborts necessary view building tasks to prevent https://github.com/scylladb/scylladb/issues/21564. Then, new adjusted tasks are created at the end of the operation.
If the operation fails at any moment, aborted tasks are rollback.

The view building coordinator can also handle staging sstables using process_staging view building tasks. We do this because we don't want to start generating view updates from a staging sstable prematurely, before the writes are directed to the new replica (https://github.com/scylladb/scylladb/issues/19149).

For detailed description check: `docs/dev/view-building-coordinator.md`

Fixes https://github.com/scylladb/scylladb/issues/22288
Fixes https://github.com/scylladb/scylladb/issues/19149
Fixes https://github.com/scylladb/scylladb/issues/21564
Fixes https://github.com/scylladb/scylladb/issues/17603
Fixes https://github.com/scylladb/scylladb/issues/22586
Fixes https://github.com/scylladb/scylladb/issues/18826
Fixes https://github.com/scylladb/scylladb/issues/23930

---

This PR is reimplementation of https://github.com/scylladb/scylladb/pull/21942

Closes scylladb/scylladb#23760

* github.com:scylladb/scylladb:
  test/cluster: add view build status tests
  test/cluster: add view building coordinator tests
  utils/error_injection: allow to abort `injection_handler::wait_for_message()`
  test: adjust existing tests
  utils/error_injection: add injection with `sleep_abortable()`
  db/view/view_builder: ignore `no_such_keyspace` exception
  docs/dev: add view building coordinator documentation
  db/view/view_building_worker: work on `process_staging` tasks
  db/view/view_building_worker: register staging sstable to view building coordinator when needed
  db/view/view_building_worker: discover staging sstables
  db/view/view_building_worker: add method to register staging sstable
  db/view/view_update_generator: add method to process staging sstables instantly
  db/view/view_update_generator: extract generating updates from staging sstables to a method
  db/view/view_update_generator: ignore tablet-based sstables
  db/view/view_building_coordinator: update view build status on node join/left
  db/view/view_building_coordinator: handle tablet operations
  db/view: add view building task mutation builder
  service/topology_coordinator: run view building coordinator
  db/view: introduce `view_building_coordinator`
  db/view/view_building_worker: update built views locally
  db/view: introduce `view_building_worker`
  db/view: extract common view building functionalities
  db/view: prepare to create abstract `view_consumer`
  message/messaging_service: add `work_on_view_building_tasks` RPC
  service/topology_coordinator: make `term_changed_error` public
  db/schema_tables: create/cleanup tasks when an index is created/dropped
  service/migration_manager: cleanup view building state on drop keyspace
  service/migration_manager: cleanup view building state on drop view
  service/migration_manager: create view building tasks on create view
  test/boost: enable proxy remote in some tests
  service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()`
  service/migration_manager: coroutinize `prepare_new_view_announcement()`
  service/storage_proxy: expose references to `system_keyspace` and `view_building_state_machine`
  service: reload `view_building_state_machine` on group0 apply()
  service/vb_coordinator: add currently processing base
  db/system_keyspace: move `get_scylla_local_mutation()` up
  db/system_keyspace: add `view_building_tasks` table
  db/view: add view_building_state and views_state
  db/system_keyspace: add method to get view build status map
  db/view: extract `system.view_build_status_v2` cql statements to system_keyspace
  db/system_keyspace: move `internal_system_query_state()` function earlier
  db/view: ignore tablet-based views in `view_builder`
  gms/feature_service: add VIEW_BUILDING_COORDINATOR feature
2025-08-29 17:28:44 +02:00
Dario Mirovic
51995af258 transport: replace throwing protocol_exception with returns
Replace throwing `protocol_exception` with returning it as a result
or an exceptional future in the transport server module. The goal is
to improve performance.

Most of the `protocol_exception` throws were made from
`fragmented_temporary_buffer` module, by passing `exception_thrower()`
to its `read*` methods. `fragmented_temporary_buffer` is changed so
that it now accepts an exception creator, not exception thrower.
`fragmented_temporary_buffer_concepts::ExceptionCreator` concept replaced
`fragmented_temporary_buffer_concepts::ExceptionThrower` and all
methods that have been throwing now return failed result of type
`utils::result_with_exception_ptr`. This change is then propagated to the callers.

The scope of this patch is `protocol_exception`, so commitlog just calls
`.value()` method on the result. If the result failed, that will throw the
exception from the result, as defined by `utils::result_with_exception_ptr_throw_policy`.
This means that the behavior of commitlog module stays the same.

transport server module handles results gracefully. All the caller functions
that return non-future value `T` now return `utils::result_with_exception_ptr<T>`.
When the caller is a function that returns a future, and it receives
failed result, `make_exception_future(std::move(failed_result).value())`
is returned. The rest of the callstack up to the transport server `handle_error`
function is already working without throwing, and that's how zero throws is
achieved.

Fixes: #24567
2025-08-28 23:31:36 +02:00
Łukasz Paszkowski
535c901e50 config: Add critical_disk_utilization_level option
The option defines the threshold at which the defensive mechanisms
preventing nodes from running out of space, e.g. rejecting user
writes shall be activated.

Its default value is 98% of the disk capacity.
2025-08-28 18:06:37 +02:00
Petr Gusev
4b907c7711 storage_service: move get_host_id_to_ip_map to system_keyspace
Reimplemented the function to use the peers cache. It could be replaced
with get_ip_from_peers_table, but that would create a coroutine frame for
each call.
2025-08-28 12:48:46 +02:00
Petr Gusev
de5dc4c362 system_keyspace: use peers cache in get_ip_from_peers_table
The storage_service::on_change method can be called quite often
by the gossiper, see scylladb/scylla-enterprise#5613. In this commit
we introduce a temporal cache for system.peers so that we don't have
to go to the storage each time we need to resolve host_id -> ip.
We keep the cache only for a small amount of time to handle the
(unlikely) scenario when the user wants to update system.peers table
from CQL.

Fixes scylladb/scylladb#25660
2025-08-28 12:48:39 +02:00
Petr Gusev
91c633371e storage_service: move get_ip_from_peers_table to system_keyspace
We plan to add a cache to get_ip_from_peers_table in upcoming commits.
It's more convenient to do this from system_keyspace, since the only two
methods that mutate system.peers (remove_endpoint and update_peers_info)
are already there.
2025-08-28 12:30:41 +02:00
Calle Wilund
2eccd17e70 system_keyspace: Limit parallelism in drop_truncation_records
Fixes #25682
Refs scylla-enterprise#5580

If the truncation table is large in entries, we might create a
huge parallel execution, quite possibly consuming loads of resources
doing something quite trivial.
Limit concurrency to a small-ish number

Closes scylladb/scylladb#25678
2025-08-28 12:50:00 +03:00
Radosław Cybulski
01bb7b629a build: add precompiled headers to CMakeLists.txt
Add precompiled header support to CMakeLists.txt and configure.py -
it improves compilation time by approximately 10%.

New header `stdafx.hh` is added, don't include it manually -
the compiler will include it for you. The header contains includes from
external libraries used by Scylla - seastar, standard library,
linux headers and zlib.

The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER`
or configure.py --disable-precompiled-header to disable.

The feature should be disabled, when trying to check headers - otherwise
you might get false negatives on missing includes from seastar / abseil and so on.

Note: following configuration needs to be added to ccache.conf:

    sloppiness = pch_defines,time_macros

Closes #25182
2025-08-27 21:37:54 +03:00