Commit Graph

49196 Commits

Author SHA1 Message Date
Emil Maskovsky
5dac4b38fb test/gossiper: add reproducible test for race condition during node decommission
This change introduces a targeted test that simulates the gossiper race
condition observed during node decommissioning. The test delays gossip
state application and host ID lookup to reliably reproduce the scenario
where `gossiper::get_host_id()` is called on a removed endpoint,
potentially triggering an abort in `apply_new_states`.

There is a specific error injection added to widen the race window, in
order to increase the likelihood of hitting the race condition. The
error injection is designed to delay the application of gossip state
updates, for the specific node that is being decommissioned. This should
then result in the server abort in the gossiper.

Refs: scylladb/scylladb#25621
Fixes: scylladb/scylladb#25721

Backport: The test is primarily for an issue found in 2025.1, so it
needs to be backported to all the 2025.x branches.

Closes scylladb/scylladb#25685
2025-09-01 13:59:47 +02:00
Petr Gusev
2e757d6de4 cas: pass timeout_if_partially_accepted := write to accept_proposal()
Write requests cannot be safely retried if some replicas respond with
accepts and others with rejects. In this case, the coordinator is
uncertain about the outcome of the LWT: a subsequent LWT may either
complete the Paxos round (if a quorum observed the accept) or overwrite it
(if a quorum did not). If the original LWT was actually completed by
later rounds and the coordinator retried it, the write could be applied
twice, potentially overwriting effects of other LWTs that slipped in
between. Read requests do not have this problem, so they
can be safely retried.

Before this commit, handler->accept_proposal was called with
timeout_if_partially_accepted := true. This caused both read and write
requests to throw an "uncertainty" timeout to the user in the case
of the contention described above. After this commit, we throw an
"uncertainty" timeout only for write requests, while read requests
are instead retried in the loop in sp::cas.

Closes scylladb/scylladb#25602
2025-09-01 14:31:04 +03:00
Pavel Emelyanov
840cdab627 api: Move /load and /metrics/load handlers code to column_family.cc
Both handlers need database to proceed and thus need to be registered
(and unregistered) in a group that captures database for its handlers.

Once moved, the used get_cf_stats() method can be marked local to
column_family.cc file.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#25671
2025-09-01 08:11:00 +02:00
Dawid Mędrek
fc50e9d0a4 test/perf: Require smp=1 in perf_cache_eviction
Trying to run the test with more than one shard results in a failure
when generating sharding metadata:

```
ERROR 2025-08-27 16:00:17,551 [shard  0:main] table - Memtable flush failed due to: std::runtime_error (Failed to generate sharding metadata for /tmp/scylla-c9fa42fe/ks/cf-2938a030834e11f0a561ffa33feb022d/me-3gt6_12wh_1gifk2ijgeu1ovc1m5-big-Data.db). Aborting
```

Let's require that the test be run with a single shard.

Closes scylladb/scylladb#25703
2025-09-01 08:59:35 +03:00
Nadav Har'El
6d1abc5b2c utils/base64: fix misleading code and comment (no functional change)
utils/base64.cc had some strange code with a strange comment in
base64_begins_with().

The code had

        base.substr(operand.size() - 4, operand.size())

The comment claims that this is "last 4 bytes of base64-encoded string",
but this comment is misleading - operand is typically shorter than base
(this this whole point of the base64_begins_with()), so the real
intention of the code is not to find the *last* 4 bytes of base, but rather
the *next* four bytes after the (operand.size() - 4) which we already copied.
These four bytes that may need the full power of base64_decode_string()
because they may or may not contain padding.

But, if we really want the next 4 bytes, why pass operand.size() as the
length of the substring? operand.size() is at least 4 (it's a mutiple of
4, and if it's 0 we returned earlier), but it could me more. We don't
need more, we just need 4. It's not really wrong to take more than 4 (so
this patch doesn't *fix* any bug), but can be wasteful. So this code
should be:

        base.substr(operand.size() - 4, 4)

We already have in test/boost/alternator_unit_test.cc a test,
test_base64_begins_with that takes encoded base64 strings up to 12
characters in length (corresponding to decoded strings up to 8 chars),
and substrings from length 0 to the base string's length, and check
that test_base64_begins_with succeeds.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25712
2025-09-01 08:57:50 +03:00
Andrei Chekun
e55c8a9936 test.py: modify run to use different junit output filenames
Currently, run will execute twice pytest without modifying the path of the
JUnit XML report. This leads that the second execution of the pytest
will override the report. This PR fixing this issue so both reports will
be stored.

Closes scylladb/scylladb#25726
2025-09-01 08:56:48 +03:00
Ernest Zaslavsky
05154e131a cleanup: Add missing #pragma once
Add missing `#pragma once` to include header

Closes scylladb/scylladb#25761
2025-09-01 06:41:57 +03:00
Botond Dénes
fbff8d3b2d Merge 'vector_store_client: disable Nagle's algorithm on the http client' from Pawel Pery
Nagle’s algorithm and Delayed ACK’s algorithm are enabled by default on sockets in Linux. As a result we can experience 40ms latency on simply waiting for ACK on the client side. Disabling the Nagle’s algorithm (using TCP_NODELAY) should fix the issue (client won’t wait 40ms for ACKs).

This change sets `TCP_NODELAY` on every socket created by the `http_client`.

Checking for dead peers or network is helpful in maintaining a lifetime of the http client. This change also sets TCP_KEEPALIVE option on the http client's socket.

Fixes: VECTOR-169

Closes scylladb/scylladb#25401

* github.com:scylladb/scylladb:
  vector_store_client: set keepalive for the http client's socket
  vector_store_client: disable Nagle's algorithm on the http client
2025-09-01 06:26:06 +03:00
Jenkins Promoter
619b4102bd Update pgo profiles - x86_64 2025-09-01 05:08:56 +03:00
Jenkins Promoter
783f866bd3 Update pgo profiles - aarch64 2025-09-01 05:05:17 +03:00
Avi Kivity
dfc7957a73 Merge 'test/cluster/test_repair: test_vnode_keyspace_describe_ring: verify that describe_ring results agree with natural_endpoints' from Benny Halevy
Following up on 6129411a5e
improve test_vnode_keyspace_describe_ring be verifying that the
endpoints listed by describe_ring match those returned by the
`natural_endpoints` api (for random tokens).
The latter are calculated using an independent code path
directly from the effective_replication_map.

* test exists currently only on master, no backport required

Closes scylladb/scylladb#25610

* github.com:scylladb/scylladb:
  test/cluster/test_repair: test_vnode_keyspace_describe_ring: verify that describe_ring results agree with natural_endpoints
  test/pylib/rest_client: add natural_endpoints function
2025-08-31 20:36:15 +03:00
Avi Kivity
bae66cc0d8 Merge 'types: add byte-comparable format support for collections' from Lakshmi Narayanan Sreethar
This PR builds on the byte comparable support introduced in #23541 to add byte comparable support for all the collection types.

This implementation adheres to the byte-comparable format specification in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md

Refs https://github.com/scylladb/scylladb/issues/19407

New feature - backport not required.

Closes scylladb/scylladb#25603

* github.com:scylladb/scylladb:
  types/comparable_bytes: add compatibility testcases for collection types
  types/comparable_bytes: update compatibility testcase to support collection types
  types/comparable_bytes: support empty type
  types/comparable_bytes: support reversed types
  types/comparable_bytes: support vector cql3 type
  types/comparable_bytes: support tuple and UDT cql3 type
  types/comparable_bytes: support map cql3 type
  types/comparable_bytes: support set and list cql3 types
  types/comparable_bytes: introduce encode/decode_component
  types/comparable_bytes: introduce to_comparable_bytes/from_comparable_bytes
2025-08-31 15:53:27 +03:00
Avi Kivity
600349e29a Merge 'tasks: return task::impl from make_and_start_task ' from Aleksandra Martyniuk
Currently, make_and_start_task returns a pointer to task_manager::task
that hides the implementation details. If we need to access
the implementation (e.g. because we want a task to "return" a value),
we need to make and start task step by step openly.

Return task_manager::task::impl from make_and_start_task. Use it
where possible.

Fixes: https://github.com/scylladb/scylladb/issues/22146.

Optimization; no backport

Closes scylladb/scylladb#25743

* github.com:scylladb/scylladb:
  tasks: return task::impl from make_and_start_task
  compaction: use current_task_type
  repair: add new param to tablet_repair_task_impl
  repair: add new params to shard_repair_task_impl
  repair: pass argument by value
2025-08-31 15:44:37 +03:00
Nadav Har'El
ff91027eac utils, alternator: fix detection of invalid base-64
This patch fixes an error-path bug in the base-64 decoding code in
utils/base64.cc, which among other things is used in Alternator to decode
blobs in JSON requests.

The base-64 decoding code has a lookup table, which was wrongly sized 255
bytes, but needed to be 256 bytes. This meant that if the byte 255 (0xFF)
was included in an invalid base-64 string, instead of detecting that this
is an invalid byte (since the only valid bytes in a base-64 string are
A-Z,a-z,0-9,+,/ and =), the code would either think it's valid with a
nonsense 6-bit part, or even crash on an out-of-bounds read.

Besides the trivial fix, this patch also includes a reproducing test,
which tries to write a blob as a supposedly base-64 encoded string with
a 0xFF byte in it. The test fails before this patch (the write succeeds,
unexpectedly), and passes after this patch (the write fails as
expected). The test also passes on DynamoDB.

Fixes #25701

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25705
2025-08-31 15:38:01 +03:00
Avi Kivity
1f4c9b1528 Merge 'system_keyspace: add peers cache to get_ip_from_peers_table' from Petr Gusev
The gossiper can call `storage_service::on_change` frequently (see  scylladb/scylla-enterprise#5613), which may cause high CPU load and even trigger OOMs or related issues.

This PR adds a temporary cache for `system.peers` to resolve host_id -> ip without hitting storage on every call. The cache is short-lived to handle the unlikely case where `system.peers` is updated directly via CQL.

This is a temporary fix; a more thorough solution is tracked in https://github.com/scylladb/scylladb/issues/25620.

Fixes scylladb/scylladb#25660

backport: this patch needs to be backported to all supported versions (2025.1/2/3).

Closes scylladb/scylladb#25658

* github.com:scylladb/scylladb:
  storage_service: move get_host_id_to_ip_map to system_keyspace
  system_keyspace: use peers cache in get_ip_from_peers_table
  storage_service: move get_ip_from_peers_table to system_keyspace
2025-08-31 15:34:35 +03:00
Piotr Wieczorek
5add43e15c alternator: streams: Address minor incompatibilities with DynamoDB in GetRecords response.
This commit adds missing fields to GetRecords responses: `awsRegion` and
`eventVersion`. We also considered changing `eventSource` from
`scylladb:alternator` to `aws:dynamodb` and setting `SizeBytes` subfield
inside the `dynamodb` field.

We set `awsRegion` to the datacenter's name of the node that received
the request. This is in line with the AWS documentation, except that
Scylla has no direct equivalent of a region, so we use the datacenter's
name, which is analogous to DynamoDB's concept of region.

The field `eventVersion` determines the structure of a Record. It is
updated whenever the structure changes. We think that adding a field
`userIdentity` bumped the version from `1.0` to `1.1`. Currently, Scylla
doesn't support this field (#11523), hence we use the older 1.0 version.

We have decided to leave `eventSource` as is, since it's easy to modify
it in case of problems to `aws:dynamodb` used by DynamoDB.

Not setting `SizeBytes` subfield inside the `dynamodb` field was
dictated by the lack of apparent use cases. The documentation is unclear
about how `SizeBytes` is calculated and after experimenting a little
bit, I haven't found an obvious pattern.

Fixes: #6931

Closes scylladb/scylladb#24903
2025-08-31 14:55:47 +03:00
Avi Kivity
bf9a963582 utils: mark crc barrett tables const
They're marked constinit, but constinit does not imply const. Since
they're not supposed to be modified, mark them const too.

Closes scylladb/scylladb#25539
2025-08-31 11:37:39 +03:00
Avi Kivity
bc5773f777 Merge 'Add out of space prevention mechanisms' from Łukasz Paszkowski
When a scaling out is delayed or fails, it is crucial to ensure that clusters remain operational
and recoverable even under extreme conditions. To achieve this, the following proactive measures
are implemented:
- reject writes
      - includes: inserts, updates, deletes, counter updates, hints, read+repair and lwt writes
      - applicable to: user tables, views, CDC log, audit, cql tracing
- stop running compactions/repairs and prevent from starting new ones
- reject incoming tablet migrations

The aforementioned mechanisms are automatically enabled when node's disk utilization reaches
the critical level (default: 98%) and disabled when the utilization drop below the threshold.

Apart from that, the series add tests that require mounted volumes to simulate out of space.
The paths to the volumes can be provided using the a pytest argument, i.e.  `--space-limited-dirs`.
When not provided, tests are skipped.

Test scenarios:

1. Start a cluster and write data until one of the nodes reaches 90% of the disk utilization
2. Perform an **operation** that would take the nodes over 100%
3. The nodes should not exceed the critical disk utilization (98% by default)
4. Scale out the cluster by adding one node per rack
5. Retry or wait for the **operation** from step 2

The **operation** is: writing data, running compactions, building materialized views, running repair,
migrating tablets (caused by RF change, decommission).

The test is successful, if no nodes run out of space, the **operation** from step 2 is
aborted/paused/timed out and the **operation** from step 5 is successful.

`perf-simple-query --smp 1 -m 1G` results obtained for fixed 400MHz frequency:

Read path (before)

```
instructions_per_op:
	mean=   39661.51 standard-deviation=34.53
	median= 39655.39 median-absolute-deviation=23.33
	maximum=39708.71 minimum=39622.61
```

Read path (after)

```
instructions_per_op:
	mean=   39691.68 standard-deviation=34.54
	median= 39683.14 median-absolute-deviation=11.94
	maximum=39749.32 minimum=39656.63
```

Write path (before):

```
instructions_per_op:
	mean=   50942.86 standard-deviation=97.69
	median= 50974.11 median-absolute-deviation=34.25
	maximum=51019.23 minimum=50771.60
```

Write path (after):

```
instructions_per_op:
	mean=   51000.15 standard-deviation=115.04
	median= 51043.93 median-absolute-deviation=52.19
	maximum=51065.81 minimum=50795.00
```

Fixes: https://github.com/scylladb/scylladb/issues/14067
Refs: https://github.com/scylladb/scylladb/issues/2871

No backport, as it is a new feature.

Closes scylladb/scylladb#23917

* github.com:scylladb/scylladb:
  tests/cluster: Add new storage tests
  test/scylla_cluster: Override workdir when passed via cmdline
  streaming: Reject incoming migrations
  storage_service: extend locator::load_stats to collect per-node critical disk utilization flag
  repair_service: Add a facility to disable the service
  compaction_manager: Subscribe to out of space controller
  compaction_manager: Replace enabled/disabled states with running state
  database: Add critical_disk_utilization mode database can be moved to
  disk_space_monitor: add subscription API for threshold-based disk space monitoring
  docs: Add feature documentation
  config: Add critical_disk_utilization_level option
  replica/exceptions: Add a new custom replica exception
2025-08-30 18:47:57 +03:00
Petr Gusev
898531fe7c client_state: decoroutinize check_internal_table_permissions
This function is on a hot path, better avoid allocating
coroutine frames.

Fixes scylladb/scylladb#25501

Closes scylladb/scylladb#25689
2025-08-30 18:46:54 +03:00
Avi Kivity
5c4a8ee134 Update seastar submodule
* seastar 0a90f7945...c2d989333 (7):
  > Add missing `#pragma once` to response_parser.rl
  > simple-stream: avoid memcpy calls in fragmented streams for constant sizes
  > reactor: Move stopping activity out of main loop
  > Add sequential buffer size options to IOTune
  > disable exception interception when ASAN enabled
  > file, io_queue: Drop maybe_priority_class_ref{} from internal calls
  > reactor: Equip make_task() and lambda_task with concepts

Closes scylladb/scylladb#25737
2025-08-30 14:53:34 +03:00
Calle Wilund
cc9eb321a1 commitlog: Ensure segment deletion is re-entrant
Fixes #25709

If we have large allocations, spanning more than one segment, and
the internal segment references from lead to secondary are the
only thing keeping a segment alive, the implicit drop in
discard_unused_segments and orphan_all can cause a recursive call
to discard_unused_segments, which in turn can lead to vector
corruption/crash, or even double free of segment (iterator confusion).

Need to separate the modification of the vector (_segments) from
actual releasing of objects. Using temporaries is the easiest
solution.

To further reduce recursion, we can also do an early clear of
segment dependencies in callbacks from segment release (cf release).

Closes scylladb/scylladb#25719
2025-08-30 08:24:57 +02:00
Piotr Dulikowski
7ccb50514d Merge 'Introduce view building coordinator' from Michał Jadwiszczak
This patch introduces `view_building_coordinator`, a single entity within whole cluster responsible for building tablet-based views.

The view building coordinator takes slightly different approach than the existing node-local view builder. The whole process is split into smaller view building tasks, one per each tablet replica of the base table.
The coordinator builds one base table at a time and it can choose another when all views of currently processing base table are built.
The tasks are started by setting `STARTED` state and they are executed by node-local view building worker. The tasks are scheduled in a way, that each shard processes only one tablet at a time (multiple tasks can be started for a shard on a node because a table can have multiple views but then all tasks have the same base table and tablet (last_token)). Once the coordinator starts the tasks, it sends `work_on_view_building_tasks` RPC to start the tasks and receive their results.
This RPC is resilient to RPC failure or raft leader change, meaning if one RPC call started a batch of tasks but then failed (for instance the raft leader was changed and caller aborted waiting for the response), next RPC call will attach itself to the already started batch.

The coordinator plugs into handling tablet operations (migration/resize/RF change) and adjusts its tasks accordingly. At the start of each tablet operation, the coordinator aborts necessary view building tasks to prevent https://github.com/scylladb/scylladb/issues/21564. Then, new adjusted tasks are created at the end of the operation.
If the operation fails at any moment, aborted tasks are rollback.

The view building coordinator can also handle staging sstables using process_staging view building tasks. We do this because we don't want to start generating view updates from a staging sstable prematurely, before the writes are directed to the new replica (https://github.com/scylladb/scylladb/issues/19149).

For detailed description check: `docs/dev/view-building-coordinator.md`

Fixes https://github.com/scylladb/scylladb/issues/22288
Fixes https://github.com/scylladb/scylladb/issues/19149
Fixes https://github.com/scylladb/scylladb/issues/21564
Fixes https://github.com/scylladb/scylladb/issues/17603
Fixes https://github.com/scylladb/scylladb/issues/22586
Fixes https://github.com/scylladb/scylladb/issues/18826
Fixes https://github.com/scylladb/scylladb/issues/23930

---

This PR is reimplementation of https://github.com/scylladb/scylladb/pull/21942

Closes scylladb/scylladb#23760

* github.com:scylladb/scylladb:
  test/cluster: add view build status tests
  test/cluster: add view building coordinator tests
  utils/error_injection: allow to abort `injection_handler::wait_for_message()`
  test: adjust existing tests
  utils/error_injection: add injection with `sleep_abortable()`
  db/view/view_builder: ignore `no_such_keyspace` exception
  docs/dev: add view building coordinator documentation
  db/view/view_building_worker: work on `process_staging` tasks
  db/view/view_building_worker: register staging sstable to view building coordinator when needed
  db/view/view_building_worker: discover staging sstables
  db/view/view_building_worker: add method to register staging sstable
  db/view/view_update_generator: add method to process staging sstables instantly
  db/view/view_update_generator: extract generating updates from staging sstables to a method
  db/view/view_update_generator: ignore tablet-based sstables
  db/view/view_building_coordinator: update view build status on node join/left
  db/view/view_building_coordinator: handle tablet operations
  db/view: add view building task mutation builder
  service/topology_coordinator: run view building coordinator
  db/view: introduce `view_building_coordinator`
  db/view/view_building_worker: update built views locally
  db/view: introduce `view_building_worker`
  db/view: extract common view building functionalities
  db/view: prepare to create abstract `view_consumer`
  message/messaging_service: add `work_on_view_building_tasks` RPC
  service/topology_coordinator: make `term_changed_error` public
  db/schema_tables: create/cleanup tasks when an index is created/dropped
  service/migration_manager: cleanup view building state on drop keyspace
  service/migration_manager: cleanup view building state on drop view
  service/migration_manager: create view building tasks on create view
  test/boost: enable proxy remote in some tests
  service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()`
  service/migration_manager: coroutinize `prepare_new_view_announcement()`
  service/storage_proxy: expose references to `system_keyspace` and `view_building_state_machine`
  service: reload `view_building_state_machine` on group0 apply()
  service/vb_coordinator: add currently processing base
  db/system_keyspace: move `get_scylla_local_mutation()` up
  db/system_keyspace: add `view_building_tasks` table
  db/view: add view_building_state and views_state
  db/system_keyspace: add method to get view build status map
  db/view: extract `system.view_build_status_v2` cql statements to system_keyspace
  db/system_keyspace: move `internal_system_query_state()` function earlier
  db/view: ignore tablet-based views in `view_builder`
  gms/feature_service: add VIEW_BUILDING_COORDINATOR feature
2025-08-29 17:28:44 +02:00
Aleksandra Martyniuk
7fe1ad1f63 tasks: return task::impl from make_and_start_task
Currently, make_and_start_task returns a pointer to task_manager::task
that hides the implementation details. If we need to access
the implementation (e.g. because we want a task to "return" a value),
we need to make and start task step by step openly.

Return task_manager::task::impl from make_and_start_task. Use it
where possible.

Fixes: https://github.com/scylladb/scylladb/issues/22146.
2025-08-29 17:12:07 +02:00
Aleksandra Martyniuk
0844a057d1 compaction: use current_task_type 2025-08-29 17:08:00 +02:00
Łukasz Paszkowski
e34deea50e tests/cluster: Add new storage tests
The storage submodule contains tests that require mounted volumes
to be executed. The volumes are created automatically with the
`volumes_factory` fixture.

The tests in this suite are executed with the custom launcher
`unshare -mr pytest`

Test scenarios (when one node reaches critical disk utilization):
1. Reject user table writes
2. Disable/Enabled compaction
3. Reject split compactions
4. New split compactions not triggered
5. Abort tablet repair
6. Disable/Enabled incoming tablet migrations
7. Restart a node while a tablet split is triggered
2025-08-29 14:56:13 +02:00
Łukasz Paszkowski
4bb5696a5d test/scylla_cluster: Override workdir when passed via cmdline
Currently, workdir is set in ScyllaCluster constructor and it does
not take into accout that the value could be overridden via cmdline
arguments. When this happens, then some data (logs, configs) are
stored under one path and other (data) is stored under a different.

The patch allows overriding the value when passed via cmdline arguments
leading to all files being stored under the same path.
2025-08-29 14:56:13 +02:00
Łukasz Paszkowski
7cfedb1214 streaming: Reject incoming migrations
When a replica operates in the critical disk utilization mode, all
incoming migrations are being rejected by rejecting an incoming
sstable file.

In the topology_coordinator, the rejected tablet is moved into the
cleanup_target state in order to revert migration. Otherwise, retry
happens and a cluster stays in the tablet_migration transition state
preventing any other topology changes to happen, e.g. scaling out.

Once the tablet migration is rejected, the load balancer will schedule
a new migration.
2025-08-29 14:56:13 +02:00
Łukasz Paszkowski
54201960e6 storage_service: extend locator::load_stats to collect per-node critical disk utilization flag
This commit extends the TABLE_LOAD_STATS RPC with information whether
a node operates in the critical disk utilization mode.

This information will be needed to distict between the causes why
a table migration/repair was interrupted.
2025-08-29 14:56:13 +02:00
Łukasz Paszkowski
9809800aa8 repair_service: Add a facility to disable the service
Repair service currently have two functions: stop() and shutdown() that
stop all ongoing repairs and prevent any further repairs from being started.

It is possible to stop the repair_service once. Once stopped, it cannot
be restarted. We would like, however, to enable / disable the repair service
many times.

Similarly to compaction_manager, the repair service provides two new functions:
- drain() - abort all ongoing local repair task and disable the service,
            i.e. no new local task will be scheduled and data received from
            the repair master is rejected. It's, though, still possible to
            schedule a global repair request
- enable() - enable the service

By default, the repair service is enabled immediately once started.

For tablet-based keyspaces, the new facility prevents tablets from being
repaired. Whenever the repair_service is disabled and the request to repair
a tablet arrives, an exception is returned.

Once the exception is thrown, the tablet is moved into the end_repair
state and the operation will be retried later. Hence, disabling the service
does not fail the global tablet repair request.
2025-08-29 14:56:13 +02:00
Łukasz Paszkowski
9539e80e54 compaction_manager: Subscribe to out of space controller 2025-08-29 14:56:07 +02:00
Aleksandra Martyniuk
f3b43b6384 repair: add new param to tablet_repair_task_impl
Currently, sched_info is set immediately after tablet_repair_task_impl
is created.

Pass this param to constructor instead. It's a preparation for
the following changes.
2025-08-29 14:37:00 +02:00
Aleksandra Martyniuk
57b47e282e repair: add new params to shard_repair_task_impl
Currently, neighbors and small_table_optimization_ranges_reduced_factor
are set immediately after shard_repair_task_impl is created.

Pass these params to constructor instead. It's a preparation for following
changes.
2025-08-29 14:27:00 +02:00
Aleksandra Martyniuk
6a0d8728de repair: pass argument by value
shard_repair_task_impl constructor gets some of its arguments by const
reference. Due to that those arguments are copied when they could be
moved.

Get shard_repair_task_impl constructor arguments by value. Use std::move
where possible.
2025-08-29 14:24:47 +02:00
Łukasz Paszkowski
40c40be8a6 compaction_manager: Replace enabled/disabled states with running state
Using a single state variable to keep track whether compaction
manager is enabled/disabled is insufficient, as multiple services
may independently request compactions to be disabled.

To address this, a counter is introduced to record how many times
the compaction manager has been drained. The manager is considered
enabled only when this counter reaches zero.

Introducing a counter, enabled and disabled states become obsolete.
So they are replaced with a single running state.
2025-08-29 13:47:01 +02:00
Łukasz Paszkowski
3d03b88719 database: Add critical_disk_utilization mode database can be moved to
When database operates in the critical disk utilization mode, all
mutation writes including inserts, updates, deletes, counter updates,
hints, read+repair, lwt writes) to user tables and other associated
with them tables like views, CDC log, audit are rejected, with a clear
error exception returned.

The mode is meant to be used with the disk space monitor in order
to prevent any user writes when node's disk utilization is too high.
2025-08-29 13:46:45 +02:00
Lakshmi Narayanan Sreethar
ce0c29e024 types/comparable_bytes: add compatibility testcases for collection types
This patch adds compatibility testcases for the following cql3 types :
set, list, map, tuple, vector and reversed types.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar
4547f6f188 types/comparable_bytes: update compatibility testcase to support collection types
The `abstract_type::from_string()` method used to parse the input data
doesn't support collections yet. So the collection testdata will be
passed as JSON strings to the testcase. This patch updates the testcase
to adapt to this workaround.

Also, extended the testcase to verify that Scylla's implementation can
successfully decode the byte comparable output encoded by Cassandra.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar
0997b3533c types/comparable_bytes: support empty type
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar
b799101a09 types/comparable_bytes: support reversed types
A reversed type is first encoded using the underlying type and then all
the bits are flipped to ensure that the lexicographical sort order is
reversed. During decode, the bytes are flipped first and then decoded
using the underlying type.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar
6c2a3e2c51 types/comparable_bytes: support vector cql3 type
The CQL vector type encoding is similar to the lists, where each element
is transformed into a byte-comparable format and prefixed with a
component marker. The sequence is terminated with a terminator marker to
indicate the end of the collection.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar
1ccfe522f1 types/comparable_bytes: support tuple and UDT cql3 type
The CQL tuple and UDT types share the same internal implementation and
therefore use the same byte comparable encoding. The encoding is similar
to lists, where each element is transformed into a byte-comparable
format and prefixed with a component marker. The sequence is terminated
with a terminator marker to indicate the end of the collection.

TODO: Add duplicate test items to maps, lists and sets
      For maps, add more entries that share keys
      ex map1 : key1 : value1, key2 : value2
         map2 : key1 : value4
         map3 : key2 : value5 etc

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar
ca38c15a97 types/comparable_bytes: support map cql3 type
The CQL map type is encoded as a sequence of key-value pairs. Each key
and each value is individually prefixed with a component marker, and the
sequence is terminated with a terminator marker to indicate the end of
the collection.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar
4d5e5f0c84 types/comparable_bytes: support set and list cql3 types
The CQL set and list types are encoded as a sequence of elements, where
each element is transformed into a byte-comparable format and prefixed
with a component marker. The sequence is terminated with a terminator
marker to indicate the end of the collection.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar
8e46e8be01 types/comparable_bytes: introduce encode/decode_component
The components of a collection, such as an element from a list, set, or
vector; a key or value from a map; or a field from a tuple, share the
same encode and decode logic. During encode, the component is transformed
into the byte comparable format and is prefixed with the `NEXT_COMPONENT`
marker. During decode, the component is transformed back into its
serialized form and is prefixed with the serialized size.

A null component is encoded as a single `NEXT_COMPONENT_NULL` marker and
during decode, a `-1` is written to the serialized output.

This commit introduces few helper methods that implement the above
mentioned encode and decode logics.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-08-29 12:26:21 +05:30
Lakshmi Narayanan Sreethar
47e88be6e0 types/comparable_bytes: introduce to_comparable_bytes/from_comparable_bytes
Added helper functions to_comparable_bytes() and from_comparable_bytes()
to let collection encode/decode methods invoke encode/decode of the
underlying types.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-08-29 12:26:09 +05:30
Łukasz Paszkowski
3e740d25b5 disk_space_monitor: add subscription API for threshold-based disk space monitoring
Introduce the `subscribe` method to disk_space_monitor, allowing clients to
register callbacks triggered when disk utilization crosses a configurable
threshold.

The API supports flexible trigger options, including notifications on threshold
crossing and direction (above/below). This enables more granular and efficient
disk space monitoring for consumers.
2025-08-28 18:06:37 +02:00
Łukasz Paszkowski
c2de678a87 docs: Add feature documentation
1. Adds user-facing page in /docs/troubleshooting/error-messages
2025-08-28 18:06:37 +02:00
Łukasz Paszkowski
535c901e50 config: Add critical_disk_utilization_level option
The option defines the threshold at which the defensive mechanisms
preventing nodes from running out of space, e.g. rejecting user
writes shall be activated.

Its default value is 98% of the disk capacity.
2025-08-28 18:06:37 +02:00
Łukasz Paszkowski
132fd1e3f2 replica/exceptions: Add a new custom replica exception
The new exception `critical_disk_utilization_exception` is thrown
when the user table mutation writes are being blocked due to e.g.
reaching a critical disk utilization level.

This new exception, is then correctly handled on the coordinator
side when transforming into `mutation_write_failure_exception` with
a meaningful error message: "Write rejected due to critical disk
utilization".
2025-08-28 18:06:37 +02:00
Petr Gusev
4b907c7711 storage_service: move get_host_id_to_ip_map to system_keyspace
Reimplemented the function to use the peers cache. It could be replaced
with get_ip_from_peers_table, but that would create a coroutine frame for
each call.
2025-08-28 12:48:46 +02:00