In the next commit we want to add an optimization that relies on
precise control over the lifetime of cas_request. In particular, we
want the implementation of this interface in Alternator to operate on
raw references that are guaranteed to remain valid only until the
cas() future is resolved. We already depend on the same lifetime
assumptions in cas_request when used by modification_statement.
However, these assumptions are not clearly expressed in the current
interface: cas_request is taken by shared_ptr, and nothing prevents
cas() from storing that pointer inside paxos_response_handler, which
may outlive the cas() future.
This commit fixes that by taking cas_request by raw reference. This
makes it explicit that cas() does not assume ownership of the object.
Callers must ensure that the referenced object remains valid until
the returned future is resolved.
I noticed during tests that `maybe_get_primary_replica`
would not distribute uniformly the choice of primary replica
because `info.replicas` on some shards would have an order whilst
on others it'd be ordered differently, thus making the function choose
a node as primary replica multiple times when it clearly could've
chosen a different nodes.
This patch sorts the replica set before passing it through the
scope filter.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
In #26408 a write_handler_destroy_promise class was introduced to
wait for abstract_write_response_handler instances destruction. We
strived to minimize the memory footprint of
abstract_write_response_handler, with write_handler_destroy_promise-es
we required only a single additional int. It turned our that in some
cases a lot of write handlers can be scheduled for deletion
at the same time, in such cases the
vector<write_handler_destroy_promise> can become big and cause
'oversized allocation' seastar warnings.
Another concern with write_handler_destroy_promise-es was that they
were more complicated than it was worth.
In this commit we replace write_handler_destroy_promise with simple
gates. One or more gates can be attached to an
abstract_write_response_handler to wait for its destruction. We use
utils::small_vector<gate::holder, 2> to store the attached gates.
The limit 2 was chosen because we expect two gates at the same time
in most cases. One is storage_proxy::_write_handlers_gate,
which is used to wait for all handlers in
cancel_all_write_response_handlers. Another one can be attached by
a caller of cancel_write_handlers. Nothing stops several
cancel_write_handlers to be called at the same time, but it should be
rare.
The sizeof(utils::small_vector<gate::holder, 2>) == 40, this is
40.0 / 488 * 100 ~ 8% increase in
sizeof(abstract_write_response_handler), which seems acceptable.
Fixesscylladb/scylladb#26788
Previously in a counter update we lock the read shard to protect the
counter's read-modify-write against concurrent updates.
This is not sufficient when the counter is migrated between different
shards, because there is a stage where the read shard switches from the
old shard to the new shard, and during that switch there can be
concurrent counter updates on both shards. If each shard takes only its
own lock, the operations will not be exclusive anymore, and this can
cause lost counter updates.
To fix this, we acquire the counter lock on both shards in the stage
write_both_read_new, when both shards can serve reads. This guarantees
that counter updates continue to be exclusive during intranode
migration.
When applying a counter mutation, use apply_on_shards to apply the
mutation on all write shards, similarly to the way other mutations are
applied in the storage proxy. Previously the mutation was applied only
on the current shard which is the read shard.
This is needed to respect the write_both stages of intranode migration
where we need to apply the mutation on both the old and the new shards.
Refactor the counter update to split the functions and have them called
by the storage proxy to prepare for a later change.
Previously in mutate_counter the storage proxy calls the replica
function apply_counter_update that does a few things:
1. checks that the operation can be done: check timeout, disk utilization
2. acquire counter locks
3. do read-modify-write and transform the counter mutation
4. apply the mutation in the replica
In this commit we change it so that these functions are split and called
from the storage proxy, so that we have better control from the storage
proxy when we change it later to work across multiple shards. For
example, we will want to acquire locks on multiple shards, transform it
on one shard, and then apply the mutation on multiple shards.
After the change it works as follows in storage proxy:
1. acquire counter locks
2. call replica prepare to check the operation and transform the mutation
3. call replica apply to apply the transformed mutation
Slightly reorganize the mutate counter function to prepare it for a
later change.
Move the code that finds the read shard and invokes the rest of the
function on the read shard to the caller function. This simplifies the
function mutate_counter_on_leader_and_replicate which now runs on the
read shard and will make it easier to extend.
Currently, batchlog replay is considered successful even if all batches fail
to be sent (they are replayed later). However, repair requires all batches
to be sent successfully. Currently, if batchlog isn't cleared, the repair never
learns and updates the repair_time. If GC mode is set to "repair", this means
that the tombstones written before the repair_time (minus propagation_delay)
can be GC'd while not all batches were replied.
Consider a scenario:
- Table t has a row with (pk=1, v=0);
- There is an entry in the batchlog that sets (pk=1, v=1) in table t;
- The row with pk=1 is deleted from table t;
- Table t is repaired:
- batchlog reply fails;
- repair_time is updated;
- propagation_delay seconds passes and the tombstone of pk=1 is GC'd;
- batchlog is replayed and (pk=1, v=1) inserted - data resurrection!
Do not update repair_time if sending any batch fails. The data is still repaired.
For tablet repair the repair runs, but at the end the exception is passed
to topology coordinator. Thanks to that the repair_time isn't updated.
The repair request isn't removed as well, due to which the repair will need
to rerun.
Apart from that, a batch is removed from the batchlog if its version is invalid
or unknown. The condition on which we consider a batch too fresh to replay
is updated to consider propagation_delay.
Fixes: https://github.com/scylladb/scylladb/issues/24415
Data resurrection fix; needs backport to all versions
Closesscylladb/scylladb#26319
* github.com:scylladb/scylladb:
db: fix indentation
test: add reproducer for data resurrection
repair: fail tablet repair if any batch wasn't sent successfully
db/batchlog_manager: fix making decision to skip batch replay
db: repair: throw if replay fails
db/batchlog_manager: delete batch with incorrect or unknown version
db/batchlog_manager: coroutinize replay_all_failed_batches
Add a reproducer to check that the repair_time isn't updated
if the batchlog replay fails.
If repair_time was updated, tombstones could be GC'd before the
batchlog is replayed. The replay could later cause the data
resurrection.
All mutation_holder::apply_locall() implementations now do the same
post fence chech. In this commit we hoist this check up to
abstract_write_response_handler::apply_locally().
This function is intended to replace start_write() in subsequent
commits. It provides the following benefits:
* Remove duplication: All start_write() call sites must run the fence
check after the operation completes. run_fenceable_write() encapsulates
this pattern.
* Fix a race: To ensure no new stale write operations occur during
cleanup, a fence check before start_write() was previously used.
However, yields in several code paths between the check and
start_write() made it non-atomic, allowing a stale operation to slip in
if the fence_version was updated in between.
* Optimize waiting: We do not need to wait for all operations—only for
vnode-based, non-local tables with versions smaller than the current
fence_version.
Future commits will extend update_fence_version, and it is simpler to do
so if the function resides in storage_proxy. Additionally, fence_version
is the only field this function accesses, and it is used solely within
storage_proxy, making this change natural on its own.
The operation must be held during the local write. Before this commit,
its scope ended after returning from apply_locally(), so it
did not actually provide any protection.
handle_write() is invoked from receive_mutation_handler() and
handle_paxos_learn(), and both previously performed a fence check in
apply_fn. This commit hoists the fence check into handle_write() to
reduce code duplication.
Additionally, move start_write() after get_schema_for_write(), since
there is no need to hold the operation while querying the schema.
As noted in the code comments, start_write() does not need to be held
during counter replication; it is required only while performing local
storage modifications. Move the start_write() call and the fence
check down to mutate_counter_on_leader_and_replicate().
Additionally, mutate_counters_on_leader() is updated to check for
possible stale_topology_exception() and properly package them
in the resulting exception_variant structure.
`shared_ptr<abstract_write_response_handler>` instances are captured in the `lmutate` and `rmutate` lambdas of `send_to_live_endpoints()`. As a result, an `abstract_write_response_handler` object may outlive its removal from the `storage_proxy::_response_handlers` map -> `cancel_all_write_response_handlers()` doesn't actually wait for requests completion -> `sp::drain_on_shutdown()` doesn't guarantee all requests are drained -> `sp::stop_remote()` completes too early and `paxos_store` is destroyed while LWT local writes might still be in progress. In this PR we introduce a `write_handler_destroy_promise` to wait for such pending instances in `cancel_write_handlers()` and `cancel_all_write_response_handlers()` to prevent the `use-after-free`.
A better long-term solution might be to replace `shared_ptr` with `unique_ptr` for `abstract_write_response_handler` and use a separate gate to track the `lmutate/rmutate` lambdas. We do not actually need to wait for these lambdas to finish before sending a timeout or error response to the client, as we currently do in `~abstract_write_response_handler`.
Fixes scylladb/scylladb#26355
backport: need to be backported to 2025.4 since #26355 is reproduced on LWT over tablets
Closesscylladb/scylladb#26408
* github.com:scylladb/scylladb:
test_tablets_lwt: add test_lwt_shutdown
storage_proxy: wait for write handler destruction
storage_proxy: coroutinize cancel_write_handlers
storage_proxy: cancel_write_handlers: don't hold a strong pointer to handler
shared_ptr<abstract_write_response_handler> instances are captured in
the lmutate/rmutate lambdas of send_to_live_endpoints(). As a result,
an abstract_write_response_handler object may outlive its removal from
the _response_handlers map. We use write_handler_destroy_promise to
wait for such pending instances in cancel_write_handlers() and
cancel_all_write_response_handlers() to prevent use-after-free.
A better long-term solution might be to replace shared_ptr with
unique_ptr for abstract_write_response_handler and use a separate gate
to track the lmutate/rmutate lambdas. We do not actually need to wait
for these lambdas to finish before sending a timeout or error response
to the client, as we currently do in ~abstract_write_response_handler.
Fixesscylladb/scylladb#26355
The cancel_write_handlers() method was assumed to be called in a thread
context, likely because it was first used from gossiper events, where a
thread context already existed. Later, this method was reused in
abort_view_writes() and abort_batch_writes(), where threads are created
on the fly and appear redundant.
The drain_on_shutdown() method also used a thread, justified by some
"delicate lifetime issues", but it is unclear what that actually means.
It seems that a straightforward co_await should work just fine.
A strong pointer was held for the duration of thread::yield(),
preventing abstract_write_response_handler destruction and possibly
delaying the sending of timeout or error responses to the client.
This commit removes the strong pointer. Instead, we compute the
next iterator before calling timeout_cb(), so if the handler is
destroyed inside timeout_cb(), we already have a valid next iterator.
This patch series introduces several tests that check number of exceptions that happens during various replica operations. The goal is to have a set of tests that can catch situations where number of exceptions per operation increases. It makes exception throw regressions easier to catch.
The tests cover apply counter update and apply functionalities in the database layer.
There are more paths that can be checked, like various semaphore wait timeouts located deeper in the code. This set of tests does not cover all code paths.
Fixes#18164
This is an improvement. No backport needed.
Closesscylladb/scylladb#25992
* github.com:scylladb/scylladb:
test: cluster: test replica write timeout
database: parameterize apply_counter_update_delay_5s injector value
test: cluster: test replica exceptions - test rate limit exceptions
This patch introduces two tests for `replica::rate_limit_exception`.
One test is for write/apply limit, the other one for read/query limit.
The tests check the number of rate limit errors reported and the
number of cpp exceptions reported. If somebody adds an exception
throw on the rate limit paths, this test will catch it and fail.
Refs #18164
This patch adds a struct `per_request_options` used to communicate between CDC and upper abstraction layers. We need this for better compatibility with DynamoDB Streams in Alternator (https://github.com/scylladb/scylladb/issues/6918) to change operation types of log rows. This patch also adds a way to conditionally forward the item read by LWT to CDC and use it as a preimage. For now, only Alternator uses this feature.
The main changes are:
- add a struct `cdc::per_request_options` to pass information between CDC and upper abstraction layers,
- add the struct to `cas_request::apply`'s signature,
- add a possibility to provide a preimage fetched by an upper abstraction layer (to propagate a row read by Alternator to CDC's preimage). This reduces the number of reads-before-write by 1 for some **Alternator** requests and it is always safe. It's possible to use this feature also in CQL.
No backport, it's a feature.
Refs https://github.com/scylladb/scylladb/issues/6918
Refs https://github.com/scylladb/scylladb/pull/26121Closesscylladb/scylladb#26149
* github.com:scylladb/scylladb:
alternator, cdc: Re-use the row read by LWT as a CDC preimage
cdc: Support prefetched preimages
storage: Add cdc options to cas_request::apply
cdc, storage: Add a struct to pass per-mutation options to CDC
cdc: Move operations enum to the top of the namespace
Batches that fail on the initial send are retired later, until they
succeed. These retires happen with CL=ALL, regardless of what the
original CL of the batch was. This is unnecessarily strict. We tried to
follow Cassandra here, but Cassandra has a big caveat in their use of
CL=ALL for batches. They accept saving just a hint for any/all of the
endpoints, so a batch which was just logged in hints is good enough for
them.
We do not plan on replicating this usage of hints at this time, so as a
middle ground, the CL is changed to EACH_QUORUM.
Fixes: scylladb/scylladb#25432Closesscylladb/scylladb#26304
This will allow us to communicate with CDC from higher layers. We plan
to use it to reduce the number of read-before-writes with preimages by
passing the row selected in upper layers.
`SELECT` commands with SERIAL consistency level are historically allowed for vnode-based views, even though they don't provide linearizability guarantees and in general don't make much sense. In this PR we prohibit LWTs for tablet-based views, but preserve old behavior for vnode-based views for compatibility. Similar logic is applied to CDC log tables.
We also add a general check that disallows colocating a table with another colocated table, since this is not needed for now.
Fixes https://github.com/scylladb/scylladb/issues/26258
backports: not needed (a new feature)
Closesscylladb/scylladb#26284
* github.com:scylladb/scylladb:
cql_test_env.cc: log exception when callback throws
lwt: prohibit for tablet-based views and cdc logs
tablets: disallow chains of colocated tables
database: get_base_table_for_tablet_colocation: extract table_id_by_name lambda
SELECT commands with SERIAL consistency level are historically allowed
for vnode-based views, even though they don't provide linearizability
guarantees. We prohibit LWTs for tablet-based views, but preserve old
behavior for vnode-based view for compatibility. Similar logic is
applied to CDC log tables.
Fixesscylladb/scylladb#26258
It belongs there, it is a completely replica-side thing. Also take the
opportunity to rename it to multishard_query.{hh,cc}, it is not just
mutation anymore (data query is also implemented).
When a node is being replaced, it enters a "left" state while still
owning tokens. Before this patch, this is also the time when we start
draining hints targeted to this node, so the hints may get sent before
the token ownership gets migrated to another replica, and these hints
may get lost.
In this patch we postpone the hint draining for the "left" nodes to
the time when we know that the target nodes no longer hold ownership
of any tokens - so they're no longer referenced in topology. I'm
calling such nodes "released".
Before this change, when we were starting draining hints, we knew the
IP addresses of the target nodes. We lose this information after entering
"left" stage, so when draining hints after a node is "released", we
can't drain the hints targeted to a specific IP instead of host_id.
We may have hints targeted to IPs if the migration rom IP-based to
host_ID-based hints didn't happen yet. The migration happens
when enabling a cluster feature since 2024.2.0, so such hints can
only exist if we perform a direct upgrade from a version before
2024.2.0 to a version that has this change (2025.4.0+).
To avoid losing hints completely when such an upgrade is combined
with a node removal/replacement, we still drain hints when the node
enters a "left" state and the migration of hints to host_id wasn't
performed yet. For these drains, the problematic scenario can't
occur because it only affects tablets, and when upgrading from
a version before 2024.2.0, no tablets can exist yet.
If we perform such a drain, we no longer need to drain hints when
entering the "released" state, so we only drain when entering that
state if the migration was already completed.
With this setup, we'll always drain hints at least once when a node
is leaving. However, if the migration to host_ids finishes between
entering the "left" state and the "released" state, we'll attempt
to drain the hints twice. This shouldn't be problem though because
each `drain_for()` is performed with the `_drain_lock` and after
a `hint_endpoint_manger` is drained, it's removed, so we won't try
to drain it twice.
This patch also includes a test for verifying that hints are properly
replayed after a node replace.
Fixes https://github.com/scylladb/scylladb/issues/24980Closesscylladb/scylladb#24981
The latter is recommended in seastar, and the former was left as
compatibility alias. Latest seastar explicitly marks it as deprecated so
once the submodule is updated, compilation logs will explode.
Most of the patch is generated with
for f in $(git grep -l '\<distributed<[A-Za-z0-9:_]*>') ; do sed -e 's/\<distributed<\([A-Za-z0-9:_]*\)>/sharded<\1>/g' -i $f; done
for f in $(git grep -l distributed.hh); do sed -e 's/distributed.hh/sharded.hh/' -i $f ; done
and a small manual change in test/perf/perf.hh
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26136
This PR consists of three parts:
* Small refactoring of the fencing APIs in storage_proxy (renames + comments + some functions were extracted)
* Implement the fencing for LWT verbs itself. This includes checking the fencing token before and after local replica data accesses.
* Two new `test.py` tests in `test_fencing.py`, which check the fencing in some real-world scenarios.
Backport: no need -- fencing for LWT requests is needed primarily for LWT over tablets, which is not released yet.
Fixesscylladb/scylladb#22332Closesscylladb/scylladb#25550
* https://github.com/scylladb/scylladb:
test_tablets_lwt: eliminate redundant disable_tablet_balancing
test_fencing: add test_lwt_fencing_upgrade
pylib: extract upgrade helpers from test_sstable_compression_dictionaries_upgrade.py
test_fencing: add test_fenced_out_on_tablet_migration_while_handling_paxos_verb
test_fencing: test_fence_lwt_during_bootstap
pylib/rest_client.py: encode injection name
storage_proxy_stats: add fenced_out_requests metric
storage_proxy: add fencing to Paxos verbs
storage_proxy::apply_fence: add overload that throws on failure
storage_proxy: extract apply_fence_result
sp::apply_fence: rename to apply_fence_on_ready
sp::apply_fence: rename to check_fence
sp::apply_fence: make non-generic
As requested in #22120, moved the files and fixed other includes and build system.
Moved files:
- query.cc
- query-request.hh
- query-result.hh
- query-result-reader.hh
- query-result-set.cc
- query-result-set.hh
- query-result-writer.hh
- query_id.hh
- query_result_merger.hh
Fixes: #22120
This is a cleanup, no need to backport
Closesscylladb/scylladb#25105
This commit adds fencing support to all Paxos verbs:
* Pass an optional (for backward compatibility) fencing_token as a
parameter to the prepare, accept, learn, and prune verbs.
* Call apply_fence twice — before and after accessing local data. This
ensures that if the coordinator is fenced out mid-request, the replica
does not return success, which would otherwise incorrectly contribute
to achieving the target CL. Without this, a user might observe
successful writes that become unreadable after the topology
operation completes.
* For prune, call apply_fence only once because it does not return a
response to the LWT coordinator.
Fixesscylladb/scylladb#22332
This commit refactors a repeated pattern that applies the fence
and embeds the exception into the exception_variant class by
extracting it into a separate method.
This overload performs the fence check only when the future is ready.
In this commit, we give it a more descriptive name to better reflect
its behavior. Additionally, we add extensive comments explaining the
overall fencing scheme and the motivation behind this specific overload.
We plan to introduce several additional apply_fence overloads in
upcoming commits. To avoid ambiguity, this change renames the existing
base function to check_fence.
Consider the following scenario:
- Current replica set is [A, B, C]
- write succeeds on [A, B], and a hint is logged for node C
- before the hint is replayed, D bootstraps and the token migrates from C to D
- hint is replayed to node C while D is pending, but it's too late, since streaming for that token is already done
- C is cleaned up, replayed data is lost, and D has a stale copy until next repair.
In the scenario we effectively fail to send the hint. This scenario is also more likely to happen with tablets,
as it can happen for every tablet migration.
This issue is particularly detrimental to materialized views. View updates use hints by default and a specific
view update may be sent to just one view replica (when a single base replica has a different row state due to
reordering or missed writes). When we lose a hint for such a view update, we can generate a persistent inconsistency
between the base and view - ghost rows can appear due to a lost tombstone and rows may be missing in the view due
to a lost row update. Such inconsistencies can't be fixed neither by repairing the view or the base table.
To handle this, in this patch we add the pending replicas to the list of targets of each hint, even if the original
target is still alive.
This will cause some updates to be redundant. These updates are probably unavoidable for now, but they shouldn't
be too common either. The scenarios for them are:
1. managing to send the hint to the source of a migrating replica before streaming that its token - the write will
arrive on the pending replica anyway in streaming
2. the hint target not being the source of the migration - if we managed to apply the original write of the hint to
the actual source of the migration, the pending replica will get it during streaming
3. sending the same hint to many targets at a similar time - while sending to each target, we'll see the same pending
replica for the hint so we'll send it multiple times
4. possible retries where even though the hint was successfully sent to the main target, we failed to send it to the
pending replica, so we need to retry the entire write
This patch handles both tablet migrations and tablet rebuilds. In the future, for tablet migrations, we can avoid
sending the hint to pending replias if the hint target is not the source fo the migration, which would allow us to
avoid the redundant writes 2 and 3. For rack-aware RF, this will be as simple as checking whether the replicas are
in the same rack.
We also add a test case reproducing the issue.
Co-Authored-By: Raphael S. Carvalho <raphaelsc@scylladb.com>
Fixes https://github.com/scylladb/scylladb/issues/19835Closesscylladb/scylladb#25590
Write requests cannot be safely retried if some replicas respond with
accepts and others with rejects. In this case, the coordinator is
uncertain about the outcome of the LWT: a subsequent LWT may either
complete the Paxos round (if a quorum observed the accept) or overwrite it
(if a quorum did not). If the original LWT was actually completed by
later rounds and the coordinator retried it, the write could be applied
twice, potentially overwriting effects of other LWTs that slipped in
between. Read requests do not have this problem, so they
can be safely retried.
Before this commit, handler->accept_proposal was called with
timeout_if_partially_accepted := true. This caused both read and write
requests to throw an "uncertainty" timeout to the user in the case
of the contention described above. After this commit, we throw an
"uncertainty" timeout only for write requests, while read requests
are instead retried in the loop in sp::cas.
Closesscylladb/scylladb#25602
When a scaling out is delayed or fails, it is crucial to ensure that clusters remain operational
and recoverable even under extreme conditions. To achieve this, the following proactive measures
are implemented:
- reject writes
- includes: inserts, updates, deletes, counter updates, hints, read+repair and lwt writes
- applicable to: user tables, views, CDC log, audit, cql tracing
- stop running compactions/repairs and prevent from starting new ones
- reject incoming tablet migrations
The aforementioned mechanisms are automatically enabled when node's disk utilization reaches
the critical level (default: 98%) and disabled when the utilization drop below the threshold.
Apart from that, the series add tests that require mounted volumes to simulate out of space.
The paths to the volumes can be provided using the a pytest argument, i.e. `--space-limited-dirs`.
When not provided, tests are skipped.
Test scenarios:
1. Start a cluster and write data until one of the nodes reaches 90% of the disk utilization
2. Perform an **operation** that would take the nodes over 100%
3. The nodes should not exceed the critical disk utilization (98% by default)
4. Scale out the cluster by adding one node per rack
5. Retry or wait for the **operation** from step 2
The **operation** is: writing data, running compactions, building materialized views, running repair,
migrating tablets (caused by RF change, decommission).
The test is successful, if no nodes run out of space, the **operation** from step 2 is
aborted/paused/timed out and the **operation** from step 5 is successful.
`perf-simple-query --smp 1 -m 1G` results obtained for fixed 400MHz frequency:
Read path (before)
```
instructions_per_op:
mean= 39661.51 standard-deviation=34.53
median= 39655.39 median-absolute-deviation=23.33
maximum=39708.71 minimum=39622.61
```
Read path (after)
```
instructions_per_op:
mean= 39691.68 standard-deviation=34.54
median= 39683.14 median-absolute-deviation=11.94
maximum=39749.32 minimum=39656.63
```
Write path (before):
```
instructions_per_op:
mean= 50942.86 standard-deviation=97.69
median= 50974.11 median-absolute-deviation=34.25
maximum=51019.23 minimum=50771.60
```
Write path (after):
```
instructions_per_op:
mean= 51000.15 standard-deviation=115.04
median= 51043.93 median-absolute-deviation=52.19
maximum=51065.81 minimum=50795.00
```
Fixes: https://github.com/scylladb/scylladb/issues/14067
Refs: https://github.com/scylladb/scylladb/issues/2871
No backport, as it is a new feature.
Closesscylladb/scylladb#23917
* github.com:scylladb/scylladb:
tests/cluster: Add new storage tests
test/scylla_cluster: Override workdir when passed via cmdline
streaming: Reject incoming migrations
storage_service: extend locator::load_stats to collect per-node critical disk utilization flag
repair_service: Add a facility to disable the service
compaction_manager: Subscribe to out of space controller
compaction_manager: Replace enabled/disabled states with running state
database: Add critical_disk_utilization mode database can be moved to
disk_space_monitor: add subscription API for threshold-based disk space monitoring
docs: Add feature documentation
config: Add critical_disk_utilization_level option
replica/exceptions: Add a new custom replica exception
The new exception `critical_disk_utilization_exception` is thrown
when the user table mutation writes are being blocked due to e.g.
reaching a critical disk utilization level.
This new exception, is then correctly handled on the coordinator
side when transforming into `mutation_write_failure_exception` with
a meaningful error message: "Write rejected due to critical disk
utilization".
The previous implementation did not handle topology changes well:
* In `node_local_only` mode with CL=1, if the current node is pending, the CL is increased to 2, causing
`unavailable_exception`.
* If the current tablet is in `write_both_read_old` and we try to read with `node_local_only` on the new node, the replica list will be empty.
This patch changes `node_local_only` mode to always use `my_host_id` as the replica list. An explicit check ensures the current node is a replica for the operation; otherwise `on_internal_error` is called.
backport: not needed, since `node_local_only` is only used in LWT for tablets and it hasn't been released yet.
Closesscylladb/scylladb#25508
* github.com:scylladb/scylladb:
test_tablets_lwt: add test_lwt_during_migration
storage_proxy: node_local_only: always use my_host_id