1370 Commits

Author SHA1 Message Date
Dawid Mędrek
a8dd13731f Merge 'Improve debuggability of test/cluster/test_data_resurrection_in_memtable.py' from Botond Dénes
This test was observed to fail in CI recently but there is not enough information in the logs to figure out what went wrong. This PR makes a few improvements to make the next investigation easier, should it be needed:
* storage-service: add table name to mutation write failure error messages.
* database: the `database_apply` error injection used to cause trouble, catching writes to bystander tables, making tests flaky. To eliminate this, it gained a filter to apply only to non-system keyspaces. Unfortunately, this still allows it to catch writes to the trace tables. While this should not fail the test, it reduces observability, as some traces disappear. Improve this error injection to only apply to selected table. Also merge it with the `database_apply_wait` error injection, to streamline the code a bit.
* test/test_data_resurrection_in_memtable.py: dump data from the datable, before the checks for expected data, so if checks fail, the data in the table is known.

Refs: SCYLLADB-812
Refs: SCYLLADB-870
Fixes: SCYLLADB-1050 (by restricting `database_apply` error injection, so it doesn't affect writes to system traces)

Backport: test related improvement, no backport

Closes scylladb/scylladb#28899

* github.com:scylladb/scylladb:
  test/cluster/test_data_resurrection_in_memtable.py: dump rows before check
  replica/database: consolidate the two database_apply error injections
  service/storage_proxy: add name of table to error message for write errors
2026-03-17 13:35:19 +01:00
Botond Dénes
44b8cad3df service/storage_proxy: add name of table to error message for write errors
It is useful to know what table the failed write belongs to.
2026-03-05 10:51:12 +02:00
Piotr Dulikowski
85dcbfae9a Merge 'hint: Don't switch group in database::apply_hint()' from Pavel Emelyanov
The method is called from storage_proxy::mutate_hint() which is in turn called from hint_mutation::apply_locally(). The latter is either called from directly by hint sender, which already runs in streaming group, or via RPC HINT_MUTATION handler which uses index 1 that negotiates streaming group as well.

To be sure, add a debugging check for current group being the expected one.

Code cleanup, not backporting

Closes scylladb/scylladb#28545

* github.com:scylladb/scylladb:
  hint: Don't switch group in database::apply_hint()
  hint_sender: Switch to sender group on stop either
2026-03-04 09:36:38 +01:00
Marcin Maliszkiewicz
a83ee6cf66 Merge 'db/batchlog_manager: re-add v1 support for mixed clusters' from Botond Dénes
3f7ee3ce5d introduced system.batchlog_v2, with a schema designed to speed up batchlog replays and make post-replay cleanups much more effective.
It did not introduce a cluster feature for the new table, because it is node local table, so the cluster can switch to the new table gradually, one node at a time.
However, https://github.com/scylladb/scylladb/issues/27886 showed that the switching causes timeouts during upgrades, in mixed clusters. Furthermore, switching to the new table unconditionally  on upgrades nodes, means that on rollback, the batches saved into the v2 table are lost.
This PR introduces re-introduces v1 (`system.batchlog`) support and guards the use of the v2 table with a cluster feature, so mixed clusters keep using v1 and thus be rollback-compatible.
The re-introduced v1 support doesn't support post-replay cleanups for simplicity. The cleanup in v1 was never particularly effective anyway and we ended up disabling it for heavy batchlog users, so I don't think the lack of support for cleanup is a problem.

Fixes: https://github.com/scylladb/scylladb/issues/27886

Needs backport to 2026.1, to fix upgrades for clusters using batches

Closes scylladb/scylladb#28736

* github.com:scylladb/scylladb:
  test/boost/batchlog_manager_test: add tests for v1 batchlog
  test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2
  test/boost/batchlog_manager_test: fix indentation
  test/boost/batchlog_manager_test: extract prepare_batches() method
  test/lib/cql_assertions: is_rows(): add dump parameter
  tools/scylla-sstable: extract query result printers
  tools/scylla-sstable: add std::ostream& arg to query result printers
  repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log
  db/batchlog_manager: re-add v1 support
  db/batchlog_manager: return all_replayed from process_batch()
  db/batchlog_manager: process_bath() fix indentation
  db/batchlog_manager: make batch() a standalone function
  db/batchlog_manager: make structs stats public
  db/batchlog_manager: allocate limiter on the stack
  db/batchlog_manager: add feature_service dependency
  gms/feature_service: add batchlog_v2 feature
2026-03-02 12:09:10 +01:00
Calle Wilund
fec7df7cbb topology::snapshot: Add expiry (ttl) to RPC/topo op
Not set yet, but includes it in messages so it can be properly
set in calling code. Will add entry to manifest.
2026-02-23 11:37:17 +01:00
Calle Wilund
425d6b4441 storage_proxy: Add snapshot_keyspace method
Takes set of ks->tables tuples and issues snapshot for each.
If feature is enabled, keyspace is non-local, and uses tablets,
will issue topo coordinator call across cluster.

Keyspaces not fitting the above will just go to "normal" (node
local) snapshot.
2026-02-23 11:27:15 +01:00
Calle Wilund
2bc633c3bd storage_proxy: Add handler for SNAPSHOT_WITH_TABLETS 2026-02-23 10:44:42 +01:00
Calle Wilund
6e39c3bb83 storage_proxy: Break out logic from request_truncate_with_tablets
Makes request_truncate_with_tablets use a parameterized helper,
because eventually we will want to use almost identical logic
for other ops, like snapshot.
2026-02-23 10:43:28 +01:00
Botond Dénes
dd50bd9bd4 db/batchlog_manager: re-add v1 support
system.batchlog will still have to be used while the cluster is
upgrading from an older version, which doesn't know v2 yet.
Re-add support for replaying v1 batchlogs. The switch to v2 will happen
after the BATCHLOG_V2 cluster feature is enabled.

The only external user -- storage_proxy -- only needs a minor
adjustment: switch between the table names. The rest is handled
transparently by the db/batchlog.hh interface and the batchlog_manager.
2026-02-20 07:03:46 +02:00
Petr Gusev
df73f723a6 storage_proxy: hold erms in replica handlers
Add explicit erm-holding variables in all replica-side RPC handlers.
This is required to ensure that tablet migration waits for in-flight
replica requests even if a non-replica coordinator has been fenced out.

Holding erms on the replica side may increase the global-barrier wait
time, since the barrier must drain these requests. We believe this
is acceptable because:
* We already hold erms during replica-side request execution, but in
an ad-hoc, non-systemic way in lower layers of storage_proxy
(e.g. in sp::mutate_locally and do_query_tablets).
* Replica requests are bounded by replica-side timeouts, so the
global-barrier wait time cannot exceed the maximum of these timeouts.

For Paxos verbs, we use token_metadata_guard, which wraps the ERM and
automatically refreshes it when tablet migration does not affect the
current token; see the token_metadata_guard comments for details.
We use this guard only for Paxos verbs because regular reads and writes
already hold raw erms in storage_proxy and on the coordinators.

The erms must be held in all RPC handlers that support fencing — that
is, those with a fencing_token parameter in storage_proxy.idl.
Counter updates already hold erms in
mutate_counter_on_leader_and_replicate.

Fix test_tablets2::test_timed_out_reader_after_cleanup: the tablets
barrier now waits for all nodes. As a result, the replica read
is expected to finish, rather than fail due to the tablet having
moved as it did previously. The test is renamed to
test_tablets_barrier_waits_for_replica_erms to better reflect its
purpose.

Refs scylladb/scylladb#26864
2026-02-16 08:57:42 +01:00
Pavel Emelyanov
83e64b516a hint: Don't switch group in database::apply_hint()
The method is called from storage_proxy::mutate_hint() which is in turn
called from hint_mutation::apply_locally(). The latter is either called
from directly by hint sender, which already runs in streaming group, or
via RPC HINT_MUTATION handler which uses index 1 that negotiates streaming
group as well.

To be sure, add a debugging check for current group being the expected one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-02-09 08:54:51 +03:00
Dawid Mędrek
b0afd3aa63 Merge 'storage_service: set up topology properly in maintenance mode' from Patryk Jędrzejczak
We currently make the local node the only token owner (that owns the
whole ring) in maintenance mode, but we don't update the topology properly.
The node is present in the topology, but in the `none` state. That's how
it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in
`scylla_main`. As a result, the node started in maintenance mode crashes
in the following way in the presence of a vnodes-based keyspace with the
NetworkTopologyStrategy:
```
scylla: locator/network_topology_strategy.cc:207:
    locator::natural_endpoints_tracker::natural_endpoints_tracker(
    const token_metadata &, const network_topology_strategy::dc_rep_factor_map &):
    Assertion `!_token_owners.empty() && !_racks.empty()' failed.
```
Both `_token_owners` and `_racks` are empty. The reason is that
`_tm.get_datacenter_token_owners()` and
`_tm.get_datacenter_racks_token_owners()` called above filter out nodes
in the `none` state.

This bug basically made maintenance mode unusable in customer clusters.

We fix it by changing the node state to `normal`.

We also extend `test_maintenance_mode` to provide a reproducer for

Fixes #27988

This PR must be backported to all branches, as maintenance mode is
currently unusable everywhere.

Closes scylladb/scylladb#28322

* github.com:scylladb/scylladb:
  test: test_maintenance_mode: enable maintenance mode properly
  test: test_maintenance_mode: shutdown cluster connections
  test: test_maintenance_mode: run with different keyspace options
  test: test_maintenance_mode: check that group0 is disabled by creating a keyspace
  test: test_maintenance_mode: get rid of the conditional skip
  test: test_maintenance_mode: remove the redundant value from the query result
  storage_proxy: skip validate_read_replica in maintenance mode
  storage_service: set up topology properly in maintenance mode
2026-02-02 13:28:19 +01:00
Patryk Jędrzejczak
9d4a5ade08 storage_proxy: skip validate_read_replica in maintenance mode
In maintenance mode, the local node adds only itself to the topology. However,
the effective replication map of a keyspace with tablets enabled contains all
tablet replicas. It gets them from the tablets map, not the topology. Hence,
`network_topology_strategy::sanity_check_read_replicas` hits
```
throw std::runtime_error(format("Requested location for node {} not in topology. backtrace {}", id, lazy_backtrace()));
```
for tablet replicas other than the local node.

As a result, all requests to a keyspace with tablets enabled and RF > 1 fail
in debug mode (`validate_read_replica` does nothing in other modes). We don't
want to skip maintenance mode tests in debug mode, so we skip the check in
maintenance mode.

We move the `is_debug_build()` check because:
- `validate_read_replicas` is a static function with no access to the config,
- we want the `!_db.local().get_config().maintenance_mode()` check to be
  dropped by the compiler in non-debug builds.

We also suppress `-Wunneeded-internal-declaration` with `[[maybe_unused]]`.
2026-01-30 12:55:17 +01:00
Pavel Emelyanov
c61d855250 hints: Provide explicit scheduling group for hint_sender
Currently it grabs one from database, but it's not nice to use database
as config/sched-groups provider.

This PR passes the scheduling group to use for sending hints via manager
which, in turn, gets one from proxy via its config (proxy config already
carries configuration for hints manager). The group is initialized in
main.cc code and is set to the maintenance one (nowadays it's the same
as streaming group).

This will help splitting the streaming scheduling group into more
elaborated groups under the maintenance supergroup: SCYLLADB-351

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28358
2026-01-27 12:50:11 +02:00
Petr Gusev
c45244b235 storage_proxy: drop stop() method
It's not called by main.cc and can be confusing.
2026-01-23 11:22:03 +01:00
Petr Gusev
f5ed3e9fea test_lwt_shutdown: fix flakiness by removing storage_proxy::stop injection
storage_proxy::stop() is not called by main (it is commented out due to #293),
so the corresponding message injection is never hit. When the test releases
paxos_state_learn_after_mutate, shutdown may already be in progress or even
completed by the time we try to trigger the storage_proxy::stop injection,
which makes the test flaky.

Fix this by completely removing the storage_proxy::stop injection.
The injection is not required for test correctness. Shutdown must wait for the
background LWT learn to finish, which is released via the
paxos_state_learn_after_mutate injection.

The shutdown process blocks on in-flight api HTTP requests through
seastar::httpd::http_server::stop and its _task_gate, so the
shutdown will not prevent the HTTP request that released the
paxos_state_learn_after_mutate from completing successfully.

Fixes scylladb/scylladb#28260
2026-01-23 11:20:36 +01:00
Yaniv Michael Kaul
d919aacc69 storage_proxy: mark write_timeouts metric for counter write timeouts
When a counter write times out (due to rpc::timeout_error or timed_out_error),
the code was throwing mutation_write_timeout_exception but not marking the
write_timeouts metric. This resulted in counter write timeouts not being
counted in the scylla_storage_proxy_coordinator_write_timeouts metric.

Regular writes go through mutate_internal -> mutate_end, which catches
mutation_write_timeout_exception and marks the metric. However, counter
writes use a separate code path (mutate_counters) that has its own
exception handling but was missing the metric update.

This fix adds get_stats().write_timeouts.mark() before throwing the
timeout exception in the counter write path, consistent with how the
CAS path handles cas_write_timeouts.

Refs: https://scylladb.atlassian.net/browse/SCYLLADB-245

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#28019
2026-01-14 17:50:46 +02:00
Botond Dénes
9bea842c01 service/storage_proxy: don't throw exceptions on the full-scan path
Use coroutine::try_future() to avoid exceptions taking flight and
triggering expensive stack-unwinding.
Especially bad for common exceptions like timeouts.
2026-01-13 10:47:57 +02:00
Radosław Cybulski
5f48ab3875 storage_proxy: fix invalid assert
Change invalid `assert(true)` into `SCYLLA_ASSERT(false)`, as
the latter was clearly meant.

Closes scylladb/scylladb#27900
2026-01-08 21:55:18 +02:00
Dawid Mędrek
77a934e5b9 db/hints: Prevent draining hints before hint replay is allowed
Context
-------
The procedure of hint draining boils down to the following steps:

1. Drain a hint sender. That should get rid of all hints stored
   for the corresponding endpoint.
2. Remove the hint directory corresponding to that endpoint.

Obviously, it gets more complex than this high-level perspective.
Without blurring the view, the relevant information is that step 1
in the algorithm above may not be executed.

Breaking it down, it comprises of two calls to
`hint_sender::send_hints_maybe()`. The function is responsible for
sending out hints, but it's not unconditional and will not be performed
if any of the following bullets is not satisfied:

* `hint_sender::replay_allowed()` is not `true`. This can happen when
  hint replay hasn't been turned on yet.
* `hint_sender::can_send()` is not `true`. This can happen if the
  corresponding endpoint is not alive AND it hasn't left the cluster
  AND it's still a normal token owner.

There is one more relevant point: sending hints can be stopped if
replaying hints fails and `hint_sender::send_hints_maybe()` returns
`false`. However, that's not not possible in the case of draining.
In that case, if Scylla comes across any failure, it'll simply delete
the corresponding hint segment. Because of that, we ignore it and
only focus on the two bullets.

---

Why is it a problem?
--------------------
If a hint directory is not purged of all hint segments in it,
any attempt to remove it will fail and we'll observe an error like this:

```
Exception when draining <host ID>: std::filesystem::__cxx11::filesystem_error
(error system:39, filesystem error: remove failed: Directory not empty [<path>])
```

The folder with the remaining hints will also stay on disk, which is, of
course, undesired.

---

When can it happen?
-------------------
As highlighted in the Context section of this commit message, the
key part of the code that can lead to a dangerous situation like that
is `hint_sender::send_hints_maybe()`. The function is called twice when
draining a hint endpoint manager: once to purge all of the existing
hints, and another time after flushing all hints stored in a commitlog
instances, but not listed by `hint_sender` yet. If any of those calls
misbehaves, we may end up with a problem. That's why it's crucial to
ensure that the function always goes through ALL of the hints.

Dangerous situations:

1. We try to drain hints before hint replay is allowed. That will
   violate the first bullet above.
2. The node we're draining is dead, but it hasn't left the cluster,
   and it still possesses some tokens.

---

How do we solve that?
---------------------
Hint replay is turned on in `main.cc`. Once enabled, it cannot be
disabled. So to address the first bullet above, it suffices to ensure
that no draining occurs beforehand. It's perfectly fine to prevent it.
Soon after hint replay is allowed, `main.cc` also asks the hint manager
to drain all of the endpoint managers whose endpoints are no longer
normal token owners (cf. `db::hints::manager::drain_left_nodes()`).

The other bullet is more tricky. It's important here to know that
draining only initiated in three situations:

1. As part of the call to `storage_service::notify_left()`.
2. As part of the call to `storage_service::notify_released()`.
3. As part of the call to `db::hints::manager::drain_left_nodes()`.

The last one is trivially non-problematic. The nodes that it'll try to
drain are no longer normal token owners, so `can_send()` must always
return `true`.

The second situation is similar. As we read in the commit message of
scylladb/scylladb@eb92f50413, which
introduced the notion of released nodes, the nodes are no longer
normal token owners:

> In this patch we postpone the hint draining for the "left" nodes to
> the time when we know that the target nodes no longer hold ownership
> of any tokens - so they're no longer referenced in topology. I'm
> calling such nodes "released".

I suggest reading the full commit message there because the problems
there are somewhat similar these changes try to solve.

Finally, the first situation: unfortunately, it's more tricky. The same
commit message says:

> When a node is being replaced, it enters a "left" state while still
> owning tokens. Before this patch, this is also the time when we start
> draining hints targeted to this node, so the hints may get sent before
> the token ownership gets migrated to another replica, and these hints
> may get lost.

This suggests that `storage_service::notify_left()` may be called when
the corresponding node still has some tokens! That's something that may
prevent properly draining hints.

Fortunately, no hope is lost. We only drain hints via `notify_left()`
when hinted handoff hasn't been upgraded to being host-ID-based yet.
If it has, draining always happens via `notify_released()`.

When I write this commit message, all of the supported versions of
Scylla 2025.1+ use host-ID-based hinted handoff. That means that
problems can only arise when upgrading from an older version of Scylla
(2024.1 downwards). Because of that, we don't cover it. It would most
likely require more extensive changes.

---

Non-issues
----------
There are notions that are closely related to sending hints. One of them
is the host filter that hinted handoff uses. It decides which endpoints
are eligible for receiving hints, and which are not. Fortunately, all
endpoints rejected by the host filter lose their hint endpoint managers
-- they're stopped as part of that procedure. What's more, draining
hints and changing the host filter cannot be happening at the same time,
so it cannot lead to any problems.

The solution
------------
To solve the described issue, we simply prevent draining hints before
hint replay is allowed. No reproducer test is attached because it's not
feasible to write one.

Fixes scylladb/scylladb#27693

Closes scylladb/scylladb#27713
2026-01-04 16:54:05 +02:00
Tomasz Grabiec
c077283352 Merge 'service: support conversion of tablet keyspaces to rack-list using ALTER KEYSPACE' from Aleksandra Martyniuk
If a keyspace has a numeric replication factor in a DC and rf < #racks,
then the replicas of tablets in this keyspace can be distributed among
all racks in the DC (different for each tablet). With rack list, we need all
tablet replicas to be placed on the same racks. Hence, the conversion
requires tablet co-location.

After this series, the conversion can be done using ALTER KEYSPACE
statement. The statement that does this conversion in any DC is not
allowed to change a rf in any DC. So, if we have dc1 and dc2 with 3 racks
each and a keyspace ks then with a single ALTER KEYSPACE we can do:
- {dc1 : 2} -> {dc1 : [r1, r2]};
- {dc1 : 2, dc2: 2} ->  {dc1 : [r1, r2], dc2: [r2,r3]};
- {dc1 : 2, dc2: 2} -> {dc1 : [r1, r2], dc2: 2}
- {dc1 : 2} -> {dc1 : 2, dc2 : [r1]}
But we cannot do:
- {dc1 : 2} -> {dc1 : [r1, r2, r3]};
- {dc1 : 1, dc2 : [r1, r2] → dc1: [r1], dc2: [r1].

In order to do the co-locations rf change request is paused. Tablet
load balancer examines the paused rf change requests and schedules
necessary tablet migrations. During the process of co-location, no other
cross-rack migration is allowed.

Load balancer checks whether any paused rf change request is
ready to be resumed. If so, it puts the request back to global topology
request queue.

While an rf change request for a keyspace is running, any other rf change
of this keyspace will fail.

Fixes: #26398.

New feature, no backport

Closes scylladb/scylladb#27279

* github.com:scylladb/scylladb:
  test: add est_rack_list_conversion_with_two_replicas_in_rack
  test: test creating tablet_rack_list_colocation_plan
  test: add test_numeric_rf_to_rack_list_conversion test
  tasks: service: add global_topology_request_virtual_task
  cql3: statements: allow altering from numeric rf to rack list
  service: topology_coordinator: pause keyspace_rf_change request
  service: implement make_rack_list_colocation_plan
  service: add tablet_rack_list_colocation_plan
  cql3: reject concurrent alter of the same keyspace
  test: check paused rf change requests persistence
  db: service: add paused_rf_change_requests to system.topology
  service: pass topology and system_keyspace to load_balancer ctor
  service: tablet_allocator: extract load updates
  service: tablet_allocator: extract ensure_node
  tasks, system_keyspace: Introduce get_topology_request_entry_opt()
  node_ops: Drop get_pending_ids()
  node_ops: Drop redundant get_status_helper()
2025-12-17 10:05:06 +01:00
Tomasz Grabiec
71e6ef90f4 tasks, system_keyspace: Introduce get_topology_request_entry_opt()
It's a cleanup. Better to return std::nullopt than faking an entry
with an id when require_entry == false.
2025-12-16 13:25:34 +01:00
Pavel Emelyanov
3f7ee3ce5d Merge 'batchlog: make replay (flush) faster' from Botond Dénes
The batchlog table contains an entry for each logged batch that is processed by the local node as coordinator. These entries are typically very short lived, they are inserted when the batch is processed and deleted immediately after the batch is successfully applied.
When a table has `tombstone_gc = {'mode': 'repair'}` enabled, every repair has to flush all hints and batchlogs, so that we can be certain that there is no live data in any of these, older than the last repair. Since batches can contain member queries from any number of tables, the whole batchlog has to be flushed, even if repair-mode tombstone-gc is enabled for a single table.

Flushing the batchlog table happens by doing a batchlog replay. This involves reading the entire content of this table, and attempting to replay+delete any live entries (that are old enough to be replayed).  Under normal operating circumstances, 99%+ of the content of the batchlog table is partition tombstones.  Because of this, scanning the content of this table has to process thousands to millions of tombstones. This was observed to require up to 20 minutes to finish, causing repairs to slow down to a crawl, as the batchlog-flush has to be repeated at the end of the repair of each token-range.

When trying to address this problem, the first idea was that we should expedite the garbage-collection of these accumulated tombstones. This experiment failed, see https://github.com/scylladb/scylladb/pull/23752. The commitlog proved to be an impossible to bypass barrier, preventing quick garbage-collection of tombstones. So long as a single commit-log segment is alive, holding content from the batchlog table, all tombstones written after are blocked from GC.
The second approach, represented by this PR, is to not rely in tombstone GC to reduce the tombstone amount. Instead restructure the table such that a single higher-order tombstone can be used to shadow and allow for the eviction of the myriads of individual batchlog entry tombstones. This is realized by reorganizing the batchlog table such that individual batches are rows, not partitions.
This new schema is introduced by the new `system.batchlog_v2` table, introduced by this PR:

    CREATE TABLE system.batchlog_v2 (
        version int,
        stage int,
        shard int,
        written_at timestamp,
        id uuid,
        data blob,
        PRIMARY KEY ((version, stage, shard), written_at, id));

The new schema organization has the following goals:
1) Make post-replay batchlog cleanup possible with a simple range-tombstone. This allows dropping the individual dead batchlog entries, as they are shadowed by a higher level tombstone. This enables dropping tombstones without tombstone GC.
2) To make the above possible, introduce the stage key component: batchlog entries that fail the first replay attempt, are moved to the failed_replay stage, so the initial stage can be cleaned up safely.
3) Spread out the data among Scylla shards, via the batchlog shard column.
4) Make batchlog entries ordered by the batchlog create time (id). This allows for selecting batchlogs to replay, without post-filtering of batchlogs that are too young to be replayed.

Fixes: https://github.com/scylladb/scylladb/issues/23358

This is an improvement, normally not a backport-candidate. We might override this and backport to allow wider use of `tombstone_gc: {'mode': 'repair'}`.

Closes scylladb/scylladb#26671

* github.com:scylladb/scylladb:
  db/config: change batchlog_replay_cleanup_after_replays default to 1
  test/boost/batchlog_manager_test: add test for batchlog cleanup
  replica/mutation_dump: always set position weight for clustering positions
  service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/
  test/lib: introduce error_injection.hh
  utils/error_injection: add debug log to disable() and disable_all()
  test/lib/cql_test_env: forward config to batchlog
  test/lib/cql_test_env: add batch type to execute_batch()
  test/lib/cql_assertions: add with_size(predicate) overload
  test/lib/cql_assertions: add source location to fail messages
  test/lib/cql_assertions: columns_assertions: add assert_for_columns_of_each_row()
  test/lib/cql_assertions: rows_assertions::assert_for_columns_of_row(): add index bound check
  test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload
  db/batchlog_manager: config: s/write_timeout/reply_timeot/
  db,service: switch to system.batchlog_v2
  db/system_keyspace: introduce system.batchlog_v2
  service,db: extract generation of batchlog delete mutation
  service,db: extract get_batchlog_mutation_for() from storage-proxy
  db/batchlog_manager: only consider propagation delay with tombstone-gc=repair
  db/batchlog_manager: don't drop entire batch if one mutations' table was dropped
  data_dictionary: table: add get_truncation_time()
  db/batchlog_manager: batch(): replace map_reduce() with simple loop
  db/batchlog_manager: finish coroutinizing replay_all_failed_batches
  db/batchlog_manager: improve replayAllFailedBatches logs
2025-12-15 15:05:19 +03:00
Petr Gusev
c428645d16 storage_proxy: cas: take cas_request by raw reference
In the next commit we want to add an optimization that relies on
precise control over the lifetime of cas_request. In particular, we
want the implementation of this interface in Alternator to operate on
raw references that are guaranteed to remain valid only until the
cas() future is resolved. We already depend on the same lifetime
assumptions in cas_request when used by modification_statement.
However, these assumptions are not clearly expressed in the current
interface: cas_request is taken by shared_ptr, and nothing prevents
cas() from storing that pointer inside paxos_response_handler, which
may outlive the cas() future.

This commit fixes that by taking cas_request by raw reference. This
makes it explicit that cas() does not assume ownership of the object.
Callers must ensure that the referenced object remains valid until
the returned future is resolved.
2025-12-07 16:14:56 +01:00
Botond Dénes
e762027943 db/config: change batchlog_replay_cleanup_after_replays default to 1
Now that batchlog cleanup is cheap, on account of memtable flush on the
system.batchlog table garbage-collecting tombstones (previous patch), we
can afford to do cleanup on each replay, keeping the memtable size small
and more importantly -- the amount of tombstones in the memtable small.
2025-12-02 14:21:26 +02:00
Botond Dénes
8edd5b80ab test/boost/batchlog_manager_test: add test for batchlog cleanup
Add more tests covering different aspects of batchlog replay, cleanup,
replay timeout and finally v1 -> v2 migration.
2025-12-02 14:21:26 +02:00
Botond Dénes
8545f7eedd service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/
Rename to make it more explicit where the error injection happens.
Also change how the error is injected, use the lambda overload instead
of is_enabled(), the former leaves better trace in logs, which helps
when debugging tests.
2025-12-02 14:21:26 +02:00
Botond Dénes
846b656610 db,service: switch to system.batchlog_v2
New batchlogs are written to the batchlog_v2 table and replay also uses
the v2 table.
The content of system.batchlog is attempted to be migrated to
system.batchlog_v2 after each start of the batchlog_manager service.
The migration is retried on each replay if it fails. This is reduntant
but simple.

Batchlog cleanup now doesn't involve flushing memtables, the only
remaining user of replica/database.hh is gone, so the include is
dropped.
2025-12-02 14:21:26 +02:00
Botond Dénes
9434ec2fd1 service,db: extract generation of batchlog delete mutation
Don't build batchlog delete mutations in storage-proxy code. Move this
code into db/batchlog_manager.cc, exposed via db/batchlog.hh.
This serves multiple goals:
1) Concentrates low-level batchlog related logic in
   db/batchlog_manager.cc
2) Reduce current and future code duplication.
3) Make future changes to this logic easier.
2025-12-02 14:21:25 +02:00
Botond Dénes
f54602daf0 service,db: extract get_batchlog_mutation_for() from storage-proxy
Don't build batchlog mutations in storage-proxy code. Move this code
into db/batchlog_manager.cc, exposed via db/batchlog.hh.
This serves multiple goals:
1) Concentrates low-level batchlog related logic in
   db/batchlog_manager.cc
2) Reduce current and future code duplication.
2) Make future changes to this logic easier.
2025-12-02 14:21:25 +02:00
Robert Bindar
817fdadd49 Improve choice distribution for primary replica
I noticed during tests that `maybe_get_primary_replica`
would not distribute uniformly the choice of primary replica
because `info.replicas` on some shards would have an order whilst
on others it'd be ordered differently, thus making the function choose
a node as primary replica multiple times when it clearly could've
chosen a different nodes.

This patch sorts the replica set before passing it through the
scope filter.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-11-11 09:18:01 +02:00
Petr Gusev
5bda226ff6 storage_proxy: use coroutine::maybe_yield();
This is a small "while at it" refactoring -- better to use
coroutine::maybe_yield with co_await-s.
2025-11-05 14:38:19 +01:00
Petr Gusev
4578304b76 storage_proxy: use gates to track write handlers destruction
In #26408 a write_handler_destroy_promise class was introduced to
wait for abstract_write_response_handler instances destruction. We
strived to minimize the memory footprint of
abstract_write_response_handler, with write_handler_destroy_promise-es
we required only a single additional int. It turned our that in some
cases a lot of write handlers can be scheduled for deletion
at the same time, in such cases the
vector<write_handler_destroy_promise> can become big and cause
'oversized allocation' seastar warnings.

Another concern with write_handler_destroy_promise-es was that they
were more complicated than it was worth.

In this commit we replace write_handler_destroy_promise with simple
gates. One or more gates can be attached to an
abstract_write_response_handler to wait for its destruction. We use
utils::small_vector<gate::holder, 2> to store the attached gates.
The limit 2 was chosen because we expect two gates at the same time
in most cases. One is storage_proxy::_write_handlers_gate,
which is used to wait for all handlers in
cancel_all_write_response_handlers. Another one can be attached by
a caller of cancel_write_handlers. Nothing stops several
cancel_write_handlers to be called at the same time, but it should be
rare.

The sizeof(utils::small_vector<gate::holder, 2>) == 40, this is
40.0 / 488 * 100 ~ 8% increase in
sizeof(abstract_write_response_handler), which seems acceptable.

Fixes scylladb/scylladb#26788
2025-11-05 14:37:52 +01:00
Michael Litvak
296b116ae2 storage_proxy: lock all read shards for counter update
Previously in a counter update we lock the read shard to protect the
counter's read-modify-write against concurrent updates.

This is not sufficient when the counter is migrated between different
shards, because there is a stage where the read shard switches from the
old shard to the new shard, and during that switch there can be
concurrent counter updates on both shards. If each shard takes only its
own lock, the operations will not be exclusive anymore, and this can
cause lost counter updates.

To fix this, we acquire the counter lock on both shards in the stage
write_both_read_new, when both shards can serve reads. This guarantees
that counter updates continue to be exclusive during intranode
migration.
2025-11-03 16:04:35 +01:00
Michael Litvak
de321218bc storage_proxy: apply counter mutation on all write shards
When applying a counter mutation, use apply_on_shards to apply the
mutation on all write shards, similarly to the way other mutations are
applied in the storage proxy. Previously the mutation was applied only
on the current shard which is the read shard.

This is needed to respect the write_both stages of intranode migration
where we need to apply the mutation on both the old and the new shards.
2025-11-03 16:03:29 +01:00
Michael Litvak
c7e7a9e120 storage_proxy: move counter update coordination to storage proxy
Refactor the counter update to split the functions and have them called
by the storage proxy to prepare for a later change.

Previously in mutate_counter the storage proxy calls the replica
function apply_counter_update that does a few things:
1. checks that the operation can be done: check timeout, disk utilization
2. acquire counter locks
3. do read-modify-write and transform the counter mutation
4. apply the mutation in the replica

In this commit we change it so that these functions are split and called
from the storage proxy, so that we have better control from the storage
proxy when we change it later to work across multiple shards. For
example, we will want to acquire locks on multiple shards, transform it
on one shard, and then apply the mutation on multiple shards.

After the change it works as follows in storage proxy:
1. acquire counter locks
2. call replica prepare to check the operation and transform the mutation
3. call replica apply to apply the transformed mutation
2025-11-03 15:59:46 +01:00
Michael Litvak
579031cfc8 storage_proxy: refactor mutate_counter_on_leader
Slightly reorganize the mutate counter function to prepare it for a
later change.

Move the code that finds the read shard and invokes the rest of the
function on the read shard to the caller function. This simplifies the
function mutate_counter_on_leader_and_replicate which now runs on the
read shard and will make it easier to extend.
2025-11-03 08:43:11 +01:00
Botond Dénes
ac618a53f4 Merge 'db: repair: do not update repair_time if batchlog replay failed' from Aleksandra Martyniuk
Currently, batchlog replay is considered successful even if all batches fail
to be sent (they are replayed later). However, repair requires all batches
to be sent successfully. Currently, if batchlog isn't cleared, the repair never
learns and updates the repair_time. If GC mode is set to "repair", this means
that the tombstones written before the repair_time (minus propagation_delay)
can be GC'd while not all batches were replied.

Consider a scenario:
- Table t has a row with (pk=1, v=0);
- There is an entry in the batchlog that sets (pk=1, v=1) in table t;
- The row with pk=1 is deleted from table t;
- Table t is repaired:
    - batchlog reply fails;
    - repair_time is updated;
- propagation_delay seconds passes and the tombstone of pk=1 is GC'd;
- batchlog is replayed and (pk=1, v=1) inserted - data resurrection!

Do not update repair_time if sending any batch fails. The data is still repaired.
For tablet repair the repair runs, but at the end the exception is passed
to topology coordinator. Thanks to that the repair_time isn't updated.
The repair request isn't removed as well, due to which the repair will need
to rerun.

Apart from that, a batch is removed from the batchlog if its version is invalid
or unknown. The condition on which we consider a batch too fresh to replay
is updated to consider propagation_delay.

Fixes: https://github.com/scylladb/scylladb/issues/24415

Data resurrection fix; needs backport to all versions

Closes scylladb/scylladb#26319

* github.com:scylladb/scylladb:
  db: fix indentation
  test: add reproducer for data resurrection
  repair: fail tablet repair if any batch wasn't sent successfully
  db/batchlog_manager: fix making decision to skip batch replay
  db: repair: throw if replay fails
  db/batchlog_manager: delete batch with incorrect or unknown version
  db/batchlog_manager: coroutinize replay_all_failed_batches
2025-10-28 14:52:59 +02:00
Aleksandra Martyniuk
1935268a87 test: add reproducer for data resurrection
Add a reproducer to check that the repair_time isn't updated
if the batchlog replay fails.

If repair_time was updated, tombstones could be GC'd before the
batchlog is replayed. The replay could later cause the data
resurrection.
2025-10-23 10:39:43 +02:00
Petr Gusev
1dd05f4404 storage_proxy: use run_fenceable_write
Switch local write code sites from start_write() to
run_fenceable_write().
2025-10-22 16:31:43 +02:00
Petr Gusev
d56495fd9c storage_proxy: abstract_write_response_handler: apply_locally: extract post fence check
All mutation_holder::apply_locall() implementations now do the same
post fence chech. In this commit we hoist this check up to
abstract_write_response_handler::apply_locally().
2025-10-22 16:31:43 +02:00
Petr Gusev
24f8962938 storage_proxy: introduce run_fenceable_write
This function is intended to replace start_write() in subsequent
commits. It provides the following benefits:
* Remove duplication: All start_write() call sites must run the fence
check after the operation completes. run_fenceable_write() encapsulates
this pattern.
* Fix a race: To ensure no new stale write operations occur during
cleanup, a fence check before start_write() was previously used.
However, yields in several code paths between the check and
start_write() made it non-atomic, allowing a stale operation to slip in
if the fence_version was updated in between.
* Optimize waiting: We do not need to wait for all operations—only for
vnode-based, non-local tables with versions smaller than the current
fence_version.
2025-10-22 16:31:43 +02:00
Petr Gusev
c5f447224a storage_proxy: move update_fence_version from shared_token_metadata
Future commits will extend update_fence_version, and it is simpler to do
so if the function resides in storage_proxy. Additionally, fence_version
is the only field this function accesses, and it is used solely within
storage_proxy, making this change natural on its own.
2025-10-22 16:31:43 +02:00
Petr Gusev
659c5912e0 storage_proxy: fix start_write() operation scope in apply_locally
The operation must be held during the local write. Before this commit,
its scope ended after returning from apply_locally(), so it
did not actually provide any protection.
2025-10-22 16:31:43 +02:00
Petr Gusev
27915befac storage_proxy: move post fence check into handle_write
handle_write() is invoked from receive_mutation_handler() and
handle_paxos_learn(), and both previously performed a fence check in
apply_fn. This commit hoists the fence check into handle_write() to
reduce code duplication.

Additionally, move start_write() after get_schema_for_write(), since
there is no need to hold the operation while querying the schema.
2025-10-22 16:31:43 +02:00
Petr Gusev
41077138bf storage_proxy: move fencing into mutate_counter_on_leader_and_replicate
As noted in the code comments, start_write() does not need to be held
during counter replication; it is required only while performing local
storage modifications. Move the start_write() call and the fence
check down to mutate_counter_on_leader_and_replicate().

Additionally, mutate_counters_on_leader() is updated to check for
possible stale_topology_exception() and properly package them
in the resulting exception_variant structure.
2025-10-22 16:31:43 +02:00
Petr Gusev
a6208b2d67 storage_proxy::handle_read: add fence check before get_schema
Avoid querying the schema for outdated requests by adding a fence check
at the start of handle_read.
2025-10-22 16:31:43 +02:00
Avi Kivity
029513bee9 Merge 'storage_proxy: wait for write handlers destruction' from Petr Gusev
`shared_ptr<abstract_write_response_handler>` instances are captured in the `lmutate` and `rmutate` lambdas of `send_to_live_endpoints()`. As a result, an `abstract_write_response_handler` object may outlive its removal from the `storage_proxy::_response_handlers` map -> `cancel_all_write_response_handlers()` doesn't actually wait for requests completion -> `sp::drain_on_shutdown()` doesn't guarantee all requests are drained -> `sp::stop_remote()` completes too early and `paxos_store` is destroyed while LWT local writes might still be in progress. In this PR we introduce a `write_handler_destroy_promise` to wait for such pending instances in `cancel_write_handlers()` and `cancel_all_write_response_handlers()` to prevent the `use-after-free`.

A better long-term solution might be to replace `shared_ptr` with `unique_ptr` for `abstract_write_response_handler` and use a separate gate to track the `lmutate/rmutate` lambdas. We do not actually need to wait for these lambdas to finish before sending a timeout or error response to the client, as we currently do in `~abstract_write_response_handler`.

Fixes scylladb/scylladb#26355

backport: need to be backported to 2025.4 since #26355 is reproduced on LWT over tablets

Closes scylladb/scylladb#26408

* github.com:scylladb/scylladb:
  test_tablets_lwt: add test_lwt_shutdown
  storage_proxy: wait for write handler destruction
  storage_proxy: coroutinize cancel_write_handlers
  storage_proxy: cancel_write_handlers: don't hold a strong pointer to handler
2025-10-22 00:02:08 +03:00
Petr Gusev
8925f31596 test_tablets_lwt: add test_lwt_shutdown 2025-10-20 20:16:09 +02:00
Petr Gusev
bbcf3f6eff storage_proxy: wait for write handler destruction
shared_ptr<abstract_write_response_handler> instances are captured in
the lmutate/rmutate lambdas of send_to_live_endpoints(). As a result,
an abstract_write_response_handler object may outlive its removal from
the _response_handlers map. We use write_handler_destroy_promise to
wait for such pending instances in cancel_write_handlers() and
cancel_all_write_response_handlers() to prevent use-after-free.

A better long-term solution might be to replace shared_ptr with
unique_ptr for abstract_write_response_handler and use a separate gate
to track the lmutate/rmutate lambdas. We do not actually need to wait
for these lambdas to finish before sending a timeout or error response
to the client, as we currently do in ~abstract_write_response_handler.

Fixes scylladb/scylladb#26355
2025-10-20 20:10:42 +02:00