scylladb

Author	SHA1	Message	Date
Dawid Mędrek	a8dd13731f	Merge 'Improve debuggability of test/cluster/test_data_resurrection_in_memtable.py' from Botond Dénes This test was observed to fail in CI recently but there is not enough information in the logs to figure out what went wrong. This PR makes a few improvements to make the next investigation easier, should it be needed: * storage-service: add table name to mutation write failure error messages. * database: the `database_apply` error injection used to cause trouble, catching writes to bystander tables, making tests flaky. To eliminate this, it gained a filter to apply only to non-system keyspaces. Unfortunately, this still allows it to catch writes to the trace tables. While this should not fail the test, it reduces observability, as some traces disappear. Improve this error injection to only apply to selected table. Also merge it with the `database_apply_wait` error injection, to streamline the code a bit. * test/test_data_resurrection_in_memtable.py: dump data from the datable, before the checks for expected data, so if checks fail, the data in the table is known. Refs: SCYLLADB-812 Refs: SCYLLADB-870 Fixes: SCYLLADB-1050 (by restricting `database_apply` error injection, so it doesn't affect writes to system traces) Backport: test related improvement, no backport Closes scylladb/scylladb#28899 * github.com:scylladb/scylladb: test/cluster/test_data_resurrection_in_memtable.py: dump rows before check replica/database: consolidate the two database_apply error injections service/storage_proxy: add name of table to error message for write errors	2026-03-17 13:35:19 +01:00
Botond Dénes	44b8cad3df	service/storage_proxy: add name of table to error message for write errors It is useful to know what table the failed write belongs to.	2026-03-05 10:51:12 +02:00
Piotr Dulikowski	85dcbfae9a	Merge 'hint: Don't switch group in database::apply_hint()' from Pavel Emelyanov The method is called from storage_proxy::mutate_hint() which is in turn called from hint_mutation::apply_locally(). The latter is either called from directly by hint sender, which already runs in streaming group, or via RPC HINT_MUTATION handler which uses index 1 that negotiates streaming group as well. To be sure, add a debugging check for current group being the expected one. Code cleanup, not backporting Closes scylladb/scylladb#28545 * github.com:scylladb/scylladb: hint: Don't switch group in database::apply_hint() hint_sender: Switch to sender group on stop either	2026-03-04 09:36:38 +01:00
Marcin Maliszkiewicz	a83ee6cf66	Merge 'db/batchlog_manager: re-add v1 support for mixed clusters' from Botond Dénes `3f7ee3ce5d` introduced system.batchlog_v2, with a schema designed to speed up batchlog replays and make post-replay cleanups much more effective. It did not introduce a cluster feature for the new table, because it is node local table, so the cluster can switch to the new table gradually, one node at a time. However, https://github.com/scylladb/scylladb/issues/27886 showed that the switching causes timeouts during upgrades, in mixed clusters. Furthermore, switching to the new table unconditionally on upgrades nodes, means that on rollback, the batches saved into the v2 table are lost. This PR introduces re-introduces v1 (`system.batchlog`) support and guards the use of the v2 table with a cluster feature, so mixed clusters keep using v1 and thus be rollback-compatible. The re-introduced v1 support doesn't support post-replay cleanups for simplicity. The cleanup in v1 was never particularly effective anyway and we ended up disabling it for heavy batchlog users, so I don't think the lack of support for cleanup is a problem. Fixes: https://github.com/scylladb/scylladb/issues/27886 Needs backport to 2026.1, to fix upgrades for clusters using batches Closes scylladb/scylladb#28736 * github.com:scylladb/scylladb: test/boost/batchlog_manager_test: add tests for v1 batchlog test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2 test/boost/batchlog_manager_test: fix indentation test/boost/batchlog_manager_test: extract prepare_batches() method test/lib/cql_assertions: is_rows(): add dump parameter tools/scylla-sstable: extract query result printers tools/scylla-sstable: add std::ostream& arg to query result printers repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log db/batchlog_manager: re-add v1 support db/batchlog_manager: return all_replayed from process_batch() db/batchlog_manager: process_bath() fix indentation db/batchlog_manager: make batch() a standalone function db/batchlog_manager: make structs stats public db/batchlog_manager: allocate limiter on the stack db/batchlog_manager: add feature_service dependency gms/feature_service: add batchlog_v2 feature	2026-03-02 12:09:10 +01:00
Calle Wilund	fec7df7cbb	topology::snapshot: Add expiry (ttl) to RPC/topo op Not set yet, but includes it in messages so it can be properly set in calling code. Will add entry to manifest.	2026-02-23 11:37:17 +01:00
Calle Wilund	425d6b4441	storage_proxy: Add snapshot_keyspace method Takes set of ks->tables tuples and issues snapshot for each. If feature is enabled, keyspace is non-local, and uses tablets, will issue topo coordinator call across cluster. Keyspaces not fitting the above will just go to "normal" (node local) snapshot.	2026-02-23 11:27:15 +01:00
Calle Wilund	2bc633c3bd	storage_proxy: Add handler for SNAPSHOT_WITH_TABLETS	2026-02-23 10:44:42 +01:00
Calle Wilund	6e39c3bb83	storage_proxy: Break out logic from request_truncate_with_tablets Makes request_truncate_with_tablets use a parameterized helper, because eventually we will want to use almost identical logic for other ops, like snapshot.	2026-02-23 10:43:28 +01:00
Botond Dénes	dd50bd9bd4	db/batchlog_manager: re-add v1 support system.batchlog will still have to be used while the cluster is upgrading from an older version, which doesn't know v2 yet. Re-add support for replaying v1 batchlogs. The switch to v2 will happen after the BATCHLOG_V2 cluster feature is enabled. The only external user -- storage_proxy -- only needs a minor adjustment: switch between the table names. The rest is handled transparently by the db/batchlog.hh interface and the batchlog_manager.	2026-02-20 07:03:46 +02:00
Petr Gusev	df73f723a6	storage_proxy: hold erms in replica handlers Add explicit erm-holding variables in all replica-side RPC handlers. This is required to ensure that tablet migration waits for in-flight replica requests even if a non-replica coordinator has been fenced out. Holding erms on the replica side may increase the global-barrier wait time, since the barrier must drain these requests. We believe this is acceptable because: * We already hold erms during replica-side request execution, but in an ad-hoc, non-systemic way in lower layers of storage_proxy (e.g. in sp::mutate_locally and do_query_tablets). * Replica requests are bounded by replica-side timeouts, so the global-barrier wait time cannot exceed the maximum of these timeouts. For Paxos verbs, we use token_metadata_guard, which wraps the ERM and automatically refreshes it when tablet migration does not affect the current token; see the token_metadata_guard comments for details. We use this guard only for Paxos verbs because regular reads and writes already hold raw erms in storage_proxy and on the coordinators. The erms must be held in all RPC handlers that support fencing — that is, those with a fencing_token parameter in storage_proxy.idl. Counter updates already hold erms in mutate_counter_on_leader_and_replicate. Fix test_tablets2::test_timed_out_reader_after_cleanup: the tablets barrier now waits for all nodes. As a result, the replica read is expected to finish, rather than fail due to the tablet having moved as it did previously. The test is renamed to test_tablets_barrier_waits_for_replica_erms to better reflect its purpose. Refs scylladb/scylladb#26864	2026-02-16 08:57:42 +01:00
Pavel Emelyanov	83e64b516a	hint: Don't switch group in database::apply_hint() The method is called from storage_proxy::mutate_hint() which is in turn called from hint_mutation::apply_locally(). The latter is either called from directly by hint sender, which already runs in streaming group, or via RPC HINT_MUTATION handler which uses index 1 that negotiates streaming group as well. To be sure, add a debugging check for current group being the expected one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-09 08:54:51 +03:00
Dawid Mędrek	b0afd3aa63	Merge 'storage_service: set up topology properly in maintenance mode' from Patryk Jędrzejczak We currently make the local node the only token owner (that owns the whole ring) in maintenance mode, but we don't update the topology properly. The node is present in the topology, but in the `none` state. That's how it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in `scylla_main`. As a result, the node started in maintenance mode crashes in the following way in the presence of a vnodes-based keyspace with the NetworkTopologyStrategy: ``` scylla: locator/network_topology_strategy.cc:207: locator::natural_endpoints_tracker::natural_endpoints_tracker( const token_metadata &, const network_topology_strategy::dc_rep_factor_map &): Assertion `!_token_owners.empty() && !_racks.empty()' failed. ``` Both `_token_owners` and `_racks` are empty. The reason is that `_tm.get_datacenter_token_owners()` and `_tm.get_datacenter_racks_token_owners()` called above filter out nodes in the `none` state. This bug basically made maintenance mode unusable in customer clusters. We fix it by changing the node state to `normal`. We also extend `test_maintenance_mode` to provide a reproducer for Fixes #27988 This PR must be backported to all branches, as maintenance mode is currently unusable everywhere. Closes scylladb/scylladb#28322 * github.com:scylladb/scylladb: test: test_maintenance_mode: enable maintenance mode properly test: test_maintenance_mode: shutdown cluster connections test: test_maintenance_mode: run with different keyspace options test: test_maintenance_mode: check that group0 is disabled by creating a keyspace test: test_maintenance_mode: get rid of the conditional skip test: test_maintenance_mode: remove the redundant value from the query result storage_proxy: skip validate_read_replica in maintenance mode storage_service: set up topology properly in maintenance mode	2026-02-02 13:28:19 +01:00
Patryk Jędrzejczak	9d4a5ade08	storage_proxy: skip validate_read_replica in maintenance mode In maintenance mode, the local node adds only itself to the topology. However, the effective replication map of a keyspace with tablets enabled contains all tablet replicas. It gets them from the tablets map, not the topology. Hence, `network_topology_strategy::sanity_check_read_replicas` hits ``` throw std::runtime_error(format("Requested location for node {} not in topology. backtrace {}", id, lazy_backtrace())); ``` for tablet replicas other than the local node. As a result, all requests to a keyspace with tablets enabled and RF > 1 fail in debug mode (`validate_read_replica` does nothing in other modes). We don't want to skip maintenance mode tests in debug mode, so we skip the check in maintenance mode. We move the `is_debug_build()` check because: - `validate_read_replicas` is a static function with no access to the config, - we want the `!_db.local().get_config().maintenance_mode()` check to be dropped by the compiler in non-debug builds. We also suppress `-Wunneeded-internal-declaration` with `[[maybe_unused]]`.	2026-01-30 12:55:17 +01:00
Pavel Emelyanov	c61d855250	hints: Provide explicit scheduling group for hint_sender Currently it grabs one from database, but it's not nice to use database as config/sched-groups provider. This PR passes the scheduling group to use for sending hints via manager which, in turn, gets one from proxy via its config (proxy config already carries configuration for hints manager). The group is initialized in main.cc code and is set to the maintenance one (nowadays it's the same as streaming group). This will help splitting the streaming scheduling group into more elaborated groups under the maintenance supergroup: SCYLLADB-351 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28358	2026-01-27 12:50:11 +02:00
Petr Gusev	c45244b235	storage_proxy: drop stop() method It's not called by main.cc and can be confusing.	2026-01-23 11:22:03 +01:00
Petr Gusev	f5ed3e9fea	test_lwt_shutdown: fix flakiness by removing storage_proxy::stop injection storage_proxy::stop() is not called by main (it is commented out due to #293), so the corresponding message injection is never hit. When the test releases paxos_state_learn_after_mutate, shutdown may already be in progress or even completed by the time we try to trigger the storage_proxy::stop injection, which makes the test flaky. Fix this by completely removing the storage_proxy::stop injection. The injection is not required for test correctness. Shutdown must wait for the background LWT learn to finish, which is released via the paxos_state_learn_after_mutate injection. The shutdown process blocks on in-flight api HTTP requests through seastar::httpd::http_server::stop and its _task_gate, so the shutdown will not prevent the HTTP request that released the paxos_state_learn_after_mutate from completing successfully. Fixes scylladb/scylladb#28260	2026-01-23 11:20:36 +01:00
Yaniv Michael Kaul	d919aacc69	storage_proxy: mark write_timeouts metric for counter write timeouts When a counter write times out (due to rpc::timeout_error or timed_out_error), the code was throwing mutation_write_timeout_exception but not marking the write_timeouts metric. This resulted in counter write timeouts not being counted in the scylla_storage_proxy_coordinator_write_timeouts metric. Regular writes go through mutate_internal -> mutate_end, which catches mutation_write_timeout_exception and marks the metric. However, counter writes use a separate code path (mutate_counters) that has its own exception handling but was missing the metric update. This fix adds get_stats().write_timeouts.mark() before throwing the timeout exception in the counter write path, consistent with how the CAS path handles cas_write_timeouts. Refs: https://scylladb.atlassian.net/browse/SCYLLADB-245 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#28019	2026-01-14 17:50:46 +02:00
Botond Dénes	9bea842c01	service/storage_proxy: don't throw exceptions on the full-scan path Use coroutine::try_future() to avoid exceptions taking flight and triggering expensive stack-unwinding. Especially bad for common exceptions like timeouts.	2026-01-13 10:47:57 +02:00
Radosław Cybulski	5f48ab3875	storage_proxy: fix invalid assert Change invalid `assert(true)` into `SCYLLA_ASSERT(false)`, as the latter was clearly meant. Closes scylladb/scylladb#27900	2026-01-08 21:55:18 +02:00
Dawid Mędrek	77a934e5b9	db/hints: Prevent draining hints before hint replay is allowed Context ------- The procedure of hint draining boils down to the following steps: 1. Drain a hint sender. That should get rid of all hints stored for the corresponding endpoint. 2. Remove the hint directory corresponding to that endpoint. Obviously, it gets more complex than this high-level perspective. Without blurring the view, the relevant information is that step 1 in the algorithm above may not be executed. Breaking it down, it comprises of two calls to `hint_sender::send_hints_maybe()`. The function is responsible for sending out hints, but it's not unconditional and will not be performed if any of the following bullets is not satisfied: * `hint_sender::replay_allowed()` is not `true`. This can happen when hint replay hasn't been turned on yet. * `hint_sender::can_send()` is not `true`. This can happen if the corresponding endpoint is not alive AND it hasn't left the cluster AND it's still a normal token owner. There is one more relevant point: sending hints can be stopped if replaying hints fails and `hint_sender::send_hints_maybe()` returns `false`. However, that's not not possible in the case of draining. In that case, if Scylla comes across any failure, it'll simply delete the corresponding hint segment. Because of that, we ignore it and only focus on the two bullets. --- Why is it a problem? -------------------- If a hint directory is not purged of all hint segments in it, any attempt to remove it will fail and we'll observe an error like this: ``` Exception when draining <host ID>: std::filesystem::__cxx11::filesystem_error (error system:39, filesystem error: remove failed: Directory not empty [<path>]) ``` The folder with the remaining hints will also stay on disk, which is, of course, undesired. --- When can it happen? ------------------- As highlighted in the Context section of this commit message, the key part of the code that can lead to a dangerous situation like that is `hint_sender::send_hints_maybe()`. The function is called twice when draining a hint endpoint manager: once to purge all of the existing hints, and another time after flushing all hints stored in a commitlog instances, but not listed by `hint_sender` yet. If any of those calls misbehaves, we may end up with a problem. That's why it's crucial to ensure that the function always goes through ALL of the hints. Dangerous situations: 1. We try to drain hints before hint replay is allowed. That will violate the first bullet above. 2. The node we're draining is dead, but it hasn't left the cluster, and it still possesses some tokens. --- How do we solve that? --------------------- Hint replay is turned on in `main.cc`. Once enabled, it cannot be disabled. So to address the first bullet above, it suffices to ensure that no draining occurs beforehand. It's perfectly fine to prevent it. Soon after hint replay is allowed, `main.cc` also asks the hint manager to drain all of the endpoint managers whose endpoints are no longer normal token owners (cf. `db::hints::manager::drain_left_nodes()`). The other bullet is more tricky. It's important here to know that draining only initiated in three situations: 1. As part of the call to `storage_service::notify_left()`. 2. As part of the call to `storage_service::notify_released()`. 3. As part of the call to `db::hints::manager::drain_left_nodes()`. The last one is trivially non-problematic. The nodes that it'll try to drain are no longer normal token owners, so `can_send()` must always return `true`. The second situation is similar. As we read in the commit message of scylladb/scylladb@eb92f50413, which introduced the notion of released nodes, the nodes are no longer normal token owners: > In this patch we postpone the hint draining for the "left" nodes to > the time when we know that the target nodes no longer hold ownership > of any tokens - so they're no longer referenced in topology. I'm > calling such nodes "released". I suggest reading the full commit message there because the problems there are somewhat similar these changes try to solve. Finally, the first situation: unfortunately, it's more tricky. The same commit message says: > When a node is being replaced, it enters a "left" state while still > owning tokens. Before this patch, this is also the time when we start > draining hints targeted to this node, so the hints may get sent before > the token ownership gets migrated to another replica, and these hints > may get lost. This suggests that `storage_service::notify_left()` may be called when the corresponding node still has some tokens! That's something that may prevent properly draining hints. Fortunately, no hope is lost. We only drain hints via `notify_left()` when hinted handoff hasn't been upgraded to being host-ID-based yet. If it has, draining always happens via `notify_released()`. When I write this commit message, all of the supported versions of Scylla 2025.1+ use host-ID-based hinted handoff. That means that problems can only arise when upgrading from an older version of Scylla (2024.1 downwards). Because of that, we don't cover it. It would most likely require more extensive changes. --- Non-issues ---------- There are notions that are closely related to sending hints. One of them is the host filter that hinted handoff uses. It decides which endpoints are eligible for receiving hints, and which are not. Fortunately, all endpoints rejected by the host filter lose their hint endpoint managers -- they're stopped as part of that procedure. What's more, draining hints and changing the host filter cannot be happening at the same time, so it cannot lead to any problems. The solution ------------ To solve the described issue, we simply prevent draining hints before hint replay is allowed. No reproducer test is attached because it's not feasible to write one. Fixes scylladb/scylladb#27693 Closes scylladb/scylladb#27713	2026-01-04 16:54:05 +02:00
Tomasz Grabiec	c077283352	Merge 'service: support conversion of tablet keyspaces to rack-list using ALTER KEYSPACE' from Aleksandra Martyniuk If a keyspace has a numeric replication factor in a DC and rf < #racks, then the replicas of tablets in this keyspace can be distributed among all racks in the DC (different for each tablet). With rack list, we need all tablet replicas to be placed on the same racks. Hence, the conversion requires tablet co-location. After this series, the conversion can be done using ALTER KEYSPACE statement. The statement that does this conversion in any DC is not allowed to change a rf in any DC. So, if we have dc1 and dc2 with 3 racks each and a keyspace ks then with a single ALTER KEYSPACE we can do: - {dc1 : 2} -> {dc1 : [r1, r2]}; - {dc1 : 2, dc2: 2} -> {dc1 : [r1, r2], dc2: [r2,r3]}; - {dc1 : 2, dc2: 2} -> {dc1 : [r1, r2], dc2: 2} - {dc1 : 2} -> {dc1 : 2, dc2 : [r1]} But we cannot do: - {dc1 : 2} -> {dc1 : [r1, r2, r3]}; - {dc1 : 1, dc2 : [r1, r2] → dc1: [r1], dc2: [r1]. In order to do the co-locations rf change request is paused. Tablet load balancer examines the paused rf change requests and schedules necessary tablet migrations. During the process of co-location, no other cross-rack migration is allowed. Load balancer checks whether any paused rf change request is ready to be resumed. If so, it puts the request back to global topology request queue. While an rf change request for a keyspace is running, any other rf change of this keyspace will fail. Fixes: #26398. New feature, no backport Closes scylladb/scylladb#27279 * github.com:scylladb/scylladb: test: add est_rack_list_conversion_with_two_replicas_in_rack test: test creating tablet_rack_list_colocation_plan test: add test_numeric_rf_to_rack_list_conversion test tasks: service: add global_topology_request_virtual_task cql3: statements: allow altering from numeric rf to rack list service: topology_coordinator: pause keyspace_rf_change request service: implement make_rack_list_colocation_plan service: add tablet_rack_list_colocation_plan cql3: reject concurrent alter of the same keyspace test: check paused rf change requests persistence db: service: add paused_rf_change_requests to system.topology service: pass topology and system_keyspace to load_balancer ctor service: tablet_allocator: extract load updates service: tablet_allocator: extract ensure_node tasks, system_keyspace: Introduce get_topology_request_entry_opt() node_ops: Drop get_pending_ids() node_ops: Drop redundant get_status_helper()	2025-12-17 10:05:06 +01:00
Tomasz Grabiec	71e6ef90f4	tasks, system_keyspace: Introduce get_topology_request_entry_opt() It's a cleanup. Better to return std::nullopt than faking an entry with an id when require_entry == false.	2025-12-16 13:25:34 +01:00
Pavel Emelyanov	3f7ee3ce5d	Merge 'batchlog: make replay (flush) faster' from Botond Dénes The batchlog table contains an entry for each logged batch that is processed by the local node as coordinator. These entries are typically very short lived, they are inserted when the batch is processed and deleted immediately after the batch is successfully applied. When a table has `tombstone_gc = {'mode': 'repair'}` enabled, every repair has to flush all hints and batchlogs, so that we can be certain that there is no live data in any of these, older than the last repair. Since batches can contain member queries from any number of tables, the whole batchlog has to be flushed, even if repair-mode tombstone-gc is enabled for a single table. Flushing the batchlog table happens by doing a batchlog replay. This involves reading the entire content of this table, and attempting to replay+delete any live entries (that are old enough to be replayed). Under normal operating circumstances, 99%+ of the content of the batchlog table is partition tombstones. Because of this, scanning the content of this table has to process thousands to millions of tombstones. This was observed to require up to 20 minutes to finish, causing repairs to slow down to a crawl, as the batchlog-flush has to be repeated at the end of the repair of each token-range. When trying to address this problem, the first idea was that we should expedite the garbage-collection of these accumulated tombstones. This experiment failed, see https://github.com/scylladb/scylladb/pull/23752. The commitlog proved to be an impossible to bypass barrier, preventing quick garbage-collection of tombstones. So long as a single commit-log segment is alive, holding content from the batchlog table, all tombstones written after are blocked from GC. The second approach, represented by this PR, is to not rely in tombstone GC to reduce the tombstone amount. Instead restructure the table such that a single higher-order tombstone can be used to shadow and allow for the eviction of the myriads of individual batchlog entry tombstones. This is realized by reorganizing the batchlog table such that individual batches are rows, not partitions. This new schema is introduced by the new `system.batchlog_v2` table, introduced by this PR: CREATE TABLE system.batchlog_v2 ( version int, stage int, shard int, written_at timestamp, id uuid, data blob, PRIMARY KEY ((version, stage, shard), written_at, id)); The new schema organization has the following goals: 1) Make post-replay batchlog cleanup possible with a simple range-tombstone. This allows dropping the individual dead batchlog entries, as they are shadowed by a higher level tombstone. This enables dropping tombstones without tombstone GC. 2) To make the above possible, introduce the stage key component: batchlog entries that fail the first replay attempt, are moved to the failed_replay stage, so the initial stage can be cleaned up safely. 3) Spread out the data among Scylla shards, via the batchlog shard column. 4) Make batchlog entries ordered by the batchlog create time (id). This allows for selecting batchlogs to replay, without post-filtering of batchlogs that are too young to be replayed. Fixes: https://github.com/scylladb/scylladb/issues/23358 This is an improvement, normally not a backport-candidate. We might override this and backport to allow wider use of `tombstone_gc: {'mode': 'repair'}`. Closes scylladb/scylladb#26671 * github.com:scylladb/scylladb: db/config: change batchlog_replay_cleanup_after_replays default to 1 test/boost/batchlog_manager_test: add test for batchlog cleanup replica/mutation_dump: always set position weight for clustering positions service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/ test/lib: introduce error_injection.hh utils/error_injection: add debug log to disable() and disable_all() test/lib/cql_test_env: forward config to batchlog test/lib/cql_test_env: add batch type to execute_batch() test/lib/cql_assertions: add with_size(predicate) overload test/lib/cql_assertions: add source location to fail messages test/lib/cql_assertions: columns_assertions: add assert_for_columns_of_each_row() test/lib/cql_assertions: rows_assertions::assert_for_columns_of_row(): add index bound check test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload db/batchlog_manager: config: s/write_timeout/reply_timeot/ db,service: switch to system.batchlog_v2 db/system_keyspace: introduce system.batchlog_v2 service,db: extract generation of batchlog delete mutation service,db: extract get_batchlog_mutation_for() from storage-proxy db/batchlog_manager: only consider propagation delay with tombstone-gc=repair db/batchlog_manager: don't drop entire batch if one mutations' table was dropped data_dictionary: table: add get_truncation_time() db/batchlog_manager: batch(): replace map_reduce() with simple loop db/batchlog_manager: finish coroutinizing replay_all_failed_batches db/batchlog_manager: improve replayAllFailedBatches logs	2025-12-15 15:05:19 +03:00
Petr Gusev	c428645d16	storage_proxy: cas: take cas_request by raw reference In the next commit we want to add an optimization that relies on precise control over the lifetime of cas_request. In particular, we want the implementation of this interface in Alternator to operate on raw references that are guaranteed to remain valid only until the cas() future is resolved. We already depend on the same lifetime assumptions in cas_request when used by modification_statement. However, these assumptions are not clearly expressed in the current interface: cas_request is taken by shared_ptr, and nothing prevents cas() from storing that pointer inside paxos_response_handler, which may outlive the cas() future. This commit fixes that by taking cas_request by raw reference. This makes it explicit that cas() does not assume ownership of the object. Callers must ensure that the referenced object remains valid until the returned future is resolved.	2025-12-07 16:14:56 +01:00
Botond Dénes	e762027943	db/config: change batchlog_replay_cleanup_after_replays default to 1 Now that batchlog cleanup is cheap, on account of memtable flush on the system.batchlog table garbage-collecting tombstones (previous patch), we can afford to do cleanup on each replay, keeping the memtable size small and more importantly -- the amount of tombstones in the memtable small.	2025-12-02 14:21:26 +02:00
Botond Dénes	8edd5b80ab	test/boost/batchlog_manager_test: add test for batchlog cleanup Add more tests covering different aspects of batchlog replay, cleanup, replay timeout and finally v1 -> v2 migration.	2025-12-02 14:21:26 +02:00
Botond Dénes	8545f7eedd	service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/ Rename to make it more explicit where the error injection happens. Also change how the error is injected, use the lambda overload instead of is_enabled(), the former leaves better trace in logs, which helps when debugging tests.	2025-12-02 14:21:26 +02:00
Botond Dénes	846b656610	db,service: switch to system.batchlog_v2 New batchlogs are written to the batchlog_v2 table and replay also uses the v2 table. The content of system.batchlog is attempted to be migrated to system.batchlog_v2 after each start of the batchlog_manager service. The migration is retried on each replay if it fails. This is reduntant but simple. Batchlog cleanup now doesn't involve flushing memtables, the only remaining user of replica/database.hh is gone, so the include is dropped.	2025-12-02 14:21:26 +02:00
Botond Dénes	9434ec2fd1	service,db: extract generation of batchlog delete mutation Don't build batchlog delete mutations in storage-proxy code. Move this code into db/batchlog_manager.cc, exposed via db/batchlog.hh. This serves multiple goals: 1) Concentrates low-level batchlog related logic in db/batchlog_manager.cc 2) Reduce current and future code duplication. 3) Make future changes to this logic easier.	2025-12-02 14:21:25 +02:00
Botond Dénes	f54602daf0	service,db: extract get_batchlog_mutation_for() from storage-proxy Don't build batchlog mutations in storage-proxy code. Move this code into db/batchlog_manager.cc, exposed via db/batchlog.hh. This serves multiple goals: 1) Concentrates low-level batchlog related logic in db/batchlog_manager.cc 2) Reduce current and future code duplication. 2) Make future changes to this logic easier.	2025-12-02 14:21:25 +02:00
Robert Bindar	817fdadd49	Improve choice distribution for primary replica I noticed during tests that `maybe_get_primary_replica` would not distribute uniformly the choice of primary replica because `info.replicas` on some shards would have an order whilst on others it'd be ordered differently, thus making the function choose a node as primary replica multiple times when it clearly could've chosen a different nodes. This patch sorts the replica set before passing it through the scope filter. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-11-11 09:18:01 +02:00
Petr Gusev	5bda226ff6	storage_proxy: use coroutine::maybe_yield(); This is a small "while at it" refactoring -- better to use coroutine::maybe_yield with co_await-s.	2025-11-05 14:38:19 +01:00
Petr Gusev	4578304b76	storage_proxy: use gates to track write handlers destruction In #26408 a write_handler_destroy_promise class was introduced to wait for abstract_write_response_handler instances destruction. We strived to minimize the memory footprint of abstract_write_response_handler, with write_handler_destroy_promise-es we required only a single additional int. It turned our that in some cases a lot of write handlers can be scheduled for deletion at the same time, in such cases the vector<write_handler_destroy_promise> can become big and cause 'oversized allocation' seastar warnings. Another concern with write_handler_destroy_promise-es was that they were more complicated than it was worth. In this commit we replace write_handler_destroy_promise with simple gates. One or more gates can be attached to an abstract_write_response_handler to wait for its destruction. We use utils::small_vector<gate::holder, 2> to store the attached gates. The limit 2 was chosen because we expect two gates at the same time in most cases. One is storage_proxy::_write_handlers_gate, which is used to wait for all handlers in cancel_all_write_response_handlers. Another one can be attached by a caller of cancel_write_handlers. Nothing stops several cancel_write_handlers to be called at the same time, but it should be rare. The sizeof(utils::small_vector<gate::holder, 2>) == 40, this is 40.0 / 488 * 100 ~ 8% increase in sizeof(abstract_write_response_handler), which seems acceptable. Fixes scylladb/scylladb#26788	2025-11-05 14:37:52 +01:00
Michael Litvak	296b116ae2	storage_proxy: lock all read shards for counter update Previously in a counter update we lock the read shard to protect the counter's read-modify-write against concurrent updates. This is not sufficient when the counter is migrated between different shards, because there is a stage where the read shard switches from the old shard to the new shard, and during that switch there can be concurrent counter updates on both shards. If each shard takes only its own lock, the operations will not be exclusive anymore, and this can cause lost counter updates. To fix this, we acquire the counter lock on both shards in the stage write_both_read_new, when both shards can serve reads. This guarantees that counter updates continue to be exclusive during intranode migration.	2025-11-03 16:04:35 +01:00
Michael Litvak	de321218bc	storage_proxy: apply counter mutation on all write shards When applying a counter mutation, use apply_on_shards to apply the mutation on all write shards, similarly to the way other mutations are applied in the storage proxy. Previously the mutation was applied only on the current shard which is the read shard. This is needed to respect the write_both stages of intranode migration where we need to apply the mutation on both the old and the new shards.	2025-11-03 16:03:29 +01:00
Michael Litvak	c7e7a9e120	storage_proxy: move counter update coordination to storage proxy Refactor the counter update to split the functions and have them called by the storage proxy to prepare for a later change. Previously in mutate_counter the storage proxy calls the replica function apply_counter_update that does a few things: 1. checks that the operation can be done: check timeout, disk utilization 2. acquire counter locks 3. do read-modify-write and transform the counter mutation 4. apply the mutation in the replica In this commit we change it so that these functions are split and called from the storage proxy, so that we have better control from the storage proxy when we change it later to work across multiple shards. For example, we will want to acquire locks on multiple shards, transform it on one shard, and then apply the mutation on multiple shards. After the change it works as follows in storage proxy: 1. acquire counter locks 2. call replica prepare to check the operation and transform the mutation 3. call replica apply to apply the transformed mutation	2025-11-03 15:59:46 +01:00
Michael Litvak	579031cfc8	storage_proxy: refactor mutate_counter_on_leader Slightly reorganize the mutate counter function to prepare it for a later change. Move the code that finds the read shard and invokes the rest of the function on the read shard to the caller function. This simplifies the function mutate_counter_on_leader_and_replicate which now runs on the read shard and will make it easier to extend.	2025-11-03 08:43:11 +01:00
Botond Dénes	ac618a53f4	Merge 'db: repair: do not update repair_time if batchlog replay failed' from Aleksandra Martyniuk Currently, batchlog replay is considered successful even if all batches fail to be sent (they are replayed later). However, repair requires all batches to be sent successfully. Currently, if batchlog isn't cleared, the repair never learns and updates the repair_time. If GC mode is set to "repair", this means that the tombstones written before the repair_time (minus propagation_delay) can be GC'd while not all batches were replied. Consider a scenario: - Table t has a row with (pk=1, v=0); - There is an entry in the batchlog that sets (pk=1, v=1) in table t; - The row with pk=1 is deleted from table t; - Table t is repaired: - batchlog reply fails; - repair_time is updated; - propagation_delay seconds passes and the tombstone of pk=1 is GC'd; - batchlog is replayed and (pk=1, v=1) inserted - data resurrection! Do not update repair_time if sending any batch fails. The data is still repaired. For tablet repair the repair runs, but at the end the exception is passed to topology coordinator. Thanks to that the repair_time isn't updated. The repair request isn't removed as well, due to which the repair will need to rerun. Apart from that, a batch is removed from the batchlog if its version is invalid or unknown. The condition on which we consider a batch too fresh to replay is updated to consider propagation_delay. Fixes: https://github.com/scylladb/scylladb/issues/24415 Data resurrection fix; needs backport to all versions Closes scylladb/scylladb#26319 * github.com:scylladb/scylladb: db: fix indentation test: add reproducer for data resurrection repair: fail tablet repair if any batch wasn't sent successfully db/batchlog_manager: fix making decision to skip batch replay db: repair: throw if replay fails db/batchlog_manager: delete batch with incorrect or unknown version db/batchlog_manager: coroutinize replay_all_failed_batches	2025-10-28 14:52:59 +02:00
Aleksandra Martyniuk	1935268a87	test: add reproducer for data resurrection Add a reproducer to check that the repair_time isn't updated if the batchlog replay fails. If repair_time was updated, tombstones could be GC'd before the batchlog is replayed. The replay could later cause the data resurrection.	2025-10-23 10:39:43 +02:00
Petr Gusev	1dd05f4404	storage_proxy: use run_fenceable_write Switch local write code sites from start_write() to run_fenceable_write().	2025-10-22 16:31:43 +02:00
Petr Gusev	d56495fd9c	storage_proxy: abstract_write_response_handler: apply_locally: extract post fence check All mutation_holder::apply_locall() implementations now do the same post fence chech. In this commit we hoist this check up to abstract_write_response_handler::apply_locally().	2025-10-22 16:31:43 +02:00
Petr Gusev	24f8962938	storage_proxy: introduce run_fenceable_write This function is intended to replace start_write() in subsequent commits. It provides the following benefits: * Remove duplication: All start_write() call sites must run the fence check after the operation completes. run_fenceable_write() encapsulates this pattern. * Fix a race: To ensure no new stale write operations occur during cleanup, a fence check before start_write() was previously used. However, yields in several code paths between the check and start_write() made it non-atomic, allowing a stale operation to slip in if the fence_version was updated in between. * Optimize waiting: We do not need to wait for all operations—only for vnode-based, non-local tables with versions smaller than the current fence_version.	2025-10-22 16:31:43 +02:00
Petr Gusev	c5f447224a	storage_proxy: move update_fence_version from shared_token_metadata Future commits will extend update_fence_version, and it is simpler to do so if the function resides in storage_proxy. Additionally, fence_version is the only field this function accesses, and it is used solely within storage_proxy, making this change natural on its own.	2025-10-22 16:31:43 +02:00
Petr Gusev	659c5912e0	storage_proxy: fix start_write() operation scope in apply_locally The operation must be held during the local write. Before this commit, its scope ended after returning from apply_locally(), so it did not actually provide any protection.	2025-10-22 16:31:43 +02:00
Petr Gusev	27915befac	storage_proxy: move post fence check into handle_write handle_write() is invoked from receive_mutation_handler() and handle_paxos_learn(), and both previously performed a fence check in apply_fn. This commit hoists the fence check into handle_write() to reduce code duplication. Additionally, move start_write() after get_schema_for_write(), since there is no need to hold the operation while querying the schema.	2025-10-22 16:31:43 +02:00
Petr Gusev	41077138bf	storage_proxy: move fencing into mutate_counter_on_leader_and_replicate As noted in the code comments, start_write() does not need to be held during counter replication; it is required only while performing local storage modifications. Move the start_write() call and the fence check down to mutate_counter_on_leader_and_replicate(). Additionally, mutate_counters_on_leader() is updated to check for possible stale_topology_exception() and properly package them in the resulting exception_variant structure.	2025-10-22 16:31:43 +02:00
Petr Gusev	a6208b2d67	storage_proxy::handle_read: add fence check before get_schema Avoid querying the schema for outdated requests by adding a fence check at the start of handle_read.	2025-10-22 16:31:43 +02:00
Avi Kivity	029513bee9	Merge 'storage_proxy: wait for write handlers destruction' from Petr Gusev `shared_ptr<abstract_write_response_handler>` instances are captured in the `lmutate` and `rmutate` lambdas of `send_to_live_endpoints()`. As a result, an `abstract_write_response_handler` object may outlive its removal from the `storage_proxy::_response_handlers` map -> `cancel_all_write_response_handlers()` doesn't actually wait for requests completion -> `sp::drain_on_shutdown()` doesn't guarantee all requests are drained -> `sp::stop_remote()` completes too early and `paxos_store` is destroyed while LWT local writes might still be in progress. In this PR we introduce a `write_handler_destroy_promise` to wait for such pending instances in `cancel_write_handlers()` and `cancel_all_write_response_handlers()` to prevent the `use-after-free`. A better long-term solution might be to replace `shared_ptr` with `unique_ptr` for `abstract_write_response_handler` and use a separate gate to track the `lmutate/rmutate` lambdas. We do not actually need to wait for these lambdas to finish before sending a timeout or error response to the client, as we currently do in `~abstract_write_response_handler`. Fixes scylladb/scylladb#26355 backport: need to be backported to 2025.4 since #26355 is reproduced on LWT over tablets Closes scylladb/scylladb#26408 * github.com:scylladb/scylladb: test_tablets_lwt: add test_lwt_shutdown storage_proxy: wait for write handler destruction storage_proxy: coroutinize cancel_write_handlers storage_proxy: cancel_write_handlers: don't hold a strong pointer to handler	2025-10-22 00:02:08 +03:00
Petr Gusev	8925f31596	test_tablets_lwt: add test_lwt_shutdown	2025-10-20 20:16:09 +02:00
Petr Gusev	bbcf3f6eff	storage_proxy: wait for write handler destruction shared_ptr<abstract_write_response_handler> instances are captured in the lmutate/rmutate lambdas of send_to_live_endpoints(). As a result, an abstract_write_response_handler object may outlive its removal from the _response_handlers map. We use write_handler_destroy_promise to wait for such pending instances in cancel_write_handlers() and cancel_all_write_response_handlers() to prevent use-after-free. A better long-term solution might be to replace shared_ptr with unique_ptr for abstract_write_response_handler and use a separate gate to track the lmutate/rmutate lambdas. We do not actually need to wait for these lambdas to finish before sending a timeout or error response to the client, as we currently do in ~abstract_write_response_handler. Fixes scylladb/scylladb#26355	2025-10-20 20:10:42 +02:00

1 2 3 4 5 ...

1370 Commits