scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-22 07:42:16 +00:00

Author	SHA1	Message	Date
Asias He	0204372156	repair: Reject repair requests where start and end tokens are equal When a user calls the repair API with identical startToken and endToken values, the code creates a wrapping interval (T, T]. This causes unwrap() to split it into (-inf, T] and (T, +inf), covering the entire token ring and triggering a full repair. Reject such requests early with an error message matching Cassandra's behavior: "Start and end tokens must be different." Fixes: https://scylladb.atlassian.net/browse/CUSTOMER-358 Closes scylladb/scylladb#29821	2026-05-11 14:08:20 +03:00
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Botond Dénes	3289928679	repair: fix quadratic complexity when loading repair history shared_tombstone_gc_state::update_repair_time() uses copy-on-write semantics: each call copies the entire per_table_history_maps and the per-table repair_history_map. repair_service::load_history() called this once per history entry, making the load O(N²) in both time and memory. Introduce batch_update_repair_time() which performs a single copy-on-write for any number of entries belonging to the same table. Restructure load_history() to collect entries into batches of up to 1000 and flush each batch in one call, keeping peak memory bounded. The batch size limit is intentional: the repair history table currently has no bound on the number of entries and can grow large. Note that this does not cause a problem in the in-memory history map itself: entries are coalesced internally and only the latest repair time is kept per range. The unbounded entry count only makes the batched update during load expensive. Fixes: SCYLLADB-104 Closes scylladb/scylladb#29326	2026-04-11 23:54:26 +03:00
Botond Dénes	fbbe2bdce8	Merge 'Introduce repair_service::config and cut dependency from db::config' from Pavel Emelyanov Spreading db::config around and making all services depend on it is not nice. Most other service that need configuration provide their own config that's populated from db::config in main.cc/cql_test_env.cc and use it, not the global config. This PR does the same for repair_service. Enhancing components dependencies, not backporting Closes scylladb/scylladb#29153 * github.com:scylladb/scylladb: repair: Remove db/config.hh from repair/*.cc files repair: Move repair_multishard_reader options onto repair_service::config repair: Move critical_disk_utilization_level onto repair_service::config repair: Move repair_partition_count_estimation_ratio onto repair_service::config repair: Move repair_hints_batchlog_flush_cache_time_in_ms onto repair_service::config repair: Move enable_small_table_optimization_for_rbno onto repair_service::config repair: Introduce repair_service::config	2026-04-09 11:44:25 +03:00
Raphael S. Carvalho	16e387d5f9	repair/replica: Fix race window where post-repair data is wrongly promoted to repaired During incremental repair, each tablet replica holds three SSTable views: UNREPAIRED, REPAIRING, and REPAIRED. The repair lifecycle is: 1. Replicas snapshot unrepaired SSTables and mark them REPAIRING. 2. Row-level repair streams missing rows between replicas. 3. mark_sstable_as_repaired() runs on all replicas, rewriting the SSTables with repaired_at = sstables_repaired_at + 1 (e.g. N+1). 4. The coordinator atomically commits sstables_repaired_at=N+1 and the end_repair stage to Raft, then broadcasts repair_update_compaction_ctrl which calls clear_being_repaired(). The bug lives in the window between steps 3 and 4. After step 3, each replica has on-disk SSTables with repaired_at=N+1, but sstables_repaired_at in Raft is still N. The classifier therefore sees: is_repaired(N, sst{repaired_at=N+1}) == false sst->being_repaired == null (lost on restart, or not yet set) and puts them in the UNREPAIRED view. If a new write arrives and is flushed (repaired_at=0), STCS minor compaction can fire immediately and merge the two SSTables. The output gets repaired_at = max(N+1, 0) = N+1 because compaction preserves the maximum repaired_at of its inputs. Once step 4 commits sstables_repaired_at=N+1, the compacted output is classified REPAIRED on the affected replica even though it contains data that was never part of the repair scan. Other replicas, which did not experience this compaction, classify the same rows as UNREPAIRED. This divergence is never healed by future repairs because the repaired set is considered authoritative. The result is data resurrection: deleted rows can reappear after the next compaction that merges unrepaired data with the wrongly-promoted repaired SSTable. The fix has two layers: Layer 1 (in-memory, fast path): mark_sstable_as_repaired() now also calls mark_as_being_repaired(session) on the new SSTables it writes. This keeps them in the REPAIRING view from the moment they are created until repair_update_compaction_ctrl clears the flag after step 4, covering the race window in the normal (no-restart) case. Layer 2 (durable, restart-safe): a new is_being_repaired() helper on tablet_storage_group_manager detects the race window even after a node restart, when being_repaired has been lost from memory. It checks: sst.repaired_at == sstables_repaired_at + 1 AND tablet transition kind == tablet_transition_kind::repair Both conditions survive restarts: repaired_at is on-disk in SSTable metadata, and the tablet transition is persisted in Raft. Once the coordinator commits sstables_repaired_at=N+1 (step 4), is_repaired() returns true and the SSTable naturally moves to the REPAIRED view. The classifier in make_repair_sstable_classifier_func() is updated to call is_being_repaired(sst, sstables_repaired_at) in place of the previous sst->being_repaired.uuid().is_null() check. A new test, test_incremental_repair_race_window_promotes_unrepaired_data, reproduces the bug by: - Running repair round 1 to establish sstables_repaired_at=1. - Injecting delay_end_repair_update to hold the race window open. - Running repair round 2 so all replicas complete mark_sstable_as_repaired (repaired_at=2) but the coordinator has not yet committed step 4. - Writing post-repair keys to all replicas and flushing servers[1] to create an SSTable with repaired_at=0 on disk. - Restarting servers[1] so being_repaired is lost from memory. - Waiting for autocompaction to merge the two SSTables on servers[1]. - Asserting that the merged SSTable contains post-repair keys (the bug) and that servers[0] and servers[2] do not see those keys as repaired. NOTE FOR MAINTAINER: Copilot initially only implemented Layer 1 (the in-memory being_repaired guard), missing the restart scenario entirely. I pointed out that being_repaired is lost on restart and guided Copilot to add the durable Layer 2 check. I also polished the implementation: moving is_being_repaired into tablet_storage_group_manager so it can reuse the already-held _tablet_map (avoiding an ERM lookup and try/catch), passing sstables_repaired_at in from the classifier to avoid re-reading it, and using compaction_group_for_sstable inside the function rather than threading a tablet_id parameter through the classifier. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1239. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#29244	2026-04-09 11:42:28 +03:00
Pavel Emelyanov	4bc8ec174c	repair: Remove db/config.hh from repair/*.cc files Now all the code uses repair_service::config and no longer needs global config description. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-20 19:36:50 +03:00
Pavel Emelyanov	35f625e5c7	repair: Move repair_multishard_reader options onto repair_service::config This actually uses two interconnected options: repair_multishard_reader_buffer_hint_size and repair_multishard_reader_enable_read_ahead. Both are propagated through repair_service::config and pass their values to repair_reader/make_reader at construction time. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-20 19:36:50 +03:00
Pavel Emelyanov	9bc0d27aae	repair: Move critical_disk_utilization_level onto repair_service::config Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-20 19:23:47 +03:00
Pavel Emelyanov	80aa0fcdc2	repair: Move repair_partition_count_estimation_ratio onto repair_service::config Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-20 19:23:47 +03:00
Pavel Emelyanov	585cb0c718	repair: Move repair_hints_batchlog_flush_cache_time_in_ms onto repair_service::config Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-20 19:23:47 +03:00
Pavel Emelyanov	d8f7f86e10	repair: Move enable_small_table_optimization_for_rbno onto repair_service::config Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-20 19:23:47 +03:00
Pavel Emelyanov	38a23ff927	repair: Introduce repair_service::config Most other services have their configs, rpair still uses global db::config. Add an empty config struct to repair_service to carry db::config options the repair service needs. Subsequent patches will populate the struct with options. The config is created in main.cc as sharded_parameter because all future options are live-updateable and should capture theirs source from db::config on correct shard. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-20 19:23:47 +03:00
Asias He	6cb263bab0	repair: Prevent CPU stall during cross-shard row copy and destruction When handling `repair_stream_cmd::end_of_current_rows`, passing the foreign list directly to `put_row_diff_handler` triggered a massive synchronous deep copy on the destination shard. Additionally, destroying the list triggered a synchronous deallocation on the source shard. This blocked the reactor and triggered the CPU stall detector. This commit fixes the issue by introducing `clone_gently()` to copy the list elements one by one, and leveraging the existing `utils::clear_gently()` to destroy them. Both utilize `seastar::coroutine::maybe_yield()` to allow the reactor to breathe during large cross-shard transfers and cleanups. Fixes SCYLLADB-403 Closes scylladb/scylladb#28979	2026-03-17 11:05:15 +02:00
Botond Dénes	475220b9c9	Merge 'Remove the rest of pre raft topology code' from Gleb Natapov Remove the rest of the code that assumes that either group0 does not exist yet or a cluster is till not upgraded to raft topology. Both of those are not supported any more. No need to backport since we remove functionality here. Closes scylladb/scylladb#28841 * github.com:scylladb/scylladb: service level: remove version 1 service level code features: move GROUP0_SCHEMA_VERSIONING to deprecated features list migration_manager: remove unused forward definitions test: remove unused code auth: drop auth_migration_listener since it does nothing now schema: drop schema_registry_entry::maybe_sync() function schema: drop make_table_deleting_mutations since it should not be needed with raft schema: remove calculate_schema_digest function schema: drop recalculate_schema_version function and its uses migration_manager: drop check for group0_schema_versioning feature cdc: drop usage of cdc_local table and v1 generation definition storage_service: no need to add yourself to the topology during reboot since raft state loading already did it storage_service: remove unused functions group0: drop with_raft() function from group0_guard since it always returns true now gossiper: do not gossip TOKENS and CDC_GENERATION_ID any more gossiper: drop tokens from loaded_endpoint_state gossiper: remove unused functions storage_service: do not pass loaded_peer_features to join_topology() storage_service: remove unused fields from replacement_info gossiper: drop is_safe_for_restart() function and its use storage_service: remove unused variables from join_topology gossiper: remove the code that was only used in gossiper topology storage_service: drop the check for raft mode from recovery code cdc: remove legacy code test: remove unused injection points auth: remove legacy auth mode and upgrade code treewide: remove schema pull code since we never pull schema any more raft topology: drop upgrade_state and its type from the topology state machine since it is not used any longer group0: hoist the checks for an illegal upgrade into main.cc api: drop get_topology_upgrade_state and always report upgrade status as done service_level_controller: drop service level upgrade code test: drop run_with_raft_recovery parameter to cql_test_env group0: get rid of group0_upgrade_state storage_service: drop topology_change_kind as it is no longer needed storage_service: drop check_ability_to_perform_topology_operation since no upgrades can happen any more service_storage: remove unused functions storage_service: remove non raft rebuild code storage_service: set topology change kind only once group0: drop in_recovery function and its uses group0: rename use_raft to maintenance_mode and make it sync	2026-03-11 10:24:20 +02:00
Botond Dénes	81e214237f	Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk This pull request adds support for calculation and storing CRC32 digests for all SSTable components. This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream. Several test cases where introduced to verify expected behaviour. Additionally, this PR adds new rewrite component mechanism for safe sstable component rewriting. Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed to the final name after sealing. This allowed crash recovery by simply removing the temporary file on startup. However, with component digests stored in scylla_metadata (#20100), replacing a component like Statistics requires atomically updating both the component and scylla_metadata with the new digest - impossible with POSIX rename. The new mechanism creates a clone sstable with a fresh generation: - Hard-links all components from the source except the component being rewritten and scylla_metadata - Copies original sstable components pointer and recognized components from the source - Invokes a modifier callback to adjust the new sstable before rewriting - Writes the modified component along with updated scylla_metadata containing the new digest - Seals the new sstable with a temporary TOC - Replaces the old sstable atomically, the same way as it is done in compaction This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair). In case of any failure durning the whole process, sstable will be automatically deleted on the node startup due to temporary toc persistence. Backport is not required, it is a new feature Fixes https://github.com/scylladb/scylladb/issues/20100, https://github.com/scylladb/scylladb/issues/27453 Closes scylladb/scylladb#28338 * github.com:scylladb/scylladb: docs: document components_digests subcomponent and trailing digest in Scylla.db sstable_compaction_test: Add tests for perform_component_rewrite sstable_test: add verification testcases of SSTable components digests persistance sstables: store digest of all sstable components in scylla metadata sstables: replace rewrite_statistics with new rewrite component mechanism sstables: add new rewrite component mechanism for safe sstable component rewriting compaction: add compaction_group_view method to specify sstable version sstables: add null_data_sink and serialized_checksum for checksum-only calculation sstables: extract default write open flags into a constant sstables: Add write_simple_with_digest for component checksumming sstables: Extract file writer closing logic into separate methods sstables: Implement CRC32 digest-only writer	2026-03-10 16:02:53 +02:00
Gleb Natapov	02fc4ad0a9	treewide: remove schema pull code since we never pull schema any more Schema pull was used by legacy schema code which is not supported for a long time now and during legacy recovery which is no longer supported as well. It can be dropped now.	2026-03-10 10:09:39 +02:00
Botond Dénes	509f2af8db	Merge 'repair: Fix rwlock in compaction_state and lock holder lifecycle' from Raphael Raph Carvalho Consider this: - repair takes the lock holder - tablet merge filber destories the compaction group and the compaction state - repair fails - repair destroy the lock holder This is observed in the test: ``` repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036] Repair 1 out of 1 tablets: table=sec_index.users range=(432345564227567615,504403158265495551] replicas=[0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea:15, 498e354c-1254-4d8d-a565-2f5c6523845a:9, 5208598c-84f0-4526-bb7f-573728592172:28] ... repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: Started to repair 1 out of 1 tables in keyspace=sec_index, table=users, table_id=ea2072d0-ccd9-11f0-8dba-c5ab01bffb77, repair_reason=repair repair - Enable incremental repair for table=sec_index.users range=(432345564227567615,504403158265495551] table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: get_sync_boundary: got error from node=0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea, keyspace=sec_index, table=users, range=(432345564227567615,504403158265495551], error=seastar::rpc::remote_verb_error (Compaction state for table [0x60f008fa34c0] not found) compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge .... scylla[10793] Segmentation fault on shard 28, in scheduling group streaming ``` The rwlock in compaction_state could be destroyed before the lock holder of the rwlock is destroyed. This causes user after free when the lock the holder is destroyed. To fix it, users of repair lock will now be waited when a compaction group is being stopped. That way, compaction group - which controls the lifetime of rwlock - cannot be destroyed while the lock is held. Additionally, the merge completion fiber - that might remove groups - is properly serialized with incremental repair. The issue can be reproduced using sanitize build consistently and can not be reproduced after the fix. Fixes #27365 Closes scylladb/scylladb#28823 * github.com:scylladb/scylladb: repair: Fix rwlock in compaction_state and lock holder lifecycle repair: Prevent repair lock holder leakage after table drop	2026-03-05 14:18:25 +02:00
Asias He	225b10b683	repair: Fix rwlock in compaction_state and lock holder lifecycle Consider this: - repair takes the lock holder - tablet merge filber destories the compaction group and the compaction state - repair fails - repair destroy the lock holder This is observed in the test: ``` repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036] Repair 1 out of 1 tablets: table=sec_index.users range=(432345564227567615,504403158265495551] replicas=[0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea:15, 498e354c-1254-4d8d-a565-2f5c6523845a:9, 5208598c-84f0-4526-bb7f-573728592172:28] ... repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: Started to repair 1 out of 1 tables in keyspace=sec_index, table=users, table_id=ea2072d0-ccd9-11f0-8dba-c5ab01bffb77, repair_reason=repair repair - Enable incremental repair for table=sec_index.users range=(432345564227567615,504403158265495551] table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: get_sync_boundary: got error from node=0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea, keyspace=sec_index, table=users, range=(432345564227567615,504403158265495551], error=seastar::rpc::remote_verb_error (Compaction state for table [0x60f008fa34c0] not found) compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge .... scylla[10793] Segmentation fault on shard 28, in scheduling group streaming ``` The rwlock in compaction_state could be destroyed before the lock holder of the rwlock is destroyed. This causes user after free when the lock the holder is destroyed. To fix it, users of repair lock will now be waited when a compaction group is being stopped. That way, compaction group - which controls the lifetime of rwlock - cannot be destroyed while the lock is held. Additionally, the merge completion fiber - that might remove groups - is properly serialized with incremental repair. The issue can be reproduced using sanitize build consistently and can not be reproduced after the fix. Fixes #27365 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2026-03-03 21:05:15 -03:00
Raphael S. Carvalho	1d8903d9f7	repair: Prevent repair lock holder leakage after table drop Prevent repair lock holder from being leaked in repair_service when table is dropped midway. The leakage might result in use-after-free later, since the repair lock itself will be gone after table drop. The RPC verb that removes the lock on success path will not be called by coordinator after table was dropped. Refs #27365. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-896. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2026-03-03 21:05:10 -03:00
Marcin Maliszkiewicz	a83ee6cf66	Merge 'db/batchlog_manager: re-add v1 support for mixed clusters' from Botond Dénes `3f7ee3ce5d` introduced system.batchlog_v2, with a schema designed to speed up batchlog replays and make post-replay cleanups much more effective. It did not introduce a cluster feature for the new table, because it is node local table, so the cluster can switch to the new table gradually, one node at a time. However, https://github.com/scylladb/scylladb/issues/27886 showed that the switching causes timeouts during upgrades, in mixed clusters. Furthermore, switching to the new table unconditionally on upgrades nodes, means that on rollback, the batches saved into the v2 table are lost. This PR introduces re-introduces v1 (`system.batchlog`) support and guards the use of the v2 table with a cluster feature, so mixed clusters keep using v1 and thus be rollback-compatible. The re-introduced v1 support doesn't support post-replay cleanups for simplicity. The cleanup in v1 was never particularly effective anyway and we ended up disabling it for heavy batchlog users, so I don't think the lack of support for cleanup is a problem. Fixes: https://github.com/scylladb/scylladb/issues/27886 Needs backport to 2026.1, to fix upgrades for clusters using batches Closes scylladb/scylladb#28736 * github.com:scylladb/scylladb: test/boost/batchlog_manager_test: add tests for v1 batchlog test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2 test/boost/batchlog_manager_test: fix indentation test/boost/batchlog_manager_test: extract prepare_batches() method test/lib/cql_assertions: is_rows(): add dump parameter tools/scylla-sstable: extract query result printers tools/scylla-sstable: add std::ostream& arg to query result printers repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log db/batchlog_manager: re-add v1 support db/batchlog_manager: return all_replayed from process_batch() db/batchlog_manager: process_bath() fix indentation db/batchlog_manager: make batch() a standalone function db/batchlog_manager: make structs stats public db/batchlog_manager: allocate limiter on the stack db/batchlog_manager: add feature_service dependency gms/feature_service: add batchlog_v2 feature	2026-03-02 12:09:10 +01:00
Taras Veretilnyk	5bbc44ed12	sstables: replace rewrite_statistics with new rewrite component mechanism This commits migrates all callers that used rewrite_statistics to new rewrite component mechanism.	2026-02-26 22:38:55 +01:00
Gleb Natapov	6173ea476b	node_ops: remove topology over node ops code The code is no longer called.	2026-02-25 10:08:32 +02:00
Botond Dénes	0549b61d55	repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log Provides visibility into whether batchlog replay was successful or not.	2026-02-20 07:03:46 +02:00
Aleksandra Martyniuk	3fe596d556	service: pass topology guard to RBNO Currently, raft-based node operations with streaming use topology guards, but repair-based don't. Topology guards ensure that if a respective session is closed (the operation has finished), each leftover operation being a part of this session fails. Thanks to that we won't incorrectly assume that e.g. the old rpc received late belongs to the newly started operation. This is especially important if the operation involves writes. Pass a topology_guard down from raft_topology_cmd_handler to repair tasks. Repair tasks already support topology guards. Fixes: https://github.com/scylladb/scylladb/issues/27759	2026-01-20 10:06:34 +01:00
Asias He	7ba7b25bdd	repair: Implement auto repair for tablet repair This patch implements the basic auto repair support for tablet repair. It was decided to add no per table configuration for the initial implementation, so two scylla yaml config options are introduced to set the default auto repair configs for all the tablet tables. - auto_repair_enabled_default Set true to enable auto repair for tablet tables by default. The value will be overridden by the per keyspace or per table configuration which is not implemented yet. - auto_repair_threshold_default_in_seconds Set the default time in seconds for the auto repair threshold for tablet tables. If the time since last repair is bigger than the configured time, the tablet is eligible for auto repair. The value will be overridden by the per keyspace or per table configuration which is not implemented yet. The following metrcis are added: - auto_repair_needs_repair_nr The number of tablets with auto repair enabled that needs repair - auto_repair_enabled_nr The number of tablets with auto repair enabled The metrics are useful to tell if auto repair is falling behind. In the future, more auto repair scheduling will be added, e.g., scheduling based on the repaired and unrepaired sstable set size, tombstone ratio and so on, in addition to the time based scheduling. Fixes SCYLLADB-99	2026-01-09 16:11:39 +08:00
Asias He	4f77dd058d	repair: Add tablet repair progress report support This patch adds tablet repair progress report support so that the user could use the /task_manager/task_status API to query the progress. In order to support this, a new system table is introduced to record the user request related info, i.e, start of the request and end of the request. The progress is accurate when tablet split or merge happens in the middle of the request, since the tokens of the tablet are recorded when the request is started and when repair of each tablet is finished. The original tablet repair is considered as finished when the finished ranges cover the original tablet token ranges. After this patch, the /task_manager/task_status API will report correct progress_total and progress_completed. Fixes #22564 Fixes #26896 Closes scylladb/scylladb#27679	2026-01-08 21:55:18 +02:00
Asias He	0aabf51380	repair: Fix sstable_list_to_mark_as_repaired with multishard writer It was obseved: ``` test_repair_disjoint_row_2nodes_diff_shard_count was spuriously failing due to segfault. backtrace pointed to a failure when allocating an object from the chain of freed objects, which indicates memory corruption. (gdb) bt at ./seastar/include/seastar/core/shared_ptr.hh:275 at ./seastar/include/seastar/core/shared_ptr.hh:430 Usual suspect is use-after-free, so ran the reproducer in the sanitize mode, which indicated shared ptr was being copied into another cpu through the multi shard writer: seastar - shared_ptr accessed on non-owner cpu, at: ... -------- seastar::smp_message_queue::async_work_item<mutation_writer::multishard_writer::make_shard_writer... ``` The multishard writer itself was fine, the problem was in the streaming consumer for repair copying a shared ptr. It could work fine with same smp setting, since there will be only 1 shard in the consumer path, from rpc handler all the way to the consumer. But with mixed smp setting, the ptr would be copied into the cpus involved, and since the shared ptr is not cpu safe, the refcount change can go wrong, causing double free, use-after-free. To fix, we pass a generic incremental repair handler to the streaming consumer. The handler is safe to be copied to different shards. It will be a no op if incremental repair is not enabled or on a different shard. A reproducer test is added. The test could reproduce the crash consistently before the fix and work well after the fix. Fixes #27666 Closes scylladb/scylladb#27870	2026-01-08 21:55:18 +02:00
Asias He	3abda7d15e	topology_coordinator: Ensure repair_update_compaction_ctrl is executed Consider this: - n1 is a coordinator and schedules tablet repair - n1 detects tablet repair failed, so it schedules tablet transition to end_repair state - n1 loses leadership and n2 becomes the new topology coordinator - n2 runs end_repair on the tablet with session_id=00000000-0000-0000-0000-000000000000 - when a new tablet repair is scheduled, it hangs since the lock is already taken because it was not removed in previous step To fix, we use the global_tablet_id to index the lock instead of the session id. In addition, we retry the repair_update_compaction_ctrl verb in case of error to ensure the verb is eventually executed. The verb handler is also updated to check if it is still in end_repair stage. Fixes #26346 Closes scylladb/scylladb#27740	2025-12-31 13:17:18 +01:00
Botond Dénes	bfdd4f7776	Merge 'Synchronize incremental repair and tablet split' from Raphael Raph Carvalho Split prepare can run concurrently with repair. Consider this: 1) split prepare starts 2) incremental repair starts 3) split prepare finishes 4) incremental repair produces unsplit sstable 5) split is not happening on sstable produced by repair 5.1) that sstable is not marked as repaired yet 5.2) might belong to repairing set (has compaction disabled) 6) split executes 7) repairing or repaired set has unsplit sstable If split was acked to coordinator (meaning prepare phase finished), repair must make sure that all sstables produced by it are split. It's not happening today with incremental repair because it disables split on sstables belonging to repairing group. And there's a window where sstables produced by repair belong to that group. To solve the problem, we want the invariant where all sealed sstables will be split. To achieve this, streaming consumers are patched to produce unsealed sstable, and the new variant add_new_sstable_and_update_cache() will take care of splitting the sstable while it's unsealed. If no split is needed, the new sstable will be sealed and attached. This solution was also needed to interact nicely with out of space prevention too. If disk usage is critical, split must not happen on restart, and the invariant aforementioned allows for it, since any unsplit sstable left unsealed will be discarded on restart. The streaming consumer will fail if disk usage is critical too. The reason interposer consumer doesn't fully solve the problem is because incremental repair can start before split, and the sstable being produced when split decision was emitted must be split before attached. So we need a solution which covers both scenarios. Fixes #26041. Fixes #27414. Should be backported to 2025.4 that contains incremental repair Closes scylladb/scylladb#26528 * github.com:scylladb/scylladb: test: Add reproducer for split vs intra-node migration race test: Verify split failure on behalf of repair during critical disk utilization test: boost: Add failure_when_adding_new_sstable_test test: Add reproducer for split vs incremental repair race condition compaction: Fail split of new sstable if manager is disabled replica: Don't split in do_add_sstable_and_update_cache() streaming: Leave sstables unsealed until attached to the table replica: Wire add_new_sstables_and_update_cache() into intra-node streaming replica: Wire add_new_sstable_and_update_cache() into file streaming consumer replica: Wire add_new_sstable_and_update_cache() into streaming consumer replica: Document old add_sstable_and_update_cache() variants replica: Introduce add_new_sstables_and_update_cache() replica: Introduce add_new_sstable_and_update_cache() replica: Account for sstables being added before ACKing split replica: Remove repair read lock from maybe_split_new_sstable() compaction: Preserve state of input sstable in maybe_split_new_sstable() Rename maybe_split_sstable() to maybe_split_new_sstable() sstables: Allow storage::snapshot() to leave destination sstable unsealed sstables: Add option to leave sstable unsealed in the stream sink test: Verify unsealed sstable can be compacted sstables: Allow unsealed sstable to be loaded sstables: Restore sstable_writer_config::leave_unsealed	2025-12-23 07:28:56 +02:00
Calle Wilund	5f8f724d78	repair: Don't use off-strategy as repair destination with tablet tables Fixes #17384 Bypasses enabling off-strategy storage/placement for repair streams when table repaired is using tablets. Instead, the resulting sstable(s) will be placed in the "normal" set of sstables, and bypass a post-repair off-strategy compaction. v2: Bypass off-strat for whatever reason iff dest is tablets. Closes scylladb/scylladb#27500	2025-12-16 06:54:07 +02:00
Raphael S. Carvalho	77a4f95eb8	test: Add reproducer for split vs incremental repair race condition Refs #26041. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 17:01:16 -03:00
Avi Kivity	24264e24bb	Revert "repair: Add tablet repair progress report support" This reverts commit `faad0167d7`. It causes a regression in test_two_tablets_concurrent_repair_and_migration_repair_writer_level in debug mode (with ~5%-10% probability). Fixes #27510. Closes scylladb/scylladb#27560	2025-12-11 12:18:11 +02:00
Asias He	faad0167d7	repair: Add tablet repair progress report support This patch adds tablet repair progress report support so that the user could use the /task_manager/task_status API to query the progress. In order to support this, a new system table is introduced to record the user request related info, i.e, start of the request and end of the request. The progress is accurate when tablet split or merge happens in the middle of the request, since the tokens of the tablet are recorded when the request is started and when repair of each tablet is finished. The original tablet repair is considered as finished when the finished ranges cover the original tablet token ranges. After this patch, the /task_manager/task_status API will report correct progress_total and progress_completed. Fixes #22564 Fixes #26896 Closes scylladb/scylladb#26924	2025-12-08 13:35:19 +02:00
Asias He	e97a504775	repair: Allow min max range to be updated for repair history It is observed that: repair - repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: Failed to update system.repair_history table of node d27de212-6f32-4649ad76-a9ef1165fdcb: seastar::rpc::remote_verb_error (repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: range (minimum token,maximum token) is not in the format of (start, end]) This is because repair checks the end of the range to be repaired needs to be inclusive. When small_table_optimization is enabled for regular repair, a (minimum token,maximum token) will be used. To fix, we can relax the check of (start, end] for the min max range. Fixes #27220 Closes scylladb/scylladb#27357	2025-12-05 10:41:25 +02:00
Aleksandra Martyniuk	e3e81a9a7a	repair: throw if flush failed in get_flush_time Currently, _flush_time was stored as a std::optional<gc_clock::time_point> and std::nullopt indicates that the flush was needed but failed. It's confusing for the caller and does not work as expected since the _flush_time is initialized with value (not optional). Change _flush_time type to gc_clock::time_point. If a flush is needed but failed, get_flush_time() throws an exception. This was suppose to be a part of https://github.com/scylladb/scylladb/pull/26319 but it was mistakenly overwritten during rebases. Refs: https://github.com/scylladb/scylladb/issues/24415. Closes scylladb/scylladb#26794	2025-12-04 11:45:53 +02:00
Asias He	da5cc13e97	repair: Fix deadlock when topology coordinator steps down in the middle Consider this: 1) n1 is the topology coordinator 2) n1 schedules and executes a tablet repair with session id s1 for a tablet on n3 an n4. 3) n3 and n4 take and store the in _rs._repair_compaction_locks[s1] 4) n1 steps down before it executes locator::tablet_transition_stage::end_repair 5) n2 becomes the new topology coordinator 6) n2 runs locator::tablet_transition_stage::repair again 7) n3 and n4 try to take the lock again and hangs since the lock is already taken. To avoid the deadlock, we can throw in step 7 so that n2 will proceed to end_repair stage and release the lock. After that, the scheduler could schedule the tablet repair request again. Fixes #26346 Closes scylladb/scylladb#27163	2025-11-28 15:14:39 +01:00
Radosław Cybulski	d589e68642	Add precompiled headers to CMakeLists.txt Add precompiled header support to CMakeLists.txt and configure.py - it improves compilation time by approximately 10%. New header `stdafx.hh` is added, don't include it manually - the compiler will include it for you. The header contains includes from external libraries used by Scylla - seastar, standard library, linux headers and zlib. The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER` or configure.py --disable-precompiled-header to disable. The feature should be disabled, when trying to check headers - otherwise you might get false negatives on missing includes from seastar / abseil and so on. Note: following configuration needs to be added to ccache.conf: sloppiness = pch_defines,time_macros,include_file_mtime,include_file_ctime Closes scylladb/scylladb#26617	2025-11-21 12:27:41 +02:00
Asias He	dbeca7c14d	repair: Add metric for time spent on tablet repair It is useful to check time spent on tablet repair. It can be used to compare incremental repair and non-incremental repair. The time does not include the time waiting for the tablet scheduler to schedule the tablet repair task. Fixes #26505 Closes scylladb/scylladb#26502	2025-11-06 10:00:20 +03:00
Aleksandra Martyniuk	d436233209	repair: fail tablet repair if any batch wasn't sent successfully If any batch replay failed, we cannot update repair_time as we risk the data resurrection. If replay of any batch needs to be retried, run the whole repair but fail at the very end, so that the repair_time for it won't be updated.	2025-10-23 10:39:42 +02:00
Aleksandra Martyniuk	7f20b66eff	db: repair: throw if replay fails Return a flag determining whether all the batches were sent successfully in batchlog_manager::replay_all_failed_batches (batches skipped due to being too fresh are not counted). Throw in repair_flush_hints_batchlog_handler if not all batches were replayed, to ensure that repair_time isn't updated.	2025-10-23 10:38:31 +02:00
Asias He	33bc1669c4	repair: Fix uuid and nodes_down order in the log Fixes #26536 Closes scylladb/scylladb#26547	2025-10-20 13:21:59 +03:00
Michał Chojnowski	55c4b89b88	sstables: make `sstable::estimated_keys_for_range` asynchronous Currently, `sstable::estimated_keys_for_range` works by checking what fraction of Summary is covered by the given range, and multiplying this fraction to the number of all keys. Since computing things on Summary doesn't involve I/O (because Summary is always kept in RAM), this is synchronous. In a later patch, we will modify `sstable::estimated_keys_for_range` so that it can deal with sstables that don't have a Summary (because they use BTI indexes instead of BIG indexes). In that case, the function is going to compute the relevant fraction by using the index instead of Summary. This will require making the function asynchronous. This is what we do in this patch. (The actual change to the logic of `sstable::estimated_keys_for_range` will come in the next patch. In this one, we only make it asynchronous).	2025-09-29 13:01:21 +02:00
Asias He	b31e651657	repair: Always reset node ops progress to 100% upon completion Always set the node ops progress to 100% when the operation finishes, regardless of success or failure. This ensures the progress never remains below 100%, which would otherwise indicates a pending node operation in case of an error. Fixes #26193 Closes scylladb/scylladb#26194	2025-09-25 11:05:52 +03:00
Pavel Emelyanov	a1ea553fe1	code: Replace distributed<> with sharded<> The latter is recommended in seastar, and the former was left as compatibility alias. Latest seastar explicitly marks it as deprecated so once the submodule is updated, compilation logs will explode. Most of the patch is generated with for f in $(git grep -l '\<distributed<[A-Za-z0-9:_]>') ; do sed -e 's/\<distributed<$[A-Za-z0-9:_]$>/sharded<\1>/g' -i $f; done for f in $(git grep -l distributed.hh); do sed -e 's/distributed.hh/sharded.hh/' -i $f ; done and a small manual change in test/perf/perf.hh Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26136	2025-09-19 12:22:51 +02:00
Gleb Natapov	d3badf7406	storage_service: change node_ops_info::ignore_nodes to host id It drop useless translation from id to ip during removenode through topology coordinator. Closes scylladb/scylladb#25958	2025-09-15 10:18:24 +02:00
Asias He	cb7db47ae1	repair: Add incremental_mode option for tablet repair This patch introduces a new `incremental_mode` parameter to the tablet repair REST API, providing more fine-grained control over the incremental repair process. Previously, incremental repair was on and could not be turned off. This change allows users to select from three distinct modes: - `regular`: This is the default mode. It performs a standard incremental repair, processing only unrepaired sstables and skipping those that are already repaired. The repair state (`repaired_at`, `sstables_repaired_at`) is updated. - `full`: This mode forces the repair to process all sstables, including those that have been previously repaired. This is useful when a full data validation is needed without disabling the incremental repair feature. The repair state is updated. - `disabled`: This mode completely disables the incremental repair logic for the current repair operation. It behaves like a classic (pre-incremental) repair, and it does not update any incremental repair state (`repaired_at` in sstables or `sstables_repaired_at` in the system.tablets table). The implementation includes: - Adding the `incremental_mode` parameter to the `/storage_service/repair/tablet` API endpoint. - Updating the internal repair logic to handle the different modes. - Adding a new test case to verify the behavior of each mode. - Updating the API documentation and developer documentation. Fixes #25605 Closes scylladb/scylladb#25693	2025-09-09 06:50:21 +03:00
Radosław Cybulski	c242234552	Revert "build: add precompiled headers to CMakeLists.txt" This reverts commit `01bb7b629a`. Closes scylladb/scylladb#25735	2025-09-03 09:46:00 +03:00
Avi Kivity	600349e29a	Merge 'tasks: return task::impl from make_and_start_task ' from Aleksandra Martyniuk Currently, make_and_start_task returns a pointer to task_manager::task that hides the implementation details. If we need to access the implementation (e.g. because we want a task to "return" a value), we need to make and start task step by step openly. Return task_manager::task::impl from make_and_start_task. Use it where possible. Fixes: https://github.com/scylladb/scylladb/issues/22146. Optimization; no backport Closes scylladb/scylladb#25743 * github.com:scylladb/scylladb: tasks: return task::impl from make_and_start_task compaction: use current_task_type repair: add new param to tablet_repair_task_impl repair: add new params to shard_repair_task_impl repair: pass argument by value	2025-08-31 15:44:37 +03:00
Avi Kivity	bc5773f777	Merge 'Add out of space prevention mechanisms' from Łukasz Paszkowski When a scaling out is delayed or fails, it is crucial to ensure that clusters remain operational and recoverable even under extreme conditions. To achieve this, the following proactive measures are implemented: - reject writes - includes: inserts, updates, deletes, counter updates, hints, read+repair and lwt writes - applicable to: user tables, views, CDC log, audit, cql tracing - stop running compactions/repairs and prevent from starting new ones - reject incoming tablet migrations The aforementioned mechanisms are automatically enabled when node's disk utilization reaches the critical level (default: 98%) and disabled when the utilization drop below the threshold. Apart from that, the series add tests that require mounted volumes to simulate out of space. The paths to the volumes can be provided using the a pytest argument, i.e. `--space-limited-dirs`. When not provided, tests are skipped. Test scenarios: 1. Start a cluster and write data until one of the nodes reaches 90% of the disk utilization 2. Perform an operation that would take the nodes over 100% 3. The nodes should not exceed the critical disk utilization (98% by default) 4. Scale out the cluster by adding one node per rack 5. Retry or wait for the operation from step 2 The operation is: writing data, running compactions, building materialized views, running repair, migrating tablets (caused by RF change, decommission). The test is successful, if no nodes run out of space, the operation from step 2 is aborted/paused/timed out and the operation from step 5 is successful. `perf-simple-query --smp 1 -m 1G` results obtained for fixed 400MHz frequency: Read path (before) ``` instructions_per_op: mean= 39661.51 standard-deviation=34.53 median= 39655.39 median-absolute-deviation=23.33 maximum=39708.71 minimum=39622.61 ``` Read path (after) ``` instructions_per_op: mean= 39691.68 standard-deviation=34.54 median= 39683.14 median-absolute-deviation=11.94 maximum=39749.32 minimum=39656.63 ``` Write path (before): ``` instructions_per_op: mean= 50942.86 standard-deviation=97.69 median= 50974.11 median-absolute-deviation=34.25 maximum=51019.23 minimum=50771.60 ``` Write path (after): ``` instructions_per_op: mean= 51000.15 standard-deviation=115.04 median= 51043.93 median-absolute-deviation=52.19 maximum=51065.81 minimum=50795.00 ``` Fixes: https://github.com/scylladb/scylladb/issues/14067 Refs: https://github.com/scylladb/scylladb/issues/2871 No backport, as it is a new feature. Closes scylladb/scylladb#23917 * github.com:scylladb/scylladb: tests/cluster: Add new storage tests test/scylla_cluster: Override workdir when passed via cmdline streaming: Reject incoming migrations storage_service: extend locator::load_stats to collect per-node critical disk utilization flag repair_service: Add a facility to disable the service compaction_manager: Subscribe to out of space controller compaction_manager: Replace enabled/disabled states with running state database: Add critical_disk_utilization mode database can be moved to disk_space_monitor: add subscription API for threshold-based disk space monitoring docs: Add feature documentation config: Add critical_disk_utilization_level option replica/exceptions: Add a new custom replica exception	2025-08-30 18:47:57 +03:00
Piotr Dulikowski	7ccb50514d	Merge 'Introduce view building coordinator' from Michał Jadwiszczak This patch introduces `view_building_coordinator`, a single entity within whole cluster responsible for building tablet-based views. The view building coordinator takes slightly different approach than the existing node-local view builder. The whole process is split into smaller view building tasks, one per each tablet replica of the base table. The coordinator builds one base table at a time and it can choose another when all views of currently processing base table are built. The tasks are started by setting `STARTED` state and they are executed by node-local view building worker. The tasks are scheduled in a way, that each shard processes only one tablet at a time (multiple tasks can be started for a shard on a node because a table can have multiple views but then all tasks have the same base table and tablet (last_token)). Once the coordinator starts the tasks, it sends `work_on_view_building_tasks` RPC to start the tasks and receive their results. This RPC is resilient to RPC failure or raft leader change, meaning if one RPC call started a batch of tasks but then failed (for instance the raft leader was changed and caller aborted waiting for the response), next RPC call will attach itself to the already started batch. The coordinator plugs into handling tablet operations (migration/resize/RF change) and adjusts its tasks accordingly. At the start of each tablet operation, the coordinator aborts necessary view building tasks to prevent https://github.com/scylladb/scylladb/issues/21564. Then, new adjusted tasks are created at the end of the operation. If the operation fails at any moment, aborted tasks are rollback. The view building coordinator can also handle staging sstables using process_staging view building tasks. We do this because we don't want to start generating view updates from a staging sstable prematurely, before the writes are directed to the new replica (https://github.com/scylladb/scylladb/issues/19149). For detailed description check: `docs/dev/view-building-coordinator.md` Fixes https://github.com/scylladb/scylladb/issues/22288 Fixes https://github.com/scylladb/scylladb/issues/19149 Fixes https://github.com/scylladb/scylladb/issues/21564 Fixes https://github.com/scylladb/scylladb/issues/17603 Fixes https://github.com/scylladb/scylladb/issues/22586 Fixes https://github.com/scylladb/scylladb/issues/18826 Fixes https://github.com/scylladb/scylladb/issues/23930 --- This PR is reimplementation of https://github.com/scylladb/scylladb/pull/21942 Closes scylladb/scylladb#23760 * github.com:scylladb/scylladb: test/cluster: add view build status tests test/cluster: add view building coordinator tests utils/error_injection: allow to abort `injection_handler::wait_for_message()` test: adjust existing tests utils/error_injection: add injection with `sleep_abortable()` db/view/view_builder: ignore `no_such_keyspace` exception docs/dev: add view building coordinator documentation db/view/view_building_worker: work on `process_staging` tasks db/view/view_building_worker: register staging sstable to view building coordinator when needed db/view/view_building_worker: discover staging sstables db/view/view_building_worker: add method to register staging sstable db/view/view_update_generator: add method to process staging sstables instantly db/view/view_update_generator: extract generating updates from staging sstables to a method db/view/view_update_generator: ignore tablet-based sstables db/view/view_building_coordinator: update view build status on node join/left db/view/view_building_coordinator: handle tablet operations db/view: add view building task mutation builder service/topology_coordinator: run view building coordinator db/view: introduce `view_building_coordinator` db/view/view_building_worker: update built views locally db/view: introduce `view_building_worker` db/view: extract common view building functionalities db/view: prepare to create abstract `view_consumer` message/messaging_service: add `work_on_view_building_tasks` RPC service/topology_coordinator: make `term_changed_error` public db/schema_tables: create/cleanup tasks when an index is created/dropped service/migration_manager: cleanup view building state on drop keyspace service/migration_manager: cleanup view building state on drop view service/migration_manager: create view building tasks on create view test/boost: enable proxy remote in some tests service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()` service/migration_manager: coroutinize `prepare_new_view_announcement()` service/storage_proxy: expose references to `system_keyspace` and `view_building_state_machine` service: reload `view_building_state_machine` on group0 apply() service/vb_coordinator: add currently processing base db/system_keyspace: move `get_scylla_local_mutation()` up db/system_keyspace: add `view_building_tasks` table db/view: add view_building_state and views_state db/system_keyspace: add method to get view build status map db/view: extract `system.view_build_status_v2` cql statements to system_keyspace db/system_keyspace: move `internal_system_query_state()` function earlier db/view: ignore tablet-based views in `view_builder` gms/feature_service: add VIEW_BUILDING_COORDINATOR feature	2025-08-29 17:28:44 +02:00

1 2 3 4 5 ...

1212 Commits