scylladb

Author	SHA1	Message	Date
Yaniv Michael Kaul	8818a0347e	streaming/stream_transfer_task: reformat code style Reformat indentation, brace placement, lambda formatting, and line wrapping for consistency. The seastar logger already checks is_enabled() before formatting arguments, so explicit guards around debug calls with simple variable arguments are unnecessary. AI-assisted: OpenCode / Claude Opus 4.6 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>	2026-03-24 18:30:39 +02:00
Botond Dénes	fc8cebd671	Merge 'Verify components digests during component load and scrub in validate mode' from Taras Veretilnyk This PR adds integrity verification for SSTable component files during loading. When component digests are present in Scylla metadata, the loader now validates each component's CRC32 digest against the stored expected value, catching silent corruption of component files. Index, Rows and Partitions components digests are also validated duriung scrub in validate mode Added corruption tests that write an SSTable, flip a bit in a specific component file, then verify that reloading the SSTable detects the corruption and throws the expected exception. Depends on https://github.com/scylladb/scylladb/pull/28338 Backport is not required, this is new feature Fixes https://github.com/scylladb/scylladb/issues/20103 Closes scylladb/scylladb#28761 * github.com:scylladb/scylladb: test/cqlpy: test --ignore-component-digest-mismatch flag in scylla sstable upgrade docs: document --ignore-component-digest-mismatch flag for scylla sstable upgrade sstables: propagate ignore_component_digest_mismatch config to all load sites sstables: add option to ignore component digest mismatches sstable_compaction_test: Add scrub validate test for corrupted index sstables: add tests for component digest validation on corrupted SSTables sstables: validate index components digests during SSTable scrub in validate mode sstables: verify component digests on SSTable load sstables: add digest_file_random_access_reader for CRC32 digest computation	2026-03-13 09:55:55 +02:00
Taras Veretilnyk	7214f5a0b6	sstables: propagate ignore_component_digest_mismatch config to all load sites Add ignore_component_digest_mismatch option to db::config (default false). When set, sstable loading logs a warning instead of throwing on component digest mismatches, allowing a node to start up despite corrupted non-vital components or bugs in digest calculation. Propagate the config to all production sstable load paths: - distributed_loader (node startup, upload dir processing) - storage_service (tablet storage cloning) - sstables_loader (load-and-stream, download tasks, attach) - stream_blob (tablet streaming)	2026-03-10 19:24:05 +01:00
Gleb Natapov	02fc4ad0a9	treewide: remove schema pull code since we never pull schema any more Schema pull was used by legacy schema code which is not supported for a long time now and during legacy recovery which is no longer supported as well. It can be dropped now.	2026-03-10 10:09:39 +02:00
Raphael S. Carvalho	5b550e94a6	streaming: Release space incrementally during file streaming File streaming only releases the file descriptors of a tablet being streamed in the very streaming end. Which means that if the streaming tablet has compaction on largest tier finished after streaming started, there will be always ~2x space amplification for that single tablet. Since there can be up to 4 tablets being migrated away, it can add up to a significant amount, since nodes are pushed to a substantial usage of available space (~90%). We want to optimize this by dropping reference to a sstable after it was fully streamed. This way, we reduce the chances of hitting 2x space amplification for a given tablet. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#28505	2026-02-18 10:10:40 +03:00
Pavel Emelyanov	6f3f30ee07	storage_service: Use stream_manager group for streaming The hander of raft_topology_cmd::command::stream_ranges switches to streaming scheduling group to perform data streaming in it. It grabs the group from database db_config, which's not great. There's streaming manager at hand in storage service handlers, since it's using its functionality, it should use _its_ scheduling group. This will help splitting the streaming scheduling group into more elaborated groups under the maintenance supergroup: SCYLLADB-351 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28363	2026-02-01 20:42:37 +02:00
Pavel Emelyanov	cb1d05d65a	streaming: Get streaming sched group from debug:: namespace In a lambda returned from make_streaming_consumer() there's a check for current scheudling group being streaming one. It came from #17090 where streaming code was launched in wrong sched group thus affecting user groups in a bad way. The check is nice and useful, but it abuses replica::database by getting unrelated information from it. To preserve the check and to stop using database as provider of configs, keep the streaming scheduling group handle in the debug namespace. This emphasises that this global variable is purely for debugging purposes. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28410	2026-01-28 19:14:59 +02:00
Asias He	0aabf51380	repair: Fix sstable_list_to_mark_as_repaired with multishard writer It was obseved: ``` test_repair_disjoint_row_2nodes_diff_shard_count was spuriously failing due to segfault. backtrace pointed to a failure when allocating an object from the chain of freed objects, which indicates memory corruption. (gdb) bt at ./seastar/include/seastar/core/shared_ptr.hh:275 at ./seastar/include/seastar/core/shared_ptr.hh:430 Usual suspect is use-after-free, so ran the reproducer in the sanitize mode, which indicated shared ptr was being copied into another cpu through the multi shard writer: seastar - shared_ptr accessed on non-owner cpu, at: ... -------- seastar::smp_message_queue::async_work_item<mutation_writer::multishard_writer::make_shard_writer... ``` The multishard writer itself was fine, the problem was in the streaming consumer for repair copying a shared ptr. It could work fine with same smp setting, since there will be only 1 shard in the consumer path, from rpc handler all the way to the consumer. But with mixed smp setting, the ptr would be copied into the cpus involved, and since the shared ptr is not cpu safe, the refcount change can go wrong, causing double free, use-after-free. To fix, we pass a generic incremental repair handler to the streaming consumer. The handler is safe to be copied to different shards. It will be a no op if incremental repair is not enabled or on a different shard. A reproducer test is added. The test could reproduce the crash consistently before the fix and work well after the fix. Fixes #27666 Closes scylladb/scylladb#27870	2026-01-08 21:55:18 +02:00
Raphael S. Carvalho	48d243f32f	streaming: Leave sstables unsealed until attached to the table We want the invariant that after ACK, all sealed sstables will be split. This guarantee that on restart, no unsplit sstables will be found sealed. The paths that generate unsplit sstables are streaming and file streaming consumers. It includes intra-node streaming, which is local but can clone an unsplit sstable into destination. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:51 -03:00
Raphael S. Carvalho	ddb27488fa	replica: Wire add_new_sstable_and_update_cache() into file streaming consumer Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:51 -03:00
Raphael S. Carvalho	10225ee434	replica: Wire add_new_sstable_and_update_cache() into streaming consumer After the wiring, failure to attach the new sstable in the streaming consumer will unlink the sstable automatically. Fixes #27414. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:51 -03:00
Raphael S. Carvalho	c5e840e460	sstables: Add option to leave sstable unsealed in the stream sink That will be needed for file streaming to leave output sstable unsealed. we want the invariant where all sealed sstables are split after split was ACKed. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
Botond Dénes	296d7b8595	Merge 'Enable digest+checksum verification for file based streaming' from Taras Veretilnyk This patch enables integrity check in 'create_stream_sources()' by introducing a new 'sstable_data_stream_source_impl' class for handling the Data component of SSTables. The new implementation uses 'sstable::data_stream()' with 'integrity_check::yes' instead of the raw input_stream. These additional checks require reading the digest and CRC components from disk, which may introduce some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data. For compressed SSTables - where checksums are already embedded - the cost comes from reading, calculating and verifying the diges. New test cases were added to verify that the integrity checks work correctly, detecting both data and digest mismatches. Backport is not required, since it is a new feature Fixes #21776 Closes scylladb/scylladb#26702 * github.com:scylladb/scylladb: file_stream_test: add sstable file streaming integrity verification test cases streaming: prioritize sender-side errors in tablet_stream_files sstables: enable integrity check for data file streaming sstables: Add compressed raw streaming support sstables: Allow to read digest and checksum from user provided file instance sstables: add overload of data_stream() to accept custom file_input_stream_options	2025-11-24 06:37:27 +02:00
Taras Veretilnyk	77dcad9484	streaming: prioritize sender-side errors in tablet_stream_files When 'send_data_to_peer' throws and closes the sink, the peer later reports its own error, masking the original sender failure. This commit preserves the original sender exception. If the status-retrieval task throws its own error before sender task rethrows its exception, we can still propagate the original exception later.	2025-11-21 12:52:31 +01:00
Taras Veretilnyk	c8d2f89de7	sstables: enable integrity check for data file streaming This patch enables integrity check in 'create_stream_sources()' by introducing a new 'sstable_data_stream_source_impl' class for handling the Data component of SSTables. The new implementation uses 'sstable::data_stream()' with 'integrity_check::yes' instead of the raw input_stream. These additional checks require reading the digest and CRC components from disk, which may introduce some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data. For compressed SSTables - where checksums are already embedded - the cost comes from reading, calculation and verifying the digest.	2025-11-21 12:52:26 +01:00
Radosław Cybulski	d589e68642	Add precompiled headers to CMakeLists.txt Add precompiled header support to CMakeLists.txt and configure.py - it improves compilation time by approximately 10%. New header `stdafx.hh` is added, don't include it manually - the compiler will include it for you. The header contains includes from external libraries used by Scylla - seastar, standard library, linux headers and zlib. The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER` or configure.py --disable-precompiled-header to disable. The feature should be disabled, when trying to check headers - otherwise you might get false negatives on missing includes from seastar / abseil and so on. Note: following configuration needs to be added to ccache.conf: sloppiness = pch_defines,time_macros,include_file_mtime,include_file_ctime Closes scylladb/scylladb#26617	2025-11-21 12:27:41 +02:00
Michał Chojnowski	55c4b89b88	sstables: make `sstable::estimated_keys_for_range` asynchronous Currently, `sstable::estimated_keys_for_range` works by checking what fraction of Summary is covered by the given range, and multiplying this fraction to the number of all keys. Since computing things on Summary doesn't involve I/O (because Summary is always kept in RAM), this is synchronous. In a later patch, we will modify `sstable::estimated_keys_for_range` so that it can deal with sstables that don't have a Summary (because they use BTI indexes instead of BIG indexes). In that case, the function is going to compute the relevant fraction by using the index instead of Summary. This will require making the function asynchronous. This is what we do in this patch. (The actual change to the logic of `sstable::estimated_keys_for_range` will come in the next patch. In this one, we only make it asynchronous).	2025-09-29 13:01:21 +02:00
Michał Jadwiszczak	0f3827d509	streaming/stream_blob: register staging sstables to process them After scylladb/scylladb#22034, staging status of sstables streamed via file streaming was ignored and view updates were never generated. This patch fixes it and now staging sstables are registered to `view_building_worker`. Then, the worker create view building tasks for those sstables, so the view building coordinator can schedule them once the tablet migration is finished. Fixes scylladb/scylla-enterprise#4572	2025-09-23 15:34:42 +02:00
Pavel Emelyanov	a1ea553fe1	code: Replace distributed<> with sharded<> The latter is recommended in seastar, and the former was left as compatibility alias. Latest seastar explicitly marks it as deprecated so once the submodule is updated, compilation logs will explode. Most of the patch is generated with for f in $(git grep -l '\<distributed<[A-Za-z0-9:_]>') ; do sed -e 's/\<distributed<$[A-Za-z0-9:_]$>/sharded<\1>/g' -i $f; done for f in $(git grep -l distributed.hh); do sed -e 's/distributed.hh/sharded.hh/' -i $f ; done and a small manual change in test/perf/perf.hh Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26136	2025-09-19 12:22:51 +02:00
Avi Kivity	f6b6312cf4	Merge 'sstables/trie: prepare for integrating BTI indexes with sstable readers and writers' from Michał Chojnowski This is yet another part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25626 Next parts: introducing the new components, Partitions.db and Rows.db This is the preparatory, uncontroversial part of https://github.com/scylladb/scylladb/pull/26039, which has been split out to a separate PR to make the main part (which, after a revision, will be posted later) smaller. This series contains several small fixes and changes to BTI-related code added earlier, which either have to be done (i.e. propagating `reader_permit` to IO calls in index reads) or just deserved to be done. There's no single theme for the changes in this PR, refer to the individual commits for details. The changes are for the sake of new and unreleased code. No backporting should be done. Closes scylladb/scylladb#26075 * github.com:scylladb/scylladb: sstables/mx/reader: remove mx::make_reader_with_index_reader test/boost/bti_index_test: fix indentation sstables/trie/bti_index_reader: in last_block_offset(), return offset from the beginning of partition, not file sstables/trie: support reader_permit and trace_state properly sstables/trie/bti_node_reader: avoid calling into `cached_file` if the target position is already cached sstables/trie/bti_index_reader: get rid of the seastar::file wrapper in read_row_index_header sstables/trie/bti_index_reader: support BYPASS CACHE test/boost/bti_index_test: use read_bti_partitions_db_footer where appropriate sstables/trie: change the signature of bti_partition_index_writer::finish sstables/bti_index: improve signatures of special member functions in index writers streaming/stream_transfer_task: coroutinize `estimate_partitions()` types/comparable_bytes: add a missing implementation for date_type_impl sstables: remove an outdated FIXME storage_service: delete `get_splits()` sstables/trie: fix some comment typos in bti_index_reader.cc sstables/mx/writer: rename _pi_write_m.tomb to partition_tombstone	2025-09-18 12:10:27 +03:00
Michał Jadwiszczak	50678030c0	db/view/view_building_worker: use table id in `register_staging_sstable_tasks()` There is no need to pass the pointer only to get id of the table.	2025-09-18 02:57:35 +02:00
Michał Chojnowski	421fb8e722	streaming/stream_transfer_task: coroutinize `estimate_partitions()` In preparation for making `sstable::estimated_keys_for_range` asynchronous.	2025-09-17 12:22:40 +02:00
Ernest Zaslavsky	d624413ddd	treewide: Move query related files to a new `query` directory As requested in #22120, moved the files and fixed other includes and build system. Moved files: - query.cc - query-request.hh - query-result.hh - query-result-reader.hh - query-result-set.cc - query-result-set.hh - query-result-writer.hh - query_id.hh - query_result_merger.hh Fixes: #22120 This is a cleanup, no need to backport Closes scylladb/scylladb#25105	2025-09-16 23:40:47 +03:00
Asias He	451e1ec659	streaming: Fix use after move in the tablet_stream_files_handler The files object is moved before the log when stream finishes. We've logged the files when the stream starts. Skip it in the end of streaming. Fixes #25830 Closes scylladb/scylladb#25835	2025-09-08 11:59:52 +02:00
Radosław Cybulski	c242234552	Revert "build: add precompiled headers to CMakeLists.txt" This reverts commit `01bb7b629a`. Closes scylladb/scylladb#25735	2025-09-03 09:46:00 +03:00
Avi Kivity	bc5773f777	Merge 'Add out of space prevention mechanisms' from Łukasz Paszkowski When a scaling out is delayed or fails, it is crucial to ensure that clusters remain operational and recoverable even under extreme conditions. To achieve this, the following proactive measures are implemented: - reject writes - includes: inserts, updates, deletes, counter updates, hints, read+repair and lwt writes - applicable to: user tables, views, CDC log, audit, cql tracing - stop running compactions/repairs and prevent from starting new ones - reject incoming tablet migrations The aforementioned mechanisms are automatically enabled when node's disk utilization reaches the critical level (default: 98%) and disabled when the utilization drop below the threshold. Apart from that, the series add tests that require mounted volumes to simulate out of space. The paths to the volumes can be provided using the a pytest argument, i.e. `--space-limited-dirs`. When not provided, tests are skipped. Test scenarios: 1. Start a cluster and write data until one of the nodes reaches 90% of the disk utilization 2. Perform an operation that would take the nodes over 100% 3. The nodes should not exceed the critical disk utilization (98% by default) 4. Scale out the cluster by adding one node per rack 5. Retry or wait for the operation from step 2 The operation is: writing data, running compactions, building materialized views, running repair, migrating tablets (caused by RF change, decommission). The test is successful, if no nodes run out of space, the operation from step 2 is aborted/paused/timed out and the operation from step 5 is successful. `perf-simple-query --smp 1 -m 1G` results obtained for fixed 400MHz frequency: Read path (before) ``` instructions_per_op: mean= 39661.51 standard-deviation=34.53 median= 39655.39 median-absolute-deviation=23.33 maximum=39708.71 minimum=39622.61 ``` Read path (after) ``` instructions_per_op: mean= 39691.68 standard-deviation=34.54 median= 39683.14 median-absolute-deviation=11.94 maximum=39749.32 minimum=39656.63 ``` Write path (before): ``` instructions_per_op: mean= 50942.86 standard-deviation=97.69 median= 50974.11 median-absolute-deviation=34.25 maximum=51019.23 minimum=50771.60 ``` Write path (after): ``` instructions_per_op: mean= 51000.15 standard-deviation=115.04 median= 51043.93 median-absolute-deviation=52.19 maximum=51065.81 minimum=50795.00 ``` Fixes: https://github.com/scylladb/scylladb/issues/14067 Refs: https://github.com/scylladb/scylladb/issues/2871 No backport, as it is a new feature. Closes scylladb/scylladb#23917 * github.com:scylladb/scylladb: tests/cluster: Add new storage tests test/scylla_cluster: Override workdir when passed via cmdline streaming: Reject incoming migrations storage_service: extend locator::load_stats to collect per-node critical disk utilization flag repair_service: Add a facility to disable the service compaction_manager: Subscribe to out of space controller compaction_manager: Replace enabled/disabled states with running state database: Add critical_disk_utilization mode database can be moved to disk_space_monitor: add subscription API for threshold-based disk space monitoring docs: Add feature documentation config: Add critical_disk_utilization_level option replica/exceptions: Add a new custom replica exception	2025-08-30 18:47:57 +03:00
Piotr Dulikowski	7ccb50514d	Merge 'Introduce view building coordinator' from Michał Jadwiszczak This patch introduces `view_building_coordinator`, a single entity within whole cluster responsible for building tablet-based views. The view building coordinator takes slightly different approach than the existing node-local view builder. The whole process is split into smaller view building tasks, one per each tablet replica of the base table. The coordinator builds one base table at a time and it can choose another when all views of currently processing base table are built. The tasks are started by setting `STARTED` state and they are executed by node-local view building worker. The tasks are scheduled in a way, that each shard processes only one tablet at a time (multiple tasks can be started for a shard on a node because a table can have multiple views but then all tasks have the same base table and tablet (last_token)). Once the coordinator starts the tasks, it sends `work_on_view_building_tasks` RPC to start the tasks and receive their results. This RPC is resilient to RPC failure or raft leader change, meaning if one RPC call started a batch of tasks but then failed (for instance the raft leader was changed and caller aborted waiting for the response), next RPC call will attach itself to the already started batch. The coordinator plugs into handling tablet operations (migration/resize/RF change) and adjusts its tasks accordingly. At the start of each tablet operation, the coordinator aborts necessary view building tasks to prevent https://github.com/scylladb/scylladb/issues/21564. Then, new adjusted tasks are created at the end of the operation. If the operation fails at any moment, aborted tasks are rollback. The view building coordinator can also handle staging sstables using process_staging view building tasks. We do this because we don't want to start generating view updates from a staging sstable prematurely, before the writes are directed to the new replica (https://github.com/scylladb/scylladb/issues/19149). For detailed description check: `docs/dev/view-building-coordinator.md` Fixes https://github.com/scylladb/scylladb/issues/22288 Fixes https://github.com/scylladb/scylladb/issues/19149 Fixes https://github.com/scylladb/scylladb/issues/21564 Fixes https://github.com/scylladb/scylladb/issues/17603 Fixes https://github.com/scylladb/scylladb/issues/22586 Fixes https://github.com/scylladb/scylladb/issues/18826 Fixes https://github.com/scylladb/scylladb/issues/23930 --- This PR is reimplementation of https://github.com/scylladb/scylladb/pull/21942 Closes scylladb/scylladb#23760 * github.com:scylladb/scylladb: test/cluster: add view build status tests test/cluster: add view building coordinator tests utils/error_injection: allow to abort `injection_handler::wait_for_message()` test: adjust existing tests utils/error_injection: add injection with `sleep_abortable()` db/view/view_builder: ignore `no_such_keyspace` exception docs/dev: add view building coordinator documentation db/view/view_building_worker: work on `process_staging` tasks db/view/view_building_worker: register staging sstable to view building coordinator when needed db/view/view_building_worker: discover staging sstables db/view/view_building_worker: add method to register staging sstable db/view/view_update_generator: add method to process staging sstables instantly db/view/view_update_generator: extract generating updates from staging sstables to a method db/view/view_update_generator: ignore tablet-based sstables db/view/view_building_coordinator: update view build status on node join/left db/view/view_building_coordinator: handle tablet operations db/view: add view building task mutation builder service/topology_coordinator: run view building coordinator db/view: introduce `view_building_coordinator` db/view/view_building_worker: update built views locally db/view: introduce `view_building_worker` db/view: extract common view building functionalities db/view: prepare to create abstract `view_consumer` message/messaging_service: add `work_on_view_building_tasks` RPC service/topology_coordinator: make `term_changed_error` public db/schema_tables: create/cleanup tasks when an index is created/dropped service/migration_manager: cleanup view building state on drop keyspace service/migration_manager: cleanup view building state on drop view service/migration_manager: create view building tasks on create view test/boost: enable proxy remote in some tests service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()` service/migration_manager: coroutinize `prepare_new_view_announcement()` service/storage_proxy: expose references to `system_keyspace` and `view_building_state_machine` service: reload `view_building_state_machine` on group0 apply() service/vb_coordinator: add currently processing base db/system_keyspace: move `get_scylla_local_mutation()` up db/system_keyspace: add `view_building_tasks` table db/view: add view_building_state and views_state db/system_keyspace: add method to get view build status map db/view: extract `system.view_build_status_v2` cql statements to system_keyspace db/system_keyspace: move `internal_system_query_state()` function earlier db/view: ignore tablet-based views in `view_builder` gms/feature_service: add VIEW_BUILDING_COORDINATOR feature	2025-08-29 17:28:44 +02:00
Łukasz Paszkowski	7cfedb1214	streaming: Reject incoming migrations When a replica operates in the critical disk utilization mode, all incoming migrations are being rejected by rejecting an incoming sstable file. In the topology_coordinator, the rejected tablet is moved into the cleanup_target state in order to revert migration. Otherwise, retry happens and a cluster stays in the tablet_migration transition state preventing any other topology changes to happen, e.g. scaling out. Once the tablet migration is rejected, the load balancer will schedule a new migration.	2025-08-29 14:56:13 +02:00
Radosław Cybulski	01bb7b629a	build: add precompiled headers to CMakeLists.txt Add precompiled header support to CMakeLists.txt and configure.py - it improves compilation time by approximately 10%. New header `stdafx.hh` is added, don't include it manually - the compiler will include it for you. The header contains includes from external libraries used by Scylla - seastar, standard library, linux headers and zlib. The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER` or configure.py --disable-precompiled-header to disable. The feature should be disabled, when trying to check headers - otherwise you might get false negatives on missing includes from seastar / abseil and so on. Note: following configuration needs to be added to ccache.conf: sloppiness = pch_defines,time_macros Closes #25182	2025-08-27 21:37:54 +03:00
Michał Jadwiszczak	233f4dcee3	db/view/view_building_worker: register staging sstable to view building coordinator when needed Change return type of `check_needs_view_update_path()`. Instead of retrning bool which tells whether to use staging directory (and register to `view_update_generator`) or use normal directory. Now the function returns enum with possible values: - `normal_directory` - use normal directory for the sstable - `staging_directly_to_generator` - use staging directory and register to `view_update_generator` - `staging_managed_by_vbc` - use staging directory but don't register it to `view_update_generator` but create view building tasks for later The third option is new, it's used when the table has any view which is in building process currrently. In this case, registering it to `view_update_generator` prematurely may lead to base-view inconsistency (for example when a replica is in a pending state).	2025-08-27 10:23:03 +02:00
Asias He	b12404ba52	streaming: Enclose potential throws in try block and ensure sink close before logging - Move the initialization of log_done inside the try block to catch any exceptions it may throw. - Relocate the failure warning log after sink.close() cleanup to guarantee sink.close() is always called before logging errors. Refs #25497 Closes scylladb/scylladb#25591	2025-08-20 19:46:56 +02:00
Avi Kivity	611918056a	Merge 'repair: Add tablet incremental repair support' from Asias He The central idea of incremental repair is to allow repair participants to select and repair only a portion of the dataset to speed up the repair process. All repair participants must utilize an identical selection method to repair and synchronize the same selected dataset. There are two primary selection methods: time-based and file-based. The time-based method selects data within a specified time frame. It is versatile but it is less efficient because it requires reading all of the dataset and omitting data beyond the time frame. The file-based method selects data from unrepaired SSTables and is more efficient because it allows the entire SSTable to be omitted. This document patch implements the file-based selection method. Incremental repair will only be supported for tablet tables; it will not be supported for vnode tables. On one hand, the legacy vnode is less important to support. On the other hand, the incremental repair for vnode is much harder to implement. With vnodes, a SSTalbe could contain data for multiple vnode ranges. When a given vnode range is repaired, only a portion of the SSTable is repaired. This complicates the manipulation of SSTables significantly during both repair and compaction. With tablets, an entire tablet is repaired so that a sstable is either fully repaired or not repaired which is a huge simplification. This patch uses the repaired_at from sstables::statistics component to mark a sstable as repaired. It uses a virtual clock as the repair timestamp, i.e., using a monotonically increasing number for the repaired_at field of a SSTable and sstables_repaired_at column in system.tablets table. Notice that when a sstable is not repaired, the repaired_at field will be set to the default value 0 by default. The being_repaired in memory field of a SSTable is used to explicitly mark that a SSTable is being selected. The following variables are used for incremental repair: The repaired_at on disk field of a SSTable is used. - A 64-bit number increases sequentially The sstables_repaired_at is added to the system.tablets table. - repaired_at <= sstables_repaired_at means the sstable is repaired The being_repaired in memory field of a SSTable is added. - A repair UUID tells which sstable has participated in the repair Initial test results: 1) Medium dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~500GB Cluster pre-populated with ~500GB of data before starting repairs job. Results for Repair Timings: The regular repair run took 210 mins. Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s The speedup is: 183 mins / 48s = 228X 2) Small dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~167GB Cluster pre-populated with ~167GB of data before starting the repairs job. Regular repair 1st run took 110s, 2nd and 3rd runs took 110s. Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds. The speedup is: 110s / 1.5s = 73X 3) Large dataset results Node amount: 6 Instance type: i4i.2xlarge, 3 racks 50% of base load, 50% read/write Dataset == Sum of data on each node Dataset Non-incremental repair (minutes) 1.3 TiB 31:07 3.5 TiB 25:10 5.0 TiB 19:03 6.3 TiB 31:42 Dataset Incremental repair (minutes) 1.3 TiB 24:32 3.0 TiB 13:06 4.0 TiB 5:23 4.8 TiB 7:14 5.6 TiB 3:58 6.3 TiB 7:33 7.0 TiB 6:55 Fixes #22472 Closes scylladb/scylladb#24291 * github.com:scylladb/scylladb: replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair compaction: Move compaction_reenabler to compaction_reenabler.hh topology_coordinator: Make rpc::remote_verb_error to warning level repair: Add metrics for sstable bytes read and skipped from sstables test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair test.py: Add tests for tablet incremental repair repair: Add tablet incremental repair support compaction: Add tablet incremental repair support feature_service: Add TABLET_INCREMENTAL_REPAIR feature tablet_allocator: Add tablet_force_tablet_count_increase and decrease repair: Add incremental helpers sstable: Add being_repaired to sstable sstables: Add set_repaired_at to metadata_collector mutation_compactor: Introduce add operator to compaction_stats tablet: Add sstables_repaired_at to system.tablets table test: Fix drain api in task_manager_client.py	2025-08-19 13:13:22 +03:00
Asias He	0d7e518a26	repair: Add tablet incremental repair support The central idea of incremental repair is to allow repair participants to select and repair only a portion of the dataset to speed up the repair process. All repair participants must utilize an identical selection method to repair and synchronize the same selected dataset. There are two primary selection methods: time-based and file-based. The time-based method selects data within a specified time frame. It is versatile but it is less efficient because it requires reading all of the dataset and omitting data beyond the time frame. The file-based method selects data from unrepaired SSTables and is more efficient because it allows the entire SSTable to be omitted. This document patch implements the file-based selection method. Incremental repair will only be supported for tablet tables; it will not be supported for vnode tables. On one hand, the legacy vnode is less important to support. On the other hand, the incremental repair for vnode is much harder to implement. With vnodes, a SSTalbe could contain data for multiple vnode ranges. When a given vnode range is repaired, only a portion of the SSTable is repaired. This complicates the manipulation of SSTables significantly during both repair and compaction. With tablets, an entire tablet is repaired so that a sstable is either fully repaired or not repaired which is a huge simplification. This patch uses the repaired_at from sstables::statistics component to mark a sstable as repaired. It uses a virtual clock as the repair timestamp, i.e., using a monotonically increasing number for the repaired_at field of a SSTable and sstables_repaired_at column in system.tablets table. Notice that when a sstable is not repaired, the repaired_at field will be set to the default value 0 by default. The being_repaired in memory field of a SSTable is used to explicitly mark that a SSTable is being selected. The following variables are used for incremental repair: The repaired_at on disk field of a SSTable is used. - A 64-bit number increases sequentially The sstables_repaired_at is added to the system.tablets table. - repaired_at <= sstables_repaired_at means the sstable is repaired The being_repaired in memory field of a SSTable is added. - A repair UUID tells which sstable has participated in the repair Initial test results: 1) Medium dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~500GB Cluster pre-populated with ~500GB of data before starting repairs job. Results for Repair Timings: The regular repair run took 210 mins. Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s The speedup is: 183 mins / 48s = 228X 2) Small dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~167GB Cluster pre-populated with ~167GB of data before starting the repairs job. Regular repair 1st run took 110s, 2nd and 3rd runs took 110s. Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds. The speedup is: 110s / 1.5s = 73X 3) Large dataset results Node amount: 6 Instance type: i4i.2xlarge, 3 racks 50% of base load, 50% read/write Dataset == Sum of data on each node Dataset Non-incremental repair (minutes) 1.3 TiB 31:07 3.5 TiB 25:10 5.0 TiB 19:03 6.3 TiB 31:42 Dataset Incremental repair (minutes) 1.3 TiB 24:32 3.0 TiB 13:06 4.0 TiB 5:23 4.8 TiB 7:14 5.6 TiB 3:58 6.3 TiB 7:33 7.0 TiB 6:55 Fixes #22472	2025-08-18 11:01:21 +08:00
Benny Halevy	49e3b2827f	streaming: stream_blob: use the table sstable_generation_generator No need to start a local generator. Can just use the table's sstable generation generator to make new sstables now that it's stateless and doesn't depend on the highest generation found. Note that tablet_stream_files_handler used uuid generations unconditionally from inception (`4018dc7f0d`). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:21 +03:00
Benny Halevy	6cc964ef16	sstables: sstable_generation: get rid of uuid_identifiers bool class Now that all call sites enable uuid_identifiers. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:21 +03:00
Aleksandra Martyniuk	99ff08ae78	streaming: close sink when exception is thrown If an exception is thrown in result_handling_cont in streaming, then the sink does not get closed. This leads to a node crash. Close sink in exception handler. Fixes: https://github.com/scylladb/scylladb/issues/25165. Closes scylladb/scylladb#25238	2025-07-30 14:26:14 +03:00
Ernest Zaslavsky	408aa289fe	treewide: Move misc files to `utils` directory As requested in #22114, moved the files and fixed other includes and build system. Moved files: - interval.hh - Map_difference.hh Fixes: #22114 This is a cleanup, no need to backport Closes scylladb/scylladb#25095	2025-07-21 11:56:40 +03:00
Tomasz Grabiec	dff2b01237	streaming: Avoid deadlock by running view checks in a separate scheduling group This issue happens with removenode, when RBNO is disabled, so range streamer is used. The deadlock happens in a scenario like this: 1. Start 3 nodes: {A, B, C}, RF=2 2. Node A is lost 3. removenode A 4. Both B and C gain ownership of ranges. 5. Streaming sessions are started with crossed directions: B->C, C->B Readers created by sender side exhaust streaming semaphore on B and C. Receiver side attempts to obtain a permit indirectly by calling check_needs_view_update_path(), which reads local tables. That read is blocked and times-out, causing streaming to fail. The streaming writer is already using a tracking-only permit. To avoid that, run the query under a different scheduling group, which translates to the system semaphore instead of the maintenance semaphore, to break the dependency. The gossip group was chosen because it shouldn't be contended and this change should not interfere with it much. Fixes: #24807	2025-07-11 16:30:46 +02:00
Benny Halevy	15bee9f232	sstables: sstable_generation_generator: set last_generation=0 by default Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-06-18 11:30:29 +03:00
Patryk Jędrzejczak	c21692f3a6	Merge 'token_range_vector: fragment' from Avi Kivity token_range_vector is a sequence of intervals of tokens. It is used to describe vnodes or token ranges owned by shards. Since tokens are bloated (16 bytes instead of 8), and intervals are bloated (40 byte of overhead instead of 8), and since we have plenty of token ranges, such vectors can exceed our allocation unit of 128 kB and cause allocation stalls. This series fixes that by first generalizing some helpers and then changing token_range_vector to use chunked_vector. Although this touches IDL, there is no compatibility problem since the encoding for vector and chunked_vector are identical. There is no performance concern since token_range_vector is never used on any hot path (hot paths always contain a partition key). Fixes #3335. Fixes #24115. No backport: minor performance fix that isn't a regression. Closes scylladb/scylladb#24205 * https://github.com/scylladb/scylladb: dht: fragment token_range_vector partition_range_compat: generalize wrap/unwrap helpers	2025-05-29 18:45:13 +02:00
Avi Kivity	844a49ed6e	dht: fragment token_range_vector token_range_vector is a linear vector containing intervals of tokens. It can grow quite large in certain places and so cause stalls. Convert it to utils::chunked_vector, which prevents allocation stalls. It is not used in any hot path, as it usually describes vnodes or similar things. Fixes #3335.	2025-05-27 14:47:24 +03:00
Ernest Zaslavsky	7d0d3ec1c8	load_and_stream: Add abortion flow to mutation streaming * The new abort command explicitly represents the abortion flow in mutation streaming, clearly identifying operations that are intentionally aborted. This reduces ambiguity around failures in streaming operations. * In the error-handling section, aborted operations are now explicitly marked as the cause of the streaming failure. This allows us to differentiate them from genuine errors and appropriately adjust log severity to reduce unnecessary alarm caused by aborted streaming failures. * To avoid alarming users with excessive error logs, log severity for streaming failures caused by aborted operations has been downgraded. This helps keep logs cleaner and prevents unnecessary concerns. * A new feature has been added to ensure mixed clusters during updates do not receive unsupported RPC messages, improving compatibility and stability. fixes: https://github.com/scylladb/scylladb/issues/23076 Closes scylladb/scylladb#23214	2025-05-27 14:21:58 +03:00
Aleksandra Martyniuk	2dcea5a27d	streaming: use host_id in file streaming Use host ids instead of ips in file-streaming. Fixes: #22421. Closes scylladb/scylladb#24055	2025-05-12 09:36:48 +03:00
Avi Kivity	5e764d1de2	Merge 'Drop v2 and flat from reader and related names' from Botond Dénes Following a number of similar code cleanup PR, this one aims to be the last one, definitely dropping flat from all reader and related names. Similarly, v2 is also dropped from reader names, although it still persists in mutation_fragment_v2, mutation_v2 and related names. This won't change in the foreseeable future, as we don't have plans to drop mutation (the v1 variant). The changes in this PR are entirely mechanical, mostly just search-and-replace. Code cleanup, no backport required. Closes scylladb/scylladb#24087 * github.com:scylladb/scylladb: test/boost/mutation_reader_another_test: drop v2 from reader and related names test/boost/mutation_reader: s/puppet_reader_v2/puppet_reader/ test/boost/sstable_datafile_test: s/sstable_reader_v2/sstable_mutation_reader/ test/boost/mutation_test: s/consumer_v2/consumer/ test/lib/mutation_reader_assertions: s/flat_reader_assertions_v2/mutation_reader_assertions/ readers/mutation_readers: s/generating_reader_v2/generating_reader/ readers/mutation_readers: s/delegating_reader_v2/delegating_reader/ readers/mutation_readers: s/empty_flat_reader_v2/empty_mutation_reader/ readers/mutation_source: s/make_reader_v2/make_mutation_reader/ readers/mutation_source: s/flat_reader_v2_factory_type/mutation_reader_factory/ readers/mutation_reader: s/reader_consumer_v2/mutation_reader_consumer/ mutation/mutation_compactor: drop v2 from compactor and related names replica/table: s/make_reader_v2/make_mutation_reader/ mutation_writer: s/bucket_writer_v2/bucket_writer/ readers/queue: drop v2 from reader and related names readers/multishard: drop v2 from reader and related names readers/evictable: drop v2 from reader and related names readers/multi_range: remove flat from name	2025-05-11 22:22:35 +03:00
Botond Dénes	efc48caea5	readers/mutation_reader: s/reader_consumer_v2/mutation_reader_consumer/	2025-05-09 07:53:29 -04:00
Aleksandra Martyniuk	20c2d6210e	streaming: skip dropped tables Currently, stream_session::prepare throws when a table in requests or summaries is dropped. However, we do not want to fail streaming if the table is dropped. Delete table checks from stream_session::prepare. Further streaming steps can handle the dropped table and finish the streaming successfully. Fixes: #15257. Closes scylladb/scylladb#23915	2025-05-07 11:51:56 +03:00
Botond Dénes	c8563b9604	readers: mv generating_v2.hh generating.hh Completely mechanical change.	2025-04-16 04:46:08 -04:00
Pavel Emelyanov	bfbe802632	streaming: Relax load_sstable_for_tablet() The method does several excessive things, that can be relaxed 1. In order to transfer a table-id to another shard, finds the table on source shard, gets schema and captures schema id on invoke_on()'s lambda. It can just capture the original table-id 2. In order to get sstable parameters (format, version, etc.) generates toc_filename(), then calls parse_path() to convert it into the entry_descriptor. The descriptor can be read from sstable directly. 3. Logging "success" includes target shard into the message, but happens on the source shard. The message can be just logged on target shard. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23197	2025-03-14 15:26:48 +02:00
Botond Dénes	68b2ac541c	Merge 'streaming: fix the way a reason of streaming failure is determined' from Aleksandra Martyniuk During streaming receiving node gets and processes mutation fragments. If this operation fails, receiver responds with -1 status code, unless it failed due to no_such_column_family in which case streaming of this table should be skipped. However, when the table was dropped, an exception handler on receiver side may get not only data_dictionary::no_such_column_family, but also seastar::nested_exception of two no_such_column_family. Encountered example: ``` ERROR 2025-02-12 15:20:51,508 [shard 0:strm] stream_session - [Stream #f1cd6830-e954-11ef-afd9-b022e40bf72d] Failed to handle STREAM_MUTATION_FRAGMENTS (receive and distribute phase) for ks=ks, cf=cf, peer=756dd3fe-2bf0-4dcd-afbc-cfd5202669a0: seastar::nested_exception: data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14) (while cleaning up after data_dictionary::no_such_column_family (Can't find a column family with UUID ef9b1ee0-e954-11ef-ba4a-faf17acf4e14)) ``` In this case, the exception does not match the try_catch<data_dictionary::no_such_column_family> clause and gets handled the same as any other exception type. Replace try_catch clause with table_sync_and_check that synchronizes the schema and check if the table exists. Fixes: https://github.com/scylladb/scylladb/issues/22834. Needs backport to all live version, as they all contain the bug Closes scylladb/scylladb#22868 * github.com:scylladb/scylladb: streaming: fix the way a reason of streaming failure is determined streaming: save a continuation lambda streaming: use streaming namespace in table_check.{cc,hh} repair: streaming: move table_check.{cc,hh} to streaming	2025-03-14 07:25:00 +02:00
Gleb Natapov	48a1030c91	treewide: use host id directly in endpoint state change subscribers Now that we have host ids in endpoint state change subscribers some of them can be simplified by using the id directly instead of locking it up by ip.	2025-03-11 12:09:22 +02:00

1 2 3 4 5 ...

762 Commits