scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-31 03:56:42 +00:00

Author	SHA1	Message	Date
Asias He	faad0167d7	repair: Add tablet repair progress report support This patch adds tablet repair progress report support so that the user could use the /task_manager/task_status API to query the progress. In order to support this, a new system table is introduced to record the user request related info, i.e, start of the request and end of the request. The progress is accurate when tablet split or merge happens in the middle of the request, since the tokens of the tablet are recorded when the request is started and when repair of each tablet is finished. The original tablet repair is considered as finished when the finished ranges cover the original tablet token ranges. After this patch, the /task_manager/task_status API will report correct progress_total and progress_completed. Fixes #22564 Fixes #26896 Closes scylladb/scylladb#26924	2025-12-08 13:35:19 +02:00
Pavel Emelyanov	8192f45e84	Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy This change adds a new option to the REST api and correspondingly, to scylla nodetool: use_sstable_identifier. When set, we use the sstable identifier, if available, to name each sstable in the snapshots directory and the manifest.json file, rather than using the sstable generation. This can be used by the user (e.g. Scylla Manager) for global deduplication with tablets, where an sstable may be migrated across shards or across nodes, and in this case, its generation may change, but its sstable identifier remains sstable. Currently, Scylla manager uses the sstable generation to detect sstables that are already backed up to object storage and exist in previous backed up snapshots. Historically, the sstable generation was guaranteed to be unique only per table per node, so the dedup code currently checks for deduplication in the node scope. However, with tablet migration, sstables are renamed when migrated to a different shard, i.e. their generation changes, and they may be renamed when migrated to another node, but even if they are not, the dedup logic still assumes uniqueness only within a node. To address both cases, we keep the sstable_id stable throughout the sstable life cycle (since `3a12ad96c7`). Given the globally unique sstable identifier, scylla manager can now detect duplicate sstables in a wider scope. This can be cluster-wide, but we practically need only rack-wide deduplication or dc-wide, as tablets are migrated across racks only in rare occasions (like when converting from a numerical replication factor to a rack list containing a subset of the available racks in a datacenter). Fixes #27181 * New feature, no backport required Closes scylladb/scylladb#27184 * github.com:scylladb/scylladb: database: truncate_table_on_all_shards: set use_sstable_identifier to true nodetool: snapshot: add --use-sstable-identifier option api: storage_service: take_snapshot: add use_sstable_identifier option test: database_test: add snapshot_use_sstable_identifier_works test: database_test: snapshot_works: add validate_manifest sstable: write_scylla_metadata: add random_sstable_identifier error injection table: snapshot_on_all_shards: take snapshot_options sstable: add get_format getter sstable: snapshot: add use_sstable_identifier option db: snapshot_ctl: snapshot_options: add use_sstable_identifier options db: snapshot_ctl: move skip_flush to struct snapshot_options	2025-12-08 12:56:12 +03:00
Tomasz Grabiec	082342ecad	Attach names to allocating sections for better debuggability Large reserves in allocating_section can cause stalls. We already log reserve increase, but we don't know which table it belongs to: lsa - LSA allocation failure, increasing reserve in section 0x600009f94590 to 128 segments; Allocating sections used for updating row cache on memtable flush are notoriously problematic. Each table has its own row_cache, so its own allocating_section(s). If we attached table name to those sections, we could identify which table is causing problems. In some issues we suspected system.raft, but we can't be sure. This patch allows naming allocating_sections for the purpose of identifying them in such log messages. I use abstract_formatter for this purpose to avoid the cost of formatting strings on the hot path (e.g. index_reader). And also to avoid duplicating strings which are already stored elsewhere. Fixes #25799 Closes scylladb/scylladb#27470	2025-12-07 14:14:25 +02:00
Avi Kivity	47efbdffbc	Merge 'cache, mvcc: Preempt cache update when applying range tombstone from memtable' from Tomasz Grabiec Range tombstones are represented as entry attributes, which applies to the interval between entries. So if a range tombstone covers many rows, to apply it we have to update all covered entries. In some workloads that could be many entries, even the whole cache. Before the patch, we did this update without preemption, which can cause reactor stalls in such workloads. This scenario is already covered by mvcc_tests, e.g. test_apply_to_incomplete_respects_continuity. And I verified that the new preemption point is hit in the test. perf-row-cache-update results show no significant stalls anymore (max 2ms scheduling delay, instead of previous 1.5 s): Generated 1124195 rows Memtable fill took 4179.457520 [ms], {count: 8295, 99%: 0.654949 [ms], max: 32.817176 [ms]} Draining... took 0.000616 [ms] cache: 2506/2948 [MB], memtable: 781/1024 [MB], alloc/comp: 1051/662 [MB] (amp: 0.630) update: 2874.157471 [ms], preemption: {count: 26650, 99%: 1.131752 [ms], max: 2.068762 [ms]}, cache: 3027/3973 [MB], alloc/comp: 3951/2424 [MB] (amp: 0.614), pr/me/dr 1124195/0/0 Fixes #23479 Fixes #2578 Closes scylladb/scylladb#27469 * github.com:scylladb/scylladb: cache, mvcc: Preempt cache update when applying range tombstone from memtable partition_snapshot_row_cursor: Clarify non-obvious semantic difference of range_tombstone() perf-row-cache-update: Add scenario with large tombstone covering many rows	2025-12-07 11:54:15 +02:00
Tomasz Grabiec	d4014b7970	Drop legacy schema support We switched to using v3 schema tables (in system_schema keyspace) in 2017, in `9eb91bc30b`. So no system should have the old schema any more. No need to run legacy_schema_migrator on boot. Closes scylladb/scylladb#27420	2025-12-07 00:09:13 +02:00
Tomasz Grabiec	e546143fd9	partition_snapshot_row_cursor: Clarify non-obvious semantic difference of range_tombstone()	2025-12-06 01:03:10 +01:00
Botond Dénes	9d2f7c3f52	Merge 'mv: allow setting concurrency in PRUNE MATERIALIZED VIEW' from Wojciech Mitros The PRUNE MATERALIZED VIEW statement is performed as follows: 1. Perform a range scan of the view table from the view replicas based on the ranges specified in the statement. 2. While reading the paged scan above, for each view row perform a read from all base replicas at the corresponding primary key. If a discrepancy is detected, delete the row in the view table. When reading multiple rows, this is very slow because for each view row we need to performe a single row query on multiple replicas. In this patch we add an option to speed this up by performing many of the single base row reads concurrently, at the concurrency specified in the USING CONCURRENCY clause. Aside from the unit test, I checked manually on a 3-node cluster with 10M rows, using vnodes. There were actually no ghost rows in the test, but we still had to iterate over all view rows and read the corresponding base rows. And actual ghost rows, if there are any, should be a tiny fraction of all rows. I compared concurrencies 1,2,10,100 and the results were: * Pruning with concurrency 1 took total 1416 seconds * Pruning with concurrency 2 took total 731 seconds * Pruning with concurrency 10 took total 234 seconds * Pruning with concurrency 100 took total 171 seconds So after a concurrency of 10 or so we're hitting diminishing returns (at least in this setup). At that point we may be no longer bottlenecked by the reads, but by CPU on the shard that's handling the PRUNE Fixes https://github.com/scylladb/scylladb/issues/27070 Closes scylladb/scylladb#27097 * github.com:scylladb/scylladb: mv: allow setting concurrency in PRUNE MATERIALIZED VIEW cql: add CONCURRENCY to the USING clause	2025-12-04 11:47:41 +02:00
Benny Halevy	1c45ad7cee	db: snapshot_ctl: snapshot_options: add use_sstable_identifier options To be used for naming sstables in the snapshot by their sstable identifiers rather than their generation, to facilitate global deduplication of sstables in backup. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-04 09:46:35 +02:00
Benny Halevy	c18133b6cb	db: snapshot_ctl: move skip_flush to struct snapshot_options Prepare for adding another option: use_sstable_identifer. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-04 09:46:35 +02:00
Szymon Wasik	4f803aad22	Improve documentation of vector search configuration parameters. This patch adds separate group for vector search parameters in the documentation and fixes small typos and formatting. Fixes: SCYLLADB-77. Closes scylladb/scylladb#27385	2025-12-03 21:02:59 +02:00
Piotr Dulikowski	654ac9099b	db/view/view_building_coordinator: skip work if no view is built Even though that `view_building_coordinator::work_on_view_building` has an `if` at the very beginning which checks whether the currently processed base table is set, it only prints a message and continues executing the rest of the function regardless of the result of the check. However, some of the logic in the function assumes that the currently processed base table field is set and tries to access the value of the field. This can lead to the view building coordinator accessing a disengaged optional, which is undefined behavior. Fix the function by adding the clearly missing `co_await` to the check. A regression test is added which checks that the view building state observer - a different fiber which used to print a weird message due to erroneus view building coordinator behavior - does not print a warning. Fixes: scylladb/scylladb#27363 Closes scylladb/scylladb#27373	2025-12-03 09:44:28 +02:00
Piotr Dulikowski	44c605e59c	Merge 'Fix the types of change events in Alternator Streams' from Piotr Wieczorek This patch increases the compatibility with DynamoDB Streams by integrating the DynamoDB's event type rules (described in https://github.com/scylladb/scylladb/issues/6918) into Alternator. The main changes are: - introduce a new flag `alternator_streams_strict_compatibility`, meant as a guard of performance-intensive operations that increase the compatibility with DynamoDB Streams. If enabled, Alternator always performs a RBW before a data-modifying operation, and propagates its result to CDC. Then, the old item is compared to the new one, to determine the mutation type (INSERT vs MODIFY). This option is a no-op for tables with disabled Alternator Streams, - reduce splitting of simple Alternator mutations, - correctly distinguish event types described in #6918, except for item deletes. Deleting a missing item with DeleteItem, BatchWriteItem, or a missing field with UpdateItem still emit REMOVEs. To summarize, the emitted events of the data manipulation operations should be as follows: - DeleteItem/BatchWriteItem.DeleteItem of existing item: REMOVE (OK) - DeleteItem of nonexistent item: nothing (OK) - BatchWriteItem.DeleteItem of nonexistent item: nothing (OK) - PutItem/UpdateItem/BatchWriteItem.PutItem of existing and not equal item: MODIFY (OK) - PutItem/UpdateItem/BatchWriteItem.PutItem of existing and equal item: nothing (OK) - PutItem/UpdateItem/BatchWriteItem.PutItem of nonexistent item: INSERT (OK) No backport is necessary. Refs https://github.com/scylladb/scylladb/pull/26149 Refs https://github.com/scylladb/scylladb/pull/26396 Refs https://github.com/scylladb/scylladb/issues/26382 Fixes https://github.com/scylladb/scylladb/issues/6918 Closes scylladb/scylladb#26121 * github.com:scylladb/scylladb: test/alternator: Enable the tests failing because of #6918 alternator, cdc: Don't emit events for no-op removes alternator, cdc: Don't emit an event for equal items alternator/streams, cdc: Differentiate item replace and item update in CDC alternator: Change the return type of rmw_operation_return config: Add alternator_streams_strict_compatibility flag cdc: Don't split a row marker away from row cells	2025-11-30 07:20:22 +01:00
Dawid Mędrek	48a28c24c5	db/commitlog: Include position and alignment information in errors When we come across a segment truncation, this information may be helpful to determine when the error occurred exactly and hint at what code path might've led to it. Closes scylladb/scylladb#27207	2025-11-28 15:28:08 +03:00
Calle Wilund	59c87025d1	commitlog::read_log_file: Check for eof position on all data reads Fixes #24346 When reading, we check for each entry and each chunk, if advancing there will hit EOF of the segment. However, IFF the last chunk being read has the last entry _exactly_ matching the chunk size, and the chunk ending at _exactly_ segment size (preset size, typically 32Mb), we did not check the position, and instead complained about not being able to read. This has literally _never_ happened in actual commitlog (that was replayed at least), but has apparently happened more and more in hints replay. Fix is simple, just check the file position against size when advancing said position, i.e. when reading (skipping already does). v2: * Added unit test Closes scylladb/scylladb#27236	2025-11-28 15:26:46 +03:00
Wojciech Mitros	323e5cd171	mv: allow setting concurrency in PRUNE MATERIALIZED VIEW The PRUNE MATERALIZED VIEW statement is performed as follows: 1. Perform a range scan of the view table from the view replicas based on the ranges specified in the statement. 2. While reading the paged scan above, for each view row perform a read from all base replicas at the corresponding primary key. If a discrepancy is detected, delete the row in the view table. When reading multiple rows, this is very slow because for each view row we need to performe a single row query on multiple replicas. In this patch we add an option to speed this up by performing many of the single base row reads concurrently, at the concurrency specified in the USING CONCURRENCY clause. Fixes https://github.com/scylladb/scylladb/issues/27070	2025-11-27 00:02:28 +01:00
Nadav Har'El	9cde93e3da	Merge 'db/view/view_building_coordinator: get rid of task's state in group0' from Michał Jadwiszczak Previously, the view building coordinator relied on setting each task's state to STARTED and then explicitly removing these state entries once tasks finished, before scheduling new ones. This approach induced a significant number of group0 commits, particularly in large clusters with many nodes and tablets, negatively impacting performance and scalability. With the update, the coordinator and worker logic has been restructured to operate without maintaining per-task states. Instead, tasks are simply tracked with an aborted boolean flag, which is still essential for certain tablet operations. This change removes much of the coordination complexity, simplifies the view building code, and reduces operational overhead. In addition, the coordinator now batches reports of finished tasks before making commits. Rather than committing task completions individually, it aggregates them and reports in groups, significantly minimizing the frequency of group0 commits. This new approach is expected to improve efficiency and scalability during materialized view construction, especially in large deployments. Fixes https://github.com/scylladb/scylladb/issues/26311 This patch needs to be backported to 2025.4. Closes scylladb/scylladb#26897 * github.com:scylladb/scylladb: docs/dev/view-building-coordinator: update the docs after recent changes db/view/view_building: send coordinator's term in the RPC db/view/view_building_state: replace task's state with `aborted` flag db/view/view_building_coordinator: batch finished tasks reporting db/view/view_building_worker: change internal implementation db/view/view_building_coordinator: change `work_on_tasks` RPC return type	2025-11-26 11:35:44 +02:00
Botond Dénes	384bffb8da	Merge 'compaction: limit the maximum shares allocated to a compaction scheduling class' from Raphael Raph Carvalho This PR adds support for limiting the maximum shares allocated to a compaction scheduling class by the compaction controller. It introduces a new configuration parameter, compaction_max_shares, which, when set to a non zero value, will cap the shares allocated to compaction jobs. This PR also exposes the shares computed by the compaction controller via metrics, for observability purposes. Fixes https://github.com/scylladb/scylladb/issues/9431 Enhancement. No need to backport. NOTE: Replaces PR https://github.com/scylladb/scylladb/pull/26696 Ran a test in which the backlog raised the need for max shares (normalized backlog above normalization_factor), and played with different values for new option compaction_max_shares to see it works (500, 1000, 2000, 250, 50) Closes scylladb/scylladb#27024 * github.com:scylladb/scylladb: db/config: introduce new config parameter `compaction_max_shares` compaction_manager:config: introduce max_shares compaction_controller: add configurable maximum shares compaction_controller: introduce `set_max_shares()`	2025-11-26 06:51:30 +02:00
Michał Jadwiszczak	fb8cbf1615	db/view/view_building: send coordinator's term in the RPC To avoid case when an old coordinator (which hasn't been stopped yet) dictates what should be done, add raft term to the `work_on_view_building_tasks` RPC. The worker needs to check if the term matches the current term from raft server, and deny the request when the term is bad.	2025-11-25 12:14:05 +01:00
Michał Jadwiszczak	24d69b4005	db/view/view_building_state: replace task's state with `aborted` flag After previous commits, we can drop entire task's state and replace it with single boolean flag, which determines if a task was aborted. Once a task was aborted, it cannot get resurrected to a normal state.	2025-11-25 12:14:04 +01:00
Michał Jadwiszczak	eb04af5020	db/view/view_building_coordinator: batch finished tasks reporting In previous implementation to execute view building tasks, the coordinator needed to firstly set their states to `STARTED` and then it needed to remove them before it could start the next ones. This logic required a lot of group0 commits, especially in large clusters with higher number of nodes and big tablet count. After previous commit to the view building worker, the coordinator can start view building tasks without setting the `STARTED` state and deleting finished tasks. This patch adjusts the coordinator to save finished tasks locally, so it can continue to execute next ones and the finished tasks are periodically removed from the group0 by `finished_task_gc_fiber()`.	2025-11-25 12:14:04 +01:00
Lakshmi Narayanan Sreethar	9cb766f929	db/config: introduce new config parameter `compaction_max_shares` Add support for the new configuration parameter `compaction_max_shares`, and update the compaction manager to pass it down to the compaction controller when it changes. The shares allocated to compaction jobs will be limited by this new parameter. Fixes #9431 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-11-24 12:52:29 -03:00
Tomasz Grabiec	d4b77c422f	Merge 'load_stats: leaving replica could be std::nullopt' from Ferenc Szili When migrating tablet size during the end_migration tablet transition stage, we need the pending and leaving replica hosts. The leaving and pending replicas are gathered in objects of type std::optional<tablet_replica> and are not checked if they contain a value before dereferencing which could cause an exception in the topology coordinator. This patch adds a check for leaving and pending replicas, and only performs the tablet size migration if neither are empty. This bug was introduced in `10f07fb95a` This change also adds the ability to create a tablet size in load_stats during end_migration stage of a tablet rebuild. We compute the new tablet size from by averaging the tablet sizes of the existing replicas. This change also adds the virtual table tablet_sizes which contains tablet sizes of all the replicas of all the tablets in the cluster. A version containing this bug has not yet been released, so a backport is not needed. Closes scylladb/scylladb#27118 * github.com:scylladb/scylladb: test: add tests for tablet size migration during end_migration virtual_table: add tablet_sizes virtual table load_stats: update tablet sizes after migration or rebuild	2025-11-24 15:31:30 +01:00
Michał Jadwiszczak	08974e1d50	db/view/view_building_worker: change internal implementation This commit doesn't change the logic behind the view building worker but it changes how the worker is executing view building tasks. Previously, the worker had a state only on shard0 and it was reacting to changes in group0 state. When it noticed some tasks were moved to `STARTED` state, the worker was creating a batch for it on the shard0 state. The RPC call was used only to start the batch and to get its result. Now, the main logic of batch management was moved to the RPC call handler. The worker has a local state on each shard and the state contains: - unique ptr to the batch - set of completed tasks - information for which views the base table was flushed So currently, each batch lives on a shard where it has its work to do exclusively. This eliminates a need to do a synchronization between shard0 and work shard, which was a painful point in previous implementation. The worker still reacts to changes in group0 view building state, but currently it's only used to observe whether any view building tasks was aborted by setting `ABORTED` state. To prepare for further changes to drop the view building task state, the worker ignores `IDLE` and `STARTED` states completely.	2025-11-24 11:12:31 +01:00
Michał Jadwiszczak	6d853c8f11	db/view/view_building_coordinator: change `work_on_tasks` RPC return type During the initial implementation of the view builing coordinator, we decided that if a view building task fails locally on the worker (example reason: view update's target replica is not available), the worker will retry this work instead of reporting a failure to the coordinator. However, we left return type of the RPC, which was telling if a task was finished successfully or aborted. But the worker doesn't need to report that a task was aborted, because it's the coordinator, who decides to abort a task. So, this commit changes the return type to list of UUIDs of completed tasks. Previously length of the returned vector needed to be the same as length of the vector sent in the request. No we can drop this restriction and the RPC handler return list of UUIDs of completed tasks (subset of vector sent in the request). This change is required to drop `STARTED` state in next commits. Since Scylla 2025.4 wasn't released yet and we're going to merge this patch before releasing, no RPC versioning or cluster feature is needed.	2025-11-24 11:12:29 +01:00
Karol Nowacki	c40b3ba4b3	vector_search: Add HTTPS support for vector store connections This commit introduces TLS encryption support for vector store connections. A new configuration option is added: - vector_store_encryption_options.truststore: path to the trust store file To enable secure connections, use the https:// scheme in the vector_store_primary_uri/vector_store_secondary_uri configuration options. Fixes: VECTOR-327	2025-11-22 08:18:45 +01:00
Ferenc Szili	e96863be0c	virtual_table: add tablet_sizes virtual table This change adds the tablet_sizes virtual table. The contents of this table are gathered from the current load_stats data structure.	2025-11-21 16:53:28 +01:00
Gautam Menghani	939fcc0603	db/system_keyspace: Remove the FIXME related to caching of large tables Remove the FIXME comment for re-enabling caching of the large tables since the tables are used infrequently [1]. [1] : github.com/scylladb/scylladb/pull/26789#issuecomment-3477540364 Fixes #26032 Signed-off-by: Gautam Menghani <gautam.opensource@gmail.com> Closes scylladb/scylladb#26789	2025-11-21 12:34:34 +02:00
Radosław Cybulski	d589e68642	Add precompiled headers to CMakeLists.txt Add precompiled header support to CMakeLists.txt and configure.py - it improves compilation time by approximately 10%. New header `stdafx.hh` is added, don't include it manually - the compiler will include it for you. The header contains includes from external libraries used by Scylla - seastar, standard library, linux headers and zlib. The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER` or configure.py --disable-precompiled-header to disable. The feature should be disabled, when trying to check headers - otherwise you might get false negatives on missing includes from seastar / abseil and so on. Note: following configuration needs to be added to ccache.conf: sloppiness = pch_defines,time_macros,include_file_mtime,include_file_ctime Closes scylladb/scylladb#26617	2025-11-21 12:27:41 +02:00
Calle Wilund	3c4546d839	messaging_service: Add internode_compression=rack as option Fixes #27085 Adds a "rack" option to enum/config and handles in connection setup in messaging_service. Closes scylladb/scylladb#27099	2025-11-21 11:50:55 +02:00
Karol Nowacki	104de44a8d	vector_search: Add support for secondary vector store clients This change adds support for secondary vector store clients, typically located in different availability zones. Secondary clients serve as fallback targets when all primary clients are unavailable. New configuration option allows specifying secondary client addresses and ports. Fixes: VECTOR-187 Closes scylladb/scylladb#26484	2025-11-20 08:37:18 +01:00
Botond Dénes	6ee0f1f3a7	Merge 'replica/table: add a metric for hypothetical total file size without compression' from Michał Chojnowski This patch adds a metric for pre-compression size of sstable files. This patch adds a per-table metric `scylla_column_family_total_disk_space_before_compression`, which measures the hypothetical total size of sstables on disk, if Data.db was replaced with an uncompressed equivalent. As for the implementation: Before the patch, tables and sstable sets are already tracking their total physical file size. Whenever sstables are added or removed, the size delta is propagated from the sstable up through sstable sets into table_stats. To implement the new metric, we turn the size delta that is getting passed around from a one-dimensional to a two-dimensional value, which includes both the physical and the pre-compression size. New functionality, no backport needed. Closes scylladb/scylladb#26996 * github.com:scylladb/scylladb: replica/table: add a metric for hypothetical total file size without compression replica/table: keep track of total pre-compression file size	2025-11-20 09:10:38 +02:00
Botond Dénes	2ca66133a4	Revert "db/config: don't use RBNO for scaling" This reverts commit `43738298be`. This commit causes instability in dtests. Several non-gating dtests started failing, as well as some gating ones, see #27047. Closes scylladb/scylladb#27067 Fixes #27047	2025-11-18 08:17:17 +02:00
Botond Dénes	514c1fc719	Merge 'db: batchlog_manager: update _last_replay only if all batches were re…' from Aleksandra Martyniuk …played Currently, if flushing hints falls within the repair cache timeout, then the flush_time is set to batchlog_manager::_last_replay. _last_replay is updated on each replay, even if some batches weren't replayed. Due to that, we risk the data resurrection. Update _last_replay only if all batches were replayed. Fixes: https://github.com/scylladb/scylladb/issues/24415. Needs backport to all live versions. Closes scylladb/scylladb#26793 * github.com:scylladb/scylladb: test: extend test_batchlog_replay_failure_during_repair db: batchlog_manager: update _last_replay only if all batches were replayed	2025-11-18 08:17:16 +02:00
Piotr Dulikowski	f0039381d2	Merge 'db/view/view_building_worker: support staging sstables intra-node migration and tablet merge' from Michał Jadwiszczak This PR fixes staging stables handling by view building coordinator in case of intra-node tablet migration or tablet merge. To support tablet merge, the worker stores the sstables grouped only be `table_id`, instead of `(table_id, last_token)` pair. There shouldn't be that many staging sstables, so selecting relevant for each `process_staging` task is fine. For the intra-node migration support, the patch adds methods to load migrated sstables on the destination shard and to cleanup them on source shard. The patch should be backported to 2025.4 Fixes https://github.com/scylladb/scylladb/issues/26244 Closes scylladb/scylladb#26454 * github.com:scylladb/scylladb: service/storage_service: migrate staging sstables in view building worker during intra-node migration db/view/view_building_worker: support sstables intra-node migration db/view_building_worker: fix indent db/view/view_building_worker: don't organize staging sstables by last token	2025-11-17 08:53:19 +01:00
Aleksandra Martyniuk	e3dcb7e827	test: extend test_batchlog_replay_failure_during_repair Modify test_batchlog_replay_failure_during_repair to also check that there isn't data resurrection if flushing hints falls within the repair cache timeout.	2025-11-14 14:18:07 +01:00
Patryk Jędrzejczak	1141342c4f	Merge 'topology: refactor excluded nodes' from Petr Gusev This PR refactors excluded nodes handling for tablets and topology. For tablets a dedicated variable `topology::excluded_tablet_nodes` is introduced, for topology operations a method get_excluded_nodes() is inlined into topology_coordinator and renamed to `get_excluded_nodes_for_topology_request`. The PR improves codes readability and efficiency, no behavior changes. backport: this is a refactoring/optimization, no need to backport Closes scylladb/scylladb#26907 * https://github.com/scylladb/scylladb: topology_coordinator: drop unused exec_global_command overload topology_coordinator: rename get_excluded_nodes -> get_excluded_nodes_for_topology_request topology_state_machine: inline get_excluded_nodes messaging_service: simplify and optimize ban_host storage_service: topology_state_load: extract topology variable topology_coordinator: excluded_tablet_nodes -> ignored_nodes topology_state_machine: add excluded_tablet_nodes field	2025-11-14 11:52:00 +01:00
Botond Dénes	43738298be	db/config: don't use RBNO for scaling Remove bootstrap and decomission from allowed_repair_based_node_ops. Using RBNO over streaming for these operations has no benefits, as they are not exposed to the out-of-date replica problem that replace, removenode and rebuild are. On top of that, RBNO is known to have problems with empty user tables. Using streaming for boostrap and decomission is safe and faster than RBNO in all condition, especially when the table is small. One test needs adjustment as it relies on RBNO being used for all node ops. Fixes: #24664 Closes scylladb/scylladb#26330	2025-11-14 13:03:50 +03:00
Piotr Dulikowski	43506e5f28	Merge 'db/view: Add backoff when RPC fails' from Dawid Mędrek The view building coordinator manages the process by sending RPC requests to all nodes in the cluster, instructing them what to do. If processing that message fails, the coordinator decides if it wants to retry it or (temporarily) abandon the work. An example of the latter scenario could be if one of the target nodes dies and any attempts to communicate with it would fail. Unfortunately, the current approach to it is not perfect and may result in a storm of warnings, effectively clogging the logs. As an example, take a look at scylladb/scylladb#26686: the gossiper failed to mark one of the dead nodes as DOWN fast enough, and it resulted in a warning storm. To prevent situations like that, we implement a form of backoff. If processing an RPC message fails, we postpone finishing the task for a second. That should reduce the number of messages in the logs and avoid retries that are likely to fail as well. We provide a reproducer test. Fixes scylladb/scylladb#26686 Backport: impact on the user. We should backport it to 2025.4. Closes scylladb/scylladb#26729 * github.com:scylladb/scylladb: tet/cluster/mv: Clean up test_backoff_when_node_fails_task_rpc db/view/view_building_coordinator: Rate limit logging failed RPC db/view: Add backoff when RPC fails	2025-11-14 10:17:57 +01:00
Dawid Mędrek	acd9120181	db/view/view_building_coordinator: Rate limit logging failed RPC The view building coordinator sends tasks in form of RPC messages to other nodes in the cluster. If processing that RPC fails, the coordinator logs the error. However, since tasks are per replica (so per shard), it may happen that we end up with a large number of similar messages, e.g. if the target node has died, because every shard will fail to process its RPC message. It might become even worse in the case of a network partition. To mitigate that, we rate limit the logging by 1 seconds. We extend the test `test_backoff_when_node_fails_task_rpc` so that it allows the view building coordinator to have multiple tablet replica targets. If not for rate limiting the warning messages, we should start getting more of them, potentially leading to a test failure.	2025-11-13 17:57:23 +01:00
Dawid Mędrek	4a5b1ab40a	db/view: Add backoff when RPC fails The view building coordinator manages the process of view building by sending RPC requests to all nodes in the cluster, instructing them what to do. If processing that message fails, the coordinator decides if it wants to retry it or (temporarily) abandon the work. An example of the latter scenario could be if one of the target nodes dies and any attempts to communicate with it would fail. Unfortunately, the current approach to it is not perfect and may result in a storm of warnings, effectively clogging the logs. As an example, take a look at scylladb/scylladb#26686: the gossiper failed to mark one of the dead nodes as DOWN fast enough, and it resulted in a warning storm. To prevent situations like that, we implement a form of backoff. If processing an RPC message fails, we postpone finishing the task for a second. That should reduce the number of messages in the logs and avoid retries that are likely to fail as well. We provide a reproducer test: it fails before this commit and succeeds with it. Fixes scylladb/scylladb#26686	2025-11-13 17:55:41 +01:00
Aleksandra Martyniuk	4d0de1126f	db: batchlog_manager: update _last_replay only if all batches were replayed Currently, if flushing hints falls within the repair cache timeout, then the flush_time is set to batchlog_manager::_last_replay. _last_replay is updated on each replay, even if some batches weren't replayed. Due to that, we risk the data resurrection. Update _last_replay only if all batches were replayed. Fixes: https://github.com/scylladb/scylladb/issues/24415.	2025-11-13 10:40:19 +01:00
Piotr Dulikowski	2e5eb92f21	Merge 'cdc: use CDC schema that is compatible with the base schema' from Michael Litvak When generating CDC log mutations for some base mutation, use a CDC schema that is compatible with the base schema. The compatible CDC schema has for every base column a corresponding CDC column with the same name. If using a non-compatible schema, we may encounter a situation, especially during ALTER, that we have a mutation with a base column set with some value, but the CDC schema doesn't have a column by that name. This would cause the user request to fail with an error. We add to the schema object a schema_ptr that for CDC-enabled tables points to the schema object of the CDC table that is compatible with the schema. It is set by the schema merge algorithm when creating the schema for a table that is created or altered. We use the fact that a base table and its CDC table are created and altered in the same group0 operation, and this way we can find and set the cdc schema for a base table. When transporting the base schema as a frozen schema between shards, we transport with it the frozen cdc schema as well. The patch starts with a series of refactoring commits that make extending the frozen schema easier and cleans up some duplication in the code about the frozen schema. We combine the two types `frozen_schema_with_base_info` and `view_schema_and_base_info` to a single type `extended_frozen_schema` that holds a frozen schema with additional data that is not part of the schema mutations but needs to be transported with it to unfreeze it - base_info, and the frozen cdc schema which is added in a later commit. Fixes https://github.com/scylladb/scylladb/issues/26405 backport not needed - enhancement Closes scylladb/scylladb#24960 * github.com:scylladb/scylladb: test: cdc: test cdc compatible schema cdc: use compatiable cdc schema db: schema_applier: create schema with pointer to CDC schema db: schema_applier: extract cdc tables schema: add pointer to CDC schema schema_registry: remove base_info from global_schema_ptr schema_registry: use extended_frozen_schema in schema load schema_registry: replace frozen_schema+base_info with extended_frozen_schema frozen_schema: extract info from schema_ptr in the constructor frozen_schema: rename frozen_schema_with_base_info to extended_frozen_schema	2025-11-13 10:11:54 +01:00
Michał Chojnowski	1cfce430f1	replica/table: keep track of total pre-compression file size Every table and sstable set keeps track of the total file size of contained sstables. Due to a feature request, we also want to keep track of the hypothetical file size if Data files were uncompressed, to add a metric that shows the compression ratio of sstables. We achieve this by replacing the relevant `uint_64 bytes_on_disk` counters everywhere with a struct that contains both the actual (post-compression) size and the hypothetical pre-compression size. This patch isn't supposed to change any observable behavior. In the next patch, we will use these changes to add a new metric.	2025-11-13 00:49:57 +01:00
Petr Gusev	82da83d0e5	topology_state_machine: add excluded_tablet_nodes field The topology_coordinator::is_excluded() creates a temporary hash map for each call. This is probably not a performance problem since left_nodes_rs contains only those left nodes that are referenced from tablet replicas, this happens temporarily while e.g. a replaced node is being rebuilt. On the other hand, why not just have a dedicated field in the topology_state_machine, then this code wouldn't look suspicious.	2025-11-12 12:27:43 +01:00
Michał Jadwiszczak	4bc6361766	db/view/view_building_worker: support sstables intra-node migration We need to be able to load sstables on the target shard during intra-node tablet migration and to cleanup migrated sstables on the source shard.	2025-11-10 10:36:32 +01:00
Michał Jadwiszczak	c99231c4c2	db/view_building_worker: fix indent	2025-11-10 09:02:16 +01:00
Michał Jadwiszczak	2e8c096930	db/view/view_building_worker: don't organize staging sstables by last token There was a problem with staging sstables after tablet merge. Let's say there were 2 tablets and tablet 1 (lower last token) had an staging sstable. Then a tablet merge occured, so there is only one tablet now (higher last token). But entries in `_staging_sstables`, which are grouped by last token, are never adjusted. Since there shouldn't be thousands of sstables, we can just hold list of sstables per table and filter necessary entries when doing `process_staging` view building task.	2025-11-10 09:02:16 +01:00
Nadav Har'El	25439127c8	config: make tablets_mode_for_new_keyspaces live-updatable We have a configuration option "tablets_mode_for_new_keyspaces" which determines whether new keyspaces should use tablets or vnodes. For some reason, this configuration parameter was never marked live- updatable, so in this patch we add flag. No other changes are needed - the existing code that uses this flag always uses it through the up-to-date configuration. In the previous patches we start to honor tablets_mode_for_new_keyspaces also in Alternator CreateTable, and we wanted to test this but couldn't do this in test/alternator because the option was not live-updatable. Now that it will be, we'll be able to test this feature in test/alternator. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-11-09 12:52:29 +02:00
Piotr Szymaniak	af00b59930	Fix incorrect hint for tablets_mode_for_new_keyspaces	2025-11-09 10:49:46 +02:00
Wojciech Mitros	0a22ac3c9e	mv: don't mark the view as built if the reader produced no partitions When we build a materialized view we read the entire base table from start to end to generate all required view udpates. If a view is created while another view is being built on the same base table, this is optimized - we start generating view udpates for the new view from the base table rows that we're currently reading, and we read the missed initial range again after the previous view finishes building. The view building progress is only updated after generating view updates for some read partitions. However, there are scenarios where we'll generate no view updates for the entire read range. If this was not handled we could end up in an infinite view building loop like we did in https://github.com/scylladb/scylladb/issues/17293 To handle this, we mark the view as built if the reader generated no partitions. However, this is not always the correct conclusion. Another scenario where the reader won't encounter any partitions is when view building is interrupted, and then we perform a reshard. In this scenario, we set the reader for all shards to the last unbuilt token for an existing partition before the reshard. However, this partition may not exist on a shard after reshard, and if there are also no partitions with higher tokens, the reader will generate no partitions even though it hasn't finished view building. Additionally, we already have a check that prevents infinite view building loops without taking the partitions generated by the reader into account. At the end of stream, before looping back to the start, we advance current_key to the end of the built range and check for built views in that range. This handles the case where the entire range is empty - the conditions for a built view are: 1. the "next_token" is no greater than "first_token" (the view building process looped back, so we've built all tokens above "first_token") 2. the "current_token" is no less than "first_token" (after looping back, we've built all tokens below "first_token") If the range is empty, we'll pass these conditions on an empty range after advancing "current_key" to the end because: 1. after looping back, "next_token" will be set to `dht::minimum_token` 2. "current_key" will be set to `dht::ring_position::max()` In this patch we remove the check for partitions generated by the reader. This fixes the issue with resharding and it does not resurrect the issue with infinite view building that the check was introduced for. Fixes https://github.com/scylladb/scylladb/issues/26523 Closes scylladb/scylladb#26635	2025-11-05 17:02:32 +02:00

1 2 3 4 5 ...

4629 Commits