scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-31 03:56:42 +00:00

Author	SHA1	Message	Date
Botond Dénes	9d2f7c3f52	Merge 'mv: allow setting concurrency in PRUNE MATERIALIZED VIEW' from Wojciech Mitros The PRUNE MATERALIZED VIEW statement is performed as follows: 1. Perform a range scan of the view table from the view replicas based on the ranges specified in the statement. 2. While reading the paged scan above, for each view row perform a read from all base replicas at the corresponding primary key. If a discrepancy is detected, delete the row in the view table. When reading multiple rows, this is very slow because for each view row we need to performe a single row query on multiple replicas. In this patch we add an option to speed this up by performing many of the single base row reads concurrently, at the concurrency specified in the USING CONCURRENCY clause. Aside from the unit test, I checked manually on a 3-node cluster with 10M rows, using vnodes. There were actually no ghost rows in the test, but we still had to iterate over all view rows and read the corresponding base rows. And actual ghost rows, if there are any, should be a tiny fraction of all rows. I compared concurrencies 1,2,10,100 and the results were: * Pruning with concurrency 1 took total 1416 seconds * Pruning with concurrency 2 took total 731 seconds * Pruning with concurrency 10 took total 234 seconds * Pruning with concurrency 100 took total 171 seconds So after a concurrency of 10 or so we're hitting diminishing returns (at least in this setup). At that point we may be no longer bottlenecked by the reads, but by CPU on the shard that's handling the PRUNE Fixes https://github.com/scylladb/scylladb/issues/27070 Closes scylladb/scylladb#27097 * github.com:scylladb/scylladb: mv: allow setting concurrency in PRUNE MATERIALIZED VIEW cql: add CONCURRENCY to the USING clause	2025-12-04 11:47:41 +02:00
Tomasz Grabiec	1d42770936	Merge 'topology_coordinator: Add barrier to cleanup_target' from Łukasz Paszkowski Consider the following scenario: 1. A table has RF=3 and writes use CL=QUORUM 2. One node is down 3. There is a pending tablet migration from the unavailable node that is reverted During the revert, there can be a time window where the pending replica being cleaned up still accepts writes. This leads to write failures, as only two nodes (out of four) are able to acknowledge writes. This patch fixes the issue by adding a barrier to the cleanup_target tablet transition state, ensuring that the coordinator switches back to the previous replica set before cleanup is triggered. Fixes https://github.com/scylladb/scylladb/issues/26512 It's a pre existing issue. Backport is required to all recent 2025.x versions. Closes scylladb/scylladb#27413 * github.com:scylladb/scylladb: topology_coordinator: Fix the indentation for the cleanup_target case topology_coordinator: Add barrier to cleanup_target test_node_failure_during_tablet_migration: Increase RF from 2 to 3	2025-12-03 23:57:45 +01:00
Łukasz Paszkowski	6163fedd2e	topology_coordinator: Fix the indentation for the cleanup_target case	2025-12-03 16:37:33 +01:00
Łukasz Paszkowski	67f1c6d36c	topology_coordinator: Add barrier to cleanup_target Consider the following scenario: 1. A table has RF=3 and writes use CL=QUORUM 2. One node is down 3. There is a pending tablet migration from the unavailable node that is reverted During the revert, there can be a time window where the pending replica being cleaned up still accepts writes. This leads to write failures, as only two nodes (out of four) are able to acknowledge writes. This patch fixes the issue by adding a barrier to the cleanup_target tablet transition state, ensuring that the coordinator switches back to the previous replica set before cleanup is triggered. Fixes https://github.com/scylladb/scylladb/issues/26512	2025-12-03 16:19:17 +01:00
Botond Dénes	b9199e8b24	Merge 'auth: use auth cache on login path' from Marcin Maliszkiewicz Scylla currently has bad resiliency to connection storms. Nodes are easy to overload or impact their latency by unbound concurrency in making new connections on the client side. This can easily happen in bigger deployments where there are thousands of client instances, e.g. pods. To improve resiliency we are introducing unified auth specialized cache to the system. This patch series is stage 1, where cache is used only on login path. Dependency diagram: ``` \|Authentication Layer\| \| v +--------------------------------+ \| Auth Cache \| +--------------------------------+ ^ \| \| \| \| v \|Raft Write Logic \| \| CQL Read Layer\| ``` Cache invalidation is based on raft and the cache contains full content of related tables. Ldap role manager may benefit partially as can_logic function is common and will be cached, but it still needs to query roles from external source. Performance results: For single shard connection/disconnection scenario insns/conn decreased by 5%, allocs/conn decreased by 23%, tasks/conn decreased by 20%. Results for 20 shards are very similar. Raw data before: ``` ≡ ◦ ⤖ rm -rf /tmp/scylla-data && build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 1 --developer-mode 1 --username cassandra --password cassandra --connection-per-request true 2> /dev/null Running test with config: {workload=read, partitions=10000, concurrency=100, duration=5, ops_per_shard=0, auth, connection_per_request} Pre-populated 10000 partitions 1128.55 tps (599.2 allocs/op, 0.0 logallocs/op, 145.2 tasks/op, 2586610 insns/op, 1350912 cycles/op, 0 errors) 1157.41 tps (601.3 allocs/op, 0.0 logallocs/op, 145.2 tasks/op, 2589046 insns/op, 1356691 cycles/op, 0 errors) 1167.42 tps (603.3 allocs/op, 0.0 logallocs/op, 145.2 tasks/op, 2603234 insns/op, 1360607 cycles/op, 0 errors) 1159.63 tps (605.9 allocs/op, 0.0 logallocs/op, 145.3 tasks/op, 2609977 insns/op, 1363935 cycles/op, 0 errors) 1165.12 tps (608.8 allocs/op, 0.0 logallocs/op, 145.2 tasks/op, 2625804 insns/op, 1365736 cycles/op, 0 errors) throughput: mean= 1155.63 standard-deviation=15.66 median= 1159.63 median-absolute-deviation=9.49 maximum=1167.42 minimum=1128.55 instructions_per_op: mean= 2602934.31 standard-deviation=16063.01 median= 2603234.19 median-absolute-deviation=13887.96 maximum=2625804.05 minimum=2586609.82 cpu_cycles_per_op: mean= 1359576.30 standard-deviation=5945.69 median= 1360607.05 median-absolute-deviation=4358.94 maximum=1365736.42 minimum=1350912.10 ``` Raw data after: ``` ≡ ◦ ⤖ rm -rf /tmp/scylla-data && build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 1 --developer-mode 1 --username cassandra --password cassandra --connection-per-request true --duration 10 2> /dev/null Running test with config: {workload=read, partitions=10000, concurrency=100, duration=10, ops_per_shard=0, auth, connection_per_request} Pre-populated 10000 partitions 1132.09 tps (457.5 allocs/op, 0.0 logallocs/op, 115.1 tasks/op, 2432485 insns/op, 1270655 cycles/op, 0 errors) 1157.70 tps (458.4 allocs/op, 0.0 logallocs/op, 115.1 tasks/op, 2447779 insns/op, 1283768 cycles/op, 0 errors) 1162.86 tps (459.0 allocs/op, 0.0 logallocs/op, 115.1 tasks/op, 2463225 insns/op, 1291782 cycles/op, 0 errors) 1153.15 tps (460.2 allocs/op, 0.0 logallocs/op, 115.2 tasks/op, 2469230 insns/op, 1296381 cycles/op, 0 errors) 1142.09 tps (460.6 allocs/op, 0.0 logallocs/op, 115.1 tasks/op, 2478900 insns/op, 1299342 cycles/op, 0 errors) 1124.89 tps (462.5 allocs/op, 0.0 logallocs/op, 115.2 tasks/op, 2470962 insns/op, 1305026 cycles/op, 0 errors) 1156.75 tps (464.4 allocs/op, 0.0 logallocs/op, 115.1 tasks/op, 2493823 insns/op, 1305136 cycles/op, 0 errors) 1152.16 tps (466.3 allocs/op, 0.0 logallocs/op, 115.2 tasks/op, 2497246 insns/op, 1309816 cycles/op, 0 errors) 1154.77 tps (469.8 allocs/op, 0.0 logallocs/op, 115.5 tasks/op, 2571954 insns/op, 1345341 cycles/op, 0 errors) 1152.22 tps (472.4 allocs/op, 0.0 logallocs/op, 115.3 tasks/op, 2551954 insns/op, 1334202 cycles/op, 0 errors) throughput: mean= 1148.87 standard-deviation=12.08 median= 1153.15 median-absolute-deviation=7.88 maximum=1162.86 minimum=1124.89 instructions_per_op: mean= 2487755.88 standard-deviation=43838.23 median= 2478900.02 median-absolute-deviation=24531.06 maximum=2571954.26 minimum=2432485.38 cpu_cycles_per_op: mean= 1304144.76 standard-deviation=22129.55 median= 1305025.71 median-absolute-deviation=12363.25 maximum=1345341.16 minimum=1270655.17 ``` Fixes https://github.com/scylladb/scylladb/issues/18891 Backport: no, it's a new feature Closes scylladb/scylladb#26841 * github.com:scylladb/scylladb: auth: use auth cache on login path auth: corutinize standard_role_manager::can_login main: auth: add auth cache dependency to auth service raft: update auth cache when data changes auth: storage_service: reload auth cache on v1 to v2 auth migration raft: reload auth cache on snapshot application service: add auth cache getter to storage service main: start auth cache service auth: add unified cache implementation auth: move table names to common.hh	2025-12-03 16:45:01 +02:00
Łukasz Paszkowski	0ed3452721	service/storage_service: Mark nodes excluded on shard0 Excluding nodes is a group0 operation and as such it needs to be executed onyl on shard0. In case, the method `mark_excluded` is invoked on a different shard, redirect the request to shard0. Fixes https://github.com/scylladb/scylladb/issues/27129 Closes scylladb/scylladb#27167	2025-12-01 17:30:40 +01:00
Asias He	da5cc13e97	repair: Fix deadlock when topology coordinator steps down in the middle Consider this: 1) n1 is the topology coordinator 2) n1 schedules and executes a tablet repair with session id s1 for a tablet on n3 an n4. 3) n3 and n4 take and store the in _rs._repair_compaction_locks[s1] 4) n1 steps down before it executes locator::tablet_transition_stage::end_repair 5) n2 becomes the new topology coordinator 6) n2 runs locator::tablet_transition_stage::repair again 7) n3 and n4 try to take the lock again and hangs since the lock is already taken. To avoid the deadlock, we can throw in step 7 so that n2 will proceed to end_repair stage and release the lock. After that, the scheduler could schedule the tablet repair request again. Fixes #26346 Closes scylladb/scylladb#27163	2025-11-28 15:14:39 +01:00
Emil Maskovsky	37e3dacf33	topology_coordinator: handle seastar::abort_requested_exception alongside raft::request_aborted In several exception handlers, only raft::request_aborted was being caught and rethrown, while seastar::abort_requested_exception was falling through to the generic catch(...) block. This caused the exception to be incorrectly treated as a failure that triggers rollback, instead of being recognized as an abort signal. For example, during tablet draining, the error log showed: "tablets draining failed with seastar::abort_requested_exception (abort requested). Aborting the topology operation" This change adds seastar::abort_requested_exception handling alongside raft::request_aborted in all places where it was missing. When rethrown, these exceptions propagate up to the main run() loop where handle_topology_coordinator_error() recognizes them as normal abort signals and allows the coordinator to exit gracefully without triggering unnecessary rollback operations. Fixes: scylladb/scylladb#27255 No backport: The problem was only seen in tests and not reported in customer tickets, so it's enough to fix it in the main branch. Closes scylladb/scylladb#27314	2025-11-28 12:19:21 +01:00
Michael Litvak	97b7c03709	tablet: scheduler: Do not emit conflicting migration in merge colocation The tablet scheduler should not emit conflicting migrations for the same tablet. This was addressed initially in scylladb/scylladb#26038 but the check is missing in the merge colocation plan, so add it there as well. Without this check, the merge colocation plan could generate a conflicting migration for a tablet that is already scheduled for migration, as the test demonstrates. This can cause correctness problems, because if the load balancer generates two migrations for a single tablet, both will be written as mutations, and the resulting mutation could contain mixed cells from both migrations. Fixes scylladb/scylladb#27304 Closes scylladb/scylladb#27312	2025-11-28 11:17:12 +01:00
Pavel Emelyanov	54edb44b20	code: Stop using seastar::compat::source_location And switch to std::source_location. Upcoming seastar update will deprecate its compatibility layer. The patch is for f in $(git grep -l 'seastar::compat::source_location'); do sed -e 's/seastar::compat::source_location/std::source_location/g' -i $f; done and removal of few header includes. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#27309	2025-11-27 19:10:11 +02:00
Wojciech Mitros	323e5cd171	mv: allow setting concurrency in PRUNE MATERIALIZED VIEW The PRUNE MATERALIZED VIEW statement is performed as follows: 1. Perform a range scan of the view table from the view replicas based on the ranges specified in the statement. 2. While reading the paged scan above, for each view row perform a read from all base replicas at the corresponding primary key. If a discrepancy is detected, delete the row in the view table. When reading multiple rows, this is very slow because for each view row we need to performe a single row query on multiple replicas. In this patch we add an option to speed this up by performing many of the single base row reads concurrently, at the concurrency specified in the USING CONCURRENCY clause. Fixes https://github.com/scylladb/scylladb/issues/27070	2025-11-27 00:02:28 +01:00
Asias He	ab4896dc70	topology_coordinator: Send incremental repair rpc only when the feature is enabled Otherwise, in a mixed cluster, the handle_tablet_resize_finalization would fail because of the unknown rpc verb. Fixes #26309 Closes scylladb/scylladb#27218	2025-11-26 15:25:36 +01:00
Marcin Maliszkiewicz	ea3dc0b0de	raft: update auth cache when data changes When applying group0_command we now inspect whether any auth internal tables were modified, and reload affected role entries in the cache. Since one auth DML may change multiple tables, when iterating over mutations we deduplicate affected roles across those tables.	2025-11-26 12:00:50 +01:00
Marcin Maliszkiewicz	2a6bef96d6	auth: storage_service: reload auth cache on v1 to v2 auth migration	2025-11-26 12:00:50 +01:00
Marcin Maliszkiewicz	19da1cb656	raft: reload auth cache on snapshot application Receiving snaphot is a rare event so as a simplification we'll be reloading the whole cache instead of trying to merge states, especially that expected size is small, below 100 records. Reloading is non-disruptive operation, old entries are removed only after all entries are loaded. If entry is updated, shared pointer will be atomically replaced in a cache map.	2025-11-26 12:00:50 +01:00
Marcin Maliszkiewicz	2cf1ca43b5	service: add auth cache getter to storage service Prepare for use in a subsequent commit in group0_state_machine, where the auth cache will be integrated. This follows the same pattern as updates to the service-level cache, view-building state, and CDC streams.	2025-11-26 12:00:50 +01:00
Nadav Har'El	9cde93e3da	Merge 'db/view/view_building_coordinator: get rid of task's state in group0' from Michał Jadwiszczak Previously, the view building coordinator relied on setting each task's state to STARTED and then explicitly removing these state entries once tasks finished, before scheduling new ones. This approach induced a significant number of group0 commits, particularly in large clusters with many nodes and tablets, negatively impacting performance and scalability. With the update, the coordinator and worker logic has been restructured to operate without maintaining per-task states. Instead, tasks are simply tracked with an aborted boolean flag, which is still essential for certain tablet operations. This change removes much of the coordination complexity, simplifies the view building code, and reduces operational overhead. In addition, the coordinator now batches reports of finished tasks before making commits. Rather than committing task completions individually, it aggregates them and reports in groups, significantly minimizing the frequency of group0 commits. This new approach is expected to improve efficiency and scalability during materialized view construction, especially in large deployments. Fixes https://github.com/scylladb/scylladb/issues/26311 This patch needs to be backported to 2025.4. Closes scylladb/scylladb#26897 * github.com:scylladb/scylladb: docs/dev/view-building-coordinator: update the docs after recent changes db/view/view_building: send coordinator's term in the RPC db/view/view_building_state: replace task's state with `aborted` flag db/view/view_building_coordinator: batch finished tasks reporting db/view/view_building_worker: change internal implementation db/view/view_building_coordinator: change `work_on_tasks` RPC return type	2025-11-26 11:35:44 +02:00
Michał Jadwiszczak	24d69b4005	db/view/view_building_state: replace task's state with `aborted` flag After previous commits, we can drop entire task's state and replace it with single boolean flag, which determines if a task was aborted. Once a task was aborted, it cannot get resurrected to a normal state.	2025-11-25 12:14:04 +01:00
Michael Litvak	005807ebb8	Revert "storage service: add repair colocated tablets rpc" This reverts commit `11f045bb7c`. The rpc was added together with colocated tablets in 2025.4 to support a "shared repair" operation of a group of colocated tablets that repairs all of them and allows also for special behavior as opposed to repairing a single specific tablet. It is not used anymore because we decided to not repair all colocated tablets in a single shared operation, but to repair only the base table, and in a later release support repairing colocated tables individually. We can remove the rpc in 2025.4 because it is introduced in the same version.	2025-11-25 09:06:48 +01:00
Michael Litvak	273f664496	topology_coordinator: don't repair colocated tablets With the introduction of colocated tables, all the tablet transitions now operate on groups of colocated tablets instead of individual tablets. such is tablet migration, and also tablet repair. The tablet repair currently doesn't work on individual tablets due to the limitations in the tablet map being shared. The way it was implemented to work on a group of colocated tablets is by repairing all the colocated tablets together, using a dedicated rpc, and setting a shared repair_time in the shared tablet map. It was implemented this way because we wanted to have some way to repair the tablets of a colocated table. However, we want to change this in the next release so that it will be possible to repair the tablets of a colocated table individually. In order to simplify and prepare for the future change, we prefer until then to not repair colocated tables at all. otherwise, we will need to support both the shared repair and individual repair together for a long time, and the upgrade will be more complicated. We change the handling of the tablet 'repair' transition to repair only the base table's tablets. It means it will not be possible to request tablet repair for a non-base colocated table such as local MV, CDC and paxos table. This restriction will be temporary until a later release where we will suuport repairing colocated tablets. This is a reasonable restriction because repair for these kind of tables is not required or as important as for normal tables. Fixes scylladb/scylladb#27119	2025-11-25 09:05:59 +01:00
Gleb Natapov	39cec4ae45	topology: let banned node know that it is banned Currently if a banned node tries to connect to a cluster it fails to create connections, but has no idea why, so from inside the node it looks like it has communication problems. This patch adds new rpc NOTIFY_BANNED which is sent back to the node when its connection is dropped. On receiving the rpc the node isolates itself and print an informative message about why it did so. Closes scylladb/scylladb#26943	2025-11-24 17:12:13 +01:00
Tomasz Grabiec	d4b77c422f	Merge 'load_stats: leaving replica could be std::nullopt' from Ferenc Szili When migrating tablet size during the end_migration tablet transition stage, we need the pending and leaving replica hosts. The leaving and pending replicas are gathered in objects of type std::optional<tablet_replica> and are not checked if they contain a value before dereferencing which could cause an exception in the topology coordinator. This patch adds a check for leaving and pending replicas, and only performs the tablet size migration if neither are empty. This bug was introduced in `10f07fb95a` This change also adds the ability to create a tablet size in load_stats during end_migration stage of a tablet rebuild. We compute the new tablet size from by averaging the tablet sizes of the existing replicas. This change also adds the virtual table tablet_sizes which contains tablet sizes of all the replicas of all the tablets in the cluster. A version containing this bug has not yet been released, so a backport is not needed. Closes scylladb/scylladb#27118 * github.com:scylladb/scylladb: test: add tests for tablet size migration during end_migration virtual_table: add tablet_sizes virtual table load_stats: update tablet sizes after migration or rebuild	2025-11-24 15:31:30 +01:00
Avi Kivity	85db7b1caf	Merge 'address_map: Use more efficient and reliable replication method' from Tomasz Grabiec Primary issue with the old method is that each update is a separate cross-shard call, and all later updates queue behind it. If one of the shards has high latency for such calls, the queue may accumulate and system will appear unresponsive for mapping changes on non-zero shards. This happened in the field when one of the shards was overloaded with sstables and compaction work, which caused frequent stalls which delayed polling for ~100ms. A queue of 3k address updates accumulated, because we update mapping on each change of gossip states. This made bootstrap impossible because nodes couldn't learn about the IP mapping for the bootstrapping node and streaming failed. To protect against that, use a more efficient method of replication which requires a single cross-shard call to replicate all prior updates. It is also more reliable, if replication fails transiently for some reason, we don't give up and fail all later updates. Fixes #26865 Closes scylladb/scylladb#26941 * github.com:scylladb/scylladb: address_map: Use barrier() to wait for replication address_map: Use more efficient and reliable replication method utils: Introduce helper for replicated data structures	2025-11-23 19:15:12 +02:00
Ferenc Szili	cede4f66af	load_stats: update tablet sizes after migration or rebuild When migrating tablet size during the end_migration tablet transition stage, we need the pending and leaving replica hosts. The leaving and pending replicas are gathered in objects of type std::optional<tablet_replica> and are not checked if they contain a value before dereferencing which could cause an exception in the topology coordinator. This patch adds a check for leaving and pending replicas, and only perfoms the tablet size migration if neither are empty. This bug was introduced in `10f07fb95a` This change also adds the functionality to add the tablet size to load_stats after a tablet rebuild. We compute the average tablet size from the existing replicas, and add the new size to the pending replica.	2025-11-21 16:22:20 +01:00
Radosław Cybulski	d589e68642	Add precompiled headers to CMakeLists.txt Add precompiled header support to CMakeLists.txt and configure.py - it improves compilation time by approximately 10%. New header `stdafx.hh` is added, don't include it manually - the compiler will include it for you. The header contains includes from external libraries used by Scylla - seastar, standard library, linux headers and zlib. The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER` or configure.py --disable-precompiled-header to disable. The feature should be disabled, when trying to check headers - otherwise you might get false negatives on missing includes from seastar / abseil and so on. Note: following configuration needs to be added to ccache.conf: sloppiness = pch_defines,time_macros,include_file_mtime,include_file_ctime Closes scylladb/scylladb#26617	2025-11-21 12:27:41 +02:00
Patryk Jędrzejczak	45ad93a52c	topology_coordinator: include all transitioning nodes in all global commands This change makes the code simpler and less vulnerable to regressions. There is no functional impact because: - we already include a decommissioning/bootstrapping/replacing node for `barrier` and `barrier_and_drain`, - we never execute global commands in the presence of a rebuilding node, - removing node always belongs to `exclude_nodes`, so it's filtered out anyway, - we execute global `stream_ranges` only for removenode, - we execute global `wait_for_ip` only for new nodes when there are no transitioning nodes. Fixes #20272 Fixes #27066 Closes scylladb/scylladb#27102	2025-11-20 11:11:32 +02:00
Botond Dénes	6ee0f1f3a7	Merge 'replica/table: add a metric for hypothetical total file size without compression' from Michał Chojnowski This patch adds a metric for pre-compression size of sstable files. This patch adds a per-table metric `scylla_column_family_total_disk_space_before_compression`, which measures the hypothetical total size of sstables on disk, if Data.db was replaced with an uncompressed equivalent. As for the implementation: Before the patch, tables and sstable sets are already tracking their total physical file size. Whenever sstables are added or removed, the size delta is propagated from the sstable up through sstable sets into table_stats. To implement the new metric, we turn the size delta that is getting passed around from a one-dimensional to a two-dimensional value, which includes both the physical and the pre-compression size. New functionality, no backport needed. Closes scylladb/scylladb#26996 * github.com:scylladb/scylladb: replica/table: add a metric for hypothetical total file size without compression replica/table: keep track of total pre-compression file size	2025-11-20 09:10:38 +02:00
Tomasz Grabiec	f83c4ffc68	address_map: Use barrier() to wait for replication More efficient than 100 pings. There was one ping in test which was done "so this shard notices the clock advance". It's not necessary, since obsering completed SMP call implies that local shard sees the clock advancement done within in.	2025-11-19 15:21:02 +01:00
Tomasz Grabiec	4a85ea8eb2	address_map: Use more efficient and reliable replication method Primary issue with the old method is that each update is a separate cross-shard call, and all later updated queue behind it. If one of the shards has high latency for such calls, the queue may accumulate and system will appear unresponsive for mapping changes on non-zero shards. This happened in the field when one of the shards was overloaded with sstables and compaction work, which caused frequent stalls which delayed polling for ~100ms. A queue of 3k address updates accumulated. This made bootstrap impossible, since nodes couldn't learn about the IP mapping for the bootstrapping node and streaming failed. To protect against that, use a more efficient method of replication which requires a single cross-shard call to replicate all prior updates. It is also more reliable, if replication fails transiently for some reason, we don't give up and fail all later updates. Fixes #26865 Fixes #26835	2025-11-19 15:21:02 +01:00
Patryk Jędrzejczak	adaa0560d9	Merge 'Automatic cleanup improvements' from Gleb Natapov This series allows an operator to reset 'cleanup needed' flag if he already cleaned up the node, so that automatic cleanup will not do it again. We also change 'nodetool cleanup' back to run cleanup on one node only (and reset 'cleanup needed' flag in the end), but the new '--global' option allows to run cleanup on all nodes that needed it simultaneously. Fixes https://github.com/scylladb/scylladb/issues/26866 Backport to all supported version since automatic cleanup behaviour as it is now may create unexpected by the operator load during cluster resizing. Closes scylladb/scylladb#26868 * https://github.com/scylladb/scylladb: cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster cleanup: Add RESTful API to allow reset cleanup needed flag	2025-11-18 08:17:17 +02:00
Piotr Dulikowski	f0039381d2	Merge 'db/view/view_building_worker: support staging sstables intra-node migration and tablet merge' from Michał Jadwiszczak This PR fixes staging stables handling by view building coordinator in case of intra-node tablet migration or tablet merge. To support tablet merge, the worker stores the sstables grouped only be `table_id`, instead of `(table_id, last_token)` pair. There shouldn't be that many staging sstables, so selecting relevant for each `process_staging` task is fine. For the intra-node migration support, the patch adds methods to load migrated sstables on the destination shard and to cleanup them on source shard. The patch should be backported to 2025.4 Fixes https://github.com/scylladb/scylladb/issues/26244 Closes scylladb/scylladb#26454 * github.com:scylladb/scylladb: service/storage_service: migrate staging sstables in view building worker during intra-node migration db/view/view_building_worker: support sstables intra-node migration db/view_building_worker: fix indent db/view/view_building_worker: don't organize staging sstables by last token	2025-11-17 08:53:19 +01:00
Pavel Emelyanov	1c9c4c8c8c	Merge 'service: attach storage_service to migration_manager using pluggable' from Marcin Maliszkiewicz Migration manager depends on storage service. For instance, it has a reload_schema_in_bg background task which calls _ss.local() so it expects that storage service is not stopped before it stops. To solve this we use permit approach, and during storage_service stop: - we ignore new code execution in migration_manager which'd use storage_service - but wait with storage_service shutdown until all existing executions are done Fixes scylladb/scylladb#26734 Backport: no need, problem existed since very long time, code restructure in https://github.com/scylladb/scylladb/commit/389afcd (and following commits) made it hitting more often, as _ss was called earlier, but it's not released yet. Closes scylladb/scylladb#26779 * github.com:scylladb/scylladb: service: attach storage_service to migration_manager using pluggabe service: migration_manager: corutinize merge_schema_from service: migration_manager: corutinize reload_schema	2025-11-14 15:14:28 +03:00
Piotr Dulikowski	2ccc94c496	Merge 'topology_coordinator: include joining node in barrier' from Michael Litvak Previously, only nodes in the 'normal' state and decommissioning nodes were included in the set of nodes participating in barrier and barrier_and_drain commands. Joining nodes are not included because they don't coordinate requests, given their cql port is closed. However, joining nodes may receive mutations from other nodes, for which they may generate and coordinate materialized view updates. If their group0 state is not synchronized it could cause lost view updates. For example: 1. On the topology coordinator, the join completes and the joining node becomes normal, but the joining node's state lags behind. Since it's not synchronized by the barrier, it could be in an old state such as `write_both_read_old`. 2. A normal node coordinates a write and sends it to the new node as the new replica. 3. The new node applies the base mutation but doesn't generate a view update for it, because it calculates the base-view pairing according to its own state and replication map, and determines that it doesn't participate in the base-view pairing. Therefore, since the joining node participates as a coordinator for view updates, it should be included in these barriers as well. This ensures that before the join completes, the joining node's state is `write_both_read_new`, where it does generate view updates. Fixes https://github.com/scylladb/scylladb/issues/26976 backport to previous versions since it fixes a bug in MV with vnodes Closes scylladb/scylladb#27008 * github.com:scylladb/scylladb: test: add mv write during node join test topology_coordinator: include joining node in barrier	2025-11-14 12:41:16 +01:00
Patryk Jędrzejczak	1141342c4f	Merge 'topology: refactor excluded nodes' from Petr Gusev This PR refactors excluded nodes handling for tablets and topology. For tablets a dedicated variable `topology::excluded_tablet_nodes` is introduced, for topology operations a method get_excluded_nodes() is inlined into topology_coordinator and renamed to `get_excluded_nodes_for_topology_request`. The PR improves codes readability and efficiency, no behavior changes. backport: this is a refactoring/optimization, no need to backport Closes scylladb/scylladb#26907 * https://github.com/scylladb/scylladb: topology_coordinator: drop unused exec_global_command overload topology_coordinator: rename get_excluded_nodes -> get_excluded_nodes_for_topology_request topology_state_machine: inline get_excluded_nodes messaging_service: simplify and optimize ban_host storage_service: topology_state_load: extract topology variable topology_coordinator: excluded_tablet_nodes -> ignored_nodes topology_state_machine: add excluded_tablet_nodes field	2025-11-14 11:52:00 +01:00
Piotr Dulikowski	833b824905	Merge 'service/qos: Fall back to default scheduling group when using maintenance socket' from Dawid Mędrek The service level controller relies on `auth::service` to collect information about roles and the relation between them and the service levels (those attached to them). Unfortunately, the service level controller is initialized way earlier than `auth::service` and so we had to prevent potential invalid queries of user service levels (cf. `46193f5e79`). Unfortunately, that came at a price: it made the maintenance socket incompatible with the current implementation of the service level controller. The maintenance socket starts early, before the `auth::service` is fully initialized and registered, and is exposed almost immediately. If the user attempts to connect to Scylla within this time window, via the maintenance socket, one of the things that will happen is choosing the right service level for the connection. Since the `auth::service` is not registered, Scylla with fail an assertion and crash. A similar scenario occurs when using maintenance mode. The maintenance socket is how the user communicates with the database, and we're not prepared for that either. To avoid unnecessary crashes, we add new branches if the passed user is absent or if it corresponds to the anonymous role. Since the role corresponding to a connection via the maintenance socket is the anonymous role, that solves the problem. Some accesses to `auth::service` are not affected and we do not modify those. Fixes scylladb/scylladb#26816 Backport: yes. This is a fix of a regression. Closes scylladb/scylladb#26856 * github.com:scylladb/scylladb: test/cluster/test_maintenance_mode.py: Wait for initialization test: Disable maintenance mode correctly in test_maintenance_mode.py test: Fix keyspace in test_maintenance_mode.py service/qos: Do not crash Scylla if auth_integration absent	2025-11-14 11:12:28 +01:00
Marcin Maliszkiewicz	958d04c349	service: attach storage_service to migration_manager using pluggabe Migration manager depends on storage service. For instance, it has a reload_schema_in_bg background task which calls _ss.local() so it expects that storage service is not stopped before it stops. To solve this we use permit approach, and during storage_service stop: - we ignore new code execution in migration_manager which'd use storage_service - but wait with storage_service shutdown until all existing executions are done Fixes scylladb/scylladb#26734	2025-11-14 08:50:19 +01:00
Marcin Maliszkiewicz	cf9b2de18b	service: migration_manager: corutinize merge_schema_from It's needed to easily keep-alive pluggable storage_service permit in a following commit.	2025-11-14 08:50:19 +01:00
Marcin Maliszkiewicz	5241e9476f	service: migration_manager: corutinize reload_schema It's needed to easily keep-alive pluggable storage_service permit in a following commit.	2025-11-14 08:50:18 +01:00
Michael Litvak	eefae4cc4e	migration_manager: pass timestamp to pre_create pass the write timestamp as parameter to the on_pre_create_column_families notification.	2025-11-13 16:59:43 +01:00
Petr Gusev	d3bd8c924d	topology_coordinator: drop unused exec_global_command overload	2025-11-13 14:19:03 +01:00
Petr Gusev	45d1302066	topology_coordinator: rename get_excluded_nodes -> get_excluded_nodes_for_topology_request This method is specific to topology requests -- node joining, replacing, decommissioning etc, everything that goes through topology::transition_state::write_both_read_old and raft_topology_cmd::command::stream_ranges. It shouldn't be used in other contexts -- to handle global topology requests (e.g. truncate table) or for tablets. Rename the method to make this more explicit.	2025-11-13 14:19:03 +01:00
Petr Gusev	bf8cc5358b	topology_state_machine: inline get_excluded_nodes The method is specific to topology_coordinator, which already contains a wrapper for it, so inline the topology method into it. Also, make the logic of the method more explicit and remove multiple transition_nodes lookups.	2025-11-13 14:18:46 +01:00
Michael Litvak	13d94576e5	topology_coordinator: include joining node in barrier Previously, only nodes in the 'normal' state and decommissioning nodes were included in the set of nodes participating in barrier and barrier_and_drain commands. Joining nodes are not included because they don't coordinate requests, given their cql port is closed. However, joining nodes may receive mutations from other nodes, for which they may generate and coordinate materialized view updates. If their group0 state is not synchronized it could cause lost view updates. For example: 1. On the topology coordinator, the join completes and the joining node becomes normal, but the joining node's state lags behind. Since it's not synchronized by the barrier, it could be in an old state such as `write_both_read_old`. 2. A normal node coordinates a write and sends it to the new node as the new replica. 3. The new node applies the base mutation but doesn't generate a view update for it, because it calculates the base-view pairing according to its own state and replication map, and determines that it doesn't participate in the base-view pairing. Therefore, since the joining node participates as a coordinator for view updates, it should be included in these barriers as well. This ensures that before the join completes, the joining node's state is `write_both_read_new`, where it does generate view updates. Fixes scylladb/scylladb#26976	2025-11-13 12:24:31 +01:00
Piotr Dulikowski	2e5eb92f21	Merge 'cdc: use CDC schema that is compatible with the base schema' from Michael Litvak When generating CDC log mutations for some base mutation, use a CDC schema that is compatible with the base schema. The compatible CDC schema has for every base column a corresponding CDC column with the same name. If using a non-compatible schema, we may encounter a situation, especially during ALTER, that we have a mutation with a base column set with some value, but the CDC schema doesn't have a column by that name. This would cause the user request to fail with an error. We add to the schema object a schema_ptr that for CDC-enabled tables points to the schema object of the CDC table that is compatible with the schema. It is set by the schema merge algorithm when creating the schema for a table that is created or altered. We use the fact that a base table and its CDC table are created and altered in the same group0 operation, and this way we can find and set the cdc schema for a base table. When transporting the base schema as a frozen schema between shards, we transport with it the frozen cdc schema as well. The patch starts with a series of refactoring commits that make extending the frozen schema easier and cleans up some duplication in the code about the frozen schema. We combine the two types `frozen_schema_with_base_info` and `view_schema_and_base_info` to a single type `extended_frozen_schema` that holds a frozen schema with additional data that is not part of the schema mutations but needs to be transported with it to unfreeze it - base_info, and the frozen cdc schema which is added in a later commit. Fixes https://github.com/scylladb/scylladb/issues/26405 backport not needed - enhancement Closes scylladb/scylladb#24960 * github.com:scylladb/scylladb: test: cdc: test cdc compatible schema cdc: use compatiable cdc schema db: schema_applier: create schema with pointer to CDC schema db: schema_applier: extract cdc tables schema: add pointer to CDC schema schema_registry: remove base_info from global_schema_ptr schema_registry: use extended_frozen_schema in schema load schema_registry: replace frozen_schema+base_info with extended_frozen_schema frozen_schema: extract info from schema_ptr in the constructor frozen_schema: rename frozen_schema_with_base_info to extended_frozen_schema	2025-11-13 10:11:54 +01:00
Pavel Emelyanov	f47f2db710	Merge 'Support local primary-replica-only for native restore' from Robert Bindar This PR extends the restore API so that it accepts primary_replica_only as parameter and it combines the concepts of primary-replica-only with scoped streaming so that with: - `scope=all primary_replica_only=true` The restoring node will stream to the global primary replica only - `scope=dc primary_replica_only=true` The restoring node will stream to the local primary replica only. - `scope=rack primary_replica_only=true` The restoring node will stream only to the primary replica from within its own rack (with rf=#racks, the restoring node will stream only to itself) - `scope=node primary_replica_only=true` is not allowed, the restoring node will always stream only to itself so the primary_replica_only parameter wouldn't make sense. The PR also adjusts the `nodetool refresh` restriction on running restore with both primary_replica_only and scope, it adds primary_replica_only to `nodetool restore` and it adds cluster tests for primary replica within scope. Fixes #26584 Closes scylladb/scylladb#26609 * github.com:scylladb/scylladb: Add cluster tests for checking scoped primary_replica_only streaming Improve choice distribution for primary replica Refactor cluster/object_store/test_backup nodetool restore: add primary-replica-only option nodetool refresh: Enable scope={all,dc,rack} with primary_replica_only Enable scoped primary replica only streaming Support primary_replica_only for native restore API	2025-11-13 12:11:18 +03:00
Michał Chojnowski	1cfce430f1	replica/table: keep track of total pre-compression file size Every table and sstable set keeps track of the total file size of contained sstables. Due to a feature request, we also want to keep track of the hypothetical file size if Data files were uncompressed, to add a metric that shows the compression ratio of sstables. We achieve this by replacing the relevant `uint_64 bytes_on_disk` counters everywhere with a struct that contains both the actual (post-compression) size and the hypothetical pre-compression size. This patch isn't supposed to change any observable behavior. In the next patch, we will use these changes to add a new metric.	2025-11-13 00:49:57 +01:00
Tomasz Grabiec	10b893dc27	Merge 'load_stats: fix bug in migrate_tablet_size()' from Ferenc Szili `topology_cooridinator::migrate_tablet_size()` was introduced in `10f07fb95a`. It has a bug where the has_tablet_size() lambda always returns false because of bad comparison of iterators after a table and tablet search: ``` if (auto table_i = tables.find(gid.table); table_i != tables.find(gid.table)) { if (auto size_i = table_i->second.find(trange); size_i != table_i->second.find(trange)) { ``` This change also fixes a problem where the `migrate_tablet_size()` would crash with a `std::out_of_range` if the pending node was not present in load_stats. This change fixes these two problems and moves the functionality into a separate method of `load_stats`. It also adds tests for the new method. A version containing this bug has not been released yet, so no backport is needed. Closes scylladb/scylladb#26946 * github.com:scylladb/scylladb: load_stats: add test for migrate_tablet_size() load_stats: fix problem with tablet size migration	2025-11-12 23:48:37 +01:00
Petr Gusev	9fed80c4be	messaging_service: simplify and optimize ban_host We do one cross-shard call for all left+ignored nodes.	2025-11-12 12:27:44 +01:00
Petr Gusev	52cccc999e	storage_service: topology_state_load: extract topology variable It's inconvinient to always write the long expression _topology_state_machine._topology.	2025-11-12 12:27:44 +01:00
Petr Gusev	66063f202b	topology_coordinator: excluded_tablet_nodes -> ignored_nodes ignored_nodes is sufficient in these cases. excluded_tablet_nodes also includes left_nodes_rs, which are not needed here — global_token_metadata_barrier runs the barrier only on normal and transition nodes, not on left nodes.	2025-11-12 12:27:44 +01:00

1 2 3 4 5 ...

5864 Commits