scylladb

Author	SHA1	Message	Date
Pavel Emelyanov	3ebd02513a	view_builder: Start background in maintenance group Currently view_builder::start() is called in default scheduling group. Once it initializes itself, it wakes up the step fiber that explicitly switches to maintenance scheduling group. This explicit switch made sence before previous patch, when the fiber was implemented as a serialized action. Now the fiber starts directly from .start() method and can inherit scheduling group from it. Said that, main code calls view_builder::start() in maintenance scheduling group killing two birds with one stone. First, the step fiber no longer needs borrow its scheduling group indirectly via database. Second, the start_in_background() code itself runs in a more suitable scheduling group. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-28 18:34:59 +03:00
Pavel Emelyanov	2439d27b60	view_builder: Wake-up step fiber with condition variable View builder runs a background fiber that perform build steps. To kick the fiber it uses serizlized action, but it's an overkill -- nobody waits for the action to finish, but on stop, when it's joined. This patch uses condition variable to kick the fiber, and starts it instantly, in the place where serialized action was first kicked. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-28 18:34:58 +03:00
Emil Maskovsky	834961c308	db/view: add missing include for coroutine::all to fix build without precompiled headers When building with `--disable-precompiled-header`, view.cc failed to compile due to missing <seastar/coroutine/all.hh> include, which provides `coroutine::all`. The problem doesn't manifest when precompiled headers are used, which is the default. So that's likely why it was missed by the CI. Adding the explicit include fixes the build. Fixes: scylladb/scylladb#28378 Ref: scylladb/scylladb#28093 No backport: This problem is only present in master. Closes scylladb/scylladb#28379	2026-01-27 18:56:56 +01:00
Piotr Dulikowski	5d5e829107	Merge 'db: view: refactor usage and building of semaphore in create and drop views plus change continuation to co routine style' from Alex Dathskovsky db: view: refactor semaphore usage in create/drop view paths Refactor the construction and usage of semaphore units in the create and drop view flows. The previous semaphore handling was hard to follow (as noted while working on https://github.com/scylladb/scylladb/pull/27929), so this change restructures unit creation and movement to follow a clearer and symmetric pattern across shards. The semaphore usage model is now documented with a detailed in-code comment to make the intended behavior and invariants explicit. As part of the refactor, the control flow is modernized by replacing continuation-based logic with coroutine-style code, improving readability and maintainability. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-250 backport: not required, this is a refactor Closes scylladb/scylladb#28093 * github.com:scylladb/scylladb: db: view: extend try/catch scope in handle_create_view_local The try/catch region is extended to cover step functions and inner helpers, which may throw or abort during view creation. This change is safe because we are just swolowing more parts that may throw due to semaphore abortion or any other abortion request, and doesnt change the logic db: view: refine create/drop coroutine signatures Refactor the create/drop coroutine interfaces to accept parameters as const references, enabling a clearer workflow and safer data flow. db: view: switch from continuations to coroutines Refactor the flow and style of create and drop view to use coroutines instead of continuations. This simplifies the logic, improves readability, and makes the code easier to maintain and extend. This commit also utilizes the get_view_builder_units function that was added in the previous commit. this commit also introduces a new alisasing for optional unit type for simpler and more readable functions that use this type db: view: introduce helper to acquire or reuse semaphore units Introduce a small helper that acquires semaphore units when needed or reuses units provided by the caller. This centralizes semaphore handling, simplifies the current logic, and enables refactoring the view create/drop path to a coroutine-based implementation instead of continuation-style code. db: view: add detailed comments on semaphore bookkeeping and serialized create/drop on shard 0	2026-01-26 17:16:01 +01:00
Alex	954d18903e	db: view: extend try/catch scope in handle_create_view_local The try/catch region is extended to cover step functions and inner helpers, which may throw or abort during view creation. This change is safe because we are just swolowing more parts that may throw due to semaphore abortion or any other abortion request, and doesnt change the logic	2026-01-26 13:10:37 +02:00
Alex	2c3ab8490c	db: view: refine create/drop coroutine signatures Refactor the create/drop coroutine interfaces to accept parameters as const references, enabling a clearer workflow and safer data flow.	2026-01-26 13:10:34 +02:00
Alex	114f88cb9b	db: view: switch from continuations to coroutines Refactor the flow and style of create and drop view to use coroutines instead of continuations. This simplifies the logic, improves readability, and makes the code easier to maintain and extend. This commit also utilizes the get_view_builder_units function that was added in the previous commit. this commit also introduces a new alisasing for optional unit type for simpler and more readable functions that use this type	2026-01-26 13:08:24 +02:00
Alex	87c1c6f40f	db: view: introduce helper to acquire or reuse semaphore units Introduce a small helper that acquires semaphore units when needed or reuses units provided by the caller. This centralizes semaphore handling, simplifies the current logic, and enables refactoring the view create/drop path to a coroutine-based implementation instead of continuation-style code.	2026-01-26 13:03:26 +02:00
Alex	1aadedc596	db: view: add detailed comments on semaphore bookkeeping and serialized create/drop on shard 0	2026-01-25 14:29:09 +02:00
Michael Litvak	d5009882c6	locator: document the exception type of assert_rf_rack_valid_keyspace The function assert_rf_rack_valid_keyspace uses the exception type std::invalid_argument when the RF-rack validation fails. Document it and change all callers to catch this specific exception type when checking for RF-rack validation failures, so that other exception types can be propagated properly.	2026-01-22 16:11:35 +01:00
Avi Kivity	c6dfae5661	treewide: #include Seastar headers with angle brackets Seastar is an external library from the point of view of ScyllaDB, so should be included with angle brackets. Closes scylladb/scylladb#27947	2026-01-13 14:56:15 +02:00
Alex	e430065c92	db: views: serialize create/drop view operations via shard 0 Create and drop view operations are currently performed on all shards, and their execution is not fully serialized. On slower processors this can lead to interleavings that leave stale entries in `system.scylla_views_build` A problematic sequence looks like this: * `on_create_view()` runs on shard 0 → entries for shard 0 and shard 1 are created * `on_drop_view()` runs on shard 0 → entry for shard 0 is removed * `on_create_view()` runs on shard 1 → entries for shard 0 and shard 1 are created again * `on_drop_view()` runs on shard 1 → entry for shard 1 is removed, while the shard 0 entry remains This results in a leftover row in `system.scylla_views_builds_in_progress`, causing `view_build_test.cc` to get stuck indefinitely in an eventual state and eventually be terminated by CI. This patch fixes the issue by fully serializing all view create and drop operations through shard 0. Shard 0 becomes the single execution point and notifies other shards to perform their work in order. Requests originating. new process: - view_builder::on_create_view(...) runs only on shard 0 and kicks off dispatch_create_view(...) in the background. - dispatch_create_view(...) (shard 0) first checks should_ignore_tablet_keyspace(...) and returns early if needed. - dispatch_create_view(...) calls handle_seed_view_build_progress(...) on shard 0. That: - writes the global “build progress” row across all shards via _sys_ks.register_view_for_building_for_all_shards(...). - After seeding, dispatch_create_view(...) broadcasts to all shards with container().invoke_on_all(...). - Each shard runs handle_create_view_local(...), which: - waits for pending base writes/streams, flushes the base, - resets the reader to the current token and adds the new view, - handles errors and triggers _build_step to continue processing. Drop view - view_builder::on_drop_view(...) runs only on shard 0 and kicks off dispatch_drop_view(...) in the background. - dispatch_drop_view(...) (shard 0) first checks should_ignore_tablet_keyspace(...) and returns early if needed. - It broadcasts handle_drop_view_local(...) to all shards with invoke_on_all(...). - Each shard runs handle_drop_view_local(...), which: - removes the view from local build state (_base_to_build_step and _built_views) by scanning existing steps, - ignores missing keyspace cases. - After all shards finish local cleanup, shard 0 runs handle_drop_view_global_cleanup(...), which: - removes global build progress, built‑view state, and view build status in system tables, Shutdown - drain() waits on _view_notification_sem before _sem so in‑flight dispatches finish before bookkeeping is halted. In addition, the test is adjusted to remove the long eventual wait (596.52s / 30 iterations) and instead rely on the default wait of 17 iterations (~4.37 minutes), eliminating unnecessary delays while preserving correctness. Fixes: https://github.com/scylladb/scylladb/issues/27898 Backport: not required as the problem happens on master Closes scylladb/scylladb#27929	2026-01-12 09:23:22 +02:00
Michael Litvak	8df61f6d99	view: change validate_view_keyspace to allow MVs if RF=Racks The function validate_view_keyspace checks if a keyspace is eligible for having materialized views, and it is used for validation when creating a MV or a MV-based index. Previously, it was required that the rf_rack_valid_keyspaces option is set in order for tablets-based keyspaces to be considered eligible, and the RF-rack condition was enforced when the option is set. Instead of this, we change the validation to allow MVs in a keyspace if the RF-rack condition is satisfied for the keyspace - regardless of the config option. We remove the config validation for views on startup that validates the option `rf_rack_valid_keyspaces` is set if there are any views with tablets, since this is not required anymore. We can do this without worrying about upgrades because this change will be effective from 2025.4 where MVs with tablets are first out of experimental phase. We update the test for MV and index restrictions in tablets keyspaces according to the new requirements. * Create MV/index: previously the test checked that it's allowed only if the config option `rf_rack_valid_keyspaces` is set. This is changed now so it's always allowed to create MV/index if the keyspace is RF-rack-valid. Update the test to verify that we can create MV/index when the keyspace is RF-rack-valid, even if the rf_rack option is not set, and verify that it fails when the keyspace is RF-rack-invalid. * Alter: Add a new test to verify that while a keyspace has views, it can't be altered to become RF-rack-invalid.	2025-12-22 09:14:29 +01:00
Nadav Har'El	95e303faf3	Merge 'Refactor get_view_natural_endpoint' from Wojciech Mitros With the introduction of rack-lists and the reliance of materialized views on them, the `get_view_natural_endpoint` function can be greatly simplified. When using tablets, instead of doing any index-matching, we can now pair base tables with views only in the same rack. In this series we remove no longer needed code and reorganize the needed code for better clarity. After the changes, the `get_view_natural_endpoint` function goes down from 245 lines to 85 lines, while the whole pairing-related text goes down from 346 lines to 239 lines. Fixes https://github.com/scylladb/scylladb/issues/26313 Closes scylladb/scylladb#27383 * github.com:scylladb/scylladb: mv: replace the simple/complex rack-aware pairing with exact rack matching mv: split out vnode pairing code from get_view_natural_endpoint mv: unify self-pairing and rack-aware pairing into one bool mv: remove the workaround for left nodes when sending view updates	2025-12-09 13:19:13 +02:00
Wojciech Mitros	6221c58325	mv: replace the simple/complex rack-aware pairing with exact rack matching When the initial version of rack-aware pairing was introduced, materialized views with tablets were still experimental. Since then, we decided that we'll only allow materialized views in clusters where the base table and the view are replicated on the same racks, with one replica of each tablet on each rack. This allows us to remove almost all logic from our base-view pairing. The only check for the paired view replica is now whether it's in the same rack as the base replica sending the update. In this patch we replace the simple and complex rack-aware pairing with the simple check above. Because of this, we have to remove a test case from network_topology_strategy_test which was testing complex pairing. The tested topology is not supported for views with tablets (or is unlikely to be supported, as it's a random test), so there's no use keeping the test. The test case for simple rack aware pairing was kept, but now we only test the case where each rack has one replica, not multiple. Additionally, we split finding of an unpaired replica to a separate function and partially rewrite it without reusing the helper stuctures that were present when calculating the simple and complex rack-aware pairing. We only look for an unpaired replica if we couldn't find a paired replica ourselves or if the number of view replicas didn't match the base replicas. If an unpaired replica appears while these conditions pass, we won't send an extra update, but that would be a new bug altogether, because we only expect the unpaired replica to appear during RF changes, so when these conditions aren't fulfilled. Fixes https://github.com/scylladb/scylladb/issues/26313	2025-12-02 10:52:36 +01:00
Wojciech Mitros	4ec0fa6eb5	mv: split out vnode pairing code from get_view_natural_endpoint To avoid repeatedly checking whether we're using tablets and having to use unnecesarily flexible code fitting both cases, we split out the base-view pairing code for the case of vnodes to another function. The get_view_natural_endpoint will now have only common steps, a call to that function, and steps specific to tablets.	2025-12-02 03:32:36 +01:00
Wojciech Mitros	c313b215e4	mv: unify self-pairing and rack-aware pairing into one bool We always use "legacy self pairing" when not using tablets, and the "rack aware pairing" has been enabled in every version where views with tablets isn't experimental. So in practice, instead of checking these variables we can just look at whether the table uses tablets.	2025-12-02 03:32:32 +01:00
Wojciech Mitros	7c612e1789	mv: remove the workaround for left nodes when sending view updates At one point, the get_view_natural_endpoint was using IP for the view update (and hint) destinations, but the hint code was using host_id for the destinations. When a node left, we could no longer have a mapping for a IP to host_id and when trying to store a hint for this IP, we'd crash. We worked around this issue by dropping the view update completely if the target is in the "left" state. Since then, we also moved to host_id's in the view update code, so there's no longer any translation needed when storing the hints. Additionally, we now drain hints not when entering the "left" state, but when the node actually stops owning tokens. Because of that, the workaround is not needed anymore, so we remove it in this commit. The existing test_mv_tablets_empty_ip case verifies that indeed, we do not crash in the original problematic scenario.	2025-12-01 12:27:28 +01:00
Wojciech Mitros	323e5cd171	mv: allow setting concurrency in PRUNE MATERIALIZED VIEW The PRUNE MATERALIZED VIEW statement is performed as follows: 1. Perform a range scan of the view table from the view replicas based on the ranges specified in the statement. 2. While reading the paged scan above, for each view row perform a read from all base replicas at the corresponding primary key. If a discrepancy is detected, delete the row in the view table. When reading multiple rows, this is very slow because for each view row we need to performe a single row query on multiple replicas. In this patch we add an option to speed this up by performing many of the single base row reads concurrently, at the concurrency specified in the USING CONCURRENCY clause. Fixes https://github.com/scylladb/scylladb/issues/27070	2025-11-27 00:02:28 +01:00
Wojciech Mitros	0a22ac3c9e	mv: don't mark the view as built if the reader produced no partitions When we build a materialized view we read the entire base table from start to end to generate all required view udpates. If a view is created while another view is being built on the same base table, this is optimized - we start generating view udpates for the new view from the base table rows that we're currently reading, and we read the missed initial range again after the previous view finishes building. The view building progress is only updated after generating view updates for some read partitions. However, there are scenarios where we'll generate no view updates for the entire read range. If this was not handled we could end up in an infinite view building loop like we did in https://github.com/scylladb/scylladb/issues/17293 To handle this, we mark the view as built if the reader generated no partitions. However, this is not always the correct conclusion. Another scenario where the reader won't encounter any partitions is when view building is interrupted, and then we perform a reshard. In this scenario, we set the reader for all shards to the last unbuilt token for an existing partition before the reshard. However, this partition may not exist on a shard after reshard, and if there are also no partitions with higher tokens, the reader will generate no partitions even though it hasn't finished view building. Additionally, we already have a check that prevents infinite view building loops without taking the partitions generated by the reader into account. At the end of stream, before looping back to the start, we advance current_key to the end of the built range and check for built views in that range. This handles the case where the entire range is empty - the conditions for a built view are: 1. the "next_token" is no greater than "first_token" (the view building process looped back, so we've built all tokens above "first_token") 2. the "current_token" is no less than "first_token" (after looping back, we've built all tokens below "first_token") If the range is empty, we'll pass these conditions on an empty range after advancing "current_key" to the end because: 1. after looping back, "next_token" will be set to `dht::minimum_token` 2. "current_key" will be set to `dht::ring_position::max()` In this patch we remove the check for partitions generated by the reader. This fixes the issue with resharding and it does not resurrect the issue with infinite view building that the check was introduced for. Fixes https://github.com/scylladb/scylladb/issues/26523 Closes scylladb/scylladb#26635	2025-11-05 17:02:32 +02:00
Avi Kivity	d81796cae3	Merge 'Limit concurrent view updates from all sources' from Wojciech Mitros Before this patch, when a base table has many materialized views, each write to this table can start up to 128 view updates in parallel. With high client write concurrency, the actual concurrency of writes executed on the node may grow unexpectedly, which can lead to higher latency and higher memory usage compared to a sequential approach. In this patch we add a per-shard, per-service-level semaphore which limits the number of concurrent view updates processed on the shard in this service level to a constant value. We take one unit from the semaphore for each local view update write, and releasing it when it finishes. The remote view updates do not take units from the semaphore because they don't consume nearly as much processing power and they are limited by another semaphore based on their memory usage. Fixes https://github.com/scylladb/scylladb/issues/25341 Closes scylladb/scylladb#25456 * github.com:scylladb/scylladb: mv: limit concurrent view updates from all sources database: rename _view_update_concurrency_sem to _view_update_memory_sem	2025-10-28 11:13:24 +02:00
Wojciech Mitros	f07a86d16e	mv: limit concurrent view updates from all sources Before this patch, when a base table has many materialized views, each write to this table can start up to 128 view updates in parallel. With high client write concurrency, the actual concurrency of writes executed on the node may grow unexpectedly, which can lead to higher latency and higher memory usage compared to a sequential approach. In this patch we add a per-shard, per-service-level semaphore which limits the number of concurrent view updates processed on the shard in this service level to a constant value. We take one unit from the semaphore for each local view update write, and releasing it when it finishes. The remote view updates do not take units from the semaphore because they don't consume nearly as much processing power and they are limited by another semaphore based on their memory usage. The effect of this patch can also be observed when writing to a base table with a large number of materialized views, like in the materialized_views_test.py::TestMaterializedViews::test_many_mv_concurrent dtest. In that test, if we perform a full scan in parallel to a write workload with a concurrency of 100 to a table with 100 views, the scan would sometimes timeout because it would effectively get 1/10000 of cpu. With this patch, the cpu concurrency of view updates was limited to 128 (we ran both writes and scan in the same service level), and the scan no longer timed out. Fixes https://github.com/scylladb/scylladb/issues/25341	2025-10-27 18:55:41 +01:00
Wojciech Mitros	c0d0f8f85b	database: rename _view_update_concurrency_sem to _view_update_memory_sem In the following commit, we'll introduce a new semaphore for view updates that limits their concurrency by view update count. To avoid confusion, we rename the existing semaphore that tracks the memory used by concurrent view updates and related objects accordingly.	2025-10-23 10:00:15 +02:00
Botond Dénes	24c6476f73	mutation/mutation_compactor: add tombstone_gc_state to query ctor So tombstones can be purged correctly based on the tombstone gc mode. Currently if repair-mode is used, tombstones are not purged at all, which can lead to purged tombstone being re-replicated to replicas which already purged them via read-repair. This is not a correctness problem, tombstones are not included in data query resutl or digest, these purgable tombstone are only a nuissance for read repair, where they can create extra differences between replicas. Note that for the read repair to trigger, some difference other than in purgable tombstones has to exist, because as mentioned above, these are not included in digets. Fixes: scylladb/scylladb#24332 Closes scylladb/scylladb#26351	2025-10-12 17:48:15 +03:00
Piotr Dulikowski	e7907b173a	Merge 'db/view: Require rf_rack_valid_keyspaces when creating materialized view' from Dawid Mędrek Materialized views are currently in the experimental phase and using them in tablet-based keyspaces requires starting Scylla with an experimental feature, `views-with-tablets`. Any attempts to create a materialized view or secondary index when it's not enabled will fail with an appropriate error. After considerable effort, we're drawing close to bringing views out of the experimental phase, and the experimental feature will no longer be needed. However, materialized views in tablet-based keyspaces will still be restricted, and creating them will only be possible after enabling the configuration option `rf_rack_valid_keyspaces`. That's what we do in this PR. In this patch, we adjust existing tests in the tree to work with the new restriction. That shouldn't have been necessary because we've already seemingly adjusted all of them to work with the configuration option, but some tests hid well. We fix that mistake now. After that, we introduce the new restriction. What's more, when starting Scylla, we verify that there is no materialized view that would violate the contract. If there are some that do, we list them, notify the user, and refuse to start. High-level implementation strategy: 1. Name the restrictions in form of a function. 2. Adjust existing tests. 3. Restrict materialized views by both the experimental feature and the configuration option. Add validation test. 4. Drop the requirement for the experimental feature. Adjust the added test and add a new one. 5. Update the user documentation. Fixes scylladb/scylladb#23030 Backport: 2025.4, as we are aiming to support materialized views for tablets from that version. Closes scylladb/scylladb#25802 * github.com:scylladb/scylladb: view: Stop requiring experimental feature db/view: Verify valid configuration for tablet-based views db/view: Require rf_rack_valid_keyspaces when creating view test/cluster/random_failures: Skip creating secondary indexes test/cluster/mv: Mark test_mv_rf_change as skipped test/cluster: Adjust MV tests to RF-rack-validity test/boost/schema_loader_test.cc: Explicitly enable rf_rack_valid_keyspaces db/view: Name requirement for views with tablets	2025-10-06 12:46:46 +02:00
Dawid Mędrek	b409e85c20	view: Stop requiring experimental feature We modify the requirements for using materialized views in tablet-based keyspaces. Before, it was necessary to enable the configuration option `rf_rack_valid_keyspaces`, having the cluster feature `VIEWS_WITH_TABLETS` enabled, and using the experimental feature `views-with-tablets`. We drop the last requirement. We adjust code to that change and provide a new validation test. We also update the user documentation to reflect the changes. Fixes scylladb/scylladb#23030	2025-10-01 09:01:53 +02:00
Dawid Mędrek	00222070cd	db/view: Require rf_rack_valid_keyspaces when creating view We extend the requirements for being able to create materialized views and secondary indexes in tablet-based keyspaces. It's now necessary to enable the configuration option `rf_rack_valid_keyspaces`. This is a stepping stone towards bringing materialized views and secondary indexes with tablets out of the experimental phase. We add a validation test to verify the changes. Refs scylladb/scylladb#23030	2025-10-01 09:01:50 +02:00
Michael Litvak	c9237bf5f6	mv: generate view updates on both shards in intranode migration Similarly to the issue of tokens migrating from one host to another, where we need to generate view updates on both replicas before transitioning in order to not lose view updates, we need to do the same in case of intranode migration. In intranode migration we migrate tokens from one shard to another. Previously we checked shard_for_reads in order to generate view updates only on the single shard that is selected for reads, and not on a pending shard that is not ready yet. The problem is that shard_for_reads switches from the source shard to the destination shard in a single transition, and during that switch we can lose view updates because neither shard sees itself as the shard for reads. We fix this by having a phase before the transition when both shards are ready for reads and both will generate view updates.	2025-09-29 13:44:04 +02:00
Michael Litvak	d842ea2dc9	mv: generate view updates on pending replica Generate view updates from a pending base replica if it's a reading replica, i.e. it's in the last stage of transition write_both_read_new before becoming the new base replica. Previously we didn't generate view updates on a pending replica. The problem with that is that when a base token is migrated from one replica B1 to another B2, at one stage we generate view updates only from B1, then at the next stage we generate view updates only from B2. During this transition, it can happen that for some write neither B1 nor B2 generate view update, because each one sees the other as the base replica. We fix this by generating view updates from both base replicas in the phase before the transition. We can generate view updates on the pending replica in this case, even if it requires read-before-write, because it's in a stage where it contains all data and serves reads. Fixes scylladb/scylladb#24292	2025-09-29 13:44:04 +02:00
Dawid Mędrek	a1254fb6f3	db/view: Name requirement for views with tablets We add a named requirement, a function, for materialized views with tablets. It decides whether we can create views and secondary indexes in a given keyspace. It's a stepping stone towards modifying the requirements for it. This way, we keep the code in one place, so it's not possible to forget to modify it somewhere. It also makes it more organized and concise.	2025-09-29 13:07:08 +02:00
Michael Litvak	6bc41926e2	view_builder: reduce log level for expected aborts during view creation When draining the view builder, we abort ongoing operations using the view builder's abort source, which may cause them to fail with abort_requested_exception or raft::request_aborted exceptions. Since these failures are expected during shutdown, reduce the log level in add_new_view from 'error' to 'debug' for these specific exceptions while keeping 'error' level for unexpected failures. Closes scylladb/scylladb#26297	2025-09-28 22:55:07 +03:00
Ernest Zaslavsky	5ba5aec1f8	treewide: Move mutation related files to a `mutation` directory As requested in #22104, moved the files and fixed other includes and build system. Moved files: - combine.hh - collection_mutation.hh - collection_mutation.cc - converting_mutation_partition_applier.hh - converting_mutation_partition_applier.cc - counters.hh - counters.cc - timestamp.hh Fixes: #22104 This is a cleanup, no need to backport Closes scylladb/scylladb#25085	2025-09-24 13:23:38 +03:00
Wojciech Mitros	d9b8278178	mv: handle mismatched base/view replica count caused by RF change During an ALTER KEYSPACE statement execution where a table with a view is present, we need to perform tablet migrations for both tables. These migrations are not synchronized, so at some point the base may have a different number of non-pending replicas than the view. Because of that, we can't pair them correctly. If there is more non-pending base replicas than view replicas, we don't need to do anything because the view replica that didn't finish migrating is a pending replica and will get view updates from all base replicas. But if there is more non-pending view replicas than base replicas, we may currently lose view updates to the new view replica. This patch adds a workaround for this scenario. If after one migration we have too more non-pending view replicas than base replicas, we add it to the pending replica list so that it gets an update anyway. This patch will also take effect if the base and view replica counts differ due to some other bug. To track that, a new metric is added to count such occurrences. This patch also includes a test for this exact scenario, which is enforced by an injection. Fixes https://github.com/scylladb/scylladb/issues/21492	2025-09-22 12:50:16 +02:00
Wojciech Mitros	59c40a2edd	mv: save the nodes used for pairing calculations for later reuse In get_view_natural_endpoint() we start with the list if host_ids from the effective replication maps, which we later translate to locator::node to get the information about racks and datacenters. We check all replicas, but we only store the ones relevant for pairing, so for tablets, the ones in the same DC as the replica sending the update. In the next patch, we'll occasionally need to send cross-dc view updates, so to avoid computing the nodes again, in this patch we adjust the logic to prepare them in advance and save them so that they can be later reused.	2025-09-22 12:45:24 +02:00
Wojciech Mitros	9d4449a492	mv: move the decision about simple rack-aware pairing later We'll need to get the lists for the whole dc when fixing replica count mismatches caused by RF changes, so let's first get these lists, and only filter them later if we decide to use simple rack-aware pairing.	2025-09-22 12:45:24 +02:00
Michael Litvak	3dffb8e0dc	test: mv: add a test for view build interrupt during registration Add a new test that reproduces issue #22989. The test starts view building and interrupts it by restarting the node while some shards registered their status and some didn't.	2025-09-21 10:39:30 +02:00
Michael Litvak	6043409c31	view_builder: register view on all shards atomically When the view builder starts to build a new view, each shard registers itself by writing the shard id and current token to the scylla_views_builds_in_progress table. Previously, this happened independently by each shard. We change it now to register all shards "atomically" - when a shard registers itself, it also registers all other shards with an empty status, if they aren't registered yet. This ensures that we don't have a partial state in the table where only some of the shards are registered, but we always have a status for all shards. The reason we want to register all shards atomically is that if it happens that only some of the shards were registered, then we restart and load the status from table, this doesn't work well for multiple reasons. One example is that to know how many shards we had previously, we take the maximum shard id we see in the table. If it's different than the current shard count, we will execute the reshard code. But of course, if the last shard is missing from the table because it didn't register itself, this calculation will be wrong, and we can't know the previous number of shards. This is a problem because suppose we have two shards, and shard 0 finished building the view but shard 1 didn't start. When we come up, we will think that previously we had only a single shard and it completed building everything, when in fact we built only half the view approximately. The problem is that we don't have enough information in the tables to know that. There are additional problems related to reshard. In the reshard function, whether it is executed because we actually do node reshard or because we calculated the wrong number of previous shards, if the status of some shard is missing then the calculation of new ranges will be wrong. When some shard didn't make progress we should start building the view from scratch. However, this doesn't happen if we don't have a status for the shard, because the code looks only for shards that have a status. In effect, this shard is considered complete even though it didn't start. This could cause the view building to get stuck or complete without building all tokens ranges. By registering all shards atomically, this should solve the above problems because we will always have statuses for all shards. Fixes scylladb/scylladb#22989	2025-09-21 10:39:05 +02:00
Ernest Zaslavsky	d624413ddd	treewide: Move query related files to a new `query` directory As requested in #22120, moved the files and fixed other includes and build system. Moved files: - query.cc - query-request.hh - query-result.hh - query-result-reader.hh - query-result-set.cc - query-result-set.hh - query-result-writer.hh - query_id.hh - query_result_merger.hh Fixes: #22120 This is a cleanup, no need to backport Closes scylladb/scylladb#25105	2025-09-16 23:40:47 +03:00
Wojciech Mitros	1f9be235b8	mv: delete previously undetected ghost rows in PRUNE MATERIALIZED VIEW statement The PRUNE MATERIALIZED VIEW statement is supposed to remove ghost rows from the view. Ghost rows are rows in the view with no corresponding row in the base table. Before this patch, only rows whose primary key columns of the base table had different values than any of the base rows were treated as ghost rows by the PRUNE statement. However, view rows which have a column in their primary key that's not in the base primary can also be ghost rows if this column has a different value than the base row with the same values of remaining primary key columns. That's because these rows won't be deleted unless we change value of this column in the base table to this specific value. In this patch we add a check for this column in the PRUNE MATERIALIZED VIEW logic. If this column isn't the same in the base table and the view, these rows are also deleted. Fixes https://github.com/scylladb/scylladb/issues/25655 Closes scylladb/scylladb#25720	2025-09-10 07:35:00 +02:00
Michał Jadwiszczak	1e2fa069df	db/view/view_builder: ignore `no_such_keyspace` exception	2025-08-27 10:23:04 +02:00
Michał Jadwiszczak	233f4dcee3	db/view/view_building_worker: register staging sstable to view building coordinator when needed Change return type of `check_needs_view_update_path()`. Instead of retrning bool which tells whether to use staging directory (and register to `view_update_generator`) or use normal directory. Now the function returns enum with possible values: - `normal_directory` - use normal directory for the sstable - `staging_directly_to_generator` - use staging directory and register to `view_update_generator` - `staging_managed_by_vbc` - use staging directory but don't register it to `view_update_generator` but create view building tasks for later The third option is new, it's used when the table has any view which is in building process currrently. In this case, registering it to `view_update_generator` prematurely may lead to base-view inconsistency (for example when a replica is in a pending state).	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	201c4fafec	db/view/view_building_coordinator: update view build status on node join/left Copy view build status for new node for tablet views and remove relevant statuses when a node is leaving the cluster.	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	c9e710dca3	db/view: introduce `view_building_worker` The worker is responsible for building tablet-based views by executing tasks scheduled by the view building coordinator. It observes view building state machine and wait on the machine's conditional variable (so the worker is woken up when group0 state is applied). The tasks are executed in batches, all tasks in one batch need to have the same: type, base_id, table_id. One shard can only execute one batch at a time (at least for now, in the future we might want to change that). That worker keeps track of finished and failed tasks in its local state. The state is cleared when `view_building_state::currently_processed_base_table` is changed.	2025-08-27 10:22:59 +02:00
Michał Jadwiszczak	a59624c604	db/view: extract common view building functionalities Extract common methods of view builder consumer to an abstract class and `flush_base()` and `make_partition_slice()` functions, so they can be used in view builder (vnode-based views) and view building consumer (tablet-based views; introduced in the next commit).	2025-08-27 08:55:48 +02:00
Michał Jadwiszczak	f71594738e	db/view: prepare to create abstract `view_consumer` In next commit, I'm going to introduce `view_building_worker::consumer`, with very similar functionalities to `view_builder::consumer` but it'll only consume range of one tablet per execution. Since most functions are very similar, I'll create abstract `view_consumer` which will be base for both of the consumers. In order to make the transition more readable, this commit prepares the `view_builder::consumer` by making some functions virtual and next commit will extract part of functions to the abstract class.	2025-08-27 08:55:48 +02:00
Michał Jadwiszczak	f90dd522df	db/view: extract `system.view_build_status_v2` cql statements to system_keyspace Until now, all changes to `system.view_build_status_v2` were made from view.cc and the file contained all of the helper methods. This commit introduces a `build_status` enum class to avoid using hardcoded strings and extracts the helper methods to `system_keyspace` class, so they can be later used by the view building coordinator.	2025-08-27 08:55:46 +02:00
Michał Jadwiszczak	d0826e7cb1	db/view: ignore tablet-based views in `view_builder` View building of tablet-based views will be handled by the view building coordinator later in this patch.	2025-08-27 08:55:46 +02:00
Ernest Zaslavsky	d2c5765a6b	treewide: Move keys related files to a new keys directory As requested in #22102, #22103 and #22105 moved the files and fixed other includes and build system. Moved files: - clustering_bounds_comparator.hh - keys.cc - keys.hh - clustering_interval_set.hh - clustering_key_filter.hh - clustering_ranges_walker.hh - compound_compat.hh - compound.hh - full_position.hh Fixes: #22102 Fixes: #22103 Fixes: #22105 Closes scylladb/scylladb#25082	2025-07-25 10:45:32 +03:00
Benny Halevy	3feb759943	everywhere: use utils::chunked_vector for list of mutations Currently, we use std::vector<*mutation> to keep a list of mutations for processing. This can lead to large allocation, e.g. when the vector size is a function of the number of tables. Use a chunked vector instead to prevent oversized allocations. `perf-simple-query --smp 1` results obtained for fixed 400MHz frequency and PGO disabled: Before (read path): ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 89055.97 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39417 insns/op, 18003 cycles/op, 0 errors) 103372.72 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39380 insns/op, 17300 cycles/op, 0 errors) 98942.27 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39413 insns/op, 17336 cycles/op, 0 errors) 103752.93 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39407 insns/op, 17252 cycles/op, 0 errors) 102516.77 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39403 insns/op, 17288 cycles/op, 0 errors) throughput: mean= 99528.13 standard-deviation=6155.71 median= 102516.77 median-absolute-deviation=3844.59 maximum=103752.93 minimum=89055.97 instructions_per_op: mean= 39403.99 standard-deviation=14.25 median= 39406.75 median-absolute-deviation=9.30 maximum=39416.63 minimum=39380.39 cpu_cycles_per_op: mean= 17435.81 standard-deviation=318.24 median= 17300.40 median-absolute-deviation=147.59 maximum=18002.53 minimum=17251.75 ``` After (read path) ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 59755.04 tps ( 66.2 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39466 insns/op, 22834 cycles/op, 0 errors) 71854.16 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39417 insns/op, 17883 cycles/op, 0 errors) 82149.45 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39411 insns/op, 17409 cycles/op, 0 errors) 49640.04 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.3 tasks/op, 39474 insns/op, 19975 cycles/op, 0 errors) 54963.22 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.3 tasks/op, 39474 insns/op, 18235 cycles/op, 0 errors) throughput: mean= 63672.38 standard-deviation=13195.12 median= 59755.04 median-absolute-deviation=8709.16 maximum=82149.45 minimum=49640.04 instructions_per_op: mean= 39448.38 standard-deviation=31.60 median= 39466.17 median-absolute-deviation=25.75 maximum=39474.12 minimum=39411.42 cpu_cycles_per_op: mean= 19267.01 standard-deviation=2217.03 median= 18234.80 median-absolute-deviation=1384.25 maximum=22834.26 minimum=17408.67 ``` `perf-simple-query --smp 1 --write` results obtained for fixed 400MHz frequency and PGO disabled: Before (write path): ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no} Disabling auto compaction 63736.96 tps ( 59.4 allocs/op, 16.4 logallocs/op, 14.3 tasks/op, 49667 insns/op, 19924 cycles/op, 0 errors) 64109.41 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 49992 insns/op, 20084 cycles/op, 0 errors) 56950.47 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50005 insns/op, 20501 cycles/op, 0 errors) 44858.42 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50014 insns/op, 21947 cycles/op, 0 errors) 28592.87 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50027 insns/op, 27659 cycles/op, 0 errors) throughput: mean= 51649.63 standard-deviation=15059.74 median= 56950.47 median-absolute-deviation=12087.33 maximum=64109.41 minimum=28592.87 instructions_per_op: mean= 49941.18 standard-deviation=153.76 median= 50005.24 median-absolute-deviation=73.01 maximum=50027.07 minimum=49667.05 cpu_cycles_per_op: mean= 22023.01 standard-deviation=3249.92 median= 20500.74 median-absolute-deviation=1938.76 maximum=27658.75 minimum=19924.32 ``` After (write path) ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no} Disabling auto compaction 53395.93 tps ( 59.4 allocs/op, 16.5 logallocs/op, 14.3 tasks/op, 50326 insns/op, 21252 cycles/op, 0 errors) 46527.83 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50704 insns/op, 21555 cycles/op, 0 errors) 55846.30 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50731 insns/op, 21060 cycles/op, 0 errors) 55669.30 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50735 insns/op, 21521 cycles/op, 0 errors) 52130.17 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50757 insns/op, 21334 cycles/op, 0 errors) throughput: mean= 52713.91 standard-deviation=3795.38 median= 53395.93 median-absolute-deviation=2955.40 maximum=55846.30 minimum=46527.83 instructions_per_op: mean= 50650.57 standard-deviation=182.46 median= 50731.38 median-absolute-deviation=84.09 maximum=50756.62 minimum=50325.87 cpu_cycles_per_op: mean= 21344.42 standard-deviation=202.86 median= 21334.00 median-absolute-deviation=176.37 maximum=21554.61 minimum=21060.24 ``` Fixes #24815 Improvement for rare corner cases. No backport required Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#24919	2025-07-13 19:13:11 +03:00
Avi Kivity	16fb68bb5e	interval: rename start_ref() back to start() (and end_ref() etc). To reduce noise, rename start_ref() back to its original name start(), after it was changed in the previous patch to force an audit of all calls.	2025-06-14 21:26:16 +03:00

1 2 3 4 5 ...

557 Commits