scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-01 12:36:56 +00:00

Author	SHA1	Message	Date
Patryk Jędrzejczak	73db5c94de	Merge 'db: api: service: introduce system.client_routes table and related API endpoints' from Andrzej Jackowski `system.client_routes` is a system table that sets the target address and ports for each `host_id`, for one or more connection (e.g., Private Link) represented by `connection_id`. Cloud will write the table via REST, and drivers will read it via CQL to override values obtained from `system.local` and `system.peers`. This patch series contains: - Introduction of `CLIENT_ROUTES` feature flag. - Implementation of raft-based `system.client_routes` table - Implementation of `v2/client-routes` POST/DELETE/GET endpoints - Implementation of new `CLIENT_ROUTES_CHANGE` event that is sent to drivers when `system.client_routes` is changed - New tests that verifies the aforementioned features Ref: scylladb/scylla-enterprise#5699 For now, no automatic backport. However, the changes are planned to be release on `2025.4` either as a backport or a private build. Closes scylladb/scylladb#27323 * https://github.com/scylladb/scylladb: docs: describe CLIENT_ROUTES_CHANGE extension test: add test for CLIENT_ROUTES event service: transport: add CLIENT_ROUTES_CHANGE event test: add cluster tests for client routes test: add API tests for client_routes endpoints test: add `timeout` parameter to `delete` in RESTClient test: allow json_body in send api: implement client_routes endpoints api: add client_routes.json service: main: add client_routes_service db: add system.client_routes table gms: add CLIENT_ROUTES feature	2025-12-16 10:38:27 +01:00
Botond Dénes	dace39fd6c	Merge 'Make commitlog replay handle files with corrupt file header (non-zero) as data loss, not startup failure' from Calle Wilund Fixes #26744 If a segment to replay is broken such that the main header is not zero, but still broken, we throw header_checksum_error. This was not handled in replayer, which grouped this into the "user error/fundamental problem" category. However, assuming we allow for "real" disk corruption, this should really be treated same as data corruption, i.e. reported data loss, not failure to start up. The `test_one_big_mutation_corrupted_on_startup` test accidentally sometimes provoked this issue, by doing random file wrecking, which on rare occasions provoked this, and thus failed test due to scylla not starting up, instead of losing data as expected. Closes scylladb/scylladb#27556 * github.com:scylladb/scylladb: test::cluster::dtest::tools::files: Remove file commitlog_replay: Handle fully corrupt files same as partial corruption. test::pylib::suite::base: Split options.name test specifier only once	2025-12-16 06:55:42 +02:00
Patryk Jędrzejczak	844545bb74	Merge 'treewide: fix cases of improper re-throwing of `std::exception_ptr`' from Emil Maskovsky Fix multiple cases where the captured `std::exception_ptr` has been re-thrown via simple `throw eptr;`, which results in losing the original exception type and details. Resolved at various places found by clang-tidy: 1. db::schema_applier When applying schema changes, the previous implementation attempted to handle exceptions by catching and rethrowing them, but did so incorrectly: using `throw ex` with a `std::exception_ptr` loses the original exception type and details. However, in this case, explicit exception handling is unnecessary. The only reason for catching was to ensure `ap.destroy()` is called before propagating the exception. This can be more cleanly and safely achieved using Seastar's `.finally()` continuation, which guarantees cleanup regardless of success or failure. 2. directories The `std::exception_ptr()` has been captured for logging and then again re-thrown incorrectly via `throw ex;`. We could use `std::rethrow_exception()` here instead, but it seems to be simpler to just use regular `throw;` to rethrow the original exception, and only use the `std::current_exception()` for logging (which is a pattern used in other places as well). 3. storage_service Here the exception has been re-thrown incorrectly in a coroutine. There it is best to use the `co_await coroutine::return_exception_ptr` to propagate exception more efficiently in a coroutine-friendly manner. Fixes: SCYLLADB-94 Refs: scylladb/scylladb#27501 No backport: This fixes an error logging issue, that isn't a production problem by itself (only found in test), therefore not backporting to older branches. Closes scylladb/scylladb#27613 * https://github.com/scylladb/scylladb: db: schema_applier: improve exception-safe cleanup directories: fix exception rethrowing storage_service: use coroutine-friendly exception propagation in join_node_response_handler	2025-12-15 13:56:45 +01:00
Andrzej Jackowski	8dde70d04c	db: add system.client_routes table Introduce `system.client_routes`, a system table that sets the target address and ports for each `host_id`, for one or more connections (e.g., Private Link) represented by `connection_id`. Cloud will write the table via REST, and drivers will read it via CQL to override values obtained from `system.local` and `system.peers`. The table is Raft-managed to provide consistent replication across nodes. Schema overview: each row is identified by `(connection_id, host_id)` and describes where clients should connect: `address` and one or more of `port`, `tls_port`, `alternator_port`, `alternator_https_port`. `host_id` is a UUID (just as in ScyllaDB) but `connection_id` can be any string to accept formats of all cloud providers. `address` is also a regular string because it can represent either an IP address or a domain. Ports are optional in the sense that at least one of the four must be provided. Ref: scylladb/scylla-enterprise#5699	2025-12-15 13:08:05 +01:00
Pavel Emelyanov	3f7ee3ce5d	Merge 'batchlog: make replay (flush) faster' from Botond Dénes The batchlog table contains an entry for each logged batch that is processed by the local node as coordinator. These entries are typically very short lived, they are inserted when the batch is processed and deleted immediately after the batch is successfully applied. When a table has `tombstone_gc = {'mode': 'repair'}` enabled, every repair has to flush all hints and batchlogs, so that we can be certain that there is no live data in any of these, older than the last repair. Since batches can contain member queries from any number of tables, the whole batchlog has to be flushed, even if repair-mode tombstone-gc is enabled for a single table. Flushing the batchlog table happens by doing a batchlog replay. This involves reading the entire content of this table, and attempting to replay+delete any live entries (that are old enough to be replayed). Under normal operating circumstances, 99%+ of the content of the batchlog table is partition tombstones. Because of this, scanning the content of this table has to process thousands to millions of tombstones. This was observed to require up to 20 minutes to finish, causing repairs to slow down to a crawl, as the batchlog-flush has to be repeated at the end of the repair of each token-range. When trying to address this problem, the first idea was that we should expedite the garbage-collection of these accumulated tombstones. This experiment failed, see https://github.com/scylladb/scylladb/pull/23752. The commitlog proved to be an impossible to bypass barrier, preventing quick garbage-collection of tombstones. So long as a single commit-log segment is alive, holding content from the batchlog table, all tombstones written after are blocked from GC. The second approach, represented by this PR, is to not rely in tombstone GC to reduce the tombstone amount. Instead restructure the table such that a single higher-order tombstone can be used to shadow and allow for the eviction of the myriads of individual batchlog entry tombstones. This is realized by reorganizing the batchlog table such that individual batches are rows, not partitions. This new schema is introduced by the new `system.batchlog_v2` table, introduced by this PR: CREATE TABLE system.batchlog_v2 ( version int, stage int, shard int, written_at timestamp, id uuid, data blob, PRIMARY KEY ((version, stage, shard), written_at, id)); The new schema organization has the following goals: 1) Make post-replay batchlog cleanup possible with a simple range-tombstone. This allows dropping the individual dead batchlog entries, as they are shadowed by a higher level tombstone. This enables dropping tombstones without tombstone GC. 2) To make the above possible, introduce the stage key component: batchlog entries that fail the first replay attempt, are moved to the failed_replay stage, so the initial stage can be cleaned up safely. 3) Spread out the data among Scylla shards, via the batchlog shard column. 4) Make batchlog entries ordered by the batchlog create time (id). This allows for selecting batchlogs to replay, without post-filtering of batchlogs that are too young to be replayed. Fixes: https://github.com/scylladb/scylladb/issues/23358 This is an improvement, normally not a backport-candidate. We might override this and backport to allow wider use of `tombstone_gc: {'mode': 'repair'}`. Closes scylladb/scylladb#26671 * github.com:scylladb/scylladb: db/config: change batchlog_replay_cleanup_after_replays default to 1 test/boost/batchlog_manager_test: add test for batchlog cleanup replica/mutation_dump: always set position weight for clustering positions service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/ test/lib: introduce error_injection.hh utils/error_injection: add debug log to disable() and disable_all() test/lib/cql_test_env: forward config to batchlog test/lib/cql_test_env: add batch type to execute_batch() test/lib/cql_assertions: add with_size(predicate) overload test/lib/cql_assertions: add source location to fail messages test/lib/cql_assertions: columns_assertions: add assert_for_columns_of_each_row() test/lib/cql_assertions: rows_assertions::assert_for_columns_of_row(): add index bound check test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload db/batchlog_manager: config: s/write_timeout/reply_timeot/ db,service: switch to system.batchlog_v2 db/system_keyspace: introduce system.batchlog_v2 service,db: extract generation of batchlog delete mutation service,db: extract get_batchlog_mutation_for() from storage-proxy db/batchlog_manager: only consider propagation delay with tombstone-gc=repair db/batchlog_manager: don't drop entire batch if one mutations' table was dropped data_dictionary: table: add get_truncation_time() db/batchlog_manager: batch(): replace map_reduce() with simple loop db/batchlog_manager: finish coroutinizing replay_all_failed_batches db/batchlog_manager: improve replayAllFailedBatches logs	2025-12-15 15:05:19 +03:00
Emil Maskovsky	5e7456936e	db: schema_applier: improve exception-safe cleanup When applying schema changes, the previous implementation attempted to handle exceptions by catching and rethrowing them, but did so incorrectly: using `throw ex` with a `std::exception_ptr` loses the original exception type and details. The correct approach is to use `std::rethrow_exception()`. However, in this case, explicit exception handling is unnecessary. The only reason for catching was to ensure `ap.destroy()` is called before propagating the exception. This can be more cleanly and safely achieved using Seastar's `.finally()` continuation, which guarantees cleanup regardless of success or failure. This change removes the manual try/catch/rethrow and uses `.finally()` to ensure proper cleanup, letting exceptions propagate naturally and preserving their type and information. Fixes: SCYLLADB-94 Refs: scylladb/scylladb#27501	2025-12-12 18:18:31 +01:00
copilot-swe-agent[bot]	77ee7f3417	Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy" This reverts commit `8192f45e84`. The merge exposed a bug where truncate (via drop) fails and causes Raft errors, leading to schema inconsistencies across nodes. This results in test_table_drop_with_auto_snapshot failures with 'Keyspace test does not exist' errors. The specific problematic change was in commit `19b6207f` which modified truncate_table_on_all_shards to set use_sstable_identifier = true. This causes exceptions during truncate that are not properly handled, leading to Raft applier fiber stopping and nodes losing schema synchronization.	2025-12-12 03:55:13 +00:00
Avi Kivity	24264e24bb	Revert "repair: Add tablet repair progress report support" This reverts commit `faad0167d7`. It causes a regression in test_two_tablets_concurrent_repair_and_migration_repair_writer_level in debug mode (with ~5%-10% probability). Fixes #27510. Closes scylladb/scylladb#27560	2025-12-11 12:18:11 +02:00
Calle Wilund	e48170ca8e	commitlog_replay: Handle fully corrupt files same as partial corruption. Fixes #26744 If a segment to replay is broken such that the main header is not zero, but still broken, we throw header_checksum_error. This was not handled in replayer, which grouped this into the "user error/fundamental problem" category. However, assuming we allow for "real" disk corruption, this should really be treated same as data corruption, i.e. reported data loss, not failure to start up. The `test_one_big_mutation_corrupted_on_startup` test accidentally sometimes provoked this issue, by doing random file wrecking, which on rare occasions provoked this, and thus failed test due to scylla not starting up, instead of loosing data as expected. Changed test to consistently cause this exact error instead.	2025-12-10 15:37:04 +01:00
Nadav Har'El	95e303faf3	Merge 'Refactor get_view_natural_endpoint' from Wojciech Mitros With the introduction of rack-lists and the reliance of materialized views on them, the `get_view_natural_endpoint` function can be greatly simplified. When using tablets, instead of doing any index-matching, we can now pair base tables with views only in the same rack. In this series we remove no longer needed code and reorganize the needed code for better clarity. After the changes, the `get_view_natural_endpoint` function goes down from 245 lines to 85 lines, while the whole pairing-related text goes down from 346 lines to 239 lines. Fixes https://github.com/scylladb/scylladb/issues/26313 Closes scylladb/scylladb#27383 * github.com:scylladb/scylladb: mv: replace the simple/complex rack-aware pairing with exact rack matching mv: split out vnode pairing code from get_view_natural_endpoint mv: unify self-pairing and rack-aware pairing into one bool mv: remove the workaround for left nodes when sending view updates	2025-12-09 13:19:13 +02:00
Asias He	faad0167d7	repair: Add tablet repair progress report support This patch adds tablet repair progress report support so that the user could use the /task_manager/task_status API to query the progress. In order to support this, a new system table is introduced to record the user request related info, i.e, start of the request and end of the request. The progress is accurate when tablet split or merge happens in the middle of the request, since the tokens of the tablet are recorded when the request is started and when repair of each tablet is finished. The original tablet repair is considered as finished when the finished ranges cover the original tablet token ranges. After this patch, the /task_manager/task_status API will report correct progress_total and progress_completed. Fixes #22564 Fixes #26896 Closes scylladb/scylladb#26924	2025-12-08 13:35:19 +02:00
Pavel Emelyanov	8192f45e84	Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy This change adds a new option to the REST api and correspondingly, to scylla nodetool: use_sstable_identifier. When set, we use the sstable identifier, if available, to name each sstable in the snapshots directory and the manifest.json file, rather than using the sstable generation. This can be used by the user (e.g. Scylla Manager) for global deduplication with tablets, where an sstable may be migrated across shards or across nodes, and in this case, its generation may change, but its sstable identifier remains sstable. Currently, Scylla manager uses the sstable generation to detect sstables that are already backed up to object storage and exist in previous backed up snapshots. Historically, the sstable generation was guaranteed to be unique only per table per node, so the dedup code currently checks for deduplication in the node scope. However, with tablet migration, sstables are renamed when migrated to a different shard, i.e. their generation changes, and they may be renamed when migrated to another node, but even if they are not, the dedup logic still assumes uniqueness only within a node. To address both cases, we keep the sstable_id stable throughout the sstable life cycle (since `3a12ad96c7`). Given the globally unique sstable identifier, scylla manager can now detect duplicate sstables in a wider scope. This can be cluster-wide, but we practically need only rack-wide deduplication or dc-wide, as tablets are migrated across racks only in rare occasions (like when converting from a numerical replication factor to a rack list containing a subset of the available racks in a datacenter). Fixes #27181 * New feature, no backport required Closes scylladb/scylladb#27184 * github.com:scylladb/scylladb: database: truncate_table_on_all_shards: set use_sstable_identifier to true nodetool: snapshot: add --use-sstable-identifier option api: storage_service: take_snapshot: add use_sstable_identifier option test: database_test: add snapshot_use_sstable_identifier_works test: database_test: snapshot_works: add validate_manifest sstable: write_scylla_metadata: add random_sstable_identifier error injection table: snapshot_on_all_shards: take snapshot_options sstable: add get_format getter sstable: snapshot: add use_sstable_identifier option db: snapshot_ctl: snapshot_options: add use_sstable_identifier options db: snapshot_ctl: move skip_flush to struct snapshot_options	2025-12-08 12:56:12 +03:00
Tomasz Grabiec	082342ecad	Attach names to allocating sections for better debuggability Large reserves in allocating_section can cause stalls. We already log reserve increase, but we don't know which table it belongs to: lsa - LSA allocation failure, increasing reserve in section 0x600009f94590 to 128 segments; Allocating sections used for updating row cache on memtable flush are notoriously problematic. Each table has its own row_cache, so its own allocating_section(s). If we attached table name to those sections, we could identify which table is causing problems. In some issues we suspected system.raft, but we can't be sure. This patch allows naming allocating_sections for the purpose of identifying them in such log messages. I use abstract_formatter for this purpose to avoid the cost of formatting strings on the hot path (e.g. index_reader). And also to avoid duplicating strings which are already stored elsewhere. Fixes #25799 Closes scylladb/scylladb#27470	2025-12-07 14:14:25 +02:00
Avi Kivity	47efbdffbc	Merge 'cache, mvcc: Preempt cache update when applying range tombstone from memtable' from Tomasz Grabiec Range tombstones are represented as entry attributes, which applies to the interval between entries. So if a range tombstone covers many rows, to apply it we have to update all covered entries. In some workloads that could be many entries, even the whole cache. Before the patch, we did this update without preemption, which can cause reactor stalls in such workloads. This scenario is already covered by mvcc_tests, e.g. test_apply_to_incomplete_respects_continuity. And I verified that the new preemption point is hit in the test. perf-row-cache-update results show no significant stalls anymore (max 2ms scheduling delay, instead of previous 1.5 s): Generated 1124195 rows Memtable fill took 4179.457520 [ms], {count: 8295, 99%: 0.654949 [ms], max: 32.817176 [ms]} Draining... took 0.000616 [ms] cache: 2506/2948 [MB], memtable: 781/1024 [MB], alloc/comp: 1051/662 [MB] (amp: 0.630) update: 2874.157471 [ms], preemption: {count: 26650, 99%: 1.131752 [ms], max: 2.068762 [ms]}, cache: 3027/3973 [MB], alloc/comp: 3951/2424 [MB] (amp: 0.614), pr/me/dr 1124195/0/0 Fixes #23479 Fixes #2578 Closes scylladb/scylladb#27469 * github.com:scylladb/scylladb: cache, mvcc: Preempt cache update when applying range tombstone from memtable partition_snapshot_row_cursor: Clarify non-obvious semantic difference of range_tombstone() perf-row-cache-update: Add scenario with large tombstone covering many rows	2025-12-07 11:54:15 +02:00
Tomasz Grabiec	d4014b7970	Drop legacy schema support We switched to using v3 schema tables (in system_schema keyspace) in 2017, in `9eb91bc30b`. So no system should have the old schema any more. No need to run legacy_schema_migrator on boot. Closes scylladb/scylladb#27420	2025-12-07 00:09:13 +02:00
Tomasz Grabiec	e546143fd9	partition_snapshot_row_cursor: Clarify non-obvious semantic difference of range_tombstone()	2025-12-06 01:03:10 +01:00
Botond Dénes	9d2f7c3f52	Merge 'mv: allow setting concurrency in PRUNE MATERIALIZED VIEW' from Wojciech Mitros The PRUNE MATERALIZED VIEW statement is performed as follows: 1. Perform a range scan of the view table from the view replicas based on the ranges specified in the statement. 2. While reading the paged scan above, for each view row perform a read from all base replicas at the corresponding primary key. If a discrepancy is detected, delete the row in the view table. When reading multiple rows, this is very slow because for each view row we need to performe a single row query on multiple replicas. In this patch we add an option to speed this up by performing many of the single base row reads concurrently, at the concurrency specified in the USING CONCURRENCY clause. Aside from the unit test, I checked manually on a 3-node cluster with 10M rows, using vnodes. There were actually no ghost rows in the test, but we still had to iterate over all view rows and read the corresponding base rows. And actual ghost rows, if there are any, should be a tiny fraction of all rows. I compared concurrencies 1,2,10,100 and the results were: * Pruning with concurrency 1 took total 1416 seconds * Pruning with concurrency 2 took total 731 seconds * Pruning with concurrency 10 took total 234 seconds * Pruning with concurrency 100 took total 171 seconds So after a concurrency of 10 or so we're hitting diminishing returns (at least in this setup). At that point we may be no longer bottlenecked by the reads, but by CPU on the shard that's handling the PRUNE Fixes https://github.com/scylladb/scylladb/issues/27070 Closes scylladb/scylladb#27097 * github.com:scylladb/scylladb: mv: allow setting concurrency in PRUNE MATERIALIZED VIEW cql: add CONCURRENCY to the USING clause	2025-12-04 11:47:41 +02:00
Benny Halevy	1c45ad7cee	db: snapshot_ctl: snapshot_options: add use_sstable_identifier options To be used for naming sstables in the snapshot by their sstable identifiers rather than their generation, to facilitate global deduplication of sstables in backup. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-04 09:46:35 +02:00
Benny Halevy	c18133b6cb	db: snapshot_ctl: move skip_flush to struct snapshot_options Prepare for adding another option: use_sstable_identifer. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-04 09:46:35 +02:00
Szymon Wasik	4f803aad22	Improve documentation of vector search configuration parameters. This patch adds separate group for vector search parameters in the documentation and fixes small typos and formatting. Fixes: SCYLLADB-77. Closes scylladb/scylladb#27385	2025-12-03 21:02:59 +02:00
Piotr Dulikowski	654ac9099b	db/view/view_building_coordinator: skip work if no view is built Even though that `view_building_coordinator::work_on_view_building` has an `if` at the very beginning which checks whether the currently processed base table is set, it only prints a message and continues executing the rest of the function regardless of the result of the check. However, some of the logic in the function assumes that the currently processed base table field is set and tries to access the value of the field. This can lead to the view building coordinator accessing a disengaged optional, which is undefined behavior. Fix the function by adding the clearly missing `co_await` to the check. A regression test is added which checks that the view building state observer - a different fiber which used to print a weird message due to erroneus view building coordinator behavior - does not print a warning. Fixes: scylladb/scylladb#27363 Closes scylladb/scylladb#27373	2025-12-03 09:44:28 +02:00
Botond Dénes	e762027943	db/config: change batchlog_replay_cleanup_after_replays default to 1 Now that batchlog cleanup is cheap, on account of memtable flush on the system.batchlog table garbage-collecting tombstones (previous patch), we can afford to do cleanup on each replay, keeping the memtable size small and more importantly -- the amount of tombstones in the memtable small.	2025-12-02 14:21:26 +02:00
Botond Dénes	8edd5b80ab	test/boost/batchlog_manager_test: add test for batchlog cleanup Add more tests covering different aspects of batchlog replay, cleanup, replay timeout and finally v1 -> v2 migration.	2025-12-02 14:21:26 +02:00
Botond Dénes	e309b5dbe1	db/batchlog_manager: config: s/write_timeout/reply_timeot/ Although the value of this item is indeed derived from the write timeout config, the name doesn't reflect what it is used for. Change it to reflect it better.	2025-12-02 14:21:26 +02:00
Botond Dénes	846b656610	db,service: switch to system.batchlog_v2 New batchlogs are written to the batchlog_v2 table and replay also uses the v2 table. The content of system.batchlog is attempted to be migrated to system.batchlog_v2 after each start of the batchlog_manager service. The migration is retried on each replay if it fails. This is reduntant but simple. Batchlog cleanup now doesn't involve flushing memtables, the only remaining user of replica/database.hh is gone, so the include is dropped.	2025-12-02 14:21:26 +02:00
Botond Dénes	ee851266be	db/system_keyspace: introduce system.batchlog_v2 Rearranges the system.batchlog schema as follows: CREATE TABLE system.batchlog_v2 ( version int, stage int, shard int, written_at timestamp, id uuid, data blob, PRIMARY KEY ((version, stage, shard), written_at, id)); With the following goals: 1) Make post-replay batchlog cleanup possible with a simple range-tombstone. This allows dropping the individual dead batchlog entries, as they are shadowed by a higher level tombstone. This enables dropping tombstones without tombstone GC. 2) To make the above possible, introduce the stage key component: batchlog entries that fail the first replay attempt, are moved to the failed_replay stage, so the initial stage can be cleaned up safely. 3) Spread out the data among Scylla shards, via the batchlog shard column. 4) Make batchlog entries ordered by the batchlog create time (id). This allows for selecting batchlogs to replay, without post-filtering of batchlogs that are too young to be replayed.	2025-12-02 14:21:25 +02:00
Botond Dénes	9434ec2fd1	service,db: extract generation of batchlog delete mutation Don't build batchlog delete mutations in storage-proxy code. Move this code into db/batchlog_manager.cc, exposed via db/batchlog.hh. This serves multiple goals: 1) Concentrates low-level batchlog related logic in db/batchlog_manager.cc 2) Reduce current and future code duplication. 3) Make future changes to this logic easier.	2025-12-02 14:21:25 +02:00
Botond Dénes	f54602daf0	service,db: extract get_batchlog_mutation_for() from storage-proxy Don't build batchlog mutations in storage-proxy code. Move this code into db/batchlog_manager.cc, exposed via db/batchlog.hh. This serves multiple goals: 1) Concentrates low-level batchlog related logic in db/batchlog_manager.cc 2) Reduce current and future code duplication. 2) Make future changes to this logic easier.	2025-12-02 14:21:25 +02:00
Botond Dénes	097c2cd676	db/batchlog_manager: only consider propagation delay with tombstone-gc=repair The propagation delay has no effect for other tombstone gc strategies, so ignore it when tombstone-gc != repair.	2025-12-02 14:21:25 +02:00
Botond Dénes	4f30807f01	db/batchlog_manager: don't drop entire batch if one mutations' table was dropped Just skip the mutation(s) whose tables were dropped instead. Use the newly introduced data_dictionary::table::get_truncation_time() to avoid looking up real table object.	2025-12-02 14:21:25 +02:00
Botond Dénes	337f417b13	db/batchlog_manager: batch(): replace map_reduce() with simple loop The map_reduce achieves no concurrency, both map and reduce are synchronous. It only achieves two redundant lookups for the table and hard-to-read code. Convert it into a simple loop. Preserve the stall-protection by adding a maybe_yield() to the loop.	2025-12-02 12:05:10 +02:00
Wojciech Mitros	6221c58325	mv: replace the simple/complex rack-aware pairing with exact rack matching When the initial version of rack-aware pairing was introduced, materialized views with tablets were still experimental. Since then, we decided that we'll only allow materialized views in clusters where the base table and the view are replicated on the same racks, with one replica of each tablet on each rack. This allows us to remove almost all logic from our base-view pairing. The only check for the paired view replica is now whether it's in the same rack as the base replica sending the update. In this patch we replace the simple and complex rack-aware pairing with the simple check above. Because of this, we have to remove a test case from network_topology_strategy_test which was testing complex pairing. The tested topology is not supported for views with tablets (or is unlikely to be supported, as it's a random test), so there's no use keeping the test. The test case for simple rack aware pairing was kept, but now we only test the case where each rack has one replica, not multiple. Additionally, we split finding of an unpaired replica to a separate function and partially rewrite it without reusing the helper stuctures that were present when calculating the simple and complex rack-aware pairing. We only look for an unpaired replica if we couldn't find a paired replica ourselves or if the number of view replicas didn't match the base replicas. If an unpaired replica appears while these conditions pass, we won't send an extra update, but that would be a new bug altogether, because we only expect the unpaired replica to appear during RF changes, so when these conditions aren't fulfilled. Fixes https://github.com/scylladb/scylladb/issues/26313	2025-12-02 10:52:36 +01:00
Botond Dénes	705af2bc16	db/batchlog_manager: finish coroutinizing replay_all_failed_batches It was coroutinized already but strangely, some continuations also remained. The `batch` lambda is still left in continuation style.	2025-12-02 10:42:28 +02:00
Botond Dénes	5b5f9120d0	db/batchlog_manager: improve replayAllFailedBatches logs Add cleanup flag value to start message and drop cpu, it is redundant as Scylla already adds the shard number to the logs. Add all_replayed to finish message.	2025-12-02 10:42:28 +02:00
Wojciech Mitros	4ec0fa6eb5	mv: split out vnode pairing code from get_view_natural_endpoint To avoid repeatedly checking whether we're using tablets and having to use unnecesarily flexible code fitting both cases, we split out the base-view pairing code for the case of vnodes to another function. The get_view_natural_endpoint will now have only common steps, a call to that function, and steps specific to tablets.	2025-12-02 03:32:36 +01:00
Wojciech Mitros	c313b215e4	mv: unify self-pairing and rack-aware pairing into one bool We always use "legacy self pairing" when not using tablets, and the "rack aware pairing" has been enabled in every version where views with tablets isn't experimental. So in practice, instead of checking these variables we can just look at whether the table uses tablets.	2025-12-02 03:32:32 +01:00
Wojciech Mitros	7c612e1789	mv: remove the workaround for left nodes when sending view updates At one point, the get_view_natural_endpoint was using IP for the view update (and hint) destinations, but the hint code was using host_id for the destinations. When a node left, we could no longer have a mapping for a IP to host_id and when trying to store a hint for this IP, we'd crash. We worked around this issue by dropping the view update completely if the target is in the "left" state. Since then, we also moved to host_id's in the view update code, so there's no longer any translation needed when storing the hints. Additionally, we now drain hints not when entering the "left" state, but when the node actually stops owning tokens. Because of that, the workaround is not needed anymore, so we remove it in this commit. The existing test_mv_tablets_empty_ip case verifies that indeed, we do not crash in the original problematic scenario.	2025-12-01 12:27:28 +01:00
Piotr Dulikowski	44c605e59c	Merge 'Fix the types of change events in Alternator Streams' from Piotr Wieczorek This patch increases the compatibility with DynamoDB Streams by integrating the DynamoDB's event type rules (described in https://github.com/scylladb/scylladb/issues/6918) into Alternator. The main changes are: - introduce a new flag `alternator_streams_strict_compatibility`, meant as a guard of performance-intensive operations that increase the compatibility with DynamoDB Streams. If enabled, Alternator always performs a RBW before a data-modifying operation, and propagates its result to CDC. Then, the old item is compared to the new one, to determine the mutation type (INSERT vs MODIFY). This option is a no-op for tables with disabled Alternator Streams, - reduce splitting of simple Alternator mutations, - correctly distinguish event types described in #6918, except for item deletes. Deleting a missing item with DeleteItem, BatchWriteItem, or a missing field with UpdateItem still emit REMOVEs. To summarize, the emitted events of the data manipulation operations should be as follows: - DeleteItem/BatchWriteItem.DeleteItem of existing item: REMOVE (OK) - DeleteItem of nonexistent item: nothing (OK) - BatchWriteItem.DeleteItem of nonexistent item: nothing (OK) - PutItem/UpdateItem/BatchWriteItem.PutItem of existing and not equal item: MODIFY (OK) - PutItem/UpdateItem/BatchWriteItem.PutItem of existing and equal item: nothing (OK) - PutItem/UpdateItem/BatchWriteItem.PutItem of nonexistent item: INSERT (OK) No backport is necessary. Refs https://github.com/scylladb/scylladb/pull/26149 Refs https://github.com/scylladb/scylladb/pull/26396 Refs https://github.com/scylladb/scylladb/issues/26382 Fixes https://github.com/scylladb/scylladb/issues/6918 Closes scylladb/scylladb#26121 * github.com:scylladb/scylladb: test/alternator: Enable the tests failing because of #6918 alternator, cdc: Don't emit events for no-op removes alternator, cdc: Don't emit an event for equal items alternator/streams, cdc: Differentiate item replace and item update in CDC alternator: Change the return type of rmw_operation_return config: Add alternator_streams_strict_compatibility flag cdc: Don't split a row marker away from row cells	2025-11-30 07:20:22 +01:00
Dawid Mędrek	48a28c24c5	db/commitlog: Include position and alignment information in errors When we come across a segment truncation, this information may be helpful to determine when the error occurred exactly and hint at what code path might've led to it. Closes scylladb/scylladb#27207	2025-11-28 15:28:08 +03:00
Calle Wilund	59c87025d1	commitlog::read_log_file: Check for eof position on all data reads Fixes #24346 When reading, we check for each entry and each chunk, if advancing there will hit EOF of the segment. However, IFF the last chunk being read has the last entry _exactly_ matching the chunk size, and the chunk ending at _exactly_ segment size (preset size, typically 32Mb), we did not check the position, and instead complained about not being able to read. This has literally _never_ happened in actual commitlog (that was replayed at least), but has apparently happened more and more in hints replay. Fix is simple, just check the file position against size when advancing said position, i.e. when reading (skipping already does). v2: * Added unit test Closes scylladb/scylladb#27236	2025-11-28 15:26:46 +03:00
Wojciech Mitros	323e5cd171	mv: allow setting concurrency in PRUNE MATERIALIZED VIEW The PRUNE MATERALIZED VIEW statement is performed as follows: 1. Perform a range scan of the view table from the view replicas based on the ranges specified in the statement. 2. While reading the paged scan above, for each view row perform a read from all base replicas at the corresponding primary key. If a discrepancy is detected, delete the row in the view table. When reading multiple rows, this is very slow because for each view row we need to performe a single row query on multiple replicas. In this patch we add an option to speed this up by performing many of the single base row reads concurrently, at the concurrency specified in the USING CONCURRENCY clause. Fixes https://github.com/scylladb/scylladb/issues/27070	2025-11-27 00:02:28 +01:00
Nadav Har'El	9cde93e3da	Merge 'db/view/view_building_coordinator: get rid of task's state in group0' from Michał Jadwiszczak Previously, the view building coordinator relied on setting each task's state to STARTED and then explicitly removing these state entries once tasks finished, before scheduling new ones. This approach induced a significant number of group0 commits, particularly in large clusters with many nodes and tablets, negatively impacting performance and scalability. With the update, the coordinator and worker logic has been restructured to operate without maintaining per-task states. Instead, tasks are simply tracked with an aborted boolean flag, which is still essential for certain tablet operations. This change removes much of the coordination complexity, simplifies the view building code, and reduces operational overhead. In addition, the coordinator now batches reports of finished tasks before making commits. Rather than committing task completions individually, it aggregates them and reports in groups, significantly minimizing the frequency of group0 commits. This new approach is expected to improve efficiency and scalability during materialized view construction, especially in large deployments. Fixes https://github.com/scylladb/scylladb/issues/26311 This patch needs to be backported to 2025.4. Closes scylladb/scylladb#26897 * github.com:scylladb/scylladb: docs/dev/view-building-coordinator: update the docs after recent changes db/view/view_building: send coordinator's term in the RPC db/view/view_building_state: replace task's state with `aborted` flag db/view/view_building_coordinator: batch finished tasks reporting db/view/view_building_worker: change internal implementation db/view/view_building_coordinator: change `work_on_tasks` RPC return type	2025-11-26 11:35:44 +02:00
Botond Dénes	384bffb8da	Merge 'compaction: limit the maximum shares allocated to a compaction scheduling class' from Raphael Raph Carvalho This PR adds support for limiting the maximum shares allocated to a compaction scheduling class by the compaction controller. It introduces a new configuration parameter, compaction_max_shares, which, when set to a non zero value, will cap the shares allocated to compaction jobs. This PR also exposes the shares computed by the compaction controller via metrics, for observability purposes. Fixes https://github.com/scylladb/scylladb/issues/9431 Enhancement. No need to backport. NOTE: Replaces PR https://github.com/scylladb/scylladb/pull/26696 Ran a test in which the backlog raised the need for max shares (normalized backlog above normalization_factor), and played with different values for new option compaction_max_shares to see it works (500, 1000, 2000, 250, 50) Closes scylladb/scylladb#27024 * github.com:scylladb/scylladb: db/config: introduce new config parameter `compaction_max_shares` compaction_manager:config: introduce max_shares compaction_controller: add configurable maximum shares compaction_controller: introduce `set_max_shares()`	2025-11-26 06:51:30 +02:00
Michał Jadwiszczak	fb8cbf1615	db/view/view_building: send coordinator's term in the RPC To avoid case when an old coordinator (which hasn't been stopped yet) dictates what should be done, add raft term to the `work_on_view_building_tasks` RPC. The worker needs to check if the term matches the current term from raft server, and deny the request when the term is bad.	2025-11-25 12:14:05 +01:00
Michał Jadwiszczak	24d69b4005	db/view/view_building_state: replace task's state with `aborted` flag After previous commits, we can drop entire task's state and replace it with single boolean flag, which determines if a task was aborted. Once a task was aborted, it cannot get resurrected to a normal state.	2025-11-25 12:14:04 +01:00
Michał Jadwiszczak	eb04af5020	db/view/view_building_coordinator: batch finished tasks reporting In previous implementation to execute view building tasks, the coordinator needed to firstly set their states to `STARTED` and then it needed to remove them before it could start the next ones. This logic required a lot of group0 commits, especially in large clusters with higher number of nodes and big tablet count. After previous commit to the view building worker, the coordinator can start view building tasks without setting the `STARTED` state and deleting finished tasks. This patch adjusts the coordinator to save finished tasks locally, so it can continue to execute next ones and the finished tasks are periodically removed from the group0 by `finished_task_gc_fiber()`.	2025-11-25 12:14:04 +01:00
Lakshmi Narayanan Sreethar	9cb766f929	db/config: introduce new config parameter `compaction_max_shares` Add support for the new configuration parameter `compaction_max_shares`, and update the compaction manager to pass it down to the compaction controller when it changes. The shares allocated to compaction jobs will be limited by this new parameter. Fixes #9431 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-11-24 12:52:29 -03:00
Tomasz Grabiec	d4b77c422f	Merge 'load_stats: leaving replica could be std::nullopt' from Ferenc Szili When migrating tablet size during the end_migration tablet transition stage, we need the pending and leaving replica hosts. The leaving and pending replicas are gathered in objects of type std::optional<tablet_replica> and are not checked if they contain a value before dereferencing which could cause an exception in the topology coordinator. This patch adds a check for leaving and pending replicas, and only performs the tablet size migration if neither are empty. This bug was introduced in `10f07fb95a` This change also adds the ability to create a tablet size in load_stats during end_migration stage of a tablet rebuild. We compute the new tablet size from by averaging the tablet sizes of the existing replicas. This change also adds the virtual table tablet_sizes which contains tablet sizes of all the replicas of all the tablets in the cluster. A version containing this bug has not yet been released, so a backport is not needed. Closes scylladb/scylladb#27118 * github.com:scylladb/scylladb: test: add tests for tablet size migration during end_migration virtual_table: add tablet_sizes virtual table load_stats: update tablet sizes after migration or rebuild	2025-11-24 15:31:30 +01:00
Michał Jadwiszczak	08974e1d50	db/view/view_building_worker: change internal implementation This commit doesn't change the logic behind the view building worker but it changes how the worker is executing view building tasks. Previously, the worker had a state only on shard0 and it was reacting to changes in group0 state. When it noticed some tasks were moved to `STARTED` state, the worker was creating a batch for it on the shard0 state. The RPC call was used only to start the batch and to get its result. Now, the main logic of batch management was moved to the RPC call handler. The worker has a local state on each shard and the state contains: - unique ptr to the batch - set of completed tasks - information for which views the base table was flushed So currently, each batch lives on a shard where it has its work to do exclusively. This eliminates a need to do a synchronization between shard0 and work shard, which was a painful point in previous implementation. The worker still reacts to changes in group0 view building state, but currently it's only used to observe whether any view building tasks was aborted by setting `ABORTED` state. To prepare for further changes to drop the view building task state, the worker ignores `IDLE` and `STARTED` states completely.	2025-11-24 11:12:31 +01:00
Michał Jadwiszczak	6d853c8f11	db/view/view_building_coordinator: change `work_on_tasks` RPC return type During the initial implementation of the view builing coordinator, we decided that if a view building task fails locally on the worker (example reason: view update's target replica is not available), the worker will retry this work instead of reporting a failure to the coordinator. However, we left return type of the RPC, which was telling if a task was finished successfully or aborted. But the worker doesn't need to report that a task was aborted, because it's the coordinator, who decides to abort a task. So, this commit changes the return type to list of UUIDs of completed tasks. Previously length of the returned vector needed to be the same as length of the vector sent in the request. No we can drop this restriction and the RPC handler return list of UUIDs of completed tasks (subset of vector sent in the request). This change is required to drop `STARTED` state in next commits. Since Scylla 2025.4 wasn't released yet and we're going to merge this patch before releasing, no RPC versioning or cluster feature is needed.	2025-11-24 11:12:29 +01:00

1 2 3 4 5 ...

4655 Commits